HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythongitModerate

Taking YouTube links out of a list of GitHub repo READMEs

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
githublinksyoutuberepolistreadmesouttaking

Problem

I have a .csv file that contains student GitHub repository assignment submissions. I made a script to go to each repository and extract the YouTube video that they must have provided in their README file.

The structure of the CSV file is as follows:

Timestamp,Name,Student Number,Git Repo link


```
#!/usr/bin/python3

import csv
import github3
import time
import re
import argparse
from secrets import username, password

# API rate limit for authenticated requests is way higher than anonymous, so login.
gh = github3.login(username, password=password)
# gh = github3.GitHub() # Anonymous

def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("filepath", type=str, metavar="filepath", help="Filepath to the input csv file.")

args = parser.parse_args()
args = vars(args) # Turn into dict-like view.

return args

def get_row_count(filename):
with open(filename, 'r') as file:
return sum(1 for row in csv.reader(file))

def get_repositories(link):
if gh.rate_limit()['resources']['search']['remaining'] == 0:
print("API rate exceeded, sleeping for {0} seconds.".format(gh.rate_limit()['resources']['search']['reset'] - int(time.time()+1)))
time.sleep(gh.rate_limit()['resources']['search']['reset'] - int(time.time()+1))

return gh.search_repositories(link.replace("https://github.com/", "", 1), "", 1)

def main():
filepath = parse_args()['filepath']
if not filepath.endswith('.csv'):
print("Input file must be a .csv file.")
exit()

p = re.compile(r"http(?:s?):\/\/(?:www\.)?youtu(?:be\.com\/watch\?v=|\.be\/)([\w\-\_])(&(amp;)?‌​[\w\?‌​=])?") # From http://stackoverflow.com/a/3726073/6549676
row_counter = 0
row_count = get_row_count(filepath)

with open(filepath, 'r') as infile, open(filepath[:3] + "_ytlinks.csv", "w") as outfile:
reader = csv.reader(infile)
next(reader, None) # Skip header

writer = csv.writer(outfile)
writer.

Solution

Here are some concerns/suggestions:

  • you are reading the file twice - once to get the row count and when reading the links. And, you don't need to initialize the csv.reader to get the row count, simply use sum() over the lines in the file. You would probably need to use infile.seek(0) after getting the count and before initializing the csv reader



  • use _ for the throw-away variables (when counting the number of lines)



  • if len(ids) == 0: can be simplified as if not ids:



  • it looks like you don't need .findall() and should use .search() method since you are up to a single match



  • if there is a single repository link per line, you probably should have get_repository() method instead of get_repositories() and avoid the for repo in get_repositories(row[3]): loop - remember, "Flat is better than nested"



  • instead of handling the enumeration with row_counter manually, use enumerate()



-
instead of accessing the current row fields by index - e.g. row[1] or row[3], you can unpack the row in the for loop, something like (an example, I don't know your actual CSV input format):

for index, username, _, github_link in reader:


Or, you can use a csv.DictReader - accessing the fields by column names instead of indexes would improve readability - e.g. row["github_link"] instead of row[3]

  • you don't have to convert the args to a dictionary - return args and then access the arguments using a dot notation - e.g. args.filepath

Code Snippets

for index, username, _, github_link in reader:

Context

StackExchange Code Review Q#154951, answer score: 14

Revisions (0)

No revisions yet.