patternpythongitModerate
Taking YouTube links out of a list of GitHub repo READMEs
Viewed 0 times
githublinksyoutuberepolistreadmesouttaking
Problem
I have a .csv file that contains student GitHub repository assignment submissions. I made a script to go to each repository and extract the YouTube video that they must have provided in their README file.
The structure of the CSV file is as follows:
```
#!/usr/bin/python3
import csv
import github3
import time
import re
import argparse
from secrets import username, password
# API rate limit for authenticated requests is way higher than anonymous, so login.
gh = github3.login(username, password=password)
# gh = github3.GitHub() # Anonymous
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("filepath", type=str, metavar="filepath", help="Filepath to the input csv file.")
args = parser.parse_args()
args = vars(args) # Turn into dict-like view.
return args
def get_row_count(filename):
with open(filename, 'r') as file:
return sum(1 for row in csv.reader(file))
def get_repositories(link):
if gh.rate_limit()['resources']['search']['remaining'] == 0:
print("API rate exceeded, sleeping for {0} seconds.".format(gh.rate_limit()['resources']['search']['reset'] - int(time.time()+1)))
time.sleep(gh.rate_limit()['resources']['search']['reset'] - int(time.time()+1))
return gh.search_repositories(link.replace("https://github.com/", "", 1), "", 1)
def main():
filepath = parse_args()['filepath']
if not filepath.endswith('.csv'):
print("Input file must be a .csv file.")
exit()
p = re.compile(r"http(?:s?):\/\/(?:www\.)?youtu(?:be\.com\/watch\?v=|\.be\/)([\w\-\_])(&(amp;)?[\w\?=])?") # From http://stackoverflow.com/a/3726073/6549676
row_counter = 0
row_count = get_row_count(filepath)
with open(filepath, 'r') as infile, open(filepath[:3] + "_ytlinks.csv", "w") as outfile:
reader = csv.reader(infile)
next(reader, None) # Skip header
writer = csv.writer(outfile)
writer.
The structure of the CSV file is as follows:
Timestamp,Name,Student Number,Git Repo link```
#!/usr/bin/python3
import csv
import github3
import time
import re
import argparse
from secrets import username, password
# API rate limit for authenticated requests is way higher than anonymous, so login.
gh = github3.login(username, password=password)
# gh = github3.GitHub() # Anonymous
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("filepath", type=str, metavar="filepath", help="Filepath to the input csv file.")
args = parser.parse_args()
args = vars(args) # Turn into dict-like view.
return args
def get_row_count(filename):
with open(filename, 'r') as file:
return sum(1 for row in csv.reader(file))
def get_repositories(link):
if gh.rate_limit()['resources']['search']['remaining'] == 0:
print("API rate exceeded, sleeping for {0} seconds.".format(gh.rate_limit()['resources']['search']['reset'] - int(time.time()+1)))
time.sleep(gh.rate_limit()['resources']['search']['reset'] - int(time.time()+1))
return gh.search_repositories(link.replace("https://github.com/", "", 1), "", 1)
def main():
filepath = parse_args()['filepath']
if not filepath.endswith('.csv'):
print("Input file must be a .csv file.")
exit()
p = re.compile(r"http(?:s?):\/\/(?:www\.)?youtu(?:be\.com\/watch\?v=|\.be\/)([\w\-\_])(&(amp;)?[\w\?=])?") # From http://stackoverflow.com/a/3726073/6549676
row_counter = 0
row_count = get_row_count(filepath)
with open(filepath, 'r') as infile, open(filepath[:3] + "_ytlinks.csv", "w") as outfile:
reader = csv.reader(infile)
next(reader, None) # Skip header
writer = csv.writer(outfile)
writer.
Solution
Here are some concerns/suggestions:
-
instead of accessing the current row fields by index - e.g.
Or, you can use a
- you are reading the file twice - once to get the row count and when reading the links. And, you don't need to initialize the
csv.readerto get the row count, simply usesum()over the lines in the file. You would probably need to useinfile.seek(0)after getting the count and before initializing the csv reader
- use
_for the throw-away variables (when counting the number of lines)
if len(ids) == 0:can be simplified asif not ids:
- it looks like you don't need
.findall()and should use.search()method since you are up to a single match
- if there is a single repository link per line, you probably should have
get_repository()method instead ofget_repositories()and avoid thefor repo in get_repositories(row[3]):loop - remember, "Flat is better than nested"
- instead of handling the enumeration with
row_countermanually, useenumerate()
-
instead of accessing the current row fields by index - e.g.
row[1] or row[3], you can unpack the row in the for loop, something like (an example, I don't know your actual CSV input format): for index, username, _, github_link in reader:Or, you can use a
csv.DictReader - accessing the fields by column names instead of indexes would improve readability - e.g. row["github_link"] instead of row[3] - you don't have to convert the
argsto a dictionary - returnargsand then access the arguments using a dot notation - e.g.args.filepath
Code Snippets
for index, username, _, github_link in reader:Context
StackExchange Code Review Q#154951, answer score: 14
Revisions (0)
No revisions yet.