patternpythonMinor
Sample scraping Project Gutenberg using Beautiful Soup and requests
Viewed 0 times
samplerequestsgutenbergscrapingprojectusingsoupandbeautiful
Problem
I am trying to learn web scraping in Python using Beautiful Soup and requests. My program goes to the book page on Project Gutenberg with the given book number (Example). It then finds the link for the given format (text in this case) and then writes the contents of the book to a file on the hard disk.
How and what can I improve in this code?
And this is sample HTML of the book links:
How and what can I improve in this code?
import requests
from bs4 import BeautifulSoup
def go_gutenberg(file_format,book_no):
url = "https://www.gutenberg.org/ebooks/"
r =requests.get(url+str(book_no))
r_html = r.text
soup = BeautifulSoup(r_html,"html.parser")
for file in soup.find_all('a',class_="link"):
if file_format in file.text:
get_book=file.get('href')
g = requests.get("https:"+get_book)
with open("C:\\Users\\syed\\Documents\\Gutenberg\\Book"+str(book_no)+".txt",'wb') as open_file:
for chunk in g.iter_content(10000):
open_file.write(chunk)
def main():
go_gutenberg("Text",1000)
if __name__=="__main__":main()And this is sample HTML of the book links:
Plain Text UTF-8Solution
Easy stuff
There are a lot of stylistic issues (indentation, spacing between operators, etc.) that violate PEP 8, you can use tools like pylint to find what they are.
Moving on
Do you really need this? You don't save any characters (not that it is always about saving characters) and you use
I would also say the same about:
Just get rid of it. Also, in this instance in particular I would get rid of it, because when I read it, I believe there might be a way to abstract it. "Maybe let the user change the
Furthermore, it is considered good practice to use format instead of the addition of strings so:
Becomes:
This on the other hand:
Seems like it can be abstracted, what if the user wants to change the location? Maybe by default use this location, but allow the user to change it. Create a default argument to
Same with:
What is
There are a lot of stylistic issues (indentation, spacing between operators, etc.) that violate PEP 8, you can use tools like pylint to find what they are.
Moving on
r_html = r.textDo you really need this? You don't save any characters (not that it is always about saving characters) and you use
r_html once. I would just get rid of the line.I would also say the same about:
url = "https://www.gutenberg.org/ebooks/"Just get rid of it. Also, in this instance in particular I would get rid of it, because when I read it, I believe there might be a way to abstract it. "Maybe let the user change the
url, somehow?") There isn't a particularly good way to "abstract" this url out in particular.Furthermore, it is considered good practice to use format instead of the addition of strings so:
"https://www.gutenberg.org/ebooks/" + str(book_no)Becomes:
"https://www.gutenberg.org/ebooks/%d" % book_noThis on the other hand:
"C:\\Users\\syed\\Documents\\Gutenberg\\Book"Seems like it can be abstracted, what if the user wants to change the location? Maybe by default use this location, but allow the user to change it. Create a default argument to
"C:\\Users\\syed\\Documents\\Gutenberg\\Book" but give the user the option to specify it.Same with:
for chunk in g.iter_content(10000):What is
10000? Maybe make a default argument for this with the value of 10000 and the name of (possibly) chunk_size?Code Snippets
r_html = r.texturl = "https://www.gutenberg.org/ebooks/""https://www.gutenberg.org/ebooks/" + str(book_no)"https://www.gutenberg.org/ebooks/%d" % book_no"C:\\Users\\syed\\Documents\\Gutenberg\\Book"Context
StackExchange Code Review Q#141898, answer score: 2
Revisions (0)
No revisions yet.