patternpythonMinor
Scraping efficiently with mechanize and bs4
Viewed 0 times
efficientlywithbs4scrapingmechanizeand
Problem
I have written some code that scrapes data on asteroids, but the problem is that is super slow! I understand that it has a lot to scrape, but as of now it has been running for 5 days and is bot even a tenth of the way through. Here is my code, the part I'm talking about is under GET-EPHEMERIDES:
```
from mechanize import Browser
from bs4 import BeautifulSoup
import datetime
from dateutil.relativedelta import relativedelta
##from SendEmail import Send
import os
import httplib
import time
import sys
print "###################################################################################\n Scrapes targets from MPO Opposition Database \n \n and Ephemerides from MPC's website \n Author: Tarik Joseph Zegmott \n###################################################################################\n\n"
#-----EXTRACTION-CODE-FOR-MINORPLANET.INFO------------------------------------------
def extract(soup):
table = soup.find('table', border=1)
for row in table.findAll('tr')[1:]: # uses BS findAll() to pull all tr tags (html row) into a list, the [1:] modifier skips the first lin which is just a header
col = row.findAll('td') # A list that will grab all the td tags (html column)
num = col[0].font.string
name = col[1].font.string
odate = col[2].font.string # Opposition Date (mm/dd.d)
omag = col[3].font.string # Opposition Mag (V)
mddate = col[4].font.string # Date of Minimum Distance (mm/dd.d)
mdist = col[5].font.string # Minimum Distance from Earth (AU)
bdate = col[7].font.string # Date of Brightest Apparition (mm/dd.d)
bmag = col[8].font.string # Brightest Magnitude (V)
bdec = col[9].font.string # Declination on Date of Brightest Apparition
r
```
from mechanize import Browser
from bs4 import BeautifulSoup
import datetime
from dateutil.relativedelta import relativedelta
##from SendEmail import Send
import os
import httplib
import time
import sys
print "###################################################################################\n Scrapes targets from MPO Opposition Database \n \n and Ephemerides from MPC's website \n Author: Tarik Joseph Zegmott \n###################################################################################\n\n"
#-----EXTRACTION-CODE-FOR-MINORPLANET.INFO------------------------------------------
def extract(soup):
table = soup.find('table', border=1)
for row in table.findAll('tr')[1:]: # uses BS findAll() to pull all tr tags (html row) into a list, the [1:] modifier skips the first lin which is just a header
col = row.findAll('td') # A list that will grab all the td tags (html column)
num = col[0].font.string
name = col[1].font.string
odate = col[2].font.string # Opposition Date (mm/dd.d)
omag = col[3].font.string # Opposition Mag (V)
mddate = col[4].font.string # Date of Minimum Distance (mm/dd.d)
mdist = col[5].font.string # Minimum Distance from Earth (AU)
bdate = col[7].font.string # Date of Brightest Apparition (mm/dd.d)
bmag = col[8].font.string # Brightest Magnitude (V)
bdec = col[9].font.string # Declination on Date of Brightest Apparition
r
Solution
- Remove nasty comments. For example, your "block separator comments",
###...###are completely, and utterly useless. Remove them. Another thing would be to not create "title comments",#----...----. Comments should be helpful, many of yours are not.
- Again, another tip on comments. Many of your comments describe things that are already obvious from looking at the code. For example, you have a comment that says this:
# Reruns this loop until 'try' works, but can make run indefinately (not sure why?). It's clear that once thetryblock works, that the loop is exited. Obvious comments like these can be removed.
- Variables and functions should be named in the style of
snake_case, and classes should be in the style ofPascalCase. If a variable is constant, it should beUPPERCASE_SNAKE_CASE.
- You need better variable names. For example,
d1, ord2are completely unclear as to what their purposes are. Variable names should be long, but not too long, and as descriptive as possible. You also have many other places where you could do renaming,d1, andd2are just a few examples.
- When getting user input, instead of making the user enter everything in lowercase, lower the text using
str.lower(). Here's an example:user_input = raw_input("> ").lower().
- At the beginning of your code, you print the same character many times. Instead of repeating this character over and over again in your string, use string multiplication. For example, if I wanted to print 50 spaces, I would do
print " " * 50.
- Why on Earth are you using
sys.stdout.write()? Just useprintto print something.sys.stdout.write()is unnecessary.
- Rather than checking if something equals false, as you're doing on this line:
if os.path.exists('./Targets') is False:, you can just doif not os.path.exists("./Targets"):.
- Finally, you have many PEP8 errors. There are way too many to list in one answer, so I'm going to link the style guide instead, and you can read through it. You can find PEP8 here.
Context
StackExchange Code Review Q#73743, answer score: 3
Revisions (0)
No revisions yet.