HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Scraping efficiently with mechanize and bs4

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
efficientlywithbs4scrapingmechanizeand

Problem

I have written some code that scrapes data on asteroids, but the problem is that is super slow! I understand that it has a lot to scrape, but as of now it has been running for 5 days and is bot even a tenth of the way through. Here is my code, the part I'm talking about is under GET-EPHEMERIDES:

```
from mechanize import Browser
from bs4 import BeautifulSoup
import datetime
from dateutil.relativedelta import relativedelta
##from SendEmail import Send
import os
import httplib
import time
import sys

print "###################################################################################\n Scrapes targets from MPO Opposition Database \n \n and Ephemerides from MPC's website \n Author: Tarik Joseph Zegmott \n###################################################################################\n\n"

#-----EXTRACTION-CODE-FOR-MINORPLANET.INFO------------------------------------------

def extract(soup):
table = soup.find('table', border=1)
for row in table.findAll('tr')[1:]: # uses BS findAll() to pull all tr tags (html row) into a list, the [1:] modifier skips the first lin which is just a header
col = row.findAll('td') # A list that will grab all the td tags (html column)
num = col[0].font.string
name = col[1].font.string
odate = col[2].font.string # Opposition Date (mm/dd.d)
omag = col[3].font.string # Opposition Mag (V)
mddate = col[4].font.string # Date of Minimum Distance (mm/dd.d)
mdist = col[5].font.string # Minimum Distance from Earth (AU)
bdate = col[7].font.string # Date of Brightest Apparition (mm/dd.d)
bmag = col[8].font.string # Brightest Magnitude (V)
bdec = col[9].font.string # Declination on Date of Brightest Apparition
r

Solution


  • Remove nasty comments. For example, your "block separator comments", ###...### are completely, and utterly useless. Remove them. Another thing would be to not create "title comments", #----...----. Comments should be helpful, many of yours are not.



  • Again, another tip on comments. Many of your comments describe things that are already obvious from looking at the code. For example, you have a comment that says this: # Reruns this loop until 'try' works, but can make run indefinately (not sure why?). It's clear that once the try block works, that the loop is exited. Obvious comments like these can be removed.



  • Variables and functions should be named in the style of snake_case, and classes should be in the style of PascalCase. If a variable is constant, it should be UPPERCASE_SNAKE_CASE.



  • You need better variable names. For example, d1, or d2 are completely unclear as to what their purposes are. Variable names should be long, but not too long, and as descriptive as possible. You also have many other places where you could do renaming, d1, and d2 are just a few examples.



  • When getting user input, instead of making the user enter everything in lowercase, lower the text using str.lower(). Here's an example: user_input = raw_input("> ").lower().



  • At the beginning of your code, you print the same character many times. Instead of repeating this character over and over again in your string, use string multiplication. For example, if I wanted to print 50 spaces, I would do print " " * 50.



  • Why on Earth are you using sys.stdout.write()? Just use print to print something. sys.stdout.write() is unnecessary.



  • Rather than checking if something equals false, as you're doing on this line: if os.path.exists('./Targets') is False:, you can just do if not os.path.exists("./Targets"):.



  • Finally, you have many PEP8 errors. There are way too many to list in one answer, so I'm going to link the style guide instead, and you can read through it. You can find PEP8 here.

Context

StackExchange Code Review Q#73743, answer score: 3

Revisions (0)

No revisions yet.