HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

eBay scraper with BeautifulSoup

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
ebaybeautifulsoupwithscraper

Problem

A friend of mine needs a scraper that reads product titles, prices and pictures from eBay and saves them to an Excel-ready .csv file. The program reads product pages into a list and then loops through it to get the relevant data, which it then saves to a .csv file.

```
import pdb
import time
import urllib
import urllib2
import re
import sys
import os
import shutil
import codecs
import unicodedata
from BeautifulSoup import BeautifulSoup

#debugger, just in case :S
#pdb.set_trace()
print("eBay allGrab Scraper by Ben Fishman, 2013\n")
print("\n")

#Create Input.csv
file = open("input.csv", "w")
file.close()
#Set counter to 0
i=0

urlinn = raw_input('Number of search pages you want to enter:\n')
urlinnn=int(urlinn)
for i in range(0, urlinnn):
print "Please copy & paste search page URL #",i+1,":"
urlin = raw_input('')
soup = BeautifulSoup(urllib2.urlopen(urlin).read())
for link in soup.findAll('a',{'itemprop':'name'}):
url = link.get('href')
file = open("Input.csv", "a")
file.write(url)
file.write(",")
file.close()
#Delete the last comma so it doesn't screw up the processing later
with open("input.csv", 'rb+') as filehandle:
filehandle.seek(-1, os.SEEK_END)
filehandle.truncate()
print "Done."

dlcheck = raw_input('Do you want to download the product pictures? (y/n)\n')
print("\n")
#if the images folder doesn't exist, create a new one
if dlcheck == "y":
if not os.path.exists('images'):
os.mkdir('images')

print("Reading input file...")
#open and read Input.csv, split at commas and create a list of URLs

try:
with open('input.csv') as f:
content = f.read().split(',')
print("Input file read.")
print("Creating output file...")
#create the Output.csv file and write a header
file = open("Output.csv", "w")
file.write('Name,Price,icIMG URL\n')
file.close()
#get list lenght
llen=content.__len__()
#throw an error if the list is invalid
if llen==1:
print "Hey! The Input file is empty!"
print "Please put some URLs in

Solution

First things first

When your scripts start to grow unruly like this it is usually a good idea to try and break them down into their component parts and compartmentalize functionality. I'm going to go through just a couple parts of your script with a few ideas that might help to 'modularize'. Doing so can help you with debugging in the future, as well as clarifying your thought process on what you are currently doing or not doing.

I haven't tried to radically re-implement what you're doing but your use of a CSV file as a container for the search page links seems a little unnecessary. Much of your script is spent doing something like:


parsed html >> CSV >> list

It could be easier to instead work directly on the list and data - unless you want to use the CSV files for something else later. The list of HTML links won't be a memory issue unless it explodes to thousands and thousands of links (a 10k character string takes 10033 bytes on my machine)

Modules You may be interested in

Two modules I would recommend are the CSV module, which simplifies much of what you're doing and extends your ability to work with CSV files. Also, for these examples I've used the Requests library rather than urllib/urllib2. This is more a style choice than a critique of your code - but you might look into it and see which you prefer.

Parse HTML function

def build_list_of_links(ebay_page_url):
    page = requests.get(page_url).text
    soup = BeautifulSoup(page)
    list_of_links = []
    for item in soup.find_all('a', {'itemprop':'name'}):
        list_of_links.append(item.get('href'))
    return(list_of_links)


Arguably, the first useful thing your script does is parse the webpage, which is what I've amended above. You mention that your code is currently prone to crashing. This won't necessarily do anything to improve reliability but it can provide a clearer picture of where things are breaking. I have opted to build a list of links rather than writing to a file.

But writing those links to a CSV file could be done like this:

Writing CSV

def write_links_file(output_links_file, links_list):
    with open(file_destination, "a") as output_file:
        link_writer = csv.writer(output_file)
        link_writer.writerow(links_list)


The main thing to note here is the use of the With Statement. It simplifies operations on files by removing the need to open/close the file manually. By calling

with open(....)


the file will remain open within the scope of the statement.

file = open("input.csv", "w")
    file.close()


Is unnecessary with the with statement, simply by calling with open... the file will be created if it does not exist currently. By designating the "a" - append flag, you can ensure that you don't lose any data this way either (as you have done).

To achieve the same thing that you have currently you could then just chain those two functions together:

write_links_file("/some/dir/", build_list_of_links("some.page"))


Nothing says this is the best way to do it - but it is much more apparent what is happening and where the execution is taking place.

Jumping down a bit further, I've written your image directory check into a function:

If Not vs. If/Else

def create_img_dir():
  save_imgages = raw_input('Do you want to download the product pictures? (y/n)\n\n')
  if save_imgages == "y":
      if os.path.exists('images'):
          pass
      else:
          os.mkdir('images')
  else:
    ...


This is a readability issue and my personal preference, but instead of

if not os.path.exists


I would opt for:

if os.path.exists
        pass
    else
        os.mkdir


which is functionally identical but now reads more like what is actually happening (to me anyway).

Various things to note:

-
instead of

content.__len__()


you can call

len(content)


-
instead of

for i in range(0, urlinnn)


is the same as

for i in range(urlinnn)


-
urlin/urlinn/urlinnn are totally inscrutable variable names

-
Writing:

print 'Oh dear.\n'
 print 'Something went horribly wrong.\n'
 print 'The Input file is corrupt!'
 print 'Did you check the Input.csv file for mistakes?'
 print 'Pay attention to double commas!'


can also be written with a single print statement like this:

print("Oh dear.\n\
 Something went horribly wrong.\n\
 The Input file is corrupt!\n\
 Did you check the Input.csv file for mistakes?\n\
 Pay attention to double commas!")


You've actually done most of the hard work already in getting it to work. I think a good next step would be to go through your script and adjust things into logical 'partitions' (this section reads/writes CSV, this sections gets webpage data, this section logs errors etc.). That and improving your variable names!

Code Snippets

def build_list_of_links(ebay_page_url):
    page = requests.get(page_url).text
    soup = BeautifulSoup(page)
    list_of_links = []
    for item in soup.find_all('a', {'itemprop':'name'}):
        list_of_links.append(item.get('href'))
    return(list_of_links)
def write_links_file(output_links_file, links_list):
    with open(file_destination, "a") as output_file:
        link_writer = csv.writer(output_file)
        link_writer.writerow(links_list)
with open(....)
file = open("input.csv", "w")
    file.close()
write_links_file("/some/dir/", build_list_of_links("some.page"))

Context

StackExchange Code Review Q#32667, answer score: 6

Revisions (0)

No revisions yet.