patternpythonMinor
eBay scraper with BeautifulSoup
Viewed 0 times
ebaybeautifulsoupwithscraper
Problem
A friend of mine needs a scraper that reads product titles, prices and pictures from eBay and saves them to an Excel-ready .csv file. The program reads product pages into a list and then loops through it to get the relevant data, which it then saves to a .csv file.
```
import pdb
import time
import urllib
import urllib2
import re
import sys
import os
import shutil
import codecs
import unicodedata
from BeautifulSoup import BeautifulSoup
#debugger, just in case :S
#pdb.set_trace()
print("eBay allGrab Scraper by Ben Fishman, 2013\n")
print("\n")
#Create Input.csv
file = open("input.csv", "w")
file.close()
#Set counter to 0
i=0
urlinn = raw_input('Number of search pages you want to enter:\n')
urlinnn=int(urlinn)
for i in range(0, urlinnn):
print "Please copy & paste search page URL #",i+1,":"
urlin = raw_input('')
soup = BeautifulSoup(urllib2.urlopen(urlin).read())
for link in soup.findAll('a',{'itemprop':'name'}):
url = link.get('href')
file = open("Input.csv", "a")
file.write(url)
file.write(",")
file.close()
#Delete the last comma so it doesn't screw up the processing later
with open("input.csv", 'rb+') as filehandle:
filehandle.seek(-1, os.SEEK_END)
filehandle.truncate()
print "Done."
dlcheck = raw_input('Do you want to download the product pictures? (y/n)\n')
print("\n")
#if the images folder doesn't exist, create a new one
if dlcheck == "y":
if not os.path.exists('images'):
os.mkdir('images')
print("Reading input file...")
#open and read Input.csv, split at commas and create a list of URLs
try:
with open('input.csv') as f:
content = f.read().split(',')
print("Input file read.")
print("Creating output file...")
#create the Output.csv file and write a header
file = open("Output.csv", "w")
file.write('Name,Price,icIMG URL\n')
file.close()
#get list lenght
llen=content.__len__()
#throw an error if the list is invalid
if llen==1:
print "Hey! The Input file is empty!"
print "Please put some URLs in
```
import pdb
import time
import urllib
import urllib2
import re
import sys
import os
import shutil
import codecs
import unicodedata
from BeautifulSoup import BeautifulSoup
#debugger, just in case :S
#pdb.set_trace()
print("eBay allGrab Scraper by Ben Fishman, 2013\n")
print("\n")
#Create Input.csv
file = open("input.csv", "w")
file.close()
#Set counter to 0
i=0
urlinn = raw_input('Number of search pages you want to enter:\n')
urlinnn=int(urlinn)
for i in range(0, urlinnn):
print "Please copy & paste search page URL #",i+1,":"
urlin = raw_input('')
soup = BeautifulSoup(urllib2.urlopen(urlin).read())
for link in soup.findAll('a',{'itemprop':'name'}):
url = link.get('href')
file = open("Input.csv", "a")
file.write(url)
file.write(",")
file.close()
#Delete the last comma so it doesn't screw up the processing later
with open("input.csv", 'rb+') as filehandle:
filehandle.seek(-1, os.SEEK_END)
filehandle.truncate()
print "Done."
dlcheck = raw_input('Do you want to download the product pictures? (y/n)\n')
print("\n")
#if the images folder doesn't exist, create a new one
if dlcheck == "y":
if not os.path.exists('images'):
os.mkdir('images')
print("Reading input file...")
#open and read Input.csv, split at commas and create a list of URLs
try:
with open('input.csv') as f:
content = f.read().split(',')
print("Input file read.")
print("Creating output file...")
#create the Output.csv file and write a header
file = open("Output.csv", "w")
file.write('Name,Price,icIMG URL\n')
file.close()
#get list lenght
llen=content.__len__()
#throw an error if the list is invalid
if llen==1:
print "Hey! The Input file is empty!"
print "Please put some URLs in
Solution
First things first
When your scripts start to grow unruly like this it is usually a good idea to try and break them down into their component parts and compartmentalize functionality. I'm going to go through just a couple parts of your script with a few ideas that might help to 'modularize'. Doing so can help you with debugging in the future, as well as clarifying your thought process on what you are currently doing or not doing.
I haven't tried to radically re-implement what you're doing but your use of a CSV file as a container for the search page links seems a little unnecessary. Much of your script is spent doing something like:
parsed html >> CSV >> list
It could be easier to instead work directly on the list and data - unless you want to use the CSV files for something else later. The list of HTML links won't be a memory issue unless it explodes to thousands and thousands of links (a 10k character string takes 10033 bytes on my machine)
Modules You may be interested in
Two modules I would recommend are the CSV module, which simplifies much of what you're doing and extends your ability to work with CSV files. Also, for these examples I've used the Requests library rather than urllib/urllib2. This is more a style choice than a critique of your code - but you might look into it and see which you prefer.
Parse HTML function
Arguably, the first useful thing your script does is parse the webpage, which is what I've amended above. You mention that your code is currently prone to crashing. This won't necessarily do anything to improve reliability but it can provide a clearer picture of where things are breaking. I have opted to build a list of links rather than writing to a file.
But writing those links to a CSV file could be done like this:
Writing CSV
The main thing to note here is the use of the With Statement. It simplifies operations on files by removing the need to open/close the file manually. By calling
the file will remain open within the scope of the statement.
Is unnecessary with the with statement, simply by calling with open... the file will be created if it does not exist currently. By designating the "a" - append flag, you can ensure that you don't lose any data this way either (as you have done).
To achieve the same thing that you have currently you could then just chain those two functions together:
Nothing says this is the best way to do it - but it is much more apparent what is happening and where the execution is taking place.
Jumping down a bit further, I've written your image directory check into a function:
If Not vs. If/Else
This is a readability issue and my personal preference, but instead of
I would opt for:
which is functionally identical but now reads more like what is actually happening (to me anyway).
Various things to note:
-
instead of
you can call
-
instead of
is the same as
-
urlin/urlinn/urlinnn are totally inscrutable variable names
-
Writing:
can also be written with a single print statement like this:
You've actually done most of the hard work already in getting it to work. I think a good next step would be to go through your script and adjust things into logical 'partitions' (this section reads/writes CSV, this sections gets webpage data, this section logs errors etc.). That and improving your variable names!
When your scripts start to grow unruly like this it is usually a good idea to try and break them down into their component parts and compartmentalize functionality. I'm going to go through just a couple parts of your script with a few ideas that might help to 'modularize'. Doing so can help you with debugging in the future, as well as clarifying your thought process on what you are currently doing or not doing.
I haven't tried to radically re-implement what you're doing but your use of a CSV file as a container for the search page links seems a little unnecessary. Much of your script is spent doing something like:
parsed html >> CSV >> list
It could be easier to instead work directly on the list and data - unless you want to use the CSV files for something else later. The list of HTML links won't be a memory issue unless it explodes to thousands and thousands of links (a 10k character string takes 10033 bytes on my machine)
Modules You may be interested in
Two modules I would recommend are the CSV module, which simplifies much of what you're doing and extends your ability to work with CSV files. Also, for these examples I've used the Requests library rather than urllib/urllib2. This is more a style choice than a critique of your code - but you might look into it and see which you prefer.
Parse HTML function
def build_list_of_links(ebay_page_url):
page = requests.get(page_url).text
soup = BeautifulSoup(page)
list_of_links = []
for item in soup.find_all('a', {'itemprop':'name'}):
list_of_links.append(item.get('href'))
return(list_of_links)Arguably, the first useful thing your script does is parse the webpage, which is what I've amended above. You mention that your code is currently prone to crashing. This won't necessarily do anything to improve reliability but it can provide a clearer picture of where things are breaking. I have opted to build a list of links rather than writing to a file.
But writing those links to a CSV file could be done like this:
Writing CSV
def write_links_file(output_links_file, links_list):
with open(file_destination, "a") as output_file:
link_writer = csv.writer(output_file)
link_writer.writerow(links_list)The main thing to note here is the use of the With Statement. It simplifies operations on files by removing the need to open/close the file manually. By calling
with open(....)the file will remain open within the scope of the statement.
file = open("input.csv", "w")
file.close()Is unnecessary with the with statement, simply by calling with open... the file will be created if it does not exist currently. By designating the "a" - append flag, you can ensure that you don't lose any data this way either (as you have done).
To achieve the same thing that you have currently you could then just chain those two functions together:
write_links_file("/some/dir/", build_list_of_links("some.page"))Nothing says this is the best way to do it - but it is much more apparent what is happening and where the execution is taking place.
Jumping down a bit further, I've written your image directory check into a function:
If Not vs. If/Else
def create_img_dir():
save_imgages = raw_input('Do you want to download the product pictures? (y/n)\n\n')
if save_imgages == "y":
if os.path.exists('images'):
pass
else:
os.mkdir('images')
else:
...This is a readability issue and my personal preference, but instead of
if not os.path.existsI would opt for:
if os.path.exists
pass
else
os.mkdirwhich is functionally identical but now reads more like what is actually happening (to me anyway).
Various things to note:
-
instead of
content.__len__()you can call
len(content)-
instead of
for i in range(0, urlinnn)is the same as
for i in range(urlinnn)-
urlin/urlinn/urlinnn are totally inscrutable variable names
-
Writing:
print 'Oh dear.\n'
print 'Something went horribly wrong.\n'
print 'The Input file is corrupt!'
print 'Did you check the Input.csv file for mistakes?'
print 'Pay attention to double commas!'can also be written with a single print statement like this:
print("Oh dear.\n\
Something went horribly wrong.\n\
The Input file is corrupt!\n\
Did you check the Input.csv file for mistakes?\n\
Pay attention to double commas!")You've actually done most of the hard work already in getting it to work. I think a good next step would be to go through your script and adjust things into logical 'partitions' (this section reads/writes CSV, this sections gets webpage data, this section logs errors etc.). That and improving your variable names!
Code Snippets
def build_list_of_links(ebay_page_url):
page = requests.get(page_url).text
soup = BeautifulSoup(page)
list_of_links = []
for item in soup.find_all('a', {'itemprop':'name'}):
list_of_links.append(item.get('href'))
return(list_of_links)def write_links_file(output_links_file, links_list):
with open(file_destination, "a") as output_file:
link_writer = csv.writer(output_file)
link_writer.writerow(links_list)with open(....)file = open("input.csv", "w")
file.close()write_links_file("/some/dir/", build_list_of_links("some.page"))Context
StackExchange Code Review Q#32667, answer score: 6
Revisions (0)
No revisions yet.