debugpythonMinor
Parsing a website
Viewed 0 times
websiteparsingstackoverflow
Problem
Following is the code I wrote to download the information of different items in a page.
I have one main website which has links to different items. I parse this main page to get the list. This is handled by the
I also parse these each of the links in the list using the
I have implemented a
```
class Handler:
def __init__(self, url):
self.url = url
self.property = {}
self.homeDir = os.path.dirname(__file__)
self.parser = self.getParser()
self.name = self.getTitle()
self.setupFolder()
def updateName(self, name):
self.name = name
def setupFolder(self):
dataDir = os.path.join(self.homeDir, self.name)
if not os.path.exists(dataDir):
os.makedirs(dataDir)
def getTitle(self):
return "".join(char
for char in self.parser.title.string
if char.isalnum() or char == " ")
def getFilePath(self):
return os.path.join(self.homeDir, self.name)
def getRequest(self, url):
return urllib.urlopen(url).read()
def getParser(self):
parser = BeautifulSoup(self.getRequest(self.url))
return parser
def saveProperty(self, key, value):
self.property[key] = value
def writeProperty(self):
fileName = self.getTitle() + ".property"
with open(fileName, 'w') as f:
f.write("\n".join(
key + ":" + self.property[key]
for key in self.property))
class Items(Handler, object):
def __init__(self, url, category) :
super(Items, self).__init__(url)
self.category = category
self.updateName(category)
def extractContents(self):
self.parser = self.getParser()
contents = self.parser.find("ul",{"class" : "galerie"}).findAll('li')
print len(contents)
return contents
def downloadContents(self):
I have one main website which has links to different items. I parse this main page to get the list. This is handled by the
Items class. I also parse these each of the links in the list using the
Item class.I have implemented a
Handler class which is the base class for both of these classes.```
class Handler:
def __init__(self, url):
self.url = url
self.property = {}
self.homeDir = os.path.dirname(__file__)
self.parser = self.getParser()
self.name = self.getTitle()
self.setupFolder()
def updateName(self, name):
self.name = name
def setupFolder(self):
dataDir = os.path.join(self.homeDir, self.name)
if not os.path.exists(dataDir):
os.makedirs(dataDir)
def getTitle(self):
return "".join(char
for char in self.parser.title.string
if char.isalnum() or char == " ")
def getFilePath(self):
return os.path.join(self.homeDir, self.name)
def getRequest(self, url):
return urllib.urlopen(url).read()
def getParser(self):
parser = BeautifulSoup(self.getRequest(self.url))
return parser
def saveProperty(self, key, value):
self.property[key] = value
def writeProperty(self):
fileName = self.getTitle() + ".property"
with open(fileName, 'w') as f:
f.write("\n".join(
key + ":" + self.property[key]
for key in self.property))
class Items(Handler, object):
def __init__(self, url, category) :
super(Items, self).__init__(url)
self.category = category
self.updateName(category)
def extractContents(self):
self.parser = self.getParser()
contents = self.parser.find("ul",{"class" : "galerie"}).findAll('li')
print len(contents)
return contents
def downloadContents(self):
Solution
The general style for naming in Python is
Secondly, it looks like you're using Python 2.x. If you're using Python 2.x, you need to have classes explicitly inherit from
Finally, you can use string multiplication to print many characters. For example, the line
snake_case for functions and variables, and PascalCase for classes. You should also have two blank lines between top-level functions/classes/code blocks, not, an arbitrary amount. You have a few other style violations. To fix these, visit PEP8, Python's official style guide.Secondly, it looks like you're using Python 2.x. If you're using Python 2.x, you need to have classes explicitly inherit from
object. For example: class MyClass(object):, not class MyClass:. If you are using Python 3.x, then you can use the second example.Finally, you can use string multiplication to print many characters. For example, the line
print "===============================" can be shortened to `print "=" * 31.Context
StackExchange Code Review Q#49754, answer score: 2
Revisions (0)
No revisions yet.