patternpythonMinor
Scraping names of directors from a website
Viewed 0 times
websitedirectorsscrapingnamesfrom
Problem
I am scraping the names of the directors from a website using Python / ScraPy. I am very new to coding (under a year and after work) - any views would be appreciated.
The reason I have a
```
import scrapy
import re
from CompaniesHouse.items import CompanieshouseItem
class CompaniesHouseSpider(scrapy.Spider):
name = "companieshouse"
allowed_domains = ["companieshouse.gov.uk"]
start_urls = ["https://beta.companieshouse.gov.uk/company/OC361003/officers",
]
def parse(self, response):
for count in range(0,100):
for sel in response.xpath('//*[@id="content-container"]'):
companys = sel.xpath('//*[@id="company-name"]/text()').extract()
companys = [company.strip() for company in companys]
string1 = "officer-name-" + str(count)
names = sel.xpath('//*[@id="%s"]/a/text()' %string1).extract()
names = [name.strip() for name in names]
namerefs = sel.xpath('//[@id="%s"]/a/@href' %string1).re(r'(?<=/officers/).?(?=/appointments)')
namerefs = [nameref.strip() for nameref in namerefs]
string2 = "officer-role-" + str(count)
roles = sel.xpath('//*[@id="%s"]/text()' %string2).extract()
roles = [role.strip() for role in roles]
string3 = "officer-date-of-birth-" + str(count)
if sel.xpath('//*[@id="%s"]/text()' %string3):
dateofbirths = sel.xpath('//*[@id="%s"]/text()' %string3).extract()
else:
dateofbirths = ["n/a"]
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
result = zip(companys, names,
The reason I have a
for loop with count from 0 to 100 is that not all the names on the website have a date of birth, and hence where there are blanks I need to return a value ("n/a" in this case) otherwise the lists of names / namerefs / roles / dateofbirths will get out of order.```
import scrapy
import re
from CompaniesHouse.items import CompanieshouseItem
class CompaniesHouseSpider(scrapy.Spider):
name = "companieshouse"
allowed_domains = ["companieshouse.gov.uk"]
start_urls = ["https://beta.companieshouse.gov.uk/company/OC361003/officers",
]
def parse(self, response):
for count in range(0,100):
for sel in response.xpath('//*[@id="content-container"]'):
companys = sel.xpath('//*[@id="company-name"]/text()').extract()
companys = [company.strip() for company in companys]
string1 = "officer-name-" + str(count)
names = sel.xpath('//*[@id="%s"]/a/text()' %string1).extract()
names = [name.strip() for name in names]
namerefs = sel.xpath('//[@id="%s"]/a/@href' %string1).re(r'(?<=/officers/).?(?=/appointments)')
namerefs = [nameref.strip() for nameref in namerefs]
string2 = "officer-role-" + str(count)
roles = sel.xpath('//*[@id="%s"]/text()' %string2).extract()
roles = [role.strip() for role in roles]
string3 = "officer-date-of-birth-" + str(count)
if sel.xpath('//*[@id="%s"]/text()' %string3):
dateofbirths = sel.xpath('//*[@id="%s"]/text()' %string3).extract()
else:
dateofbirths = ["n/a"]
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
result = zip(companys, names,
Solution
DRY
Reduce duplicated logic using helper functions.
Currently you have 2 lines of code for each field you extract,
for example:
This is tedious. You could capture the common logic in a helper function, for example:
With that, much of the code can be simplified:
Repeated operations
Here, an xpath lookup is performed twice:
It would be better to avoid that:
Use
The
Formatting
Python has a style guide called PEP8, I suggest to follow it.
Reduce duplicated logic using helper functions.
Currently you have 2 lines of code for each field you extract,
for example:
companys = sel.xpath('//*[@id="company-name"]/text()').extract()
companys = [company.strip() for company in companys]
names = sel.xpath('//*[@id="%s"]/a/text()' % string1).extract()
names = [name.strip() for name in names]This is tedious. You could capture the common logic in a helper function, for example:
def to_list(xpath):
return [v.strip() for v in xpath.extract()]With that, much of the code can be simplified:
companys = to_list(sel.xpath('//*[@id="company-name"]/text()'))
names = to_list(sel.xpath('//*[@id="%s"]/a/text()' % string1).extract())Repeated operations
Here, an xpath lookup is performed twice:
if sel.xpath('//*[@id="%s"]/text()' % string3):
dateofbirths = sel.xpath('//*[@id="%s"]/text()' % string3).extract()
else:
dateofbirths = ["n/a"]
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]It would be better to avoid that:
dateofbirths = to_list(sel.xpath('//*[@id="%s"]/text()' % string3))
if not dateofbirths:
dateofbirths = ["n/a"]Use
"...".format(...)The
"%s" % ... style formatting is old, it's recommended to use the format function instead, for example:names = sel.xpath('//*[@id="{}"]/a/text()'.format(string1)).extract()Formatting
Python has a style guide called PEP8, I suggest to follow it.
Code Snippets
companys = sel.xpath('//*[@id="company-name"]/text()').extract()
companys = [company.strip() for company in companys]
names = sel.xpath('//*[@id="%s"]/a/text()' % string1).extract()
names = [name.strip() for name in names]def to_list(xpath):
return [v.strip() for v in xpath.extract()]companys = to_list(sel.xpath('//*[@id="company-name"]/text()'))
names = to_list(sel.xpath('//*[@id="%s"]/a/text()' % string1).extract())if sel.xpath('//*[@id="%s"]/text()' % string3):
dateofbirths = sel.xpath('//*[@id="%s"]/text()' % string3).extract()
else:
dateofbirths = ["n/a"]
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]dateofbirths = to_list(sel.xpath('//*[@id="%s"]/text()' % string3))
if not dateofbirths:
dateofbirths = ["n/a"]Context
StackExchange Code Review Q#150339, answer score: 3
Revisions (0)
No revisions yet.