HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Scraping names of directors from a website

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
websitedirectorsscrapingnamesfrom

Problem

I am scraping the names of the directors from a website using Python / ScraPy. I am very new to coding (under a year and after work) - any views would be appreciated.

The reason I have a for loop with count from 0 to 100 is that not all the names on the website have a date of birth, and hence where there are blanks I need to return a value ("n/a" in this case) otherwise the lists of names / namerefs / roles / dateofbirths will get out of order.

```
import scrapy
import re

from CompaniesHouse.items import CompanieshouseItem

class CompaniesHouseSpider(scrapy.Spider):
name = "companieshouse"
allowed_domains = ["companieshouse.gov.uk"]
start_urls = ["https://beta.companieshouse.gov.uk/company/OC361003/officers",
]

def parse(self, response):
for count in range(0,100):
for sel in response.xpath('//*[@id="content-container"]'):
companys = sel.xpath('//*[@id="company-name"]/text()').extract()
companys = [company.strip() for company in companys]
string1 = "officer-name-" + str(count)
names = sel.xpath('//*[@id="%s"]/a/text()' %string1).extract()
names = [name.strip() for name in names]
namerefs = sel.xpath('//[@id="%s"]/a/@href' %string1).re(r'(?<=/officers/).?(?=/appointments)')
namerefs = [nameref.strip() for nameref in namerefs]
string2 = "officer-role-" + str(count)
roles = sel.xpath('//*[@id="%s"]/text()' %string2).extract()
roles = [role.strip() for role in roles]
string3 = "officer-date-of-birth-" + str(count)
if sel.xpath('//*[@id="%s"]/text()' %string3):
dateofbirths = sel.xpath('//*[@id="%s"]/text()' %string3).extract()
else:
dateofbirths = ["n/a"]
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
result = zip(companys, names,

Solution

DRY

Reduce duplicated logic using helper functions.
Currently you have 2 lines of code for each field you extract,
for example:

companys = sel.xpath('//*[@id="company-name"]/text()').extract()
companys = [company.strip() for company in companys]

names = sel.xpath('//*[@id="%s"]/a/text()' % string1).extract()
names = [name.strip() for name in names]


This is tedious. You could capture the common logic in a helper function, for example:

def to_list(xpath):
    return [v.strip() for v in xpath.extract()]


With that, much of the code can be simplified:

companys = to_list(sel.xpath('//*[@id="company-name"]/text()'))
names = to_list(sel.xpath('//*[@id="%s"]/a/text()' % string1).extract())


Repeated operations

Here, an xpath lookup is performed twice:

if sel.xpath('//*[@id="%s"]/text()' % string3):
    dateofbirths = sel.xpath('//*[@id="%s"]/text()' % string3).extract()
else:
    dateofbirths = ["n/a"]
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]


It would be better to avoid that:

dateofbirths = to_list(sel.xpath('//*[@id="%s"]/text()' % string3))
if not dateofbirths:
    dateofbirths = ["n/a"]


Use "...".format(...)

The "%s" % ... style formatting is old, it's recommended to use the format function instead, for example:

names = sel.xpath('//*[@id="{}"]/a/text()'.format(string1)).extract()


Formatting

Python has a style guide called PEP8, I suggest to follow it.

Code Snippets

companys = sel.xpath('//*[@id="company-name"]/text()').extract()
companys = [company.strip() for company in companys]

names = sel.xpath('//*[@id="%s"]/a/text()' % string1).extract()
names = [name.strip() for name in names]
def to_list(xpath):
    return [v.strip() for v in xpath.extract()]
companys = to_list(sel.xpath('//*[@id="company-name"]/text()'))
names = to_list(sel.xpath('//*[@id="%s"]/a/text()' % string1).extract())
if sel.xpath('//*[@id="%s"]/text()' % string3):
    dateofbirths = sel.xpath('//*[@id="%s"]/text()' % string3).extract()
else:
    dateofbirths = ["n/a"]
dateofbirths = [dateofbirth.strip() for dateofbirth in dateofbirths]
dateofbirths = to_list(sel.xpath('//*[@id="%s"]/text()' % string3))
if not dateofbirths:
    dateofbirths = ["n/a"]

Context

StackExchange Code Review Q#150339, answer score: 3

Revisions (0)

No revisions yet.