HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Simple Python job vacancies downloader

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
simpledownloadervacanciespythonjob

Problem

I have created a BeautifulSoup vacancies parser, which works, but I do not like how it looks. Therefore, I'd be very happy if somebody could give me any improvements.

from urllib.request import urlopen
from bs4 import BeautifulSoup

page = urlopen('http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2')
soup = BeautifulSoup(page, "lxml")
vacancies = {}

a = soup.find_all("a", class_="t")
inx = 0
for x in a:
    inx += 1
    vacancies[inx] = {'position': x.contents[0].strip(),
                      'link': 'http://rabota.ua{}'.format(x.get('href'))}

a = soup.find_all("a", class_="rua-p-c-default")[2:]
inx = 0
for x in a:
    inx += 1
    vacancies[inx].update({'company': x.get_text()})

for x in vacancies:
    print('{}\n{}\n{}\n'.format(vacancies[x]['company'], vacancies[x]['position'], vacancies[x]['link']))


Here is the HTML code:

Solution

A couple of suggestions:

  • Use requests instead of urllib. This is mostly a matter of preference but I think you'll find it easier to work with.



An example on how this would look is:

import requests

page = requests.get('http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2').content


  • Fix the bug



In some cases you'll get:


Traceback (most recent call last):

http://rabota.ua/company209907/vacancy6312501
vacancies[x]['link'])) UnicodeEncodeError: 'ascii' codec can't encode characters in position 26-33: ordinal not in range(128)

Which means you're trying to parse characters which cannot be decoded as they are. I'd recommend using the encode function to get rid of it:

for x in vacancies:
    print('{}\n{}\n{}\n'.format(
        vacancies[x]['company'].encode('ascii', 'ignore'),
        vacancies[x]['position'].encode('ascii', 'ignore'),
        vacancies[x]['link'].encode('ascii', 'ignore')))


Instead of using a counter, you could use the builtin that Python already has: enumerate().

This:

inx = 0
for x in a:
    inx += 1
    vacancies[inx].update({'company': x.get_text()})


Will become:

for counter, x in enumerate(a, 1):
    vacancies[counter].update({'company': x.get_text()})


So far, we have this:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2').content
soup = BeautifulSoup(page, "lxml")

a = soup.find_all("a", class_="t")

vacancies = {}
for counter, x in enumerate(a, 1):
    vacancies[counter] = {
        'position': x.contents[0].strip(),
        'link': 'http://rabota.ua{}'.format(x.get('href'))
    }

a = soup.find_all("a", class_="rua-p-c-default")[2:]

for counter, x in enumerate(a, 1):
    vacancies[counter].update({'company': x.get_text()})

for x in vacancies:
    print('{}\n{}\n{}\n'.format(
        vacancies[x]['company'].encode('ascii', 'ignore'),
        vacancies[x]['position'].encode('ascii', 'ignore'),
        vacancies[x]['link'].encode('ascii', 'ignore')))


Other suggestions:

I'm not a fan of your naming conventions:

  • a might become href_tag



  • x might become element



More, I'd split everything into functions as this will make your code readable and easier to maintain:

from bs4 import BeautifulSoup
import requests

URL = 'http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2'

def get_html():
    return BeautifulSoup(requests.get(URL).content, 'lxml')

def parse_html():
    content = get_html()
    href_tag = content.find_all('a', class_='t')

    vacancies = {}
    for counter, element in enumerate(href_tag, 1):
        vacancies[counter] = {
            'position': element.contents[0].strip(),
            'link': 'http://rabota.ua{}'.format(element.get('href'))
        }

    href_tag = content.find_all("a", class_="rua-p-c-default")[2:]

    for counter, element in enumerate(href_tag, 1):
        vacancies[counter].update({'company': element.get_text()})

    for element in vacancies:
        print('{}\n{}\n{}\n'.format(
            vacancies[element]['company'].encode('ascii', 'ignore'),
            vacancies[element]['position'].encode('ascii', 'ignore'),
            vacancies[element]['link'].encode('ascii', 'ignore'))
        )

if __name__ == '__main__':
    parse_html()


Yet another way of splitting the logic of your program:

from bs4 import BeautifulSoup
import requests

URL = 'http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2'

def get_html():
    return BeautifulSoup(requests.get(URL).content, 'lxml')

def get_position_and_link(content):
    href_tag = content.find_all('a', class_='t')

    vacancies = {}
    for counter, element in enumerate(href_tag, 1):
        vacancies[counter] = {
            'position': element.contents[0].strip(),
            'link': 'http://rabota.ua{}'.format(element.get('href'))
        }

    return vacancies

def update_info(content):
    company_href_tag = content.find_all("a", class_="rua-p-c-default")[2:]
    vacancies = get_position_and_link(content)

    for counter, element in enumerate(company_href_tag, 1):
        vacancies[counter].update({'company': element.get_text()})
    return vacancies

def main():
    content = get_html()
    vacancies = update_info(content)

    for element in vacancies:
        print('{}\n{}\n{}\n'.format(
            vacancies[element]['company'].encode('ascii', 'ignore'),
            vacancies[element]['position'].encode('ascii', 'ignore'),
            vacancies[element]['link'].encode('ascii', 'ignore'))
        )

if __name__ == '__main__':
    main()


You can see that I also added if __name__ == '__main__'. By doing the main check, you can have that code only execute when you want to run the module as a program and not have it execute when someone just wants to import your module and call your functions themselves.

As @Mathias suggested in comments, instead of using the encode() function, you might as well do:

```
de

Code Snippets

import requests

page = requests.get('http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2').content
for x in vacancies:
    print('{}\n{}\n{}\n'.format(
        vacancies[x]['company'].encode('ascii', 'ignore'),
        vacancies[x]['position'].encode('ascii', 'ignore'),
        vacancies[x]['link'].encode('ascii', 'ignore')))
inx = 0
for x in a:
    inx += 1
    vacancies[inx].update({'company': x.get_text()})
for counter, x in enumerate(a, 1):
    vacancies[counter].update({'company': x.get_text()})
from bs4 import BeautifulSoup
import requests

page = requests.get('http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2').content
soup = BeautifulSoup(page, "lxml")

a = soup.find_all("a", class_="t")

vacancies = {}
for counter, x in enumerate(a, 1):
    vacancies[counter] = {
        'position': x.contents[0].strip(),
        'link': 'http://rabota.ua{}'.format(x.get('href'))
    }

a = soup.find_all("a", class_="rua-p-c-default")[2:]

for counter, x in enumerate(a, 1):
    vacancies[counter].update({'company': x.get_text()})

for x in vacancies:
    print('{}\n{}\n{}\n'.format(
        vacancies[x]['company'].encode('ascii', 'ignore'),
        vacancies[x]['position'].encode('ascii', 'ignore'),
        vacancies[x]['link'].encode('ascii', 'ignore')))

Context

StackExchange Code Review Q#146384, answer score: 11

Revisions (0)

No revisions yet.