patternpythonModerate
Simple Python job vacancies downloader
Viewed 0 times
simpledownloadervacanciespythonjob
Problem
I have created a BeautifulSoup vacancies parser, which works, but I do not like how it looks. Therefore, I'd be very happy if somebody could give me any improvements.
Here is the HTML code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen('http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2')
soup = BeautifulSoup(page, "lxml")
vacancies = {}
a = soup.find_all("a", class_="t")
inx = 0
for x in a:
inx += 1
vacancies[inx] = {'position': x.contents[0].strip(),
'link': 'http://rabota.ua{}'.format(x.get('href'))}
a = soup.find_all("a", class_="rua-p-c-default")[2:]
inx = 0
for x in a:
inx += 1
vacancies[inx].update({'company': x.get_text()})
for x in vacancies:
print('{}\n{}\n{}\n'.format(vacancies[x]['company'], vacancies[x]['position'], vacancies[x]['link']))Here is the HTML code:
Solution
A couple of suggestions:
An example on how this would look is:
In some cases you'll get:
Traceback (most recent call last):
http://rabota.ua/company209907/vacancy6312501
vacancies[x]['link']))
Which means you're trying to parse characters which cannot be decoded as they are. I'd recommend using the
Instead of using a counter, you could use the builtin that Python already has:
This:
Will become:
So far, we have this:
Other suggestions:
I'm not a fan of your naming conventions:
More, I'd split everything into functions as this will make your code readable and easier to maintain:
Yet another way of splitting the logic of your program:
You can see that I also added
As @Mathias suggested in comments, instead of using the
```
de
- Use
requestsinstead ofurllib. This is mostly a matter of preference but I think you'll find it easier to work with.
An example on how this would look is:
import requests
page = requests.get('http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2').content- Fix the bug
In some cases you'll get:
Traceback (most recent call last):
http://rabota.ua/company209907/vacancy6312501
vacancies[x]['link']))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 26-33: ordinal not in range(128)Which means you're trying to parse characters which cannot be decoded as they are. I'd recommend using the
encode function to get rid of it:for x in vacancies:
print('{}\n{}\n{}\n'.format(
vacancies[x]['company'].encode('ascii', 'ignore'),
vacancies[x]['position'].encode('ascii', 'ignore'),
vacancies[x]['link'].encode('ascii', 'ignore')))Instead of using a counter, you could use the builtin that Python already has:
enumerate().This:
inx = 0
for x in a:
inx += 1
vacancies[inx].update({'company': x.get_text()})Will become:
for counter, x in enumerate(a, 1):
vacancies[counter].update({'company': x.get_text()})So far, we have this:
from bs4 import BeautifulSoup
import requests
page = requests.get('http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2').content
soup = BeautifulSoup(page, "lxml")
a = soup.find_all("a", class_="t")
vacancies = {}
for counter, x in enumerate(a, 1):
vacancies[counter] = {
'position': x.contents[0].strip(),
'link': 'http://rabota.ua{}'.format(x.get('href'))
}
a = soup.find_all("a", class_="rua-p-c-default")[2:]
for counter, x in enumerate(a, 1):
vacancies[counter].update({'company': x.get_text()})
for x in vacancies:
print('{}\n{}\n{}\n'.format(
vacancies[x]['company'].encode('ascii', 'ignore'),
vacancies[x]['position'].encode('ascii', 'ignore'),
vacancies[x]['link'].encode('ascii', 'ignore')))Other suggestions:
I'm not a fan of your naming conventions:
amight becomehref_tag
xmight becomeelement
More, I'd split everything into functions as this will make your code readable and easier to maintain:
from bs4 import BeautifulSoup
import requests
URL = 'http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2'
def get_html():
return BeautifulSoup(requests.get(URL).content, 'lxml')
def parse_html():
content = get_html()
href_tag = content.find_all('a', class_='t')
vacancies = {}
for counter, element in enumerate(href_tag, 1):
vacancies[counter] = {
'position': element.contents[0].strip(),
'link': 'http://rabota.ua{}'.format(element.get('href'))
}
href_tag = content.find_all("a", class_="rua-p-c-default")[2:]
for counter, element in enumerate(href_tag, 1):
vacancies[counter].update({'company': element.get_text()})
for element in vacancies:
print('{}\n{}\n{}\n'.format(
vacancies[element]['company'].encode('ascii', 'ignore'),
vacancies[element]['position'].encode('ascii', 'ignore'),
vacancies[element]['link'].encode('ascii', 'ignore'))
)
if __name__ == '__main__':
parse_html()Yet another way of splitting the logic of your program:
from bs4 import BeautifulSoup
import requests
URL = 'http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2'
def get_html():
return BeautifulSoup(requests.get(URL).content, 'lxml')
def get_position_and_link(content):
href_tag = content.find_all('a', class_='t')
vacancies = {}
for counter, element in enumerate(href_tag, 1):
vacancies[counter] = {
'position': element.contents[0].strip(),
'link': 'http://rabota.ua{}'.format(element.get('href'))
}
return vacancies
def update_info(content):
company_href_tag = content.find_all("a", class_="rua-p-c-default")[2:]
vacancies = get_position_and_link(content)
for counter, element in enumerate(company_href_tag, 1):
vacancies[counter].update({'company': element.get_text()})
return vacancies
def main():
content = get_html()
vacancies = update_info(content)
for element in vacancies:
print('{}\n{}\n{}\n'.format(
vacancies[element]['company'].encode('ascii', 'ignore'),
vacancies[element]['position'].encode('ascii', 'ignore'),
vacancies[element]['link'].encode('ascii', 'ignore'))
)
if __name__ == '__main__':
main()You can see that I also added
if __name__ == '__main__'. By doing the main check, you can have that code only execute when you want to run the module as a program and not have it execute when someone just wants to import your module and call your functions themselves.As @Mathias suggested in comments, instead of using the
encode() function, you might as well do:```
de
Code Snippets
import requests
page = requests.get('http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2').contentfor x in vacancies:
print('{}\n{}\n{}\n'.format(
vacancies[x]['company'].encode('ascii', 'ignore'),
vacancies[x]['position'].encode('ascii', 'ignore'),
vacancies[x]['link'].encode('ascii', 'ignore')))inx = 0
for x in a:
inx += 1
vacancies[inx].update({'company': x.get_text()})for counter, x in enumerate(a, 1):
vacancies[counter].update({'company': x.get_text()})from bs4 import BeautifulSoup
import requests
page = requests.get('http://rabota.ua/zapros/python/%D0%BA%D0%B8%D0%B5%D0%B2').content
soup = BeautifulSoup(page, "lxml")
a = soup.find_all("a", class_="t")
vacancies = {}
for counter, x in enumerate(a, 1):
vacancies[counter] = {
'position': x.contents[0].strip(),
'link': 'http://rabota.ua{}'.format(x.get('href'))
}
a = soup.find_all("a", class_="rua-p-c-default")[2:]
for counter, x in enumerate(a, 1):
vacancies[counter].update({'company': x.get_text()})
for x in vacancies:
print('{}\n{}\n{}\n'.format(
vacancies[x]['company'].encode('ascii', 'ignore'),
vacancies[x]['position'].encode('ascii', 'ignore'),
vacancies[x]['link'].encode('ascii', 'ignore')))Context
StackExchange Code Review Q#146384, answer score: 11
Revisions (0)
No revisions yet.