patternpythonMinor
Trivago hotels price checker
Viewed 0 times
pricetrivagocheckerhotels
Problem
I've decided to write my first project in Python. I would like to hear some opinion from you.
Description of the script:
Settings:
I wrote this script to learn how website scraping in Python works, and how to use SQL. I also wanted to learn something about object-oriented programming.
```
__author__ = ''
import datetime
import pymysql
import lxml.html as lh
import re
import sys
from selenium import webdriver
class TrivagoPriceChecker():
from_year = ''
from_month = ''
from_day = ''
days_number = ''
city_id = ''
hotel_id = ''
result_file = ''
browser = webdriver.PhantomJS()
def __init__(self):
print("Trivago Price Checker ver 1.0")
def generate_url(self):
from_date = datetime.date(int(self.from_year), int(self.from_month), int(self.from_day))
to_date = datetime.date(int(self.from_year),int(self.from_month),int(self.from_day)) + datetime.timedelta(days=int(self.days_number))
url_list = []
while(from_date < to_date):
day_plus = from_date + datetime.timedelta(days=1)
url = 'http://www.trivago.pl/?aDateRange%5Barr%5D=' + str(from_date) + '&aDateRange%5Bdep%5D=' + str(day_plus) + '&iRoomType=7&iPathId=' + str(self.city_id) + '&iGeoDistanceItem=' + str(self.hotel_id) + '&iViewType=0&bIsSeoPage=false&bIsSitemap=false&'
url_list.append(url)
from_date += datetime.timedelta(days=1)
return url_list
def get_hotel_price(self, hotel_url):
self.browser.get(hotel_url)
content = self.browser.page_source
website = lh.fromstring(content)
for price in website.
Description of the script:
- Generate Trivago URLs for 5 star hotels in specified city.
- Scrape these URLs to get prices.
- Save results as SQL queries in text file
- Execute queries from text file to upload results to database.
Settings:
- The date from which the script is going to check prices.
- Number of days to check.
- City to check.
- Name of file with results of the script.
I wrote this script to learn how website scraping in Python works, and how to use SQL. I also wanted to learn something about object-oriented programming.
```
__author__ = ''
import datetime
import pymysql
import lxml.html as lh
import re
import sys
from selenium import webdriver
class TrivagoPriceChecker():
from_year = ''
from_month = ''
from_day = ''
days_number = ''
city_id = ''
hotel_id = ''
result_file = ''
browser = webdriver.PhantomJS()
def __init__(self):
print("Trivago Price Checker ver 1.0")
def generate_url(self):
from_date = datetime.date(int(self.from_year), int(self.from_month), int(self.from_day))
to_date = datetime.date(int(self.from_year),int(self.from_month),int(self.from_day)) + datetime.timedelta(days=int(self.days_number))
url_list = []
while(from_date < to_date):
day_plus = from_date + datetime.timedelta(days=1)
url = 'http://www.trivago.pl/?aDateRange%5Barr%5D=' + str(from_date) + '&aDateRange%5Bdep%5D=' + str(day_plus) + '&iRoomType=7&iPathId=' + str(self.city_id) + '&iGeoDistanceItem=' + str(self.hotel_id) + '&iViewType=0&bIsSeoPage=false&bIsSitemap=false&'
url_list.append(url)
from_date += datetime.timedelta(days=1)
return url_list
def get_hotel_price(self, hotel_url):
self.browser.get(hotel_url)
content = self.browser.page_source
website = lh.fromstring(content)
for price in website.
Solution
try:
query.execute(line)
progress += 1
print(progress)
except:
passHere you used a bare (with nothing next to it) except, this is considered bad practice because if an unexpected error happens, it will be silently ignored, instead use;
except TheExceptionIThinkWillHappen:print("City not found!")
exit()This looks like an error, please use:
raise ValueError("City not found")Possible bug
---
for price in website.xpath('//*[@id="js_item_' + str(self.hotel_id) + '"]/div[1]/div[2]/div[2]/strong[2]'):
return price.text
This returns only the first price because
return terminates the function, is this intended or a bug? (If you want all the prices you should use yield).If you want to follow PEP8 remember to separate the standard library imports from the third library imports, so you should write:
# The other imports
from selenium import webdriver
import pymysqlThis allows the users of your script to check quickly if they have or not the required 3-rd part modeles.
url_list = []
while(from_date < to_date):
day_plus = from_date + datetime.timedelta(days=1)
url = 'http://www.trivago.pl/?aDateRange%5Barr%5D=' + str(from_date) + '&aDateRange%5Bdep%5D=' + str(day_plus) + '&iRoomType=7&iPathId=' + str(self.city_id) + '&iGeoDistanceItem=' + str(self.hotel_id) + '&iViewType=0&bIsSeoPage=false&bIsSitemap=false&'
url_list.append(url)
from_date += datetime.timedelta(days=1)In general the
while loop should be avoided, I would use a for loop:url = "http://www.trivago.pl/?aDateRange%5Barr%5D={}&aDateRange%5Bdep%5D={}&iRoomType=7&iPathId={}&iGeoDistanceItem={}&iViewType=0&bIsSeoPage=false&bIsSitemap=false&"
url_list = []
day = datetime.timedelta(days=1)
for date in range(from_date,to_date,day):
url_list.append(url.format(date,date + day,
self.city_id,self.hotel_id))or even (but you may not want to compress so much):
return [url.format(date,date + day,
self.city_id,self.hotel_id) for date in range(from_date,to_date,day)]for price in website.xpath('//*[@id="js_item_' + str(self.hotel_id) + '"]/div[1]/div[2]/div[2]/strong[2]'):
return price.textreturn is executed only once but the for loops usually does a thing many times, I would have written:prices = [ price.text for price in website.xpath('//*[@id="js_item_' + str(self.hotel_id) + '"]/div[1]/div[2]/div[2]/strong[2]')]
return prices[0]so that the reader immediately understands that you want to return only the first item.
Veedrac noted that the solution above is too slow, I think the best thing to do here is just adding a comment to make explicit that you only take one item
for price in website.xpath('//*[@id="js_item_' + str(self.hotel_id) + '"]/div[1]/div[2]/div[2]/strong[2]'):
# Return the first price found ONLY
return price.textfile = open(self.result_file, "a")This line contains a problem:
fileis abuilt-innever re-assign (give another value to) a built-in as a comment noted, in Python 3 it is not a built-in so don't worry about shadowing it.
- Opening and closing manually is obsolete, you should use
:
with open(self.result_file,"a") as f:
# some code
f.write(value)Closing is handled automatically so you will not be in danger of forgetting it.
Code Snippets
try:
query.execute(line)
progress += 1
print(progress)
except:
passexcept TheExceptionIThinkWillHappen:print("City not found!")
exit()raise ValueError("City not found")# The other imports
from selenium import webdriver
import pymysqlContext
StackExchange Code Review Q#77324, answer score: 5
Revisions (0)
No revisions yet.