patternpythonMinor
Selenium-based link checker for a shopping site
Viewed 0 times
shoppingseleniumforbasedsitecheckerlink
Problem
I just started learning Python and I wrote my first useful script for work. I did a bunch of the basic tutorials and really enjoyed learning Python so far.
I am looking for any advice on how to make things more pythonic. What areas can I improve going forward? I want to make this script better/move on to my new project but I don't want to build off bad fundamentals.
My script works and I am using it. My script goes to a set of websites using the selenium webdriver to pull all the links down to a list. I then delete the duplicates. Then I use the requests module to verify a 200 response code.
I incorporated multiprocessing into it because the first version took way too long, 5+ hours to scan on 7300 links. I got the script time down to about an hour.
```
from CustomFunctions import get_links, check_links, main
import fileinput
import sys
#redircting stdout to fie
old_stdout = sys.stdout
log_file = open("output.log","w")
sys.stdout = log_file
#Gateways
home = 'http://www.bonton.com'
#brands = 'http://www.bonton.com/sc1/brands/'
women = 'http://www.bonton.com/sc1/women/'
shoe
I am looking for any advice on how to make things more pythonic. What areas can I improve going forward? I want to make this script better/move on to my new project but I don't want to build off bad fundamentals.
My script works and I am using it. My script goes to a set of websites using the selenium webdriver to pull all the links down to a list. I then delete the duplicates. Then I use the requests module to verify a 200 response code.
I incorporated multiprocessing into it because the first version took way too long, 5+ hours to scan on 7300 links. I got the script time down to about an hour.
CustomFunctions.pyimport requests
from selenium import webdriver
import time
import multiprocessing
def get_links(x):
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--disable-application-cache')
driver = webdriver.Chrome('/Desktop/project/SiteCheck/LinkCheckV03/chromedriver', chrome_options=chrome_options)
driver.get(x)
links = driver.find_elements_by_xpath('//*[@href]')
time.sleep(4)
return links
def check_links(links):
try:
r = requests.get(links)
rc = r.status_code
strRc = str(rc)
result = links, strRc
return result
except Exception as e:
logz = open('exception.log', 'w')
logz.write(str(e) + '\n')
def main(func, mlist):
pool = multiprocessing.Pool(4)
results = pool.map(func, mlist)
pool.close()
pool.join()
return resultsLinkCheck.Py```
from CustomFunctions import get_links, check_links, main
import fileinput
import sys
#redircting stdout to fie
old_stdout = sys.stdout
log_file = open("output.log","w")
sys.stdout = log_file
#Gateways
home = 'http://www.bonton.com'
#brands = 'http://www.bonton.com/sc1/brands/'
women = 'http://www.bonton.com/sc1/women/'
shoe
Solution
Your
That being said, you should have a look at Python's official style-guide, PEP8. Also, it is quite weird to have code (especially including
LinkCheck.py can be greatly simplified:from CustomFunctions import get_links, check_links, main
import fileinput
import sys
#redirecting stdout to file
old_stdout = sys.stdout
log_file = open("output.log","w")
sys.stdout = log_file
#Gateways
gateways = {'Homepage': 'http://www.bonton.com',
...,
'Handbags&Accessories GW': 'http://www.bonton.com/sc1/handbags-accessories/',
...}
# Fetch Links from host
all_links = []
for gateway, url in gateways.items():
links = get_links(url)
print('Total number of links on {}: {}'.format(gateway, len(links)))
all_links.extend(link.get_attribute('href') for link in links)
# Print link totals and get rid of duplicates
print('Total number of links before duplicates are removed:', len(all_links))
all_links = list(set(all_links))
print('Total number of links after duplicates are removed: ', len(all_links))
# execute the check_links commands with multiprocesing
if __name__ == '__main__':
logd = open('linklist.log', 'w')
line = main(check_links, MasterListNoDupes)
for items in line:
logd.write(str(items) + '\n')
logd.close()
# Only print links that do now have a 200 response code
with open('linklist.log', 'r') as searchfile:
for line in searchfile:
if '200' in line:
pass
else:
print(line)
# ending stdout logging to file
sys.stdout = old_stdout
log_file.close()That being said, you should have a look at Python's official style-guide, PEP8. Also, it is quite weird to have code (especially including
print) outside of the if __name__ == '__main__': guard, so I would move it inside of it (or even inside of a main function`), unless this interferes with the multiprocessing.Code Snippets
from CustomFunctions import get_links, check_links, main
import fileinput
import sys
#redirecting stdout to file
old_stdout = sys.stdout
log_file = open("output.log","w")
sys.stdout = log_file
#Gateways
gateways = {'Homepage': 'http://www.bonton.com',
...,
'Handbags&Accessories GW': 'http://www.bonton.com/sc1/handbags-accessories/',
...}
# Fetch Links from host
all_links = []
for gateway, url in gateways.items():
links = get_links(url)
print('Total number of links on {}: {}'.format(gateway, len(links)))
all_links.extend(link.get_attribute('href') for link in links)
# Print link totals and get rid of duplicates
print('Total number of links before duplicates are removed:', len(all_links))
all_links = list(set(all_links))
print('Total number of links after duplicates are removed: ', len(all_links))
# execute the check_links commands with multiprocesing
if __name__ == '__main__':
logd = open('linklist.log', 'w')
line = main(check_links, MasterListNoDupes)
for items in line:
logd.write(str(items) + '\n')
logd.close()
# Only print links that do now have a 200 response code
with open('linklist.log', 'r') as searchfile:
for line in searchfile:
if '200' in line:
pass
else:
print(line)
# ending stdout logging to file
sys.stdout = old_stdout
log_file.close()Context
StackExchange Code Review Q#161653, answer score: 4
Revisions (0)
No revisions yet.