HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Selenium-based link checker for a shopping site

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
shoppingseleniumforbasedsitecheckerlink

Problem

I just started learning Python and I wrote my first useful script for work. I did a bunch of the basic tutorials and really enjoyed learning Python so far.

I am looking for any advice on how to make things more pythonic. What areas can I improve going forward? I want to make this script better/move on to my new project but I don't want to build off bad fundamentals.

My script works and I am using it. My script goes to a set of websites using the selenium webdriver to pull all the links down to a list. I then delete the duplicates. Then I use the requests module to verify a 200 response code.

I incorporated multiprocessing into it because the first version took way too long, 5+ hours to scan on 7300 links. I got the script time down to about an hour.

CustomFunctions.py

import requests
from selenium import webdriver
import time
import multiprocessing

def get_links(x):
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--disable-application-cache')
    driver = webdriver.Chrome('/Desktop/project/SiteCheck/LinkCheckV03/chromedriver', chrome_options=chrome_options)
    driver.get(x)
    links = driver.find_elements_by_xpath('//*[@href]')
    time.sleep(4)
    return links

def check_links(links):
    try:
        r = requests.get(links)
        rc = r.status_code
        strRc = str(rc)
        result = links, strRc
        return result
    except Exception as e:
        logz = open('exception.log', 'w')
        logz.write(str(e) + '\n')

def main(func, mlist):
    pool = multiprocessing.Pool(4)
    results = pool.map(func, mlist)

    pool.close()
    pool.join()

    return results


LinkCheck.Py

```
from CustomFunctions import get_links, check_links, main
import fileinput
import sys

#redircting stdout to fie
old_stdout = sys.stdout
log_file = open("output.log","w")
sys.stdout = log_file

#Gateways
home = 'http://www.bonton.com'
#brands = 'http://www.bonton.com/sc1/brands/'
women = 'http://www.bonton.com/sc1/women/'
shoe

Solution

Your LinkCheck.py can be greatly simplified:

from CustomFunctions import get_links, check_links, main
import fileinput
import sys

#redirecting stdout to file
old_stdout = sys.stdout
log_file = open("output.log","w")
sys.stdout = log_file

#Gateways
gateways = {'Homepage': 'http://www.bonton.com',
            ...,
            'Handbags&Accessories GW': 'http://www.bonton.com/sc1/handbags-accessories/',
            ...}

# Fetch Links from host
all_links = []
for gateway, url in gateways.items():
    links = get_links(url)
    print('Total number of links on {}: {}'.format(gateway, len(links)))
    all_links.extend(link.get_attribute('href') for link in links)

# Print link totals and get rid of duplicates    
print('Total number of links before duplicates are removed:', len(all_links))
all_links = list(set(all_links))
print('Total number of links after duplicates are removed: ', len(all_links))

# execute the check_links commands with multiprocesing
if __name__ == '__main__':
    logd = open('linklist.log', 'w')
    line = main(check_links, MasterListNoDupes)
    for items in line:
        logd.write(str(items) + '\n')
    logd.close()

# Only print links that do now have a 200 response code 
with open('linklist.log', 'r') as searchfile:
    for line in searchfile:
        if '200' in line:
            pass            
        else:
            print(line)

# ending stdout logging to file            
sys.stdout = old_stdout

log_file.close()


That being said, you should have a look at Python's official style-guide, PEP8. Also, it is quite weird to have code (especially including print) outside of the if __name__ == '__main__': guard, so I would move it inside of it (or even inside of a main function`), unless this interferes with the multiprocessing.

Code Snippets

from CustomFunctions import get_links, check_links, main
import fileinput
import sys

#redirecting stdout to file
old_stdout = sys.stdout
log_file = open("output.log","w")
sys.stdout = log_file


#Gateways
gateways = {'Homepage': 'http://www.bonton.com',
            ...,
            'Handbags&Accessories GW': 'http://www.bonton.com/sc1/handbags-accessories/',
            ...}

# Fetch Links from host
all_links = []
for gateway, url in gateways.items():
    links = get_links(url)
    print('Total number of links on {}: {}'.format(gateway, len(links)))
    all_links.extend(link.get_attribute('href') for link in links)

# Print link totals and get rid of duplicates    
print('Total number of links before duplicates are removed:', len(all_links))
all_links = list(set(all_links))
print('Total number of links after duplicates are removed: ', len(all_links))

# execute the check_links commands with multiprocesing
if __name__ == '__main__':
    logd = open('linklist.log', 'w')
    line = main(check_links, MasterListNoDupes)
    for items in line:
        logd.write(str(items) + '\n')
    logd.close()

# Only print links that do now have a 200 response code 
with open('linklist.log', 'r') as searchfile:
    for line in searchfile:
        if '200' in line:
            pass            
        else:
            print(line)

# ending stdout logging to file            
sys.stdout = old_stdout

log_file.close()

Context

StackExchange Code Review Q#161653, answer score: 4

Revisions (0)

No revisions yet.