HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Simple image scraping

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
simpleimagescraping

Problem

I wrote this code over the last few days and I learned a lot, and it works as I expect. However I suspect it's woefully inefficient:

import requests, bs4, os

os.chdir('../../Desktop')
wallpaperFolder = os.makedirs('Wallpapers', exist_ok=True)

pageCount = 1
GalleryUrl = 'https://wall.alphacoders.com/search.php?search=Game+Of+Thrones&page=%s' % (pageCount)

linkList = []
while not GalleryUrl.endswith('page=3'):
    # Download the page
    GalleryUrl = 'https://wall.alphacoders.com/search.php?search=Game+Of+Thrones&page=%s' % (pageCount)
    print('Opening gallary... %s' % (GalleryUrl))
    res = requests.get(GalleryUrl)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # find urls to image page
    section = soup.find_all('div', attrs={'class': 'boxgrid'})

    for images in section:
        imageLink = images.find_all('a')
        for a in imageLink:
            imageUrl = a.get('href')
            linkList.append(imageUrl)
            print('Image found at... ' + imageUrl)
    linkCount = 0
# Follow links and download images on page.
    while linkCount != len(linkList):
        print('Downloading image at... %s' % (str(linkList[linkCount])))
        imagePageUrl = ('https://wall.alphacoders.com/' + str(linkList[linkCount]))
        res = requests.get(imagePageUrl)

        wallpaperSoup = bs4.BeautifulSoup(res.text, 'html.parser')

        link = wallpaperSoup.find(id='main_wallpaper')
        imageSrc = link['src']

        # Save image to folder

        res = requests.get(imageSrc)
        imageFile = open(os.path.join(os.getcwd(), 'Wallpapers', os.path.basename(imageSrc)), 'wb')
        for chunk in res.iter_content(12000):
            imageFile.write(chunk)
        imageFile.close()
        linkCount += 1

    # Move to next page and empty link list
    pageCount += 1
    del linkList[:]       

print('Done')


Aside from cutting the print statements and simply letting the program do its thing, how could I optimize

Solution

os.chdir('../../Desktop')


Halt. Stop right there. This assumes the script is located at some specific location. Desktop happens to be always in the same place, but this assumes the script to be in a place relative to Desktop. On what OS do you intent to run this? If it's Linux-only, you could replace it by:

os.chdir('~/Desktop')


Now it will work regardless of where the script is located. Of-course, you just made the problem worse for non-Linux systems.

Even better would be to ask where the user wants to drop his files, possibly by using arguments (take a look at argparse).

But why don't you simply drop it instead? A map named Wallpapers will simply be created in the current directory. This would be obvious and expected behaviour.

for images in section:
    imageLink = images.find_all('a')
    for a in imageLink:
        imageUrl = a.get('href')
        linkList.append(imageUrl)
        print('Image found at... ' + imageUrl)


This isn't your bottleneck (not by a long shot), but that second for reeks of bad time complexity for something simple as telling the user where the images are.

One of the major time-consumers appears to be the downloading itself.

I ran cProfile against your code to find the bottleneck(s):

python3 -m cProfile -s tottime CR_133564.py


This automatically sorts the output on the total time taken per function, highest first. Don't forget to import cProfile. Let's take a look at everything with a total time higher than 0.5 seconds:

8989953 function calls (8984594 primitive calls) in 108.880 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
20393 59.314 0.003 59.314 0.003 {method 'read' of '_ssl._SSLSocket' objects}
183 28.853 0.158 28.853 0.158 {method 'do_handshake' of '_ssl._SSLSocket' objects}
183 9.689 0.053 9.689 0.053 {method 'connect' of '_socket.socket' objects}
183 1.547 0.008 1.548 0.008 {built-in method _socket.getaddrinfo}
183 1.005 0.005 1.005 0.005 {method 'load_verify_locations' of '_ssl._SSLContext' objects}
78129 0.621 0.000 3.383 0.000 parser.py:301(parse_starttag)
585660 0.580 0.000 0.580 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
93 0.549 0.006 5.750 0.062 parser.py:134(goahead)


That's responsible for 102.2 seconds out of the 108.9 it takes. If you want to optimize, do it here. The rest is peanuts.

Do you notice anything? You're wasting half a minute on handshakes. That's almost 30 seconds you could be doing something useful instead. Unless you find a trick to carry all files over one handshake, there's not much you can do about it. It also takes almost 60 seconds simply to read (AKA: download) the data. That's only slightly more than I'd expect with my internet connection, so there's not much you can do about that either.

Code Snippets

os.chdir('../../Desktop')
os.chdir('~/Desktop')
for images in section:
    imageLink = images.find_all('a')
    for a in imageLink:
        imageUrl = a.get('href')
        linkList.append(imageUrl)
        print('Image found at... ' + imageUrl)
python3 -m cProfile -s tottime CR_133564.py

Context

StackExchange Code Review Q#133564, answer score: 6

Revisions (0)

No revisions yet.