HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Saving images from a web page

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
pagewebimagesfromsaving

Problem

I'm just getting my feet on the ground with Python (maybe 3 days now). The only issue with teaching my self is that I have no one to critique my work.

Correct me if I'm wrong but I think my algorithm/method for solving this problem is quite promising; but, the code not so much.

The program basically strips a web-page and puts it in to the designated directory. My favorite part is the method for deciding the image's extensions :)

import os, re, urllib

def Read(url):
    uopen = urllib.urlopen(url)
    uread = ''
    for line in uopen: uread += line
    return uread

def Find(what, where):
    found = re.findall(what, where)
    match = []
    for i in found:
        i = i[9:-1]
        match.append(i)
    return match    

def Retrieve(urls, loc):
    loc = os.path.realpath(loc)
    os.system('mkdir ' + loc +'/picretrieved')
    loc += '/picretrieved'
    x = 0
    exts = ['.jpeg', '.jpg', '.gif', '.png']
    for url in urls:
        x += 1
        for i in exts:
            ext = re.search(i, url)
            if ext != None:
                ext = ext.group()
                urllib.urlretrieve(url, loc + '/' + str(x) + ext)
            else:
                continue

    print 'Placed', str(x), 'pictures in:', loc

def main():
    url = raw_input('URL to PicRetrieve (google.com): ')
    url = 'http://' + url
    loc = raw_input('Location for PicRetrieve older ("." for here): ')
    html = Read(url)
    urls = Find('img src=".*?"',  html)
    print urls
    Retrieve(urls, loc)

if __name__ == '__main__':
    main()

Solution

import os, re, urllib

def Read(url):


Python convention is to name function lowercase_with_underscores

uopen = urllib.urlopen(url)


I recommend against abbreviations

uread = ''
    for line in uopen: uread += line
    return uread


Use uopen.read() it'll read the whole file

def Find(what, where):
    found = re.findall(what, where)
    match = []
    for i in found:


avoid single letter variable names

i = i[9:-1]
        match.append(i)
    return match


Combine short lines of code

for i in re.findall(what, where):
      match.append(i[9:-1])


I think its cleaner this way. Also use capturing groups rather then indexes.

def Retrieve(urls, loc):
    loc = os.path.realpath(loc)
    os.system('mkdir ' + loc +'/picretrieved')


Use os.mkdir rather than using system

loc += '/picretrieved'


Use os.path.join to construct paths

x = 0
    exts = ['.jpeg', '.jpg', '.gif', '.png']
    for url in urls:
        x += 1


use for x, url in enumerate(urls):

for i in exts:
            ext = re.search(i, url)


Don't use regular expressions to search for simple strings. In this case I think you really want to use url.endswith(i). Also, stop using i everywhere.

if ext != None:
                ext = ext.group()
                urllib.urlretrieve(url, loc + '/' + str(x) + ext)
            else:
                continue


This continue does nothing

print 'Placed', str(x), 'pictures in:', loc


You don't need to use str when you are using print. It does it automatically.

def main():
    url = raw_input('URL to PicRetrieve (google.com): ')
    url = 'http://' + url
    loc = raw_input('Location for PicRetrieve older ("." for here): ')
    html = Read(url)
    urls = Find('img src=".*?"',  html)


Generally, using an html parser is preferred to using a regex on html. In this simple case you are ok.

print urls
    Retrieve(urls, loc)

if __name__ == '__main__':
    main()

Code Snippets

import os, re, urllib

def Read(url):
uopen = urllib.urlopen(url)
uread = ''
    for line in uopen: uread += line
    return uread
def Find(what, where):
    found = re.findall(what, where)
    match = []
    for i in found:
i = i[9:-1]
        match.append(i)
    return match

Context

StackExchange Code Review Q#10804, answer score: 3

Revisions (0)

No revisions yet.