HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Pure Python script that saves an HTML page with all images

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptpureallwiththatpythonpageimagessaveshtml

Problem

Here is a pure Python script that saves an HTML page without CSS but with all images on it and replaces all hrefs with a path of an image on the hard drive.

I know that there are great libraries like BeautifulSoup and others but I would like to try myself with pure Python.

Actually there is no practical usage of this script, it was just test task from one of companies where I applied.

Once again what it does:

This script can pre-launched from command line, which takes 2 arguments:

  • Address of web page to save (this one is required)



  • Name of folder where images from page should be saved (this one is optional)


Script saves html content of page(without CSS) and also it looks for all images on page and save them too, replacing their href attribute to actual path of image on hard drive.

How can I improve it?

```
import random
import string
import sys
import urllib2
import os
import re
from urlparse import urlparse

def page_loader(url_name, dir_name='imgs'):

page_to_open = urllib2.urlopen(url_name)
target_page = page_to_open.read()
base_dir = os.path.dirname(os.path.realpath(__file__))
dir_to_save = os.path.join(base_dir, dir_name)
new_file_name = '%s.html' % ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase) for _ in range(10))
if not os.path.exists(dir_to_save):
os.makedirs(dir_to_save)

images_on_page = re.findall('img .?src="(.?)"', target_page)
internal_images = [img for img in images_on_page if img.startswith('/')]
external_images = [img for img in images_on_page if not img.startswith('/')]

for image in internal_images:
image_url = '%s%s' % (page_to_open.geturl()[:-1], image)
new_image_name = urlparse(image_url).path.split('/')[-1]
with open(os.path.join(dir_to_save, new_image_name), 'w') as new_image:
new_image.write(urllib2.urlopen(image_url).read())
target_page = re.sub(image, new_image.name, target_page)

for image_url in externa

Solution

Your code is pretty good! I do have a few tips regarding style, and such.

  • As mentioned in the comments, even though it looks like you aren't, you shouldn't be trying to parse HTML with Regexes.



  • Where are your comments? While good code can be pretty readable, comments are still a valuable asset. You should probably flesh out your page_loader function with a docstring, and any unclear blocks of code with an inline comment.



  • If you're only going to use part of a module, e.g, one function or variable, you should do from ... import ....



That's about all I can really think of right now. If there's anything else you want me to comment on, just mention it below, and I'll see if I can cover it. Hope this helps!

Context

StackExchange Code Review Q#78775, answer score: 7

Revisions (0)

No revisions yet.