HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Find missing web-pages

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
pagesmissingwebfind

Problem

You are writing your web-page and relentlessly adding links, as you think that links are the best thing a web-page can offer.

The fact is that you write links even to pages that do not exist just yet, because you will write them later.

But later becomes a lot later and now you have your website in your directory with a lot of links pointing to nothing.

You decide to systematically write a page for each link, as to eliminate all the pending links, and using this script to list all such pending links:

An example output looks like:

['addition.html', 'definition.html', 'division.html', 'infinity.html', 'multiplication.html', 'primitive_concept.html', 'recursion.html', 'set.html', 'subtraction.html']


The code is very straightforward, but there is always potential for improvement:

"""
Given a folder of html pages, gives a list of all the pages that are linked to,
but do not exist.
"""
import doctest
import itertools
import os
import re

PATH = os.path.dirname(os.path.realpath(__file__))
flatten = itertools.chain.from_iterable

def destinations(html):
    """
    >>> destinations('''The natural numbers are an infinitesetdefined recursivelyas follows:''')
    ['infinity.html', 'set.html', 'recursion.html']
    """
    return re.findall('[a-z_]+\.html', html)

def read(f):
    with open(f) as c:
        return c.read()

def missing_pages(directory=PATH):
    """
    Lists all the pending links of the html pages if the `directory`.
    """
    all_pages = sorted(set((flatten(destinations(read(i)) for i in os.listdir(directory)))))
    return list(i for i in all_pages if i not in os.listdir(directory))

if __name__ == "__main__":
    doctest.testmod()
    print(missing_pages())

Solution

If the script works fine for you the points about robustness below won't
matter too much.

  • The way PATH is defined is a bit unusual since in most cases I want


to be able to invoke the script from any directory, but this default
forces it to the directory of the script. I'd almost say that the
normal default argument of os.listdir, namely "." is way better.

  • os.listdir is called too often, the result can just be reused.



  • Don't use list when you can have literal list syntax instead.



  • The output is fine I guess, except for passing it to other scripts


you'd usually want a more "standard" format, i.e. a single file per
line without quotes.

For the record missing_pages should look more like this (filtering
filenames and directories is left as an exercise though):

def missing_pages(directory="."):
    """
    Lists all the pending links of the html pages of the `directory`.
    """
    files = os.listdir(directory)
    all_pages = sorted(set(flatten(destinations(read(i)) for i in files)))
    return [i for i in all_pages if i not in files]


Otherwise looks good, test does what it says, docstrings where needed
and the script even has a top-level description.

Robustness

I ran the script on a random directory just to see what happens. With
Python 3 I get a UnicodeDecodeError; with Python 2 it errors out on a
directory - it wouldn't hurt to skip them. It probably should only read
files with the correct ending as well.

Next is that the regular expression to match is quite limited; it also
will not really deal with the situation where a file is mentioned, but
not linked to. The docstring to destinations implies that the
function does more than what it can - so you could just leave out all
the HTML fluff and just mention the filenames separated by spaces; or,
actually document the corner cases, meaning I'd like to have a docstring
saying something like "Finds mentions of HTML files in a string. Only lowercase filenames with underscores are returned." or something similar.

Code Snippets

def missing_pages(directory="."):
    """
    Lists all the pending links of the html pages of the `directory`.
    """
    files = os.listdir(directory)
    all_pages = sorted(set(flatten(destinations(read(i)) for i in files)))
    return [i for i in all_pages if i not in files]

Context

StackExchange Code Review Q#114068, answer score: 7

Revisions (0)

No revisions yet.