patternpythonMinor
Find missing web-pages
Viewed 0 times
pagesmissingwebfind
Problem
You are writing your web-page and relentlessly adding links, as you think that links are the best thing a web-page can offer.
The fact is that you write links even to pages that do not exist just yet, because you will write them later.
But later becomes a lot later and now you have your website in your directory with a lot of links pointing to nothing.
You decide to systematically write a page for each link, as to eliminate all the pending links, and using this script to list all such pending links:
An example output looks like:
The code is very straightforward, but there is always potential for improvement:
The fact is that you write links even to pages that do not exist just yet, because you will write them later.
But later becomes a lot later and now you have your website in your directory with a lot of links pointing to nothing.
You decide to systematically write a page for each link, as to eliminate all the pending links, and using this script to list all such pending links:
An example output looks like:
['addition.html', 'definition.html', 'division.html', 'infinity.html', 'multiplication.html', 'primitive_concept.html', 'recursion.html', 'set.html', 'subtraction.html']The code is very straightforward, but there is always potential for improvement:
"""
Given a folder of html pages, gives a list of all the pages that are linked to,
but do not exist.
"""
import doctest
import itertools
import os
import re
PATH = os.path.dirname(os.path.realpath(__file__))
flatten = itertools.chain.from_iterable
def destinations(html):
"""
>>> destinations('''The natural numbers are an infinitesetdefined recursivelyas follows:''')
['infinity.html', 'set.html', 'recursion.html']
"""
return re.findall('[a-z_]+\.html', html)
def read(f):
with open(f) as c:
return c.read()
def missing_pages(directory=PATH):
"""
Lists all the pending links of the html pages if the `directory`.
"""
all_pages = sorted(set((flatten(destinations(read(i)) for i in os.listdir(directory)))))
return list(i for i in all_pages if i not in os.listdir(directory))
if __name__ == "__main__":
doctest.testmod()
print(missing_pages())Solution
If the script works fine for you the points about robustness below won't
matter too much.
to be able to invoke the script from any directory, but this default
forces it to the directory of the script. I'd almost say that the
normal default argument of
you'd usually want a more "standard" format, i.e. a single file per
line without quotes.
For the record
filenames and directories is left as an exercise though):
Otherwise looks good, test does what it says, docstrings where needed
and the script even has a top-level description.
Robustness
I ran the script on a random directory just to see what happens. With
Python 3 I get a
directory - it wouldn't hurt to skip them. It probably should only read
files with the correct ending as well.
Next is that the regular expression to match is quite limited; it also
will not really deal with the situation where a file is mentioned, but
not linked to. The docstring to
function does more than what it can - so you could just leave out all
the HTML fluff and just mention the filenames separated by spaces; or,
actually document the corner cases, meaning I'd like to have a docstring
saying something like "Finds mentions of HTML files in a string. Only lowercase filenames with underscores are returned." or something similar.
matter too much.
- The way
PATHis defined is a bit unusual since in most cases I want
to be able to invoke the script from any directory, but this default
forces it to the directory of the script. I'd almost say that the
normal default argument of
os.listdir, namely "." is way better.os.listdiris called too often, the result can just be reused.
- Don't use
listwhen you can have literal list syntax instead.
- The output is fine I guess, except for passing it to other scripts
you'd usually want a more "standard" format, i.e. a single file per
line without quotes.
For the record
missing_pages should look more like this (filteringfilenames and directories is left as an exercise though):
def missing_pages(directory="."):
"""
Lists all the pending links of the html pages of the `directory`.
"""
files = os.listdir(directory)
all_pages = sorted(set(flatten(destinations(read(i)) for i in files)))
return [i for i in all_pages if i not in files]Otherwise looks good, test does what it says, docstrings where needed
and the script even has a top-level description.
Robustness
I ran the script on a random directory just to see what happens. With
Python 3 I get a
UnicodeDecodeError; with Python 2 it errors out on adirectory - it wouldn't hurt to skip them. It probably should only read
files with the correct ending as well.
Next is that the regular expression to match is quite limited; it also
will not really deal with the situation where a file is mentioned, but
not linked to. The docstring to
destinations implies that thefunction does more than what it can - so you could just leave out all
the HTML fluff and just mention the filenames separated by spaces; or,
actually document the corner cases, meaning I'd like to have a docstring
saying something like "Finds mentions of HTML files in a string. Only lowercase filenames with underscores are returned." or something similar.
Code Snippets
def missing_pages(directory="."):
"""
Lists all the pending links of the html pages of the `directory`.
"""
files = os.listdir(directory)
all_pages = sorted(set(flatten(destinations(read(i)) for i in files)))
return [i for i in all_pages if i not in files]Context
StackExchange Code Review Q#114068, answer score: 7
Revisions (0)
No revisions yet.