HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Extracting a div from parsed HTML

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
divextractingfromparsedhtml

Problem

It seems that lxml/etree are generally imported as from lxml import etree -- why is that? It keeps the code tidier, and while the potential namespace ambiguity might not be a concern, I don't have any incentive of doing this as it's generally frowned upon.

I know for a script of this size it doesn't matter much, but I'm going to be using these modules for a lot more. I'm also curious about what others have to say.

#!/usr/bin/python
# Stuart Powers http://sente.cc/

import sys
import urllib
import lxml.html
from cStringIO import StringIO

""" This script parses HTML and extracts the div with an id of 'search-results':
  ex:  ...

$ python script.py "http://www.youtube.com/result?search_query=python+stackoverflow&page=1"
The output, if piped to a file would look like: http://c.sente.cc/E4xR/lxml_results.html

"""

parser = lxml.html.HTMLParser()

filecontents = urllib.urlopen(sys.argv[1]).read()

tree = lxml.etree.parse(StringIO(filecontents), parser)

node = tree.xpath("//div[@id='search-results']")[0]

print lxml.etree.tostring(tree, pretty_print=True)

Solution

You might be confusing from lxml import etree that is a legitimate (even preferred) form of an absolute import with relative imports for intra-package imports that are discouraged: http://www.python.org/dev/peps/pep-0008/ (see "Imports" section)

Context

StackExchange Code Review Q#7430, answer score: 3

Revisions (0)

No revisions yet.