HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Simplification of Python code (HTML extraction)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
extractionpythonsimplificationcodehtml

Problem

I'd like to extract HTML content of websites and save it into a file. To achieve this, I have the following (working) code:

output = codecs.open("test.html", "a", "utf-8")

def first():
    for i in range(1, 10):

        root = lxml.html.parse('http://test'+str(i)+'.xyz'+'?action=source').getroot()

        for empty in root.xpath('//*[self::b or self::i][not(node())]'):
            empty.getparent().remove(empty)

        tables = root.cssselect('table.main')
        tables = root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

        txt = []

        txt += ([lxml.html.tostring(t, method="html", encoding="utf-8") for t in tables])

        text = "\n".join(re.sub(r'\[:[\/]?T.*?:\]', '', el) for el in txt)

        output.write(text.decode("utf-8"))
        output.write("\n\n")


Is this "nice" code? I ask because I'm not sure if it's a good idea to use strings, and because of the fact that other HTML texts I've seen use one tag-ending (for example `) per line. My code produces partly more tag-endings per line.

Is it possible not to use strings or/and to achieve that we receive not things like

` but things like:



? Thanks for any proposition :)

Solution

Try something like this, you can definitely clean up the two list comprehensions you have, but for now this should suffice.

def first():
    with codecs.open("test.html", "a", "utf-8") as output:
        for i in range(1, 10):
            txt = []
            root = lxml.html.parse('http://test'+str(i)+'.xyz'+'?action=source').getroot()
            for empty in root.xpath('//*[self::b or self::i][not(node())]'):
                empty.getparent().remove(empty)

            #tables = root.cssselect('table.main')  <--Dont need this, its is being overwritten
            tables = root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

            txt += ([lxml.html.tostring(t, method="html", encoding="utf-8") for t in tables])
            text = "\n".join(re.sub(r'\[:[\/]?T.*?:\]', '', el) for el in txt)

            output.write(text.decode("utf-8") + "\n\n")

Code Snippets

def first():
    with codecs.open("test.html", "a", "utf-8") as output:
        for i in range(1, 10):
            txt = []
            root = lxml.html.parse('http://test'+str(i)+'.xyz'+'?action=source').getroot()
            for empty in root.xpath('//*[self::b or self::i][not(node())]'):
                empty.getparent().remove(empty)

            #tables = root.cssselect('table.main')  <--Dont need this, its is being overwritten
            tables = root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

            txt += ([lxml.html.tostring(t, method="html", encoding="utf-8") for t in tables])
            text = "\n".join(re.sub(r'\[:[\/]?T.*?:\]', '', el) for el in txt)

            output.write(text.decode("utf-8") + "\n\n")

Context

StackExchange Code Review Q#30627, answer score: 2

Revisions (0)

No revisions yet.