HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Parsing locally stored HTML files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
storedlocallyfilesparsinghtml

Problem

I am working with this code to parse through HTML files stored on my computer and extract HTML text by defining a certain tag that should be found:

from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib

@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()

def trade_spider():
    os.chdir(r"C:\Users\Independent Auditors Report")
    with stdout2file("auditfeesexpenses.txt"):
        for file in glob.iglob('**/*.html', recursive=True):
            with open(file, encoding="utf8") as f:
                contents = f.read()
                soup = BeautifulSoup(contents, "html.parser")
                for item in soup.findAll("ix:nonfraction"):
                    if re.match(".*AuditFeesExpenses", item['name']):
                        print(file.split(os.path.sep)[-1], end="| ")
                        print(item['name'], end="| ")
                        print(item.get_text())
                        break
trade_spider()


The code works perfectly thanks to the help of the stackflow community! As I am not an expert in python coding, I am wondering whether there are some magic tricks some of you might know, to speed up my code and reduce processing time as it has to parse through ~ 4 Million files.

Perhaps in a nutshell what my code does:
-> open text file -> parse through all html documents in set directory -> if regex is found, print result into open text file -> break, no more than one match and continue to next file...

I am open to any suggestions on improving this code.

Update:

Further Explanation: Basically I want to find a certain name attribute (name=".+AuditFeesExpenses") in each HTML document and IF this attribute is found I want to have the name of the file, the Name Attribute and the correlating HTML text be printed into a separat text file.

An example string that I extracted from a single HTML file is:

`

Solution

I don't know if this would be significant, but a first suggestion would be to replace the relatively costly re operation with the basic string operationitem['name'].endswith("AuditFeesExpenses").

Another possible suggestion, based on @Dex'ter's comment would be to change the stdout redirection into a regular .write() on the output file.

But what I'd really recommend is to profile the script to figure out the hot spots. I suspect that the bottleneck is within BeautifulSoup, and if that's the case, (given that you're only searching for a substring and not parsing) perhaps you could find an alternative search method.

Context

StackExchange Code Review Q#128515, answer score: 2

Revisions (0)

No revisions yet.