patternpythonMinor
Parsing locally stored HTML files
Viewed 0 times
storedlocallyfilesparsinghtml
Problem
I am working with this code to parse through HTML files stored on my computer and extract HTML text by defining a certain tag that should be found:
The code works perfectly thanks to the help of the stackflow community! As I am not an expert in python coding, I am wondering whether there are some magic tricks some of you might know, to speed up my code and reduce processing time as it has to parse through ~ 4 Million files.
Perhaps in a nutshell what my code does:
-> open text file -> parse through all html documents in set directory -> if regex is found, print result into open text file -> break, no more than one match and continue to next file...
I am open to any suggestions on improving this code.
Update:
Further Explanation: Basically I want to find a certain name attribute (name=".+AuditFeesExpenses") in each HTML document and IF this attribute is found I want to have the name of the file, the Name Attribute and the correlating HTML text be printed into a separat text file.
An example string that I extracted from a single HTML file is:
`
from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib
@contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
def trade_spider():
os.chdir(r"C:\Users\Independent Auditors Report")
with stdout2file("auditfeesexpenses.txt"):
for file in glob.iglob('**/*.html', recursive=True):
with open(file, encoding="utf8") as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
for item in soup.findAll("ix:nonfraction"):
if re.match(".*AuditFeesExpenses", item['name']):
print(file.split(os.path.sep)[-1], end="| ")
print(item['name'], end="| ")
print(item.get_text())
break
trade_spider()The code works perfectly thanks to the help of the stackflow community! As I am not an expert in python coding, I am wondering whether there are some magic tricks some of you might know, to speed up my code and reduce processing time as it has to parse through ~ 4 Million files.
Perhaps in a nutshell what my code does:
-> open text file -> parse through all html documents in set directory -> if regex is found, print result into open text file -> break, no more than one match and continue to next file...
I am open to any suggestions on improving this code.
Update:
Further Explanation: Basically I want to find a certain name attribute (name=".+AuditFeesExpenses") in each HTML document and IF this attribute is found I want to have the name of the file, the Name Attribute and the correlating HTML text be printed into a separat text file.
An example string that I extracted from a single HTML file is:
`
Solution
I don't know if this would be significant, but a first suggestion would be to replace the relatively costly
Another possible suggestion, based on @Dex'ter's comment would be to change the stdout redirection into a regular
But what I'd really recommend is to profile the script to figure out the hot spots. I suspect that the bottleneck is within BeautifulSoup, and if that's the case, (given that you're only searching for a substring and not parsing) perhaps you could find an alternative search method.
re operation with the basic string operationitem['name'].endswith("AuditFeesExpenses").Another possible suggestion, based on @Dex'ter's comment would be to change the stdout redirection into a regular
.write() on the output file.But what I'd really recommend is to profile the script to figure out the hot spots. I suspect that the bottleneck is within BeautifulSoup, and if that's the case, (given that you're only searching for a substring and not parsing) perhaps you could find an alternative search method.
Context
StackExchange Code Review Q#128515, answer score: 2
Revisions (0)
No revisions yet.