patternhtmlMinor
XSLT 2.0: Crawl HTML and add links
Viewed 0 times
crawlxsltlinksandhtmladd
Problem
Background: I have 4 GB of text data dispersed in 250,000 html files. I want to interlink the files with `
for the reader to click on. I have a 12 MB file of regex patterns to identify the sites.
Situation: I have developed a working proof of concept, three files:
- an XML file of regex patterns of where we would want to place a touch-link
- A test HTML file
- An xslt file to read the regex patterns, and apply them to the HTML file
Concern: I have slow performance when I apply the proof of concept to full production data.
The regex patterns (test-anchor-sites.xml):
The test HTML:
Set Anchor IDs: Test File
Spinal Surgery
Degeneration of one or more disc(s) of the spine is called degenerative disc disease (DDD).
Often, degenerative DDD can be successfully treated without surgery. Chapter 1 describes these non-surgical treatments.
Chapter 2 describes a Laminectomy, which is a surgical procedure that removes a portion of the vertebral bone called the lamina.
A discectomy is the surgical removal of herniated disc material that presses on a nerve root or the spinal cord. It is covered in Chapter 3 and Chapter 4.
Open disectomy is done through a large incision, and is described in Chapter 3.
Microdisectomy is minimally invasive surgery, described in Chapter 4, and is often the most appropriate treatment after conservative treatments fail to provide relief.
A percutaneous discectomy is a surgical procedure in which the central portion of an intervertebral disc is accessed and removed through a cannula.
The style sheet to load the regex patterns and apply them to the HTML:
``Solution
If it does run in 24 hours then that might well be the best way to do it. The only way I could think of speeding it up would be to build some kind of index (using xsl:key) of the words that appear in the links, and then pre-filtering each text node to see whether any of its words are present in the index before applying the regular expressions. This of course won't give quite the same result because you aren't currently taking word boundaries into account.
Context
StackExchange Code Review Q#56635, answer score: 3
Revisions (0)
No revisions yet.