HiveBrain v1.2.0
Get Started
← Back to all entries
patternhtmlMinor

XSLT 2.0: Crawl HTML and add links

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
crawlxsltlinksandhtmladd

Problem

Background: I have 4 GB of text data dispersed in 250,000 html files. I want to interlink the files with ` for the reader to click on. I have a 12 MB file of regex patterns to identify the sites.

Situation: I have developed a working proof of concept, three files:

  • an XML file of regex patterns of where we would want to place a touch-link



  • A test HTML file



  • An xslt file to read the regex patterns, and apply them to the HTML file



Concern: I have slow performance when I apply the proof of concept to full production data.

The regex patterns (test-anchor-sites.xml):


    
    
    
    
    
    


The test HTML:


    
        Set Anchor IDs: Test File
    
    
        
            Spinal Surgery
        
        
            Degeneration of one or more disc(s) of the spine is called degenerative disc disease (DDD).
            Often, degenerative DDD can be successfully treated without surgery. Chapter 1 describes these non-surgical treatments.
            Chapter 2 describes a Laminectomy, which is a surgical procedure that removes a portion of the vertebral bone called the lamina.
            
                A discectomy is the surgical removal of herniated disc material that presses on a nerve root or the spinal cord. It is covered in Chapter 3 and Chapter 4.
                Open disectomy is done through a large incision, and is described in Chapter 3.
                Microdisectomy is minimally invasive surgery, described in Chapter 4, and is often the most appropriate treatment after conservative treatments fail to provide relief.
                A percutaneous discectomy is a surgical procedure in which the central portion of an intervertebral disc is accessed and removed through a cannula.
            
        
    


The style sheet to load the regex patterns and apply them to the HTML:

``

























Solution

If it does run in 24 hours then that might well be the best way to do it. The only way I could think of speeding it up would be to build some kind of index (using xsl:key) of the words that appear in the links, and then pre-filtering each text node to see whether any of its words are present in the index before applying the regular expressions. This of course won't give quite the same result because you aren't currently taking word boundaries into account.

Context

StackExchange Code Review Q#56635, answer score: 3

Revisions (0)

No revisions yet.