HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Regex to get all image links

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
imagealllinksgetregex

Problem

I have some pretty basic Regex that scans the output of a HTML file (the whole document source) and attempts to extract all of the absolute links that look like images. Whether they are actually images or not isn't too important, as those checks would be made later.

I am just trying to see if this Regex looks okay, since I kind of hacked it together from various sources.

The Raw Regex Pattern:

\bhttps?:[^)''"]+\.(?:jpg|jpeg|gif|png)


The Regex in Practice (I am coding in Railo - an open source CFML engine):



NOTE: variables.getDocument contains the raw HTML code.

It seems to work great right now and I'm not seeing any issues, but just wanting to know if you can see any potential pitfalls with this method? Is there room for improvement?

Solution

If you're truly only looking for images, why not parse any src attributes you can find? As it stands, all of the provided Regular Expressions will fail on local images, as well as on dynamically generated images, such as:



This is perfectly valid, and none of the other expressions would capture them. Generally, for parsing HTML, you're better of traversing it with something like an XML reader, or a DOM builder, depending on how well the HTML is formed.

Ben Nadel has a pretty good blog article on working with badly formed HTML and translating into an XML document. Once you have an XML document, you can unleash the power of XPath to do some very nice searching.

If, however, you just want a quick and dirty RegEx to get the job done, I'd recommend actually using the underlying Java Regular Expression library, as it's probably more efficient when matching multiple items over a large document.

The main RegEx I'll be using is as follows:

]*?src=("|')([^"']+)\1


Which looks for the src attribute in img tags. You can vary this as you see fit.

 
]*?src=("|')([^"']+)\1') />

    


UPDATE :
Here's an explanation of the pattern I chose:

]*?         - Lazily match any character that is not a '>' while looking for the next literal string
src=           - Literal string "src="
("|')          - Match either a single or a double quote, both are valid in HTML
([^"']+)       - Match anything that isn't a single or double quote. Note: You *could* use [^\1] here, however this way the match will reject malformed HTML attributes that have mismatched quotes
\1             - Match the value of the first group (either a single or double quote)

Code Snippets

<img src="/local/images/get_profile_pic.php?id=12345" title="John Doe" />
<img\s+[^>]*?src=("|')([^"']+)\1
<cfset html = variables.getDocument /> <!--- your HTML --->
<cfset pattern = CreateObject("java","java.util.regex.Pattern").compile('(?i)<img\s+[^>]*?src=("|')([^"']+)\1') />
<cfset matcher = pattern.matcher(html) />

<!--- loop through the matches --->
<cfloop condition="matcher.find()">
    <cfset src = matcher.group(2) />
</cfloop>
<img           - Literal string "<img", match the opening tag
\s+            - Match one or more whitespace characters, so <img\t is valid
[^>]*?         - Lazily match any character that is not a '>' while looking for the next literal string
src=           - Literal string "src="
("|')          - Match either a single or a double quote, both are valid in HTML
([^"']+)       - Match anything that isn't a single or double quote. Note: You *could* use [^\1] here, however this way the match will reject malformed HTML attributes that have mismatched quotes
\1             - Match the value of the first group (either a single or double quote)

Context

StackExchange Code Review Q#20126, answer score: 6

Revisions (0)

No revisions yet.