HiveBrain v1.2.0
Get Started
← Back to all entries
patternphpMinor

Get InnerHTML, OuterHTML, and plain text of an element by ID or class

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
textinnerhtmlgetelementouterhtmlandclassplain

Problem

I do a fair amount of scraping but I am by no way a good PHP programmer. I always struggle to get the innerhtml of elements using PHP, domdocument and XPath.

I have cobbled together a couple of functions that appear to do what I need but my question is are there any major holes in my logic or how can it be improved.

Disclaimer - I did not write all of the code but the functions below are amalgamations of others code with a little rewriting by myself. If you can easily identify the code as yours please let me know so I can add a credit to you.

```

Untitled Document


This paragraph is in the first child div

This is a standalone paragraph

Span in a div
This paragraph is in the second child div


';

$return = getHTMLElementsByID('attachment_371',$htmlstring,array('div'));
echo "";
print_r($return);

/*$return = getHTMLElementsByID('attachment_371',$htmlstring);
echo ""; print_r($return);

$return = getHTMLElementsByClass('testclass',$htmlstring);
echo ""; print_r($return);*/

function getHTMLElementsByID($id,$htmlstring,$tags = array('*')) {
$contents = array();
$pattern = "/]?)(([\s]\/>)|(>((([^)|(?R))*)))/sm";
$dom = new DOMDocument();
$libxml_previous_state = libxml_use_internal_errors(true);
$dom->loadHTML($htmlstring);
$errors = libxml_get_errors();
libxml_clear_errors();
libxml_use_internal_errors($libxml_previous_state);
$xpath = new DOMXPath($dom);
foreach ($tags as $tagname) {
$elements = $xpath->query('//'.$tagname.'[@id="'.$id.'"]');
foreach ($elements as $element) {
$elementhtml = $dom->saveXML($element);
preg_match_all($pattern, $elementhtml, $matches, PREG_OFFSET_CAPTURE);
foreach ($matches[0] as $key => $match) {
$x = new SimpleXMLElement("");
$plaintext = isset($matches[6][$key][0]) ? $matches[6][$key][0] : '';
$plaintext = preg_replace ('/]*>/', ' ', $plaintext);

Solution

Except for the difference between @id=… and @class=… (and an indentation error on that line), the two functions look identical. You should avoid cut-and-paste code by implementing both getElementsById() and getElementsByClass() in terms of a common helper function — perhaps getElementsByXPath().

In any document, an id is supposed to uniquely identify at most one element; having two elements with the same id is an error. Therefore, getElementsById() should be getElementById(), and it should return just the first element it finds (or NULL if not found).


[The Element.id property] must be unique in a document, and is often used to retrieve the element using getElementById.

That means that your example HTML is invalid.

Context

StackExchange Code Review Q#88235, answer score: 2

Revisions (0)

No revisions yet.