patternphpMinor
Get InnerHTML, OuterHTML, and plain text of an element by ID or class
Viewed 0 times
textinnerhtmlgetelementouterhtmlandclassplain
Problem
I do a fair amount of scraping but I am by no way a good PHP programmer. I always struggle to get the innerhtml of elements using PHP, domdocument and XPath.
I have cobbled together a couple of functions that appear to do what I need but my question is are there any major holes in my logic or how can it be improved.
Disclaimer - I did not write all of the code but the functions below are amalgamations of others code with a little rewriting by myself. If you can easily identify the code as yours please let me know so I can add a credit to you.
```
Untitled Document
This paragraph is in the first child div
This is a standalone paragraph
Span in a div
This paragraph is in the second child div
';
$return = getHTMLElementsByID('attachment_371',$htmlstring,array('div'));
echo "";
print_r($return);
/*$return = getHTMLElementsByID('attachment_371',$htmlstring);
echo ""; print_r($return);
$return = getHTMLElementsByClass('testclass',$htmlstring);
echo ""; print_r($return);*/
function getHTMLElementsByID($id,$htmlstring,$tags = array('*')) {
$contents = array();
$pattern = "/]?)(([\s]\/>)|(>((([^)|(?R))*)))/sm";
$dom = new DOMDocument();
$libxml_previous_state = libxml_use_internal_errors(true);
$dom->loadHTML($htmlstring);
$errors = libxml_get_errors();
libxml_clear_errors();
libxml_use_internal_errors($libxml_previous_state);
$xpath = new DOMXPath($dom);
foreach ($tags as $tagname) {
$elements = $xpath->query('//'.$tagname.'[@id="'.$id.'"]');
foreach ($elements as $element) {
$elementhtml = $dom->saveXML($element);
preg_match_all($pattern, $elementhtml, $matches, PREG_OFFSET_CAPTURE);
foreach ($matches[0] as $key => $match) {
$x = new SimpleXMLElement("");
$plaintext = isset($matches[6][$key][0]) ? $matches[6][$key][0] : '';
$plaintext = preg_replace ('/]*>/', ' ', $plaintext);
I have cobbled together a couple of functions that appear to do what I need but my question is are there any major holes in my logic or how can it be improved.
Disclaimer - I did not write all of the code but the functions below are amalgamations of others code with a little rewriting by myself. If you can easily identify the code as yours please let me know so I can add a credit to you.
```
Untitled Document
This paragraph is in the first child div
This is a standalone paragraph
Span in a div
This paragraph is in the second child div
';
$return = getHTMLElementsByID('attachment_371',$htmlstring,array('div'));
echo "";
print_r($return);
/*$return = getHTMLElementsByID('attachment_371',$htmlstring);
echo ""; print_r($return);
$return = getHTMLElementsByClass('testclass',$htmlstring);
echo ""; print_r($return);*/
function getHTMLElementsByID($id,$htmlstring,$tags = array('*')) {
$contents = array();
$pattern = "/]?)(([\s]\/>)|(>((([^)|(?R))*)))/sm";
$dom = new DOMDocument();
$libxml_previous_state = libxml_use_internal_errors(true);
$dom->loadHTML($htmlstring);
$errors = libxml_get_errors();
libxml_clear_errors();
libxml_use_internal_errors($libxml_previous_state);
$xpath = new DOMXPath($dom);
foreach ($tags as $tagname) {
$elements = $xpath->query('//'.$tagname.'[@id="'.$id.'"]');
foreach ($elements as $element) {
$elementhtml = $dom->saveXML($element);
preg_match_all($pattern, $elementhtml, $matches, PREG_OFFSET_CAPTURE);
foreach ($matches[0] as $key => $match) {
$x = new SimpleXMLElement("");
$plaintext = isset($matches[6][$key][0]) ? $matches[6][$key][0] : '';
$plaintext = preg_replace ('/]*>/', ' ', $plaintext);
Solution
Except for the difference between
In any document, an
[The
That means that your example HTML is invalid.
@id=… and @class=… (and an indentation error on that line), the two functions look identical. You should avoid cut-and-paste code by implementing both getElementsById() and getElementsByClass() in terms of a common helper function — perhaps getElementsByXPath().In any document, an
id is supposed to uniquely identify at most one element; having two elements with the same id is an error. Therefore, getElementsById() should be getElementById(), and it should return just the first element it finds (or NULL if not found).[The
Element.id property] must be unique in a document, and is often used to retrieve the element using getElementById.That means that your example HTML is invalid.
Context
StackExchange Code Review Q#88235, answer score: 2
Revisions (0)
No revisions yet.