patternphpMinor
Get contents from file
Viewed 0 times
filefromgetcontents
Problem
The following PHP code does exactly what I need it to do but, as a complete PHP novice, I am wondering: what can be improved here?
$theurl='./section-1.html';
function getbody1($filename) {
$file = file_get_contents($filename);
$bodypattern = ".*";
$bodyendpattern = ".*";
$thecontainer = "id=\"container\"";
$thecleanup = "";
$noheader = preg_replace("/".$bodypattern."/", "", $file);
$noheader = preg_replace("~".$bodyendpattern."~", "", $noheader);
$noheader = preg_replace("|".$thecontainer."|", "", $noheader);
$noheader = preg_replace("#".$thecleanup."#", "", $noheader);
return $noheader;
}
echo getbody1("$theurl");Solution
First things first, you are not using the right tool for the job. Regular expressions can't reliably parse markup. Full stop. But more on that later, let's first take a look at your patterns, as this may help you in the future:
Now, whenever you quote a string into a regex pattern, you have to stop and think: what if the string happens to contain special chars?. I mean: half of the characters can mean a variety of things depending on the context. Thankfully, there's a handy function called
Now, you seem to be under the impression that a dot matches anything. And it does, except line-feeds. If I were to apply
You need a PCRE modifier:
That's more like it. However, this is just as an aside, because what you really should be looking into are ways to actually treat markup as markup: a DOM needs to be parsed, if you want to get the most out of it. Here's my actual (and as-originally-posted) answer:
Ouch, The first thing I noticed is that you're processing markup, and have chosen to do so using regular expressions. That's a bad idea from the off.
It's especially bad considering you have a tool at your disposal that is specifically built to parse markup, the
Basically, what you're trying to do can be written as simple as this:
The reason why you need to remove the attributes for the body tag, is because
$bodypattern = ".*";//for example
//later
preg_replace('/'.$bodypattern.'/',...);Now, whenever you quote a string into a regex pattern, you have to stop and think: what if the string happens to contain special chars?. I mean: half of the characters can mean a variety of things depending on the context. Thankfully, there's a handy function called
preg_quote to help you with that. The safe option here would be:preg_replace('/'.preg_quote($bodypattern, '/').'/', ...);Now, you seem to be under the impression that a dot matches anything. And it does, except line-feeds. If I were to apply
preg_match('/.+/', $thisAnswer, $matches); to this answer, the pattern would only match the text until a \n is encountered.You need a PCRE modifier:
s:preg_match('/.+/s', $thisAnswer, $match);That's more like it. However, this is just as an aside, because what you really should be looking into are ways to actually treat markup as markup: a DOM needs to be parsed, if you want to get the most out of it. Here's my actual (and as-originally-posted) answer:
Ouch, The first thing I noticed is that you're processing markup, and have chosen to do so using regular expressions. That's a bad idea from the off.
It's especially bad considering you have a tool at your disposal that is specifically built to parse markup, the
DOMDocument object.Basically, what you're trying to do can be written as simple as this:
$DOM = new DOMDocument;//create DOM
$DOM->loadHTMLFile($theUrl);//parse HTML
$container = $DOM->getElementById('container');//get the id=container element
if ($container)
{//make sure it was found
$container->removeAttribute('id');//remove the id attribute
}
$body = $contents = $dom->getElementsByTagName('body')->item(0);//get body
//remove attributes
while ($body->hasAttributes())
{//while tag has attributes
//remove attribute
$body->removeAttribute($body->attributes->item(0)->nodeName);
}
//Get the body as string, use saveXML
//as saveHTML adds DOCTYPE, html, head and title tag again
$body = substr($DOM->saveXML($body), 6, -7);The reason why you need to remove the attributes for the body tag, is because
DOMDocument::saveXML will return the markup string with the outer ` and tags. Their length is a given, but only if the opening tag hasn't got any attributes. Thankfully, we can remove them quite easily, and slice them off with substr.
If you want, you could easily use:
$body = str_replace(
array('', ''),
'',
$DOM->saveXML($body)
);
Perhaps this is a bit easier to understand. Anyway: read the DOMDocument docs, there's a lot you can do there. Just click on the methods I used, and click the objects they return, too, so you know what instances you're working with.
Now wrap all this in a function, and we can add some checks to avoid exceptions or errors from showing up:
function getBody($file, DOMDocument $dom = null)
{//optionally pass existing DOMDocument instance
//to avoid creating a new one each time we call this function
if (!file_exists($file))
{//check if the file exists, if not:
throw new InvalidArgumentException($file. ' does not exist!');
}
$dom = $dom instanceof DOMDocument ? $dom : new DOMDocument;
$dom->loadHTMLFile($file);
$container = $dom->getElementById('container');
if ($container)
$container->removeAttribute('id');
$body = $dom->getElementsByTagName('body')->item(0);
while($body->hasAttributes())
{
$body->removeAttribute(
$body->attributes->item(0)->nodeName
);
}
$body = substr($dom->saveXML($body), 6, -7);
//optionally remove trailing whitespace:
return trim($body);
}
Call this function either like this:
$body1 = getBody('path/to/file1.html');
or, since you're going to process more than 1 file, and we can avoid creating a new DOMDocument instance for every function call:
$dom = new DOMDocument;
$files = array('file1.html', 'file2.html');
$bodyStrings = array();
foreach ($files as $file)
$bodyStrings[] = getBody($file, $dom);
That's it.
As a guide-to-the-DOMDocument-docs:
Every class extends from the DOMNode class, except for the DOMNodeList class, which is a Traversable containing only DOMNode instances, accessible through the item method. Anyway, check the DOMNode methods and properties. Learn that class by heart, as almost every object you work with inherits its properties and methods.
The other class to check is the DOMElement class. Instances of this class represent a node (a tag if you will). It has some methods that I've used here (like removeAttribute).
Finally, everything, of course starts with the DOMDocument` class. I had aCode Snippets
$bodypattern = ".*";//for example
//later
preg_replace('/'.$bodypattern.'/',...);preg_replace('/'.preg_quote($bodypattern, '/').'/', ...);preg_match('/.+/s', $thisAnswer, $match);$DOM = new DOMDocument;//create DOM
$DOM->loadHTMLFile($theUrl);//parse HTML
$container = $DOM->getElementById('container');//get the id=container element
if ($container)
{//make sure it was found
$container->removeAttribute('id');//remove the id attribute
}
$body = $contents = $dom->getElementsByTagName('body')->item(0);//get body
//remove attributes
while ($body->hasAttributes())
{//while tag has attributes
//remove attribute
$body->removeAttribute($body->attributes->item(0)->nodeName);
}
//Get the body as string, use saveXML
//as saveHTML adds DOCTYPE, html, head and title tag again
$body = substr($DOM->saveXML($body), 6, -7);$body = str_replace(
array('<body>', '</body>'),
'',
$DOM->saveXML($body)
);Context
StackExchange Code Review Q#42691, answer score: 2
Revisions (0)
No revisions yet.