HiveBrain v1.2.0
Get Started
← Back to all entries
patternphpMinor

Get contents from file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filefromgetcontents

Problem

The following PHP code does exactly what I need it to do but, as a complete PHP novice, I am wondering: what can be improved here?

$theurl='./section-1.html';
function getbody1($filename) {
  $file = file_get_contents($filename);
  $bodypattern = ".*";
  $bodyendpattern = ".*";
  $thecontainer = "id=\"container\"";
  $thecleanup = "";
  $noheader = preg_replace("/".$bodypattern."/", "", $file);
  $noheader = preg_replace("~".$bodyendpattern."~", "", $noheader);
  $noheader = preg_replace("|".$thecontainer."|", "", $noheader);
  $noheader = preg_replace("#".$thecleanup."#", "", $noheader);
  return $noheader;
}
echo getbody1("$theurl");

Solution

First things first, you are not using the right tool for the job. Regular expressions can't reliably parse markup. Full stop. But more on that later, let's first take a look at your patterns, as this may help you in the future:

$bodypattern = ".*";//for example
//later
preg_replace('/'.$bodypattern.'/',...);


Now, whenever you quote a string into a regex pattern, you have to stop and think: what if the string happens to contain special chars?. I mean: half of the characters can mean a variety of things depending on the context. Thankfully, there's a handy function called preg_quote to help you with that. The safe option here would be:

preg_replace('/'.preg_quote($bodypattern, '/').'/', ...);


Now, you seem to be under the impression that a dot matches anything. And it does, except line-feeds. If I were to apply preg_match('/.+/', $thisAnswer, $matches); to this answer, the pattern would only match the text until a \n is encountered.
You need a PCRE modifier: s:

preg_match('/.+/s', $thisAnswer, $match);


That's more like it. However, this is just as an aside, because what you really should be looking into are ways to actually treat markup as markup: a DOM needs to be parsed, if you want to get the most out of it. Here's my actual (and as-originally-posted) answer:

Ouch, The first thing I noticed is that you're processing markup, and have chosen to do so using regular expressions. That's a bad idea from the off.

It's especially bad considering you have a tool at your disposal that is specifically built to parse markup, the DOMDocument object.

Basically, what you're trying to do can be written as simple as this:

$DOM = new DOMDocument;//create DOM
$DOM->loadHTMLFile($theUrl);//parse HTML
$container = $DOM->getElementById('container');//get the id=container element
if ($container)
{//make sure it was found
    $container->removeAttribute('id');//remove the id attribute
}
$body = $contents = $dom->getElementsByTagName('body')->item(0);//get body
//remove attributes
while ($body->hasAttributes())
{//while tag has attributes
    //remove attribute
    $body->removeAttribute($body->attributes->item(0)->nodeName);
}
//Get the body as string, use saveXML
//as saveHTML adds DOCTYPE, html, head and title tag again
$body = substr($DOM->saveXML($body), 6, -7);


The reason why you need to remove the attributes for the body tag, is because DOMDocument::saveXML will return the markup string with the outer ` and tags. Their length is a given, but only if the opening tag hasn't got any attributes. Thankfully, we can remove them quite easily, and slice them off with substr.

If you want, you could easily use:

$body = str_replace(
    array('', ''),
    '',
    $DOM->saveXML($body)
);


Perhaps this is a bit easier to understand. Anyway: read the
DOMDocument docs, there's a lot you can do there. Just click on the methods I used, and click the objects they return, too, so you know what instances you're working with.

Now wrap all this in a function, and we can add some checks to avoid exceptions or errors from showing up:

function getBody($file, DOMDocument $dom = null)
{//optionally pass existing DOMDocument instance
 //to avoid creating a new one each time we call this function
    if (!file_exists($file))
    {//check if the file exists, if not:
        throw new InvalidArgumentException($file. ' does not exist!');
    }
    $dom = $dom instanceof DOMDocument ? $dom : new DOMDocument;
    $dom->loadHTMLFile($file);
    $container = $dom->getElementById('container');
    if ($container)
        $container->removeAttribute('id');
    $body = $dom->getElementsByTagName('body')->item(0);
    while($body->hasAttributes())
    {
        $body->removeAttribute(
            $body->attributes->item(0)->nodeName
        );
    }
    $body = substr($dom->saveXML($body), 6, -7);
    //optionally remove trailing whitespace:
    return trim($body);
}


Call this function either like this:

$body1 = getBody('path/to/file1.html');


or, since you're going to process more than 1 file, and we can avoid creating a new
DOMDocument instance for every function call:

$dom = new DOMDocument;
$files = array('file1.html', 'file2.html');
$bodyStrings = array();
foreach ($files as $file)
    $bodyStrings[] = getBody($file, $dom);


That's it.

As a guide-to-the-DOMDocument-docs:

Every class extends from the
DOMNode class, except for the DOMNodeList class, which is a Traversable containing only DOMNode instances, accessible through the item method. Anyway, check the DOMNode methods and properties. Learn that class by heart, as almost every object you work with inherits its properties and methods.

The other class to check is the
DOMElement class. Instances of this class represent a node (a tag if you will). It has some methods that I've used here (like removeAttribute).
Finally, everything, of course starts with the
DOMDocument` class. I had a

Code Snippets

$bodypattern = ".*";//for example
//later
preg_replace('/'.$bodypattern.'/',...);
preg_replace('/'.preg_quote($bodypattern, '/').'/', ...);
preg_match('/.+/s', $thisAnswer, $match);
$DOM = new DOMDocument;//create DOM
$DOM->loadHTMLFile($theUrl);//parse HTML
$container = $DOM->getElementById('container');//get the id=container element
if ($container)
{//make sure it was found
    $container->removeAttribute('id');//remove the id attribute
}
$body = $contents = $dom->getElementsByTagName('body')->item(0);//get body
//remove attributes
while ($body->hasAttributes())
{//while tag has attributes
    //remove attribute
    $body->removeAttribute($body->attributes->item(0)->nodeName);
}
//Get the body as string, use saveXML
//as saveHTML adds DOCTYPE, html, head and title tag again
$body = substr($DOM->saveXML($body), 6, -7);
$body = str_replace(
    array('<body>', '</body>'),
    '',
    $DOM->saveXML($body)
);

Context

StackExchange Code Review Q#42691, answer score: 2

Revisions (0)

No revisions yet.