HiveBrain v1.2.0
Get Started
← Back to all entries
patternphpMinor

Optimizing PHP script fetching entire HTML pages

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptentirephpoptimizingfetchingpageshtml

Problem

The following script should get links which are stored in a text file (line by line), then put them into an array and finally scan each links' source code for a certain line. If this line is found, it will be pasted into a CSV file.

It works fine so far, but it takes ages to finish, since each link is 'opened' and the complete source code for that link is scanned for this specific line.

I'm looking for ideas on how to optimize the code to run faster.

Here is my code:

$filename = "products.txt";
$writecsv = "notavailable.csv";
global $products;

$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, "/tmp/abCk.txt");
curl_setopt($ch, CURLOPT_URL,"https://www.websitegoeshere.com");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, "Login=USERNAME&Password=PASSWORD");

ob_start();      // prevent any output
curl_exec ($ch); // execute the curl command
ob_end_clean();  // stop preventing output

curl_close ($ch);
unset($ch);

$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_COOKIEFILE, "/tmp/abCk.txt");

// Open the file
$fp = @fopen($filename, 'r') or die("products.txt not found"); 

// Add each line to an array
if ($fp) {
   $products = explode("\n", fread($fp, filesize($filename)));
}

fclose($fp);

$fpcsv = fopen($writecsv, 'w') or die("notavailable.csv not found");

foreach($products as $key => $val) {
    curl_setopt($ch, CURLOPT_URL,$val);
    $buf2 = curl_exec ($ch);
    $html = htmlentities($buf2);
    if (strpos($html, "/extension/silver.project/design/sc_base/images/available_yes.gif") !== false) {
        fputcsv($fpcsv, "available");
    } else {
        fputcsv($fpcsv, "not available");
    }
}

fclose($fpcsv);

curl_close ($ch);
echo "csv written successfully."


Any help is really welcomed. Thanks in advance!

Solution

Performing this task in parallel would help greatly although PHP is not the best language to do so in.

I would do so by spawning PHP processes by POSTing to other PHP scripts (asynchronously), possibly posting to them the links they should get. You would need to store whether the line is available or not in a database, I would use SQLite in this case (it doesn't need a server to be setup).

So a possible setup could look something like:

Master Process: Splits the main link file into n parts then POSTs the parts to the child pages. It would need to know when the child processes are finished, it could do so by having a polling loop that checks if the number of rows in the database is equal to the number of links, you would need to make sure you put in a sleep function so it doesn't poll too often. When the child processes are done and the polling breaks out, you could then take the data in the database and convert it to CSV.

Child Processes: This page receives a set of links via POST, it then goes through each one checking if it contains the string and marking the result down in the database.

I would not use this for production code, PHP is not made for this and there are lots of things that could go wrong. If it was possible I would do this sort of thing in languages with built in parallelism such as Golang or Clojure.

Context

StackExchange Code Review Q#23849, answer score: 2

Revisions (0)

No revisions yet.