HiveBrain v1.2.0
Get Started
← Back to all entries
patternphpModerate

Why is my web scraping script so slow?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptwhyscrapingslowweb

Problem

My script, which is a web scraping script, is very slow, I even needed to put set_time_limit(0);

This is the whole http://phpfiddle.org/main/code/9qt-78n

I think the problem is here:

foreach ($array_with_links as $url_job) {

$info=Array(getTitle($url_job, getID($url_job)), getTitle_Short($url_job, getID($url_job)), getCity($url_job), 
 getDepartment($url_job), getSalary($url_job), getJobNumber($url_job), getPositionStartDate($url_job), getFullTimeEquivalent($url_job),
  getPermTermCasual($url_job), getLocation($url_job), getQualifications($url_job), getDuties($url_job), 
  getClosingDate($url_job), getContact($url_job), getEmail($url_job), getCreated_On($url_job), getID($url_job) );

array_push($data, $info);}


Example of some of those functions:

function getCity($url)
    {
    $url = curl_get_contents($url);
    $html_object = str_get_html($url);
    return $ret = $html_object->find('td', 86)->plaintext;
    }

function getDepartment($url)
    {
    $url = curl_get_contents($url);
    $html_object = str_get_html($url);
    return $ret = $html_object->find('td', 90)->plaintext;
    }


And this is my cURL funtion:

function curl_get_contents($url)
{
  $curl_moteur = curl_init();
  curl_setopt($curl_moteur, CURLOPT_URL, $url);
  curl_setopt($curl_moteur, CURLOPT_RETURNTRANSFER, 1);

  curl_setopt($curl_moteur,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

  curl_setopt($curl_moteur, CURLOPT_FOLLOWLOCATION, 1);
  $web = curl_exec($curl_moteur);
  curl_close($curl_moteur);
  return $web;
}


Those getX come from an URL one by one. Maybe is there any method to make multiple simultaneous insertions in that array?

I really don't know what to do and what's my mistake.

Solution

It seems that every one of your getCity(), getDepartment() etc functions loads the same web page over and over.

You should load each URL once with your curl_get_contents(), then pass its result into each get*() function to parse it.

Context

StackExchange Code Review Q#40538, answer score: 11

Revisions (0)

No revisions yet.