patternphpModerate
Why is my web scraping script so slow?
Viewed 0 times
scriptwhyscrapingslowweb
Problem
My script, which is a web scraping script, is very slow, I even needed to put set_time_limit(0);
This is the whole http://phpfiddle.org/main/code/9qt-78n
I think the problem is here:
Example of some of those functions:
And this is my cURL funtion:
Those getX come from an URL one by one. Maybe is there any method to make multiple simultaneous insertions in that array?
I really don't know what to do and what's my mistake.
This is the whole http://phpfiddle.org/main/code/9qt-78n
I think the problem is here:
foreach ($array_with_links as $url_job) {
$info=Array(getTitle($url_job, getID($url_job)), getTitle_Short($url_job, getID($url_job)), getCity($url_job),
getDepartment($url_job), getSalary($url_job), getJobNumber($url_job), getPositionStartDate($url_job), getFullTimeEquivalent($url_job),
getPermTermCasual($url_job), getLocation($url_job), getQualifications($url_job), getDuties($url_job),
getClosingDate($url_job), getContact($url_job), getEmail($url_job), getCreated_On($url_job), getID($url_job) );
array_push($data, $info);}Example of some of those functions:
function getCity($url)
{
$url = curl_get_contents($url);
$html_object = str_get_html($url);
return $ret = $html_object->find('td', 86)->plaintext;
}
function getDepartment($url)
{
$url = curl_get_contents($url);
$html_object = str_get_html($url);
return $ret = $html_object->find('td', 90)->plaintext;
}And this is my cURL funtion:
function curl_get_contents($url)
{
$curl_moteur = curl_init();
curl_setopt($curl_moteur, CURLOPT_URL, $url);
curl_setopt($curl_moteur, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_moteur,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl_moteur, CURLOPT_FOLLOWLOCATION, 1);
$web = curl_exec($curl_moteur);
curl_close($curl_moteur);
return $web;
}Those getX come from an URL one by one. Maybe is there any method to make multiple simultaneous insertions in that array?
I really don't know what to do and what's my mistake.
Solution
It seems that every one of your
You should load each URL once with your
getCity(), getDepartment() etc functions loads the same web page over and over. You should load each URL once with your
curl_get_contents(), then pass its result into each get*() function to parse it.Context
StackExchange Code Review Q#40538, answer score: 11
Revisions (0)
No revisions yet.