HiveBrain v1.2.0
Get Started
← Back to all entries
patternphpMinor

Retrieving all the email addresses on a web page

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
theemailalladdressespagewebretrieving

Problem

I recently discovered the file_get_contents function and wanted to put it to some use, alongside preg_match_all, off the information in this Stack Overflow thread.

$count email addresses in total: $emailsAsString";

?>


I trust my comments are enough to justify the choices I made. I used the stackoverflow thread I mentioned as a test case, because it had many test/fake emails, but all in an appropriate format for the regex.

Here's the output for this case:

array (size=19)
  0 => string 'apple-touch-icon@2.png' (length=22)
  1 => string 'example@slu.edu' (length=15)
  2 => string 'a+b@google.com.sg' (length=17)
  3 => string 'test1+2@gmail.com' (length=17)
  4 => string 'test-2@yahoo.co.jp' (length=18)
  5 => string 'test@test.com' (length=13)
  6 => string 'test@test.co.uk' (length=15)
  7 => string 'test@google.com.sg' (length=18)
  8 => string 'email@domain.info' (length=17)
  9 => string 'email@domain.inf' (length=16)
  10 => string 'first.lastname@domain.be' (length=24)
  11 => string 'lastname@domain.be' (length=18)
  12 => string 'HIDDENFORLOGICALREASONS@cameranh.rs.gov.br' (length=42)
  13 => string 'HIDDENFORLOGICALREASONS@cameranh.rs.go' (length=38)
  14 => string 'myemail@office21.company.com' (length=28)
  15 => string 'mymail@yahoo.com' (length=16)
  16 => string 'my-e.mail@yahoo.com' (length=19)
  17 => string 'joe@mysite.com' (length=14)
  18 => string 'name@example.com.sv' (length=19)
19 email addresses in total:

apple-touch-icon@2.png, example@slu.edu, a+b@google.com.sg, test1+2@gmail.com, test-2@yahoo.co.jp, test@test.com, test@test.co.uk, test@google.com.sg, email@domain.info, email@domain.inf, first.lastname@domain.be, lastname@domain.be, HIDDENFORLOGICALREASONS@cameranh.rs.gov.br, HIDDENFORLOGICALREASONS@cameranh.rs.go, myemail@office21.company.com, mymail@yahoo.com, my-e.mail@yahoo.com, joe@mysite.com, name@example.com.sv


I realise the string at the end isn't particularly necessary, I just wanted to make use of implode

Solution

First, your code isn't exactly substantial, so it's hard to write a good review.

Up first, file_get_contents() works, but curl is faster.

It's a little longer, but try this:

function file_get_contents_curl($url) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);       

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}


Moving on to your regex:

/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}/


Instead of A-Za-z, you can end with a /i, and make your search case insensitive:

/[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}/i


Depending on how complex you'll allow emails to be, you can add (or remove), some valid characters.

Next, you perform a array_values(array_unique($matches[0])) twice.

var_dump(array_values(array_unique($matches[0])));

$neaterArray = array_values(array_unique($matches[0]));


Into:

$neaterArray = (array_values(array_unique($matches[0])));
var_dump($neaterArray);


Your variable naming follows a camelCase standard all-throughout, so that's alright.

Your comments are generally good, but //store above in array for upcoming bit is incorrect as I pointed out above.


Is this a good way to go about solving the problem? Are there parts that could be optimised with more suitable functions?

Swapping file_get_contents() with curl will improve the speed of page download.


Is initialising the array imperative for this program? I tried it without the original declaration (in the preg_match_all line) and it still functioned.

You don't need to initialise the variable first, you really only would if it where in a for loop, in this case you can just output to a new variable.


Is there a part/are there parts of the code that comes across as poorly written?

Not really, variable naming could be expressed better by replacing $string with $input and $neaterArray with $duplicateRemovedArray, but other than that, the code is good.

Code Snippets

function file_get_contents_curl($url) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);       

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}
/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}/
/[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}/i
var_dump(array_values(array_unique($matches[0])));

$neaterArray = array_values(array_unique($matches[0]));
$neaterArray = (array_values(array_unique($matches[0])));
var_dump($neaterArray);

Context

StackExchange Code Review Q#92204, answer score: 5

Revisions (0)

No revisions yet.