patternphpMinor
Retrieving all the email addresses on a web page
Viewed 0 times
theemailalladdressespagewebretrieving
Problem
I recently discovered the
I trust my comments are enough to justify the choices I made. I used the stackoverflow thread I mentioned as a test case, because it had many test/fake emails, but all in an appropriate format for the regex.
Here's the output for this case:
I realise the string at the end isn't particularly necessary, I just wanted to make use of
file_get_contents function and wanted to put it to some use, alongside preg_match_all, off the information in this Stack Overflow thread.$count email addresses in total: $emailsAsString";
?>I trust my comments are enough to justify the choices I made. I used the stackoverflow thread I mentioned as a test case, because it had many test/fake emails, but all in an appropriate format for the regex.
Here's the output for this case:
array (size=19)
0 => string 'apple-touch-icon@2.png' (length=22)
1 => string 'example@slu.edu' (length=15)
2 => string 'a+b@google.com.sg' (length=17)
3 => string 'test1+2@gmail.com' (length=17)
4 => string 'test-2@yahoo.co.jp' (length=18)
5 => string 'test@test.com' (length=13)
6 => string 'test@test.co.uk' (length=15)
7 => string 'test@google.com.sg' (length=18)
8 => string 'email@domain.info' (length=17)
9 => string 'email@domain.inf' (length=16)
10 => string 'first.lastname@domain.be' (length=24)
11 => string 'lastname@domain.be' (length=18)
12 => string 'HIDDENFORLOGICALREASONS@cameranh.rs.gov.br' (length=42)
13 => string 'HIDDENFORLOGICALREASONS@cameranh.rs.go' (length=38)
14 => string 'myemail@office21.company.com' (length=28)
15 => string 'mymail@yahoo.com' (length=16)
16 => string 'my-e.mail@yahoo.com' (length=19)
17 => string 'joe@mysite.com' (length=14)
18 => string 'name@example.com.sv' (length=19)
19 email addresses in total:
apple-touch-icon@2.png, example@slu.edu, a+b@google.com.sg, test1+2@gmail.com, test-2@yahoo.co.jp, test@test.com, test@test.co.uk, test@google.com.sg, email@domain.info, email@domain.inf, first.lastname@domain.be, lastname@domain.be, HIDDENFORLOGICALREASONS@cameranh.rs.gov.br, HIDDENFORLOGICALREASONS@cameranh.rs.go, myemail@office21.company.com, mymail@yahoo.com, my-e.mail@yahoo.com, joe@mysite.com, name@example.com.svI realise the string at the end isn't particularly necessary, I just wanted to make use of
implodeSolution
First, your code isn't exactly substantial, so it's hard to write a good review.
Up first,
It's a little longer, but try this:
Moving on to your regex:
Instead of
Depending on how complex you'll allow emails to be, you can add (or remove), some valid characters.
Next, you perform a
Into:
Your variable naming follows a
Your comments are generally good, but
Is this a good way to go about solving the problem? Are there parts that could be optimised with more suitable functions?
Swapping
Is initialising the array imperative for this program? I tried it without the original declaration (in the
You don't need to initialise the variable first, you really only would if it where in a
Is there a part/are there parts of the code that comes across as poorly written?
Not really, variable naming could be expressed better by replacing
Up first,
file_get_contents() works, but curl is faster.It's a little longer, but try this:
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}Moving on to your regex:
/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}/Instead of
A-Za-z, you can end with a /i, and make your search case insensitive:/[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}/iDepending on how complex you'll allow emails to be, you can add (or remove), some valid characters.
Next, you perform a
array_values(array_unique($matches[0])) twice.var_dump(array_values(array_unique($matches[0])));
$neaterArray = array_values(array_unique($matches[0]));Into:
$neaterArray = (array_values(array_unique($matches[0])));
var_dump($neaterArray);Your variable naming follows a
camelCase standard all-throughout, so that's alright.Your comments are generally good, but
//store above in array for upcoming bit is incorrect as I pointed out above.Is this a good way to go about solving the problem? Are there parts that could be optimised with more suitable functions?
Swapping
file_get_contents() with curl will improve the speed of page download.Is initialising the array imperative for this program? I tried it without the original declaration (in the
preg_match_all line) and it still functioned.You don't need to initialise the variable first, you really only would if it where in a
for loop, in this case you can just output to a new variable.Is there a part/are there parts of the code that comes across as poorly written?
Not really, variable naming could be expressed better by replacing
$string with $input and $neaterArray with $duplicateRemovedArray, but other than that, the code is good.Code Snippets
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}//[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}/ivar_dump(array_values(array_unique($matches[0])));
$neaterArray = array_values(array_unique($matches[0]));$neaterArray = (array_values(array_unique($matches[0])));
var_dump($neaterArray);Context
StackExchange Code Review Q#92204, answer score: 5
Revisions (0)
No revisions yet.