HiveBrain v1.2.0
Get Started
← Back to all entries
snippetphpMinor

73 Lines of Mayhem - Parse, Sort and Save to CSV in PHP CLI

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
phpcsvparsesavemayhemcliandsortlines

Problem

Inside of a folder named txt I have 138 text files (totaling 349MB) full of email addresses. I have no idea (yet) how many addresses there are. They are separated from one another by line breaks. I created the following script to read all of these files into an array, dismiss the duplicates, then sort alphabetically and save in groups of 10K per csv file. It works correctly, but it has also been running for over 8 hours (dual core i3 w/ 4 gigabizzles of ram, sata 7200 hdd) which seems excessive to me. Top also tells me that my program's CPU usage is 100% and it's been like that the whole while it's been running. Give my script a looksie and advise me on where I've gone so terribly wrong.

```
function writeFile($fileName, $fileData)
{

$writeFileOpen = fopen('csv/' . $fileName, 'w');
fwrite($writeFileOpen, $fileData) or die('Unable to write file: ' . $fileName);
fclose($writeFileOpen);

}

function openFiles()
{

$addressList = array();
$preventRepeat = array();

if ($handle = opendir('txt')) {
while (false !== ($file = readdir($handle))) {
if ($file != '.' && $file != '..') {
$newList = explode("\n", trim(file_get_contents('txt/' . $file)));
foreach ($newList as $key => $val) {
$val = str_replace(array(',', '"'), '', $val);
if (in_array($val, $preventRepeat) || !strpos($val, '@') || !$val) {
unset($newList[$key]);
}
$preventRepeat[] = $val;
}
if (empty($addressList)) {
$addressList = $newList;
} else {
$addressList = array_merge($addressList, $newList);
}
unset($newList);
}
}
closedir($handle);
} else {
echo 'Unable to Read Directory';
}

$lineNum = 1;
$fileNum = 1;
$fileData = '"Email Address"' . "\n";

sor

Solution

This will be much more efficient:

$result = array();

if (($handle = opendir('./txt/')) !== false)
{
    set_time_limit(0);
    ini_set('memory_limit', -1);

    while (($file = readdir($handle)) !== false)
    {
        if (($file != '.') && ($file != '..'))
        {
            if (is_resource($file = fopen('./txt/' . $file, 'rb')) === true)
            {
                while (($email = fgets($file)) !== false)
                {
                    $email = trim(str_replace(array(',', '"'), '', $email));

                    if (filter_var($email, FILTER_VALIDATE_EMAIL) !== false)
                    {
                        $result[strtolower($email)] = true;
                    }
                }

                fclose($file);
            }
        }
    }

    closedir($handle);

    if (empty($result) !== true)
    {
        ksort($result);

        foreach (array_chunk($result, 10000, true) as $key => $value)
        {
            file_put_contents('./emailList-' . ($key + 1) . '.csv', implode("\n", array_keys($value)), LOCK_EX);
        }
    }

    echo 'Done!';
}

Code Snippets

$result = array();

if (($handle = opendir('./txt/')) !== false)
{
    set_time_limit(0);
    ini_set('memory_limit', -1);

    while (($file = readdir($handle)) !== false)
    {
        if (($file != '.') && ($file != '..'))
        {
            if (is_resource($file = fopen('./txt/' . $file, 'rb')) === true)
            {
                while (($email = fgets($file)) !== false)
                {
                    $email = trim(str_replace(array(',', '"'), '', $email));

                    if (filter_var($email, FILTER_VALIDATE_EMAIL) !== false)
                    {
                        $result[strtolower($email)] = true;
                    }
                }

                fclose($file);
            }
        }
    }

    closedir($handle);

    if (empty($result) !== true)
    {
        ksort($result);

        foreach (array_chunk($result, 10000, true) as $key => $value)
        {
            file_put_contents('./emailList-' . ($key + 1) . '.csv', implode("\n", array_keys($value)), LOCK_EX);
        }
    }

    echo 'Done!';
}

Context

StackExchange Code Review Q#1393, answer score: 3

Revisions (0)

No revisions yet.