patternMinor

N-gram generation

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

generationgramstackoverflow

Problem

I have the following code, but it's too slow. How can I make it faster?

generateNGram();
}

private function getFilePath() {
    $files = array();
    $excludes = array('.', '..');
    $path = rtrim(self::SAMPLE_DIRECTORY, DIRECTORY_SEPARATOR . '/');
    $files = scandir($path);
    $files = array_diff($files, $excludes);
    foreach ($files as $file) {

        if (is_dir($path . DIRECTORY_SEPARATOR . $file))
            fetchdir($path . DIRECTORY_SEPARATOR . $file, $callback);
        else if (!preg_match('/^.*\\' . self::SOURCE_EXTENSION . '$/', $file))
            continue;
        else
            $filesPath[] = $path . DIRECTORY_SEPARATOR . $file; 
    }
    unset($file);
    return $filesPath;
}
protected function removeUniCharCategories($string){
    //Replace punctuation(' " # % & ! . : , ? ¿) become space " "
    //Example : 'You&me', become 'You Me'.
    $string = preg_replace( "/\p{Po}/u", " ", $string );
    //--------------------------------------------------
    $string = preg_replace( "/[^\p{Ll}|\p{Lm}|\p{Lo}|\p{Lt}|\p{Lu}|\p{Zs}]/u", "", $string );
    $string = trim($string);
    $string = mb_strtolower($string,'UTF-8');
    return $string;
}
private function generateNGram() {
    $files = $this->getFilePath();
    foreach($files as $file) {
        $file_content = file_get_contents($file, FILE_TEXT);
        $file_content = $this->removeUniCharCategories($file_content);
        $words = explode(" ", $file_content);
        $tokens = array();
        foreach ($words as $word) {
            $word = "_" . $word . "_";
            $length = mb_strlen($word, 'UTF-8');
            for ($i = self::N_GRAM_MIN_LENGTH, $min =  min(self::N_GRAM_MAX_LENGTH, $length); $i

Solution

This code seems to be very good as far as performance is concerned. I cannot see any way to improve it.

I used xdebug and cachegrind to see that my test samples were bound by the performance of the loop that creates the tokens. I investigated switching the $i and $j loops around, but the performance was no better.

I cannot see any way to improve the code, so I would look at a faster CPU or a compiled language like C. There are also large improvements in performance with PHP 5.4, so moving to that might be a very good option.

Response to Other Answer

We now have two answers both claiming different bottlenecks. This makes things interesting! I'll list the evidence that makes me believe it is a CPU bottleneck. The first evidence I'll investigate will be the profiling information. The second will be a simple proof of removing the I/O and observing the runtimes. Skip to the second in the "Removed I/O" section if you just want the quick answer.

Profiling

The tools I am using are xdebug and kcachegrind.

First I created four some sample files. My first 3 files were lorem ipsum type texts in Greek, Chinese and Latin. My fourth was the concatenation of these three files and is shown below:

遺健問使物済表績象全問前禎。画更予服的負要前行年語断。肉購会際以面長力攻以野所。治条育録療常強門守更校尊無場審言団。早患社支科視徐取告性表変全意主場。度坂信神暮明死載高深典導昇話。無稿乳横面祝揚倉厘積卒食当正使。食考割闘出護空樫京速地間作夜半文性新局。前馬報惑般供筋子真車東生然歌論略形。政図警情調並浜校素先間後善経完。

Lorem ipsum dolor sit amet, eu ius nisl impedit. Tritani denique ut nam, brute aliquid iudicabit eos ei. Ad vix hinc numquam, unum utamur vix cu. Natum ubique vocibus ei duo.

Quo te eius propriae voluptatibus, id nostro forensibus signiferumque est, graecis senserit gloriatur per et. Eu munere facete scripserit vel, cu vim laudem noster. His cu dictas prodesset, aliquando constituto reprimique sed ex. Mazim decore imperdiet ut vel. Scripta delicatissimi an pro, quo ex porro nominati.

Δε τις έργων κανείς γραμμή, κι χρόνου φακέλους γειτονιάς ήδη. Σε πακέτων επενδυτής λες, που κρατήσουν επεξεργασία δε, μια ακούσει ξεχειλίζει παρατηρούμενη τα. Νέων πετάνε συνηθίζουν δε για', στη να σωστά ευκολότερο βαθμολόγησε. Απλό τέτοιο διοίκηση στα μη, δε νέο τότε περιβάλλον, οι δώσε απαρατήρητο των.

Βαθμολόγησε επιδιόρθωση επιχειρηματίες αν ματ, σου δεδομένη αγοράζοντας δωροδοκηθούν με, τα ένα αναφορά βουτήξουν. Αν καρέκλα υποψήφιο εξαρτάται όσο, και δε τεράστιο προκύπτουν σημαντικός, αν ναι καθώς διορθώσει. Υόρκη ιδιαίτερα τη άρα, τα λοιπόν παράγοντες μας. Πόρτες γραμμές σκεφτείς λες με. Σου βγήκε αρχεία δε. Και εφαμοργής κακόκεφος δε.

I then ran the code, with a small modification to help benchmarking:

$startTime = microtime(true);

for ($i = 1; $i <= 10; $i++)
{
    $ii = new Ngram();
}

$finishTime = microtime(true);
echo 'Took: ' . ($finishTime - $startTime) . 'seconds';

It took about 0.61 seconds without xdebug enabled and 29.1 seconds with it. Note how much slower it is with xdebug (it takes a lot of time to keep track of the execution for profiling)! I then observed the xdebug information using kcachegrind:

You can see that the time listed against file_put_contents (which is highlighted in the image is very small. Of the 29.1 seconds it only spends .01 seconds writing the files (4 of them). This was on my SSD with an ext4 filesystem. You can see that there are 40 calls to file_put_contents 4 files, 10 times each. Actually, more time is spent by getFilePath which has more costly file stat operations.

The more interesting part is how the 29 seconds gets spent in generateNGram. I have a picture of what I believe to be the "hot" part of the code:

We get into 4 levels of looping here where there are 121280 calls to the inner loop functions. There are no times listed for the for loop structures or the addition to the tokens array ($tokens[] = $token;). I don't exactly know where the 29 seconds got used, but I am assuming that it was within these loops.

Removed I/O

The simple way of determining whether this is I/O bound is to remove the I/O. I commented out the file_put_contents line and ran the benchmark again. I found that there was no difference in time with 0.63 seconds without xdebug and 29.2 seconds with it (compared to 0.61 and 29.1). It was actually slower (due to timing fluctuations), but this just shows the negligible influence of the I/O.

Summary

I still believe this is a CPU bottleneck and maintain my recommendation for a faster CPU, PHP 5.4 or a compiled language. The other answer raises good points in the "Optimization, what for?" section.

Code Snippets

遺健問使物済表績象全問前禎。画更予服的負要前行年語断。肉購会際以面長力攻以野所。治条育録療常強門守更校尊無場審言団。早患社支科視徐取告性表変全意主場。度坂信神暮明死載高深典導昇話。無稿乳横面祝揚倉厘積卒食当正使。食考割闘出護空樫京速地間作夜半文性新局。前馬報惑般供筋子真車東生然歌論略形。政図警情調並浜校素先間後善経完。

Lorem ipsum dolor sit amet, eu ius nisl impedit. Tritani denique ut nam, brute aliquid iudicabit eos ei. Ad vix hinc numquam, unum utamur vix cu. Natum ubique vocibus ei duo.

Quo te eius propriae voluptatibus, id nostro forensibus signiferumque est, graecis senserit gloriatur per et. Eu munere facete scripserit vel, cu vim laudem noster. His cu dictas prodesset, aliquando constituto reprimique sed ex. Mazim decore imperdiet ut vel. Scripta delicatissimi an pro, quo ex porro nominati.

Δε τις έργων κανείς γραμμή, κι χρόνου φακέλους γειτονιάς ήδη. Σε πακέτων επενδυτής λες, που κρατήσουν επεξεργασία δε, μια ακούσει ξεχειλίζει παρατηρούμενη τα. Νέων πετάνε συνηθίζουν δε για', στη να σωστά ευκολότερο βαθμολόγησε. Απλό τέτοιο διοίκηση στα μη, δε νέο τότε περιβάλλον, οι δώσε απαρατήρητο των.

Βαθμολόγησε επιδιόρθωση επιχειρηματίες αν ματ, σου δεδομένη αγοράζοντας δωροδοκηθούν με, τα ένα αναφορά βουτήξουν. Αν καρέκλα υποψήφιο εξαρτάται όσο, και δε τεράστιο προκύπτουν σημαντικός, αν ναι καθώς διορθώσει. Υόρκη ιδιαίτερα τη άρα, τα λοιπόν παράγοντες μας. Πόρτες γραμμές σκεφτείς λες με. Σου βγήκε αρχεία δε. Και εφαμοργής κακόκεφος δε.

$startTime = microtime(true);

for ($i = 1; $i <= 10; $i++)
{
    $ii = new Ngram();
}

$finishTime = microtime(true);
echo 'Took: ' . ($finishTime - $startTime) . 'seconds';

Context

StackExchange Code Review Q#3244, answer score: 6

Revisions (0)

No revisions yet.