patternpythonMinor

Splitting text into n-grams and analyzing statistics on them

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

gramsintotextanalyzingsplittingstatisticsandthem

Problem

I have made the algorithm that split text into n-grams (collocations) and it counts probabilities and other statistics of this collocations. When file is more then 50 megabytes it takes long time to count maybe some one will help to improve it.

``


import math
import re
import csv
from itertools import zip_longest
from datetime import datetime

def tokenize(input_file, encoding):
    lst =[]
    with open(input_file, 'r', encoding=encoding) as f:
        for sent in f:
            sent = sent.lower()
            sent = re.sub("[A-z0-9\'\"

\|\/\+\#\,\)\(\?\!\B\-\:\=\;\.\«\»\—\@]", '', sent)
sent = re.findall('\w+', sent)
for word in sent:
lst.append(word)
return lst

def ngrams_split(lst, n):
counts = dict()
grams = [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]
for gram in grams:
if gram not in counts:
counts[gram] = 1
else:
counts[gram] += 1
return counts

def list_add(counts):
ngrams = []
for key, val in counts.items():
ngrams.append((val, key))
return ngrams

def gram_add(lst, n):
ng = []
grams = [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]
for gram in grams:
ng.append(gram)
return ng

def two_gram_count(input_file, encoding, n_filter, n):
output_file = []
lst = tokenize(input_file, encoding) #tokenize
n_words = len(lst)
counts = ngrams_split(lst, n) #spliting into ngrams
ngrams = list_add(counts) #ading ngrmas to list
for key, val in ngrams:
if int(key) >= n_filter:
ngram_freq = math.log(key/n_words)
num = key*n_words
f1 = lst.count(val.split()[0])
f2 = lst.count(val.split()[1])
mi = math.pow(math.log(num/(f1*f2), 10), 2)
ngram_prob = math.log(key/f1, 10)
output_file.append((ngram_freq, mi, ngram_prob, key, val))
return output_file

def three_gram_count(input_file, encoding, n_filter, n):
outp

Solution

Code organisation

Your code seems to be splitted into small-ish functions which is good. However, there is something that could easily be improved : you could move your code actually doing something (by opposition to merely define things) behind an if __name__ == "__main__": guard.

User interface

Your code contains hardcoded file paths. Obviously, this doesn't work on my machine without changing the code. The right way to do this is probably to use command-line arguments to be able to provide the path for the input file and for the output file. Alternatively, you could get rid of input and/or output file, just read from standard input and write in standard output and assume that the user knows enough about redirection to use your script.

Do not repeat yourself

Your code contains a lot of duplicated code which makes it hard to read and/or to maintain.

In your code, it may also make it less efficient than it could be because one may read and tokenize the file multiple times instead of only once.

For instance, we could imagine providing the list of tokens to the xxx_grams_count functions instead of filename and encoding. This could make things more straightforward but also easier to test.

Most of your functions do many things and do not respect the Single Responsability Principle.

Variable names

Your variables names are in general not very descriptive. lst for a list of tokens is not good especially because you handle multiple other lists, tokens seems to be a better idea.

Also output_file for a list of frequencies is a terrible name : it is not even slightly related to file (except for the fact that the value returned might eventually be written in a file). output_tuples or output_freq or ngram_count might be slightly less confusing. You could probably find an even better name as you know more about the code than I do.

Comments

It is a good thing to have comments in your code but if the comments is just stating the obvious, it doesn't add any value while adding some noise.

For instance, this comment lst = tokenize(input_file, encoding) #tokenize doesn't add anything.

Also, it may be a good idea to add docstrings to your function describing their input/output/behavior.

Useless list manipulation

You are performing many useless list manipulations.

In:

def gram_add(lst, n):
    ng = []
    grams = [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]
    for gram in grams:
        ng.append(gram)
    return ng

you are creating a new list, iterating over each item of that list to add it to an other list that you can return.

This could just be:

def gram_add(lst, n):
    return [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]

Pretty similarly, in :

def n_grams_stat(input_file, encoding, n_filter, n):
    ngram_count = []
    tokens = tokenize(input_file, encoding)
    if n == 2:
        for i in two_gram_count(tokens, n_filter, n):
            ngram_count.append(i)
    elif n == 3:
        for i in three_gram_count(tokens, n_filter, n):
            ngram_count.append(i)
    elif n == 4:
        for i in four_grams_count(tokens, n_filter, n):
            ngram_count.append(i)
    return ngram_count

you are getting a list, iterating over each item of that list to add it to an other list that you can return.

This could be written:

def n_grams_stat(input_file, encoding, n_filter, n):
    tokens = tokenize(input_file, encoding)
    if n == 2:
        return two_gram_count(tokens, n_filter, n)
    elif n == 3:
        return three_gram_count(tokens, n_filter, n)
    elif n == 4:
        return four_grams_count(tokens, n_filter, n)
    return []

List comprehension

You seem to be aware of the existence of list comprehension but you haven't used them in a few places where it could be quite useful. For instance :

def list_add(counts):
    ngrams = []
    for key, val in counts.items():
        ngrams.append((val, key))
    return ngrams

could just be :

def list_add(counts):
    return [(val, key) for key, val in counts.items()]

also this still has 3 problems: the name of the function seems pretty bad (we don't "add" anything to any list) and the name of the parameter seems just as bad (counts suggest a list of number, not a dictionnary; maybe dict_ would be a better name), the fact that this function can be simplified more.

I'll explain the third point which will make the 2 first points irrelevant. What you are actually doing in the function is consuming a list (or a view depending on the Python version but it doesn't matter for you) of (key, val) pairs and returning a list of (val, key) pairs but at the end of the day, you don't really care about the order, you could just swap your variable names when you iterate on the result.

Thus, changing : for key, val in ngrams: for for val, key in ngrams:, you could write :

def list_add(dict_):
    return dict_.items()

which makes the function useless.

Using the correc

Code Snippets

def gram_add(lst, n):
    ng = []
    grams = [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]
    for gram in grams:
        ng.append(gram)
    return ng

def gram_add(lst, n):
    return [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]

def n_grams_stat(input_file, encoding, n_filter, n):
    ngram_count = []
    tokens = tokenize(input_file, encoding)
    if n == 2:
        for i in two_gram_count(tokens, n_filter, n):
            ngram_count.append(i)
    elif n == 3:
        for i in three_gram_count(tokens, n_filter, n):
            ngram_count.append(i)
    elif n == 4:
        for i in four_grams_count(tokens, n_filter, n):
            ngram_count.append(i)
    return ngram_count

def n_grams_stat(input_file, encoding, n_filter, n):
    tokens = tokenize(input_file, encoding)
    if n == 2:
        return two_gram_count(tokens, n_filter, n)
    elif n == 3:
        return three_gram_count(tokens, n_filter, n)
    elif n == 4:
        return four_grams_count(tokens, n_filter, n)
    return []

def list_add(counts):
    ngrams = []
    for key, val in counts.items():
        ngrams.append((val, key))
    return ngrams

Context

StackExchange Code Review Q#124245, answer score: 8

Revisions (0)

No revisions yet.