patternpythonMinor
Splitting text into n-grams and analyzing statistics on them
Viewed 0 times
gramsintotextanalyzingsplittingstatisticsandthem
Problem
I have made the algorithm that split text into n-grams (collocations) and it counts probabilities and other statistics of this collocations. When file is more then 50 megabytes it takes long time to count maybe some one will help to improve it.
``
sent = re.findall('\w+', sent)
for word in sent:
lst.append(word)
return lst
def ngrams_split(lst, n):
counts = dict()
grams = [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]
for gram in grams:
if gram not in counts:
counts[gram] = 1
else:
counts[gram] += 1
return counts
def list_add(counts):
ngrams = []
for key, val in counts.items():
ngrams.append((val, key))
return ngrams
def gram_add(lst, n):
ng = []
grams = [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]
for gram in grams:
ng.append(gram)
return ng
def two_gram_count(input_file, encoding, n_filter, n):
output_file = []
lst = tokenize(input_file, encoding) #tokenize
n_words = len(lst)
counts = ngrams_split(lst, n) #spliting into ngrams
ngrams = list_add(counts) #ading ngrmas to list
for key, val in ngrams:
if int(key) >= n_filter:
ngram_freq = math.log(key/n_words)
num = key*n_words
f1 = lst.count(val.split()[0])
f2 = lst.count(val.split()[1])
mi = math.pow(math.log(num/(f1*f2), 10), 2)
ngram_prob = math.log(key/f1, 10)
output_file.append((ngram_freq, mi, ngram_prob, key, val))
return output_file
def three_gram_count(input_file, encoding, n_filter, n):
outp
``
import math
import re
import csv
from itertools import zip_longest
from datetime import datetime
def tokenize(input_file, encoding):
lst =[]
with open(input_file, 'r', encoding=encoding) as f:
for sent in f:
sent = sent.lower()
sent = re.sub("[A-z0-9\'\"\|\/\+\#\,\)\(\?\!\B\-\:\=\;\.\«\»\—\@]", '', sent)sent = re.findall('\w+', sent)
for word in sent:
lst.append(word)
return lst
def ngrams_split(lst, n):
counts = dict()
grams = [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]
for gram in grams:
if gram not in counts:
counts[gram] = 1
else:
counts[gram] += 1
return counts
def list_add(counts):
ngrams = []
for key, val in counts.items():
ngrams.append((val, key))
return ngrams
def gram_add(lst, n):
ng = []
grams = [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]
for gram in grams:
ng.append(gram)
return ng
def two_gram_count(input_file, encoding, n_filter, n):
output_file = []
lst = tokenize(input_file, encoding) #tokenize
n_words = len(lst)
counts = ngrams_split(lst, n) #spliting into ngrams
ngrams = list_add(counts) #ading ngrmas to list
for key, val in ngrams:
if int(key) >= n_filter:
ngram_freq = math.log(key/n_words)
num = key*n_words
f1 = lst.count(val.split()[0])
f2 = lst.count(val.split()[1])
mi = math.pow(math.log(num/(f1*f2), 10), 2)
ngram_prob = math.log(key/f1, 10)
output_file.append((ngram_freq, mi, ngram_prob, key, val))
return output_file
def three_gram_count(input_file, encoding, n_filter, n):
outp
Solution
Code organisation
Your code seems to be splitted into small-ish functions which is good. However, there is something that could easily be improved : you could move your code actually doing something (by opposition to merely define things) behind an
User interface
Your code contains hardcoded file paths. Obviously, this doesn't work on my machine without changing the code. The right way to do this is probably to use command-line arguments to be able to provide the path for the input file and for the output file. Alternatively, you could get rid of input and/or output file, just read from standard input and write in standard output and assume that the user knows enough about redirection to use your script.
Do not repeat yourself
Your code contains a lot of duplicated code which makes it hard to read and/or to maintain.
In your code, it may also make it less efficient than it could be because one may read and tokenize the file multiple times instead of only once.
For instance, we could imagine providing the list of tokens to the
Most of your functions do many things and do not respect the Single Responsability Principle.
Variable names
Your variables names are in general not very descriptive.
Also
Comments
It is a good thing to have comments in your code but if the comments is just stating the obvious, it doesn't add any value while adding some noise.
For instance, this comment
Also, it may be a good idea to add docstrings to your function describing their input/output/behavior.
Useless list manipulation
You are performing many useless list manipulations.
In:
you are creating a new list, iterating over each item of that list to add it to an other list that you can return.
This could just be:
Pretty similarly, in :
you are getting a list, iterating over each item of that list to add it to an other list that you can return.
This could be written:
List comprehension
You seem to be aware of the existence of list comprehension but you haven't used them in a few places where it could be quite useful. For instance :
could just be :
also this still has 3 problems: the name of the function seems pretty bad (we don't "add" anything to any list) and the name of the parameter seems just as bad (
I'll explain the third point which will make the 2 first points irrelevant. What you are actually doing in the function is consuming a list (or a view depending on the Python version but it doesn't matter for you) of (key, val) pairs and returning a list of (val, key) pairs but at the end of the day, you don't really care about the order, you could just swap your variable names when you iterate on the result.
Thus, changing :
which makes the function useless.
Using the correc
Your code seems to be splitted into small-ish functions which is good. However, there is something that could easily be improved : you could move your code actually doing something (by opposition to merely define things) behind an
if __name__ == "__main__": guard.User interface
Your code contains hardcoded file paths. Obviously, this doesn't work on my machine without changing the code. The right way to do this is probably to use command-line arguments to be able to provide the path for the input file and for the output file. Alternatively, you could get rid of input and/or output file, just read from standard input and write in standard output and assume that the user knows enough about redirection to use your script.
Do not repeat yourself
Your code contains a lot of duplicated code which makes it hard to read and/or to maintain.
In your code, it may also make it less efficient than it could be because one may read and tokenize the file multiple times instead of only once.
For instance, we could imagine providing the list of tokens to the
xxx_grams_count functions instead of filename and encoding. This could make things more straightforward but also easier to test.Most of your functions do many things and do not respect the Single Responsability Principle.
Variable names
Your variables names are in general not very descriptive.
lst for a list of tokens is not good especially because you handle multiple other lists, tokens seems to be a better idea.Also
output_file for a list of frequencies is a terrible name : it is not even slightly related to file (except for the fact that the value returned might eventually be written in a file). output_tuples or output_freq or ngram_count might be slightly less confusing. You could probably find an even better name as you know more about the code than I do.Comments
It is a good thing to have comments in your code but if the comments is just stating the obvious, it doesn't add any value while adding some noise.
For instance, this comment
lst = tokenize(input_file, encoding) #tokenize doesn't add anything.Also, it may be a good idea to add docstrings to your function describing their input/output/behavior.
Useless list manipulation
You are performing many useless list manipulations.
In:
def gram_add(lst, n):
ng = []
grams = [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]
for gram in grams:
ng.append(gram)
return ngyou are creating a new list, iterating over each item of that list to add it to an other list that you can return.
This could just be:
def gram_add(lst, n):
return [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]Pretty similarly, in :
def n_grams_stat(input_file, encoding, n_filter, n):
ngram_count = []
tokens = tokenize(input_file, encoding)
if n == 2:
for i in two_gram_count(tokens, n_filter, n):
ngram_count.append(i)
elif n == 3:
for i in three_gram_count(tokens, n_filter, n):
ngram_count.append(i)
elif n == 4:
for i in four_grams_count(tokens, n_filter, n):
ngram_count.append(i)
return ngram_countyou are getting a list, iterating over each item of that list to add it to an other list that you can return.
This could be written:
def n_grams_stat(input_file, encoding, n_filter, n):
tokens = tokenize(input_file, encoding)
if n == 2:
return two_gram_count(tokens, n_filter, n)
elif n == 3:
return three_gram_count(tokens, n_filter, n)
elif n == 4:
return four_grams_count(tokens, n_filter, n)
return []List comprehension
You seem to be aware of the existence of list comprehension but you haven't used them in a few places where it could be quite useful. For instance :
def list_add(counts):
ngrams = []
for key, val in counts.items():
ngrams.append((val, key))
return ngramscould just be :
def list_add(counts):
return [(val, key) for key, val in counts.items()]also this still has 3 problems: the name of the function seems pretty bad (we don't "add" anything to any list) and the name of the parameter seems just as bad (
counts suggest a list of number, not a dictionnary; maybe dict_ would be a better name), the fact that this function can be simplified more.I'll explain the third point which will make the 2 first points irrelevant. What you are actually doing in the function is consuming a list (or a view depending on the Python version but it doesn't matter for you) of (key, val) pairs and returning a list of (val, key) pairs but at the end of the day, you don't really care about the order, you could just swap your variable names when you iterate on the result.
Thus, changing :
for key, val in ngrams: for for val, key in ngrams:, you could write :def list_add(dict_):
return dict_.items()which makes the function useless.
Using the correc
Code Snippets
def gram_add(lst, n):
ng = []
grams = [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]
for gram in grams:
ng.append(gram)
return ngdef gram_add(lst, n):
return [' '.join(lst[i:i+n]) for i in range(len(lst)-n)]def n_grams_stat(input_file, encoding, n_filter, n):
ngram_count = []
tokens = tokenize(input_file, encoding)
if n == 2:
for i in two_gram_count(tokens, n_filter, n):
ngram_count.append(i)
elif n == 3:
for i in three_gram_count(tokens, n_filter, n):
ngram_count.append(i)
elif n == 4:
for i in four_grams_count(tokens, n_filter, n):
ngram_count.append(i)
return ngram_countdef n_grams_stat(input_file, encoding, n_filter, n):
tokens = tokenize(input_file, encoding)
if n == 2:
return two_gram_count(tokens, n_filter, n)
elif n == 3:
return three_gram_count(tokens, n_filter, n)
elif n == 4:
return four_grams_count(tokens, n_filter, n)
return []def list_add(counts):
ngrams = []
for key, val in counts.items():
ngrams.append((val, key))
return ngramsContext
StackExchange Code Review Q#124245, answer score: 8
Revisions (0)
No revisions yet.