patternpythonMinor
Identifying "Matching Bigrams" in Large Text Collection
Viewed 0 times
textcollectionbigramslargeidentifyingmatching
Problem
I have a large number of plain text files (north of 20 GB), and I wish to find all "matching" "bigrams" between any two texts in this collection. More specifically, my workflow looks like this: for each text, for each sentence in that text, for each possible combination of two non-stop-words ("bigram") in that sentence, identify all texts in which that "bigram" appears within a single sentence. (I am working on fuzzy plagiarism detection.)
Because I wish to know whether two given authors in my corpus share a given "bigram," I am currently pursuing the following approach:
```
'''This script reads in a directory of files, and for each of that files, for each sentence in that file, for each non-stop-word in that sentence, for each combination of those words, creates a bigram entry in a table. Each bigram in the bigram table corresponds to a sentence id value,
and these sentence id values correspond to a text id value, which in turn correspond to a filename id value. Using separate tables for each of these values allows us to compress our bigram csv enormously'''
def create_bigram_tables():
#define the name of authors_and_texts_file and specify its sep
authors_and_texts = "authors_and_paths_reduced.csv"
sep = "\t"
from collections import Counter
import string, codecs, nltk, glob, re, itertools, os, errno
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#b = tokenizer.tokenize("sample string")
def clean_text(s):
#to preserve contractions:
s = s.replace("'","")
#re1 = match all consecutive non-alpha, non-space strings of length 1 or more.
re1 = re.compile( "[^a-zA-Z ]+" )
#re2 = match all strings of 2 or more spaces.
re2 = re.compile( " +" )
p = re2.sub( " ", re1.sub( " ", s ) )
p = p.split()
return p
def make_sure_path_exists(path):
try:
os.makedirs(path)
except OSError as exception:
if exception.errno
Because I wish to know whether two given authors in my corpus share a given "bigram," I am currently pursuing the following approach:
```
'''This script reads in a directory of files, and for each of that files, for each sentence in that file, for each non-stop-word in that sentence, for each combination of those words, creates a bigram entry in a table. Each bigram in the bigram table corresponds to a sentence id value,
and these sentence id values correspond to a text id value, which in turn correspond to a filename id value. Using separate tables for each of these values allows us to compress our bigram csv enormously'''
def create_bigram_tables():
#define the name of authors_and_texts_file and specify its sep
authors_and_texts = "authors_and_paths_reduced.csv"
sep = "\t"
from collections import Counter
import string, codecs, nltk, glob, re, itertools, os, errno
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#b = tokenizer.tokenize("sample string")
def clean_text(s):
#to preserve contractions:
s = s.replace("'","")
#re1 = match all consecutive non-alpha, non-space strings of length 1 or more.
re1 = re.compile( "[^a-zA-Z ]+" )
#re2 = match all strings of 2 or more spaces.
re2 = re.compile( " +" )
p = re2.sub( " ", re1.sub( " ", s ) )
p = p.split()
return p
def make_sure_path_exists(path):
try:
os.makedirs(path)
except OSError as exception:
if exception.errno
Solution
- Introduction
This code is not ready to be optimized because it is such a mess. It is hard to read, hard to follow and hard to understand. I've written some detailed comments below, but really I've just scratched the surface.
Your best bet is to rewrite the code so that the algorithm and dataflow are clear, and then you might find it easier to speed it up. If you are still stuck at that point, submit the revised code for review.
(If you resubmit, it would be helpful if you could include enough information for us to be able to run your code: we'll need example data, and your tokenizer.)
- Review
-
The top level of your script consists of a call to
cProfile.run. This can't be right: why would the user of your script want to profile it? I can only presume you put this in there to help you during development and forgot to take it out again. But that's a very error-prone approach: you should get out of the habit of modifying code to debug it, and instead learn how to use the tools. In this case you can run the profiler from the command line:python -m cProfile -o output_file myscript.py-
The code is hard to follow because the lines are so long that we have to scroll the window horizontally to read it. You should endeavour to follow the Python style guide (PEP8), which says,
Limit all lines to a maximum of 79 characters.
For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters.
Limiting the required editor window width makes it possible to have several files open side-by-side, and works well when using code review tools that present the two versions in adjacent columns.
-
You have nested all your functions inside the
create_bigram_tables function. This makes it hard to write unit tests (or test the code interactively) because there's no way to call clean_text and so on. You should move nested function definitions to top level, unless there's a good reason not to.-
There are no docstrings for your functions. What do they do and how are they supposed to be called? Having well-written specifications of the behaviour of functions makes them easier to use, review and modify. So I'd expect something like:
def clean_text(s):
"""Return a list of the words in the string s, where a "word" is a
consecutive sequence of ASCII letters and apostrophes in s, from
which the apostrophes have been removed.
"""-
I think a better name for this function would be
words: naming a function after the thing it returns often makes code easier to understand.-
Do you really mean to restrict words to ASCII letters? Won't that become a problem later if you want to analyze languages other than English?
-
There's no point in calling
re.compile unless you're going to use the compiled regular expression more than once. Instead of:re1 = re.compile("regexp")
re1.sub(s)just write:
re.sub("regexp", s)-
You manipulate the text by repeatedly calling
str.replace and re.sub. But this is an expensive operation: Python has to build a whole new string in memory each time, and if s is long then so are the replaced strings. It would be more efficient to extract the words you want rather than replacing the non-words. So I would write the function like this:def words(s):
"""Return a list of the words in the string s, where a "word" is a
consecutive sequence of ASCII letters and apostrophes in s, from
which the apostrophes have been removed.
"""
return re.findall(r"[a-zA-Z]+", s.replace("'", ""))This is about 25% faster than
clean_text:>>> s = open('pg2600.txt', encoding='utf8').read()
>>> len(s)
3226622
>>> from timeit import timeit
>>> timeit(lambda:clean_text(s), number=1)
0.39841552218422294
>>> timeit(lambda:words(s), number=1)
timeit(lambda:words(s), number=1)
0.30522726010531187-
In
make_sure_path_exists, you catch all OSError exceptions and re-raise the ones you don't want:try:
os.makedirs(path)
except OSError as exception:
if exception.errno != errno.EEXIST:
raiseIn Python 3, you could catch
FileExistsError:try:
os.makedirs(path)
except FileExistsError:
passBut it would be even better to pass the keyword argument
exist_ok=True to os.makedirs.-
You don't need to write:
make_sure_path_exists(os.getcwd() + "\\tables")because file operations are relative to the current directory. Just write:
make_sure_path_exists("tables")This would also have the advantage of being portable to operating systems other than Microsoft Windows.
-
Similarly, instead of:
os.getcwd() + "\\tables\\" + "bigram_ids.txt"write:
os.path.join("tables", "bigram_ids.txt")-
All those nested
with open(...) as ...: statements push the actual code so far over to the right that it is unreadable in an 80-column window. One way to fix this wouCode Snippets
python -m cProfile -o output_file myscript.pydef clean_text(s):
"""Return a list of the words in the string s, where a "word" is a
consecutive sequence of ASCII letters and apostrophes in s, from
which the apostrophes have been removed.
"""re1 = re.compile("regexp")
re1.sub(s)re.sub("regexp", s)def words(s):
"""Return a list of the words in the string s, where a "word" is a
consecutive sequence of ASCII letters and apostrophes in s, from
which the apostrophes have been removed.
"""
return re.findall(r"[a-zA-Z]+", s.replace("'", ""))Context
StackExchange Code Review Q#57046, answer score: 3
Revisions (0)
No revisions yet.