patternpythonMinor
Paragraph Matching in Python
Viewed 0 times
pythonparagraphmatching
Problem
So I developed some code as part of a larger project. I came upon a problem with how to match paragraphs, and wasn't sure how to proceed, so I asked on Stack Overflow here. You can find an in-depth description of my problem there if you're curious.
Just to be clear, I am not reposting the same question here to get an answer.
I came up with a solution to my own problem, but I'm unsure of limits/pitfalls, and here seems like the perfect place for that.
The short version on the explanation is this: I have two strings, one is the revised version of the other. I want to generate markups and preserve the paragraph spacing, thus, I need to correlate the list of paragraphs in each, match them, and then mark the remaining as either new or deleted.
So I have a function (
-
-
-
So without further adiue, here is my function:
```
def paraMatcher(orParas, revParas):
THRESHOLD = 0.75
matchSet = []
shifter = 0
for revPara in revParas:
print "Checking revPara ", revParas.index(revPara)
matchTuples = [(difflib.SequenceMatcher(a=orPara,b=revPara).ratio(), orParas.index(orPara)) for orPara in orParas]
print "MatchTuples: ", matchTuples
if matchTuples:
bestMatch = sorted(matchTuples, key = lambda tup: tup[0])[-1]
print "Best Match: ", bestMatch
if bestMatch[0] > THRESHOLD:
orParas.pop(bestMatch[1])
print orParas
matchSet.append((revParas.index(revPara), bestMatch[1] + shifter))
shifter += 1
Just to be clear, I am not reposting the same question here to get an answer.
I came up with a solution to my own problem, but I'm unsure of limits/pitfalls, and here seems like the perfect place for that.
The short version on the explanation is this: I have two strings, one is the revised version of the other. I want to generate markups and preserve the paragraph spacing, thus, I need to correlate the list of paragraphs in each, match them, and then mark the remaining as either new or deleted.
So I have a function (
paraMatcher()) which matches paragraphs and returns a list of tuples as follows:-
(num1, num2) means that the best match for revised paragraph num1 is original paragraph num2-
(num, '+') means that there is no match for revised paragraph num, so it must be new (designated by the '+')-
(num, '-') means that no revised paragraph was matched to original paragraph num so it must have been deleted (designated by the '-')So without further adiue, here is my function:
```
def paraMatcher(orParas, revParas):
THRESHOLD = 0.75
matchSet = []
shifter = 0
for revPara in revParas:
print "Checking revPara ", revParas.index(revPara)
matchTuples = [(difflib.SequenceMatcher(a=orPara,b=revPara).ratio(), orParas.index(orPara)) for orPara in orParas]
print "MatchTuples: ", matchTuples
if matchTuples:
bestMatch = sorted(matchTuples, key = lambda tup: tup[0])[-1]
print "Best Match: ", bestMatch
if bestMatch[0] > THRESHOLD:
orParas.pop(bestMatch[1])
print orParas
matchSet.append((revParas.index(revPara), bestMatch[1] + shifter))
shifter += 1
Solution
A few tips for leaner code:
-
Use
-
-
Use
enumerate while iterating when you need the index, to avoid the sequential search and extra verbosity of orParas.index(orPara). For example the last loop becomesfor orIndex, orPara in enumerate(orParas):
matchSet.insert(orIndex + shifter, (orIndex + shifter, "-"))-
max(matchTuples) achieves the same as sorted(matchTuples, key = lambda tup: tup[0])[-1] and avoids sorting. You could even give a key argument to max, but tuples are sorted item by item anyway, and here the second item is an ascending integer, so including it in the sort key does not change the order.- Unpacking
bestRatio, bestIndex = max(matchTuples)improves readability as you can usebestRatioinstead ofbestMatch[0]in the code that follows.
Code Snippets
for orIndex, orPara in enumerate(orParas):
matchSet.insert(orIndex + shifter, (orIndex + shifter, "-"))Context
StackExchange Code Review Q#41167, answer score: 4
Revisions (0)
No revisions yet.