patternpythonMinor
Finding duplicate strings within a file
Viewed 0 times
fileduplicatewithinfindingstrings
Problem
This code basically finds duplicate strings within a file.
An example: DUPLICATE_LENGTH set to 6, file contains:
The output will be
higher
The following code shows my approach to solving this issue. If there are completely different ideas how to solve it, I'm open too, but primarily I'd like to get some feedback of the code I wrote.
I'm also searching performance optimizations as it gets kinda slow on large files.
An example: DUPLICATE_LENGTH set to 6, file contains:
process hash michael honeycomb interrupt system call deadlock scheduling michaelThe output will be
michael, as its a duplicate with a length of 6 orhigher
The following code shows my approach to solving this issue. If there are completely different ideas how to solve it, I'm open too, but primarily I'd like to get some feedback of the code I wrote.
I'm also searching performance optimizations as it gets kinda slow on large files.
import codecs
# Constants
DUPLICATE_LENGTH = 30;
FILE_NAME = "data.txt"
SHOW_RESULTS_WHILE_PROCESSING = True
def showDuplicate(duplicate):
print("Duplicate found:{0}".format(duplicate))
fileHandle = codecs.open(FILE_NAME, "r", "utf-8-sig")
fileContent = fileHandle.read()
fileHandle.close()
substringList = [] #contains all possible duplicates.
duplicatesList = [] #contains all duplicates
for i in range(0, len(fileContent) - DUPLICATE_LENGTH):
end = i + DUPLICATE_LENGTH
duplicate = fileContent[i:end]
if duplicate in substringList and '|' not in duplicate:
duplicatesList.append(duplicate)
else:
substringList.append(duplicate)
resultList = []
currentMatch = duplicatesList[0]
currentMatchPos = 1
for i in range(1, len(duplicatesList)):
if currentMatch[currentMatchPos:] in duplicatesList[i]:
currentMatch += duplicatesList[i][-1]
currentMatchPos += 1
else:
if SHOW_RESULTS_WHILE_PROCESSING:
showDuplicate(currentMatch)
resultList.append(currentMatch) # this match is closed, add it!
currentMatch = duplicatesList[i]
currentMatchPos = 1
if not SHOW_RESULTS_WHILE_PROCESSING:
for duplicate in resultList:
showDuplicate(duplicate)
if not resultList:
print "Awesome, no duplicates were found!"Solution
Firstly, your code doesn't actually work. Run it against your example data and DUPLICATE_LENGTH. It has no results. So I'm going to look at your code stylistically and ignore the algorithm for now:
No need for that semicolon
python style guide recommends seperation_by_underscores for function names
Python style guide recommends words_seperated_by_underscores for variable names.
No need for the 0,
I'd combine those two lines. I'd also not call it a duplicate as its not a duplicate, just a substring.
I not clear what significance the
Any sort of complex logic should really be in a function, not at the module level.
Do you really need the indexes? Maybe you should use
I dislike the structure of this code. You are manipulating several variables across loop iterations which makes the code hard to follow. But since the algorithm is simply incorrect for what you are doing I can't really tell you how to fix it.
Ok, now for the actual algorithm. I think wilberforce has misinterpreted what you want. His algorithm requires the duplication to be aligned, which I don't think you want.
The simple brute force strategy is as follows:
A more efficient and clever algorithm is to use dynamic programming:
My two programs produce slightly different results. The first says:
The second says:
The difference is because the actual match is 7 characters long. As a result, it matches twice. I don't know what results you actually do want, so I haven't looked into it.
import codecs
# Constants
DUPLICATE_LENGTH = 30;No need for that semicolon
FILE_NAME = "data.txt"
SHOW_RESULTS_WHILE_PROCESSING = True
def showDuplicate(duplicate):python style guide recommends seperation_by_underscores for function names
print("Duplicate found:{0}".format(duplicate))
fileHandle = codecs.open(FILE_NAME, "r", "utf-8-sig")
fileContent = fileHandle.read()
fileHandle.close()Python style guide recommends words_seperated_by_underscores for variable names.
substringList = [] #contains all possible duplicates.
duplicatesList = [] #contains all duplicates
for i in range(0, len(fileContent) - DUPLICATE_LENGTH):No need for the 0,
range(x) == range(0, x)end = i + DUPLICATE_LENGTH
duplicate = fileContent[i:end]I'd combine those two lines. I'd also not call it a duplicate as its not a duplicate, just a substring.
if duplicate in substringList and '|' not in duplicate:I not clear what significance the
| hasduplicatesList.append(duplicate)
else:
substringList.append(duplicate)
resultList = []
currentMatch = duplicatesList[0]
currentMatchPos = 1Any sort of complex logic should really be in a function, not at the module level.
for i in range(1, len(duplicatesList)):Do you really need the indexes? Maybe you should use
for duplicate in duplicatesList[1:]if currentMatch[currentMatchPos:] in duplicatesList[i]:
currentMatch += duplicatesList[i][-1]
currentMatchPos += 1
else:
if SHOW_RESULTS_WHILE_PROCESSING:
showDuplicate(currentMatch)
resultList.append(currentMatch) # this match is closed, add it!
currentMatch = duplicatesList[i]
currentMatchPos = 1I dislike the structure of this code. You are manipulating several variables across loop iterations which makes the code hard to follow. But since the algorithm is simply incorrect for what you are doing I can't really tell you how to fix it.
if not SHOW_RESULTS_WHILE_PROCESSING:
for duplicate in resultList:
showDuplicate(duplicate)
if not resultList:
print "Awesome, no duplicates were found!"Ok, now for the actual algorithm. I think wilberforce has misinterpreted what you want. His algorithm requires the duplication to be aligned, which I don't think you want.
The simple brute force strategy is as follows:
# look at every position where the duplicates could start
for x_idx in range(len(file_content)):
for y_idx in range(x_idx):
# count the length of the duplicate
for length, (x_letter, y_letter) in enumerate(zip(file_content[x_idx:], file_content[y_idx:])):
if x_letter != y_letter:
break
# if its good enough, print it
if length > DUPLICATE_LENGTH:
showDuplicate(file_content[x_idx:x_idx + length])A more efficient and clever algorithm is to use dynamic programming:
# we create 2D list (list of lists) where lengths[x][y] will be length
# of the duplicate which ends on letter x-1 and y-1 in the file content
# so we will calculate the length of duplicate for every combination of two positions
file_size = len(file_content) + 1
lengths = [ [None] * file_size for x in range(file_size) ]
# If we are before the characters, the length of the duplicate string is
# obviously zero
for x in range(file_size):
lengths[x][0] = 0
lengths[0][x] = 0
for x in range(1, file_size):
# we don't check cases where x == y or y > x
# x == y is trivial and y > x is just
# the transpose
for y in range(1, x):
if file_content[x-1] == file_content[y-1]:
# if two letters match, the duplicate length
# is 1 plus whatever we had before
duplicate_length = lengths[x][y] = lengths[x-1][y-1] + 1
if duplicate_length > DUPLICATE_LENGTH:
showDuplicate(file_content[x-duplicate_length:x])
else:
# if the files don't match the duplicate ends here
lengths[x][y] = 0My two programs produce slightly different results. The first says:
Duplicate found: michael
Duplicate found:michaelThe second says:
Duplicate found: michae
Duplicate found: michaelThe difference is because the actual match is 7 characters long. As a result, it matches twice. I don't know what results you actually do want, so I haven't looked into it.
Code Snippets
import codecs
# Constants
DUPLICATE_LENGTH = 30;FILE_NAME = "data.txt"
SHOW_RESULTS_WHILE_PROCESSING = True
def showDuplicate(duplicate):print("Duplicate found:{0}".format(duplicate))
fileHandle = codecs.open(FILE_NAME, "r", "utf-8-sig")
fileContent = fileHandle.read()
fileHandle.close()substringList = [] #contains all possible duplicates.
duplicatesList = [] #contains all duplicates
for i in range(0, len(fileContent) - DUPLICATE_LENGTH):end = i + DUPLICATE_LENGTH
duplicate = fileContent[i:end]Context
StackExchange Code Review Q#7213, answer score: 5
Revisions (0)
No revisions yet.