HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Finding duplicate strings within a file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fileduplicatewithinfindingstrings

Problem

This code basically finds duplicate strings within a file.


An example: DUPLICATE_LENGTH set to 6, file contains:

process hash michael honeycomb interrupt system call deadlock scheduling michael




The output will be michael, as its a duplicate with a length of 6 or
higher

The following code shows my approach to solving this issue. If there are completely different ideas how to solve it, I'm open too, but primarily I'd like to get some feedback of the code I wrote.

I'm also searching performance optimizations as it gets kinda slow on large files.

import codecs

# Constants
DUPLICATE_LENGTH = 30;
FILE_NAME = "data.txt"
SHOW_RESULTS_WHILE_PROCESSING = True

def showDuplicate(duplicate):
    print("Duplicate found:{0}".format(duplicate))

fileHandle = codecs.open(FILE_NAME, "r", "utf-8-sig")
fileContent = fileHandle.read()
fileHandle.close()

substringList = []  #contains all possible duplicates.
duplicatesList = [] #contains all duplicates

for i in range(0, len(fileContent) - DUPLICATE_LENGTH):
    end = i + DUPLICATE_LENGTH
    duplicate = fileContent[i:end]
    if duplicate in substringList and '|' not in duplicate:
        duplicatesList.append(duplicate)
    else:
        substringList.append(duplicate)

resultList = []
currentMatch = duplicatesList[0]
currentMatchPos = 1
for i in range(1, len(duplicatesList)):
    if currentMatch[currentMatchPos:] in duplicatesList[i]:
        currentMatch += duplicatesList[i][-1]
        currentMatchPos += 1
    else:
        if SHOW_RESULTS_WHILE_PROCESSING:
            showDuplicate(currentMatch)

        resultList.append(currentMatch) # this match is closed, add it!
        currentMatch = duplicatesList[i]
        currentMatchPos = 1

if not SHOW_RESULTS_WHILE_PROCESSING:
    for duplicate in resultList:
        showDuplicate(duplicate)

if not resultList:
    print "Awesome, no duplicates were found!"

Solution

Firstly, your code doesn't actually work. Run it against your example data and DUPLICATE_LENGTH. It has no results. So I'm going to look at your code stylistically and ignore the algorithm for now:

import codecs

# Constants
DUPLICATE_LENGTH = 30;


No need for that semicolon

FILE_NAME = "data.txt"
SHOW_RESULTS_WHILE_PROCESSING = True

def showDuplicate(duplicate):


python style guide recommends seperation_by_underscores for function names

print("Duplicate found:{0}".format(duplicate))

fileHandle = codecs.open(FILE_NAME, "r", "utf-8-sig")
fileContent = fileHandle.read()
fileHandle.close()


Python style guide recommends words_seperated_by_underscores for variable names.

substringList = []  #contains all possible duplicates.
duplicatesList = [] #contains all duplicates

for i in range(0, len(fileContent) - DUPLICATE_LENGTH):


No need for the 0, range(x) == range(0, x)

end = i + DUPLICATE_LENGTH
    duplicate = fileContent[i:end]


I'd combine those two lines. I'd also not call it a duplicate as its not a duplicate, just a substring.

if duplicate in substringList and '|' not in duplicate:


I not clear what significance the | has

duplicatesList.append(duplicate)
    else:
        substringList.append(duplicate)

resultList = []
currentMatch = duplicatesList[0]
currentMatchPos = 1


Any sort of complex logic should really be in a function, not at the module level.

for i in range(1, len(duplicatesList)):


Do you really need the indexes? Maybe you should use for duplicate in duplicatesList[1:]

if currentMatch[currentMatchPos:] in duplicatesList[i]:
        currentMatch += duplicatesList[i][-1]
        currentMatchPos += 1
    else:
        if SHOW_RESULTS_WHILE_PROCESSING:
            showDuplicate(currentMatch)

        resultList.append(currentMatch) # this match is closed, add it!
        currentMatch = duplicatesList[i]
        currentMatchPos = 1


I dislike the structure of this code. You are manipulating several variables across loop iterations which makes the code hard to follow. But since the algorithm is simply incorrect for what you are doing I can't really tell you how to fix it.

if not SHOW_RESULTS_WHILE_PROCESSING:
    for duplicate in resultList:
        showDuplicate(duplicate)

if not resultList:
    print "Awesome, no duplicates were found!"


Ok, now for the actual algorithm. I think wilberforce has misinterpreted what you want. His algorithm requires the duplication to be aligned, which I don't think you want.

The simple brute force strategy is as follows:

# look at every position where the duplicates could start
for x_idx in range(len(file_content)):
    for y_idx in range(x_idx):
        # count the length of the duplicate
        for length, (x_letter, y_letter) in enumerate(zip(file_content[x_idx:], file_content[y_idx:])):
            if x_letter != y_letter:
                break
        # if its good enough, print it
        if length > DUPLICATE_LENGTH:
            showDuplicate(file_content[x_idx:x_idx + length])


A more efficient and clever algorithm is to use dynamic programming:

# we create 2D list (list of lists) where lengths[x][y] will be length
# of the duplicate which ends on letter x-1 and y-1 in the file content
# so we will calculate the length of duplicate for every combination of two positions
file_size = len(file_content) + 1
lengths = [ [None] * file_size for x in range(file_size) ]

# If we are before the characters, the length of the duplicate string is 
# obviously zero
for x in range(file_size):
    lengths[x][0] = 0
    lengths[0][x] = 0

for x in range(1, file_size):
    # we don't check cases where x == y or y > x
    # x == y is trivial and y > x is just
    # the transpose
    for y in range(1, x):
        if file_content[x-1] == file_content[y-1]:
            # if two letters match, the duplicate length
            # is 1 plus whatever we had before
            duplicate_length = lengths[x][y] = lengths[x-1][y-1] + 1
            if duplicate_length > DUPLICATE_LENGTH:
                showDuplicate(file_content[x-duplicate_length:x])
        else:
            # if the files don't match the duplicate ends here
            lengths[x][y] = 0


My two programs produce slightly different results. The first says:

Duplicate found: michael
Duplicate found:michael


The second says:

Duplicate found: michae
Duplicate found: michael


The difference is because the actual match is 7 characters long. As a result, it matches twice. I don't know what results you actually do want, so I haven't looked into it.

Code Snippets

import codecs

# Constants
DUPLICATE_LENGTH = 30;
FILE_NAME = "data.txt"
SHOW_RESULTS_WHILE_PROCESSING = True


def showDuplicate(duplicate):
print("Duplicate found:{0}".format(duplicate))



fileHandle = codecs.open(FILE_NAME, "r", "utf-8-sig")
fileContent = fileHandle.read()
fileHandle.close()
substringList = []  #contains all possible duplicates.
duplicatesList = [] #contains all duplicates


for i in range(0, len(fileContent) - DUPLICATE_LENGTH):
end = i + DUPLICATE_LENGTH
    duplicate = fileContent[i:end]

Context

StackExchange Code Review Q#7213, answer score: 5

Revisions (0)

No revisions yet.