HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Remove repetitive strings from a given sentence efficiently

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
efficientlyremoverepetitivesentencefromstringsgiven

Problem

I recently gave a test on HackerRank:


Given a string, return the most concise string that can be formed from it.


For example:

string = 'watson  Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into  Cognitive|Cognitive Systems; a new era|what does watson represent'





  • The following string contains many duplicates like "Watson represents." We have to ignore extra spacing between characters or lower/upper case. "watson Represents" and "watson represents" are the same thing.



  • Semicolons and commas represent the same thing. For example, "Cognitive Systems; a new era" is present inside "Watson represents a first step into cognitive systems, a new era of computing."



  • Your final string should not contain any duplicates. Ignore lowercase/uppercase or extra space if you have any.




My answer:

watson = 'watson  Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into  Cognitive|Cognitive Systems; a new era|what does watson represent'

import re

watson = re.sub(r';', r',', watson)  #replace the semicolon with colon
watson = watson.strip().split('|')
removeWatson = list(watson)

for i in range(len(watson)):

    for j in range(len(watson)):

        if j == i:
            continue

        if " ".join(watson[i].strip().lower().split()) in " ".join(watson[j].strip().lower().split()):
            removeWatson[i] = ''

print "|".join(filter(None, removeWatson))


My answer is definitely inefficient and I am wondering if you can suggest an alternative way of solving this problem.

Most concise string:


Watson represents a first step into cognitive systems, a new era of computing.|what does watson represent

Solution

Here's my shot at this puzzle.

Benchmark of original code, over 1,000,000 iterations (with print() commented out):


real 1m8.486s

user 0m0.000s

sys 0m0.000s

Benchmark of my code, over 1,000,000 iterations (with print() commented out):


real 0m16.226s

user 0m0.000s

sys 0m0.015s

Performance gain: 422.07%

It looks like you want to preserve the string order and uppercase/lowercase, so I took that into account. Otherwise this could be even faster/shorter.

The strategy here is to preprocess the strings, instead of going through strip(), lower(), and split() each time you need to do a comparison.
Additionally, sort the strings by length, so you only perform comparisons you need to perform. See comments for more details.

WATSON_RAW = 'watson  Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into  Cognitive|Cognitive Systems; a new era|what does watson represent'

# 1. Preprocess the string.
#    Remove extraneous whitespace, replace semicolon with comma.
modified = ' '.join(WATSON_RAW.split())
modified = modified.replace(';', ',')

# 2. Split string by delimiter.
split = modified.split('|')

# 3. Generate string metadata, sorted by length
#    Keep track of the string contents, its index in the "split" variable, and its length.
#    Tuple has a speed advantage over list.
metadata = [(i, len(s), s.lower()) for i, s in enumerate(split)]
metadata = sorted(metadata, key=lambda x: x[1])

# 4. Uniqify the strings.
indexes_of_unique_strings = []

for index, _, string_lowercase1 in metadata:
    # Only search from current index on, skipping comparisions with earlier indexes.
    for _, _, string_lowercase2 in metadata[i:]:
        if string_lowercase1 in string_lowercase2 and string_lowercase1 != string_lowercase2:
            # This string is redundant.
            break
    else:
        # This string is unique.
        indexes_of_unique_strings.append(index)

# 5. Map indexes back to original, correctly capitalized strings, in original order.
indexes_of_unique_strings = sorted(indexes_of_unique_strings)
final = '|'.join(split[i] for i in indexes_of_unique_strings)

print(final)

Code Snippets

WATSON_RAW = 'watson  Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into  Cognitive|Cognitive Systems; a new era|what does watson represent'

# 1. Preprocess the string.
#    Remove extraneous whitespace, replace semicolon with comma.
modified = ' '.join(WATSON_RAW.split())
modified = modified.replace(';', ',')


# 2. Split string by delimiter.
split = modified.split('|')


# 3. Generate string metadata, sorted by length
#    Keep track of the string contents, its index in the "split" variable, and its length.
#    Tuple has a speed advantage over list.
metadata = [(i, len(s), s.lower()) for i, s in enumerate(split)]
metadata = sorted(metadata, key=lambda x: x[1])


# 4. Uniqify the strings.
indexes_of_unique_strings = []

for index, _, string_lowercase1 in metadata:
    # Only search from current index on, skipping comparisions with earlier indexes.
    for _, _, string_lowercase2 in metadata[i:]:
        if string_lowercase1 in string_lowercase2 and string_lowercase1 != string_lowercase2:
            # This string is redundant.
            break
    else:
        # This string is unique.
        indexes_of_unique_strings.append(index)


# 5. Map indexes back to original, correctly capitalized strings, in original order.
indexes_of_unique_strings = sorted(indexes_of_unique_strings)
final = '|'.join(split[i] for i in indexes_of_unique_strings)

print(final)

Context

StackExchange Code Review Q#108092, answer score: 4

Revisions (0)

No revisions yet.