patternpythonMinor
Remove repetitive strings from a given sentence efficiently
Viewed 0 times
efficientlyremoverepetitivesentencefromstringsgiven
Problem
I recently gave a test on HackerRank:
Given a string, return the most concise string that can be formed from it.
For example:
My answer:
My answer is definitely inefficient and I am wondering if you can suggest an alternative way of solving this problem.
Most concise string:
Watson represents a first step into cognitive systems, a new era of computing.|what does watson represent
Given a string, return the most concise string that can be formed from it.
For example:
string = 'watson Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into Cognitive|Cognitive Systems; a new era|what does watson represent'- The following string contains many duplicates like "Watson represents." We have to ignore extra spacing between characters or lower/upper case. "watson Represents" and "watson represents" are the same thing.
- Semicolons and commas represent the same thing. For example, "Cognitive Systems; a new era" is present inside "Watson represents a first step into cognitive systems, a new era of computing."
- Your final string should not contain any duplicates. Ignore lowercase/uppercase or extra space if you have any.
My answer:
watson = 'watson Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into Cognitive|Cognitive Systems; a new era|what does watson represent'
import re
watson = re.sub(r';', r',', watson) #replace the semicolon with colon
watson = watson.strip().split('|')
removeWatson = list(watson)
for i in range(len(watson)):
for j in range(len(watson)):
if j == i:
continue
if " ".join(watson[i].strip().lower().split()) in " ".join(watson[j].strip().lower().split()):
removeWatson[i] = ''
print "|".join(filter(None, removeWatson))My answer is definitely inefficient and I am wondering if you can suggest an alternative way of solving this problem.
Most concise string:
Watson represents a first step into cognitive systems, a new era of computing.|what does watson represent
Solution
Here's my shot at this puzzle.
Benchmark of original code, over 1,000,000 iterations (with
real 1m8.486s
user 0m0.000s
sys 0m0.000s
Benchmark of my code, over 1,000,000 iterations (with
real 0m16.226s
user 0m0.000s
sys 0m0.015s
Performance gain: 422.07%
It looks like you want to preserve the string order and uppercase/lowercase, so I took that into account. Otherwise this could be even faster/shorter.
The strategy here is to preprocess the strings, instead of going through
Additionally, sort the strings by length, so you only perform comparisons you need to perform. See comments for more details.
Benchmark of original code, over 1,000,000 iterations (with
print() commented out): real 1m8.486s
user 0m0.000s
sys 0m0.000s
Benchmark of my code, over 1,000,000 iterations (with
print() commented out):real 0m16.226s
user 0m0.000s
sys 0m0.015s
Performance gain: 422.07%
It looks like you want to preserve the string order and uppercase/lowercase, so I took that into account. Otherwise this could be even faster/shorter.
The strategy here is to preprocess the strings, instead of going through
strip(), lower(), and split() each time you need to do a comparison.Additionally, sort the strings by length, so you only perform comparisons you need to perform. See comments for more details.
WATSON_RAW = 'watson Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into Cognitive|Cognitive Systems; a new era|what does watson represent'
# 1. Preprocess the string.
# Remove extraneous whitespace, replace semicolon with comma.
modified = ' '.join(WATSON_RAW.split())
modified = modified.replace(';', ',')
# 2. Split string by delimiter.
split = modified.split('|')
# 3. Generate string metadata, sorted by length
# Keep track of the string contents, its index in the "split" variable, and its length.
# Tuple has a speed advantage over list.
metadata = [(i, len(s), s.lower()) for i, s in enumerate(split)]
metadata = sorted(metadata, key=lambda x: x[1])
# 4. Uniqify the strings.
indexes_of_unique_strings = []
for index, _, string_lowercase1 in metadata:
# Only search from current index on, skipping comparisions with earlier indexes.
for _, _, string_lowercase2 in metadata[i:]:
if string_lowercase1 in string_lowercase2 and string_lowercase1 != string_lowercase2:
# This string is redundant.
break
else:
# This string is unique.
indexes_of_unique_strings.append(index)
# 5. Map indexes back to original, correctly capitalized strings, in original order.
indexes_of_unique_strings = sorted(indexes_of_unique_strings)
final = '|'.join(split[i] for i in indexes_of_unique_strings)
print(final)Code Snippets
WATSON_RAW = 'watson Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into Cognitive|Cognitive Systems; a new era|what does watson represent'
# 1. Preprocess the string.
# Remove extraneous whitespace, replace semicolon with comma.
modified = ' '.join(WATSON_RAW.split())
modified = modified.replace(';', ',')
# 2. Split string by delimiter.
split = modified.split('|')
# 3. Generate string metadata, sorted by length
# Keep track of the string contents, its index in the "split" variable, and its length.
# Tuple has a speed advantage over list.
metadata = [(i, len(s), s.lower()) for i, s in enumerate(split)]
metadata = sorted(metadata, key=lambda x: x[1])
# 4. Uniqify the strings.
indexes_of_unique_strings = []
for index, _, string_lowercase1 in metadata:
# Only search from current index on, skipping comparisions with earlier indexes.
for _, _, string_lowercase2 in metadata[i:]:
if string_lowercase1 in string_lowercase2 and string_lowercase1 != string_lowercase2:
# This string is redundant.
break
else:
# This string is unique.
indexes_of_unique_strings.append(index)
# 5. Map indexes back to original, correctly capitalized strings, in original order.
indexes_of_unique_strings = sorted(indexes_of_unique_strings)
final = '|'.join(split[i] for i in indexes_of_unique_strings)
print(final)Context
StackExchange Code Review Q#108092, answer score: 4
Revisions (0)
No revisions yet.