patternpythonMinor
Finding the shortest excerpt that contains all search terms
Viewed 0 times
theshortestallsearchthatcontainsfindingtermsexcerpt
Problem
Write a function called
returns the shortest snippet of the document, containing all of the
given search terms. The search terms can appear in any order.
My program below is giving the correct answer but the time complexity is very high since I am doing the Cartesian product. If the input is very high then I am not able to clear to those test cases. I am not able to reduce the complexity of this program, and any help will be greatly appreciated.
If anyone wants to clone my program here is code
answer(document, searchTerms) whichreturns the shortest snippet of the document, containing all of the
given search terms. The search terms can appear in any order.
Inputs:
(string) document = "many google employees can program"
(string list) searchTerms = ["google", "program"]
Output:
(string) "google employees can program"
Inputs:
(string) document = "a b c d a"
(string list) searchTerms = ["a", "c", "d"]
Output:
(string) "c d a"My program below is giving the correct answer but the time complexity is very high since I am doing the Cartesian product. If the input is very high then I am not able to clear to those test cases. I am not able to reduce the complexity of this program, and any help will be greatly appreciated.
import itertools
import sys
def answer(document, searchTerms):
min = sys.maxint
matchedString = ""
stringList = document.split(" ")
d = dict()
for j in range(len(searchTerms)):
for i in range(len(stringList)):
if searchTerms[j] == stringList[i]:
d.setdefault(searchTerms[j], []).append(i)
for element in itertools.product(*d.values()):
sortedList = sorted(list(element))
differenceList = [t - s for s, t in zip(sortedList, sortedList[1:])]
if min > sum(differenceList):
min = sum(differenceList)
sortedElement = sortedList
if sum(differenceList) == len(sortedElement) - 1:
break
try:
for i in range(sortedElement[0], sortedElement[len(sortedElement)-1]+1):
matchedString += "".join(stringList[i]) + " "
except:
pass
return matchedStringIf anyone wants to clone my program here is code
Solution
Pythonicity
Your solution feels very procedural and not really Pythonic. I believe that this function accomplishes the same task using roughly the same approach:
Observations:
-
You wrote
$$(n_1 - n_0) + (n_2 - n_1) + (n_3 - n_2) + \ldots + (n_N - n_{N - 1})$$
is a telescoping series where all of the terms cancel except \$-n_0\$ and \$n_N\$.
Bugs
This function only recognizes words that are delimited by spaces. Any other punctuation could cause words to fail to be recognized.
Your algorithm works based on word count rather than by character positions. Therefore, you end up finding the excerpt with the fewest intervening words, which is not necessarily what I would consider to be the "shortest snippet". Example:
I've also written a different solution to solve the same problem. There, as here, I've made no attempt to choose a speedy algorithm, but rather aimed for clarity.
Your solution feels very procedural and not really Pythonic. I believe that this function accomplishes the same task using roughly the same approach:
from itertools import product
def answer(document, search_terms):
words = document.split(" ")
search_term_positions = [
[i for i, word in enumerate(words) if word == term]
for term in search_terms
]
combos = [sorted(combo) for combo in product(*search_term_positions)]
if not combos:
return None # Could not find all search terms
best_combo = min(combos, key=lambda combo: combo[-1] - combo[0])
return " ".join(words[best_combo[0] : best_combo[-1] + 1])Observations:
dis a variable name that suggests that it's adict, but is not much more helpful than that. Instead ofd.values(), I've definedsearch_term_positions, which is a much more descriptive variable name. You don't really even need a dictionary, since you only care about the values, not the keys.
- You can simplify the second
forloop by taking full advantage of the built-inmin()function.
-
You wrote
sum(differenceList) three times! sum(differenceList) is just sortedList[-1] - sortedList[0], because$$(n_1 - n_0) + (n_2 - n_1) + (n_3 - n_2) + \ldots + (n_N - n_{N - 1})$$
is a telescoping series where all of the terms cancel except \$-n_0\$ and \$n_N\$.
sortedElement[len(sortedElement)-1]can be written assortedElement[-1].
- Swallowing all exceptions with
except: passis a no-no.
Bugs
This function only recognizes words that are delimited by spaces. Any other punctuation could cause words to fail to be recognized.
Your algorithm works based on word count rather than by character positions. Therefore, you end up finding the excerpt with the fewest intervening words, which is not necessarily what I would consider to be the "shortest snippet". Example:
>>> answer('many google supercalifragilistic program to google a program', ['google', 'program'])
'google supercalifragilistic program'I've also written a different solution to solve the same problem. There, as here, I've made no attempt to choose a speedy algorithm, but rather aimed for clarity.
Code Snippets
from itertools import product
def answer(document, search_terms):
words = document.split(" ")
search_term_positions = [
[i for i, word in enumerate(words) if word == term]
for term in search_terms
]
combos = [sorted(combo) for combo in product(*search_term_positions)]
if not combos:
return None # Could not find all search terms
best_combo = min(combos, key=lambda combo: combo[-1] - combo[0])
return " ".join(words[best_combo[0] : best_combo[-1] + 1])>>> answer('many google supercalifragilistic program to google a program', ['google', 'program'])
'google supercalifragilistic program'Context
StackExchange Code Review Q#102351, answer score: 4
Revisions (0)
No revisions yet.