HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Minimum window substring

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
windowminimumsubstring

Problem

The minimum window substring problem from leetcode.com asks:


Given a string S and a string T, find the minimum window in S which will contain all the characters in T in complexity \$O(n)\$.


For example,


S = "ADOBECODEBANC"

T = "ABC"


Minimum window is "BANC".

I post two versions of code. The differences between two versions of code is, the first one will move the start index when there is a match, the 2nd version will not move the start index when there is a move.

Wondering if logics are both correct? If both have the correct logic, which one do you think has better performance? I have did basic testing and post my test cases below. Using Python 2.7.

```
from collections import defaultdict
import sys

def minWindow(source, pattern):
targetCount = defaultdict(int)
currentCount = defaultdict(int)
for ch in pattern:
targetCount[ch] += 1
start = 0
validCount = 0
minSize = sys.maxint
for index in range(len(source)):
if source[index] not in targetCount:
continue
if currentCount[source[index]] targetCount[source[start]]:
currentCount[source[start]] -= 1
start += 1
else:
minSize = min(minSize, index-start+1)
currentCount[source[start]] -= 1
validCount -= 1
start += 1
return (minSize, start-1)

def minWindowv2(source, pattern):
targetCount = defaultdict(int)
currentCount = defaultdict(int)
for ch in pattern:
targetCount[ch] += 1
start = 0
validCount = 0
minSize = sys.maxint
for index in range(len(source)):
if source[index] not in targetCount:
continue
if currentCount[source[index]] targetCount[source[start]]:
currentCount[source[start]] -= 1
start += 1
else:
minSize = min(minSize, index-start+1)
break
return (minSize, start)

Solution


  1. Review



I don't like comparative reviews, so I only looked in detail at the first version of your code. Possibly some of these comments apply to the other version too.

-
The code does not return the correct answer if there are repeated letters in pattern. For example, given S = 'ABCA' and T = 'AA', the minimum window should be the whole string. But:

>>> minWindow('ABCA', 'AA')
(1, 3)


To fix this, use len(pattern) instead of len(targetCount).

-
It would be better to raise an exception if no window is found rather than returning the meaningless result (4294967295, -1).

-
sys.maxint is not portable to Python 3. If you need a starting value that's bigger than any integer, use float('inf'). But it's possible to reorganize the code so that there's no need for a starting value — see the revised code below.

-
Instead of defaultdict(int), use Counter, which you can initialize directly from an iterable:

targetCount = Counter(pattern)


-
Instead of iterating over indexes into source, use enumerate:

for end, c in enumerate(source):
    if c not in targetCount:
        continue
    if currentCount[c] < targetCount[c]:
        validCount += 1
    currentCount[c] += 1
    # ... and so on ...


This avoids lots of lookups of source[index].

-
There's repetition of lines like start += 1 and currentCount[source[start]] -= 1. It would be easy to reorder the code so that each line appears only once.

-
Similarly, instead of maintaining and incrementing a start index, create an iterable that enumerates the source string:

window_start = enumerate(source)


and call next each time you need another character from it:

start, c = next(window_start)
if c not in targetCount:
    continue
# ... and so on ...


This avoids lots of lookups of source[start].

-
The inner loop is:

while validCount == len(targetCount):


but validCount does not change on every loop iteration (only on loop iterations where the start character causes one of the current counts to drop below its target). So some of these tests are wasted. Instead, you could loop using:

for start, c in window_start:


and then exit the loop only after reducing validCount:

validCount -= 1
    if validCount < len(pattern):
        break


-
The test:

if source[index] not in targetCount:
    continue


is not necessary — the algorithm works just as well without it. And similarly for the corresponding test in the inner loop.

  1. Revised code



In this implementation I've tried to emphasize the symmetry between the outer and the inner loop. Really what we have here is a pair of coroutines, each iterating over the input, and yielding to the other one at the appropriate point.

from collections import Counter

def min_window(source, pattern):
    """Return the shortest substring of source that contains all the
    characters in pattern (with the required multiplicity). Raise
    ValueError if there is no such substring.

    """
    def windows():
        pattern_length = len(pattern)
        target = Counter(pattern) # character -> required count
        current = Counter()       # character -> count in current window
        matched = 0               # matched characters in current window
        window_start, window_end = enumerate(source), enumerate(source)
        for end, c in window_end:
            current[c] += 1
            if current[c] <= target[c]:
                matched += 1
                if matched == pattern_length:
                    for start, c in window_start:
                        current[c] -= 1
                        if current[c] < target[c]:
                            matched -= 1
                            if matched < pattern_length:
                                yield end + 1 - start, start
                                break
    length, start = min(windows())
    return source[start: start + length]

Code Snippets

>>> minWindow('ABCA', 'AA')
(1, 3)
targetCount = Counter(pattern)
for end, c in enumerate(source):
    if c not in targetCount:
        continue
    if currentCount[c] < targetCount[c]:
        validCount += 1
    currentCount[c] += 1
    # ... and so on ...
window_start = enumerate(source)
start, c = next(window_start)
if c not in targetCount:
    continue
# ... and so on ...

Context

StackExchange Code Review Q#132227, answer score: 4

Revisions (0)

No revisions yet.