patternpythonMinor
Counting the number of elements in a sorted array
Viewed 0 times
numberthecountingelementsarraysorted
Problem
Suppose I have a sorted array, and each element may appear multiple times.
To make it simple, here is an example:
The expected output is the number of each element. In this example, the results are:
I want to find a solution where the algorithm time complexity is lower than \$O(n)\$. My idea is similar to binary search:
Any advice on algorithm time complexity improvement, code bugs or general code style advice is appreciated.
```
from collections import defaultdict
import random
def generate_count(numbers):
last_index = -1
result = defaultdict(int)
cur_index = last_index + 1
tail_index = cur_index + 1
while cur_index < len(numbers):
while cur_index < tail_index and cur_index < len(numbers) and tail_index < len(numbers):
if numbers[cur_index] == numbers[tail_index]:
pre = cur_index
cur_index = tail_index
tail_index += (tail_index - pre)
else:
tail_index = (cur_index+tail_index) / 2
result[numbers[cur_index]] = cur_index - last_index
last_index = cur_index
cur_index = last_index + 1
tail_index = cur_index + 1
return result
if __name__ == "__main__":
numbers = []
for i in range(random.randint(1,20)):
numbers.append(1)
print 'number of 1 is', i+1
for i in range(random.randint(1, 20)):
numbers.append(2)
print 'number of 2 is', i+1
for i in range(random.randint(1, 20)):
To make it simple, here is an example:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3]The expected output is the number of each element. In this example, the results are:
number of 1 is 17
number of 2 is 6
number of 3 is 9I want to find a solution where the algorithm time complexity is lower than \$O(n)\$. My idea is similar to binary search:
- When I meet with equal neighbour, I will jump the tail pointer by 1, by 2, by 4, etc. and at the same time, I will make cur pointer move to the tail position
- When I meet with unequal case, I will shrink the tail pointer to be in the middle of cur and tail
- The diff between cur and last element tail is the count of the current elements
Any advice on algorithm time complexity improvement, code bugs or general code style advice is appreciated.
```
from collections import defaultdict
import random
def generate_count(numbers):
last_index = -1
result = defaultdict(int)
cur_index = last_index + 1
tail_index = cur_index + 1
while cur_index < len(numbers):
while cur_index < tail_index and cur_index < len(numbers) and tail_index < len(numbers):
if numbers[cur_index] == numbers[tail_index]:
pre = cur_index
cur_index = tail_index
tail_index += (tail_index - pre)
else:
tail_index = (cur_index+tail_index) / 2
result[numbers[cur_index]] = cur_index - last_index
last_index = cur_index
cur_index = last_index + 1
tail_index = cur_index + 1
return result
if __name__ == "__main__":
numbers = []
for i in range(random.randint(1,20)):
numbers.append(1)
print 'number of 1 is', i+1
for i in range(random.randint(1, 20)):
numbers.append(2)
print 'number of 2 is', i+1
for i in range(random.randint(1, 20)):
Solution
I want to find a solution [whose] algorithm time complexity is lower than
You'll have to be more specific. What you asked for literally isn't possible; but sometimes people use "big-O" notation loosely, either out of laziness or because they haven't yet been exposed to the notation expressing their intended precise gradations of meaning.
Notice that if I give you the input
then you must output
— that is, if the size of the input is
On the above worst-case input, your "repeated binary search" algorithm runs in
Now, if you're not concerned about worst-case inputs, but merely with the average input... well, then you'd have to define a notion of "the average input". You'll have to come up with some model of what kinds of input you expect. One way to do this is to come up with an algorithm for generating inputs, and then treat it as your model. For example:
Or:
On these two different models, your algorithm performs very differently. On the latter, you do well:
This is kind of like asking for a compression algorithm that compresses all possible inputs. You can get compression for some inputs only at the expense of others; the art is deciding which kinds of inputs are "interesting" and which can safely be blown up. In this case, you can get sub-
EDIT: My analogy above is flawed, as indicated by LinMa's comments below this answer. See, if you can get
Hope this helps!
O(n).You'll have to be more specific. What you asked for literally isn't possible; but sometimes people use "big-O" notation loosely, either out of laziness or because they haven't yet been exposed to the notation expressing their intended precise gradations of meaning.
Notice that if I give you the input
1 2 3 4 5 6 7 8 9 10 11 12 13 ...then you must output
number of 1 is 1
number of 2 is 1
number of 3 is 1
number of 4 is 1
...— that is, if the size of the input is
n, then the worst-case size of the output is O(n). Printing those O(n) bytes of output will take O(n) time just by itself. Anything else the program does (besides print its output) can only make the overall program slower, not faster.On the above worst-case input, your "repeated binary search" algorithm runs in
O(n log n) time, which is worse than O(n).Now, if you're not concerned about worst-case inputs, but merely with the average input... well, then you'd have to define a notion of "the average input". You'll have to come up with some model of what kinds of input you expect. One way to do this is to come up with an algorithm for generating inputs, and then treat it as your model. For example:
def get_random_input(n):
return sorted([random.randrange(n) for i in xrange(n)])Or:
def get_random_input(n):
return sorted([random.randrange(100) for i in xrange(n)])On these two different models, your algorithm performs very differently. On the latter, you do well:
O(log n). On the former, you do poorly: O(n log n).This is kind of like asking for a compression algorithm that compresses all possible inputs. You can get compression for some inputs only at the expense of others; the art is deciding which kinds of inputs are "interesting" and which can safely be blown up. In this case, you can get sub-
O(n) performance for some inputs only at the expense of others; the art is deciding which kinds of inputs are "interesting", which is to say, deciding on a model for your "average" or "expected" input.EDIT: My analogy above is flawed, as indicated by LinMa's comments below this answer. See, if you can get
O(log n) performance on your "expected" inputs while blowing up the "unexpected" inputs by only a constant factor (e.g. "O(n) plus a constant number of comparisons" or "O(n) times two"), then you can keep your algorithm's worst-case big-O "constant" while improving the best-case.Hope this helps!
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 ...number of 1 is 1
number of 2 is 1
number of 3 is 1
number of 4 is 1
...def get_random_input(n):
return sorted([random.randrange(n) for i in xrange(n)])def get_random_input(n):
return sorted([random.randrange(100) for i in xrange(n)])Context
StackExchange Code Review Q#155808, answer score: 6
Revisions (0)
No revisions yet.