snippetMinor
Can we remove duplicates faster than we can sort?
Viewed 0 times
canthanremovefastersortduplicates
Problem
The problem is (integer) duplicate removal, which can also be perceived as producing the image of an evaluated function (of integers):
Given a sequence $S_\text{in}$ of $n$ integers, produce a sequence $S_\text{out}$ of elements such that any element in $S_\text{in}$ also appears in $S_\text{out}$, and all elements in $S_\text{out}$ are distinct.
Ignoring some details regarding the elements' size in bits, this can be done by sorting in $O(n \log n)$ time; or by hashing in expected time $O(n)$.
Given that we don't require $S_\text{out}$ to be sorted - is it possible to do better than sorting in the worst case?
Notes:
Given a sequence $S_\text{in}$ of $n$ integers, produce a sequence $S_\text{out}$ of elements such that any element in $S_\text{in}$ also appears in $S_\text{out}$, and all elements in $S_\text{out}$ are distinct.
Ignoring some details regarding the elements' size in bits, this can be done by sorting in $O(n \log n)$ time; or by hashing in expected time $O(n)$.
Given that we don't require $S_\text{out}$ to be sorted - is it possible to do better than sorting in the worst case?
Notes:
- Some dependence on the output size (let's call it $m$) rather than $n$ is an improvement, but I'm mostly interested in getting closer to linearity in $n$.
- The details I've ignored may not be so insignificant.
- Algorithms need not be restricted to algebraic computation, i.e. you can tear into the bit representation if it helps.
- We cannot make any assumptions regarding the input distribution.
Solution
It is a classical result that the element distinctness problem requires $\Omega(n\log n)$ comparisons in the comparison model (the one used to analyze sorting algorithms); in fact, it also requires $\Omega(n\log n)$ time in stronger models such as algebraic decision trees, in which we are allowed to compute the sign of arbitrary bounded degree polynomials (rather than just $x_i - x_j$).
The element distinctness problem asks whether all elements in the input are distinct. This is clearly easier than your problem, since you can just compare the sizes of the input and output arrays. Hence in these computation models, you cannot do asymptotically better than $n\log n$.
The element distinctness problem asks whether all elements in the input are distinct. This is clearly easier than your problem, since you can just compare the sizes of the input and output arrays. Hence in these computation models, you cannot do asymptotically better than $n\log n$.
Context
StackExchange Computer Science Q#100786, answer score: 6
Revisions (0)
No revisions yet.