HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Algorithms for almost sorting

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
almostsortingforalgorithms

Problem

I want to find a comparison sorting algorithm that can almost sort a set of data, using the least comparisons possible.

What I mean by "almost" is that if the perfectly sorted data is $[x_1, x_2, …, x_n]$ and the almost sorted data is $[x_{i_1}, x_{i_2}, …, x_{i_n}]$, then $\forall j \in [\![1, n]\!], |i_j - j| \leq C$, $C$ being a parameter of the algorithm.

I want to emphasize that I want to minimize the number of comparisons. Other operations don't matter in the complexity. I would like to know if a number of comparisons next to $\frac{n\log_2 n}{C}$ can be possible.

In practice, I have to work with $n \simeq 1000$ and $C \simeq 10$ and only the comparisons are done by a human.

Are there any algorithm of the kind? Could I have some insight about it? Thanks.

EDIT: I got two ideas to almost sort:

  • The first one is to quicksort the array, and stop when I have to sort a sub-array of $\leq C$ data. The best case scenario requires roughly $\sum\limits_{k=0}^{\log_2(n/C)}2^k\frac{n}{2^k} = n\log_2(n/C)$ comparisons.


The algorithm is correct because sub-arrays are relatively sorted, and since they are of size $\leq C$, a value can't be farther than $C$ from its sorted position. The problem is that the worst case scenario is still in $\Omega(n^2)$ ;

  • The second idea is to sort using an insertion sort with a dichotomic search, and stop the search when the bounds of the search are distant of at most $C$. With this method, I get roughly $\sum\limits_{k=1}^{n-1}\log_2(k/C) = \log_2((n-1)!) - n\log_2(C)\simeq n\log_2(n/C) - n$ in the worst case, which is better than the previous algorithm, but I am not quite sure that the array will be $C$-almost sorted.



EDIT 2: I found that the second algorithm is not correct. If, during the first insertion, $x_n$ is put before $x_1$, then their relative order will not change during the rest of the execution, therefore one of them will be farther than $C$ from its sorted index if $n > 2C$.

Solution

Call an array $C$-sorted if each element is at-most $C$ places away from its place had the array been sorted (this is the same as "almost"-sorted array in the question, but in plain English).

Assume we will use comparison sort.

Claim One: Given an array of $n$ elements that is $C$-sorted, we can sort it with at most $n\lceil\log_2 (C+1)\rceil$ comparisons.

Proof: Here is how we can sort the given array.

-
Create a buffer that is an array of size $C+1$.

-
One by one move the first $\min(n, C+1)$ elements to the buffer, keeping the buffer sorted using binary search to find the right place for each new element.

-
One by one move the min element from the buffer array to the result array, and, if the given array is not empty yet, add the next element to the buffer, keeping the buffer sorted using binary search to find the right place for that next element.

For each element in the array, at most $\lceil\log_2 (C+1)\rceil$ comparisons is used to find its right place in the buffer. The number of total comparisons used is at most $n\lceil\log_2 (C+1)\rceil$. $\quad\checkmark$

Claim Two: At least $\log_2 (n!) - n\lceil\log_2 (C+1)\rceil$ comparisons are needed to $C$-sort an array of size $n$ in the worst case.

Proof: Otherwise, suppose there is an algorithm that can $C$-sort any given array of size $n$ with less than $\log_2 (n!) - n\lceil\log_2 (C+1)\rceil$ comparisons. Extending that algorithm with the algorithm shown in the proof of claim one, we obtain an algorithm that can sort any given array of size $n$ with less than $\log_2(n!)$ comparisons. That is well-known to be impossible. $\quad\checkmark$

Claim Three: There is no algorithm that can $C$-sort an array of size $n$ with $O(\frac{n\log_2 n}{C})$ comparisons.

Proof: We have, if $n\ge C^{1+\epsilon}$ for some fixed $\epsilon\gt0$,
$$\begin{aligned}\lim_{C\to\infty} \frac{\log_2 (n!) - n\lceil\log_2 (C+1)\rceil}{\frac{n\log_2 n}{C}} &= \lim_{C\to\infty} C\left(\frac{\log_2 (n!)}{n\log_2 n}-\frac{\lceil\log_2 (C+1)\rceil}{\log_2 n}\right)\\&\ge \lim_{C\to\infty} C\left(1-\frac1{1+\epsilon}\right)=\infty.
\end{aligned}\quad\checkmark$$

Indeed, $\log_2 (n!) - n\lceil\log_2 (C+1)\rceil$ gives a pretty strong lower bound for the number of comparison needed in the worst case since
$\log_2 (n!)$ is the lower bound to sort an array by comparison sort in the worst case, and, as $\log_2(n!)\approx n\log n$, for any fixed $C$,

$$n\lceil\log_2 (C+1)\rceil = o(\log_2 (n!)). \ \ \color{#d0d0d0}{\text{(small o, not big O!)}}$$

That is, at least $(1-\epsilon)\log_2(n!)$ comparison is needed to $C$-sort array when $n$ is sufficiently large, for all fixed $\epsilon\gt0$ and $C$.

The statements above support the usage of "almost sort" and "almost sorted" to refer to "$C$-sort" and "$C$-sorted" respectively.

We also note that

on average, quicksort performs only about 39% worse than in its best case. In this sense, it is closer to the best case than the worst case. A comparison sort cannot use less than $\log_2(n!)$ comparisons on average to sort $n$ items (as explained in the article Comparison sort) and in case of large $n$, Stirling's approximation yields $log_2(n!) \approx n(\log_2 n − log_2 e)$, so quicksort is not much worse than an ideal comparison sort.

Now let us have an array of $n=1000$ elements and $C=10$.

  • Quicksort will perform about $2n\log n\approx 19931$ comparisons on average to sort the array, as shown on Wikipedia.



  • Claim Two says $\log_2 (n!) - n\lceil\log_2 (C+1)\rceil\approx4529$ are needed to $10$-sort the array in the worse case, which is sort of information-theoretical the best.



  • Note that $\frac{n\log_2 n}{C}\approx997$, which is much less than


of 4529.

So I believe, (some more argument are omitted here), an algorithm that performs about $n\log_2(n/C)\approx6643$ comparisons to $10$-sort the array is pretty much about the best we can get.

Context

StackExchange Computer Science Q#136548, answer score: 7

Revisions (0)

No revisions yet.