patternMajor
"partial sorting" algorithms (aka "partitioning")
Viewed 0 times
sortingalgorithmsakapartialpartitioning
Problem
Context:
When trying to tame real-world datasets that contain outliers and noise, the interquartile mean is a handy tool: you sort the data, throw away the top and bottom 25% of the data and take the mean of what's left. (Of course, you can choose other partitioning than top and bottom 25%.)
Which led me to wonder: is there any efficiency to be gained only partially sorting the array? That is, if we describe three groups: A is the low quartile, B is the middle, and C is the high quartile, we don't care if A or C are sorted: we're going to discard them. And we don't care if B is sorted since we're only going to take the mean of its values. It's sufficient that the data is partitioned into those three groups.
The question:
Update
I've added "partitioning" to the title, since (I now know) that's the correct term for what this question is about. Thank you to everyone with good answers!
When trying to tame real-world datasets that contain outliers and noise, the interquartile mean is a handy tool: you sort the data, throw away the top and bottom 25% of the data and take the mean of what's left. (Of course, you can choose other partitioning than top and bottom 25%.)
Which led me to wonder: is there any efficiency to be gained only partially sorting the array? That is, if we describe three groups: A is the low quartile, B is the middle, and C is the high quartile, we don't care if A or C are sorted: we're going to discard them. And we don't care if B is sorted since we're only going to take the mean of its values. It's sufficient that the data is partitioned into those three groups.
The question:
- is there a "partial sorting" algorithm that is more efficient than a full sort that will yield those three groups?
- Are there additional savings if the array is always a power of 2 (assume N >= 4)?
- What if you want to adjust the partition boundaries other than quartiles? Does that make it less efficient?
Update
I've added "partitioning" to the title, since (I now know) that's the correct term for what this question is about. Thank you to everyone with good answers!
Solution
The algorithm quickselect can return the $k$-th value of an unordered array in average linear time. It can be "improved" (though not so much in practice) using the median of medians to guarantee worst case linear time.
Using that, you can quickselect the $\frac{N}4$-th, $\frac{N}2$-th and $\frac{3N}4$-th values. The algorithm will partition the array into the four desired parts. All this can be done in linear time. It is optimal since you need to check each element at least once.
As long as you use a constant number of them, you could use other values than quartiles (like deciles, for example).
Using that, you can quickselect the $\frac{N}4$-th, $\frac{N}2$-th and $\frac{3N}4$-th values. The algorithm will partition the array into the four desired parts. All this can be done in linear time. It is optimal since you need to check each element at least once.
As long as you use a constant number of them, you could use other values than quartiles (like deciles, for example).
Context
StackExchange Computer Science Q#150417, answer score: 26
Revisions (0)
No revisions yet.