patterncsharpMinor
SpaceSort - A new sorting algorithm
Viewed 0 times
sortingnewalgorithmspacesort
Problem
A sorting algorithm i have written.
Pros
Cons
The main reason i post the algorithm here is to learn if it already exists or not. From the research i have done i can't find anything similar. Though i am sceptical it would be very fun to know if it is new or not. What i mostly want to know however is if the algorithm could be useful and if my implementation of it is reasonable.
Algorithm:
Basic idea: When you divide the number you want to sort with the largest number of the set you get a percentage. This percentage gives a rough position where in the sorted list that number should be.
Because it is difficult to know how the data is distributed the mean/average is used. By dividing the mean-value with the amount of unique values in the set, you can decide how much of the data set is above and below the mean-value.
Then by using the above idea on all the elements, each element can get a position and be put in a new sorted list.
Some lists with a high amount of clustering or small number of elements with large values, can have the same position calculated from different elements. Therefore all the elements are first put in a larger array to lessen the amount of collisions and if a collision occur the element is moved sligthly up/down. Bacause the new array is larger, elements can be moved up and down without causing problems with insertion of new ones.
Duplicates are handled by counting the collisions of each element.
When all elements have been placed in the larger array it is then collapsed into an array of the correct size.
Exampel:
[4, 20, 6, 4, 4]
min = 4, max = 20, mean = 8
Calculating the possible amout of unique elements:
(max - min + 1) = (20 - 4 + 1) = 17
Calculating a percentage for the mean:
(1 - (mean-min)/unique) = (1 - (8-4)/17) = 0.76
This tells us roughly how la
Pros
- Gives each element a position once.
- Few comparisons needed.
Cons
- For certain sets of data, large amount of extra storage is needed.
- Only works with integers.
The main reason i post the algorithm here is to learn if it already exists or not. From the research i have done i can't find anything similar. Though i am sceptical it would be very fun to know if it is new or not. What i mostly want to know however is if the algorithm could be useful and if my implementation of it is reasonable.
Algorithm:
Basic idea: When you divide the number you want to sort with the largest number of the set you get a percentage. This percentage gives a rough position where in the sorted list that number should be.
Because it is difficult to know how the data is distributed the mean/average is used. By dividing the mean-value with the amount of unique values in the set, you can decide how much of the data set is above and below the mean-value.
Then by using the above idea on all the elements, each element can get a position and be put in a new sorted list.
Some lists with a high amount of clustering or small number of elements with large values, can have the same position calculated from different elements. Therefore all the elements are first put in a larger array to lessen the amount of collisions and if a collision occur the element is moved sligthly up/down. Bacause the new array is larger, elements can be moved up and down without causing problems with insertion of new ones.
Duplicates are handled by counting the collisions of each element.
When all elements have been placed in the larger array it is then collapsed into an array of the correct size.
Exampel:
[4, 20, 6, 4, 4]
min = 4, max = 20, mean = 8
Calculating the possible amout of unique elements:
(max - min + 1) = (20 - 4 + 1) = 17
Calculating a percentage for the mean:
(1 - (mean-min)/unique) = (1 - (8-4)/17) = 0.76
This tells us roughly how la
Solution
So you are using a function mapping each element to the interval (0, 1) to sort them. If you wanted to make this fully generic, here is the way to do that:
First, assume arbitrary precision. Your doubles only allow about 253 different values to be represented which means that you will probably run into problems with maliciously-constructed 64-bit integer inputs which end up mapping a large range of numbers to the same double. For example, an array of 10 elements where 5 of them are the integers 0-4, then 5 of them are maybe 261 to 261+4, probably only the first 5 get properly sorted because for the second two the doubles are likely all equal even though the integers are not.
Second, array indexing is going to mess you up. It is not a constant-time operation with regard to the length of the array, so it probably creates a secret O(n2) scaling, and the people who care about the speed of sorting algorithms usually have gigabytes or more of data to sort, they cannot easily just say "oh, I will allocate an array that is substantially bigger than the amount of data I need to sort."
I think when you look at both of these criteria together, you realize that what you're really interested in is a sort of high-fanout tree: and the first place to start with that is to start with a low-fanout tree (like a binary tree, each bit in the decimal expansion tells you to go into the left or right subtree) for "correctness" and then "optimize for speed".
At the end of this analysis you'll be taking N bits from the binary expansion at a time, let's say 6, and then we use that to index into a 64-element array. Then to handle equal elements we need stacks at the bottom and some way to determine if the arbitrary-precision numbers are equal etc. So for the pathological 5+5 example that you couldn't sort, we can just divide 64-bit ints into bitstrings with bit-shifts and masks the obvious way. In JSON (but not JS, which doesn't have 64-bit integers!) you could write the result without trailing nulls as:
In fact it seems like you would do best to have a trie data structure to reduce this overhead even a little more; using base64 encoding this I think is the trie
Now that you see that this is generally what you are doing, I can tell you that the name of this class of algorithms is called "Radix sort." Something similar can be read in Haskell code if you look at the implementation for the standard container Data.IntSet. The direct generalization where you say "wait, now that I see that I really need this tree structure, what happens if I just let the data direct the sorting of the tree?" is called a "self-balancing" search tree, so you might want to start with looking up the red-black tree and go from there.
First, assume arbitrary precision. Your doubles only allow about 253 different values to be represented which means that you will probably run into problems with maliciously-constructed 64-bit integer inputs which end up mapping a large range of numbers to the same double. For example, an array of 10 elements where 5 of them are the integers 0-4, then 5 of them are maybe 261 to 261+4, probably only the first 5 get properly sorted because for the second two the doubles are likely all equal even though the integers are not.
Second, array indexing is going to mess you up. It is not a constant-time operation with regard to the length of the array, so it probably creates a secret O(n2) scaling, and the people who care about the speed of sorting algorithms usually have gigabytes or more of data to sort, they cannot easily just say "oh, I will allocate an array that is substantially bigger than the amount of data I need to sort."
I think when you look at both of these criteria together, you realize that what you're really interested in is a sort of high-fanout tree: and the first place to start with that is to start with a low-fanout tree (like a binary tree, each bit in the decimal expansion tells you to go into the left or right subtree) for "correctness" and then "optimize for speed".
At the end of this analysis you'll be taking N bits from the binary expansion at a time, let's say 6, and then we use that to index into a 64-element array. Then to handle equal elements we need stacks at the bottom and some way to determine if the arbitrary-precision numbers are equal etc. So for the pathological 5+5 example that you couldn't sort, we can just divide 64-bit ints into bitstrings with bit-shifts and masks the obvious way. In JSON (but not JS, which doesn't have 64-bit integers!) you could write the result without trailing nulls as:
[
[[[[[[[[[[
0,
1,
2,
3,
4
]]]]]]]]]],
null,
[[[[[[[[[[
2305843009213693952,
2305843009213693953,
2305843009213693954,
2305843009213693955,
2305843009213693956
]]]]]]]]]]
]In fact it seems like you would do best to have a trie data structure to reduce this overhead even a little more; using base64 encoding this I think is the trie
{
"AAAAAAAAAAA": ["A", "B", "C", "D", "E"],
"CAAAAAAAAAA": ["A", "B", "C", "D", "E"]
}Now that you see that this is generally what you are doing, I can tell you that the name of this class of algorithms is called "Radix sort." Something similar can be read in Haskell code if you look at the implementation for the standard container Data.IntSet. The direct generalization where you say "wait, now that I see that I really need this tree structure, what happens if I just let the data direct the sorting of the tree?" is called a "self-balancing" search tree, so you might want to start with looking up the red-black tree and go from there.
Code Snippets
[
[[[[[[[[[[
0,
1,
2,
3,
4
]]]]]]]]]],
null,
[[[[[[[[[[
2305843009213693952,
2305843009213693953,
2305843009213693954,
2305843009213693955,
2305843009213693956
]]]]]]]]]]
]{
"AAAAAAAAAAA": ["A", "B", "C", "D", "E"],
"CAAAAAAAAAA": ["A", "B", "C", "D", "E"]
}Context
StackExchange Code Review Q#158511, answer score: 5
Revisions (0)
No revisions yet.