snippetsqlMajor
How to treat numbers within strings as numbers when sorting ("A3" sorts before "A10", not after)
Viewed 0 times
aftersortinga10notnumberssortswithinhowbeforestrings
Problem
For all of these queries:
The output is always:
But, I want and expect:
I'm out of collations to try now according to
Please note that I've read the existing seemingly related SE questions as well as many articles on
I'm using PostgreSQL 12.4
SELECT label FROM personal.storage_disks ORDER BY label ASC;
SELECT label FROM personal.storage_disks ORDER BY label COLLATE "C" ASC;
SELECT label FROM personal.storage_disks ORDER BY label COLLATE "POSIX" ASC;
SELECT label FROM personal.storage_disks ORDER BY label COLLATE "default" ASC;The output is always:
DISK 1, DISK 10, DISK 2, DISK 3, [...]But, I want and expect:
DISK 1, DISK 2, DISK 3, [...] DISK 10I'm out of collations to try now according to
SELECT * FROM pg_collation;... unless I'm supposed to use one of the many really weird ones with cryptic names. (I even tried a bunch of those with the same result.)Please note that I've read the existing seemingly related SE questions as well as many articles on
SORT BY, but they didn't help and didn't clear up anything for me.I'm using PostgreSQL 12.4
Solution
Sorting strings naturally puts "15" before "2" because the first digit in the "15" is a "1", which sorts before "2". Sorting a "2" stored in a string type before the "15" can be done in a few ways. The most efficient way is to have the collation itself handle this internally. This option is not well known and not even available in most places, but any system that implements ICU (International Components for Unicode)** has the potential for allowing this type of sorting (as long as it allows for customizing the sort options), which is often referred to as "natural" sorting.
Handling natural sorting is often done programmatically by chopping the string into its pure alpha and numerical pieces and then sorting on those individually. This is a necessary evil in many cases, but fortunately, PostgreSQL (at least as of version 10), does allow for this internally. You would need to create a custom collation (this one even comes straight from their documentation):
and use that in the
https://dbfiddle.uk/?rdbms=postgres_11&fiddle=58763b51a8ccb2360cf387d8c2b91d51
NOTE
Due to collations not being generally well-understood, and the ability to customize them being more recent (for databases, at least) and even more esoteric, I would recommend doing the following when implementing this solution:
https://www.postgresql.org/docs/12/collation.html#id-1.6.10.4.5.7.5
You could even mention that the
BONUS
To more fully demonstrate how the "numeric" collation option works, I added some data to the previous example to show:
The additional data is:
Here is the updated example:
https://dbfiddle.uk/?rdbms=postgres_11&fiddle=20416b0dd731b2cc28b6fdee8ef70ec7
** To be fair, ICU / Unicode isn't "required" for doing this type of sorting, given that any collation or system can implement the same algorithm. However, it is built into ICU, and more and more systems are integrating ICU.
Handling natural sorting is often done programmatically by chopping the string into its pure alpha and numerical pieces and then sorting on those individually. This is a necessary evil in many cases, but fortunately, PostgreSQL (at least as of version 10), does allow for this internally. You would need to create a custom collation (this one even comes straight from their documentation):
CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');
and use that in the
ORDER BY. Here is a working example of it:https://dbfiddle.uk/?rdbms=postgres_11&fiddle=58763b51a8ccb2360cf387d8c2b91d51
NOTE
Due to collations not being generally well-understood, and the ability to customize them being more recent (for databases, at least) and even more esoteric, I would recommend doing the following when implementing this solution:
- Prefix the name of the new collation with "custom_" to increase awareness that this is indeed a custom collation that might have non-obvious behavior and might not exist on other systems (hence might need to be added to a system or application setup process)
- Place a comment just after each query that uses this custom collation, pointing out that it is a custom collation, and include a link to the official documentation:
https://www.postgresql.org/docs/12/collation.html#id-1.6.10.4.5.7.5
You could even mention that the
-kn-true part enables "numeric" sorting.BONUS
To more fully demonstrate how the "numeric" collation option works, I added some data to the previous example to show:
- multiple / separate groups of numbers within a string are handled separately
- different non-digit characters are handled as expected
- leading 0's do not affect the result
The additional data is:
DISK 2A
DISK 2B
DISK 2B 33
DISK 2B 4
FILE 62
FILE 7
DIRECTORY 1000000
DIRECTORY 57
DIRECTORY 9999
DIRECTORY 57000
DIRECTORY 057
DIRECTORY 0057
DIRECTORY 52Here is the updated example:
https://dbfiddle.uk/?rdbms=postgres_11&fiddle=20416b0dd731b2cc28b6fdee8ef70ec7
** To be fair, ICU / Unicode isn't "required" for doing this type of sorting, given that any collation or system can implement the same algorithm. However, it is built into ICU, and more and more systems are integrating ICU.
Code Snippets
DISK 2A
DISK 2B
DISK 2B 33
DISK 2B 4
FILE 62
FILE 7
DIRECTORY 1000000
DIRECTORY 57
DIRECTORY 9999
DIRECTORY 57000
DIRECTORY 057
DIRECTORY 0057
DIRECTORY 52Context
StackExchange Database Administrators Q#285222, answer score: 30
Revisions (0)
No revisions yet.