snippetsqlMinor
ORDER BY and comparison of mixed strings of letters and numbers
Viewed 0 times
ordernumberscomparisonandstringsmixedletters
Problem
We need to do some reporting on values that are usually mixed strings of numbers and letters that need to be sorted 'naturally'. Things like, e.g. "P7B18" or "P12B3". @The strings will mostly be sequences of letters then numbers alternating. The number of these segments and the length of each could vary, though.
We'd like the numeric portions of these to be sorted in numeric order. Obviously, if I just handle those string values directly with
I'd also like to be able to do range comparisons, e.g.
In our case, string casing isn't important, though if there's a way to do this in a collation-aware fashion, others might find that useful. The ugliest part of all this is I'd like to be able to do both ordering, and range filtering in the
If I were doing this in C#, it would be a pretty simple task: do some parsing to separate the alpha from the numeric, implement IComparable, and you're basically done. SQL Server, of course, doesn't appear to offer any similar functionality, at least as far as I'm aware.
Anybody know any good tricks to make this work? Is there some little-publicized ability to create custom CLR types that implement IComparable and have this behave as expected? I'm also not opposed to Stupid XML Tricks (see also: list concatenation), and I've got CLR regex matching/extracting/replacement wrapper functions available on the server as well.
EDIT:
As a slightly more detailed example, I'd want the data to behave something like this.
```
SELECT bin FROM bins ORDER BY bin
bin
--------------------
M7R16L
P8RF6JJ
P16B5
PR7S19
PR7S19L
S2F3
S
We'd like the numeric portions of these to be sorted in numeric order. Obviously, if I just handle those string values directly with
ORDER BY, then "P12B3" is going to come before "P7B18", since "P1" is earlier than "P7", but I'd like the reverse, as "P7" naturally precedes "P12".I'd also like to be able to do range comparisons, e.g.
@bin < 'P13S6' or some such. I don't have to handle floating point or negative numbers; these will strictly be non-negative integers that we're dealing with. String lengths and number of segments could potentially be arbitrary, with no fixed upper bounds.In our case, string casing isn't important, though if there's a way to do this in a collation-aware fashion, others might find that useful. The ugliest part of all this is I'd like to be able to do both ordering, and range filtering in the
WHERE clause.If I were doing this in C#, it would be a pretty simple task: do some parsing to separate the alpha from the numeric, implement IComparable, and you're basically done. SQL Server, of course, doesn't appear to offer any similar functionality, at least as far as I'm aware.
Anybody know any good tricks to make this work? Is there some little-publicized ability to create custom CLR types that implement IComparable and have this behave as expected? I'm also not opposed to Stupid XML Tricks (see also: list concatenation), and I've got CLR regex matching/extracting/replacement wrapper functions available on the server as well.
EDIT:
As a slightly more detailed example, I'd want the data to behave something like this.
```
SELECT bin FROM bins ORDER BY bin
bin
--------------------
M7R16L
P8RF6JJ
P16B5
PR7S19
PR7S19L
S2F3
S
Solution
Want a sensible, efficient means of sorting numbers in strings as actual numbers? Consider voting for my Microsoft Connect suggestion: Support "natural sorting" / DIGITSASNUMBERS as a Collation option
There is no easy, built-in means of doing this, but here is a possibility:
Normalize the strings by reformatting them into fixed-length segments:
Example:
In this approach, you can sort via:
And you can do range filtering via:
or:
Both the
Equality comparisons would still be done on the original value column.
Other thoughts:
-
Use a SQLCLR UDT. This might could work, though it is unclear if it presents a net-gain as compared to the approach described above.
Yes, a SQLCLR UDT can have its comparison operators overridden with custom algorithms. This handles situations where the value is being compared to either another value that is already the same custom type, or one that needs to be implicitly converted. This should handle the range filter in a
With regards to sorting the UDT as a regular column type (not a computed column), this is only possible if the UDT is "byte ordered". Being "byte ordered" means that the binary representation of the UDT (which can be defined in the UDT) naturally sorts in the appropriate order. Assuming that the binary representation is handled similarly to the approach described above for the VARCHAR(50) column that has fixed-length segments that are padded, that would qualify. Or, if it was not easy to ensure that the binary representation would naturally be ordered in the proper way, you could expose a method or property of the UDT that outputs a value that would be properly ordered, and then create a
Benefits of this approach are:
Consequences of this approach are:
There is no easy, built-in means of doing this, but here is a possibility:
Normalize the strings by reformatting them into fixed-length segments:
- Create a sort column of type
VARCHAR(50) COLLATE Latin1_General_100_BIN2. The max length of 50 might need to be adjusted based on the max number of segments and their potential maximum lengths.
- While the normalization could be done in the app layer more efficiently, handling this in the database using a T-SQL UDF would allow for placing the scalar UDF into an
AFTER [or FOR] INSERT, UPDATETrigger such that you are guaranteed of properly setting the value for all records, even those coming in via ad hoc queries, etc. Of course, that scalar UDF can also be handled via SQLCLR, but it would need to be tested to determine which one was actually more efficient. **
- The UDF (regardless of being in T-SQL or SQLCLR) should:
- Process an unknown number of segments by reading each character and stopping when the type switches from alpha to numeric or numeric to alpha.
- Per each segment it should return a fixed-length string set to the maximum possible characters/digits of any segment (or maybe max + 1 or 2 to account for future growth).
- Alpha segments should be left-justified and right-padded with spaces.
- Numeric segments should be right-justified and left-padded with zeroes.
- If alpha characters can come in as mixed-case but the ordering needs to be case-insensitive, apply the
UPPER()function to the final result of all segments (so that it only needs to be done once and not per segment). This will allow for proper sorting given the binary collation of the sort column.
- Create an
AFTER INSERT, UPDATETrigger on the table that calls the UDF to set the sort column. To improve performance, use theUPDATE()function to determine if this code column is even in theSETclause of theUPDATEstatement (simplyRETURNif false), and then join theINSERTEDandDELETEDpseudo-tables on the code column to only process rows that have changes in the code value. Be sure to specifyCOLLATE Latin1_General_100_BIN2on that JOIN condition to ensure accuracy in determining if there is a change.
- Create an Index on the new sort column.
Example:
P7B18 -> "P 000007B 000018"
P12B3 -> "P 000012B 000003"
P12B3C8 -> "P 000012B 000003C 000008"
In this approach, you can sort via:
ORDER BY tbl.SortColumn
And you can do range filtering via:
WHERE tbl.SortColumn BETWEEN dbo.MyUDF('P7B18') AND dbo.MyUDF('P12B3')
or:
DECLARE @RangeStart VARCHAR(50),
@RangeEnd VARCHAR(50);
SELECT @RangeStart = dbo.MyUDF('P7B18'),
@RangeEnd = dbo.MyUDF('P12B3');
WHERE tbl.SortColumn BETWEEN @RangeStart AND @RangeEnd
Both the
ORDER BY and the WHERE filter should use the binary collation defined for SortColumn due to Collation Precedence.Equality comparisons would still be done on the original value column.
Other thoughts:
-
Use a SQLCLR UDT. This might could work, though it is unclear if it presents a net-gain as compared to the approach described above.
Yes, a SQLCLR UDT can have its comparison operators overridden with custom algorithms. This handles situations where the value is being compared to either another value that is already the same custom type, or one that needs to be implicitly converted. This should handle the range filter in a
WHERE condition.With regards to sorting the UDT as a regular column type (not a computed column), this is only possible if the UDT is "byte ordered". Being "byte ordered" means that the binary representation of the UDT (which can be defined in the UDT) naturally sorts in the appropriate order. Assuming that the binary representation is handled similarly to the approach described above for the VARCHAR(50) column that has fixed-length segments that are padded, that would qualify. Or, if it was not easy to ensure that the binary representation would naturally be ordered in the proper way, you could expose a method or property of the UDT that outputs a value that would be properly ordered, and then create a
PERSISTED computed column on that method or property. The method needs to be deterministic and marked as IsDeterministic = true.Benefits of this approach are:
- No need for an "original value" field.
- No need to call a UDF to insert the data or to compare values. Assuming that the
Parsemethod of the UDT takes in theP7B18value and converts it, then you should be able to simply insert the values naturally asP7B18. And with the implicit conversion method set in the UDT, the WHERE condition would also allow for using simply P7B18`.
Consequences of this approach are:
- Simply selecting the field will return the binary representation, if using the b
Context
StackExchange Database Administrators Q#126744, answer score: 9
Revisions (0)
No revisions yet.