HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Relationships between like matching and comparison

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
likecomparisonbetweenandrelationshipsmatching

Problem

Naively I assumed that the relationship between a like 'foo%' and a >= 'foo' is that the latter matches the former and some additional rows that come after in the index order. More generally, I thought that for a given text t an index on a character column splits into four regions, being

  • the entries before (



  • the entries equal to (=) t,



  • the entries greater than (>) t that still have t as a prefix (like concat(t, '%')) and



  • the rest,



and that those regions are contiguous and in that order.

Also, all rows having
t as a prefix (like concat(t, '%')) form a contiguous region in the index as well and that this region contains region 2 at its start (meaning it's exactly the regions 2 and 3 together).

It may not be that simple though. For example, in SQL Server the word
Häuser compares equal to Haeuser in the collation German_PhoneBook_CI_AI and both are >= 'Ha'.

However, only
Haeuser matches like 'Ha%'`, so the additional assumptions about prefixes don't entirely hold.

You can see this in this fiddle.

My question is what other assumptions here may be wrong. Does the result of an anchored like even always give a contiguous slice of the index or can there be gaps?1 Do I at least still have those four contiguous regions?

The underlying problem I'm working on is a generic database UI that shows index slices for a given search term. It's beneficial from a user experience point of view to not have the user select whether they want to make a prefix, equality or greater-than search, but to simply start at the equality (in the case of ascending seeking), following the index and then marking the two further boundaries "no longer equal from here" and "not even a prefix anymore here".

Only character types have this issue as it comes from the complications of collations.

This question is tagged SQL Server, but I feel if there's a RDMS where this isn't true that it's probably a deeper issue I need to know about - so feel free to comme

Solution

For example, in SQL Server the word Häuser comes before Haeuser in the collation German_PhoneBook_CI_AI

No it doesn't. They are equal.

select case when  N'Häuser' < N'Haeuser' Collate German_PhoneBook_CI_AI 
              then 'true' else 'false' end


returns

false

N'Hä' = N'Hae' Collate German_PhoneBook_CI_AI


but

N'Ha' < N'Hä'  Collate German_PhoneBook_CI_AI


And so N'Häuser' is not like 'Ha%' but is like N'Hae%' and like N'Hä%'.

It's possible that SQL Server will not be able to completely evaluate the like predicate in an index seek+ordered scan, instead having to seek+scan a larger range and apply the like operator to the rows in the larger range. See Dynamic Seeks and Hidden Implicit Conversions

Code Snippets

select case when  N'Häuser' < N'Haeuser' Collate German_PhoneBook_CI_AI 
              then 'true' else 'false' end
N'Hä' = N'Hae' Collate German_PhoneBook_CI_AI
N'Ha' < N'Hä'  Collate German_PhoneBook_CI_AI

Context

StackExchange Database Administrators Q#316446, answer score: 7

Revisions (0)

No revisions yet.