patternsqlMinor
Why do two identical strings have a different length but the same binary value?
Viewed 0 times
whythesamelengthbutidenticalvaluedifferenttwobinary
Problem
I am trying to return a distinct set of department names from a table - nothing special. However, when using the following query duplicates are displayed:
I also tried:
So this led me to believe that I might have hidden characters in the values and sure enough, when I checked the length of the strings they returned different values. So, I decided to locate the hidden characters using the function from this question on stack overflow. Strangely enough this only return SPACE. I then tried the following query and it made no difference at all:
Intrigued, I cast the values to
I have tried casting between
UPDATE
After some further investigation, this issue seems to only happen for strings that end with the hex value
UPDATE 2
If I remove the
But this won't work long term as users can update this table and I would need to adjust every query to do a replacement in the
select distinct department_name
from dbo.departments;I also tried:
select distinct department_name
from dbo.departments
group by department_name;So this led me to believe that I might have hidden characters in the values and sure enough, when I checked the length of the strings they returned different values. So, I decided to locate the hidden characters using the function from this question on stack overflow. Strangely enough this only return SPACE. I then tried the following query and it made no difference at all:
select distinct ltrim(rtrim(department_name)) as department_name
from dbo.departments;Intrigued, I cast the values to
VARBINARY and noticed that they have the exact same binary value, and performing a DISTINCT on the binary value does indeed produce a unique result set.I have tried casting between
VARCHAR and NVARCHAR and to a different collation also (the values are in the same column, within the same database using Latin1_General_CI_AI). I really need to be able to get a distinct set from this table. Does anybody know what might be causing this issue?UPDATE
After some further investigation, this issue seems to only happen for strings that end with the hex value
0xA000. Any value in the column not ending with this character are fine.UPDATE 2
If I remove the
0xA000 character from the string I can then apply DISTINCT as normal like so:DECLARE @binary VARBINARY(8) = 0xA000;
DECLARE @string VARCHAR(8) = CONVERT(VARCHAR(MAX), @binary);
UPDATE dbo.departments
SET department_name = REPLACE(department_name, @string, '');But this won't work long term as users can update this table and I would need to adjust every query to do a replacement in the
WHERE clause. I am using a workaround right now which is no more than using MIN tSolution
Good job investigating so far. Some initial notes:
-
I wouldn't worry about that function from the S.O. answer.
-
-
Adding
-
Whether
Still, something is amiss. You simply cannot have different length (or different anything) with the same binary value. How was length determined:
In order to help further, we need to know two things (please update the question with the results):
-
The exact datatype of the
-
The output of the following query:
Just find a set of rows for a name that looks the same but is showing duplicates via
For the moment, I can say that if you have character 0 (ASCII value 0, Code Point U+0000,
Enjoy: Peter Paul Almond Joy & Mounds - "Feel Like A Nut" (1980)
TEMPORARY UPDATE (WAITING FOR QUESTION TO BE UPDATED WITH QUERY OUTPUT):
Based on this info from a comment:
I used both LEN and DATALENGTH to perform the check. For two exact same strings I got (20,40) and (21,42) respectively.
it is clear that:
That means that the
No-Break Space
As in our favorite HTML character:
All you need to do is remove those characters on the way into the DB, using:
For example:
To be clear, all of that work messing with
But this won't work long term as users can update this table and I would need to adjust every query to do a replacement in the
True, updating every WHERE clause is not a workable solution. This is why you need to sanitize the input on the way in. There are a limited number of entry points for the data (INSERT / UPDATE procs for the UI, possibly some ETL processes), so it shouldn't be that bad. You can request that the developers strip out "bad" characters before they call the stored procedure(s), but there is no guarantee that they will, or that new code will, or that things won't change later, or that they will be able to fix ETL processes, etc.
-
I wouldn't worry about that function from the S.O. answer.
-
RTRIM and LTRIM only trim spaces, not white-space in general:SELECT RTRIM('A ') + 'a';
-- Aa
SELECT RTRIM('A ' + CHAR(9)) + 'a'; -- CHAR(9) = tab
-- A a-
Adding
GROUP BY (2nd query) doesn't change that query since it was implied in the first query ;-).-
Whether
0xA000 is 2 VARCHAR characters or 1 NVARCHAR character, there does not seem to be any special behavior with this sequence of bytes for either datatype, using either Latin1_General_CI_AI or Latin1_General_100_CI_AI.Still, something is amiss. You simply cannot have different length (or different anything) with the same binary value. How was length determined:
LEN or DATALENGTH? Is it possible that the value got truncated somewhere in the testing so that it then appeared to be the same?In order to help further, we need to know two things (please update the question with the results):
-
The exact datatype of the
department_name. Please find via:SELECT typ.[name], col.*
FROM sys.columns col
INNER JOIN sys.types typ
ON typ.[user_type_id] = col.[user_type_id]
WHERE col.[object_id] = OBJECT_ID(N'dbo.departments')
AND col.[name] = N'department_name';-
The output of the following query:
SELECT dept.department_name,
LEN(dept.department_name) AS [name_chars],
DATALENGTH(dept.department_name) AS [name_bytes],
CONVERT(VARBINARY(MAX), dept.department_name) AS [name_hex]
FROM dbo.departments dept
ORDER BY dept.department_nameJust find a set of rows for a name that looks the same but is showing duplicates via
DISTINCT.For the moment, I can say that if you have character 0 (ASCII value 0, Code Point U+0000,
CHAR(0), NCHAR(0)) in your values then appearances can be deceiving since char(0) is the "null terminator" for strings. So neither it nor anything after it will be displayed, but it and everything after are still part of the string:DECLARE @Test TABLE
(
[Something] VARCHAR(50) COLLATE Latin1_General_CI_AI
);
INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'feel like a nut');
INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'feel like a nut');
INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'don''t');
SELECT DISTINCT [Something],
LEN([Something]) AS [Something_chars],
DATALENGTH([Something]) AS [Something_bytes]
FROM @Test;
/*
Something Something_chars Something_bytes
Sometimes you 19 19
Sometimes you 29 29
*/Enjoy: Peter Paul Almond Joy & Mounds - "Feel Like A Nut" (1980)
TEMPORARY UPDATE (WAITING FOR QUESTION TO BE UPDATED WITH QUERY OUTPUT):
Based on this info from a comment:
I used both LEN and DATALENGTH to perform the check. For two exact same strings I got (20,40) and (21,42) respectively.
it is clear that:
- these are definitely different values, and
- the datatype of the column is
NVARCHAR(sinceDATALENGTHis twiceLEN)
That means that the
0xA000 value is a single, UTF16LE character. Due to being Little Endian (bytes in reverse order), the actual Code Point is U+00A0. That character is:No-Break Space
As in our favorite HTML character:
All you need to do is remove those characters on the way into the DB, using:
REPLACE(@InputParam, NCHAR(0x00A0) COLLATE Latin1_General_100_BIN2, N'')For example:
SELECT CONVERT(VARBINARY(MAX),
REPLACE(N'test' + NCHAR(0x00A0), NCHAR(0x00A0) COLLATE Latin1_General_100_BIN2, N'')
);To be clear, all of that work messing with
VARBINARY and VARCHAR, etc is unnecessary.But this won't work long term as users can update this table and I would need to adjust every query to do a replacement in the
WHERE clause.True, updating every WHERE clause is not a workable solution. This is why you need to sanitize the input on the way in. There are a limited number of entry points for the data (INSERT / UPDATE procs for the UI, possibly some ETL processes), so it shouldn't be that bad. You can request that the developers strip out "bad" characters before they call the stored procedure(s), but there is no guarantee that they will, or that new code will, or that things won't change later, or that they will be able to fix ETL processes, etc.
Code Snippets
SELECT RTRIM('A ') + 'a';
-- Aa
SELECT RTRIM('A ' + CHAR(9)) + 'a'; -- CHAR(9) = tab
-- A aSELECT typ.[name], col.*
FROM sys.columns col
INNER JOIN sys.types typ
ON typ.[user_type_id] = col.[user_type_id]
WHERE col.[object_id] = OBJECT_ID(N'dbo.departments')
AND col.[name] = N'department_name';SELECT dept.department_name,
LEN(dept.department_name) AS [name_chars],
DATALENGTH(dept.department_name) AS [name_bytes],
CONVERT(VARBINARY(MAX), dept.department_name) AS [name_hex]
FROM dbo.departments dept
ORDER BY dept.department_nameDECLARE @Test TABLE
(
[Something] VARCHAR(50) COLLATE Latin1_General_CI_AI
);
INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'feel like a nut');
INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'feel like a nut');
INSERT INTO @Test ([Something])
VALUES ('Sometimes you' + CHAR(0) + 'don''t');
SELECT DISTINCT [Something],
LEN([Something]) AS [Something_chars],
DATALENGTH([Something]) AS [Something_bytes]
FROM @Test;
/*
Something Something_chars Something_bytes
Sometimes you 19 19
Sometimes you 29 29
*/REPLACE(@InputParam, NCHAR(0x00A0) COLLATE Latin1_General_100_BIN2, N'')Context
StackExchange Database Administrators Q#210340, answer score: 6
Revisions (0)
No revisions yet.