patternsqlModerate
Collate issues with wrong characters
Viewed 0 times
collateissueswithwrongcharacters
Problem
Well, the problem is well known, but I'm looking for a smarter solution if there's one.
For some reason the system doesn't recognize some characters and I can't compare the columns
Here is an example of the text:
Right
ASPIRADOR ULTRASSONICO-LOCAÇAO (NOTA FISCAL SERVIÇO)
Wrong
ASPIRADOR ULTRASSONICO-LOCA€AO (NOTA FISCAL SERVI€O)
Actually I'm fixing this through this function
So, my question is: is this the best way or have I missed something here?
EDIT
Just a complementary test
```
select dbo.fixcollation(' …ƃ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãå♣—–éëꚇ€§')
select dbo.FixCodePage850toCodePage1252(' …ƃ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãå
For some reason the system doesn't recognize some characters and I can't compare the columns
Here is an example of the text:
Right
ASPIRADOR ULTRASSONICO-LOCAÇAO (NOTA FISCAL SERVIÇO)
Wrong
ASPIRADOR ULTRASSONICO-LOCA€AO (NOTA FISCAL SERVI€O)
Actually I'm fixing this through this function
create function fixcollation(@ps_Texto VARCHAR(4000)) returns VARCHAR(4000)
as
begin
declare @vlgsv1itu INT declare @nxn68ezzi INT declare @dw17rsyva VARCHAR(50) declare @iw8a2z01i VARCHAR(50) declare @t64e98xq6 VARCHAR(50) declare @zwjs2imy3 INT declare @jsyt85sy8 VARCHAR(4000)
----------------------------------------------------
set @dw17rsyva = ' …ƃ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãå♣—–éëꚇ€§'
set @iw8a2z01i = 'áàãâäÁÀÃÂÄéèêëÈÉÊËíìïÍÌÏóòõôöÓÒÕÔÖúùûüÚÙÛÜçǺØ'
set @jsyt85sy8 = @ps_Texto set @zwjs2imy3 = IsNull(datalength(@ps_Texto), 0)
set @nxn68ezzi = 1
while(@nxn68ezzi <= IsNull(datalength( @ps_Texto), 0))
begin
set @vlgsv1itu = 1
while(@vlgsv1itu <= IsNull(datalength(@dw17rsyva), 0))
begin
IF(ASCII(SUBSTRING(@ps_Texto, @nxn68ezzi, 1) COLLATE LATIN1_GENERAL_CS_AS) = ASCII(SUBSTRING(@dw17rsyva, @vlgsv1itu, 1) COLLATE LATIN1_GENERAL_CS_AS))
BEGIN
set @t64e98xq6 = SUBSTRING( @iw8a2z01i, @vlgsv1itu, 1) set @jsyt85sy8 = SUBSTRING(@jsyt85sy8, 1, @nxn68ezzi -1) + @t64e98xq6 + SUBSTRING(@jsyt85sy8, @nxn68ezzi + 1, @zwjs2imy3 - @nxn68ezzi)
break
end
set @vlgsv1itu = @vlgsv1itu + 1
end
set @nxn68ezzi = @nxn68ezzi + 1
end
return @jsyt85sy8
endSo, my question is: is this the best way or have I missed something here?
EDIT
Just a complementary test
```
select dbo.fixcollation(' …ƃ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãå♣—–éëꚇ€§')
select dbo.FixCodePage850toCodePage1252(' …ƃ„µ·Ç¶Ž‚Šˆ‰ÔÒÓ¡‹ÖÞØ¢•ä“”àãå
Solution
This is an incorrect encoding issue. The characters are coming in encoded as DOS Code Page 850 yet the target Code Page you are using (based on the
The incorrect characters are incorrect due to being imported into SQL Server with the wrong encoding being specified for the source. This is not happening within SQL Server as that would be a code page conversion issue, in which case the same "character" would have its value translated for the same character in the target Code Page (if the character exists in the target Code Page, else you get
Returns:
Meaning, this is happening most likely during a file import — BCP.exe, SQLCMD.exe,
It should be noted that this could also happen with data coming in from app code if the driver (ODBC, etc) is being told to use the wrong code page.
Now, regarding the method of fixing this:
-
Use the following function which does the conversion. First it gets the bytes of the current string, then it injects those bytes into a
Testing both functions returns the same results:
I came up with a test to check the mappings of all characters just in case the company providing the translation function missed any mappings. I filtered out the graphics characters and dotless "i" that are only found in Code Page 850.
```
USE [tempdb];
GO
;WITH nums AS
(
SELECT TOP (256) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1) AS [num]
FROM master.sys.columns
), vals AS
(
SELECT nums.[num] AS [Value],
CHAR(nums.[num]) AS [Character],
dbo.fixcollation(CHAR(nums.[num])) AS [OldWay],
dbo.FixCodePage850toCodePage1252(CHAR(nums.[num])) AS [NewWay]
FROM nums
)
SELECT vals.*,
ASCII(vals.[NewWay]) AS [NewValue]
FROM vals
WHERE vals.[Character] <> vals.[NewWay] COLLATE Latin1_General_BIN2
AND vals.[OldWay] <> vals.[NewWay] COLLATE Latin1_General_BIN2
AND vals.[Value] NOT IN (176, 177, 178, 179,
Latin1_General Collations) is Windows Code Page 1252. For example, in DOS Code Page 850, the Ç character has a value of 0x80 (or 128 in Decimal). However, that same value of 0x80 in Windows Code Page 1252 gives you €. Likewise, Ã in DOS Code Page 850 has a value of 0xC7 (or 199 in Decimal). However, that same value of 0xC7 in Windows Code Page 1252 gives you Ç.The incorrect characters are incorrect due to being imported into SQL Server with the wrong encoding being specified for the source. This is not happening within SQL Server as that would be a code page conversion issue, in which case the same "character" would have its value translated for the same character in the target Code Page (if the character exists in the target Code Page, else you get
?). For example:SELECT ASCII('Ç' COLLATE Latin1_General_CI_AS) AS [CP1252 Value],
'Ç' COLLATE SQL_Latin1_General_CP850_CI_AS AS [CharacterInCP850],
ASCII('Ç' COLLATE SQL_Latin1_General_CP850_CI_AS) AS [CP850 Value];Returns:
CP1252 Value CharacterInCP850 CP850 Value
199 Ç 128
Meaning, this is happening most likely during a file import — BCP.exe, SQLCMD.exe,
BULK INSERT, OPENROWSET(BULK...), custom app code that reads a file, etc — where either the wrong source Code Page is being specified, or no Code Page at all is being specified for the source. If an import is being done that specifies Code Page 1252 for this file, it will have the effect that you are seeing here since those bytes are encoded for Code Page 850, not Code Page 1252.It should be noted that this could also happen with data coming in from app code if the driver (ODBC, etc) is being told to use the wrong code page.
Now, regarding the method of fixing this:
- Ideally the method of importing the data would be updated / fixed to properly account for the actual code page with which the data has been encoded.
- If it is not possible to fix the import process, then the function that the other company provided is not the best way to go. In fact, it is probably the slowest, most convoluted approach which is also prone to errors (if they didn't map all of the characters). There is no reason to do two loops along with
SUBSTRINGwhen loading the characters, in pairs, into a table variable would have allowed for a single loop using theREPLACEfunction. And using theASCIIfunction and a case-sensitive, accent-sensitive Collation is unnecessary and prone to error (if two characters match what is being searched for) when using a_BIN2Collation would have been better.
-
Use the following function which does the conversion. First it gets the bytes of the current string, then it injects those bytes into a
VARCHAR column that uses Code Page 850, then it selects that value from the table variable into a local variable (necessary anyway to return the value) which has the effect of converting the string into the Code Page used by the default Collation of the Database (which here would have to be Code Page 1252 else you would not be getting the "correct" string out of the function):USE [tempdb];
GO
CREATE FUNCTION dbo.FixCodePage850toCodePage1252
(
@CodePage850String VARCHAR(8000)
)
RETURNS VARCHAR(8000)
WITH SCHEMABINDING
AS
BEGIN
DECLARE @Convert850to1252 TABLE
(
[String] VARCHAR(8000) COLLATE SQL_Latin1_General_CP850_CI_AS
);
DECLARE @ReturnValue VARCHAR(8000);
INSERT INTO @Convert850to1252 ([String])
VALUES (CONVERT(VARBINARY(8000), @CodePage850String, 0));
SELECT @ReturnValue = [String] -- automatic conversion to Code Page of database
FROM @Convert850to1252;
RETURN @ReturnValue;
END;
GOTesting both functions returns the same results:
SELECT dbo.fixcollation('lj§ ULTRASSONICO-LOCA€AO (NOTA SERVI€O)');
-- Ãëº ULTRASSONICO-LOCAÇAO (NOTA SERVIÇO)
SELECT dbo.FixCodePage850toCodePage1252('lj§ ULTRASSONICO-LOCA€AO (NOTA SERVI€O)');
-- Ãëº ULTRASSONICO-LOCAÇAO (NOTA SERVIÇO)I came up with a test to check the mappings of all characters just in case the company providing the translation function missed any mappings. I filtered out the graphics characters and dotless "i" that are only found in Code Page 850.
```
USE [tempdb];
GO
;WITH nums AS
(
SELECT TOP (256) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1) AS [num]
FROM master.sys.columns
), vals AS
(
SELECT nums.[num] AS [Value],
CHAR(nums.[num]) AS [Character],
dbo.fixcollation(CHAR(nums.[num])) AS [OldWay],
dbo.FixCodePage850toCodePage1252(CHAR(nums.[num])) AS [NewWay]
FROM nums
)
SELECT vals.*,
ASCII(vals.[NewWay]) AS [NewValue]
FROM vals
WHERE vals.[Character] <> vals.[NewWay] COLLATE Latin1_General_BIN2
AND vals.[OldWay] <> vals.[NewWay] COLLATE Latin1_General_BIN2
AND vals.[Value] NOT IN (176, 177, 178, 179,
Code Snippets
SELECT ASCII('Ç' COLLATE Latin1_General_CI_AS) AS [CP1252 Value],
'Ç' COLLATE SQL_Latin1_General_CP850_CI_AS AS [CharacterInCP850],
ASCII('Ç' COLLATE SQL_Latin1_General_CP850_CI_AS) AS [CP850 Value];USE [tempdb];
GO
CREATE FUNCTION dbo.FixCodePage850toCodePage1252
(
@CodePage850String VARCHAR(8000)
)
RETURNS VARCHAR(8000)
WITH SCHEMABINDING
AS
BEGIN
DECLARE @Convert850to1252 TABLE
(
[String] VARCHAR(8000) COLLATE SQL_Latin1_General_CP850_CI_AS
);
DECLARE @ReturnValue VARCHAR(8000);
INSERT INTO @Convert850to1252 ([String])
VALUES (CONVERT(VARBINARY(8000), @CodePage850String, 0));
SELECT @ReturnValue = [String] -- automatic conversion to Code Page of database
FROM @Convert850to1252;
RETURN @ReturnValue;
END;
GOSELECT dbo.fixcollation('lj§ ULTRASSONICO-LOCA€AO (NOTA SERVI€O)');
-- Ãëº ULTRASSONICO-LOCAÇAO (NOTA SERVIÇO)
SELECT dbo.FixCodePage850toCodePage1252('lj§ ULTRASSONICO-LOCA€AO (NOTA SERVI€O)');
-- Ãëº ULTRASSONICO-LOCAÇAO (NOTA SERVIÇO)USE [tempdb];
GO
;WITH nums AS
(
SELECT TOP (256) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1) AS [num]
FROM master.sys.columns
), vals AS
(
SELECT nums.[num] AS [Value],
CHAR(nums.[num]) AS [Character],
dbo.fixcollation(CHAR(nums.[num])) AS [OldWay],
dbo.FixCodePage850toCodePage1252(CHAR(nums.[num])) AS [NewWay]
FROM nums
)
SELECT vals.*,
ASCII(vals.[NewWay]) AS [NewValue]
FROM vals
WHERE vals.[Character] <> vals.[NewWay] COLLATE Latin1_General_BIN2
AND vals.[OldWay] <> vals.[NewWay] COLLATE Latin1_General_BIN2
AND vals.[Value] NOT IN (176, 177, 178, 179, 180, 185, 186, 187, 188,
191, 192, 193, 194, 195, 196, 197, 200, 201,
202, 203, 204, 205, 206, 217, 218, 219, 220,
223, 254, 213); -- characters only in CP850Context
StackExchange Database Administrators Q#197263, answer score: 10
Revisions (0)
No revisions yet.