HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Encoding Debug UTF8 & Latin 1

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
debugutf8latinencoding

Problem

We have on our DB some international people that got some characters saved incorrectly on our DB (MS SQL-Server 2008 R2).

For example we have strings like 'WilcoxonÃ'.

We got this data into the db because we take international submissions but management didn't want to use any character set even though we tried for years to make that happen. Now we finally convinced them to implement character set and new data is saved correctly.

What we are trying to do now, is do a search and replace for existing records (around 300 or so) with the incorrect data on them.

We've taken the UTF 8 debug table from here UTF8 Debug and turned it into SQL Table.

This is the code to create the table (I've tried using SQL Fiddle but it's taking ages to create the schema)

```
CREATE TABLE UTF8Encoding (
id int NOT NULL identity,
[unicode] nvarchar(250) NOT NULL default '',
[windows1252] nvarchar(250) NOT NULL default '',
[expected] nvarchar(250) NOT NULL default '',
[actual] nvarchar(250) NOT NULL default '',
[utf8bytes] nvarchar(250) NOT NULL default '',
PRIMARY KEY (id)
);

INSERT INTO UTF8Encoding (unicode,windows1252,expected,actual,utf8bytes) VALUES ('U+20AC','0x80',N'€','€','%E2 %82 %AC');
INSERT INTO UTF8Encoding (unicode,windows1252,expected,actual,utf8bytes) VALUES ('','0x81',N'','','');
INSERT INTO UTF8Encoding (unicode,windows1252,expected,actual,utf8bytes) VALUES ('U+201A','0x82',N'‚','‚','%E2 %80 %9A');
INSERT INTO UTF8Encoding (unicode,windows1252,expected,actual,utf8bytes) VALUES ('U+0192','0x83',N'ƒ','Æ’','%C6 %92');
INSERT INTO UTF8Encoding (unicode,windows1252,expected,actual,utf8bytes) VALUES ('U+201E','0x84',N'„','„','%E2 %80 %9E');
INSERT INTO UTF8Encoding (unicode,windows1252,expected,actual,utf8bytes) VALUES ('U+2026','0x85',N'…','…','%E2 %80 %A6');
INSERT INTO UTF8Encoding (unicode,windows1252,expected,actual,utf8bytes) VALUES ('U+2020','0x86',N'†','â€','%E2 %80 %A0');
INSERT INTO UTF8Encodin

Solution

The reason your function takes ages is because you have empty values for actual in UTF8Encoding. The patindex expression returns 1 when you check for an empty actual so you never exit the inner loop. You can fix that by adding and actual <> '' to the query against UTF8Encoding. Next issue is where you use @expected as parameter to nchar(). The parameter should be an integer so if you remove nchar() your code returns something but I don't think it is what you are looking for. Wilcoxonà is translated to WilcoxonÁƒƒÁ††™.

Another approach you can try is to use the XML capabilities in SQL Server. XML in SQL Server is UTF-16 but it is able to load UTF-8 encoded strings and that can be used.

Concatenate your string with a UTF-8 xml declaration and use the value() function to fetch the value from the constructed XML.

I guess you eventually want to use this on a table so here is an example that uses a table variable.

declare @T table(InputString varchar(max))

insert into @T values
('åäöÅÄÖ'),
('WilcoxonÃ')

select cast(''+T.InputString as xml).value('text()[1]', 'nvarchar(max)') as Value
from @T as T


Result:

Value
---------------
åäöÅÄÖ
WilcoxonÃ

Context

StackExchange Database Administrators Q#76794, answer score: 3

Revisions (0)

No revisions yet.