snippetsqlMinor
SQL Server to MySQL migration - how can I remove UCS-2 surrogate pairs in SQL Server?
Viewed 0 times
pairscanmigrationsqlremovemysqlsurrogatehowserverucs
Problem
I have been tasked with migrating a SQL Server 2005 database to MySQL 5.6 (these are both database servers runnig locally) and would really appreciate some help.
-------Edit - Answered-------
This issue has now been resolved. I used Remus Rusanu's suggestion for finding the rows with these surrogate pair characters using
-----------------------------
-
SQL Server source database has
-
MySQL target database wants the character set UTF-8.
I decided to use the database migration toolkit in the latest version of the MySQL workbench. at first it worked fine and migrated everything as expected. But I have been totally tripped up upon encountering UCS-2 surrogate pair characters in the SQL Server database.
The migration toolkit copytable program did not provide a very useful error message:
Error during charset conversion of wstring: No error.
It also did not provide any field/row information on the problem-causing data and would fail within chunks of 100 rows. So after searching through the 100 rows after the last successful insert I found that the issue seemed to be caused by two UCS-2 characters in one of the nvarchar fields. They are listed as surrogate pairs in the UCS-2 character set. They were specifically the characters DBC0 and DC83 (I got this by looking
-------Edit - Answered-------
This issue has now been resolved. I used Remus Rusanu's suggestion for finding the rows with these surrogate pair characters using
CHARINDEX and have decided to use SUBSTRING to exclude the troublesome characters like so:UPDATE test
SET a = SUBSTRING(a, 1, (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 - 1) -- string before the unwanted character
+ SUBSTRING(a, (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 +1, LEN(a) ) -- string after the unwanted character
WHERE CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000))) % 2 = 1 -- only odd numbered charindexes (to signify match at beginning of byte pair character)-----------------------------
-
SQL Server source database has
latin1 collation (so has ISO 8859-1 character set right?) but doesn't have any char/varchar fields (any string field is nvarchar/nchar) so all this data should be using the UCS-2 character set.-
MySQL target database wants the character set UTF-8.
I decided to use the database migration toolkit in the latest version of the MySQL workbench. at first it worked fine and migrated everything as expected. But I have been totally tripped up upon encountering UCS-2 surrogate pair characters in the SQL Server database.
The migration toolkit copytable program did not provide a very useful error message:
Error during charset conversion of wstring: No error.
It also did not provide any field/row information on the problem-causing data and would fail within chunks of 100 rows. So after searching through the 100 rows after the last successful insert I found that the issue seemed to be caused by two UCS-2 characters in one of the nvarchar fields. They are listed as surrogate pairs in the UCS-2 character set. They were specifically the characters DBC0 and DC83 (I got this by looking
Solution
You need to take the data from UTF-8 and convert it into UCS-2LE using something like iconv. For example, using the character in your example:
Now I'm not sure what character UTF-8 \xdcb3 is, but apparently it's correct translation to UCS-2LE is
I'm not an expert in the MySQL tool set so I can't say what step is missing that was supposed to do the iconv.
Update
to locate the records with surrogates you must turn to the binary representation, since any character function will threat the surrogates as 'special':. Luckily the string manipulation functions work on binary too with the expected semantics. Eg. CHARINDEX:
echo "010000: dcb3" | xxd -r -s -0x10000 | iconv -f "UTF-8" -t "UCS-2LE" | xxd
0000000: 3307Now I'm not sure what character UTF-8 \xdcb3 is, but apparently it's correct translation to UCS-2LE is
\U0733. If you have \0xDCB3 in the SQL Server it means it was not translated into UCS-2LE before import. You should not have surrogates in the NVARCHAR fields, UCS-2 is "surrogate agnostic". See UCS-2 vs. UTF-16 (not quite Kramer vs. Kramer).I'm not an expert in the MySQL tool set so I can't say what step is missing that was supposed to do the iconv.
Update
to locate the records with surrogates you must turn to the binary representation, since any character function will threat the surrogates as 'special':. Luckily the string manipulation functions work on binary too with the expected semantics. Eg. CHARINDEX:
insert into test(a) values (N'a');
insert into test(a) values (NCHAR(0xdc83));
insert into test(a) values (N'b');
go
select * from test where charindex(0x83dc, cast(a as varbinary(8000))) > 0;Code Snippets
echo "010000: dcb3" | xxd -r -s -0x10000 | iconv -f "UTF-8" -t "UCS-2LE" | xxd
0000000: 3307insert into test(a) values (N'a');
insert into test(a) values (NCHAR(0xdc83));
insert into test(a) values (N'b');
go
select * from test where charindex(0x83dc, cast(a as varbinary(8000))) > 0;Context
StackExchange Database Administrators Q#37343, answer score: 4
Revisions (0)
No revisions yet.