patternsqlMinor
Character Conversion Issue Importing From CSV File
Viewed 0 times
conversionfileissuecsvcharacterimportingfrom
Problem
After doing a load of CSV files, there are various words that are "written" to the database incorrectly.
Some examples:
Is there any way to translate the symbols to the correct characters?
I've done several tests with different
Some examples:
Diã¡riashould beDiária
Crã©ditoshould beCrédito
Ligaã§ãµesshould beLigações
Usuã¡rioshould beUsuário
Nãºmerosshould beNúmeros
Is there any way to translate the symbols to the correct characters?
I've done several tests with different
collations and functions that could search the internet, but without success.Solution
there are various words that are "written" to the database incorrectly.
No, the characters are being read incorrectly. They are being written correctly. Or, another way to look at this is: the characters are being written incorrectly to the data file. Either way, SQL Server is doing exactly what it is being asked to do.
This is a simple encoding issue. The data was originally exported as UTF-8, but then that UTF-8 encoded file was read into SQL Server as if it were an Extended ASCII file using Code Page 1252. There are three clues which indicate that this is the issue:
UTF-8 is a multi-byte encoding: it uses a different number of bytes depending on the character being encoded. The first 128 Code Points (U+0000 - U+007F), which contain the US English alphabet, all use 1 byte. Code Points above that range take 2 - 4 bytes. This is why some characters transferred between systems as expected:
-
The
-
The
Your choices are:
-
Encode the file as (Extended) ASCII using Code Page 1252 (as it is exported), and do not change how it is being read into SQL Server. (it sounds like you have already done this).
-
Continue exporting the file using the UTF-8 encoding, but change how the file is being read into SQL Server by specifying that it is encoded as UTF-8. Please note that for anyone using BCP.exe,
No, the characters are being read incorrectly. They are being written correctly. Or, another way to look at this is: the characters are being written incorrectly to the data file. Either way, SQL Server is doing exactly what it is being asked to do.
This is a simple encoding issue. The data was originally exported as UTF-8, but then that UTF-8 encoded file was read into SQL Server as if it were an Extended ASCII file using Code Page 1252. There are three clues which indicate that this is the issue:
- Most characters come through correctly. The ones that don't are the accented characters.
- The characters that are incorrect are coming through as two characters instead of one.
- You mentioned that exporting to an "ASCII" file worked.
UTF-8 is a multi-byte encoding: it uses a different number of bytes depending on the character being encoded. The first 128 Code Points (U+0000 - U+007F), which contain the US English alphabet, all use 1 byte. Code Points above that range take 2 - 4 bytes. This is why some characters transferred between systems as expected:
N (upper-case Latin "N") is 0x4E in UTF-8 as well as in Code Page 1252 (in fact, it's 0x4E across all 8-bit Code Pages supported by SQL Server). This is one of the reasons that UTF-8 is so popular. However, the accented characters are not 1 byte in UTF-8:-
The
á ( U+00E1 ) is encoded as two bytes in UTF-8: 0xC3 and 0xA1. When those two bytes are read by something expecting Code Page 1252, they are interpreted as being à ( 0xC3 on Code Page 1252 ) and ¡ ( 0xA1 on Code Page 1252 ). Then, either you or your import process, lower-cased the à (probably because it was in the middle of a word) which is how you ended up with: Usuã¡rio.-
The
ú ( U+00FA ) is encoded as two bytes in UTF-8: 0xC3 and 0xBA. When those two bytes are read by something expecting Code Page 1252, they are interpreted as being à ( 0xC3 on Code Page 1252 ) and º ( 0xBA on Code Page 1252 ). Then, either you or your import process, lower-cased the à (probably because it was in the middle of a word) which is how you ended up with: Nãºmero.Your choices are:
-
Encode the file as (Extended) ASCII using Code Page 1252 (as it is exported), and do not change how it is being read into SQL Server. (it sounds like you have already done this).
-
Continue exporting the file using the UTF-8 encoding, but change how the file is being read into SQL Server by specifying that it is encoded as UTF-8. Please note that for anyone using BCP.exe,
BULK INSERT, or OPENROWSET(BULK...), this option only became available starting in SQL Server 2016. The Code Page to use is 65001 (which usually means UTF-8 with Byte Order Mark, but I am not sure if SQL Server requires the Byte Order Mark in these cases).Context
StackExchange Database Administrators Q#204887, answer score: 6
Revisions (0)
No revisions yet.