HiveBrain v1.2.0
Get Started
← Back to all entries
debugsqlMinor

How do I fix the Code Page in SSIS Lookup Transformation to be 65001?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
fixthessispagehowcodelookuptransformation65001

Problem

I have an SQL server 2019 with databases and tables all set with the collation Latin1_General_100_CI_AS_SC_UTF8.

The relevant table has code and desc columns which are both of data type varchar.

In SSIS project I have a single Data Flow component:

I have a UTF-8 CSV file read with flat file connection, text column code to match is DT_STR, 65001

I have a Lookup that is set to "Full Cache" and loads the Latin1_General_100_CI_AS_SC_UTF8 table, but SSIS thinks the varchar columns are DT_STR, 1252

Finally the code in both CSV and lookup are matched and desc is sent to destination table which is on the same Latin1_General_100_CI_AS_SC_UTF8 collation. The destination component is set to AlwaysUseDefaultCodePage True and DefaultCodePage 65001.

I now get an error saying the column has more than one code page and cannot run the package.

If not for the mislabeled 1252, this package should run. I believe its something to do with ExternalMetadataXml, which is read-only and says all my lookup varchar columns are CodePage="1252".

If I manually edit the package .dtsx with npp and replace all instances of 1252 with 65001, the package can run and seems to do what I expected, as long as I never touch the lookup component again.. That seems a bit messed up of a solution tho, I am hoping there's someone else who has a cleaner way to fix this. Thanks.

edit 2021/03/10: I thought I should clarify, the lookup is using OLE to read a table from the SQL server, and is trying to match code column onto the code column read from the flat file. The problem is not the flat file, which is already recognised as DT_STR, 65001. The problem is the Lookup widget reading from a table that should be 65001, somehow ends up with 1252. If I didn't use OLE, and instead read from a UTF8 flat file to create the Lookup widget, there is no error.

Solution

With the disclaimer that I'm a "dumb American" who doesn't deal with non-English data but did work with a friend recently on using bulk import with UTF-8 data, here's what I see.

I have a pipe separated value file that looks like this

level|name
7|"Ovasino Poste de Santé"


Notepad++ indicates I have saved it as UTF-8.

I created two flat file connection managers in SSIS: Codepage65001STR and Codepage65001WSTR. They both use a Code Page of 65001 (UTF-8)

In the advanced tab for the STR variant, I left the data type as DT_STR

In the advanced tab for WSTR variant, I changed the data type to DT_WSTR

I also created a table and loaded it with the same data

DROP TABLE IF EXISTS dbo.dba_286478;
CREATE TABLE dbo.dba_286478
(
level int NOT NULL
,   name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO dbo.dba_286478
(
    level
,   name
)
VALUES
(
    7 -- level - int
,   'Ovasino Poste de Santé' -- name - varchar(75)
);

DROP TABLE IF EXISTS dbo.dba_286478;
CREATE TABLE dbo.dba_286478
(
    level int NOT NULL
,   name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
);


I then created a data flow task with a Flat File Source using the different Flat File Connection Managers and added data viewers between them and an empty derived column (so I had an anchor point for the data viewer).

I did the same thing with an OLE DB Source pointing at my table as well as a custom query of

SELECT
    T.level
,   CAST(T.name AS varchar(75)) AS name
FROM 
    dbo.dba_286478 AS T;


as well as explicitly defining the collation as it makes no different in SSIS

,   CAST(T.name COLLATE Latin1_General_100_CI_AS_SC_UTF8 AS varchar(75)) AS name


The results all show the same, the final word is an accented Sante. If the UTF-8 hadn't happened, it'd show as Santé

At this point, it doesn't matter whether we DT_STR or DT_WSTR in our flat file source column definition, the component understands UTF-8 and UTF-16.

Properties, metadata of each. The Codepage 65001 STR looks as expected. code page of 65001 and data type DT_STR

Unicode, DT_WSTR looks good

The OLE components however, they're a different animal. The component is returning a metadata of DT_WSTR (full Unicode/UTF-16) regardless of whether we do an explicit cast to DT_STR, optionally specifying collation, or let the natural metadata flow through.

Either way, it doesn't detect the code page/collation stuff and just says Nope, you're Unicode

So, when we get to trying to use a Lookup task with an OLE DB connection manager, we can expect and receive the same inability to delineate between UTF-8 string/varchar and UTF-16/nvarchar

The error would indicate and that's true, DT_STR can't match DT_WSTR

Cannot map the input column, 'name', to the lookup column, 'name', because the data types do not match.

So what do I do?

You must have type alignment to make the lookup component work which means the source data needs to be of type DT_WSTR. You can either bring the data in from the Flat File as Unicode or leave it as string with code page 65001. If you go the latter route, then you need to make a copy, Derived Column or Data Conversion work, of that column and use it in the Lookup component.

If you're pulling text out of the lookup component, that's now in your pipeline as Unicode so you probably want to then convert that to a string type with code page. Again, Derived Column or Data Conversion will be used.

SSIS OLE components don't understand UTF8

We saw with the source and lookup component that SSIS is going to treat the UTF-8 strings as UTF-16 but I assumed it would handle storing to the table just fine. Not so much.

My server collation is Latin1_General_100_CI_AI_SC_UTF8 and while I switched accent sensitivities between server and table definition of dbo.dba_286478, it doesn't matter in this case as it's UTF-8 all the way down.

For my Flat File Source, I use the STR based file which has the metadata shown above with the yellow highlighting. The Codepage 65001 for data type DT_STR is what we want.

I added an OLE DB Destination and pointed it at my table which again has the "name" column defined as UTF-8

name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8


Check this error!

Validation error. Data Flow Task OLE DB Destination [138]: The column "name" cannot be processed because more than one code page (65001 and 1252) are specified for it.

We only have code page 65001 at play in this data flow and yet, something in SSIS space is inferring/defaulting to a 1252 code page during validation.
Making it work

The componentry in a data flow task was built with OLE DB connections in mind. That's why the Lookup task supported OLE DB Connections for 2005, 2008 and maybe 2008R2? Long time ago now, I know but the Cache Connection Manager (aka anything else) option was added in later iterations because of the need to use something besides OLE connection managers especially given the push then was to deprecate t

Code Snippets

level|name
7|"Ovasino Poste de Santé"
DROP TABLE IF EXISTS dbo.dba_286478;
CREATE TABLE dbo.dba_286478
(
level int NOT NULL
,   name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO dbo.dba_286478
(
    level
,   name
)
VALUES
(
    7 -- level - int
,   'Ovasino Poste de Santé' -- name - varchar(75)
);

DROP TABLE IF EXISTS dbo.dba_286478;
CREATE TABLE dbo.dba_286478
(
    level int NOT NULL
,   name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
);
SELECT
    T.level
,   CAST(T.name AS varchar(75)) AS name
FROM 
    dbo.dba_286478 AS T;
,   CAST(T.name COLLATE Latin1_General_100_CI_AS_SC_UTF8 AS varchar(75)) AS name
name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8

Context

StackExchange Database Administrators Q#286478, answer score: 2

Revisions (0)

No revisions yet.