HiveBrain v1.2.0
Get Started
← Back to all entries
debugsqlMinor

Is it possible to use OPENROWSET to import fixed width UTF8 encoded files?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
useutf8encodedpossiblewidthfilesfixedopenrowsetimport

Problem

I have an example data file with following contents and saved with UTF8 encoding.

oab~opqr
öab~öpqr
öab~öpqr


The format of this file is fixed width with columns 1 to 3 each being allocated 1 character and column 4 reserved 5 characters.

I have created an XML format file as below

  
  
     
             
       
       
       
       
     
     
        
       
        
       
     


Disappointingly running the following SQL...

SELECT *
FROM OPENROWSET
(
BULK 'mydata.txt',
FORMATFILE = 'myformat_file.xml',
CODEPAGE = '65001'
) AS X


Produces the following results

Col1 Col2 Col3 Col4
---- ---- ---- -----
o    a    b    ~opqr
�    �    a    b~öp
�    �    a    b~öp


from which I conclude the LENGTH is counting bytes rather than characters.

Is there any way I can get this working correctly for fixed character widths with UTF8 encoding?

(Target environment is Azure SQL Database reading from Blob storage)

NB: It was suggested in the comments that adding COLLATION="LATIN1_GENERAL_100_CI_AS_SC_UTF8" to the FIELD elements might help but the results remain unchanged with this.

Solution

One workaround is to just change the format file to bring the whole line in, in bulk, and do the substring-ing in TSQL

With format file

  
  
     
       
     
     
       
     


The following does return the desired results

SELECT SUBSTRING(WholeLine, 1,1) AS Col1,
       SUBSTRING(WholeLine, 2,1) AS Col2,
       SUBSTRING(WholeLine, 3,1) AS Col3,
       SUBSTRING(WholeLine, 4,5) AS Col4
FROM OPENROWSET
(
BULK 'mydata.txt',
FORMATFILE = 'myformat_file.xml',
CODEPAGE = '65001'
) AS X

Code Snippets

<?xml version = "1.0"?>  
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">  
   <RECORD>  
      <FIELD xsi:type="CharTerm" ID="WholeLine" TERMINATOR="\n"/> 
   </RECORD>  
   <ROW>  
      <COLUMN SOURCE="WholeLine" NAME="WholeLine" xsi:type="SQLNVARCHAR"/> 
   </ROW>  
</BCPFORMAT>
SELECT SUBSTRING(WholeLine, 1,1) AS Col1,
       SUBSTRING(WholeLine, 2,1) AS Col2,
       SUBSTRING(WholeLine, 3,1) AS Col3,
       SUBSTRING(WholeLine, 4,5) AS Col4
FROM OPENROWSET
(
BULK 'mydata.txt',
FORMATFILE = 'myformat_file.xml',
CODEPAGE = '65001'
) AS X

Context

StackExchange Database Administrators Q#303337, answer score: 9

Revisions (0)

No revisions yet.