HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Unicode storage of \u202b RLE and \u202c PDE in a Unicode-aware database?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rlestorageu202cunicodepdedatabaseandu202baware

Problem

I'm building a new product for toponyms and in it the Arabic shows kinda like this:

^IArabic^Iﺰﻤﺑﺎﺑﻮﻳ^Iﺞﻫﻭﺮﻳﺓ ﺰﻤﺑﺎﺑﻮﻳ$


Actually not quite. This is a real problem for my ASCII-spewing terminal, so I'll make an exception and screenshot text.

My question is about those U202B "Right-To-Left Embedding" (RLE), and U202C "Pop Directional Formatting" (PDF). Do those get stored as data? My first assumption was that the characters were rendered and not in the file, but alas they are there..

360    5E03 97E6 5171 548C 56FD 000A 0009 0041 0072 0061 0062 0069 0063 0009 202B 0632    布韦共和国..Arabic..ز
.............................................................................^HERE
389    0645 0628 0627 0628 0648 064A 202C 0009 202B 064F 062C 0647 0648 0631 064A 0629    مبابوي...ُجهورية
.....................................^HERE.....^HERE
422    0020 0632 0645 0628 0627 0628 0648 064A 202C 000A 0009 004E 006F 0074 0065 0073    .زمبابوي...Notes
...............................................^HERE


When storing Arabic in a database, do you typically store \u202b, and \u202c? They seem like they're rendering characters and not technically data? I'm simply wanting to process this text to throw in a database, and wondering if these characters should be present in the database, or stripped before insert.

Background

  • The screenshot was taken with VIM in a terminal (Kitty) which does not support the Arabic text because all characters get displayed on a grid.



  • The text comes from text-extraction (using pdftotext)



  • The pdf was produced by the "United Nations Group of Experts on Geographical Names". You can find the pdf (E/CONF.105/13) freely available here.

Solution

Arabic (as well as Hebrew and Syriac) are right-to-left languages. Hence they display in the opposite direction that the bytes are physical stored in. Having the proper display is controlled through non-printable characters that are interpreted only by the font / rendering system. These two characters in particular are used to control this (see original Unicode spec for starters: https://www.unicode.org/charts/PDF/U2000.pdf ), especially in the context of embedding right-to-left text in the same paragraph as left-to-right text (and the other way around).

So, you must keep them stored or else attempted to display this data later will render it backwards from how the language is supposed to appear and will hence be considered data loss. These are among many formatting control characters that are non-printable / zero-width.

The "official" description of how to work with these characters, from the Unicode Consortium, is (taken from "Chapter 23: Special Areas and Format Characters" top of page 868):


As with other format control characters, bidirectional ordering controls affect the layout of the text in which they are contained but should be ignored for other text processes, such as sorting or searching. However, text processes that modify text content must maintain these characters correctly, because matching pairs of bidirectional ordering controls must be coordinated, so as not to disrupt the layout and interpretation of bidirectional text. Each instance of a lre, rle, lro, or rlo is normally paired with a corresponding pdf. Likewise, each instance of an lri, rli, or fsi is normally paired with a corresponding pdi.

Regarding the importance of keeping (not discarding) these hidden formatting code points, the "Unicode® Standard Annex #9: UNICODE BIDIRECTIONAL ALGORITHM", in section "2.7 Markup and Formatting Characters" states (emphasis mine):


The explicit formatting characters introduce state into the plain text, which must be maintained when editing or displaying the text. Processes that are modifying the text without being aware of this state may inadvertently affect the rendering of large portions of the text, for example by removing a PDF.

and:


Whenever plain text is produced from a document containing markup (ed: HTML and/or CSS), the equivalent formatting characters should be introduced, so that the correct ordering is not lost.

Further explanation is provided in the (excellent) "Understanding Bidirectional (BIDI) Text in Unicode" document by Cal Henderson (taken from the O.P.'s answer) states:


... we could disallow these explicit characters (U+202A - U+202E) which is pretty easy. This does mean that anybody who wants to use them to include Neutrals at the edges of their Arabic usernames will be out of luck - and that sucks more when it's a comment they're posting, where the period jumps to the 'beginning' of the text.


If we want to allow use of these characters, the solution is fairly simple (if hard to implement): we need to make sure that every opening marker has a paired closing marker (PDF) so that the state stack coming out of the string is at the same state as when we went in. We also need to be careful that we don't allow any PDFs to be used without accompanying push markers, else we can't use any ourselves outside of the block.

So, even if the text of a particular cell is supposed to be entirely a right-to-left language, removing these markers could alter the placement of neutral characters (such as punctuation). For example (using SQL Server):

SELECT NCHAR(0x0671) + NCHAR(0x0679) +  N'!';
-- ٱٹ!

SELECT NCHAR(0x202B) + NCHAR(0x0671) + NCHAR(0x0679) + N'!' + NCHAR(0x202C);
-- ‫ٱٹ!‬


Planning to add them back in later, or having a client app add them back in, won't work because there is no inherent means of knowing that they were even being used, and if so, where they were placed.

The safest approach is to keep these characters

For example, you are attempting to include some of this text at the top of the question:

^IArabic^Iﺰﻤﺑﺎﺑﻮﻳ^Iﺞﻫﻭﺮﻳﺓ ﺰﻤﺑﺎﺑﻮﻳ$

but clearly that's not displaying in the correct order. The bytes, however, are in the correct order:

Looking at just the first ... section (again, using SQL Server, hence it's Little Endian):

SELECT CONVERT(VARBINARY(MAX), N'ﺰﻤﺑﺎﺑﻮﻳ');


the bytes are:

3C00 3200 3000 3200 6200 3E00 B0FE E4FE 91FE 8EFE 91FE EEFE F3FE 3C00 3200 3000 3200 6300 3E00
    .    .    .    .    .    .    .    


As you can see, there are no additional formatting characters. Because the Arabic characters are strong right-to-left, the characters that follow – `). And to be clear, the 202 itself displays left-to-right, which would be clearer if the number wasn't a palindrome. If the number was 203, then it would still show as 203 and not 302. But the c` is strong left-to-right so it (and the characters that follow) display as expected.

How to fix? Just add the implicit left-to-right markers jus

Code Snippets

SELECT NCHAR(0x0671) + NCHAR(0x0679) +  N'!';
-- ٱٹ!

SELECT NCHAR(0x202B) + NCHAR(0x0671) + NCHAR(0x0679) + N'!' + NCHAR(0x202C);
-- ‫ٱٹ!‬
SELECT CONVERT(VARBINARY(MAX), N'<202b>ﺰﻤﺑﺎﺑﻮﻳ<202c>');
3C00 3200 3000 3200 6200 3E00 B0FE E4FE 91FE 8EFE 91FE EEFE F3FE 3C00 3200 3000 3200 6300 3E00
<    2    0    2    b    >    .    .    .    .    .    .    .    <    2    0    2    c    >

Context

StackExchange Database Administrators Q#225996, answer score: 6

Revisions (0)

No revisions yet.