HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Having charset "utf8mb4" - does it mean every character takes 4 bytes, or only those that need 4 bytes?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thoseneedmeanhavingeverycharacterthatbytesdoesutf8mb4

Problem

I have a VARCHAR(80) utf8mb4 column, and I'm testing adding some ASCII & Emoji characters, and use LENGTH() and CHAR_LENGTH() to understand the differences.

From reading in different places, my understanding is that each character on a utf8mb4 column would take 4 bytes. However, it appears that it may not be the case, if I understand LENGTH() as giving me the actual size that the specific contents are taking on that field.

Is it correct that a row with "aaaa" contents would take less storage than a row with ""?

Solution

Actually, my answer is in the Documentation:
https://mariadb.com/kb/en/unicode/

UTF-8 encoding using one to three bytes per character. Basic Latin
letters, numbers and punctuation use one byte. European and Middle
East letters mostly fit into 2 bytes. Korean, Chinese, and Japanese
ideographs use 3-bytes. No supplementary characters are stored.

utf8mb4 -- Same as utf8, but stores supplementary characters in four
bytes.

Also, as per https://mariadb.com/kb/en/data-type-storage-requirements/

String Data Types

In the descriptions below, M is the declared column length (in characters or in bytes), while len is the actual length in bytes of the value.

VARCHAR(M) -- len + 1 bytes if column is 0 – 255 bytes, len + 2 bytes if column may require more than 255 bytes

So, for VARCHAR(80) utf8mb4,

80x4 > 255, so a field with "aaaa" will take 6 bytes (len + 2)

Context

StackExchange Database Administrators Q#285336, answer score: 7

Revisions (0)

No revisions yet.