HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Find non-UTF-8 data in mysql

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
utfnonmysqlfinddata

Problem

I have a MySQL database containing data that is input by our customers.

Customers are told their data should only be encoded as UTF-8, however this is currently not being enforced and some people are using characters not encoded in UTF-8.

We intend to add some checking, however how should we detect data that isn't encoded using UTF-8? Is there a SQL or MySQL query (or admin command) that will show me the data that is NOT encoded as UTF-8?

Solution

You can check for the existence of (non-)UTF-8 data by comparing byte length to character length on a column, e.g.:

SELECT * FROM MyTable
WHERE LENGTH(MyColumn) <> CHAR_LENGTH(MyColumn)


Multibyte characters will have a greater LENGTH (bytes), so you'll need to look for where that condition isn't met.

Note that MySQL's utf8 character set isn't true Unicode UTF-8 as it only supports a maximum of 3 bytes per character. If your MySQL is later than 5.5.3 you can use utf8mb4 to get 4 bytes per character.

Code Snippets

SELECT * FROM MyTable
WHERE LENGTH(MyColumn) <> CHAR_LENGTH(MyColumn)

Context

StackExchange Database Administrators Q#77101, answer score: 14

Revisions (0)

No revisions yet.