gotchaModerate
Unicode handling pitfalls — emoji length, string slicing, normalization
Viewed 0 times
emoji lengthgrapheme clusterutf8mb4Intl.Segmenternormalizecode point
Error Messages
Problem
String operations produce unexpected results with emoji and non-Latin scripts. Emoji family length returns 11 in JavaScript. Slicing breaks emoji.
Solution
(1) Use Array.from(str) or [...str] for proper iteration. For emoji families, use Intl.Segmenter. (2) Database: use utf8mb4 in MySQL (utf8 only supports 3-byte chars). (3) Use grapheme clusters for length limits. (4) Normalize with str.normalize('NFC'). (5) Sort with Intl.Collator.
Why
Unicode has multiple layers: bytes, code units (UTF-16 in JS), code points, and grapheme clusters. .length counts UTF-16 code units, not visual characters.
Revisions (0)
No revisions yet.