patternpythonCriticalCanonical
What is the best way to remove accents (normalize) in a Python unicode string?
Viewed 0 times
normalizeremovethebestunicodewaystringwhatpythonaccents
Problem
I have a Unicode string in Python, and I would like to remove all the accents (diacritics).
I found an elegant way to do this (in Java):
Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about Python 3?
I would like to avoid explicitly mapping characters
I found an elegant way to do this (in Java):
- convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
- remove all the characters whose Unicode type is "diacritic".
Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about Python 3?
I would like to avoid explicitly mapping characters
Solution
How about this:
This works on greek letters, too:
The character category "Mn" stands for
And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')This works on greek letters, too:
>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>The character category "Mn" stands for
Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".
Code Snippets
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>Context
Stack Overflow Q#517923, score: 444
Revisions (0)
No revisions yet.