snippetcppMinor

Sort the characters in a UnicodeString

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

characterssorttheunicodestring

Problem

The task: sort the characters in a string, as provided by the ICU UnicodeString. This is because I want to be able to find anagrams using the suggestion from "Programming Pearls", which is, find "signatures" for each word in the dictionary, then sort according to these. It is enough if it works for the European languages and scripts.

The input file is for example the word list found in /usr/share/dict/words. For now, I just read from standard input, print them to standard output, sort the characters in the word, and print the sorted word:

$ cat sort-each-word.cpp 
#include 
#include 
#include "unicode/ustream.h"
#include "unicode/unistr.h"
#include "unicode/schriter.h"

int main()
{
    icu::UnicodeString word;
    while (std::cin >> word) {
        std::cout << word << '\t';
        auto n = word.length();
        UChar *begin = word.getBuffer(n+1);
        UChar *end = begin + n;
        std::sort(begin, end);
        *(begin+n) = 0;
        word.releaseBuffer(n);
        std::cout << word << '\n';
    }
}

To compile it:

g++  -pedantic -Wall -O2 --std=c++0x    -L/usr/lib -licui18n -licuuc -licudata   -licuio   sort-each-word.cpp   -o sort-each-word

Then, for example:

$ cat naivete
naiveté
$ < naivete ./sort-each-word
naiveté aeintvé

I feel most uncomfortable about getBuffer(n+1), where n is the current string length. I need the extra space to terminate the sorted string, but must I check that length() + 1 =< getCapacity()?

Other suggestions and comments are of course most welcome.

Solution

You should not need the n + 1, since the UnicodeString keeps track of the string's length, no matter whether it is null-terminated or not. It's in the documentation, you just need to read it.

So the line *(begin+n) = 0; is unnecessary. Especially since you tell icu that the new length is n, which means that begin[n] will be ignored anyway.

A general remark: maybe the code is even easier to read like follows:

auto len = word.length();
UChar *buf = word.getBuffer(len);
std::sort(buf, buf + len);
word.releaseBuffer(len);

I renamed n to len, renamed begin to buf and removed end altogether. Given the idiomatic C++ usage of the (begin, begin + len) pattern, it saves some code and makes the intention of the variables a little clearer.

If you want to support full unicode, you should not use UChar, since that is only a UTF-16 code unit and therefore not able to handle emojis, extended CJK ideographs, byzantine musical symbols and several more scripts starting at U+10000 and beyond.

Code Snippets

auto len = word.length();
UChar *buf = word.getBuffer(len);
std::sort(buf, buf + len);
word.releaseBuffer(len);

Context

StackExchange Code Review Q#136465, answer score: 2

Revisions (0)

No revisions yet.