snippetcppMinor
Sort the characters in a UnicodeString
Viewed 0 times
characterssorttheunicodestring
Problem
The task: sort the characters in a string, as provided by the ICU
The input file is for example the word list found in
To compile it:
Then, for example:
I feel most uncomfortable about
Other suggestions and comments are of course most welcome.
UnicodeString. This is because I want to be able to find anagrams using the suggestion from "Programming Pearls", which is, find "signatures" for each word in the dictionary, then sort according to these. It is enough if it works for the European languages and scripts.The input file is for example the word list found in
/usr/share/dict/words. For now, I just read from standard input, print them to standard output, sort the characters in the word, and print the sorted word:$ cat sort-each-word.cpp
#include
#include
#include "unicode/ustream.h"
#include "unicode/unistr.h"
#include "unicode/schriter.h"
int main()
{
icu::UnicodeString word;
while (std::cin >> word) {
std::cout << word << '\t';
auto n = word.length();
UChar *begin = word.getBuffer(n+1);
UChar *end = begin + n;
std::sort(begin, end);
*(begin+n) = 0;
word.releaseBuffer(n);
std::cout << word << '\n';
}
}To compile it:
g++ -pedantic -Wall -O2 --std=c++0x -L/usr/lib -licui18n -licuuc -licudata -licuio sort-each-word.cpp -o sort-each-wordThen, for example:
$ cat naivete
naiveté
$ < naivete ./sort-each-word
naiveté aeintvéI feel most uncomfortable about
getBuffer(n+1), where n is the current string length. I need the extra space to terminate the sorted string, but must I check that length() + 1 =< getCapacity()?Other suggestions and comments are of course most welcome.
Solution
You should not need the
So the line
A general remark: maybe the code is even easier to read like follows:
I renamed
If you want to support full unicode, you should not use
n + 1, since the UnicodeString keeps track of the string's length, no matter whether it is null-terminated or not. It's in the documentation, you just need to read it.So the line
*(begin+n) = 0; is unnecessary. Especially since you tell icu that the new length is n, which means that begin[n] will be ignored anyway.A general remark: maybe the code is even easier to read like follows:
auto len = word.length();
UChar *buf = word.getBuffer(len);
std::sort(buf, buf + len);
word.releaseBuffer(len);I renamed
n to len, renamed begin to buf and removed end altogether. Given the idiomatic C++ usage of the (begin, begin + len) pattern, it saves some code and makes the intention of the variables a little clearer.If you want to support full unicode, you should not use
UChar, since that is only a UTF-16 code unit and therefore not able to handle emojis, extended CJK ideographs, byzantine musical symbols and several more scripts starting at U+10000 and beyond.Code Snippets
auto len = word.length();
UChar *buf = word.getBuffer(len);
std::sort(buf, buf + len);
word.releaseBuffer(len);Context
StackExchange Code Review Q#136465, answer score: 2
Revisions (0)
No revisions yet.