HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Source code level portable C++ Unicode literals

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
literalssourcelevelunicodeportablecode

Problem

Windows console windows do unfortunately not support stream I/O of international characters. For instance, in Windows 7, you can still do "chcp 65001" (sets the active code page to UTF-8), type "more", and get a crash.

This means that it's practically impossible for a novice to write a "Hello, world!" program that:

  • has national characters in literals



  • will yield the same results in *nix and Windows regardless of country



After a little discussion about this on a Boost mailing list I set down to implement more fully an idea that I sketched there, namely to use the natural encoding of the OS for internal strings, which in practice then means UTF-8 for internal strings on *nix and UTF-16 for internal strings on Windows. It's worth noting that e.g. the ICU library, at least according to its documentation, uses UTF-16 encoded strings.

With UTF-8 as a common external coding, one has:

Extern Intern E.g. ICU library
UTF-8 - UTF-8 UTF-16 UTF-16 - UTF-16

To support that I define a macro with the incredibly short name
U, which adapts a literal to the platform one compiles for, and creates a strongly typed string or character. The typing helps to avoid using functions in a non-portable manner. And it enables argument dependent lookup, like this:

#include
using namespace std;
namespace u = progrock::cppx::u;

int main()
{
u::out

Here u::out is either std::cout (for *nix) or std::wcout (for Windows). And the source code needs to be stored as UTF-8 with BOM in order to compile nicely with both g++ and Visual C++.

If standard output goes to a Windows console window, then the Norwegian, Russian and Chinese characters result in correct Unicode code points in the console window's text buffer. The Norwegian and Russian displays OK, the Chinese displays as rectangles (on my machine), and the text can be copied correctly to e.g. Notepad, which can display it all. If standard output is

Solution

Since the code is supposed to work for Unix the pragma is a bad idea.

#pragma once


Prefer to use normal include guards.

You are imbuing streams here:

std::locale const  utf8Locale( stream.getloc(), new CodecUtf8() );
            stream.imbue( utf8Locale );
            return stream;


The only problem I see with this is that after you have started using the stream any attempt to imbue can silently fail (or it used too they may have fixed that in C++11).

Now I assume you are trying to force this initialization before use with:

static IStream& in      = naturalEncoding::StdStreams::inStream();
static OStream& out     = naturalEncoding::StdStreams::outStream();
static OStream& err     = naturalEncoding::StdStreams::errStream();
static OStream& log     = naturalEncoding::StdStreams::logStream();


This will work 99% of the time but if somebody starts logging (using one of the std:: streams (in/out/err/log) in the constructor of a global scope static storage duration object then all bets are off). Since this is a rare case I am not too worried but you should document this somewhere like at the top of the header file (assuming it is still a problem).

I don't see a definition for U() or writeTo() or raw() or CodingValue

Code Snippets

#pragma once
std::locale const  utf8Locale( stream.getloc(), new CodecUtf8() );
            stream.imbue( utf8Locale );
            return stream;
static IStream& in      = naturalEncoding::StdStreams< encoding >::inStream();
static OStream& out     = naturalEncoding::StdStreams< encoding >::outStream();
static OStream& err     = naturalEncoding::StdStreams< encoding >::errStream();
static OStream& log     = naturalEncoding::StdStreams< encoding >::logStream();

Context

StackExchange Code Review Q#5889, answer score: 4

Revisions (0)

No revisions yet.