CppGraph Application Framework©
Copyright © 2004-202x Geoff Goldberg
Unicode

Overview

CppGraph© uses UTF-8 for all character and string data and operations (see utf8everywhere) STL and platforms, on the other hand, use wchar_t.

Advantages of UTF-8

The advantages of UTF-8 include:

Disadvantages of wchar_t

The disadvantages of using wchar_t for character and string data and operations include:

UTF Types

UTF-related classes include:

Iteration

Range-based iteration is supported by utf8_string, utf16_string, utf8_character, and utf16_character.

for ( auto const & character : my_utf8_string ) // returns a utf8_character const &
{
std::cout << utf8_char; // Writes a UTF-8 character (1 to 4 bytes).
}
for ( auto char_ : my_utf8_character ) // Provides 1 to 4 bytes.
{
std::cout << char_; // Writes 1 byte of the UTF-8 character.
}

Encoding Conversion

Whenever interaction with STL and platform APIs is necessary, the framework performs temporary conversions between UTF-8 and wchar_t. User code must perform the same conversions when interacting with STL and platform APIs. Note that STL's wstring on Windows uses 2-bytes per character (it assumes UCS-2) whereas the Windows character encoding is UTF-16 (a specific character may 2 or 4 bytes). Therefore STL wstring functions that are sensitive to this difference cannot be used.

Functions that perform string encoding conversions include:

Functions that perform character encoding conversions include:

Encoding Comparisons

The following table presents attributes of the relevant encodings. Note that UTF-8 is the only encoding that supports all unicode characters and is endian-agnostic.

encoding bytes per character unicode characters endian-agnostic?
ASCII 1 128 or 256 yes
UTF-8 1, 2, 3, or 4 all yes
UTF-16 2 or 4 all no
UTF-32 4 all no
UCS-2 2 common subset no

The following table presents the attributes of the wchar_t data type:

platform STL encoding platform encoding
POSIX UTF-32 UTF-32
Windows UCS-2 UTF-16

Note that, for Windows, STL assumes UCS-2, whereas the actual encoding is UTF-16. STL's encoding fails for 4-byte UTF-16 characters!