|
CppGraph Application Framework©
Copyright © 2004-202x Geoff Goldberg
|
CppGraph© uses UTF-8 for all character and string data and operations (see utf8everywhere) STL and platforms, on the other hand, use wchar_t.
The advantages of UTF-8 include:
The disadvantages of using wchar_t for character and string data and operations include:
std::wstring algorithms may provide incorrect results on Windows.UTF-related classes include:
cppgraph::utf8_string cppgraph::utf16_string cppgraph::utf8_string_iterator cppgraph::utf16_string_iterator cppgraph::utf8_character cppgraph::utf16_character cppgraph::utf8_character_iterator cppgraph::utf16_character_iterator Range-based iteration is supported by utf8_string, utf16_string, utf8_character, and utf16_character.
Whenever interaction with STL and platform APIs is necessary, the framework performs temporary conversions between UTF-8 and wchar_t. User code must perform the same conversions when interacting with STL and platform APIs. Note that STL's wstring on Windows uses 2-bytes per character (it assumes UCS-2) whereas the Windows character encoding is UTF-16 (a specific character may 2 or 4 bytes). Therefore STL wstring functions that are sensitive to this difference cannot be used.
Functions that perform string encoding conversions include:
cppgraph::convert< std::wstring >( cppgraph::utf8_string const & ) cppgraph::convert< utf16_string >( cppgraph::utf8_string const & ) cppgraph::convert< utf16_string >( cppgraph::utf32_string const & ) cppgraph::convert< utf32_string >( cppgraph::utf8_string const & ) cppgraph::convert< utf32_string >( cppgraph::utf16_string const & ) cppgraph::convert< utf8_string >( cppgraph::utf16_string const & ) cppgraph::convert< utf8_string >( cppgraph::utf32_string const & ) cppgraph::convert< utf8_string >( std::wstring const & ) Functions that perform character encoding conversions include:
cppgraph::convert< utf32_character >( utf8_character const & ) cppgraph::convert< utf32_character >( utf16_character const & ) The following table presents attributes of the relevant encodings. Note that UTF-8 is the only encoding that supports all unicode characters and is endian-agnostic.
| encoding | bytes per character | unicode characters | endian-agnostic? |
|---|---|---|---|
| ASCII | 1 | 128 or 256 | yes |
| UTF-8 | 1, 2, 3, or 4 | all | yes |
| UTF-16 | 2 or 4 | all | no |
| UTF-32 | 4 | all | no |
| UCS-2 | 2 | common subset | no |
The following table presents the attributes of the wchar_t data type:
| platform | STL encoding | platform encoding |
|---|---|---|
| POSIX | UTF-32 | UTF-32 |
| Windows | UCS-2 | UTF-16 |
Note that, for Windows, STL assumes UCS-2, whereas the actual encoding is UTF-16. STL's encoding fails for 4-byte UTF-16 characters!
1.8.14