Reichweit.IT Unicode Library ============================ This software package contains a C++ library for Unicode encoding conversion and command line tools which apply those functions in example runtime programs: recode and validate. Properties ---------- * Supports C++17 and C++20 * Locale independent validation and conversion * Supports UTF-8, UTF-16, UTF-32, ISO-8859-1 and ISO-8859-15 * Supports Linux and Windows * Supports current compilers (clang++-11, clang++-13, g++-11, msvc-19.28.29337) C++ interface (package libunicode-dev) -------------------------------------- This library includes multiple encoding specification concepts to choose from: While explicit specification of source and destination encodings are possible, implicit specification of encoding of Unicode UTF encodings is also implemented via the respective C++ types: For char8_t, char16_t and char32_t, the respective UTF-8, UTF-16 and UTF-32 encoding is automatically used. In case of C++17 where char8_t is not implemented, char is used instead. The same applies for the std::basic_string<> specializations std::u8string (or std::string on C++17), std::u16string and std::u32string. The main purpose of this library is conversion (and validation) between Unicode encodings. However, Latin-1 (i.e. ISO 8859-1) and Latin-9 (i.e. ISO 8859-15) are also implemented for practical reasons. Since the Latin character sets are also encoded in char and std::string (at least necessarily on C++17), the Latin encodings must be specified explicitly for disambiguation where Unicode is used by default otherwise. I.e. UTF-8 is the default for all 8 bit character types, UTF-16 is the default for 16 bit character types and UTF-32 is the default for 32 bit character types. Besides support for different character and string types from the STL, common container types like std::vector, std::deque, std::list and std::array (the latter only as source) are supported. The basic convention for the conversion interface is: to = unicode::convert(from); where FromType and ToType can be one of: (1) Character type like char, char8_t, char16_t and char32_t (2) Container type like std::string, std::list, std::deque (3) Explicit encoding like unicode::UTF_8, unicode::UTF_16, unicode::UTF_32, unicode::ISO_8859_1 or unicode::ISO_8859_15 For the validation interface, the same principle applies: bool flag = unicode::is_valid_utf(from); There is also a Unicode character validation function which operates on Unicode character values directly, i.e. no specific encoding is used but 32 bit (or less) values are evaluated for a valid Unicode character: bool flag = unicode::is_valid_unicode(character_value); While this validates a Unicode value in general, it doesn't tell if the specified value is actually designated in an actual Unicode version. E.g. as of 2022, in current Unicode version 14.0, the character 0x1FABA "NEST WITH EGGS" is designated, but not 0x1FABB. Both of them would be detected as "valid" by unicode::is_valid_unicode(). See also: https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt Examples: #include ... C++17 conversion of a UTF-8 string to UTF-16: std::string utf8_value {u8"äöü"}; std::u16string utf16_value{unicode::convert(utf8_value)}; C++20 conversion of a UTF-8 string to UTF-16: std::u8string utf8_value {u8"äöü"}; std::u16string utf16_value{unicode::convert(utf8_value)}; The following encodings are implicitly deducted from types: * char resp. char8_t (C++20): UTF-8 * char16_t: UTF-16 * char32_t: UTF-32 Specification via container types: std::deque utf8_value {...}; std::list utf16_value{unicode::convert, std::list>(utf8_value)}; Explicit encoding specification: std::string value {"äöü"}; std::u32string utf32_value{unicode::convert(value)}; Supported encodings are: * unicode::UTF_8 * unicode::UTF_16 * unicode::UTF_32 * unicode::ISO_8859_1 * unicode::ISO_8859_15 Supported basic types for source and target characters: * char * char8_t (C++20) * wchar_t (UTF-16 on Windows, UTF-32 on Linux) * char16_t * char32_t * uint8_t, int8_t * uint16_t, int16_t * uint32_t, int32_t * basically, all basic 8-bit, 16-bit and 32-bit that can encode UTF-8, UTF-16 and UTF-32, respectively. Supported container types: * All std container types that can be iterated (vector, list, deque, array) * Source and target containers can be different container types Validation can be done like this: bool valid{unicode::is_valid_utf(utf16_value)}; Or via explicit encoding specification: bool valid{unicode::is_valid_utf(utf8_value)}; CLI interface (package unicode-tools) ------------------------------------- * unicode-recode Usage: recode Format: UTF-8 UTF-8 UTF-16 UTF-16, native endian UTF-16LE UTF-16, little endian UTF-16BE UTF-16, big endian UTF-32 UTF-32, native endian UTF-32LE UTF-32, little endian UTF-32BE UTF-32, big endian ISO-8859-1 ISO-8859-1 (Latin-1) ISO-8859-15 ISO-8859-15 (Latin-9) Exit code: 0 if valid, 1 otherwise. * unicode-validate Usage: validate Format: UTF-8 UTF-8 UTF-16 UTF-16, big or little endian UTF-16LE UTF-16, little endian UTF-16BE UTF-16, big endian UTF-32 UTF-32, big or little endian UTF-32LE UTF-32, little endian UTF-32BE UTF-32, big endian Exit code: 0 if valid, 1 otherwise. Contact ------- Reichwein IT