Unicode Library

C++ and command line Unicode library.

News

Introduction

Unicode encoding, conversion and validation is well understood and supported in C++ and Linux. So why providing another Unicode library for C++?

Currently (C++20 being the latest C++ standard), Unicode conversion is provided in the C++ standard library, but marked as deprecated. boost::locale provides means for Unicode conversion, but as the name suggests, it is locale dependent and using boost::locale can add dozens of megabytes to a simple executable just because of Unicode conversion which should not directly depend on locales.

Therefore, this library is provided as a C++17 and C++20 conformant way for the basic task of converting between UTF-8 (default encoding under Linux), UTF-16 (default encoding under Windows) and UTF-32 (default encoding in Qt, and generally in GUI/typesetting like FreeType).

The command line interface is just a runtime application of the provided library. There are other tools available that offer the same functionality, see below.

Features

CLI interface (package unicode-tools)

unicode-recode

  Usage: unicode-recode <from-format> <from-file> <to-format> <to-file>
  Format:
      UTF-8       UTF-8
      UTF-16      UTF-16, native endian
      UTF-16LE    UTF-16, little endian
      UTF-16BE    UTF-16, big endian
      UTF-32      UTF-32, native endian
      UTF-32LE    UTF-32, little endian
      UTF-32BE    UTF-32, big endian
      ISO-8859-1  ISO-8859-1 (Latin-1)
      ISO-8859-15 ISO-8859-15 (Latin-9)
  Exit code: 0 if valid, 1 otherwise.

unicode-validate

  Usage: unicode-validate <format> <file>
  Format:
      UTF-8     UTF-8
      UTF-16    UTF-16, big or little endian
      UTF-16LE  UTF-16, little endian
      UTF-16BE  UTF-16, big endian
      UTF-32    UTF-32, big or little endian
      UTF-32LE  UTF-32, little endian
      UTF-32BE  UTF-32, big endian
  Exit code: 0 if valid, 1 otherwise.

C++ interface (package libunicode-dev)

Example:

#include <unicode.h>
...

  std::string utf8_value {"äöü"};
  std::u16string utf16_value{unicode::convert<char, char16_t>(utf8_value)};

And for C++20:

  std::u8string utf8_value {u8"äöü"};
  std::u16string utf16_value{unicode::convert<char8_t, char16_t>(utf8_value)};

The following encodings are implicitly deducted from types:
  * char resp. char8_t (C++20): UTF-8
  * char16_t: UTF-16
  * char32_t: UTF-32

You can specify different container types directly:
  
  std::deque<char> utf8_value {...};
  std::list<wchar_t> utf16_value{unicode::convert<std::deque<char>, std::list<wchar_t>>(utf8_value)};

Explicit encoding specification is also possible:

  std::u8string value {"äöü"};
  std::u16string utf16_value{unicode::convert<unicode::UTF_8, unicode::UTF_16>(value)};

  std::string value {"äöü"};
  std::u32string utf32_value{unicode::convert<unicode::ISO_8859_1, unicode::UTF_32>(value)};

Supported encodings are:

  * unicode::UTF_8
  * unicode::UTF_16
  * unicode::UTF_32
  * unicode::ISO_8859_1
  * unicode::ISO_8859_15

Supported basic types:
  * char
  * char8_t (C++20)
  * wchar_t (UTF-16 on Windows, UTF-32 on Linux)
  * char16_t
  * char32_t
  * uint8_t, int8_t
  * uint16_t, int16_t
  * uint32_t, int32_t
  * basically, all basic 8-bit, 16-bit and 32-bit that can encode
    UTF-8, UTF-16 and UTF-32, respectively.

Supported container types:
  * All std container types that can be iterated (vector, list, deque)
  * Source and target containers can be different container types

Validation can be done like this:

  bool valid{unicode::is_valid_utf<char16_t>(utf16_value)};

Or via explicit encoding specification:

  bool valid{unicode::is_valid_utf<unicode::UTF_8>(utf8_value)};
  

Licensing

This software is available in the public domain under the conditions of CC0 1.0 Universal.

Installation

Download is available from https://www.reichwein.it/download

Installation via Debian's APT mechanism is supported for the following operating systems: To prepare the APT configuration, add the respective line from the following choices to /etc/apt/sources.list:
# For Debian 11:
deb http://www.reichwein.it/debian/ stable debian11

# For Ubuntu 21.04:
deb http://www.reichwein.it/debian/ stable ubuntu2104

# For Ubuntu 21.10:
deb http://www.reichwein.it/debian/ stable ubuntu2110
			

The package reichwein-keyring helps apt to control cryptographic trust upon the packages. It can be manually installed from the above sources.

Then, the following commands (as root) will install the packages unicode-tools (Command Line Interface, CLI) and libunicode-dev (C++ development files) via the operating system's package mechanism:
# apt-get update
# apt-get install unicode-tools libunicode-dev
			

Source Code

Source code is available at https://www.reichwein.it/download

The git repository can be browsed at https://www.reichwein.it/cgit/unicode.git/ and cloned via:

$ git clone http://reichwein.it/git/unicode
			

For Debian-like systems, you can use the following APT configuration. Add the respective line from the following choices to /etc/apt/sources.list:

# For Debian 11:
deb-src http://www.reichwein.it/debian/ stable debian11

# For Ubuntu 21.04:
deb-src http://www.reichwein.it/debian/ stable ubuntu2104

# For Ubuntu 21.10:
deb-src http://www.reichwein.it/debian/ stable ubuntu2110
			

See also

The following resources are available for further reading:

Contact

Roland Reichwein
Hauptstr. 101a
82008 Unterhaching
mail@reichwein.it
https://www.reichwein.it