Tips on Using Unicode with C/C++
Recently I did some programming and had to deal with text documents stored into C++ strings and C memory regions. Since I had no need for elaborate string functions I thought I'd get away with a few pointers to string objects and be done with it. It took me some days to turn my code from an anagram generator into something useful (even though I like anagrams). Let me tell you what to keep in mind when dealing with Unicode in C/C++ code.
Encodings and Character Sets
Character sets are sets of characters. This sounds rather tautological, but you have to keep in mind that computers usually use a certain set of characters that can be displayed. Usually this restriction is due to the fonts used or to the representation of characters as offsets of specific length. If someone gives you a string and tells you the character set that has been used, you will most certainly need to know which encoding was used to write the string into memory. Why is that so important? Well, a single character can be represented by more than one byte. The character encoding determines how many bytes will be used to represent one character. The standard choices are as follows:
- one byte
- two byte
- four byte
- multiple bytes
The one byte per character notation is well known and works for most text and character sets (such as US ASCII or the ISO-8859 family). Two bytes per character is used by the UTF-16 encoding, also known as 16-bit Unicode Transformation Format. Four bytes per character is used with the UTF-32 encoding (32-bit Unicode Transformation Format). Multiple bytes, meaning characters encoded by a variable number of bytes, is the encoding used with UTF-7 and UTF-8 (again the 7 and the 8 denote the number of bits used). UTF-8 uses escape sequences to indicate whether the encoded character uses 1, 2, 3 or 4 bytes; UTF-7 uses different escape sequences and ensures that every single character in the entire string always uses 7 bits per byte. If you want to explore UTF-8, there's a command line tool called unicode that lets you display the byte sequences of certain characters. Make sure your terminal can handle UTF-8 and has a font with the characters you want to display.
rpfeiffer@agamemnon:~ $ unicode é U+00E9 LATIN SMALL LETTER E WITH ACUTE UTF-8: c3 a9 UTF-16BE: 00e9 Decimal: é é (É) Uppercase: U+00C9 Category: Ll (Letter, Lowercase) Bidi: L (Left-to-Right) Decomposition: 0065 0301 rpfeiffer@agamemnon:~ $ unicode SMILING U+263A WHITE SMILING FACE UTF-8: e2 98 ba UTF-16BE: 263a Decimal: ☺ ☺ Category: So (Symbol, Other) Bidi: ON (Other Neutrals) U+263B BLACK SMILING FACE UTF-8: e2 98 bb UTF-16BE: 263b Decimal: ☻ ☻ Category: So (Symbol, Other) Bidi: ON (Other Neutrals) rpfeiffer@agamemnon:~ $
I deliberately didn't mention Unicode since it is a standard (ISO/IEC 10646 to be exact) and not simply an encoding; Unicode describes a lot more than just the encoding.
Strings in C/C++
C doesn't care about the encoding as long as the string consists of a sequence of bytes with a null byte at the end. You only need to deal with encoding if you want to determine the length of the string in terms of characters or if you wish to do string manipulations. As long as you are happy with copying bytes around, there's no difference between a UTF-8 and ISO-8859-15 string in C.
In C++ things look a bit different. Strings are now objects. There is no such things as a null byte at the end since the string object "knows" how many bytes it contains. Apart from this fact, dealing with strings is similar to C: you may copy strings in (almost) any encoding and never notice it. However, there is one exception which I will describe shortly.
Wide Characters and Wide Strings
C and C++ have special types and objects to deal with so-called "wide characters". When I talked about strings, I implicitly meant UTF-8 or ISO-8859 class encodings. What about UTF-16 or UTF-32? UTF-16 and UTF-32 are different from the usual string notation because they use, respectively, two and four bytes per character. These encodings must be handled properly because the characters depend on the endianness given by the CPU. UTF-8 is somewhat endianness-agnostic because despite the multibyte sequences an UTF-8 string is still handled as a one-byte-per-character string, meaning it is copied byte-wise.
Wide characters use the type wchar_t according to the ISO C90 standard. In C++, there are wide string classes for representing string objects and performing wide character I/O such as reading or writing from/to a file. String literals must be marked with a prefixed L.
if ( file_has_content ) { cout << "DEBUG: Document " << file_document->toString() << endl; wprintf( L"DEBUG: printf -> %s\n", file_document->toString() ); }
So what's the size of a wchar_t then? 2 or 4 byte? The answer is yes - that is, the standards don't specify an exact length. The Unicode 4.0 standard says that "ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension." Furthermore, the standard specifies: "The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers."
This means that a UNIX-like operating system will usually use 4 bytes (it's best to verify this by using sizeof()). If you use the Microsoft® Wind0ws API, you end up with 2 bytes per wchar_t. That's great, isn't it? So whatever you do, make sure you deal with strings in a consistent way and don't cast your data to death. Casting a pointer doesn't translate the string into another encoding. There are conversion functions that can do these transformations (iconv() or the UTF-8 Codecvt Facet from the Boost library for example). Choose a correct data type and make sure that this data type matches the encoding you wish/have to use.
One side note about converting data from one encoding to another: avoid it if at all possible. Use the correct data type for your variables and try to "get the data right" from the beginning. If you need to read text files encoded in UTF-8 from disk into a string buffer, use the wide stream I/O classes from the C++ library and copy the data into a wide string. This saves you the hassle of converting UTF-8 to UTF-16/UTF-32. Here is a little code snippet that reads a file into a dynamically allocated C++ wide string:
#include <fstream> #include <iostream> wchar_t *file_content; ifstream::pos_type file_size; wifstream file_stream; string file_temporary; file_stream.open( file_temporary.c_str(), ios_base::in | ios_base::binary | ios_base::ate ); if ( file_stream.is_open() ) { file_size = file_stream.tellg(); file_content = new wchar_t[file_size]; file_stream.seekg( 0, ios_base::beg ); file_stream.read( file_content, (streamsize)file_size ); file_stream.close(); }
Make sure you treat files with Unicode content as binary in order to avoid "premature conversion". The example above still has one weakness - the file name is a "normal" string. If you have file names with special characters you have to check your locale settings and determine how to encode these special characters.
Portable Libraries and TCHAR
If you use portable C/C++ libraries in your code, you may have noticed the TCHAR data type in the header files. Any code that is intended to run on Microsoft® Wind0ws and has to deal with Unicode will most certainly use the TCHAR data type. TCHAR will be translated into a wide character data type when compiling this code with the GNU C Compiler (most portable libraries define TCHAR in their headers and refer to wchar_t). This, in fact, was how I turned my C++ program into an anagram generator: I used standard C++ strings filled with UTF-8 and fed the data with pointers casted to wchar_t to library functions. UTF-8 data interpreted as UTF-32 equals garbage (but it is tremendously useful for obfuscation of data and bugs).
Conclusion
This article was meant to give a short overview; that's the reason for the many links to other resources. I deliberately left out conversion functions and locale settings (mainly because of a lack of example code). You also have to remember the extra storage required for wide strings - i.e. up to four times more than a standard string. This may be important if you have a lot of strings or big string buffers. So I hope you avoid some of the errors I ran into, or at least come up with more creative ones.
Useful resources
- A tutorial on character code issues
- International Components for Unicode
- Notes on the codecvt implementation
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- The ISO 8859 Alphabet Soup
- Unicode-enabling Microsoft C/C++ Source Code (cheat sheet)
- Using Unicode in C/C++
- UTF-8 Codecvt Facet
Talkback: Discuss this article with The Answer Gang
René was born in the year of Atari's founding and the release of the game Pong. Since his early youth he started taking things apart to see how they work. He couldn't even pass construction sites without looking for electrical wires that might seem interesting. The interest in computing began when his grandfather bought him a 4-bit microcontroller with 256 byte RAM and a 4096 byte operating system, forcing him to learn assembler before any other language.
After finishing school he went to university in order to study physics. He then collected experiences with a C64, a C128, two Amigas, DEC's Ultrix, OpenVMS and finally GNU/Linux on a PC in 1997. He is using Linux since this day and still likes to take things apart und put them together again. Freedom of tinkering brought him close to the Free Software movement, where he puts some effort into the right to understand how things work. He is also involved with civil liberty groups focusing on digital rights.
Since 1999 he is offering his skills as a freelancer. His main activities include system/network administration, scripting and consulting. In 2001 he started to give lectures on computer security at the Technikum Wien. Apart from staring into computer monitors, inspecting hardware and talking to network equipment he is fond of scuba diving, writing, or photographing with his digital camera. He would like to have a go at storytelling and roleplaying again as soon as he finds some more spare time on his backup devices.