Dealing with text is a major task for code. Writing text means to string characters in a row. Characters are the symbols. The encoding determines how these characters are represented in memory. There are single-byte and multi-byte encodings. The Unicode family aims to represent all characters and symbols of all writing systems. If you specify Unicode, you still need to select a specific encoding. Unicode can be expressed in UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-7-IMAP, UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. The numbers indicate the bytes and bits. The LE and BE indicate the endianness of the encoding. So if you see a software specification saying „let’s use Unicode“, then this is not a specification. Universal Coded Character Set (UCS) is an early representation of Unicode, but it is still updated by the Unicode group.
C++ has multiple string classes. The string container follows the C behaviour and has no encoding per se. You can store byte sequences in a string. You have to take care of the encoding. Wide strings can be stored in the wstring container. Wide strings can accommodate multi-byte characters as used in UTF-16 or UTF-32. The disadvantage is that this differs between platforms (just as the int data type). C++11 and C++20 introduced the u8string, u16string, and u32string containers to address this. You still need to track the encoding of the data. A good choice is to stick with the standard string container and handle the encoding issues yourself. However, the C++ standard library lacks some functionality that is frequently needed. The following libraries can help you out:
- simdutf for Unicode validation and transformation; the library has SIMD support
- pcrecpp for regular expressions with Unicode
- UTF8-CPP for Unicode string operations with UTF-8 and conversions to UTF-16 / UTF-32
The native string encoding on Microsoft© Windows® is UTF-16LE. GNU/Linux® systems usually use UTF-8 as does the World Wide Web. Web servers can also serve UTF-16 content. Web standards do not allow UTF-32 for text content.
You must validate all strings entering your code. Both simdutf and UTF8-CPP have validation functions. You can store the text in the standard string container. Using Unicode adds a lot of extra characters and code that you need to track. For example, you get over two whitespaces in strings. Unicode has 25 characters with the whitespace property. Filtering is easiest with regular expressions. There are some caveats. The extended ASCII and ISO-8859 non-breaking space has the code 0xa0. Unicode has the code 0xc2 0xa0. Filtering may only remove the 0xa0, and this leaves you with an invalid code point 0xc2. The pcrecpp library will do this if you remove all Unicode whitespaces. It’s helpful to explore how Unicode encodes characters. Focus on the additional controls and modification characters, because they can also reverse the writing order (see Unicode bidirectional formatting characters for more information). The best way to avoid trouble is to use allow lists and remove everything else, if possible. Some special cases will require looking for byte sequences that never occur and markers for the two-, three-, and four-byte sequences (in UTF-8, other encoding also have markers for extended character sequences and modifiers).
Transformations will also be a frequent issue. The in-memory representation of the C++ string classes is independent of the representation on storage subsystems or the network. Make sure to handle this and all localization aspects. The language settings require extra conversions.
If all you have is a Large Language Model (LLM), then you will apply it to all of your problems. People are now trying to find 0-days with the might of LLMs. While there is no surprise that this works, there is a better way of pushing your code to the limit. Just use random data! Someone coined the term fuzzing in 1988. People have been using defective punch cards as input for a while longer. With input filtering of data, you want to eliminate as much bias as possible. This is exactly why people create the input data using random data. Human testers think too much, too less, or are too constrained. (Pseudo-)Random number generators rarely have a bias. LLMs do. This means that the publication about finding 0-days by using LLMs should not be good news. Just like human Markov chains, LLMs only „look“ in a specific direction when creating input data. The model is the slave of vectors and the training data. The process might use the source code as an „inspiration“, but so does a compiler with a fuzzing engine. Understanding that LLMs do not possess any cognitive capabilities is the key point here. You cannot ask an LLM what it thinks of the code in combination with certain input data. You are basically using a fancy data generator that uses more energy and is too complex for the task at hand.