Tag: Input Validation

Filtering Unicode Strings in C++

On January 3, 2025

The image shows a screenshot of the "iconv -l" command. It shows all character encodings that the iconv tool can convert. Dealing with text is a major task for code. Writing text means to string characters in a row. Characters are the symbols. The encoding determines how these characters are represented in memory. There are single-byte and multi-byte encodings. The Unicode family aims to represent all characters and symbols of all writing systems. If you specify Unicode, you still need to select a specific encoding. Unicode can be expressed in UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-7-IMAP, UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. The numbers indicate the bytes and bits. The LE and BE indicate the endianness of the encoding. So if you see a software specification saying „let’s use Unicode“, then this is not a specification. Universal Coded Character Set (UCS) is an early representation of Unicode, but it is still updated by the Unicode group.

C++ has multiple string classes. The string container follows the C behaviour and has no encoding per se. You can store byte sequences in a string. You have to take care of the encoding. Wide strings can be stored in the wstring container. Wide strings can accommodate multi-byte characters as used in UTF-16 or UTF-32. The disadvantage is that this differs between platforms (just as the int data type). C++11 and C++20 introduced the u8string, u16string, and u32string containers to address this. You still need to track the encoding of the data. A good choice is to stick with the standard string container and handle the encoding issues yourself. However, the C++ standard library lacks some functionality that is frequently needed. The following libraries can help you out:

simdutf for Unicode validation and transformation; the library has SIMD support
pcrecpp for regular expressions with Unicode
UTF8-CPP for Unicode string operations with UTF-8 and conversions to UTF-16 / UTF-32

The native string encoding on Microsoft© Windows® is UTF-16LE. GNU/Linux® systems usually use UTF-8 as does the World Wide Web. Web servers can also serve UTF-16 content. Web standards do not allow UTF-32 for text content.

You must validate all strings entering your code. Both simdutf and UTF8-CPP have validation functions. You can store the text in the standard string container. Using Unicode adds a lot of extra characters and code that you need to track. For example, you get over two whitespaces in strings. Unicode has 25 characters with the whitespace property. Filtering is easiest with regular expressions. There are some caveats. The extended ASCII and ISO-8859 non-breaking space has the code 0xa0. Unicode has the code 0xc2 0xa0. Filtering may only remove the 0xa0, and this leaves you with an invalid code point 0xc2. The pcrecpp library will do this if you remove all Unicode whitespaces. It’s helpful to explore how Unicode encodes characters. Focus on the additional controls and modification characters, because they can also reverse the writing order (see Unicode bidirectional formatting characters for more information). The best way to avoid trouble is to use allow lists and remove everything else, if possible. Some special cases will require looking for byte sequences that never occur and markers for the two-, three-, and four-byte sequences (in UTF-8, other encoding also have markers for extended character sequences and modifiers).

Transformations will also be a frequent issue. The in-memory representation of the C++ string classes is independent of the representation on storage subsystems or the network. Make sure to handle this and all localization aspects. The language settings require extra conversions.

URL Validation, Unit Tests and the Lost Constructor

By René Pfeiffer

On May 27, 2024

In C/C++, Testing

I have some code that requests URLs, looks for session identifiers or tokens, extracts them, and calculates some indicators of randomness. The tool works, but I decided to add some unit tests in order to play with the Catch2 framework. Unit tests requires some easy to check conditions, so validating HTTP/HTTPS URLs sounds like a good idea to get started. The code uses the curl library for requests, so checking URLs can be done by libcurl or before feeding the URL string to it. Therefore I added some regular expressions. RFC 3986 has a very good description of Uniform Resource Identifiers (URIs). The full regular expression is quite large and match too many variations of URI strings. You can inspect it on the regex101 web site. Shortening the regex to matching URLs beginning with “http” or “https” requires to define what you want to match. Should there be only domain names? Are IP addresses allowed? If so, what about IPv4 and IPv4? Experimenting with the filter variations took a bit of time. The problem was that no regex was matching the pattern. Even patterns that worked fine in other programming languages did not work in the unit test code. The error was hidden in a constructor.

Class definitions in C++ often have multiple variations of constructors. The web interface code can create a default instance where you set the target URL later by using setters. You can also create instances with parameters such as the target or the number of requests. The initialisation code sits in one member function which also initialises the libcurl data structures. So the constructors look like this:

http::http() { }

http::http( unsigned int nreq ) { init_data_structures(); set_max_requests( nreq ); return; }

The function init_data_structures() sets a flag that tells the instance if the libcurl subsystem is working or not. The first constructor does not call the function, so the flag is always false. The missing function call is hard to miss. The reason why the line was missing is that the code had a default constructor at first. The other constructors were added later, and the default constructor function was never used, because the test code never creates instances without an URL. This bring me back to the unit tests. The Catch2 framework does not need a full program code with a main() function. You can directly create instances in your test code snippets and use them. That’s why the error got noticed. Unit tests are not security tests. The missing initialisation function call is most probably not a security weakness, because the code does not run with the web request subsystem flag set to false. It’s still a good way to catch omissions or logic errors. So please do lots of unit tests all of the time.