Secure Software doesn't develop itself.

The picture shows the top layer of the Linux kernel's API subsystems. Source: https://www.linux.org/attachments/kernel-jpeg.6497/

Tag: C++

Filtering Unicode Strings in C++

The image shows a screenshot of the "iconv -l" command. It shows all character encodings that the iconv tool can convert.Dealing with text is a major task for code. Writing text means to string characters in a row. Characters are the symbols. The encoding determines how these characters are represented in memory. There are single-byte and multi-byte encodings. The Unicode family aims to represent all characters and symbols of all writing systems. If you specify Unicode, you still need to select a specific encoding. Unicode can be expressed in UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-7-IMAP, UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. The numbers indicate the bytes and bits. The LE and BE indicate the endianness of the encoding. So if you see a software specification saying „let’s use Unicode“, then this is not a specification. Universal Coded Character Set (UCS) is an early representation of Unicode, but it is still updated by the Unicode group.

C++ has multiple string classes. The string container follows the C behaviour and has no encoding per se. You can store byte sequences in a string. You have to take care of the encoding. Wide strings can be stored in the wstring container. Wide strings can accommodate multi-byte characters as used in UTF-16 or UTF-32. The disadvantage is that this differs between platforms (just as the int data type). C++11 and C++20 introduced the u8string, u16string, and u32string containers to address this. You still need to track the encoding of the data. A good choice is to stick with the standard string container and handle the encoding issues yourself. However, the C++ standard library lacks some functionality that is frequently needed. The following libraries can help you out:

  • simdutf for Unicode validation and transformation; the library has SIMD support
  • pcrecpp for regular expressions with Unicode
  • UTF8-CPP for Unicode string operations with UTF-8 and conversions to UTF-16 / UTF-32

The native string encoding on Microsoft© Windows® is UTF-16LE. GNU/Linux® systems usually use UTF-8 as does the World Wide Web. Web servers can also serve UTF-16 content. Web standards do not allow UTF-32 for text content.

You must validate all strings entering your code. Both simdutf and UTF8-CPP have validation functions. You can store the text in the standard string container. Using Unicode adds a lot of extra characters and code that you need to track. For example, you get over two whitespaces in strings. Unicode has 25 characters with the whitespace property. Filtering is easiest with regular expressions. There are some caveats. The extended ASCII and ISO-8859 non-breaking space has the code 0xa0. Unicode has the code 0xc2 0xa0. Filtering may only remove the 0xa0, and this leaves you with an invalid code point 0xc2. The pcrecpp library will do this if you remove all Unicode whitespaces. It’s helpful to explore how Unicode encodes characters. Focus on the additional controls and modification characters, because they can also reverse the writing order (see Unicode bidirectional formatting characters for more information). The best way to avoid trouble is to use allow lists and remove everything else, if possible. Some special cases will require looking for byte sequences that never occur and markers for the two-, three-, and four-byte sequences (in UTF-8, other encoding also have markers for extended character sequences and modifiers).

Transformations will also be a frequent issue. The in-memory representation of the C++ string classes is independent of the representation on storage subsystems or the network. Make sure to handle this and all localization aspects. The language settings require extra conversions.

Recommendations for using Exceptions in Code

Exceptions can be useful for handling error conditions. They are suited for better structuring code and avoiding if/else cascades. However, exceptions can disturb the control flow and can make your program skip sections of your code. Cleaning up after errors and the management of resources can be affected by this. Another downside is the performance if exceptions are triggered often. If you need to catch errors, you have to be careful when to use exceptions and when to use error flags. The article Exception dangers and downsides in the C++ tutorial has some hints how to use exceptions:

  • Use exceptions for errors that trigger infrequently.
  • Use exceptions for errors that are critical for the code execution (i.e. error conditions that make subsequent operations impossible).
  • Use exceptions when the errors cannot be handled right at the point where it occurs.
  • Use exceptions when returning an error code is not possible or not an option.

The actual cost for exceptions is influenced by the compiler, the hardware, and the operating system. There is a sign that the exception handling is improved on the x86_64 platform. Having multiple cores can pose a problem, because unwinding exceptions is a single-thread problem. This can be a problem for locked resources and similar synchronisation techniques. The proposal P2544R0 describes the background of these problems, and it proposes some alternatives for error handling by exceptions. The article also has some measurements that show the impact of exceptions. My recommendation is to investigate how frequent errors are and to explore handling non-critical errors by using flags or return codes. When in doubt, use the instrumentation of your compiler and measure the actual cost of exceptions.

Go, Go Carbon, Go++, Carbon++, C++, or Go Rust?

I break my rule of not writing titles with questions marks. The exemption is because of the new programming language, Carbon. I saw the announcement a couple of days ago. The article mentioned that Google engineers are working on a new programming language called Carbon. The author of the article added the tagline „A Hopeful Successor To C++“. The immediate question to the endeavor of creating a new programming language was: Why? There are a lot of programming languages and dialects around. All the languages have their own background. It is easy to bash a specific programming language, but if you take the history of creation into account, then it gets easier to understand why specific design choices were made. It is easy to recommend different choices in retrospect. Given the existence of Go and the periodic C++ standard updates, I wonder what the design goals of Carbon are. Luckily, the article mentioned them:

  • C++ performance
  • Seamless bidirectional interoperability with C++
  • Easier learning curve for C++ developers
  • Comparable expressivity
  • Scalable migration

If you read the actual Carbon language description from the repository, then there are some additional goals:

  • Safer fundamentals, and an incremental path towards a memory-safe subset
  • Language evolution (Carbon wants to be C++’s TypeScript/Kotlin)

This doesn’t look too bad. Clearly safer fundamentals and memory-safe features are a good idea. After checking some code examples, the syntax of Carbon looks a lot like Rust. For my taste, the easier learning curve is in the beholder’s eye. Personally, I dislike Rust’s syntax. Carbon adds some grammar and differences in special characters that will probably hinder anyone with C++ experience (well, at least myself). The interoperability claims that you can use your tried build system for your projects. The Carbon project clearly states that it is presenting a prototype for exploring the desired language features. Apparently, the designers want to use C++ as the foundation and add their own vision.

C++ has evolved a lot in the past decade. The new C++ standards have implemented a lot of the missing features that had to be supported by third-party libraries. Because a programming language needs time to create a stable version. Carbon will have to catch up with C++’s head start. Judging from the project vision, it seems to be yet another use case for LLVM. Let’s see if we find out the real reason why Google wants to replace C++.

Powered by WordPress & Theme by Anders Norén