Tag: C++

I did some refactoring of old code that started out as a Bash and Perl script. The idea was to fill a filesystem with random files of random content. The use case was to test filesystems and storage media. My code started out as C++ with some C to interface with the Linux® kernel’s syscalls. Essentially, it was an exercise into the wonderful world of the virtual filesystem layer (VFS), I/O options, memory attributes, and getting lots of random data faster than any disks could write. The random generator in the code comprises the SSE-accelerated Mersenne Twister algorithm from the University of Hiroshima. Clearly, this is a sign for pre-C++11 code. Modern code can use the C++ <random> library, which also uses SIMD instructions heavily if compiled in native mode. The works fine and has been used to test I/O performance and storage media. The problem: The SIMD instructions are for the Intel®/AMD™ platform only. What about ARM and RISC-V processors? Alternative processors use less power and enable to run the code on a smaller platform.

The code itself is mostly portable. When compiling for Microsoft® Windows® or Apple OS X, you can safely remove the Linux® syscalls. Basically, it is just madvise(), posix_memalign() and posix_fallocate64() to give the VFS layer some hints. C++11 and later standards allow for replacing the SSE MT random generator with the standard library. The first stage of refactoring was a hack, because I just redefined missing functions with empty function declarations. Don’t do that in production code, because it produces source code that you just don’t need on a specific platform. The next step is to collect all the platform-dependent code in classes that handle the interface and the random data generation. Alternatively, the random file could be its own class and instance, but this would cause more memory allocations. The current uses a single memory buffer to prepare the file content before the data is written to the file. Memory buffer size can be increased increased to 2 GB which shows nice alternating processor / I/O cycles. The code does not use any parallel code, so one memory buffer is fine.

The code’s file I/O layer is very close to the Linux® kernel. open(), write(), and close() are called direct. madvise() is used to tell the kernel that the data will be written sequentially and that it will not be needed for reading soon. This helps the memory management. When combined with O_DIRECT, the code can run on desktops without filling up file and block buffers with write-only random data. This part is hard to refactor to C++’s stream library. I can recommend anyone to study the parameters of the syscalls I mentioned. Using the lower layers of the I/O subsystem can actually be useful in low-powered systems or when saving memory/performance. The uses of the code since 2008 have shown that the storage media and the I/O bus path to the device is the bottleneck. Both the SSE MT random generator and the C++11 <random> code can create more than enough data to saturate the I/O system.

The takeaway from writing cross-platform code is not surprising. Contain all the platform-dependent code in classes of functions. Do this early in your process, even in prototypes. Using empty functions to disable non-existent syscalls might be optimised away by the compiler, but it makes the source code messy. You will need some #ifdef / #ifndef directives, though. Best to collect them in your include files. If you are interested in the code, let me know. You can also use it for fuzzing, because there is a switch to create file and directory names from fully random bytes.

Filtering Unicode Strings in C++

By René Pfeiffer

On January 3, 2025

In C/C++, Development, Secure Coding

The image shows a screenshot of the "iconv -l" command. It shows all character encodings that the iconv tool can convert. Dealing with text is a major task for code. Writing text means to string characters in a row. Characters are the symbols. The encoding determines how these characters are represented in memory. There are single-byte and multi-byte encodings. The Unicode family aims to represent all characters and symbols of all writing systems. If you specify Unicode, you still need to select a specific encoding. Unicode can be expressed in UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-7-IMAP, UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. The numbers indicate the bytes and bits. The LE and BE indicate the endianness of the encoding. So if you see a software specification saying „let’s use Unicode“, then this is not a specification. Universal Coded Character Set (UCS) is an early representation of Unicode, but it is still updated by the Unicode group.

C++ has multiple string classes. The string container follows the C behaviour and has no encoding per se. You can store byte sequences in a string. You have to take care of the encoding. Wide strings can be stored in the wstring container. Wide strings can accommodate multi-byte characters as used in UTF-16 or UTF-32. The disadvantage is that this differs between platforms (just as the int data type). C++11 and C++20 introduced the u8string, u16string, and u32string containers to address this. You still need to track the encoding of the data. A good choice is to stick with the standard string container and handle the encoding issues yourself. However, the C++ standard library lacks some functionality that is frequently needed. The following libraries can help you out:

simdutf for Unicode validation and transformation; the library has SIMD support
pcrecpp for regular expressions with Unicode
UTF8-CPP for Unicode string operations with UTF-8 and conversions to UTF-16 / UTF-32

The native string encoding on Microsoft© Windows® is UTF-16LE. GNU/Linux® systems usually use UTF-8 as does the World Wide Web. Web servers can also serve UTF-16 content. Web standards do not allow UTF-32 for text content.

You must validate all strings entering your code. Both simdutf and UTF8-CPP have validation functions. You can store the text in the standard string container. Using Unicode adds a lot of extra characters and code that you need to track. For example, you get over two whitespaces in strings. Unicode has 25 characters with the whitespace property. Filtering is easiest with regular expressions. There are some caveats. The extended ASCII and ISO-8859 non-breaking space has the code 0xa0. Unicode has the code 0xc2 0xa0. Filtering may only remove the 0xa0, and this leaves you with an invalid code point 0xc2. The pcrecpp library will do this if you remove all Unicode whitespaces. It’s helpful to explore how Unicode encodes characters. Focus on the additional controls and modification characters, because they can also reverse the writing order (see Unicode bidirectional formatting characters for more information). The best way to avoid trouble is to use allow lists and remove everything else, if possible. Some special cases will require looking for byte sequences that never occur and markers for the two-, three-, and four-byte sequences (in UTF-8, other encoding also have markers for extended character sequences and modifiers).

Transformations will also be a frequent issue. The in-memory representation of the C++ string classes is independent of the representation on storage subsystems or the network. Make sure to handle this and all localization aspects. The language settings require extra conversions.

Recommendations for using Exceptions in Code

By René Pfeiffer

On February 22, 2023

In C/C++

Exceptions can be useful for handling error conditions. They are suited for better structuring code and avoiding if/else cascades. However, exceptions can disturb the control flow and can make your program skip sections of your code. Cleaning up after errors and the management of resources can be affected by this. Another downside is the performance if exceptions are triggered often. If you need to catch errors, you have to be careful when to use exceptions and when to use error flags. The article Exception dangers and downsides in the C++ tutorial has some hints how to use exceptions:

Use exceptions for errors that trigger infrequently.
Use exceptions for errors that are critical for the code execution (i.e. error conditions that make subsequent operations impossible).
Use exceptions when the errors cannot be handled right at the point where it occurs.
Use exceptions when returning an error code is not possible or not an option.

The actual cost for exceptions is influenced by the compiler, the hardware, and the operating system. There is a sign that the exception handling is improved on the x86_64 platform. Having multiple cores can pose a problem, because unwinding exceptions is a single-thread problem. This can be a problem for locked resources and similar synchronisation techniques. The proposal P2544R0 describes the background of these problems, and it proposes some alternatives for error handling by exceptions. The article also has some measurements that show the impact of exceptions. My recommendation is to investigate how frequent errors are and to explore handling non-critical errors by using flags or return codes. When in doubt, use the instrumentation of your compiler and measure the actual cost of exceptions.

Go, Go Carbon, Go++, Carbon++, C++, or Go Rust?

By René Pfeiffer

On July 29, 2022

In Programming, Secure Coding

I break my rule of not writing titles with questions marks. The exemption is because of the new programming language, Carbon. I saw the announcement a couple of days ago. The article mentioned that Google engineers are working on a new programming language called Carbon. The author of the article added the tagline „A Hopeful Successor To C++“. The immediate question to the endeavor of creating a new programming language was: Why? There are a lot of programming languages and dialects around. All the languages have their own background. It is easy to bash a specific programming language, but if you take the history of creation into account, then it gets easier to understand why specific design choices were made. It is easy to recommend different choices in retrospect. Given the existence of Go and the periodic C++ standard updates, I wonder what the design goals of Carbon are. Luckily, the article mentioned them:

C++ performance
Seamless bidirectional interoperability with C++
Easier learning curve for C++ developers
Comparable expressivity
Scalable migration

If you read the actual Carbon language description from the repository, then there are some additional goals:

Safer fundamentals, and an incremental path towards a memory-safe subset
Language evolution (Carbon wants to be C++’s TypeScript/Kotlin)

This doesn’t look too bad. Clearly safer fundamentals and memory-safe features are a good idea. After checking some code examples, the syntax of Carbon looks a lot like Rust. For my taste, the easier learning curve is in the beholder’s eye. Personally, I dislike Rust’s syntax. Carbon adds some grammar and differences in special characters that will probably hinder anyone with C++ experience (well, at least myself). The interoperability claims that you can use your tried build system for your projects. The Carbon project clearly states that it is presenting a prototype for exploring the desired language features. Apparently, the designers want to use C++ as the foundation and add their own vision.

C++ has evolved a lot in the past decade. The new C++ standards have implemented a lot of the missing features that had to be supported by third-party libraries. Because a programming language needs time to create a stable version. Carbon will have to catch up with C++’s head start. Judging from the project vision, it seems to be yet another use case for LLVM. Let’s see if we find out the real reason why Google wants to replace C++.

Secure Software doesn't develop itself.

Tag: C++

Platform-dependent Code, SIMD instructions, and I/O Layers

Filtering Unicode Strings in C++

Recommendations for using Exceptions in Code

Go, Go Carbon, Go++, Carbon++, C++, or Go Rust?