Category: Secure Coding

Specific advice or techniques used for secure coding.

Filtering Unicode Strings in C++

On January 3, 2025

The image shows a screenshot of the "iconv -l" command. It shows all character encodings that the iconv tool can convert. Dealing with text is a major task for code. Writing text means to string characters in a row. Characters are the symbols. The encoding determines how these characters are represented in memory. There are single-byte and multi-byte encodings. The Unicode family aims to represent all characters and symbols of all writing systems. If you specify Unicode, you still need to select a specific encoding. Unicode can be expressed in UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-7-IMAP, UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. The numbers indicate the bytes and bits. The LE and BE indicate the endianness of the encoding. So if you see a software specification saying „let’s use Unicode“, then this is not a specification. Universal Coded Character Set (UCS) is an early representation of Unicode, but it is still updated by the Unicode group.

C++ has multiple string classes. The string container follows the C behaviour and has no encoding per se. You can store byte sequences in a string. You have to take care of the encoding. Wide strings can be stored in the wstring container. Wide strings can accommodate multi-byte characters as used in UTF-16 or UTF-32. The disadvantage is that this differs between platforms (just as the int data type). C++11 and C++20 introduced the u8string, u16string, and u32string containers to address this. You still need to track the encoding of the data. A good choice is to stick with the standard string container and handle the encoding issues yourself. However, the C++ standard library lacks some functionality that is frequently needed. The following libraries can help you out:

simdutf for Unicode validation and transformation; the library has SIMD support
pcrecpp for regular expressions with Unicode
UTF8-CPP for Unicode string operations with UTF-8 and conversions to UTF-16 / UTF-32

The native string encoding on Microsoft© Windows® is UTF-16LE. GNU/Linux® systems usually use UTF-8 as does the World Wide Web. Web servers can also serve UTF-16 content. Web standards do not allow UTF-32 for text content.

You must validate all strings entering your code. Both simdutf and UTF8-CPP have validation functions. You can store the text in the standard string container. Using Unicode adds a lot of extra characters and code that you need to track. For example, you get over two whitespaces in strings. Unicode has 25 characters with the whitespace property. Filtering is easiest with regular expressions. There are some caveats. The extended ASCII and ISO-8859 non-breaking space has the code 0xa0. Unicode has the code 0xc2 0xa0. Filtering may only remove the 0xa0, and this leaves you with an invalid code point 0xc2. The pcrecpp library will do this if you remove all Unicode whitespaces. It’s helpful to explore how Unicode encodes characters. Focus on the additional controls and modification characters, because they can also reverse the writing order (see Unicode bidirectional formatting characters for more information). The best way to avoid trouble is to use allow lists and remove everything else, if possible. Some special cases will require looking for byte sequences that never occur and markers for the two-, three-, and four-byte sequences (in UTF-8, other encoding also have markers for extended character sequences and modifiers).

Transformations will also be a frequent issue. The in-memory representation of the C++ string classes is independent of the representation on storage subsystems or the network. Make sure to handle this and all localization aspects. The language settings require extra conversions.

Finding 0-Days with Large Language Models exclusive-or Fuzzing

By René Pfeiffer

On June 17, 2024

In Secure Coding, Testing

If all you have is a Large Language Model (LLM), then you will apply it to all of your problems. People are now trying to find 0-days with the might of LLMs. While there is no surprise that this works, there is a better way of pushing your code to the limit. Just use random data! Someone coined the term fuzzing in 1988. People have been using defective punch cards as input for a while longer. With input filtering of data, you want to eliminate as much bias as possible. This is exactly why people create the input data using random data. Human testers think too much, too less, or are too constrained. (Pseudo-)Random number generators rarely have a bias. LLMs do. This means that the publication about finding 0-days by using LLMs should not be good news. Just like human Markov chains, LLMs only „look“ in a specific direction when creating input data. The model is the slave of vectors and the training data. The process might use the source code as an „inspiration“, but so does a compiler with a fuzzing engine. Understanding that LLMs do not possess any cognitive capabilities is the key point here. You cannot ask an LLM what it thinks of the code in combination with certain input data. You are basically using a fancy data generator that uses more energy and is too complex for the task at hand.

Comparing LLMs with fuzzing engines does not work well. Both approaches serve an original purpose. Always remember that the input data in security tests should push your filters to the limit and create a situation that you did not expect. Randomness will do this much more efficiently than a more complex algorithm. If you are fond of complexity or have too much powerful hardware at your hands, there are other things you can do with this.

Go, Go Carbon, Go++, Carbon++, C++, or Go Rust?

By René Pfeiffer

On July 29, 2022

In Programming, Secure Coding

I break my rule of not writing titles with questions marks. The exemption is because of the new programming language, Carbon. I saw the announcement a couple of days ago. The article mentioned that Google engineers are working on a new programming language called Carbon. The author of the article added the tagline „A Hopeful Successor To C++“. The immediate question to the endeavor of creating a new programming language was: Why? There are a lot of programming languages and dialects around. All the languages have their own background. It is easy to bash a specific programming language, but if you take the history of creation into account, then it gets easier to understand why specific design choices were made. It is easy to recommend different choices in retrospect. Given the existence of Go and the periodic C++ standard updates, I wonder what the design goals of Carbon are. Luckily, the article mentioned them:

C++ performance
Seamless bidirectional interoperability with C++
Easier learning curve for C++ developers
Comparable expressivity
Scalable migration

If you read the actual Carbon language description from the repository, then there are some additional goals:

Safer fundamentals, and an incremental path towards a memory-safe subset
Language evolution (Carbon wants to be C++’s TypeScript/Kotlin)

This doesn’t look too bad. Clearly safer fundamentals and memory-safe features are a good idea. After checking some code examples, the syntax of Carbon looks a lot like Rust. For my taste, the easier learning curve is in the beholder’s eye. Personally, I dislike Rust’s syntax. Carbon adds some grammar and differences in special characters that will probably hinder anyone with C++ experience (well, at least myself). The interoperability claims that you can use your tried build system for your projects. The Carbon project clearly states that it is presenting a prototype for exploring the desired language features. Apparently, the designers want to use C++ as the foundation and add their own vision.

C++ has evolved a lot in the past decade. The new C++ standards have implemented a lot of the missing features that had to be supported by third-party libraries. Because a programming language needs time to create a stable version. Carbon will have to catch up with C++’s head start. Judging from the project vision, it seems to be yet another use case for LLVM. Let’s see if we find out the real reason why Google wants to replace C++.

Implementing basic Tests during Software Development

By René Pfeiffer

On February 3, 2021

In Programming, Secure Coding, Testing

The recent GnuPG bugs have sparked a discussion about standard tests during software development. The case was a buffer in the code which could be overwritten by a decryption operation. Overflow bugs can be easily avoided by defensive programming, but also by standard tests during the development phase. Modern compilers have features to test for stack/heap overflows, memory leaks, undefined behaviour, and many more cases you don’t want in your code. Clang offers the Clang Static Analyzer tools. GCC 10 offers similar features in the form of its static code analysis options as well. Valgrind celebrated its 20th anniversary last year, so there are no excuses. Since every project written in C/C++ always needs a set of build options for anyway, why not add some scripts or configurations to your tests?

First of all, adding something to your tests requires that you already do systematic testing of your code. Most projects have a collection of regression tests to make sure the code behaves as expected after changes. Additionally there might be test cases stemming from the bug reports to check for errors which should have been fixed and should never return. Furthermore, some projects have stress tests, load tests, and even fuzzing tests which can be easily activated. All of this requires a testing platform and processes to define, develop, test, and deploy tests. Not having a test infrastructure is no excuse for not testing code. This is especially true for code bases like libgcrypt or other widely used libraries/tools. The lack of continuous integration (CI) pipelines is also no excuse. Ideally tests are automated, but they don’t have to be. They need to just be run with few changes to the code and the build instructions. Often code has debug flags or other parameters which influences the run-time behaviour or generated code. That’s the way to start. Once your configuration (and maybe scripts) are in place, then you can go forth and automate everything else.

Collecting test cases should be your first step. Harvest the bug tracker and the change history. Try to extract cases and data that triggered a bug. Build a library of tests, then start extending your build system by a test mode that utilises this library and performs the tests. Don’t forget the benefits of your toolchain! Use the static analyzers when testing or running code. You can even do what you usually do with your code before shipping, just make sure the analysers are in place (i.e. compiled in or supervising the code execution). Using different compiles in the pre-shipping phase is a very good idea, too. All in all this should not add an enormous time to your development cycle. You have to test your code any way, why not let computers do this?

Using the built-in features of your compiler (or your favourite run-time framework) in order to detect bugs is a basic task for developers. Don’t wait until security researchers or penetration testers will do this for you. And if they do, please don’t treat bug reports as yet another rant. If it is a real bug report, then you should fix it and blame your code. The alternative is not to accept bug reports, but doing this doesn’t help anyone.