Secure Design – Opinion

Secure Software doesn't develop itself.

The picture shows the top layer of the Linux kernel's API subsystems. Source: https://www.linux.org/attachments/kernel-jpeg.6497/

Linking against the Threading Building Blocks (oneTBB) library with g++ and clang++

A couple of weeks ago, I created a repository of example code for C and C++ courses. The examples include a source file that uses the Threading Building Blocks (oneTBB) library. Since the examples are rather small, I included no Makefile. Instead, I wrote a Bash script using clang++ and a Ninja build file using g++. Strangely, the clang++ build worked, but the g++ version complained about symbols in the linking phase. The linking errors look horrible. Here is the top of the long list of errors (added here for search engines to find the error):

/usr/bin/ld: /tmp/cc4i6mNS.o: in function `tbb::detail::d1::wait_context::add_reference(long)':
parallel_algorithms.cpp:(.text._ZN3tbb6detail2d112wait_context13add_referenceEl[_ZN3tbb6detail2d112wait_context13add_referenceEl]+0x6c): undefined reference to `tbb::detail::r1::notify_waiters(unsigned long)'
/usr/bin/ld: /tmp/cc4i6mNS.o: in function `tbb::detail::d1::execution_slot(tbb::detail::d1::execution_data const&)':
parallel_algorithms.cpp:(.text._ZN3tbb6detail2d114execution_slotERKNS1_14execution_dataE[_ZN3tbb6detail2d114execution_slotERKNS1_14execution_dataE]+0x14): undefined reference to `tbb::detail::r1::execution_slot(tbb::detail::d1::execution_data const*)'
/usr/bin/ld: /tmp/cc4i6mNS.o: in function `tbb::detail::d1::spawn(tbb::detail::d1::task&, tbb::detail::d1::task_group_context&)':
parallel_algorithms.cpp:(.text._ZN3tbb6detail2d15spawnERNS1_4taskERNS1_18task_group_contextE[_ZN3tbb6detail2d15spawnERNS1_4taskERNS1_18task_group_contextE]+0x30): undefined reference to `tbb::detail::r1::spawn(tbb::detail::d1::task&, tbb::detail::d1::task_group_context&)'
/usr/bin/ld: /tmp/cc4i6mNS.o: in function `tbb::detail::d1::execute_and_wait(tbb::detail::d1::task&, tbb::detail::d1::task_group_context&, tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&)':
parallel_algorithms.cpp:(.text._ZN3tbb6detail2d116execute_and_waitERNS1_4taskERNS1_18task_group_contextERNS1_12wait_contextES5_[_ZN3tbb6detail2d116execute_and_waitERNS1_4taskERNS1_18task_group_contextERNS1_12wait_contextES5_]+0x2c): undefined reference to `tbb::detail::r1::execute_and_wait(tbb::detail::d1::task&, tbb::detail::d1::task_group_context&, tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&)'…

There was no obvious difference between the compiler calls.

g++ -Wall -Werror -Wpedantic -std=c++20 -ltbb -g -O0 -o parallel_algorithms parallel_algorithms.cpp
clang++ -Wall -Werror -Wpedantic -std=c++20 -ltbb -march=native -o parallel_algorithms parallel_algorithms.cpp

After browsing through a lot of search results that don’t explain the problem, I tried to put the -ltbb at the end of the command. If you do this, then everything works fine with g++:

g++ -Wall -Werror -Wpedantic -std=c++20 -g -O0 -o parallel_algorithms parallel_algorithms.cpp -ltbb

🥳 oneTBB doesn’t require much, but its link option has to be at the end of the compiler command. Apparently clang++ does something different when resolving symbols. Good to know.

Data Conversion in Code – Casting Data Types

A lot of code deals with transforming data from one data type to another. There are many reasons for doing this. The different sizes of integer numbers and changing their sign is one issue. In the past, casting was easy and error-prone. C++ has different ways to deal with casting data types. The full list of methods involves:

  • static_cast<T> is for related data. The checks perform at compile time, not at run-time.
  • dynamic_cast<T> is useful for pointers. It checks the data type at run-time and ensures that unsafe downcasts are avoided. Catching errors usually results in an exception.
  • const_cast<T> does not change the data type. It just tells the compiler not to expect any write operations on the data. This is useful for optimisation choices implemented by the compiler.
  • reinterpret_cast<T> is the worst option for converting/casting data. The compiler will enforce the new data type. So this should only be an option if you know what you are doing.
  • std::move is new in C++11/C++14. It doesn’t change the data type, but the location of the data. You can move values and instances with it. The main reason for moving things is to protect them from unintended modification.

While you can still use the old-fashioned C-style cast operation, you should never do this in C++. The list of options document more clearly what you intend to do in your code. For more complex operations, it is recommended to create a method for performing the transformation. This method can also enforce range checks or other tests before converting data. There is a blog article that illustrates the conversion methods by using Han Solo from Star Wars.

Using constants and constant expressions is something you absolutely should adopt for your coding style. Marking data storage as read-only allows the compiler to create faster code. In addition, you will get errors when other parts of your code modifies data that should not be changed.

C++ Style Improvements

The international tidy person symbol. Source: https://commons.wikimedia.org/wiki/File:International_tidyman.svgRecently I experimented with Clang Tidy. I let it analyse some of my code and looked at the results. It was really helpful, because it suggested a lot of slight improvements that do not radically change your code. If you started with C and switched to C++ at some point, you will definitely still have some C habits in your code. Especially APIs expecting pointers to character buffers often use C-style arrays. There are ways to get rid of these language constructs, but you have to work on your habits. Avoiding C-style arrays is straightforward. You just have to do it. There is a reason why C and C++ are different programming languages.

Then there are the new C++ standards. In terms of security and convenience, the C++11 and C++17 standards are milestones. You should update your code to at least the C++11 standard. Clang Tidy can help you. There is a class of checks covering the C++ Core Guidelines. The guidelines are a compilation of recommendations to refactor your code into a version that is less error-prone and more maintainable. Keep in mind that you can never apply all rules and recommendations to your code base. Pick a set of five to seven issues and try to implement them. Repeat when you have time or need to change something in the code anyway.

Please stop anthropomorphising Algorithms

The picture showes a box with lots of cables and technical-looking decorations. It symbolises a machine and was created by the Midjourney image generator.If you read the news, the articles are full of hype for the new Large Language Models (LLMs). People do amazing and stupid things by chatting with these algorithms. The problem is that LLMs are just algorithms. They have been „trained“ with large amounts of data, hence the first „L“. There is no personality or cognitive process involved. Keep the following properties in mind when using these algorithms.

  • LLMs stem from the field of artificial intelligence, which is a field of computer science.
  • LLMs perform no cognitive work. The algorithms do not think, they just remix, filter, and repeat. They can’t create anything new better than a random generator and a Markov chain of sufficiently high order.
  • LLMs have no opinion.
  • LLMs feature in-built hallucinations. This means the algorithms can lie. Any of the current (and possibly future models of the same kind) will generate nonsense at some point.
  • LLMs can be manipulated by conversation (by using variations of prompts).
  • LLMs are highly biased, because their training data lacks different perspectives (given their size and cultural background of the data, especially languages used for training).
  • LLMs can’t be debugged reliably. Most fixes to problems comprise input or output filters. Essentially, these filters are allow or block lists. This is the weakest form of using filters for protection.

So there are a lot of arguments for not treating LLMs as something with a opinion, personality, or intelligence. These models can mimick language. When training, they always learn language patterns. Yes, LLMs belong to the research field of artificial intelligence, but there is no thinking going on. The label „AI“ doesn’t describe what the algorithm is doing. This is not news. Emily Bender published an article titled „Look behind the curtain: Don’t be dazzled by claims of ‘artificial intelligence’“ in the Seattle Times in May 2022. Furthermore, there is a publication from Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. Its title is On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. The publication date is March 2021. They published both articles way before the „new“ chat algorithms and LLM-powered „search engines“ hit the marketing departments. You have been warned.

So, this article is not meant to discourage anyone from exploring LLMs. You just need to be careful. Remember that algorithms are seldom a silver bullet that can solve all your problems or make them go away. This is especially true if you want to use LLMs for your software development cycle.

Update: If you really want to explore the uses of synthetic text generation, please watch the presentation “ChatGP-why: When, if ever, is synthetic text safe, appropriate, and desirable?” by Emily Bender.

Continuous Integration is no excuse for a complex Build System

The picture shows a computer screen with keyboard in cartoon style. The screen shows a diagram of code flows with red squares as a marker for errors.Continuous Integration (CI) is a standard in software development. A lot of companies use it for their development process. It basically means using automation tools to test new code more frequently. Instead of continuous, you can also use the word automated, because CI can’t work manually. Modern build systems comprise scripts and descriptive configurations that invoke components of the toolchain in order to produce executable code. Applications build with different programming languages can invoke a lot of tools with individual configurations. The build system is also a part of the code development process. What does this mean for CI in terms of secure coding?

First, if you use CI methods in your development cycle, then make sure you understand the build system. When working with external consultants that audit your code, the review must be possible without the CI pipeline. In theory, this is always the case, but I have seen code collections that cannot be built easily, because of the many configuration parameters hidden in hundreds of directories. Some configuration is old and use environment variables to control how the toolchain has to translate the source. Especially cross-platform code is difficult to analyse because of the mixture of tools. Often it is only possible to study the source. This is a problem, because a code review also needs to rebuild the code with changing parameters (for example, changing compiler flags, replacing compilers, adding analyzers, etc.). If the build process doesn’t allow this, then you have a problem. This makes switching to different tools impossible, which is also necessary when you need to test new versions of your programming language or need to migrate old parts of your code to a newer standard.

Furthermore, if your code cannot be built outside your CI pipeline, then reviews are basically impossible. Just reading the source means that a lot of testing cannot be done. Ensure that your build systems do not grow into a complex creation no one wants to touch any more. The rules of secure and clean coding also apply to your CI pipeline. Create individual packages. Divide the build into modules, so that you can assemble the final application from independent building blocks. Also, refactor your build configuration. Make is simpler and remove all workarounds. Once the review gets stuck and auditors have to read your code like the newspaper, it is too late.

Using AI Language Models for Code Creation

The picture show the inside of a circuit box created by the Midjourney AI graphic generation algorithm.The trend of large language models (LLMs) continues. Many people are doing experiments and explore how these algorithms can help them when developing software. Most integrated development environments have features that help you while writing code. Access to documentation, function call parameters, static checks, and suggestions are standard tools to help you. LLMs are the new kid on the block. Some articles describe how questions (or prompts) to chat engines were used to create code samples. The output depends a lot on the prompt. Changing words or rephrasing the prompt can lead to different results. This differs from the way other tools work. Getting useful results means to play with the prompt and engage in trial-and-error cycles. Algorithms such as ChatGPT are not sentient. They cannot think. The algorithm just remixes and repeats part of its training data. Asking for code examples is probably most useful for getting templates or single functions. This use case is disappointingly close to browsing tutorials or Stackoverflow.

Designing prompts is a new skill artificially created by LLMs algorithms. This is another problem, because you now need to collect prompts for creating the most useful code. The work shifts to another domain, but you don’t actually save time unless you have a compendium of prompts. Creating useful and well-tested templates is a better use of resources. The correct of of patterns governs code creating with or without LLMs.

The security questions have not been addressed yet. There are studies that analyse how the code quality of tool- and human-generated code looks like. According to the Open Source Security and Risk Analysis (OSSRA) report from 2022, the code created by assistant features contained vulnerabilities 40% of the time. An examination of code created by Github’s Copilot shows that autogenerated code contains bugs that belong to specific software weaknesses. The code created by humans has a distinct pattern of weaknesses. A more detailed analysis can only be done by larger statistical samples, because Copilot’s training data is proprietary and not accessible. There is room for more research, but it is safe to say that LLMs also make mistakes. Output from these algorithms must be included in the quality assurance cycle, with no exceptions. Code generators cannot work magic.

If you are interested in using LLMs for code creation, make sure that you understand the implications. Developing safe and useful templates is a better way than to engineer prompts for the best code output. Furthermore, the output can change whenever the LLM changes its version or training data. Using algorithms to create code is not a novel approach. Your toolchains most probably already contain code generators. In either case, you absolutely have to understand how they work. This is not something an AI algorithm can replace. The best approach is to study the code suggested by the algorithm, transfer it into pseudo-code, and write the actual code yourself. Avoid cut & paste, otherwise your introduce more bugs to fix later.

Presentation Supply Chain Attacks and Software Manifests

Today I held a presentation about supply chain attacks and software manifests. The content covers my experience with exploring standards for Software Bill of Materials (SBOMs). While most build systems support creating the manifests, the first step is to identify what components you use and where they come from. Typical software projects will use a mixture of sources such as packet managers from programming languages, operating systems, and direct downloads from software repositories. It is important to focus on the components your code directly relies on. Supporting applications that manage a database or host application programming interfaces (APIs) are a requirement, but usually not part of your software.

The presentation can be found by using this link. The slides are in German, but you will find plenty of links to sources in English.

Error message “undefined reference to `Mcuda_compiled’ in nvhpc”

Using different compilers for either production or testing purposes can be challenging. While the GNU Compiler Collection is the standard on GNU/Linux system, using more than one compiler toolchain helps to identify bugs in the code. The project I am working on uses the Clang compiler. I added command-line options for NVIDIA’s nvc and nvc++ from the high-performance compiler collection. The first compilations runs yielded the error message “undefined reference to `Mcuda_compiled'” during the linker phase. Inspecting the object files and the libraries did not help. Several libraries define the Mcuda_compiled symbol. The key is to use consistent -M options across compiler and linker. The compiler uses the optimizer flags -Mcuda -Mvect. Adding the same options to the linker phase solves the problem. Inspecting the output of ldd shows that the library libcudanvhpc.so contains the symbol.

Creating Software Bill of Materials (SBOM) for your Code

All software comprises components. Given the prevalence of supply chain attacks against modules and libraries, it is important to know what parts your code uses. This is where the Software Bill of Materials (SBOM) comes into play. A SBOM is a list of all components your code relies on. So much for the theory, because what is a component? Do you count all the header files of C/C++ as individual components or just as parts of the library? Is it just the run-time environment or do you also list the moving parts you used to create the final application? The latter includes all tools working in intermediate processes. A good choice is to focus on the application in its deployment state.

How do you get all the components used in the run-time version of your application? The build process is the first source you need to query. The details depend on the build tool you use. You need to extract the version, the source of the packaged component (this can be a link on the download source or a package URL), the name of of component, and hashes of either the files or the component itself. If you inspect the CycloneDX or SPDX specifications, then you will find that there are a lot more data fields specified. You can group components, name authors, add a description, complex distributions processes, generic properties, and more details. The build systems and package managers do not provide easy access to all information. The complete view requires using the original source for the component, the operating system, and external data sources (for example, the CPE or links to published vulnerabilities). Don’t get distracted by the sheer amount of information that can be included in your manifest. Focus on the relevant and easy to get data about your components.

Hashes of your components are important. When focussing on the run-time of your application, make sure you identify all parts of it. To give an example of C/C++ code, you can identify all libraries your applications load dynamically. Then calculate the hashes of the libraries on the target platform. SBOMs can be different for various target platforms. If you use containers, then you can fix the components. Linking dynamically against libraries means that your code will use different incarnations on different systems. Make sure that you calculate more than one hash for your manifest. MD5 and SHA-1 are legacy algorithms. Do not use them. Instead, use SHA-2 with 256 bit or more and one hash of the SHA-3 family. This guards against hash collisions, because SHA-3 is not prone to content appending attacks.

Recommendations for using Exceptions in Code

Exceptions can be useful for handling error conditions. They are suited for better structuring code and avoiding if/else cascades. However, exceptions can disturb the control flow and can make your program skip sections of your code. Cleaning up after errors and the management of resources can be affected by this. Another downside is the performance if exceptions are triggered often. If you need to catch errors, you have to be careful when to use exceptions and when to use error flags. The article Exception dangers and downsides in the C++ tutorial has some hints how to use exceptions:

  • Use exceptions for errors that trigger infrequently.
  • Use exceptions for errors that are critical for the code execution (i.e. error conditions that make subsequent operations impossible).
  • Use exceptions when the errors cannot be handled right at the point where it occurs.
  • Use exceptions when returning an error code is not possible or not an option.

The actual cost for exceptions is influenced by the compiler, the hardware, and the operating system. There is a sign that the exception handling is improved on the x86_64 platform. Having multiple cores can pose a problem, because unwinding exceptions is a single-thread problem. This can be a problem for locked resources and similar synchronisation techniques. The proposal P2544R0 describes the background of these problems, and it proposes some alternatives for error handling by exceptions. The article also has some measurements that show the impact of exceptions. My recommendation is to investigate how frequent errors are and to explore handling non-critical errors by using flags or return codes. When in doubt, use the instrumentation of your compiler and measure the actual cost of exceptions.

Page 1 of 2

Powered by WordPress & Theme by Anders Norén