Secure Software doesn't develop itself.

The picture shows the top layer of the Linux kernel's API subsystems. Source: https://www.linux.org/attachments/kernel-jpeg.6497/

Category: Development Page 1 of 2

Researching Code Examples for Secure Coding

The image shows shredded paper strips from a shredded document. Source: http://securology.blogspot.com/2012/09/destroying-paper-documents.htmlLearning by doing means you spent a lot of time with reading documentation and exploring example code that illustrates the features of your favourite development toolchain. Getting a well-written example code has become substantially more difficult in the past years. Once upon a time, Google offered a search engine just for source code. It was active between 2006 and 2012. Now you are stuck with search engines and their deteriorating quality. The amount of AI-generated content, copy-&-paste from documentation, and hyperlinks to gigantic forum discussions filled with errors and even more copy-&-paste snippets destroys the classical Internet research. You have to select your sources carefully. So what is a good strategy here? I have compiled a short checklist that enables you to avoid wasting time.

  • Start with the tutorials and documentation of your development tools/languages. Some have sections with examples and a well-written explanation. It depends on the developers, because writing didactically valuable explanations takes some effort.
  • Actively look for content from schools, colleges, or universities. Sometimes courses are online and contain the information you need. Try to prefer this source category.
  • When using search engines, keep the following in mind:
    • Skip results pushed by Search Engine Optimization (SEO); SEO is basically a way to push results to the top by adding noise and following the search engine company’s policy of the day. You can recognise this content by summary texts that don’t tell you the facts in briefs, the obnoxious Top N phrase in the title, and even more variations of copy-&-paste text fragments.
    • Do not „AI-enhance“ the results! While Large Language Model (LLM) algorithms may have used actual sources relevant to your research during training, their results are merely a statistical remix subtly altered by hallucinations. Go directly to software/coding forums and look for relevant threads. LLM-generated code will contain more bugs or bugs more frequently.
    • Do not use content sponsored by companies pushing their development products. Research is all about good examples, good explanations, and facts, not marketing.
    • Mind the date of the results. AI spammers and companies following the AI hype have changed dates of published articles to sell them as new or updated. Don’t fall for that.
  • Inspect secure coding standards and policy documents. Some contain useful sections with examples. You can also verify the search results with this by recognising outdated advice (deprecated algorithms, old standards, etc.).
  • Inspect version control repositories and look for example code. A lot of projects have samples and test code that is part of the release.
  • Write your own test code and explore! Add the created test code to your personal/project toolbox. You can later turn this code into unit tests or use it to check if major version changes broke something.

Unfortunately, these hints won’t change the degrading quality of the current search engines. It will help you filter out the noise.

Filtering Unicode Strings in C++

The image shows a screenshot of the "iconv -l" command. It shows all character encodings that the iconv tool can convert.Dealing with text is a major task for code. Writing text means to string characters in a row. Characters are the symbols. The encoding determines how these characters are represented in memory. There are single-byte and multi-byte encodings. The Unicode family aims to represent all characters and symbols of all writing systems. If you specify Unicode, you still need to select a specific encoding. Unicode can be expressed in UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-7-IMAP, UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. The numbers indicate the bytes and bits. The LE and BE indicate the endianness of the encoding. So if you see a software specification saying „let’s use Unicode“, then this is not a specification. Universal Coded Character Set (UCS) is an early representation of Unicode, but it is still updated by the Unicode group.

C++ has multiple string classes. The string container follows the C behaviour and has no encoding per se. You can store byte sequences in a string. You have to take care of the encoding. Wide strings can be stored in the wstring container. Wide strings can accommodate multi-byte characters as used in UTF-16 or UTF-32. The disadvantage is that this differs between platforms (just as the int data type). C++11 and C++20 introduced the u8string, u16string, and u32string containers to address this. You still need to track the encoding of the data. A good choice is to stick with the standard string container and handle the encoding issues yourself. However, the C++ standard library lacks some functionality that is frequently needed. The following libraries can help you out:

  • simdutf for Unicode validation and transformation; the library has SIMD support
  • pcrecpp for regular expressions with Unicode
  • UTF8-CPP for Unicode string operations with UTF-8 and conversions to UTF-16 / UTF-32

The native string encoding on Microsoft© Windows® is UTF-16LE. GNU/Linux® systems usually use UTF-8 as does the World Wide Web. Web servers can also serve UTF-16 content. Web standards do not allow UTF-32 for text content.

You must validate all strings entering your code. Both simdutf and UTF8-CPP have validation functions. You can store the text in the standard string container. Using Unicode adds a lot of extra characters and code that you need to track. For example, you get over two whitespaces in strings. Unicode has 25 characters with the whitespace property. Filtering is easiest with regular expressions. There are some caveats. The extended ASCII and ISO-8859 non-breaking space has the code 0xa0. Unicode has the code 0xc2 0xa0. Filtering may only remove the 0xa0, and this leaves you with an invalid code point 0xc2. The pcrecpp library will do this if you remove all Unicode whitespaces. It’s helpful to explore how Unicode encodes characters. Focus on the additional controls and modification characters, because they can also reverse the writing order (see Unicode bidirectional formatting characters for more information). The best way to avoid trouble is to use allow lists and remove everything else, if possible. Some special cases will require looking for byte sequences that never occur and markers for the two-, three-, and four-byte sequences (in UTF-8, other encoding also have markers for extended character sequences and modifiers).

Transformations will also be a frequent issue. The in-memory representation of the C++ string classes is independent of the representation on storage subsystems or the network. Make sure to handle this and all localization aspects. The language settings require extra conversions.

Static Tests and Code Coverage

The picture shows a warning sign indicating that a laser beam is operating in the area. Source: https://commons.wikimedia.org/wiki/File:Laser-symbol-text.svgTesting software and measuring the code coverage is a critical ritual for most software development teams. The more code lines you cover, the better the results. Right? Well, yes, and no. Testing is fine, but you should not get excited about maximising the code coverage. Measuring code coverage can turn into a game and a quest for the highest score. Applying statistics to computer science can show you how many code paths your tests need to cover. Imagine that you have a piece of code containing 32 if()/else() statements. Testing all branches means you will have to run through 4,294,967,296 different combinations. Now add some loops, function calls, and additional if() statements (because 32 comparisons are quite low for a sufficiently big code base). This will increase the paths considerably. Multiply the number by the time needed to complete a test run. This shows that tests are limited by physics and mathematics.

Static analysis is a standard tool which helps you detect bugs and problems in your code. Remember that all testing tries to determine the behaviour of your application. Mathematics has more bad news for you. Rice’s Theorem states that all non-trivial semantic properties of a specific code are undecidable. An undecidable problem, which is a decision problem, cannot be solved by any algorithm implementation. Rice published the theorem with a proof in 1951, and it relates to the halting problem. It implies that you cannot decide if an application is correct. You also cannot decide if the code executes without errors. The theorem sounds odd, because clearly you can run code and see if it shows any errors given a specific set of input data. This is a special case. Rice’s theorem is a generalisation and applies to all possible input data. So your successful tests basically work with special cases that do not cause harm. Security testing checks for dangerous behaviour or signs of weaknesses. Increasing the input data variations can cover more cases, but Rice’s theorem still holds, no matter how much effort you put into your testing pipeline.

Let’s get back to the code coverage metric. Of course, you should test all of your code. The major goal for your code is to handle errors correctly, fail safely (i.e. without creating damage), and keep control of the code execution. You can achive these goals with any code coverage per test above 0%. Don’t fall prey to gamification!

Mixing Secure Coding with Programming Lessons

The picture shows a fantasy battle where a witch attacks a wizard with spells. Source: https://wiki.alexissmolensk.com/index.php/File:Spellcasting.jpgLearning about programming first and then learning secure coding afterwards is a mistake. Even if you are new to a programming language or its concepts, you need to know what can go wrong. You need to know how to handle errors. You need to do some basic checks of data received, no matter what your toolchain looks like. This is part of the learning process. So instead of learning how to use code constructs or language features twice, take the shortcut and address security and understanding of the concepts at once. An example method of classes and their behaviour. If you think in instances, then you will have to deal with the occasional exception. No one would learn the methods first, ignore all error conditions, and then get back to learn about errors.

Another example are variables with numerical values. Numbers are notorious. Even the integer data types stay in the Top 25 CWE list since 2019. Integer overflow or underflow simply happens with the standard arithmetic operators. There is no fancy bug involved, just basic counting. You have to implement range checks. There is no way around this. Even Rust requires you to do extra bound checks by using the checked_add() methods. Secure coding always means more code, not less. This starts with basic data types and operators. You can add these logical pitfalls to exercises and examples. By using this approach, you can convey new techniques and how a mind in the security mindset improves the code. There is also the possibility of switching between “normal” exercises and security lessons with a focus on how things go wrong. It’s not helpful to pretend that code won’t run into bugs or security weaknesses. Put the examples of failure and how to deal with it right into your course from the start.

If you don’t know where to start, then consult the secure coding guidelines and top lists of well-known vulnerabilities. Here are some good pointers to get started:

The Ghost of Legacy Code and its Relation to Security

The picture shows a spade and the wall of a pit dug into the earth. The wall shows the different layers created by sedimentation over time. Source: http://www.thesubversivearchaeologist.com/2014/11/back-to-basics-stratigraphy-101.htmlThe words legacy and old carry a negative meaning when used with code or software development. Marketing has ingrained in us the belief that everything new is good and everything old should be replaced to ensure people spend money and time. Let me tell you that this is not the case, and that age is not always a suitable metric. Would you rather have your brain surgery from a surgeon with 20+ years of experience or a freshly graduated surgeon on his or her first day at the hospital?

So what is old code? In my dictionary, the label “not maintained anymore” is assigned to legacy and old code. This is where the mainstream definition fails. You can have legacy code which is still maintained. There is a sound reason for using code like this: stability and fewer errors introduced by creating code from scratch. Reimplementing code always means that you start from nothing. Computer science basic courses teach everyone to reuse code in order to avoid these situations. Basically, reusing code means that you allow code to age. Just don’t forget to maintain parts of your application that work and experience few changes. This is the sane version of old code. There is another one.

An old codebase can serve as a showstopper for changes. If you took some poor design decisions in the past, then parts of your code will resist fresh development and features. Prototypes often exhibit this behaviour (a prototype usually never sees the production phase unaltered). When you see this in your application, then it is time to think about refactoring. Refactoring has fewer restrictions if you can do this in your own code. Once components or your platform is part of the legacy code, then you are in for a major upgrade. Operating systems and run-time environments can push changes to your application by requiring a refactoring. Certifications can do the same. Certain environments only allow certified components. Your configuration becomes frozen once applications or run-time get the certification. All changes may require a re-certification. Voilà, here is your stasis, and your code ages.

Legacy code is not a burden per se. It all depends if the code is still subject to maintenance, patches, and security checks. Besides, older code usually has fewer bugs.

Code, Development, Agile, and the Waterfall – Dynamics

The picture shows the waterfalls of Gullfoss under the snow in Iceland. Source: https://commons.wikimedia.org/wiki/File:Iceland_-_2017-02-22_-_Gullfoss_-_3684.jpgCode requires a process to create it. The collection of processes, tasks, requirements, and checks is called software development. The big question is how to do it right. Frankly, the answer to this question does not exist. First, not all code is equal. A web server, a filesystem, a database, and a kernel module for network communication contain distinct code, with only a few functions that can be shared. For adding secure coding practices, some attendees of my courses question the application of checklists and cleaning of suspicious data. Security is old-fashioned, because you have to think of risks, how to address them, and how to improve sections of your code that connect to the outside world. People like to term agile where small teams bathe in outbursts of creativity and sprint to implementing requested features. You can achieve anything you set your mind to. Tear down code, write it new, deliver the features. This is not how secure coding works, and this is not how your software development process should look like (regardless what type of paradigm you follow).

It is easy to drift into a rant about the agile manifesto. Condensing the entire development process into 68 words, all done during three days of skiing in Colorado, is bound to create very general statements whose implementation wildly differs. This is not the point I want to make. You can shorten secure coding to 10 to 13 principles. The SEI CERT secure coding documents feature a list with the top 10 methods. It’s still incomplete, and you still have to actually integrate security into your writing-code-process. So you can interpret secure coding as a manifesto, too. Neglecting the implementation has advantages. You can use secure coding with all existing and future programming languages. You can use it on all platforms, also current and yet to be invented. The principles are always true. Secure coding is a model that you can use to improve how your team creates, tests, and deploys code. This also means that adopting a security stance requires you to alter your toolbox. All of us have a favourite development environment. This is the first place where you can get started with secure coding. It’s not all about having the right plugins, but it is important to see what code does while it is being developed.

The title features the words agile and waterfall. Please do yourself a favour and stop thinking about buzzwords. It doesn’t matter how your development process produces code. It matters that the code has next to none security vulnerabilities, shows no undefined behaviour and cannot be abused by third parties. Secure code is possible with any development process provided you follow the principles. Use the principle’s freedoms to your advantage and integrate what works best.

Continuous Integration is no excuse for a complex Build System

The picture shows a computer screen with keyboard in cartoon style. The screen shows a diagram of code flows with red squares as a marker for errors.Continuous Integration (CI) is a standard in software development. A lot of companies use it for their development process. It basically means using automation tools to test new code more frequently. Instead of continuous, you can also use the word automated, because CI can’t work manually. Modern build systems comprise scripts and descriptive configurations that invoke components of the toolchain in order to produce executable code. Applications build with different programming languages can invoke a lot of tools with individual configurations. The build system is also a part of the code development process. What does this mean for CI in terms of secure coding?

First, if you use CI methods in your development cycle, then make sure you understand the build system. When working with external consultants that audit your code, the review must be possible without the CI pipeline. In theory, this is always the case, but I have seen code collections that cannot be built easily, because of the many configuration parameters hidden in hundreds of directories. Some configuration is old and use environment variables to control how the toolchain has to translate the source. Especially cross-platform code is difficult to analyse because of the mixture of tools. Often it is only possible to study the source. This is a problem, because a code review also needs to rebuild the code with changing parameters (for example, changing compiler flags, replacing compilers, adding analyzers, etc.). If the build process doesn’t allow this, then you have a problem. This makes switching to different tools impossible, which is also necessary when you need to test new versions of your programming language or need to migrate old parts of your code to a newer standard.

Furthermore, if your code cannot be built outside your CI pipeline, then reviews are basically impossible. Just reading the source means that a lot of testing cannot be done. Ensure that your build systems do not grow into a complex creation no one wants to touch any more. The rules of secure and clean coding also apply to your CI pipeline. Create individual packages. Divide the build into modules, so that you can assemble the final application from independent building blocks. Also, refactor your build configuration. Make is simpler and remove all workarounds. Once the review gets stuck and auditors have to read your code like the newspaper, it is too late.

Using AI Language Models for Code Creation

The picture show the inside of a circuit box created by the Midjourney AI graphic generation algorithm.The trend of large language models (LLMs) continues. Many people are doing experiments and explore how these algorithms can help them when developing software. Most integrated development environments have features that help you while writing code. Access to documentation, function call parameters, static checks, and suggestions are standard tools to help you. LLMs are the new kid on the block. Some articles describe how questions (or prompts) to chat engines were used to create code samples. The output depends a lot on the prompt. Changing words or rephrasing the prompt can lead to different results. This differs from the way other tools work. Getting useful results means to play with the prompt and engage in trial-and-error cycles. Algorithms such as ChatGPT are not sentient. They cannot think. The algorithm just remixes and repeats part of its training data. Asking for code examples is probably most useful for getting templates or single functions. This use case is disappointingly close to browsing tutorials or Stackoverflow.

Designing prompts is a new skill artificially created by LLMs algorithms. This is another problem, because you now need to collect prompts for creating the most useful code. The work shifts to another domain, but you don’t actually save time unless you have a compendium of prompts. Creating useful and well-tested templates is a better use of resources. The correct of of patterns governs code creating with or without LLMs.

The security questions have not been addressed yet. There are studies that analyse how the code quality of tool- and human-generated code looks like. According to the Open Source Security and Risk Analysis (OSSRA) report from 2022, the code created by assistant features contained vulnerabilities 40% of the time. An examination of code created by Github’s Copilot shows that autogenerated code contains bugs that belong to specific software weaknesses. The code created by humans has a distinct pattern of weaknesses. A more detailed analysis can only be done by larger statistical samples, because Copilot’s training data is proprietary and not accessible. There is room for more research, but it is safe to say that LLMs also make mistakes. Output from these algorithms must be included in the quality assurance cycle, with no exceptions. Code generators cannot work magic.

If you are interested in using LLMs for code creation, make sure that you understand the implications. Developing safe and useful templates is a better way than to engineer prompts for the best code output. Furthermore, the output can change whenever the LLM changes its version or training data. Using algorithms to create code is not a novel approach. Your toolchains most probably already contain code generators. In either case, you absolutely have to understand how they work. This is not something an AI algorithm can replace. The best approach is to study the code suggested by the algorithm, transfer it into pseudo-code, and write the actual code yourself. Avoid cut & paste, otherwise your introduce more bugs to fix later.

Presentation Supply Chain Attacks and Software Manifests

Today I held a presentation about supply chain attacks and software manifests. The content covers my experience with exploring standards for Software Bill of Materials (SBOMs). While most build systems support creating the manifests, the first step is to identify what components you use and where they come from. Typical software projects will use a mixture of sources such as packet managers from programming languages, operating systems, and direct downloads from software repositories. It is important to focus on the components your code directly relies on. Supporting applications that manage a database or host application programming interfaces (APIs) are a requirement, but usually not part of your software.

The presentation can be found by using this link. The slides are in German, but you will find plenty of links to sources in English.

Creating Software Bill of Materials (SBOM) for your Code

All software comprises components. Given the prevalence of supply chain attacks against modules and libraries, it is important to know what parts your code uses. This is where the Software Bill of Materials (SBOM) comes into play. A SBOM is a list of all components your code relies on. So much for the theory, because what is a component? Do you count all the header files of C/C++ as individual components or just as parts of the library? Is it just the run-time environment or do you also list the moving parts you used to create the final application? The latter includes all tools working in intermediate processes. A good choice is to focus on the application in its deployment state.

How do you get all the components used in the run-time version of your application? The build process is the first source you need to query. The details depend on the build tool you use. You need to extract the version, the source of the packaged component (this can be a link on the download source or a package URL), the name of of component, and hashes of either the files or the component itself. If you inspect the CycloneDX or SPDX specifications, then you will find that there are a lot more data fields specified. You can group components, name authors, add a description, complex distributions processes, generic properties, and more details. The build systems and package managers do not provide easy access to all information. The complete view requires using the original source for the component, the operating system, and external data sources (for example, the CPE or links to published vulnerabilities). Don’t get distracted by the sheer amount of information that can be included in your manifest. Focus on the relevant and easy to get data about your components.

Hashes of your components are important. When focussing on the run-time of your application, make sure you identify all parts of it. To give an example of C/C++ code, you can identify all libraries your applications load dynamically. Then calculate the hashes of the libraries on the target platform. SBOMs can be different for various target platforms. If you use containers, then you can fix the components. Linking dynamically against libraries means that your code will use different incarnations on different systems. Make sure that you calculate more than one hash for your manifest. MD5 and SHA-1 are legacy algorithms. Do not use them. Instead, use SHA-2 with 256 bit or more and one hash of the SHA-3 family. This guards against hash collisions, because SHA-3 is not prone to content appending attacks.

Page 1 of 2

Powered by WordPress & Theme by Anders Norén