Month: May 2024

URL Validation, Unit Tests and the Lost Constructor

On May 27, 2024

I have some code that requests URLs, looks for session identifiers or tokens, extracts them, and calculates some indicators of randomness. The tool works, but I decided to add some unit tests in order to play with the Catch2 framework. Unit tests requires some easy to check conditions, so validating HTTP/HTTPS URLs sounds like a good idea to get started. The code uses the curl library for requests, so checking URLs can be done by libcurl or before feeding the URL string to it. Therefore I added some regular expressions. RFC 3986 has a very good description of Uniform Resource Identifiers (URIs). The full regular expression is quite large and match too many variations of URI strings. You can inspect it on the regex101 web site. Shortening the regex to matching URLs beginning with “http” or “https” requires to define what you want to match. Should there be only domain names? Are IP addresses allowed? If so, what about IPv4 and IPv4? Experimenting with the filter variations took a bit of time. The problem was that no regex was matching the pattern. Even patterns that worked fine in other programming languages did not work in the unit test code. The error was hidden in a constructor.

Class definitions in C++ often have multiple variations of constructors. The web interface code can create a default instance where you set the target URL later by using setters. You can also create instances with parameters such as the target or the number of requests. The initialisation code sits in one member function which also initialises the libcurl data structures. So the constructors look like this:

http::http() { }

http::http( unsigned int nreq ) { init_data_structures(); set_max_requests( nreq ); return; }

The function init_data_structures() sets a flag that tells the instance if the libcurl subsystem is working or not. The first constructor does not call the function, so the flag is always false. The missing function call is hard to miss. The reason why the line was missing is that the code had a default constructor at first. The other constructors were added later, and the default constructor function was never used, because the test code never creates instances without an URL. This bring me back to the unit tests. The Catch2 framework does not need a full program code with a main() function. You can directly create instances in your test code snippets and use them. That’s why the error got noticed. Unit tests are not security tests. The missing initialisation function call is most probably not a security weakness, because the code does not run with the web request subsystem flag set to false. It’s still a good way to catch omissions or logic errors. So please do lots of unit tests all of the time.

Floating Point Data Types and Computations

By René Pfeiffer

On May 7, 2024

In C/C++, Data, Mathematics

Floating point data types are available in most programming languages. C++ knows about float, double, and long double data types. Other programming languages feature longer (256 bit) and shorter (16 bit and lower) representations. All data types are specified in the IEEE Standard for Floating-Point Arithmetic (IEEE 754). IEEE 754 is the standard for all implementations. Hardware also supports storage and operations. Floating point data storage is usually used in numerical calculations. Since the use case is to represent real numbers, the accuracy is a problem. Mathematically there is an infinite amount of other real numbers between two arbitrary chosen real numbers. Computers are notoriously bad at storing an infinite amount of data. For the purposes of programming, this means that all choices for using any floating point data type have to deal with error conditions and how to handle them. Obvious errors include the division by zero. Less obvious conditions are rounding errors, special numbers (infinity, not a number, signed zeroes, subnormal numbers), and overflows.

Not all of the error conditions may pose a threat for your applications. It depends on what type of numerical calculations your code does or consumes. Comparisons have to be implemented in a thoughtful way. Test for equality may fail, because of rounding errors. Using the “real zero” can backfire. The C and C++ standard libraries supply you with a list of constants. Among them is the minimal difference that can be represented in a floating point data type. It is called the epsilon value. Epsilon (ε) is often used to denote very small values. cfloat or float.h defines FLT_EPSILON (for float), DBL_EPSILON (for double), and LDBL_EPSILON (for long double). Using this value as the smallest difference possible is usually a good idea. There is another method for finding neighbouring floating point numbers. C++11 has introduced functions to find the next neighbour value. The accuracy is determined by the unit of least precision (ULP). ULPs are defined by the value of the least significant bit. Using ULPs or the epsilon values is a different approach. ULP checking requires transformation of the values into integer registers. Both methods work well away from the zero. If you are near the zero value, then consider using multiples of the epsilon value as a comparison value.

There is another overlooked fact. The float data type has 32 bit of storage. This means you can use 4 billions different bit combinations, which is not a lot. Looping through all values and stress testing a numerical function can be done in minutes. There is a blog post covering this technique complete with example code.

I have compiled some useful resources for this topic.