The day I started believing in Unit Tests

Right after getting my degree, I was lucky to be hired into a small but decent R&D department as an embedded software engineer. My first task was to create a Unit Test project and have it run in our build pipeline after compilation. I was not particularly excited about this task, it seemed like busywork to me. I believed that Unit Tests are largely worthless throwaway code. They probably didn't want me to immediately dive into production code. Fair enough, I thought. I'm the newcomer and this is something that is requested by engineers who are way more experienced than I am. So for now, I should just go along with it, even if I believe it's a waste of time.

So, ahead I went. I implemented the test suite and some basic tests, including some that I ported from our old, deprecated test framework to googletest. The test was started by our build pipeline after compilation, it actually transferred the test binary to a physical device which was (and still is) situated in our office and ran the QNX operating system to perform on-target tests. This was necessary because the code used many QNX primitives and needed to be tested on that operating system. So, the test was something between a Unit Test and an Integration Test. The mechanism worked well and it ran multiple times per day. Since the test coverage was quite small and nobody ever touched that code anyway, it always succeeded - another reason why I thought it's a waste of time. Also, it was rare for new tests to be added. It's very hard to write Unit Tests for embedded firmware and hardware abstractions because the very thing you want to test is the interaction with the hardware, which is precisely what you can't do in pure code. Because the test ran automatically and always succeeded, we all soon forgot about it and moved on to other things.

Fast forward about a year. The test ran hundreds if not thousands of times successfully. What a waste of time... But then, one day, we started observing test failures. Not many, maybe three over the course of a few weeks. The test actually crashed with a Segmentation Fault, so it was clear that it was a severe error. Interestingly, none of the code under test had actually changed. Well, that's definitely something we had to investigate! I spare you the details of the search for the error, but eventually, I was able to reproduce the problem while a debugger was attached, so the entire context of the problem was handed to me on a silver platter.

The problem had to do with how our threading abstraction, which worked with inheritance, was used in the test framework. There is a base class that starts the thread with a virtual function and the user of the abstraction is supposed to override that function, kind of like this:

/* Library code: */ class Thread { public: Thread() { /* ... */ } virtual ~Thread() { stop(); } void stop() { /* join the thread if running */ } protected: virtual void singlepassThreadWork() = 0; }; /* User code: */ class MyThread : public Thread { protected: void singlepassThreadWork() override { /* do stuff with foobar */ } private: std::vector<int> foobar; };

Now, what happens when MyThread::singlepassThreadWork() uses a member variable of MyThread like foobar and we delete the MyThread object while the thread is still running? The destruction sequence is such that MyThread is deleted first and after that, the destructor of its parent object Thread runs and the thread is joined. Thus, there is a race condition: We risk accessing the vector foobar in singlepassThreadWork() after it was already deleted. We can fix the user code by explicitly stopping the thread in its destructor:

/* User code: */ class MyThread : public Thread { public: ~MyThread() { stop() } protected: void singlepassThreadWork() override { /* do stuff with foobar */ } private: std::vector<int> foobar; };

My disappointment was immeasurable. The bug was in the test framework itself, not in the code under test. Unit Tests really are worthless and this is a waste of time... Right? But then I had to think of a comic that I had found on the internet and I had actually printed out this comic and put it on the wall in our office:

Even though this comic is older than I am, from a long bygone hacker era, it resonated with me a lot because the wisdom within it still holds true. Whenever you encounter a bug, ask yourself the following three questions:

  1. Have I made this error anywhere else?
  2. What happens when I fix the bug?
  3. How can I change my ways to make this bug impossible?

It seems so simple, but it's a really powerful methodology to prevent further errors and raise code quality. With this comic in mind, I asked myself whether this error existed anywhere else in the code - and boy did I find a boatload of instances of this race condition. It was everywhere. A co-worker and I combed through the entire code base and fixed this type of error in a few large commits, adding a stop() call in the destructor of the class that was the lowest in the inheritance hierarchy for each thread. Furthermore, we made all the developers aware of this pitfall and kept an eye on new code that could be affected. We never observed this bug again in the years since. In addition, we were now aware that this inheritance-based abstraction is defective by design, since most of its uses suffer from a race condition that has to be mitigated manually by the programmer. Designs of new abstractions would not be subject to such pitfalls as we encouraged developers to use composition and dependency injection over inheritance, which raised the overall code quality significantly.

We never found out why the race condition suddenly started causing crashes after a year's worth of successful runs. As mentioned, none of the involved code was changed. As far as I know, the operating system on the test system was not modified during that time period. The crashes must have been caused by subtle differences in how the threads were scheduled. Perhaps the addition of unrelated tests made the Unit Test binary larger and caused side-effects regarding CPU caches and timings? We'll never know.

The day I found this race condition thanks to Unit Tests was the day I truly started to believe in their value. I kept expanding the test project and even achieved 100% test coverage of a crucial library we depended on, which later prevented disaster when a code modification introduced a subtle but critical bug that was caught by the test suite. The effort spent on Unit Tests is worth it.

This event also taught me not to reject concepts and development methodologies based on half-knowledge and prejudice. When in doubt, just try it. What's the worst that could happen? You wasted a bit of time. On the other hand, the potential upside is that you added a useful tool to your toolbox for the rest of your life.

Post comment

CAPTCHA
* required field

Comments

Stable wrote on 31 December 2023
Just read your post about the unit testing epiphany. It really illustrates the importance of a well-placed stop() call in a class destructor. It’s a classic example of those small details that can make or break a project. Happy new Year! ;-)
reply

Tom Van Vleck wrote on 21 December 2023
Nice story. I wrote https://multicians.org/thvv/threeq.html in 1989 and it was published in ACM SIGSOFT Software Engineering Notes (before the WWW).

reply

D-Coder wrote on 20 December 2023
This article made the news! Hacker News: https://news.ycombinator.com/item?id=38695938
reply

yashchester wrote on 19 December 2023
Love this comments section and great article! Woop woop
reply

Christopher Smith wrote on 19 December 2023
Was there a change in the toolchain, especially the compiler itself, that might have surfaced this bit of "personality"?
reply

Steven wrote on 19 December 2023
Hello I'm Steven I enjoy programming
reply

Bobby wrote on 19 December 2023
Wow I right test now on dev program. I have a persanal project check it out
reply