The weirdest bug I've ever encountered

I am responsible for the reliability of the firmware updates on our devices and this work never gets boring. The thing about firmware updates is that by definition, every bug is caused by the updater because before installing the affected version, the bug was not present. You could say that updates are the cause and the solution of all software problems. So, a lot of update bug reports that find their way to me have nothing to do with the updater per se, they just manifest themselves immediately, giving the impression that the updater is broken. But I'm getting ahead of myself, so let's start from the beginning.

A new bug comes in: Update failed from version X to version Y, error getting log files from sensor. After that, the user manually rebooted the machine and it worked again.

I connect to the machine and download all the logfiles. In the logs, I see that the update installed successfully. After the reboot which starts the newly installed firmware, there were some minor problems: Some processes took a bit longer to start than expected. Other than that, the log looks clean, no errors that could explain this behaviour.

I try to install the affected Firmware version Y on various devices. The problem does not appear. But as luck has it, a co-worker encountered the problem and had the presence of mind to recognize it and leave the system undisturbed for analysis. Connecting to the system is difficult because it is laggy and slow. The device can barely handle simple commands. Weird. Let's run top.

60 processes; 646 threads; CPU states: 16.0% user, 35.0% kernel CPU 0 Idle: 55.4% CPU 1 Idle: 42.4% Memory: 0 total, 543M avail, page size 4K PID TID PRI STATE HH:MM:SS CPU COMMAND 1 12 10 Run 1:51:21 17.72% kernel 1 5 10 Run 2:13:10 17.04% kernel 602168 1 10 Rdy 3:30:01 13.98% ps 229398 1 21 Rcv 0:09:32 0.67% smb-tunnelcreek 778301 1 10 Rply 0:00:00 0.62% top 253978 2 21 NSlp 0:05:43 0.40% devi-tsc2007 1 6 10 Rcv 2:08:09 0.25% kernel 7 2 21 Rcv 0:01:41 0.07% io-pkt-v4-hc 581675 15 10 CdV 0:00:17 0.07% telescope 8200 9 21 Rcv 0:00:35 0.05% io-usb Min Max Average CPU 0 idle: 55% 55% 55% CPU 1 idle: 42% 42% 42% Mem Avail: 543MB 543MB 543MB Processes: 60 60 60 Threads: 646 646 646

Okay, we immediately have some clues here. The QNX kernel is doing a lot of stuff and so is ps. We see that the CPU load of the two cores is about 50%. But we know that the system has hyperthreading and only has one physical core, so the real load is actually more like 100%. To me, it looks like ps is saturating the core completely, the kernel load is probably caused by whatever ps is doing.

Who is calling ps? With ps -ef we can see the PPID, the parent process id, to track down who spawned this process. Ironically, we use ps to debug this ps problem. Like this, I find that it was spawned by the shell sh, which in turn was spawned by one of our proprietary processes. Equipped with the knowledge of which process is affected and analysing the logs to find out where this process got stuck, I quickly find the offending line of code. It looks like this: int rc = system( "ps -e | grep Foo" ); where Foo is some program whose presence we need to check. Not the prettiest way to implement it, but it is legacy code that worked for many years. I also see that this code only gets executed on certain old hardwares. This explains why the problem never reproduced on most systems, they have newer hardware. Even on the old hardware, it is exceedingly rare and hard to reproduce. Killing the offending ps process, which loops infinitely otherwise, unblocks the system and restores normal operation.

We now know a lot about this problem, but what is the root cause? The ps program causes 100% CPU load... What is this? We use QNX 6.6 and ps is supplied as a closed source binary. To understand this problem, we need to analyse the ps utility itself. We can already do that at the assembly level. The QNX Momentics IDE has an option "Attach to Remote Process via QConn" with which I can tap right into the running process. We see that it is stuck in a loop that calls devctl() over and over again. The return value of this devctl() is 3 which is ESRCH: No such process. This error comes straight from MsgSendv_r which returned -3. It documents that ESRCH means The server died while the calling thread was SEND-blocked or REPLY-blocked..

Okay, so, ps gets stuck in an infinite loop. Dare I say it: Do we have a ps bug on our hands? The QNX ecosystem is generally quite robust. In the past, it almost always ended up being my own mistake when I suspected problems in QNX and its utilities. But how can the infinite loop seen in live assembly debugging be explained with a user error of ps?

At this point, an intermezzo with some QNX history is in order. A bit more than a decade ago, the QNX source code was available to the public. Back then, QNX had a vibrant open source community. People would experiment with the kernel, write various useful utilities and help each other in forums. QNX even had a fully featured Desktop GUI, ran Firefox and was self-hosting, so you could develop for QNX right on QNX itself with full IDE and compiler support. It was beautiful. Then QNX was bought, source code access was revoked and the community largely withered away. Questions were increasingly asked via private support tickets directly to QNX, locked away from the public. QNX know-how becomes harder and harder to acquire, open source software for modern QNX releases is essentially non-existent and the driver situation is a catastrophe. The QNX kernel is the most beautiful and interesting kernel I have ever had the pleasure of working with, but it lies in the shackles of corporate ownership.

Back to the bug. So there is old QNX source code lying around. Do you think anyone has modified the source code of the ps utility throughout the last 15 years? Me neither! Let's dive right into the old code.

The old ps utility compiles for QNX 6.6 on the first try. Nice! It works just like the closed source binary we have. I install my newly compiled ps, write some software to automatically test reboots in a loop overnight and am able to reproduce the problem with this custom ps binary. Perfect! Now we can use the "Attach to Remote Process via QConn" feature again, but this time we have the source code of our ps. I already read the source code of the old ps before I did all of this, and I already had my suspicions as to which devctl() call was being repeated endlessly in the loop, so it comes as no surprise when the debugger points me to the following line:

if (devctl (fd, DCMD_PROC_TIDSTATUS, &tinfo, sizeof (tinfo), 0) == EOK)

The root cause of the bug is the following. For every process in the /proc directory, ps opens a file descriptor to this process' address space (as) file to read the process information to be displayed. For each process, it loops through all the threads of this process, and the bug is that this loop has insufficient termination criteria (in other words, in some cases, it loops infinitely). The problem occurs when the process whose threads we want to inspect terminates right before ps enters this loop. In that case, the devctl() call fails and as you can see in the following simplified snippet, it will never terminate because the if-clause is never entered.

while (1) { if (devctl (fd, DCMD_PROC_TIDSTATUS, &tinfo, sizeof (tinfo), 0) == EOK) { //[...] tcount++; // stop when we have gone through each thread, or when // the user only want process info if ((usingThreads == 0) || (tcount == info.num_threads)) break; } tinfo.tid++; }

I also verify this hypothesis by halting the ps process in the debugger, killing the process whose threads we are about to inspect and resuming ps and it hangs in precisely the same way. This is what it looks like in the QNX Momentics IDE:

After this analysis, I'm quite sure that this 15 years old bug is still alive and well in our closed source binary.

Why did this problem suddenly appear in our firmware? We can only speculate, it must have been changes in the scheduling order or timing at boot time. The bug is a race condition, so it can rear its ugly head whenever it wants. The affected code was old and deployed in production for many years.

How did I fix it? I briefly considered shipping our own ps utility, but I was still unsure about other bugs that might potentially be fixed in the latest closed source binary, those fixes would be lost again if I revert to the old open source version. At the end of the day, I decided to comb through our code base and just eliminate the usage of ps in non-interactive code, there weren't that many instances. Our ps utility remains buggy as it is, but it's pretty much impossible to reproduce this bug in an interactive terminal, and our firmware no longer uses it. Needless to say, this particular update problem never occurred again after that.

Does the ps bug still exist in QNX ecosystems more recent than QNX 6.6? Most likely, yes. If it was open source, I would fix it and send a pull request. Because it's not, they have to deal with the issue themselves. Maybe in a decade, an unfortunate soul runs into this bug again. Let's hope this blog post will save them some trouble.

What do we learn?

  • No matter how battle-tested and old the code and how reputable the distributor - the code contains bugs.
  • Old bugs can manifest themselves seemingly out of nowhere, caused by subtle changes in timing or memory layout.
  • Whenever the file system is involved, there is a significant danger that bugs are caused by race conditions.
  • Closed source operating systems and ecosystems are a pain to debug. Even old open source releases help.
  • Do not use interactive shell utilities in non-interactive code. Avoid system() whenever possible. Not only is the performance terrible, it can give rise to bugs like this one.
  • Make sure your loops terminate. Bounded loop variants that increase or decrease strictly monotonically guarantee loop termination. Don't be too clever with loops!

Post comment

CAPTCHA
* required field

Comments

Asu wrote on 17 November 2022
This is a topic that's close to my heart... Cheers!
Exactly where are your contact details though?
reply

phim heo xnxx wrote on 14 November 2022
Its like you learn my thoughts! You appear to understand so much about this,
such as you wrote the guide in it or something.
I believe that you just can do with some % to force the message house a bit, but other than that, this is wonderful blog.

An excellent read. I'll certainly be back.
reply

dohm sound machine wrote on 11 November 2022
Ꭲһanks for tһe marvelous posting! I seriously enjoyed reading іt, you might
be a great author.I will ensure that I bookmark your blog and will eventually come ƅack very soon. I
want to encoᥙrɑge yourself to continue your great work, have a niϲe evening!
reply

Ditme88.Com wrote on 10 November 2022
I am not very excellent with English but I get hold this very easygoing to understand.
reply

betta fish for sale wrote on 18 October 2022
If you want your betta fish to reside long and healthy lives, an aquarium tank heater is often a necessity to be sure the temperature will consistently stay within the safe range.
reply

ufabet889 wrote on 14 October 2022
I believe this is among the most important information for
me. And i'm glad reading your article. However wanna statemnt on some common things,
The website style is great, the articles is in point
of fact excellent : D. Excellent task, cheers
reply

translation equipment SIMULTANEOUS INTERPRETATION SYSTEMS wrote on 05 October 2022
I know this site gives quality dependent posts and extra information, is there any other web page which
presents these things in quality?
reply

Marrakech to Merzouga wrote on 02 October 2022
Write more, thats all I have to say. Literally, it seems as though you relied on the video to make your point.
You obviously know what youre talking about, why waste your intelligence on just posting videos to your site when you could be giving us
something informative to read?
reply

slot pulsa indosat wrote on 16 September 2022
Excellent beat ! I would like to apprentice even as you amend your website, how could i subscribe
for a blog website? The account helped me a appropriate deal.
I had been a little bit acquainted of this your broadcast provided vivid clear concept
reply

situs dewa poker wrote on 12 September 2022
Hello to all, for the reason that I am really keen of reading this
webpage's post to be updated on a regular basis.
It carries fastidious stuff.
reply

anjing wrote on 05 September 2022
Great ? I should certainly pronounce, impressed with your web site.
I had no trouble navigating through all the tabs and related information ended up being truly simple to do to
access. I recently found what I hoped for before you know it in the least.
Reasonably unusual. Is likely to appreciate it for those who add forums or something, website theme .
a tones way for your client to communicate. Nice task.
reply

dumps with pin wrote on 05 September 2022
Inspiring story there. What occurred after? Take care!
reply

https://sipetra.id wrote on 01 September 2022
After looking at a few of the blog articles on your web page, I seriously appreciate your way of writing a blog.
I bookmarked it to my bookmark website list and will be checking
back in the near future. Please visit my web site
as well and tell me how you feel.
reply

beha bolong wrote on 27 August 2022
Wow! This could be one particular of the most useful blogs We've ever arrive across on this subject.
Actually Magnificent. I'm also an expert in this topic therefore I can understand your
hard work.
reply

oggi cripto wrote on 13 July 2022
OggiⲤrypto è un bellissimo sito per aggiornarsi sulla nicchia delle cripto valute.
Se anche tu vuoi imparare di più su crіptovaluta governo cinese
viita ѕu oggi cripto
reply

bp77 daftar wrote on 05 July 2022
Hey there! I've been following your weblog for
a while now and finally got the bravery to go ahead and
give you a shout out from Austin Texas! Just wanted to mention keep
up the excellent job!
reply

rtp live slot agus bet wrote on 28 June 2022
I think this is one of the such a lot significant info for me.
And i am satisfied reading your article. But should statement on some
normal things, The site taste is perfect, the articles is in reality great : D.
Just right activity, cheers
reply

pragmatic slot wrote on 27 June 2022
Pretty great post. I just stumbled upon your weblog and wanted to mention that I have really enjoyed surfing around your
blog posts. After all I'll be subscribing for your feed and I'm hoping you write again soon!
reply

poker wrote on 27 June 2022
hi!,I like your writing very much! share we keep in touch extra
about your article on AOL? I require a specialist in this space to unravel my problem.
May be that's you! Looking forward to peer you.
reply

میوه خشک کن wrote on 17 June 2022
Hi there, I discovered your blog via Google even as
looking for a comparable matter, your website came up, it seems to be
good. I have bookmarked it in my google bookmarks.


Hi there, simply become alert to your weblog through Google, and found that it's really informative.
I'm gonna watch out for brussels. I will appreciate in the event
you continue this in future. A lot of folks will likely be benefited out of your writing.
Cheers!
reply

سبزی خشک کن wrote on 17 June 2022
This is my first time visit at here and i
am truly pleassant to read all at single place.
reply

jos777 wrote on 02 November 2022
Hi are using Wordpress for your blog platform?
I'm new to the blog world but I'm trying to get started and set up my
own. Do you need any coding knowledge to make your own blog?
Any help would be really appreciated!
reply

situs slot gacor wrote on 26 October 2022
Definitely believe that which you said. Your favorite reason appeared to be on the web the easiest thing
to be aware of. I say to you, I definitely get annoyed while people think about
worries that they just do not know about.
You managed to hit the nail upon the top and also defined out the whole thing without having
side effect , people could take a signal.

Will likely be back to get more. Thanks
reply

event parlay wrote on 16 October 2022
Terrific work! This is the type of info that should be shared across the net.
Disgrace on Google for no longer positioning this post upper!

Come on over and discuss with my web site .

Thank you =)
reply

bu siteye de bakın wrote on 10 June 2022
Pretty great post. I jᥙst stumbled սpon your webⅼog and wished
to say that I've гeally loved browsіng your Ƅlog posts.
In ɑny case I ѡill be subscribing for үour rss feed and
I hoρe you write again very soon!
reply

situs casino online wrote on 08 June 2022
Very good post. I definitely appreciate this site.
Continue the good work!
reply

https://linux-index.org/ wrote on 07 June 2022
If some one wants expert view concerning blogging afterward i propose him/her
to go to see this webpage, Keep up the fastidious job.
reply

agen slot wrote on 07 June 2022
I am extremely impressed with your writing skills as well as with the layout on your weblog.
Is this a paid theme or did you customize it yourself?
Anyway keep up the excellent quality writing, it's rare to see a nice blog like this one
nowadays.
reply

rupiahqq wrote on 26 November 2022
It's going to be finish of mine day, but before end I am
reading this enormous article to increase my experience.
reply

sarang188 wrote on 22 October 2022
Hi there! Someone in my Myspace group shared this site with us
so I came to give it a look. I'm definitely loving the information. I'm bookmarking and will
be tweeting this to my followers! Great blog and outstanding design.
reply

daftar bola88|sbobet online wrote on 10 October 2022
When I initially commented I clicked the "Notify me when new comments are added" checkbox and now each time
a comment is added I get three emails with the same comment.
Is there any way you can remove people from that service?

Appreciate it!
reply

slot online wrote on 07 June 2022
I needed to thank you for this good read!! I
definitely enjoyed every bit of it. I have you saved as a favorite to look at new stuff you post?
reply

bandar poker qq wrote on 07 June 2022
Wonderful beat ! I wish to apprentice while you amend your website,
how could i subscribe for a blog site? The account helped me a acceptable deal.
I were tiny bit acquainted of this your broadcast offered bright transparent concept.
reply

domino qq wrote on 06 June 2022
Hello my loved one! I want to say that this post is amazing, nice written and come with approximately all vital infos.

I'd like to see extra posts like this .
reply

smm services buy wrote on 04 June 2022
It's going to be end of mine day, except before finish I am reading this impressive piece of
writing to improve my know-how.
reply

mcafee com activate wrote on 28 April 2022
I do ϲonsider all of the ideas ʏou һave offered on your post.
Tһey aгe very convincing аnd wiⅼl cеrtainly woгk.
Still, the posts ɑre toο short for starters.
Ꮇay you pleaѕe lengthen them a little from
next time? Tһank yօu foг the post.
reply

Narcissistic Abuse Recovery Tips wrote on 24 April 2022
Sweet blog! I found it while searching on Yahoo News.
Do you have any tips on how to get listed in Yahoo News?
I've been trying for a while but I never seem to get there!
Appreciate it
reply

POWERBALL wrote on 22 April 2022
I don't know whether it's just me or if everyone else
encountering issues with your website. It appears as if some
of the written text in your content are running off the screen.
Can somebody else please provide feedback and let me know
if this is happening to them too? This might be a issue with my internet browser because I've had this happen before.
Thanks
reply

595955995 wrote on 15 November 2021
Many such cases! Sad!
reply