Solving the Most Interesting Bug of My Career in 15 Steps

Written by quoraanswers | Published 2017/11/17
Tech Story Tags: programming | bugs | computer-bugs | most-interesting-bugs | quora-partnership

TLDRvia the TL;DR App

By Udayan Banerji, works at Quora. Originally published on Quora.

TLDR; Spent two weeks investigating a bug, and the fix was a one line change.

While working as a compiler engineer at Intel, I once got assigned a weird bug. It was an Android app, basically a Java benchmark, and it would randomly crash. The app had one button, and clicking that button started a long running execution of the whole suite of benchmarks.

  1. I did not have the source code of the app, but I could see the bytecode. So I first tried running it through the debugger. I tried at least 30 times, and it never crashed.
  2. I ran the app normally again, and it randomly crashed. Eventually I figured out that it crashed on every 20th time I ran the benchmark.
  3. I scoured through the bytecodes for anything that had 20 in it. Any loops of 20, any recursions. Nothing. The program kept crashing.
  4. This was getting serious now. It seemed easier to just smash the computer keyboard on this Android phone and make the pain go away.
  5. After a weekend, I came back to the issue. I went back to the crash in Java. The core issue was an assertion failure — a large floating point number was not equal to NaN (“Not a Number”).
  6. I went back to the bytecodes and looked for floating point divisions. One by one, I isolated about a dozen of the bytecode sequences, converted them to x86 assembly, put each in a long running loop, and executed them. Finally, one of them crashed every 20th time. I could see the light at the end of my carpal tunnel.
  7. I analyzed the assembly code and saw 8 divide by zero operations. Aha! Divide by Zeros produce NaN! So our compiler’s divide by 0 is broken … umm, somehow?
  8. Except no, a handwritten assembly divide by zero worked fine. Frustrated I did a loop of 20 of divide by zero, and it passed as well. I then wrote a bunch of random assembly instructions after those, and the first one gave wrong result.
  9. Wait what?
  10. Finally, went to gdb and dumped the value of all CPU registers for these operations.
  11. It is then that I noticed a trend. The x87 register stack was filling up slowly, and then staying put at capacity (8 items)
  12. Turns out, there was a bug in the ancient x87 processor in the chip, the one responsible for doing floating point operations. We were using it in the compiler for all floating point operations, and all but the divide by zero path was emptying it after use.
  13. It seems on a stack overflow, it did not throw an error, but returned a value of NaN no matter what you run through it. Which is also the value you get when you divide by 0. (Basically the stack overflow error, called stack fault, is sticky. Once it happens, you have to manually clear it in the compiler, or it keeps happening).
  14. So after every 8 divide by zero, it will fill up, and then it will treat any operation as a divide by zero, and return NaN.
  15. The fix took one line of code change, to clear the stack on the divide by zero path.

EDIT: Jay Shah asked me for the actual code. It is here: Gerrit Code Review. Note that most changes are comments. There are 4 lines of code changes, but 3 are identical and 1 is loading a value.

By Udayan Banerji, works at Quora. Originally published on Quora.

For more trending tech answers from Quora, visit HackerNoon.com/Quora.


Published by HackerNoon on 2017/11/17