You too can read disassembly

Written by okaleniuk | Published 2016/12/11
Tech Story Tags: programming | cpp | assembler | debugging | optimization

TLDRvia the TL;DR App

There is an interactive version on wordsandbuttons.online.

Reading disassembly is more like reading tracks than reading a book. You have to know the language to read a book, but reading tracks, although it gets better with skills and experience, mostly requires attentiveness and logic thinking.

Most of the time we browse disassembly only to answer only one simple question: does compiler do what we expect it to? In 3 simple exercises, I’ll show you that you too can answer such questions even if you have never seen disassembly before. I’ll use C++ as a source language, but what I’m trying to show is more or less universal, so it doesn’t matter if you write in C or Java, C# or Rust, — if you compile to some sort of machine code — you can benefit from understanding your compiler.

1. Compile time computation

Any decent compiler tries to make your code do as little work as possible. Sometimes it can even conduct the whole computation in compile time, so your machine code will simply contain the answer.

This source code defines the number of bits in a byte, provides a template function that accepts the type T and returns the size of T in bits, then calls it from the main section setting T = int.

static int BITS_IN_BYTE = 8;

template<typename T>size_t bits_in(){return sizeof(T)*BITS_IN_BYTE;}

int main(){return bits_in<int>();}

Since the compiler knows the size of int, it can compute bits_in<int>() in compile time. But since it isn’t guaranteed by the standard, it might not.

Now look at two possible disassemblies for this source code and decide what variant does compile time computation and what doesn’t.

Variant A

01021002 in al,dx01021003 mov eax,dword ptr ds:[01023000h]01021008 shl eax,20102100B pop ebp0102100C ret

Variant B

003C1000 push 20h003C1002 pop eax

By Karl Friedrich Herhold (Own work) [CC BY 3.0 (http://creativecommons.org/licenses/by/3.0)], via Wikimedia Commons

Well, that’s a no-brainer. Of course variant B does.

On 32-bit platform int size is 4 bytes, which is 32 bits, which is 20h in hexadecimal. You might not know the convention, by which function returns size_t in eax, but you see that the first variant is long enough to contain an actual multiplication, while the second one has only two lines: something with the computed answer and the other one.

2. Function inlining

Calling function implies some overhead by preparing input data in a particular order, then shifting the execution to another piece of memory, then preparing output data, and then shifting back.

If you only use the function once you don’t have to actually call the function. It just makes sense to inline function body to the place it is called from and skip all the formalities. Compilers can do this for you.

This code:

inline int twice(int x){return x + x;}

int main(){return twice(2);}

May virtually become like this:

// not really a source code, just explaining the ideaint main(){return 2 + 2; // twice gets inlined here}

But the standard does not guarantee that all the functions marked inline will get inlined. Now look at these two disassembly variants below and choose the one in which the function twice gets inlined after all.

Variant A

00E71002 in al,dx00E71003 mov eax,2 00E71008 add eax,200E7100B pop ebp 00E7100C ret

Varian B

00261002 in al,dx00261003 mov eax,dword ptr [x] 00261006 add eax,dword ptr [x]00261009 pop ebp 0026100A **ret...**008F1010 push ebp 008F1011 mov ebp,esp008F1013 push 2 008F1015 call twice (08F1000h) 008F101A add esp,4008F101D pop ebp 008F101E ret

By Lensim at English Wikipedia. Use “Michael Lensi” for attribution. (Transferred from en.wikipedia to Commons.) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons

Not really a mystery either. It’s Variant A. You might not know, that the instruction to call a function is actually called the call, but since the disassembly contains no recall of twice , it must be inlined.

3. Loop unrolling

Just like calling functions, doing loops implies some overhead. You have to increment the counter, then compare it against some number, then jump back to the loop beginning.

Compilers know that in some context it is more effective to unroll the loop, that is to do something several times in a row instead of messing with the counter comparison and jumping here and there.

So given this two similar variants of source code with respective disassembly, please choose the one that actually has an unrolled loop.

Variant A

int main(int argc, char*){int result = 1;for(short int i = 0; i < 4; ++i)result *= argc;return result;}

And respective disassembly:

00EB1002 in al,dx00EB1003 mov edx,dword ptr [argc] 00EB1006 mov eax,1 00EB100B mov ecx,400EB1010 imul eax,edx 00EB1013 dec ecx 00EB1014 jne main+10h (0EB1010h)00EB1016 pop ebp 00EB1017 ret

Variant B

int main(int argc, char*){int result = 1;for(size_t i = 0; i < 4; ++i)result *= argc;return result;}

With this:

00BF1002 in al,dx00BF1003 **mov ecx,dword ptr [argc] **00BF1006 mov eax,ecx00BF1008 **imul eax,ecx **00BF100B **imul eax,ecx **00BF100E **imul eax,ecx **00BF1011 **pop ebp **00BF1012 ret

By NASA / Buzz Aldrin (NASA (original upload; ALSJ (AS11–40–5877))) [Public domain], via Wikimedia Commons

And it’s variant B.

Once again, you might not know that j<something-something> is the family of jump instructions and cmp stands for “compare”, but variant B clearly has a repeating pattern, while variant A has some address manipulation instead.

Conclusion

You could argue that these examples were made up deliberately to be obvious. It’s only a half-truth. I did refine them to be more demonstrative, but conceptually they are all taken from my own practice.

Using static dispatch instead of dynamic made our image processing pipeline up to 5 times faster. Repairing broken inlining helped to prevent 50% loss of performance for edge-to-edge distance function. And changing counter type to enable loop unrolling is my favorite optimization ever. It only won us about 10% on matrix transformation for software rendering, but all its cost was simply changing short int to size_t in one place.

Even considering somewhat simplified examples my point remains valid. You can read disassembly to some degree without learning assembler, and you sure can benefit from reading it. Of course, without proper skill and knowledge, you might not always succeed, but you would definitely not succeed without trying.


Published by HackerNoon on 2016/12/11