Programming Pearls and Python Fails

Computer science books contains timeless wisdom, but performance advice doesn’t always age well. When reading Programming Pearls, by Jon Bentley, I’ve found more modern hardware advances that puts conventional wisdom on its head.

Consider the following problem:

Rotate an array of n elements left by i positions. For example, rotating “abcdefgh” by 3 gives “defghabc”. Is it possible to do this in place and proportional to n?

The question models a few problems in real computing scenarios:

Moving text in a text editor can be seen as rotating the text to be moved to the new location.
Swapping unequal chunks of memory that are next to each other.

Considering the use cases, the operation should not be done lazily.

One algorithm involves a delicate juggling act: taking the 1st element, then move the element at i to its final position, then the item at 2i to i, and so on. This is repeated until the entire array is rotated.

Credits: Programming Pearls, 2nd Edition, page 14.

Another way to think about this problem is to treat the array as two different segments, A and B, then rotating the array is identical to converting the array AB to the array BA. If we can reverse A to produce A’, then

(A’ B’) ‘ -> BA

The juggling algorithm reads and writes to each array element once. The reversing rotate reads and writes to each array element twice. Conventional wisdom would lead us to conclude that the juggling algorithm runs faster. In Programming Pearls, the juggle code is reported to be twice as fast. Does this result still hold true in the 21st century?

Here is an implementation of the juggle rotate in Python:

Juggle Swap

By using a typed Python array, I thought it would result in better performance compared to a list. Unfortunately, I’ve committed the sins of premature optimizations.

Running the juggle rotate on a Python array of 10000 elements and repeating 10000 times gives the following result:

……..15 seconds?

I ran the tests on a 4GHz Intel i7-6700K. As a quick sanity check, my CPU should be able to crunch through a few billion integer operations per second. The juggle rotate moves roughly 10 million integers (10000 elements * 10000 repeats), 15 seconds seems ridiculously long for the work it has to do.

What happens when a list is used?

12.56 seconds…. eh….

It’s better, but still not great. Reading the Python source code reveals that during element retrieval, Python will create a PyObject and wrap the result. This cost is incurred every time it accesses an integer it has never seen before. What if we inline the function?

A 5 second improvement via inlining….not bad

Right, no function inlining. Not nice for things like map() and filter() where you are essentially calling functions over and over again.

What about the reverse rotate?

Starting from the two ends, we loop through one at a time and swap the elements around, and we do 3 segment reversals. Pretty straightforward.

Setting up a profiling bench….

Let’s go:

At this point, while the reverse function seems pretty slow, everything just seems to be slow in general.

Well, we could run a Python environment with a JIT, such as PyPy.

Let’s give that a go

We’ve managed to achieve a 10x speedup in the juggle function and a 200x speedup in the reverse function. The reverse rotate is now much faster than the juggle rotate for some reason.

Can we improve even further?

Through Cython, and adding some type annotations, the code can be compiled, eliminating more overhead. Here’s the converted function

Note how similar it is to the Python code we wrote before

How did we go?

Wow

As another sanity check, if our CPU can punch through a few billion instructions per second, then going through 10 million integers should be within the ballpark of a hundredth of a second, our reverse swap does this in 0.07 seconds.

All of this still left us with one question: why is the reverse rotate function faster than the juggle function, despite doing more work?

I’ve re-implemented both functions in C and we’re using perf under Linux.

Hmm….

Interestingly, our C code seems to be slower than our Cython optimized code, how can compiled Python run faster than C?

I’ve increased the number of elements to 1 million integers and we’re now doing 5000 repeats, let’s repeat the Cython results again

How does our C programs perform?

Important lesson about benchmarks: Always make sure they run long enough so that the results aren’t skewed by other factors.

From our benchmarks, we can deduce that Cython has generated pretty much pure C. By profiling our C program, we should be able to get conclusive results about our performance.

Running perf gives us the final puzzle pieces.

Reverse Rotate

Juggle Rotate

There are a few statistics I’d like to highlight:

Looking at the L1 Data Cache loads (L1-dcache-loads), the reverse swap does about twice as many loads as the juggle program, which is what we expect: twice as much memory accesses.
The Swap program has far fewer cache misses, both in L1 data and The Last Level Cache (LLC), whereas the juggle program incur quite a penalty in cache misses.

In a nutshell, main memory access, compared to doing calculation, are expensive. On a Skylake CPU, the cost of accessing RAM is 42 Cycles + 51 nanoseconds. CPU Caches are based on the assumption that code will frequently access data that are close to each other and data accesses are close in time. By putting these data in a cache, the CPU only has to access the much faster cache rather than constantly hitting main memory.

Some people may be wondering, how could there be 128% cache misses? There’s a complicated answer, but tl;dr, the CPU may try multiple times to retrieve a result from a cache.

Looking at the juggle code again:

The juggle code accesses memory by constantly jumping at a certain offset. At larger jumps, this is detrimental for the CPU cache — not accessing data that are close to each other constantly results in cache misses. Looking back at the perf results, there are 32 million last level cache hits in the juggle rotate compared to only 63 thousand in the reverse rotate.

In the reverse rotate, accessing memory is done in a sequential fashion from both ends, the data accesses are close, and hence we get a much better speed, despite reading and writing twice as much.

It is worth noting that both algorithms have the same asymptotic growth function, both are O(n) algorithms, yet due to cache usage, they are a magnitude different in run time. It is also worth noting that in C++, the reverse rotate is used as part of the std::rotate implementation.

If you would like to run any of the code in this blogpost, they are all collected here:

FrozenXZeus/pyfails_pyfails — A set of implementation benchmarks against different implementations of Python_github.com

In Conclusion

CACHE ACCESS PATTERNS ARE IMPORTANT! They can often make or break performance.
Beware of premature “optimizations”. Verify assumptions under careful profiling.
Consider different Python interpreters and Cython when deploying code to production. An example is Pyston, a Python JIT written by Dropbox.