Building a hybrid spin mutex in C++

Knock knock. “Race condition.” “Who’s there?”

Doing laundry in the dorms at UCLA was a lesson in both gambling and cold war tactics. The stage is set with a hundred undergrads, two washing machines, and two dryers. First you signed up for a time slot. When the time came, you noticed both washers were already full of someone else’s day-old clothes. With a grumble, you shuffle them from the machine to a dryer, and start your wash. Then you go to lunch hoping your clothes are still there when you get back. Upon your return, the clothes are there, but are now a dripping mass on top of one of the dryers and all four machines are running. You check on the room every half hour until a dryer pops free. Three hours and seven trips later, your laundry is clean, complete with a foreign yellow sock you somehow acquired along the way.

Little did I know how dorm laundry would prepare me for the world of asynchronous software development.

At the office, I am teaching a single-threaded dog some multithreaded tricks. This requires using mutexes† to block out critical sections and avoid shared resource stomping. There are two types of mutexes that I have been playing with, and they differ in how they behave while a thread waits to lock.

Blocking Mutexes

A blocking mutex will halt the thread until it acquisition. It is useful because it consumes negligible computer resources while blocked. This leaves the CPU free to perform other tasks, including whatever other task currently owns the mutex. All this goodness is not cheap, however: it takes a decent amount of time to block thread. If your critical section is brief, you could be spending a disproportionate amount of time protecting it instead of running it.

Generally, blocking mutexes should be used when your critical section will take a while, such as I/O operations, calling out to the OS, or doing laundry in a collegiate dorm.

Spinning Mutexes

A spinning mutex will enter into an infinite loop (spin) until acquisition. It is useful because it can resume very quickly once the lock has been obtained, resulting in minimal overhead while protecting a critical section. However, since the thread remains active on the CPU, it can reduce (or eliminate!) the ability of the CPU to do other work††. If your critical section is long, you could be spending a disproportionate amount of time protecting it instead of running it.

Generally, spin mutexes should be used when your critical section is brief, such as reading or writing a memory-resident data structure.

Finding a middle ground

The dichotomy between the two mutex behaviors has left me stuck more than once. What if I was trying to protect a global resource that occasionally required a call to the OS? In those cases a blocking mutex is not a good fit, as modifying the memory-resident structure is pretty quick. However a spin mutex would be equally bad, because I do need to go to the OS time and again, and it would be a pessimization to spike a CPU while doing so.

So what are we looking for? Ideally we’d have a mutex that does both: it spins for a time in anticipation of a quick lock, but eventually blocks the thread if things take too long, relinquishing the CPU. This would be a perfect solution for my above scenario: most of the locks would be acquired during the spin phase, but if the resource is delayed (say, because an OS call was necessary) the thread will block.

Can it be done? Sure. It’s called a hybrid or adaptive mutex†††, and they’ve been around forever.

Does C++11 have one? Well, eh… not really. Even with threading instrinics being native to C++11, many of the implementations go straight to native threading counterparts under the hood (e.g., pthread on macOS), so the automatic use of hybrid spinlocks should not be assumed.

Besides, it would be mistake to use spinlocks by default- hybrid or otherwise. If a spinning mutex ever takes as long as it would to block, a blocking mutex is preferred. Since the operating system doesn’t know how big your critical sections will be, they assume the worst and (rightly) provide blocking as the de-facto mutex behavior. Additionally, if your application is ever run on a single-CPU system, a spinning mutex will always spin pointlessly, then block, making it slower than blocking to begin with.

So I want a hybrid mutex, but C++11 only provides blocking mutexes. Can I make this horse into a zebra? It’s worth a try. Fortunately std::mutex implements the try_lock() routine, a nonblocking routine that returns a boolean indicating the success of locking the mutex. We can use try_lock() to drive the spin phase of a hybrid mutex, then lock() when needed:

class spin_mutex_t {std::mutex _m;

public:void lock();void unlock();};

The only other member variable required for our class is an adjustable high-water mark that determines when the mutex should block:

class spin_mutex_t {std::mutex _m;std::atomic<long long> _p{1};

public:void lock();void unlock();};

The most interesting elements of the code are in the lock() routine, which I’ll detail here:

void spin_mutex_t::lock() {using clock_t = std::chrono::high_resolution_clock;

auto      before = clock\_t::now();  
long long measured{0};

while (!\_m.try\_lock()) {  
    measured = (clock\_t::now() - before).count();

    if (measured >= \_p \* 2) {  
        \_m.lock();

        break;  
    }  
}

\_p += (measured - \_p) / 8;

}

The routine starts by timestamping when it begins, which it will use in deciding if/when to block as well as adjusting the predictor.

The spin loop repeatedly calls try_lock(). If it succeeds, we have locked during the spin phase, and move on. If not, we determine the amount of time elapsed while spinning. If that elapsed time is twice as long as the predictor, we move on to the blocking phase of the mutex, and move on.

At this point in the routine the mutex is locked, regardless of the phase in which it was acquired. The only step that remains is to adjust the predictor based on what we learned this last time through the lock routine. The technique is to derive the next predictor value with a proportioned combination of the previous predictor and the measured value:

Adjusting the predictor

Results

With such a basic implementation, I decided to see how it behaved. The first thing I added was a probe callback to get numbers out of the mutex based on its behavior after every lock aquisition. I then ran a battery of tests.

Test 1: “0 slow”

Then I spun up five threads that each locked the spin_mutex_t 100 times as quickly as they could, and charted the results.

0 slow acquisition by phase

This bears some explanation. The probe callback was told in which phase the lock was acquired: 1 for blocking, 0 for spinning. The red dots represent each of the data points, while the black line represents a rolling average of the last 50 lock attempts. It is clear to see the trend starts very high, which is to be expected as the predictor adjusts to how long it takes to lock and unlock immediately. About 70 locks in, the predictor’s accuracy is to the point where locks acquire in the spin phase, and the trendline starts to dip down. Towards the end of the test, it is plain to see that more locks are acquired during the spin phase than blocking.

Test 2: “1 slow”

The next test is designed to imitate the scenario described above: most threads have a short critical section time, but some might be in there a while. I spun up the same five threads, each immediately unlocking save for one which has a 3ms delay:

1 slow acquisition by phase

The behavior is similar, but the trendline settles in more around 50% than the 0-slow test.

Test 3: “5-slow”

This is a worst-case scenario: each thread would be imposed with a 3ms slowdown. Although it would be better suited by a blocking mutex, I wanted to make sure the behavior was at least tolerable, even though it would be a pessimization:

5-slow acquisition by phase

It is interesting to note that as more threads are given the delay, the variation of the trendline goes down. I suspect this is due to the side effects of the operating system (e.g., yielding) becoming more muted as the predictor values become larger.

Moving Forward

There are several questions and next steps I am interested to pursue with the implementation:

Are the slowdown times appropriate? What would be better numbers to model real-life use cases?
Auto-detection when the spin phase would exceed the time required to block, and doing so when it is performant.
Fine-tuning the constants of the lock. Could adjusting the time-to-block multiplier make the mutex more performant overall? What about the multiplier used to adjust the predictor?
We know the critical section ends with a call to unlock(). Might that be used to adjust the predictor, or improve performance in lock() by anticipating how long the next critical section might be? For example, if the critical section is found take more time than blocking the thread, maybe it is worthwhile to simply block at the get-go of the next call to lock()?

Open Source

I have made the code available on GitHub. Please let me know what you think, including snafus or ways in which I might improve things.

Footnotes

† For the uninitiated, a critical section is a block of code wherein no more than one thread should be running at a time. This is solved by a thread first acquiring (or locking) a mutex that guards the critical section. Mutexes are built such that locking is an atomic operation: it will only complete on one thread at a time, regardless of how many threads are attempting to lock. All competing threads will be unable to continue until the mutex is unlocked. It’s not unlike the conch in Lord of the Flies, but with less bloodshed.

†† It should be noted here that both types of threads are subject to the OS, which can yield them preemptively at any time. This is similar to a block but is imposed by the OS, not the mutex architecture. I won’t be covering OS thread scheduling in any more detail.)

††† See, for example, A description of pthread’s adaptive mutex from the original implementer.