Everything You Need to Know About Multithreading: The Realistic Release Consistency [Part 2]

In our last installment, we introduced sequential consistency, the threading programming contract between programmers and computer systems.

We are about to dive further into the most common memory consistency model in practice, release consistency. The beauty of this consistency is that while it's much cheaper for systems to implement, given proper locking however, we can formally prove that it is equivalent to sequential consistency.

After finishing this article, C++'s

std::memory_order

documentation will actually look like English to you. You will also feel more at ease navigating the water crafting your own synchronization methods, without resorting to off-the-shelf solutions such as library-provided locks.

But first, it's necessary to make a detour to get acquainted with the staple of multi-thread programming: locks, which is so essential to release consistency's intuition and implementation.

Locks

Use case

In our last episode, we covered an example in which two threads race to update two variables

and

. Let's reuse it as our example1 in this article:

int i = 0;
int j = 0;
int output[2];

void func0() {
  i = 1;
  output[1] = j;
}

void func1() {
  j = 1;
  output[0] = i;
}

Honestly, this was only a contrived example perfectly fit for demonstrating broken memory consistencies. The most common synchronization need actually arises from the necessity of grouping multiple memory operations in one shot. This need was as old as the concept of multithreading. It even predates SMP or consistency model. Let's look at a more realistic example2.

int id = 0;
int my_id[2];

void func0() { my_id[0] = id++; }
  
void func1() { my_id[1] = id++; }

Here we have two threads wanting to generate their own unique IDs. The hazard here is that the innocent

id++

increment is actually two memory operations slammed together. It's commonly referred to as read-modify-write. If we paste this snippet into compiler explorer, we can show that on x86 each function roughly takes 4 instructions -- read-add-write for

id++

and the assigning the value to

my_id

func0():                              # @func0()
        movl    id(%rip), %eax
        leal    1(%rax), %ecx
        movl    %ecx, id(%rip)
        movl    %eax, my_id(%rip)
        retq

Of course, we're not counting the function return as a proper instruction here.

The race condition is evident. If both threads will do the reads first before the other completes the whole read-modify-write sequence, it'll be possible for both threads to get an ID of 0.

We can also write a full program that produces this result:

#include <iostream>
#include <thread>

int id = 0;
int my_id[2];

void func0() { my_id[0] = id++; }

void func1() { my_id[1] = id++; }

void run_test() {
  id = 0;

  std::thread t0(func0);
  std::thread t1(func1);

  t0.join();
  t1.join();
  std::cout << my_id[0] << my_id[1] << std::endl;
}

int main() {
  for (;;) {
    run_test();
  }
  return 0;
}

Do you see that we're fundamentally dealing with a different issue that example1 here? Example 1 can break because instructions can be reordered, and here we break because instructions may be interleaved, even when they are completely executed in program order!

One way to solve the problem is to wrap

id

in an

std::atomic

construct and convert the increment to its

fetch_add

API, which just generates a special hardware atomic instruction that takes care of the read-modify-write business. But that's just a shortcut. Prior to the C++11 standard, we would have just put the read-modify-write sequence of

id

in a good old critical section, protected by locks, so that the access to it will be mutually excluded.

Spinlock implementation

How are locks implemented though? It turns out, the locking is just a read-modify-write on its own. Let's take a look at the most simple kind of lock -- spinlock. Here we can have a simplistic working implementation using C++'s

std::atomic

. In the old days, we would have embedded assembly ourselves.

class spinlock {
public:
  void lock() {
    for (;;) {
      bool expected_val = false;
      if (busy.compare_exchange_strong(expected_val, true))
        return;
    }
  }
  void unlock() { busy.store(false); }

private:
  std::atomic<bool> busy = false;
};

It's called spinlock because the acquisition keeps spinning on the lock state hoping it won't be busy the next moment without ever blocking. A spinlock state could be as little as 1 bit. Compiler explorer shows that acquiring the lock generates a

cmpxchg

(standing for compare-and-exchange) instruction:

main:                                   # @main
        movb    $0, -8(%rsp)
        movb    $1, %cl
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        xorl    %eax, %eax
        lock            cmpxchgb        %cl, -8(%rsp)
        jne     .LBB0_1
        movb    $0, -8(%rsp)
        xorl    %eax, %eax
        retq

This instruction does the following: 1) it reads out the lock state, 2) compares to the expected "not busy" value, and 3) if equal, writes the "busy" value back to the lock state. The operation can succeed or fail, and if it fails we'll just keep retrying till we actually acquire the lock. But either way, the 3 things are done together without interruption. Again, read, update, and write:

The lock release doesn't have to be retried, and we unconditionally write a "not busy" value to the lock state.

Note here you can pretend we have sequential consistency here. For now, you will have to trust me that the lock code will work in both uniprocessor and multiprocessor environments. After finishing the article you will see why.

Now let's protect the

id

variable using our spinlock, and both our IDs should be unique.

#include <atomic>
#include <iostream>
#include <mutex>
#include <thread>

class spinlock {
public:
  void lock() {
    for (;;) {
      bool expected_val = false;
      if (busy.compare_exchange_strong(expected_val, true))
        return;
    }
  }
  void unlock() { busy.store(false); }

private:
  std::atomic<bool> busy = false;
};

spinlock m;
int id = 0;
int my_id[2];

void func0() {
  std::lock_guard g(m);
  my_id[0] = id++;
}

void func1() {
  std::lock_guard g(m);
  my_id[1] = id++;
}

void run_test() {
  id = 0;

  std::thread t0(func0);
  std::thread t1(func1);

  t0.join();
  t1.join();
  std::cout << my_id[0] << my_id[1] << std::endl;
}

int main() {
  for (;;) {
    run_test();
  }
  return 0;
}

Release consistency

Finally, we have got around to the main course of this episode. Recall that sequential consistency, for the most part, requires that there's a global ordering of all memory operations. This is very hard to implement on modern processors. X86 TSO is one attempt at relaxing the memory access orders, and we could see in the last episode that it will break sequential consistency if accesses to shared variables aren't properly ordered.

People have noticed that more often than not, shared variables ARE actually protected by locks. Like we mentioned before, this habit predates multiprocessors, so it has nothing to do with coping with memory reordering. But since locks are such special memory locations, it's tempting the think that instead of enforcing memory orders everywhere, maybe we can do something clever to the locks to maintain the sequential consistency illusion, while letting the general population of the instructions slide?

This is where release consistency (RC) comes in. There are two flavors of RC, RCsc and RCpc. Both are equivalent. Let's start with RCsc, which is easier to understand, but less practical to implement compared to RCpc.

RCsc

These points summarize RCsc:

Memory accesses are categorized into a) synchronization ops, i.e. acquiring and releasing a lock, and b) ordinary ops.
No local memory accesses can be reordered after a release op.
No local memory accesses can be reordered before an acquire op.
Synchronization ops themselves observe sequential consistency, hence the "sc" subscript in the consistency name.

This is where I want to put in a thousand words to describe the mechanism. So let's throw in a picture instead. This illustrates why these points will maintain the sequential consistency illusion.

Imagine we have two critical sections (CS) executing on two processors A and B, referred to as CS-A and CS-B respectively. These are shown as two boxes in the diagram above. The critical sections are protected by a single lock. Inside the CS'es there are some accesses to locations private to the processors themselves, and some accesses to shared variables. It's easy to prove that all accesses in CS-A will strictly happen before all accesses in CS-B, and this is observed globally. The most important trick is the sequential consistency ordering between the CS-A release and the CS-B acquire established in rule No. 4. That is,

CS-A release < CS-B acquire

where the "<" sign denotes the global happens-before relationship.

From rule No. 2 and 3, any op in CS-A will finish before the CS-A release. Also any op in CS-B will finish after the CS-B acquire. So we have the following:

Op in CS-A < CS-A release

, and

CS-B acquire < Op in CS-B

Finally, transitively, any op in CS-A will globally happen before any op in CS-B.

The private accesses are irrelevant here, because at any time they are done by a single processor, and the global order is defined by that processor's order, which is guaranteed to be equivalent to sequential consistency, even though some reordering might happen.

Also our strong happens-before relationship have put the shared accesses into the same situation. At any time there's only one processor doing the accesses exclusively, so the order it executes will be defined as the global order. This order must be equivalent to a sequential consistency order, which is the golden rule no system dares to break so far.

You can see that all release consistency is banking on, is that outside the critical sections there shouldn't be any shared access. If you do, you're on your own for defining the access ordering. This is why C++11 kind of outlawed accesses unprotected by locks or

std::atomic

and related constructs. Going back to example1 at the very beginning, as alternatives to using

std::atomic

, we could potentially use a proper lock to achieve the same purpose.

RCpc

We just proved that as long as data are always protected by locks, RCsc will be equivalent to sequential consistency. However, ruminating over the requirements RCsc here, you might find it very hard to implement on actual processors. It asks for 2 kinds of ordering enforcement: sequential consistency between sync ops, and a local ordering between sync and ordinary ops. Think about how you would provide that interface when you design instructions. Common architectures just give you local memory fences. For example, x86 has a MFENCE instruction that prevents local memory operations from being ordered across that instruction.

Now RCpc comes along and stops us from worrying. It says sequential consistency between sync ops might be more than sufficient. We can replace it with a consistency model called processor consistency, and the world will keep revolving.

Processor consistency assumes that each processor has its own local memory (which is a very realistic assumption, since commonly "cores" have their own private L1 cache).

I have deliberately refrained from linking a Wikipedia definition of processor consistency here. If you must look, I'd have to caution you that this is one of the rare cases Wikipedia is wrong! Look at one of its figures:

This is an example given to explain processor consistency. What this says is Processor1 issues a write of value

to location

. Processor2 receives that value, and writes

to the same location

. Processor3 and processor4 observe the two writes in different orders. This is definitely not true for processor consistency, because cache coherency protocols would have prevented that. This article is getting long, so maybe we will go through cache coherency when we look at some locking performance issues. For now, this survey paper sums up the concepts very nicely. Hypothetically some of you might have extra time for some extensive reading.

Essentially, processor consistency is only concerned about memory ordering w.r.t. a single memory location. Suppose we have a single lock, processor consistency for that lock can be thought of as a series of writes to the lock, and these writes are globally ordered. Reads from different threads however, may observe them at different timings.

To illustrate why processor consistency won't be an issue in the context of release consistency, let's look at the figure above and consider a theoretical case where 3 processors, yellow, purple and green are executing a few operations. At first yellow grabs the lock using a compare-and-exchange instruction, which is 3 ops executed without interruption. Shortly, it releases the lock.

The write operations (acquire and release) are propagated to purple first. On seeing the lock released, purple seized the opportunity and acquired the lock for its own use.

Meanwhile, green only sees a release from yellow, and it tries its 3-op sequence to acquire the lock too, but processor consistency dictates that the acquire op has to be a failure, because it is a write op, and it has to fit into the global ordering. So no harm is done -- green can just retry in the future without jeopardizing the lock semantics.

In reality, cache coherency protocols usually makes one processor exclusively execute the read-modify-write on the lock cache line at a time. So realistically the chart will look more like this:

You might think this looks awfully like sequential consistency, but it's not. We are only dealing with one memory location here (the lock), whereas sequential consistency mandates a total operation ordering, across all memory locations.

In the cases where you need to acquiring multiple locks to get to the critical section, this model will guarantee correctness as well. The proof is assigned to you, the reader, as an exercise. 😀

This is great news! So as long as a processor implements processor consistency as its memory consistency by default, we can safely build on top to present programmers sequential consistency, provided all shared memory accesses are protected by locks. In reality processor consistency is a very reasonable thing to assume, given how cache coherency is implemented. More on that topic later.

It might be obvious but let's spend a few words on building on top of processor consistency to implement release consistency. Given the RCpc requirement, with the native processor consistency, the only thing missing is the local ordering between acquire/release and ordinary memory operations.

Like we said, usually the instruction set architecture will give you some control as to inserting barrier or fence instructions to enforce some kind of local ordering. X86 has three of them, LFENCE, SFENCE and MFENCE. But only MFENCE is actually of any use to us. ARM for example, has DMB and DSB instructions that function similarly. Now we're good to go!

Even though there are a myriad of consistency models, release consistency is the weakest of them all that works. That's why you can ignore all other consistency models such as causal, weak ordering, PowerPC and so on. And even x86 TSO isn't all that interesting to understand either, aside from the fact that it's the consistency model of THE x86.

Coda: Revisit our spinlock implementation

Now that you have understood how release consistency works, you can probably see that our spinlock implementation at the beginning can be somehow improved in that the atomic accesses to the lock state don't have to be sequential consistency.

There are other spinlock implementations from established orgs. The simplest is Facebook's folly. Our simplest example is just a tiny bit different from it, in that:

Folly tries to avoid sequential consistency when possible.
If a spinlock acquire operation fails, it executes the PAUSE x86 instruction to wait for a tiny bit. This tells the processor to kill all speculative executions to make space for other hyperthreads on the same core.

We can definitely incorporate the memory order weakening part into our implementation easily:

class spinlock {
public:
  void lock() {
    for (;;) {
      bool expected_val = false;
      if (busy.compare_exchange_strong(expected_val, true,
                                       std::memory_order_acquire,
                                       std::memory_order_relaxed))
        return;
    }
  }
  void unlock() { busy.store(false, std::memory_order_release); }

private:
  std::atomic<bool> busy = false;
};

Now there's a funny thing. I won't paste it here, but if you watch the binaries from these two versions, you won't observe a tiny bit of difference. Why? At the very least would you think you should see some memory fence instructions removed in the new implementation?

The reason is the

cmpxchg

instruction itself brings an implicit memory barrier. So a

compare_exchange

call on x86 with weakened memory ordering isn't all that meaningful, because the architecture doesn't give the compiler an option.

As to the release, remember x86 TSO only brings loads forward, bypassing stores to different locations? There's never a time it will reorder instructions backwards, so the "release" semantics is automatically guaranteed.

Recapitulation and what's to come

In this post we learned:

Most of the time, shared variables are protected by locks. This has been the case since the uniprocessor days.
Release consistency takes advantage of this fact, and presents a much more feasible model for hardware, while maintaining the illusion of sequential consistency when variables are properly protected by locks.
Because lock acquisition can be retried, release consistency can work with weaker consistency models between synchronization ops.
On x86, no memory fence is needed for locks, because the relevant instructions automatically bring order fences.

This is quite a long post. And thank you for following this far. The fundamental architectural concepts are finally out of the way, in our next article, we'll switch gears and discuss some very practical topics. We'll first introduce some operating system concepts and spinlock's kin, mutex lock. We'll then look at how they perform in a very typical workload in real programming languages, pointing out potential directions for more advanced synchronization techniques.