The C++ Memory Model, or Why We Can't Have Nice Things

This post gives a brief overview of the memory model problem, its history, and the approach taken by C++. Memory models are a notoriously difficult subject justifying book-length introductions, so many details are left out - I consider this post a refresher on the topic with links to relevant literature. Please see referenced materials in footnotes for a more serious introduction.

The Problem

Here’s an over-simplified mental model of how a C++ program is executed on a computer:

     compiler              CPU(s)
C++ -----------> assembly --------> memory reads/writes

While not particularly accurate, it calls to attention the two steps of transformation a program goes through before operating on memory: the compiler, and the CPU. On modern computers, both steps may re-order memory operations in the original the program, in order to optimize its performance. Compilers achieve this through literal program transformation, emitting assembly instructions out of order with respect to the original program. CPUs achieve this either directly by executing instructions out-of-order, or indirectly through the use of per-core caches and write buffers.

Such re-ordering has to follow certain rules to preserve the correctness of the program. On uniprocessor systems, the rule is simple - the optimized execution of a program must have the same observable behavior as a non-optimized, statement-by-statement sequential execution. In other words, the optimized program must provide the illusion of sequential execution. This rule provides programmers the ability to reason about a program without knowledge of the mountain of optimizations it goes through.

Unfortunately, the world has moved beyond uniprocessor architectures. Symmetric Multiprocessing (SMP) systems existed in the 1960s, and the first commercially available multi-processor CPU was released in 2001. Any computer manufactured after 2010 is likely to have multiple CPU cores. Here, I’m using SMP to loosely refer to any system where all CPU cores share the same view of memory, including architectures like NUMA. Shared memory makes coordination among processor cores easy - one may simply access the same variable from threads running on different cores - but also makes it possible for one core to observe the memory effects of another core’s possibly out-of-order instructions. The sequential execution illusion is on longer “free” - extra measures must be taken if we want to preserve it.

This is really a cross-cutting concern for two separate systems - the hardware (i.e. the combination of CPU, cache, memory module, etc), and the compiler (i.e. the high-level programming language). In order to support multi-threading on a multi-core architecture, both the hardware and the language have to specify how memory operations behave when shared variables are accessed from multiple threads. In addition, this specification had better hide the implementation details of the hardware and the compiler, so that programmers can continue to reason about their programs. This is the problem addressed by memory models.

The Theory

This is not a new problem. Much research has already gone into writing correct multi-threaded programs on multi-core computers. Intuitively, when accessing shared memory in a multi-threaded program, we’d still like each memory operation from a single thread to happen atomically, so a correct execution of a multi-threaded program should be some interleaving of operations from each thread. Lamport formalized this condition as sequential consistency¹:

… the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

Things would be simpler if we stopped here and enforced that hardware and programming languages must maintain the illusion of SC. Unfortunately, guaranteeing sequential consistency unconditionally is prohibitively expensive, ruling out fundamental optimizations like write buffers².

Researchers realized that we can divide memory operations into synchronization operations used to order events across cores, and other operations that read or write data. By communicating synchronization operations explicitly to hardware, we enable optimizations to be applied to “data” operations, achieving higher performance. This, combined with a desire to define a formal contract between software and hardware, led to the Data-Race-Free-0 (SC-DRF0) synchronization model³:

A program obeys the synchronization model Data-Race-Free-0 (DRF0), if and only if

all synchronization operations are recognizable by the hardware and each accesses exactly one memory location, and

for any execution …, all conflicting accesses are ordered by the happens-before relation corresponding to the execution.

Informally, the happens-before relation is the combination of program order (sequential op order on each thread) and synchronization order (order of synchronizing ops on the same memory location). A data race, as implied from the above definition, is constituted by two or more accesses to the same memory location that are (1) not ordered by happens-before, and (2) not all reads (i.e. conflicting).

In other words, in SC-DRF0, the hardware promises to appear sequentially consistent to the software if the software does not introduce data races. This simple model gives hardware vendors space to optimize performance, while allowing software programmers to keep reasoning about sequential execution without knowing the intricate details o implementation. SC-DRF0 was initially conceived as a contract between hardware and assembly instructions, but later became the basis for the memory models of all multi-threaded programming languages, including C, C++, Java, JavaScript, Rust, and Swift⁴.

C++11 and SC-DRF0

So far, we’ve described the first half of the atomics and memory model proposal for C++11, which is summarized as the following in the C++11 standard⁵:

… a C++ program can have more than one thread of execution (a.k.a. thread) running concurrently … The execution of the entire program consists of an execution of all of its threads. [Note: Usually the execution can be viewed as an interleaving of all its threads. However some kinds of atomic operations, for example, allow executions inconsistent with a simple interleaving, as described below. —end note ]

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior. [ Note: It can be shown that programs that correctly use mutexes and memory_order_seq_cst operations to prevent all data races and use no other synchronization operations behave as if the operations executed by their constituent threads were simply interleaved, with each value computation of an object being taken from the last side effect on that object in that interleaving. This is normally referred to as “sequential consistency”. However, this applies only to data-race-free programs, and data-race-free programs cannot observe most program transformations that do not change single-threaded program semantics. In fact, most single-threaded program transformations continue to be allowed, since any program that behaves differently as a result must perform an undefined operation. — end note ]”

As hinted here, C++11 chose to define an atomics library as the vehicle for “synchornization operations” in SC-DRF0. The atomics library integrates ordering constraints along with its operations instead of using explicit memory barriers, the reasoning of which is explained in ⁶. It provides memory ordering options weaker than sequential consistency, such as acquire/release ordering, but the default behavior is SC. If we restrict the use of atomics library to only the default ordering constraint, the behavior is exactly SC-DRF0.

Another peculiarity of the C++ standard is the undefined behavior, or “catch fire” semantics of programs with data races. UB is fairly common, if controversial, in the C++ standard. The reasoning is again given in ⁶.

Things would be simpler if we stopped here and enforced that both hardware and programming languages must maintain the illusion of SC-DRF0 with no caveats. Unfortunately, SC-DRF0 means that no memory operations can be re-ordered before or past synchronizing operations, and this is still too expensive on some architectures (although some argue to the contrary⁷). What we really want is to restrict the ordering of just enough operations in just the right directions.

The Hardware

It is necessary to look into details about hardware to appreciate the cost of SC-DRF0. x86 architectures implement “Total Store Order”, or TSO⁸, so called because all processors agree on a total order in which writes commit to shared memory. This model can be understood as a simple multi-thread shared memory system with per-thread store buffers. Reads consult the store buffer first, and then access shared memory. Writes are simply queued in the local store buffer. If we consider the four possible pairs of memory operations, i.e. (load; load), (load; store), (store; store), (store; load), only the (store; load) pair can be re-ordered in the TSO model. All other pairs are guaranteed to preserve ordering. This is reflected in Intel’s Software Developer Manual⁹:

Reads are not reordered with other reads.

Writes are not reordered with older reads.

Writes to memory are not reordered with other writes … (with exceptions)

Reads may be reordered with older writes to different locations but not with older writes to the same location.

Unlike x86, POWER and ARM architectures are examples of much weaker memory models. Their models can be roughly understood as each thread having its own copy of memory, such that writes from each thread may propagate to other threads in any order. All four pairs of memory operations can be re-ordered in POWER and ARM⁸ ¹⁰.

In all architectures, special instructions can be used to preserve memory ordering. This typically comes in the form of memory barrier instructions, which can be inserted between two regular instructions so they cannot be re-ordered across the memory barrier. Sometimes, there are also special instructions like load-exclusive/store-exclusive on ARM that directly implement acquire/release semantics. Notably, enforcing ordering between a (store; load) pair is typically the most expensive constraint among the four pairs of memory operation. On some architectures, a heavy-weight full memory barrier must be used between the two operations - a sync instruction on POWER, as opposed to the cheaper lwsync. This is the cost incurred by SC-DRF0 (i.e. memory_order_seq_cst) without a weaker memory model. In addition, most higher-level synchronization primitives do not need such a strong barrier to achieve SC semantics (with the exception of Dekker’s algorithm or IRIW access patterns¹¹). As such, weaker memory ordering options were introduced into C++11¹².

C++11 and Weak Ordering

As mentioned, acquire/release ordering is a weaker constraint than sequential consistency. Unlike SC, they only place “unidirectional” constraints on surrounding operations:

A read-acquire executes before all reads and writes by the same thread that follow it in program order.
A write-release executes after all reads and writes by the same thread that precede it in program order.

More intuitively, a load-acquire can be thought of as a lock acquisition, and a store-release can be thought of as a lock release. Operations in the critical section in between cannot move out, but operations outside the critical section can move in. I believe this is also how the ordering got its name¹³.

Acquire/release is often all the ordering you need to implement widely applicable synchronization primitives - classic examples include message-passing¹³ and double-checked locking¹⁴. Neither acquire nor release require the use of full memory barriers, so it’s often cheaper and equally portable when this works, although producing a correct implementation is significantly more difficult. Additionally, since the standardization of acquire/release semantics, their cost are becoming increasingly lower on hardware - ARMv8 supports them as native instructions.

Things would be simpler if we stopped here and enforced that both hardware and programming languages must support acquire/release ordering at a minimum, and no weaker model can be exposed. Unfortunately, sometimes we’d like to not enforce ordering at all, but simply say “this variable should be accessed atomically”. Because C++11 forbids unannotated data races, this unordered-but-atomic use case led to memory_order_relaxed.

memory_order_relaxed imposes almost no ordering constraints, except for atomic operations on the same memory location, to rule out some extremely unintuitive behavior¹⁵. This is covered by a clause about modification order in the C++11 standard⁵.

All modifications to a particular atomic object M occur in some particular total order, called the modification order of M. If A and B are modifications of an atomic object M, and A happens before B, then A shall precede B in the modification order of M …

Relaxed ordering is even more difficult to use correctly than acquire/release ordering, and the exact semantics of relaxed ordering in (arguably) corner cases is still under-specified¹⁶, but there are some well-known use cases, including sequence locks, simple counters whose value does not affect control flow, reference counts (such as the one found in std::shared_ptr), etc¹² ¹⁷.

C++20

C++20 introduced some (arguably) minor revisions to the existing memory model standard. See this post or ¹⁸.

Epilogue/Rant

It’s daunting to realize that a handful of sentences in the C++ standard can hide this amount of complexity and decades of research. While I didn’t expect C++ atomics to be a simple matter, memory models are something else.

Some variation of Conway’s Law also seems to be at work here to exacerbate the situation. Three groups are involved in this problem - application programmers, language designers, hardware designers - and many more organizations behind them. The memory model has to reconcile concerns across all these groups while remaining (mostly) backwards compatible. A concrete example is that despite the strong memory model of x86, C++ has to accommodate the hardware with the weakest memory model. Surely, if we designed the hardware and the language together, this is not the solution we’d end up with?

L. Lamport, “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs,” Sep. 1979. ↩︎
S. V. Adve and K. Gharachorloo, Shared Memory Consistency Models: A Tutorial. ↩︎
S. V. Adve and M. D. Hill, “Weak Ordering - A New Deﬁnition.” ↩︎
R. Cox, “Programming Language Memory Models.” ↩︎
H. Boehm, “N2429: Concurrency memory model (final revision).” ↩︎ ↩︎
H. Boehm, “N2176: Memory Model Rationales.” ↩︎ ↩︎
D. Marino, T. Millstein, M. Musuvathi, and A. Singh, “The Silently Shifting Semicolon,”. ↩︎
L. Maranget, S. Sarkar, and P. Sewell, “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models.” ↩︎ ↩︎
Intel 64 and IA-32 Architectures Software Developer Manual, Volume 3, § 8.2.2. ↩︎
P. E. McKenney, Is Parallel Programming Hard, And, If So, What Can You Do About It?, § 15.5. ↩︎
H. Boehm and S. V. Adve, Foundations of the C++ Concurrency Memory Model. ↩︎
H. Sutter, “C++ and Beyond 2012: Herb Sutter - atomic Weapons 2 of 2.” ↩︎ ↩︎
J. Preshing, “Acquire and Release Semantics. ↩︎ ↩︎
J. Preshing, Double-Checked Locking is Fixed In C++11. ↩︎
H. Boehm, N2480: A Less Formal Explanation of the Proposed C++ Concurrency Memory Model. ↩︎
H. Boehm, P1217R0: Out-of-thin-air, revisited, again. ↩︎
H. Boehm and P. E. McKenney, P2055R0: A Relaxed Guide to memory_order_relaxed. ↩︎
H. Boehm et al, P0668R5: Revising the C++ memory model. ↩︎

The C++ Memory Model, or Why We Can’t Have Nice Things

2022/05/28