Wednesday, December 3, 2014

C# Memory Model

The C# Memory Model in Theory and Practice

This is the first of a two-part series that will tell the long story of the C# memory model. The first part explains the guarantees the C# memory model makes and shows the code patterns that motivate the guarantees; the second part will detail how the guarantees are achieved on different hardware architectures in the Microsoft .NET Framework 4.5.
One source of complexity in multithreaded programming is that the compiler and the hardware can subtly transform a program’s memory operations in ways that don’t affect the single-threaded behavior, but might affect the multithreaded behavior. Consider the following method:
void Init() {
  _data = 42;
  _initialized = true;
}
If _data and _initialized are ordinary (that is, non-volatile) fields, the compiler and the processor are allowed to reorder the operations so that Init executes as if it were written like this:
void Init() {
  _initialized = true;
  _data = 42;
}
There are various optimizations in both compilers and processors that can result in this kind of reordering, as I’ll discuss in Part 2.
In a single-threaded program, the reordering of statements in Init makes no difference in the meaning of the program. As long as both _initialized and _data are updated before the method returns, the order of the assignments doesn’t matter. In a single-threaded program, there’s no second thread that could observe the state in between the updates.
In a multithreaded program, however, the order of the assignments may matter because another thread might read the fields while Init is in the middle of execution. Consequently, in the reordered version of Init, another thread may observe _initialized=true and _data=0.
The C# memory model is a set of rules that describes what kinds of memory-operation reordering are and are not allowed. All programs should be written against the guarantees defined in the specification.
However, even if the compiler and the processor are allowed to reorder memory operations, it doesn’t mean they always do so in practice. Many programs that contain a “bug” according to the abstract C# memory model will still execute correctly on particular hardware running a particular version of the .NET Framework. Notably, the x86 and x64 processors reorder operations only in certain narrow scenarios, and similarly the CLR just-in-time (JIT) compiler doesn’t perform many of the transformations it’s allowed to.
Although the abstract C# memory model is what you should have in mind when writing new code, it can be helpful to understand the actual implementation of the memory model on different architectures, in particular when trying to understand the behavior of existing code.

C# Memory Model According to ECMA-334

The authoritative definition of the C# memory model is in the Standard ECMA-334 C# Language Specification (bit.ly/MXMCrN). Let’s discuss the C# memory model as defined in the specification.
Memory Operation Reordering According to ECMA-334, when a thread reads a memory location in C# that was written to by a different thread, the reader might see a stale value. This problem is illustrated in Figure 1.
Figure 1 Code at Risk of Memory Operation Reordering
public class DataInit {
  private int _data = 0;
  private bool _initialized = false;
  void Init() {
    _data = 42;            // Write 1
    _initialized = true;   // Write 2
  }
  void Print() {
    if (_initialized)            // Read 1
      Console.WriteLine(_data);  // Read 2
    else
      Console.WriteLine("Not initialized");
  }
}
Suppose Init and Print are called in parallel (that is, on different threads) on a new instance of DataInit. If you examine the code of Init and Print, it may seem that Print can only output “42” or “Not initialized.” However, Print can also output “0.”
The C# memory model permits reordering of memory operations in a method, as long as the behavior of single-threaded execution doesn’t change. For example, the compiler and the processor are free to reorder the Init method operations as follows:
void Init() {
  _initialized = true;   // Write 2
  _data = 42;            // Write 1
}
This reordering wouldn’t change the behavior of the Init method in a single-threaded program. In a multithreaded program, however, another thread might read _initialized and _data fields after Init has modified one field but not the other, and then the reordering could change the behavior of the program. As a result, the Print method could end up outputting a “0.”
The reordering of Init isn’t the only possible source of trouble in this code sample. Even if the Init writes don’t end up reordered, the reads in the Print method could be transformed:
void Print() {
  int d = _data;     // Read 2
  if (_initialized)  // Read 1
    Console.WriteLine(d);
  else
    Console.WriteLine("Not initialized");
}
Just as with the reordering of writes, this transformation has no effect in a single-threaded program, but might change the behavior of a multithreaded program. And, just like the reordering of writes, the reordering of reads can also result in a 0 printed to the output.
In Part 2 of this article, you’ll see how and why these transformations take place in practice when I look at different hardware architectures in detail.
Volatile Fields The C# programming language provides volatile fields that constrain how memory operations can be reordered. The ECMA specification states that volatile fields provide acquire-­release semantics (bit.ly/NArSlt).
A read of a volatile field has acquire semantics, which means it can’t be reordered with subsequent operations. The volatile read forms a one-way fence: preceding operations can pass it, but subsequent operations can’t. Consider this example:
class AcquireSemanticsExample {
  int _a;
  volatile int _b;
  int _c;
  void Foo() {
    int a = _a; // Read 1
    int b = _b; // Read 2 (volatile)
    int c = _c; // Read 3
    ...
  }
}
Read 1 and Read 3 are non-volatile, while Read 2 is volatile. Read 2 can’t be reordered with Read 3, but it can be reordered with Read 1. Figure 2 shows the valid reorderings of the Foo body.
Figure 2 Valid Reordering of Reads in AcquireSemanticsExample
int a = _a; // Read 1
int b = _b; // Read 2 (volatile)
int c = _c; // Read 3
int b = _b; // Read 2 (volatile)
int a = _a; // Read 1
int c = _c; // Read 3
int b = _b; // Read 2 (volatile)
int c = _c; // Read 3
int a = _a; // Read 1
A write of a volatile field, on the other hand, has release semantics, and so it can’t be reordered with prior operations. A volatile write forms a one-way fence, as this example demonstrates:
class ReleaseSemanticsExample
{
  int _a;
  volatile int _b;
  int _c;
  void Foo()
  {
    _a = 1; // Write 1
    _b = 1; // Write 2 (volatile)
    _c = 1; // Write 3
    ...
  }
}
Write 1 and Write 3 are non-volatile, while Write 2 is volatile. Write 2 can’t be reordered with Write 1, but it can be reordered with Write 3. Figure 3 shows the valid reorderings of the Foo body.
Figure 3 Valid Reordering of Writes in ReleaseSemanticsExample
_a = 1; // Write 1
_b = 1; // Write 2 (volatile)
_c = 1; // Write 3
_a = 1; // Write 1
_c = 1; // Write 3
_b = 1; // Write 2 (volatile)
_c = 1; // Write 3
_a = 1; // Write 1
_b = 1; // Write 2 (volatile)
I’ll come back to the acquire-release semantics in the “Publication via Volatile Field” section later in this article.
Atomicity Another issue to be aware of is that in C#, values aren’t necessarily written atomically into memory. Consider this example:
class AtomicityExample {
  Guid _value;
  void SetValue(Guid value) { _value = value; }
  Guid GetValue() { return _value; }
}
If one thread repeatedly calls SetValue and another thread calls GetValue, the getter thread might observe a value that was never written by the setter thread. For example, if the setter thread alternately calls SetValue with Guid values (0,0,0,0) and (5,5,5,5), GetValue could observe (0,0,0,5) or (0,0,5,5) or (5,5,0,0), even though none of those values was ever assigned using SetValue.
The reason behind the “tearing” is that the assignment “_value = value” doesn’t execute atomically at the hardware level. Similarly, the read of _value also doesn’t execute atomically.
The C# ECMA specification guarantees that the following types will be written atomically: reference types, bool, char, byte, sbyte, short, ushort, uint, int and float. Values of other types—including user-defined value types—could be written into memory in multiple atomic writes. As a result, a reading thread could observe a torn value consisting of pieces of different values.
One caveat is that even the types that are normally read and written atomically (such as int) could be read or written non-atomically if the value is not correctly aligned in memory. Normally, C# will ensure that values are correctly aligned, but the user is able to override the alignment using the StructLayoutAttribute class (bit.ly/Tqa0MZ).
Non-Reordering Optimizations Some compiler optimizations may introduce or eliminate certain memory operations. For example, the compiler might replace repeated reads of a field with a single read. Similarly, if code reads a field and stores the value in a local variable and then repeatedly reads the variable, the compiler could choose to repeatedly read the field instead.
Because the ECMA C# spec doesn’t rule out the non-reordering optimizations, they’re presumably allowed. In fact, as I’ll discuss in Part 2, the JIT compiler does perform these types of optimizations.

Thread Communication Patterns

The purpose of a memory model is to enable thread communication. When one thread writes values to memory and another thread reads from memory, the memory model dictates what values the reading thread might see.
Locking Locking is typically the easiest way to share data among threads. If you use locks correctly, you basically don’t have to worry about any of the memory model messiness.
Whenever a thread acquires a lock, the CLR ensures that the thread will see all updates made by the thread that held the lock earlier. Let’s add locking to the example from the beginning of this article, as shown inFigure 4.
Figure 4 Thread Communication with Locking
public class Test {
  private int _a = 0;
  private int _b = 0;
  private object _lock = new object();
  void Set() {
    lock (_lock) {
      _a = 1;
      _b = 1;
    }
  }
  void Print() {
    lock (_lock) {
      int b = _b;
      int a = _a;
      Console.WriteLine("{0} {1}", a, b);
    }
  }
}
Adding a lock that Print and Set acquire provides a simple solution. Now, Set and Print execute atomically with respect to each other. The lock statement guarantees that the bodies of Print and Set will appear to execute in some sequential order, even if they’re called from multiple threads.
The diagram in Figure 5 shows one possible sequential order that could happen if Thread 1 calls Print three times, Thread 2 calls Set once and Thread 3 calls Print once.
Sequential Execution with Locking
Figure 5 Sequential Execution with Locking
When a locked block of code executes, it’s guaranteed to see all writes from blocks that precede the block in the sequential order of the lock. Also, it’s guaranteed not to see any of the writes from blocks that follow it in the sequential order of the lock.
In short, locks hide all of the unpredictability and complexity weirdness of the memory model: You don’t have to worry about the reordering of memory operations if you use locks correctly. However, note that the use of locking has to be correct. If only Print or Set uses the lock—or Print and Set acquire two different locks—memory operations can become reordered and the complexity of the memory model comes back.
Publication via Threading API Locking is a very general and powerful mechanism for sharing state among threads. Publication via threading API is another frequently used pattern of concurrent programming.
The easiest way to illustrate publication via threading API is by way of an example:
class Test2 {
  static int s_value;
  static void Run() {
    s_value = 42;
    Task t = Task.Factory.StartNew(() => {
      Console.WriteLine(s_value);
    });
    t.Wait();
  }
}
When you examine the preceding code sample, you’d probably expect “42” to be printed to the screen. And, in fact, your intuition would be correct. This code sample is guaranteed to print “42.”
It might be surprising that this case even needs to be mentioned, but in fact there are possible implementations of StartNew that would allow “0” to be printed instead of “42,” at least in theory. After all, there are two threads communicating via a non-volatile field, so memory operations can be reordered. The pattern is displayed in the diagram in Figure 6.
Two Threads Communicating via a Non-Volatile Field
Figure 6 Two Threads Communicating via a Non-Volatile Field
The StartNew implementation must ensure that the write to s_value on Thread 1 will not move after <start task t>, and the read from s_value on Thread 2 will not move before <task t starting>. And, in fact, the StartNew API really does guarantee this.
All other threading APIs in the .NET Framework, such as Thread.Start and ThreadPool.QueueUserWorkItem, also make a similar guarantee. In fact, nearly every threading API must have some barrier semantics in order to function correctly. These are almost never documented, but can usually be deduced simply by thinking about what the guarantees would have to be in order for the API to be useful.
Publication via Type Initialization Another way to safely publish a value to multiple threads is to write the value to a static field in a static initializer or a static constructor. Consider this example:
class Test3
{
  static int s_value = 42;
  static object s_obj = new object();
  static void PrintValue()
  {
    Console.WriteLine(s_value);
    Console.WriteLine(s_obj == null);
  }
}
If Test3.PrintValue is called from multiple threads concurrently, is it guaranteed that each PrintValue call will print “42” and “false”? Or, could one of the calls also print “0” or “true”? Just as in the previous case, you do get the behavior you’d expect: Each thread is guaranteed to print “42” and “false.”
The patterns discussed so far all behave as you’d expect. Now I’ll get to cases whose behavior may be surprising.
Publication via Volatile Field Many concurrent programs can be built using the three simple patterns discussed so far, used together with concurrency primitives in the .NET System.Threading and System.Collections.Concurrent namespaces.
The pattern I’m about to discuss is so important that the semantics of the volatile keyword were designed around it. In fact, the best way to remember the volatile keyword semantics is to remember this pattern, instead of trying to memorize the abstract rules explained earlier in this article.
Let’s start with the example code in Figure 7. The DataInit class in Figure 7 has two methods, Init and Print; both may be called from multiple threads. If no memory operations are reordered, Print can only print “Not initialized” or “42,” but there are two possible cases when Print could print a “0”:
  • Write 1 and Write 2 were reordered.
  • Read 1 and Read 2 were reordered.
Figure 7 Using the Volatile Keyword
public class DataInit {
  private int _data = 0;
  private volatile bool _initialized = false;
  void Init() {
    _data = 42;            // Write 1
    _initialized = true;   // Write 2
  }
  void Print() {
    if (_initialized) {          // Read 1
      Console.WriteLine(_data);  // Read 2
    }
    else {
      Console.WriteLine("Not initialized");
    }
  }
}
If _initialized were not marked as volatile, both reorderings would be permitted. However, when _initialized is marked as volatile, neither reordering is allowed! In the case of writes, you have an ordinary write followed by a volatile write, and a volatile write can’t be reordered with a prior memory operation. In the case of the reads, you have a volatile read followed by an ordinary read, and a volatile read can’t be reordered with a subsequent memory operation.
So, Print will never print “0,” even if called concurrently with Init on a new instance of DataInit.
Note that if the _data field is volatile but _initialized is not, both reorderings would be permitted. As a result, remembering this example is a good way to remember the volatile semantics.
Lazy Initialization One common variant of publication via volatile field is lazy initialization. The example inFigure 8 illustrates lazy initialization.
Figure 8 Lazy Initialization
class BoxedInt
{
  public int Value { get; set; }
}
class LazyInit
{
  volatile BoxedInt _box;
  public int LazyGet()
  {
    var b = _box;  // Read 1
    if (b == null)
    {
      lock(this)
      {
        b = new BoxedInt();
        b.Value = 42; // Write 1
        _box = b;     // Write 2
      }
    }
    return b.Value; // Read 2
  }
}
In this example, LazyGet is always guaranteed to return “42.” However, if the _box field were not volatile, LazyGet would be allowed to return “0” for two reasons: the reads could get reordered, or the writes could get reordered.
To further emphasize the point, consider this class:
class BoxedInt2
{
  public readonly int _value = 42;
  void PrintValue()
  {
    Console.WriteLine(_value);
  }
}
Now, it’s possible—at least in theory—that PrintValue will print “0” due to a memory-model issue. Here’s a usage example of BoxedInt that allows it:
class Tester
{
  BoxedInt2 _box = null;
  public void Set() {
    _box = new BoxedInt2();
  }
  public void Print() {
    var b = _box;
    if (b != null) b.PrintValue();
  }
}
Because the BoxedInt instance was incorrectly published (through a non-volatile field, _box), the thread that calls Print may observe a partially constructed object! Again, making the _box field volatile would fix the issue.
Interlocked Operations and Memory Barriers Interlocked operations are atomic operations that can be used at times to reduce locking in a multithreaded program. Consider this simple thread-safe counter class:
class Counter
{
  private int _value = 0;
  private object _lock = new object();
  public int Increment()
  {
    lock (_lock)
    {
      _value++;
      return _value;
    }
  }
}
Using Interlocked.Increment, you can rewrite the program like this:
class Counter
{
  private int _value = 0;
  public int Increment()
  {
    return Interlocked.Increment(ref _value);
  }
}
As rewritten with Interlocked.Increment, the method should execute faster, at least on some architectures. In addition to the increment operations, the Interlocked class (bit.ly/RksCMF) exposes methods for various atomic operations: adding a value, conditionally replacing a value, replacing a value and returning the original value, and so forth.
All Interlocked methods have one very interesting property: They can’t be reordered with other memory operations. So no memory operation, whether before or after an Interlocked operation, can pass an Interlocked operation.
An operation that’s closely related to Interlocked methods is Thread.MemoryBarrier, which can be thought of as a dummy Interlocked operation. Just like an Interlocked method, Thread.Memory­Barrier can’t be reordered with any prior or subsequent memory operations. Unlike an Interlocked method, though, Thread.MemoryBarrier has no side effect; it simply constrains memory reorderings.
Polling Loop Polling loop is a pattern that’s generally not recommended but—somewhat unfortunately—frequently used in practice. Figure 9 shows a broken polling loop.
Figure 9 Broken Polling Loop
class PollingLoopExample
{
  private bool _loop = true;
  public static void Main()
  {
    PollingLoopExample test1 = new PollingLoopExample();
    // Set _loop to false on another thread
    new Thread(() => { test1._loop = false;}).Start();
    // Poll the _loop field until it is set to false
    while (test1._loop) ;
    // The previous loop may never terminate
  }
}
In this example, the main thread loops, polling a particular non-volatile field. A helper thread sets the field in the meantime, but the main thread may never see the updated value.
Now, what if the _loop field was marked volatile? Would that fix the program? The general expert consensus seems to be that the compiler isn’t allowed to hoist a volatile field read out of a loop, but it’s debatable whether the ECMA C# specification makes this guarantee.
On one hand, the specification states only that volatile fields obey the acquire-release semantics, which doesn’t seem sufficient to prevent hoisting of a volatile field. On the other hand, the example code in the specification does in fact poll a volatile field, implying that the volatile field read can’t be hoisted out of the loop.
On x86 and x64 architectures, PollingLoopExample.Main will typically hang. The JIT compiler will read test1._loop field just once, save the value in a register, and then loop until the register value changes, which will obviously never happen.
If the loop body contains some statements, however, the JIT compiler will probably need the register for some other purpose, so each iteration may end up rereading test1._loop. As a result, you may end up seeing loops in existing programs that poll a non-­volatile field and yet happen to work.
Concurrency Primitives Much concurrent code can benefit from high-level concurrency primitives that became available in the .NET Framework 4. Figure 10 lists some of the .NET concurrency primitives.
Figure 10 Concurrency Primitives in the .NET Framework 4
TypeDescription
Lazy<>Lazily initialized values
LazyInitializer
BlockingCollection<>Thread-safe collections
ConcurrentBag<>
ConcurrentDictionary<,>
ConcurrentQueue<>
ConcurrentStack<>
AutoResetEventPrimitives to coordinate execution of different threads
Barrier
CountdownEvent
ManualResetEventSlim
Monitor
SemaphoreSlim
ThreadLocal<>Container that holds a separate value for every thread
By using these primitives, you can often avoid low-level code that depends on the memory model in intricate ways (via volatile and the like).

Coming Up

So far, I’ve described the C# memory model as defined in the ECMA C# specification, and discussed the most important patterns of thread communication that define the memory model.
The second part of this article will explain how the memory model is actually implemented on different architectures, which is helpful for understanding the behavior of programs in the real world.

Best Practices

  • All code you write should rely only on the guarantees made by the ECMA C# specification, and not on any of the implementation details explained in this article.
  • Avoid unnecessary use of volatile fields. Most of the time, locks or concurrent collections (System.Collections.Concurrent.*) are more appropriate for exchanging data between threads. In some cases, volatile fields can be used to optimize concurrent code, but you should use performance measurements to validate that the benefit outweighs the extra complexity.
  • Instead of implementing the lazy initialization pattern yourself using a volatile field, use the System.Lazy<T> and System.Threading.LazyInitializer types.
  • Avoid polling loops. Often, you can use a BlockingCollection<T>, Monitor.Wait/Pulse, events or asynchronous programming instead of a polling loop.
  • Whenever possible, use the standard .NET concurrency primitives instead of implementing equivalent functionality yourself.


The C# Memory Model in Theory and Practice, Part 2


This is the second article in a two-part series that discusses the C# memory model. As explained in the first part in the December issue of MSDN Magazine (msdn.microsoft.com/magazine/jj863136), the compiler and the hardware may subtly transform the memory operations of a program in ways that don’t affect single-threaded behavior, but can impact multi-threaded behavior. As an example, consider this method:
void Init() {
  _data = 42;
  _initialized = true;
}
If _data and _initialized are ordinary (that is, non-volatile) fields, the compiler and the processor are allowed to reorder the operations so that Init executes as if it were written like this:
void Init() {
  _initialized = true;
  _data = 42;
}
In the previous article, I described the abstract C# memory model. In this article, I’ll explain how the C# memory model is actually implemented on different architectures supported by the Microsoft .NET Framework 4.5.

Compiler Optimizations

As mentioned in the first article, the compiler might optimize the code in a way that reorders memory operations. In the .NET Framework 4.5, the csc.exe compiler that compiles C# to IL doesn’t do many optimizations, so it won’t reorder memory operations. However, the just-in-time (JIT) compiler that converts IL to machine code will, in fact, perform some optimizations that reorder memory operations, as I’ll discuss.
Loop Read Hoisting Consider the polling loop pattern:
class Test
{
  private bool _flag = true;
  public void Run()
  {
    // Set _flag to false on another thread
    new Thread(() => { _flag = false; }).Start();
    // Poll the _flag field until it is set to false
    while (_flag) ;
    // The loop might never terminate!
  }
}
In this case, the .NET 4.5 JIT compiler might rewrite the loop like this:
if (_flag) { while (true); }
In the single-threaded case, this transformation is entirely legal and, in general, hoisting a read out of a loop is an excellent optimization. However, if the _flag is set to false on another thread, the optimization can cause a hang.
Note that if the _flag field were volatile, the JIT compiler would not hoist the read out of the loop. (See the “Polling Loop” section in the December article for a more detailed explanation of this pattern.)
Read Elimination Another compiler optimization that can cause errors in multi-threaded code is illustrated in this sample:
class Test
{
  private int _A, _B;
  public void Foo()
  {
    int a = _A;
    int b = _B;
    ...
  }
}
The class contains two non-volatile fields, _A and _B. Method Foo first reads field _A and then field _B. However, because the fields are non-volatile, the compiler is free to reorder the two reads. So, if the correctness of the algorithm depends on the order of the reads, the program contains a bug.
It’s hard to imagine what the compiler would gain by switching the order of the reads. Given the way Foo is written, the compiler probably wouldn’t swap the order of the reads.
However, the reordering does happen if I add another innocuous statement at the top of the Foo method:
public bool Foo()
{
  if (_B == -1) throw new Exception(); // Extra read
  int a = _A;
  int b = _B;
  return a > b;
}
On the first line of the Foo method, the compiler loads the value of _B into a register. Then, the second load of _B simply uses the value that’s already in the register instead of issuing a real load instruction.
Effectively, the compiler rewrites the Foo method as follows:
public bool Foo()
{
  int b = _B;
  if (b == -1) throw new Exception(); // Extra read
  int a = _A;
  return a > b;
}
Although this code sample gives a rough approximation of how the compiler optimizes the code, it’s also instructive to look at the disassembly of the code:
if (_B == -1) throw new Exception();
  push        eax
  mov         edx,dword ptr [ecx+8]
  // Load field _B into EDX register
  cmp         edx,0FFFFFFFFh
  je          00000016
int a = _A;
  mov         eax,dword ptr [ecx+4]
  // Load field _A into EAX register
return a > b;
  cmp         eax,edx
  // Compare registers EAX and EDX
...
Even if you don’t know assembly, what’s happening here is pretty easy to understand. As a part of evaluating the condition _B == -1, the compiler loads the _B field into the EDX register. Later, when field _B is read again, the compiler simply reuses the value it already has in EDX instead of issuing a real memory read. Consequently, the reads of _A and _B get reordered.
In this case, the correct solution is to mark field _A as volatile. If that’s done, the compiler shouldn’t reorder the reads of _A and _B, because the load of _A has load-acquire semantics. However, I should  point out that the .NET Framework, through version 4, doesn’t handle this case correctly and, in fact, marking the _A field as volatile will not prevent the read reordering. This issue has been fixed in the .NET Framework version 4.5.
Read Introduction As I just explained, the compiler sometimes fuses multiple reads into one. The compiler can also split a single read into multiple reads. In the .NET Framework 4.5, read introduction is much less common than read elimination and occurs only in very rare, specific circumstances. However, it does sometimes happen.
To understand read introduction, consider the following example:
public class ReadIntro {
  private Object _obj = new Object();
  void PrintObj() {
    Object obj = _obj;
    if (obj != null) {
      Console.WriteLine(obj.ToString());
    // May throw a NullReferenceException
    }
  }
  void Uninitialize() {
    _obj = null;
  }
}
If you examine the PrintObj method, it looks like the obj value will never be null in the obj.ToString expression. However, that line of code could in fact throw a NullReferenceException. The CLR JIT might compile the PrintObj method as if it were written like this:
void PrintObj() {
  if (_obj != null) {
    Console.WriteLine(_obj.ToString());
  }
}
Because the read of the _obj field has been split into two reads of the field, the ToString method may now be called on a null target.
Note that you won’t be able to reproduce the NullReferenceException using this code sample in the .NET Framework 4.5 on x86-x64. Read introduction is very difficult to reproduce in the .NET Framework 4.5, but it does nevertheless occur in certain special circumstances.

C# Memory Model Implementation on the x86-x64

As the x86 and x64 have the same behavior with respect to the memory model, I’ll consider both processor variants together.
Unlike some architectures, the x86-x64 processor provides fairly strong ordering guarantees on memory operations. In fact, the JIT compiler doesn’t need to use any special instructions on the x86-x64 to achieve volatile semantics; ordinary memory operations already provide those semantics. Even so, there are still specific cases when the x86-x64 processor does reorder memory operations.
x86-x64 Memory Reordering Even though the x86-x64 processor provides fairly strong ordering guarantees, a particular kind of hardware reordering still happens.
The x86-x64 processor will not reorder two writes, nor will it reorder two reads. However, the one (and only) possible reordering effect is that when a processor writes a value, that value will not be made immediately available to other processors. Figure 1 shows an example that demonstrates this behavior.
Figure 1 StoreBufferExample
class StoreBufferExample
{
  // On x86 .NET Framework 4.5, it makes no difference
  // whether these fields are volatile or not
  volatile int A = 0;
  volatile int B = 0;
  volatile bool A_Won = false;
  volatile bool B_Won = false;
  public void ThreadA()
  {
    A = true;
    if (!B) A_Won = true;
  }
  public void ThreadB()
  {
    B = true;
    if (!A) B_Won = true;
  }
}
Consider the case when methods ThreadA and ThreadB are called from different threads on a new instance of StoreBufferExample, as shown in Figure 2. If you think about the possible outcomes of the program in Figure 2, three cases seem to be possible:
  1. Thread 1 completes before Thread 2 starts. The outcome is A_Won=true, B_Won=false.
  2. Thread 2 completes before Thread 1 starts. The outcome is A_Won=false, B_Won=true.
  3. The threads interleave. The outcome is A_Won=false, B_Won=false.
Calling ThreadA and ThreadB Methods from Different Threads
Figure 2 Calling ThreadA and ThreadB Methods from Different Threads
But, surprisingly, there’s a fourth case: It’s possible that both the A_Won and B_Won fields will be true after this code has finished! Because of the store buffer, stores can get “delayed,” and therefore end up reordered with a subsequent load. Even though this outcome is not consistent with any interleaving of Thread 1 and Thread 2 executions, it can still happen.
This example is interesting because we have a processor (the x86-x64) with relatively strong ordering, and all fields are volatile—and we still observe a reordering of memory operations. Even though the write to A is volatile and the read from A_Won is also volatile, the fences are both one-directional, and in fact allow this reordering. So, the ThreadA method may effectively execute as if it were written like this:
public void ThreadA()
{
  bool tmp = B;
  A = true;
  if (!tmp) A_Won = 1;
}
One possible fix is to insert a memory barrier into both ThreadA and ThreadB. The updated ThreadA method would look like this:
public void ThreadA()
{
  A = true;
  Thread.MemoryBarrier();
  if (!B) aWon = 1;
}
The CLR JIT will insert a “lock or” instruction in place of the memory barrier. A locked x86 instruction has the side effect of flushing the store buffer:
mov         byte ptr [ecx+4],1
lock or     dword ptr [esp],0
cmp         byte ptr [ecx+5],0
jne         00000013
mov         byte ptr [ecx+6],1
ret
As an interesting side note, the Java programming language takes a different approach. The Java memory model has a slightly stronger definition of “volatile” that doesn’t permit store-load reordering, so a Java compiler on the x86 will typically emit a locked instruction after a volatile write.
x86-x64 Remarks The x86 processor has a fairly strong memory model, and the only source of reordering at the hardware level is the store buffer. The store buffer can cause a write to get reordered with a subsequent read (store-load reordering).
Also, certain compiler optimizations can result in reordering of memory operations. Notably, if several reads access the same memory location, the compiler might choose to perform the read only once and keep the value in a register for subsequent reads.
One interesting piece of trivia is that the C# volatile semantics closely match the hardware reordering guarantees made by x86-x64 hardware. As a result, reads and writes of volatile fields require no special instructions on the x86: Ordinary reads and writes (for example, using the MOV instruction) are sufficient. Of course, your code shouldn’t depend on these implementation details because they vary among hardware architectures and possibly .NET versions.

C# Memory Model Implementation on the Itanium Architecture

The Itanium hardware architecture has a memory model weaker than that of the x86-x64. Itanium was supported by the .NET Framework until version 4.
Even though Itanium is no longer supported in the .NET Framework 4.5, understanding the Itanium memory model is useful when you read older articles on the .NET memory model and have to maintain code that incorporated recommendations from those articles.
Itanium Reordering Itanium has a different instruction set than the x86-x64, and memory model concepts show up in the instruction set. Itanium distinguishes between an ordinary load (LD) and load-acquire (LD.ACQ), and an ordinary store (ST) and store-release (ST.REL).
Ordinary loads and stores can be freely reordered by the hardware, as long as the single-threaded behavior doesn’t change. For example, look at this code:
class ReorderingExample
{
  int _a = 0, _b = 0;
  void PrintAB()
  {
    int a = _a;
    int b = _b;
    Console.WriteLine("A:{0} B:{1}", a, b);
  }
  ...
}
Consider two reads of _a and _b in the PrintAB method. Because the reads access an ordinary, non-volatile field, the compiler will use ordinary LD (not LD.ACQ) to implement the reads. Consequently, the two reads might be effectively reordered in the hardware, so that PrintAB behaves as if it were written like this:
void PrintAB()
{
  int b = _b;
  int a = _a;
  Console.WriteLine("A:{0} B:{1}", a, b);
}
In practice, whether the reordering happens or not depends on a variety of unpredictable factors—what’s in the processor cache, how busy the processor pipeline is and so forth. However, the processor will not reorder two reads if they’re related via data dependency. Data dependency between two reads occurs when the value returned by a memory read determines the location of the read by a subsequent read.
This example illustrates data dependency:
class Counter { public int _value; }
class Test
{
  private Counter _counter = new Counter();
  void Do()
  {
    Counter c = _counter; // Read 1
    int value = c._value; // Read 2
  }
}
In the Do method, Itanium will never reorder Read 1 and Read 2, even though Read 1 is an ordinary load and not load-acquire. It might seem obvious that these two reads can’t be reordered: The first read determines which memory location the second read should access! However, some processors—other than Itanium—may in fact reorder the reads. The processor might guess on the value that Read 1 will return and perform Read 2 speculatively, even before Read 1 has completed. But, again, Itanium will not do that.
I’ll get back to the data dependency in Itanium discussion in a bit, and its relevance to the C# memory model will become clearer.
Also, itanium will not reorder two ordinary reads if they’re related via control dependency. Control dependency occurs when the value returned by a read determines whether a subsequent instruction will execute.
So, in this example, the reads of _initialized and _data are related via control dependency:
void Print() {
  if (_initialized)            // Read 1
    Console.WriteLine(_data);  // Read 2
  else
    Console.WriteLine("Not initialized");
}
Even if _initialized and _data are ordinary (non-volatile) reads, the Itanium processor will not reorder them. Note that the JIT compiler is still free to reorder the two reads, and in some cases will.
Also, it’s worth pointing out that, like the x86-x64 processor, Itanium also uses a store buffer, so the StoreBufferExample shown in Figure 1 will exhibit the same kind of reorderings on Itanium as it did on the x86-x64. An interesting piece of trivia is that if you use LD.ACQ for all reads and ST.REL for all writes on Itanium, you basically get the x86-x64 memory model, where the store buffer is the only source of reordering.
Compiler Behavior on Itanium The CLR JIT compiler has one surprising behavior on Itanium: all writes are emitted as ST.REL, and not ST. Consequently, a volatile write and a non-volatile write will typically emit the same instruction on Itanium. However, an ordinary read will be emitted as LD; only reads from volatile fields are emitted as LD.ACQ.
This behavior might come as a surprise because the compiler is certainly not required to emit ST.REL for non-volatile writes. As far as the European Computer Manufacturers Association (ECMA) C# specification is concerned, the compiler could emit ordinary ST instructions. Emitting ST.REL is just something extra that the compiler chooses to do, in order to ensure that a particular common (but in theory incorrect) pattern will work as expected.
It can be difficult to imagine what that important pattern might be where ST.REL must be used for writes, but LD is sufficient for reads. In the PrintAB example presented earlier in this section, constraining just the writes wouldn’t help, because reads could still be reordered.
There’s one very important scenario in which using ST.REL with ordinary LD is sufficient: when the loads themselves are ordered using data dependency. This pattern comes up in lazy initialization, which is an extremely important pattern. Figure 3 shows an example of lazy initialization.
Figure 3 Lazy Initialization
// Warning: Might not work on future architectures and .NET versions;
// do not use
class LazyExample
{
  private BoxedInt _boxedInt;
  int GetInt()
  {
    BoxedInt b = _boxedInt; // Read 1
    if (b == null)
    {
      lock(this)
      {
        if (_boxedInt == null)
        {
          b = new BoxedInt();
          b._value = 42;  // Write 1
          _boxedInt = b; // Write 2
        }
      }
    }
    int value = b._value; // Read 2
    return value;
  }
}
In order for this bit of code to always return 42—even if GetInt is called from multiple threads concurrently—Read 1 must not be reordered with Read 2, and Write 1 must not be reordered with Write 2. The reads will not be reordered by the Itanium processor because they’re related via data dependency. And, the writes will not be reordered because the CLR JIT emits them as ST.REL.
Note that if the _boxedInt field were volatile, the code would be correct according to the ECMA C# spec. That’s the best kind of correct, and arguably the only real kind of correct. However, even if _boxed is not volatile, the current version of the compiler will ensure that the code still works on Itanium in practice.
Of course, loop read hoisting, read elimination and read introduction may be performed by the CLR JIT on Itanium, just as they are on x86-x64.
Itanium Remarks The reason Itanium is an interesting part of the story is that it was the first architecture with a weak memory model that ran the .NET Framework.
As a result, in a number of articles about the C# memory model and the volatile keyword and C#, the authors generally had Itanium in mind. After all, until the .NET Framework 4.5, Itanium was the only architecture other than the x86-x64 that ran the .NET Framework.
Consequently, the author might say something like, “In the .NET 2.0 memory model, all writes are volatile—even those to non-volatile fields.” What the author means is that on Itanium, CLR will emit all writes as ST.REL. This behavior isn’t guaranteed by the ECMA C# spec, and, consequently, might not hold in future versions of the .NET Framework and on future architectures (and, in fact, does not hold in the .NET Framework 4.5 on ARM).
Similarly, some folks would argue that lazy initialization is correct in the .NET Framework even if the holding field is non-­volatile, while others would say that the field must be volatile.
And of course, developers wrote code against these (sometimes contradictory) assumptions. So, understanding the Itanium part of the story can be helpful when trying to make sense of concurrent code written by someone else, reading older articles or even just talking to other developers.

C# Memory Model Implementation on ARM

The ARM architecture is the most recent addition to the list of architectures supported by the .NET Framework. Like Itanium, ARM has a weaker memory model than the x86-x64.
ARM Reordering Just like Itanium, ARM is allowed to freely reorder ordinary reads and writes. However, the solution that ARM provides to tame the movement of reads and writes is somewhat different from that of Itanium. ARM exposes a single instruction—DMB—that acts as a full memory barrier. No memory operation can pass over DMB in either direction.
In addition to the constraints imposed by the DMB instruction, ARM also respects data dependency, but doesn’t respect control dependency. See the “Itanium Reordering” section earlier in this article for explanations of data and control dependencies.
Compiler Behavior on ARM The DMB instruction is used to implement the volatile semantics in C#. On ARM, the CLR JIT implements a read from a volatile field using an ordinary read (such as LDR) followed by the DMB instruction. Because the DMB instruction will prevent the volatile read from getting reordered with any subsequent operations, this solution correctly implements the acquire semantics.
A write to a volatile field is implemented using the DMB instruction followed by an ordinary write (such as STR). Because the DMB instruction prevents the volatile write from getting reordered with any prior operations, this solution correctly implements the release semantics.
Just as with the Itanium processor, it would be nice to go beyond the ECMA C# spec and keep the lazy initialization pattern working, because a lot of existing code depends on it. However, making all writes effectively volatile is not a good solution on ARM because the DBM instruction is fairly costly.
In the .NET Framework 4.5, the CLR JIT uses a slightly different trick to get lazy initialization working. The following are treated as “release” barriers:
  1. Writes to reference-type fields on the garbage collector (GC) heap
  2. Writes to reference-type static fields
As a result, any write that might publish an object is treated as a release barrier.
This is the relevant part of LazyExample (recall that none of the fields are volatile):
b = new BoxedInt();
b._value = 42;  // Write 1
// DMB will be emitted here
_boxedInt = b; // Write 2
Because the CLR JIT emits the DMB instructions prior to the publication of the object into the _boxedInt field, Write 1 and Write 2 will not be reordered. And because ARM respects data dependence, the reads in the lazy initialization pattern will not be reordered either, and the code will work correctly on ARM.
So, the CLR JIT makes an extra effort (beyond what’s mandated in the ECMA C# spec) to keep the most common variant of incorrect lazy initialization working on ARM.
As a final comment on ARM, note that—as on x86-x64 and Itanium—loop read hoisting, read elimination and read introduction are all legitimate optimizations as far as the CLR JIT is concerned.

Example: Lazy Initialization

It can be instructive to look at a few different variants of the lazy initialization pattern and think about how they’ll behave on different architectures.
Correct Implementation The implementation of lazy initialization in Figure 4is correct according to the C# memory model as defined by the ECMA C# spec, and so it’s guaranteed to work on all architectures supported by current and future versions of the .NET Framework.
Figure 4 A Correct Implementation of Lazy Initialization
class BoxedInt
{
  public int _value;
  public BoxedInt() { }
  public BoxedInt(int value) { _value = value; }
}
class LazyExample
{
  private volatile BoxedInt _boxedInt;
  int GetInt()
  {
    BoxedInt b = _boxedInt;
    if (b == null)
    {
      b = new BoxedInt(42);
      _boxedInt = b;
    }
    return b._value;
  }
}
Note that even though the code sample is correct, in practice it’s still preferable to use the Lazy<T> or the LazyInitializer type.
Incorrect Implementation No. 1Figure 5 shows an implementation that isn’t correct according to the C# memory model. In spite of this, the implementation will probably work on the x86-x64, Itanium and ARM in the .NET Framework. This version of the code is not correct. Because _boxedInt isn’t volatile, a C# compiler is permitted to reorder Read 1 with Read 2, or Write 1 with Write 2. Either reordering would potentially result in 0 returned from GetInt.
Figure 5 An Incorrect Implementation of Lazy Initialization
// Warning: Bad code
class LazyExample
{
  private BoxedInt _boxedInt; // Note: This field is not volatile
  int GetInt()
  {
    BoxedInt b = _boxedInt; // Read 1
    if (b == null)
    {
      b = new BoxedInt(42); // Write 1 (inside constructor)
      _boxedInt = b;        // Write 2
    }
    return b._value;        // Read 2
  }
}
However, this code will behave correctly (that is, always return 42) on all architectures in the .NET Framework versions 4 and 4.5:
  • x86-x64:
    • Writes and reads will not be reordered. There is no store-load pattern in the code, and also no reason the compiler would cache values in registers.
  • Itanium:
    • Writes will not be reordered because they are ST.REL.
    • Reads will not be reordered due to data dependency.
  • ARM:
    • Writes will not be reordered because DMB is emitted before “_boxedInt = b.”
    • Reads will not be reordered due to data dependency.
Of course, you should use this information only to try to understand the behavior of existing code. Do not use this pattern when writing new code.
Incorrect Implementation No. 2 The incorrect implementation in Figure 6 may fail on both ARM and Itanium.
Figure 6 A Second Incorrect Implementation of Lazy Initialization
// Warning: Bad code
class LazyExample
{
  private int _value;
  private bool _initialized;
  int GetInt()
  {
    if (!_initialized) // Read 1
    {
      _value = 42;
      _initialized = true;
    }
    return _value;     // Read 2
  }
}
This version of lazy initialization uses two separate fields to track the data (_value) and whether the field is initialized (_initialized). As a result, the two reads—Read 1 and Read 2—are no longer related via data dependency. Additionally, on ARM, the writes might also get reordered, for the same reasons as in the next incorrect implementation (No. 3).
As a result, this version may fail and return 0 on ARM and Itanium in practice. Of course, GetInt is allowed to return 0 on x86-x64 (and also as a result of JIT optimizations), but that behavior doesn’t seem to happen in the .NET Framework 4.5.
Incorrect Implementation No. 3 Finally, it’s possible to get the example to fail even on x86-x64. I just have to add one innocuous-looking read, as shown in Figure 7.
Figure 7 A Third Incorrect Implementation of Lazy Initialization
// WARNING: Bad code
class LazyExample
{
  private int _value;
  private bool _initialized;
  int GetInt()
  {
    if (_value < 0) throw new Exception(); // Note: extra reads to get _value
                                         // pre-loaded into a register
    if (!_initialized)      // Read 1
    {
      _value = 42;
      _initialized = true;
      return _value;
    }
    return _value;          // Read 2
  }
}
The extra read that checks whether _value < 0 can now cause the compiler to cache the value in register. As a result, Read 2 will get serviced from a register, and so it gets effectively reordered with Read 1. Consequently, this version of GetInt may in practice return 0 even on x86-x64.

Wrapping Up

When writing new multi-threaded code, it’s generally a good idea to avoid the complexity of the C# memory model altogether by using high-level concurrency primitives like locks, concurrent collections, tasks, and parallel loops. When writing CPU-intensive code, it sometimes makes sense to use volatile fields, as long as you only rely on the ECMA C# specification guarantees and not on architecture-specific implementation details.

No comments: