2/24/2009

02-24-09 - Low Level Threading - Part 3.1

Atomic RMW and Hardware

The next building block we need is some kind of atomic RMW. We've talking about getting atomic loads & stores and memory ordering, but we need a way to change a value and know that we are changing the original value.

The classic example is two threads trying to increment a shared counter :

Atomic< int > g_counter = 0;

// thread1 :
g_counter ++;

// thread2 :
g_counter ++;
After this runs, g_counter might be 1 or 2 !! The problem is what ++ is actually doing :
g_counter ++ :

temp = load g_counter
temp ++;
store g_counter , temp
Now you should be able to see that both threads can load g_counter at 0, both can increment to 1 in their temp vars, then both store a 1. To fix this we could take a lock :
lock(g_counter)
temp = load g_counter
temp ++;
store g_counter , temp
unlock(g_counter)
And that is exactly the Atomic read-modify-write (RMW) that we need. If we use this for g_counter++ then it will always be = 2 after our program runs.

Fortunately almost all hardware provides some kind of cheap atomic micro-lock mechanism to do this. On x86 you have the "lock prefix" instructions (the Interlocked intrinsics directly correspond to lock instructions for the most part). On Power you have a load-linked/store-conditional with something like lwarx-stwcx. On Power you are almost literally writing a micro-lock and unlock on the cache line. On x86 it's more implicit that a lock is taken for a period of time, it pretends to be a simple instruction. A bit more on this later.

The most powerful and useful atomic RMW is probably the CAS (Compare and Swap). I'm going to write CAS using the C++0x style since that is becoming more and more standard around the web and in the literature :


// T must be Atomic
bool CAS(T * pVar, T & oldVal, T newVal)
{
    lock(pVar);  // "micro-lock" the variable
    
    bool match = (*pVar == oldVal);
    oldVal = *pVar;
    if ( match )
        *pVar = newVal;    

    unlock(pVar);

    return match;
}

CAS checks if *pVar is = oldVal and if it is, then it stuffs in newVal. Note that oldVal is a *reference* and it also reads the current value of pVar into there. (I really despise that style of coding but it's the way the C++0x standard CAS is defined and you will see people rely on that behavior, so be aware).

Actually C++0x defines CAS a little differently; it can have spurious failures. Basically that means that the micro-lock on the variable can fail and CAS can return false even though *pVar was == oldVal :


bool CAS(T * pVar, T & oldVal, T newVal)
{
    if ( ! trylock(pVar) )
        return false;
    
    bool match = (*pVar == oldVal);
    oldVal = *pVar;
    if ( match )
        *pVar = newVal;    

    unlock(pVar);

    return match;
}

On x86 that doesn't really make any sense, but on other architectures where the RMW is implemented using a lock-page loop it gives them more flexibility to be able to fail to lock.

Note that the arguments are in the opposite order of InterlockedCompareExchange which can be a bit confusing.

CAS is super useful and can be used to implement almost anything. For example, CAS could be use to implement increment :

void AtomicIncrement( Atomic * pVar )
{
    do
    {
        Atomic old = Load( pVar );
        Atomic new = old + 1;
    } while ( ! CAS(pVar,old,new) );
}
Note that I failed to take advantage of the fact that CAS is reading oldval for me in the second operand, really this should be :
void AtomicIncrement( Atomic * pVar )
{
    Atomic old = Load( pVar );
    do
    {
        Atomic new = old + 1;
    } while ( ! CAS(pVar,old,new) );
}
When CAS fails, old is reloaded to the new value, we add 1 to that and try again. Obviously this is a crazy way to implement increment, the hardware probably has a direct atomic increment, this just shows the power of CAS, and also why C++0x made it reload old val in the second arg, it makes stuff like this very handy.

Note that I haven't been talking about the memory order semantics here. The reason is that the client algorithm needs to specify the memory order semantics that are needed for that operation. CAS itself does not have a memory order requirment, you might want to CAS_Acquire or CAS_Release or CAS_Acq_Rel or CAS_Sequential. They all have their uses depending on what the actual operation needs, and they enforce ordering of memory ops outside the CAS against the CAS (which is treated as one op).

The other really basic operation that we will use a lot is "Exchange". Exchange is almost always a lot faster than CAS. It just reads the old value and stuffs the new value atomically.

We are now ready to implement our own simple mutex :


// 0 is unlocked, 1 is locked
typedef Atomic< int > SimpleMutex; 

void SimpleLock(SimpleMutex * mut)
{    
    SpinBackOff backoff;
    
    while( ! CAS_Sequential(mut,0,1) )
    {
        // mut was not 0 , spin/yield/backoff try again
        backoff.BackOffYield();
    }
    // mut should now be 1    
}
    
void SimpleUnlock(SimpleMutex volatile * mut)
{
    // *mut should now be 1
    StoreRelease(mut,0);
}

(there are a lot of possible memory semantics you might want for your mutex, this is only one possibility; there are subtle reasons that you want mutex locks to be globally sequential and not just acquire which I won't go into)
(obvious this isn't reentrant or any of that stuff and I'm not suggesting you use this instead of critsec - critsec is better in almost every way, this is just an example)
(backoff is a little helper struct I use that spins a while then sleeps with increasing intervals).

Sequential memory semantics means that memory operations cannot move past it in either direction, and ALSO that there is a global total ordering. That is, if we have three threads doing this :

// thread 1
Store(g_global,1)

// thread 2
Store(g_global,2)

// thread 3
Store(g_global,3)
And each thread runs on its own processor, then processor 1 might see g_global be 2 then 3 then 1. Processor 2 might see g_global be 3 then 2 then 1. Obviously that makes no sense in a total order, but the memory ops are allowed to be seen by the processors in different orders as long as they satisfy the contraints placed on them.

If we make all those stores into StoreSequential, then all processors will see the ops change in the same way. They might them at different times but they see them in the same order. Think about it like a room full of people. Everybody needs to tell everyone else their name. If you just let everyone go and tell their name to other people, then each individual will hear all the others' names in very different order. eg. Mike might hear Alex then Andy then Dave then Chris. Dave might hear Alex then Mike then Chris then Andy. The total order basically means first one person raises their hand, then they say their name to everyone, then the next person raises their hand. Thus everybody hears each name in the same order.

Obviously Sequential ops are a lot more expensive because they require all the processors in the system to agree on a sync point that they will not reorder other sequential operations across.

Another common memory ordering we haven't really talked about is "Acq_Rel" or Acquire+Release. This means that loads can't move up before it, and stores can't move down past it, but loads could move down past it and stores could move up before it. Acq_Rel also doesn't imply a total order. You often want to use Acq_Rel when you are creating a sync point that does both a read and a publish, but doesn't need full Sequential order. For example :


Load(a); // can move down
Store(b); // cannot move down

CAS_Acq_Rel( stuff );

Load(c); // cannot move up
Store(d); // can move up


// so it could change to this :
// (assuming a,b,c,d don't overlap)

Store(b);
Store(d);

CAS_Acq_Rel( stuff );

Load(a);
Load(c);

Acq_Rel vs. Sequential is sort of like lwsync vs hwsync on Power (but not exactly I don't think). Acq_Rel is useful when I am trying to use CAS to both publish something and to get some values.

Okay, I think that's the majority of the tools we need. I'm going to talk a bit about what actually happens on x86, but you can just ignore this part if you want.

We already talked last time about how x86 is already automatically acquire & release, that is, there are #LoadLoad and #StoreStore barriers everywhere - but acquire and release say nothing about how loads move against stores (that would be #LoadStore and #StoreLoad barriers - the notation is that #XY means X must occur before Y). Well it turns out x86 also automatically has #LoadStore ordering. The x86 multiprocessor model was designed to make single-proc code "just work" and they did a fuck of a good job of it. The crucial thing is that Stores coming out of any core come out in the order of the program code. Obviously the #StoreStore barriers everywhere enforce that. What it all means is that Stores must be done at the very end of the x86 execution pipe and retired in order. Thus there's no way for a load to move after a store - in fact we've seen that loads can get executed speculatively, but there's no way to execute a store speculatively because that would invalidate the order coming out of the chip.

The remaining barrier that we don't have is #StoreLoad , to keep loads after stores. Note that #StoreLoad is automatic for access to the same variable (and doesn't cause a Load-Hit-Store stall on x86 because we have the awesome thing where the load can peek into the write buffer for the new value), however it its not automatic for *different* variables. eg.

g_var1 = a;
...
b = g_var2
The read of var2 can move before the write to var1. As I've said before, this is *good* you want your processor to be able to reorder things. You just need to be able to tell it when not to. On x86 the only way to make a #StoreLoad barrier is with an mfence or any interlocked op.

Interlocked ops on x86 are always sequentially consistent. That makes coding easier, but means they are very expensive. (Itanium has some Acquire & Release interlocked ops ; presumably future Intel stuff will have some weaker memory model instructions). Basically to execute a "lock" instruction the x86 CPU does the pseudo code that I wrote above for CAS. First it goes out to the bus and reserves a cache line (x86 interlocks actually hold whole cache lines which makes any op inside a cache line atomic) then it runs the cmpxchg instruction or whatever you wanted, then it releases the cache line.

When it reserves the cache line, all other processors who try to acquire that line are blocked and might stall. When it releases the cache line, all processors that had work based on that line in flight must dump it. That is, any speculative reads and all computations based on them must be invalidated from the deep execution pipe of modern x86 chips.

On modern multi-core chips with shared cache this is bad, but it's not awful. On SMP machines where the cache line reservation has to go out to bus, it's very bad indeed. And what's really awful is that a CAS on one chip doesn't just slow down that chip - it syncs the bus, and it *really* slows things down if anybody else is touching that cache line.

Even though Interlocked CAS is "lock free" it is by no means cheap, and the people who write really efficient lock free stuff try to do so even without CAS.

Note that the crucial issue here was *contention* over a *cache line*. That is a common theme and is the main thing that you should actually worry about. If you have some shared data item which is heavily contended (lots of threads need to hit it often), then changing it from a CritSec to an Interlocked CAS will help some, but the CPUs are still going to be stalling on each other like crazy. You need to reduce the contention over a single data items. You also need to worry about so-called "false sharing". Any data items on the same cache line count for contention, so even though you were good boy and made your threads work on separate variables, if they are on the same cache line you still pay much of the price because they will sit and fight over that cache line over and over.

Okay, that's enough for basics. Next time we'll look at some techniques for lock free algorithms.

No comments:

old rants