9/18/2013

09-18-13 - Per-Thread Global State Overrides

I wrote about this before ( cbloom rants 11-23-12 - Global State Considered Harmful ) but I'm doing it again because I think almost nobody does it right, so I'm gonna be really pedantic.

For concreteness, let's talk about a Log system that is controlled by bit-flags. So you have a "state" variable that is an or of bit flags. The flags are things like where do you output to (LOG_TO_FILE, LOG_TO_OUTPUTDEBUGSTRING, etc.) and maybe things like subsection enablements (LOG_SYSTEM_IO, LOG_SYSTEM_RENDERER, ...) or verbosity (LOG_V0, LOG_V1, ...). Maybe some bits of the state are an indent level. etc.

So clearly you have a global state where the user/programmer have set the options they want for the log.

But you also need a TLS state. You want to be able to do things like disable the log in scopes :


..

U32 oldState = Log_SetState(0);

FunctionThatLogsTooMuch();

Log_SetState(oldState);

..

(and in practice it's nice to use a scoper-class to do that for you). If you do that on the global variable, your thread is fucking up the state of other threads, so clearly it needs to be per-thread, eg. in the TLS. (similarly, you might want to inc the indent level for a scope, or change the verbosity level, etc.).

(note of course this is the "system has a stack of states which is implemented in the program stack").

So clearly, those need to be Log_SetLocalState. Then the functions that are used to set the overall options should be something like Log_SetGlobalState.

Now some notes on how the implementation works.

The global state has to be thread safe. It should just be an atomic var :


static U32 s_log_global_state;

U32 Log_SetGlobalState( U32 state )
{
    // set the new state and return the old; this must be an exchange

    U32 ret = Atomic_Exchange(&s_log_global_state, state , mo_acq_rel);

    return ret;
}

U32 Log_GetGlobalState( )
{
    // probably could be relaxed but WTF let's just acquire

    U32 ret = Atomic_Load(&s_log_global_state, mo_acquire);

    return ret;
}

(note that I sort of implicitly assume that there's only one thread (a "main" thread) that is setting the global state; generally it's set by command line or .ini options, and maybe from user keys in a HUD; the global state is not being fiddled by lots of threads at program time, because that creates races. eg. if you wanted to do something like turn on the LOG_TO_FILE bit, it should be done with a CAS loop or an Atomic OR, not by doing a _Get and then _Set).

Now the Local functions need to set the state in the TLS and *also* which bits are set in the local state. So the actual function is like :


per_thread U32_pair tls_log_local_state;

U32_pair Log_SetLocalState( U32 state , U32 state_set_mask )
{
    // read TLS :

    U32_pair ret = tls_log_local_state;

    // write TLS :

    tls_log_local_state = U32_pair( state, state_set_mask );

    return ret;
}

U32_pair Log_GetLocalState( )
{
    // read TLS :

    U32_pair ret = tls_log_local_state;

    return ret;
}

Note obviously no atomics or mutexes are need in per-thread functions.

So now we can get the effective combined state :


U32 Log_GetState( )
{
    U32_pair local = Log_GetLocalState();
    U32 global = Log_GetGlobalState();

    // take local state bits where they are set, else global state bits :

    U32 state = (local.first & local.second) | (global & (~local.second) );

    return state;
}

So internally to the log's operation you start every function with something like :

static bool NoState( U32 state )
{
    // if all outputs or all systems are turned off, no output is possible
    return ((state & LOG_TO_MASK) == 0) ||
        ((state & LOG_SYSTEM_MASK) == 0);
}

void Log_Printf( const char * fmt, ... )
{
    U32 state = Log_GetState();

    if ( NoState(state) )
        return;

    ... more here ...

}

So note that up to the "... more here ..." we have not taken any mutexes or in any way synchronized the threads against each other. So when the log is disabled we just exit there before doing anything painful.

Now the point of this post is not about a log system. It's that you have to do this any time you have global state that can be changed by code (and you want that change to only affect the current thread).

In the more general case you don't just have bit flags, you have arbitrary variables that you want to be per-thread and global. Here's a helper struct to do a global atomic with thread-overridable value :

            
struct tls_intptr_t
{
    int m_index;
    
    tls_intptr_t()
    {
        m_index = TlsAlloc();
        ASSERT( get() == 0 );
    }
    
    intptr_t get() const { return (intptr_t) TlsGetValue(m_index); }

    void set(intptr_t v) { TlsSetValue(m_index,(LPVOID)v); }
};

struct intptr_t_and_set
{
    intptr_t val;
    intptr_t set; // bool ; is "val" set
    
    intptr_t_and_set(intptr_t v,intptr_t s) : val(v), set(s) { }
};
    
struct overridable_intptr_t
{
    atomic<intptr_t>    m_global;
    tls_intptr_t    m_local;    
    tls_intptr_t    m_localset;
        
    overridable_intptr_t(intptr_t val = 0) : m_global(val)
    {
        ASSERT( m_localset.get() == 0 );
    }       
    
    //---------------------------------------------
    
    intptr_t set_global(intptr_t val)
    {
        return m_global.exchange(val,mo_acq_rel);
    }
    intptr_t get_global() const
    {
        return m_global.load(mo_acquire);
    }
    
    //---------------------------------------------
    
    intptr_t_and_set get_local() const
    {
        return intptr_t_and_set( m_local.get(), m_localset.get() );
    }
    intptr_t_and_set set_local(intptr_t val, intptr_t set = 1)
    {
        intptr_t_and_set old = get_local();
        m_localset.set(set);
        if ( set )
            m_local.set(val);
        return old;
    }
    intptr_t_and_set set_local(intptr_t_and_set val_and_set)
    {
        intptr_t_and_set old = get_local();
        m_localset.set(val_and_set.set);
        if ( val_and_set.set )
            m_local.set(val_and_set.val);
        return old;
    }
    intptr_t_and_set clear_local()
    {
        intptr_t_and_set old = get_local();
        m_localset.set(0);
        return old;
    }
    
    //---------------------------------------------
    
    intptr_t get_combined() const
    {
        intptr_t_and_set local = get_local();
        if ( local.set )
            return local.val;
        else
            return get_global();
    }
};

//=================================================================         

// test code :  

static overridable_intptr_t s_thingy;

int main(int argc,char * argv[])
{
    argc; argv;
    
    s_thingy.set_global(1);
    
    s_thingy.set_local(2,0);
    
    ASSERT( s_thingy.get_combined() == 1 );
    
    intptr_t_and_set prev = s_thingy.set_local(3,1);
    
    ASSERT( s_thingy.get_combined() == 3 );

    s_thingy.set_global(2);
    
    ASSERT( s_thingy.get_combined() == 3 );
    
    s_thingy.set_local(prev);
    
    ASSERT( s_thingy.get_combined() == 2 );
        
    return 0;
}

Or something.

Of course this whole post is implicitly assuming that you are using the "several threads that stay alive for the length of the app" model. An alternative is to use micro-threads that you spin up and down, and rather than inheriting from a global state, you would want them to inherit from the spawning thread's current combined state.

09-18-13 - Fast TLS on Windows

For the record; don't use this blah blah unsafe unnecessary blah blah.


extern "C" DWORD __cdecl FastTlsGetValue_x86(int index)
{
  __asm
  {
    mov     eax,dword ptr fs:[00000018h]
    mov     ecx,index

    cmp     ecx,40h // 40h = 64
    jae     over64  // Jump if above or equal 

    // return Teb->TlsSlots[ dwTlsIndex ]
    // +0xe10 TlsSlots         : [64] Ptr32 Void
    mov     eax,dword ptr [eax+ecx*4+0E10h]
    jmp     done

  over64:   
    mov     eax,dword ptr [eax+0F94h]
    mov     eax,dword ptr [eax+ecx*4-100h]

  done:
  }
}

DWORD64 FastTlsGetValue_x64(int index)
{
    if ( index < 64 )
    {
        return __readgsqword( 0x1480 + index*8 );
    }
    else
    {
        DWORD64 * table = (DWORD64 *)  __readgsqword( 0x1780 );
        return table[ index - 64 ];
    }
}

the ASM one is from nynaeve originally. ( 1 2 ). I'd rather rewrite it in C using __readfsdword but haven't bothered.

Note that these may cause a bogus failure in MS App Verifier.

Also, as noted many times in the past, you should just use the compiler __declspec thread under Windows when that's possible for you. (eg. you're not in a DLL pre-Vista).

old rants