5/20/2013

05-20-13 - Thoughts on Data Compression for MMOs

I've been thinking about this a bit and thought I'd scribble some ideas publicly. (MMO = not necessarily just MMOs but any central game server with a very large number of clients, that cares about total bandwidth).

The situation is roughly this : you have a server and many clients (let's say 10k clients per server just to be concrete). Data is mostly sent server->client , not very much is client->server. Let's say 10k bytes per second per channel from server->client, and only 1k bytes per second from client->server. So the total data rate from the server is high (100 MB/s) but the data rate on any one channel is low. The server must send packets more than once per second for latency reasons; let's say 10 times per second, so packets are only 1k on average; server sends 100k packets/second. You don't want the compressor to add any delay by buffering up packets.

I'm going to assume that you're using something like TCP so you have gauranteed packet order and no loss, so that you can use previous packets from any given channel as encoder history on that channel. If you do have an error in a connection you have to reset the encoder.

This is a rather unusual situation for data compression, and the standard LZ77 solutions don't work great. I'm going to talk only about the server->client transmission for now; you might use a completely different algorithm for the other direction. Some properties of this situation :

1. You care more about encode time than decode time. CPU time on the server is one of the primary limiting factors. The client machine has almost no compression work to do, so decode time could be quite slow.

2. Per-call encode time is very important (not just per-byte time). Packets are small and you're doing lots of them (100k packets/sec), so you can't afford long startup/shutdown times for the encoder. This is mostly just an annoyance for coding, it means you have to be really careful about your function setup code and such.

3. Memory use on the server is a bit limited. Say you allow 4 GB for encoder states; that's only 400k per client. (this is assuming that you're doing per-client encoder state, which is certainly not the only choice).

4. Cache misses are much worse than a normal compression encoder scenario. Say you have something like a 256k hash table to accelerate match finding. Normally when you're compressing you get that whole table into L2 so your hash lookups are in cache. In this scenario you're jumping from one state to another all the time, so you must assume that every memory lookup is a cache miss.

5. The standard LZ77 thing of not allowing matches at the beginning or end is rather more of a penalty. In general all those inefficiencies that you normally have on tiny buffers are more important than usual.

6. Because clients can be added at any time and connections can be reset, encoder init/reset time can't be very long. This is another reason aside from memory use that encoder state must be small.

7. The character of the data being sent doesn't really vary much from client to client. This is one way in which this scenario differs from a normal web server type of situation (in which case, different clients might be receiving vastly different types of data). The character of the data can change from packet to packet; there are sort of a few different modes of the data and the stream switches between then, but it's not like one client is usually receiving text and another one is receiving images. They're all generally receiving bit-packed 3d positions and the same type of thing.

And now some rambling about what encoder you might use that suits this scenario :

A. It's not clear that adaptive encoding is a win here. You have to do the comparison with CPU use held constant, if you just compare an encoder running adaptive vs the same encoder with a static model, that's not fair, because the static model can be so much faster you should use a more sophisticated encoder. The static model can also use vastly more memory. Maybe not a whole 4G, but a lot more than 400k.

B. LZ77 is not great here. The reason we love LZ77 is the fast, simple decoder. We don't really care about that here. An LZP or ROLZ variant would be better, that has a slightly slower and more memory-hungry decoder, but a simpler/faster encoder.

C. There are some semi-static options. Perhaps a static match dictionary with something like LZP, and then an adaptive simple context model per channel. That makes the per-channel adaptive part small in memory, but still allows some local adaptation for packets of different character. Another option would be a switching static-model scheme. Do something like train 4 different static models for different packet types, and send 2 bits to pick the model then encode the packet with that static model.

D. Static context mixing is kind of appealing. You can have static hash tables and a static mixing scheme, which eliminates a lot of the slowness of CM. Perhaps the order-0 model is adaptive per channel, and perhaps the secondary-statistics table is adaptive per channel. Hitting 100 MB/s might still be a challenge, but I think it's possible. One nice thing about CM here is that it can have the idea of packets of different character implicit in the model.

E. For static-dictionary LZ, the normal linear offset encoding doesn't really make a lot of sense. Sure, you could try to optimize a dictionary by laying out the data in it such that more common data is at lower offsets, but that seems like a nasty indirect way of getting at the solution. Off the top of my head, it seems like you could use something like an LZFG encoding. That is, make a Suffix Trie and then send match references as node or leaf indexes; leaves all have equal probability, nodes have a child count which is proportional to their probability (relative to siblings).

F. Surely the ideal solution is a blended static/dynamic coder. That is, you have some large trained static model (like a CM or PPM model, or a big static dictionary for LZ77) and then you also run a local adaptive model in a circular window for each channel. Then you compressing using a mix of the two models. There are various options on how to do this. For LZ77 you might send 0-64k offsets in the local adaptive window, and then 64k-4M offsets as indexes into the static dictionary. Or you could more explicitly code a selector bit to pick one of the two and then an offset. For CM it's most natural, you just mix the result of the static model and the local adaptive model.

G. What is not awesome is model preconditioning (and it's what most people do, because it's the only thing available with off-the-shelf compressors like zlib or whatever). Model precondition means taking an adaptive coder and initially loading its model (eg. an LZ dictionary) from some static data; then you encode packets adaptively. This might actually offer excellent compression, but it has bad channel startup time, and high memory use per channel, and it doesn't allow you to use more efficient algorithms that are possible with fully static models (such as different types of data structures that provide fast lookup but not fast update).

I believe if you're doing UDP or some other unreliable packet scheme, then static-model compression is the only way to go (rather than trying to track the different received and transmitted states to use for a dynamic model). Also if clients are very frequently joining and leaving or moving servers, then they will never build up much channel history, so static model is the way to go there as well. If streams vary vastly in size, like if they're usually less than 1k but occasionally you do large 1M+ transfers (like for content serving as opposed to updating game state) you would use a totally different scheme for the large transfers.

I'd like to do some tests. If you work on an MMO or similar game situation and can give me some real-world test data, please be in touch.

5/08/2013

05-08-13 - A Lock Free Weak Reference Table

It's very easy (almost trivial (*)) to make the table-based {index/guid} style of weak reference lock free.

(* = obviously not trivial if you're trying to minimize the memory ordering constraints, as evidenced by the revisions to this post that were required; it is trivial if you just make everything seq_cst)

Previous writings on this topic :

Smart & Weak Pointers - valuable tools for games - 03-27-04
cbloom rants 03-22-08 - 6
cbloom rants 07-05-10 - Counterpoint 2
cbloom rants 08-01-11 - A game threading model
cbloom rants 03-05-12 - Oodle Handle Table

The primary ops conceptually are :


Add object to table; gives it a WeakRef id

WeakRef -> OwningRef  (might be null)

OwningRef -> naked pointer

OwningRef construct/destruct = ref count inc/dec

The full code is in here : cbliblf.zip , but you can get a taste for how it works from the ref count maintenance code :


    // IncRef looks up the weak reference; returns null if lost
    //   (this is the only way to resolve a weak reference)
    Referable * IncRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];

        handle_type guid = handle_get_guid(h);

        // this is just an atomic inc of state
        //  but checking guid each time to check that we haven't lost this slot
        handle_type state = s->m_state.load(mo_acquire);
        for(;;)
        {
            if ( state_get_guid(state) != guid )
                return NULL;
            // assert refcount isn't hitting max
            LF_OS_ASSERT( state_get_refcount(state) < state_max_refcount );
            handle_type incstate = state+1;
            if ( s->m_state.compare_exchange_weak(state,incstate,mo_acq_rel,mo_acquire) )
            {
                // did the ref inc
                return s->m_ptr;
            }
            // state was reloaded, loop
        }
    }

    // IncRefRelaxed can be used when you know a ref is held
    //  so there's no chance of the object being gone
    void IncRefRelaxed( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        handle_type state_prev = s->m_state.fetch_add(1,mo_relaxed);
        state_prev;
        // make sure we were used correctly :
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 0 );
        LF_OS_ASSERT( state_get_refcount(state_prev) < state_max_refcount );
    }

    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state.fetch_add((handle_type)-1,mo_release);
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 1 );
        if ( state_get_refcount(state_prev) == 1 )
        {
            // I took refcount to 0
            //  slot is not actually freed yet; someone else could IncRef right now
            //  the slot becomes inaccessible to weak refs when I inc guid :
            // try to inc guid with refcount at 0 :
            handle_type old_guid = handle_get_guid(h);
            handle_type old_state = make_state(old_guid,0); // == state_prev-1
            handle_type new_state = make_state(old_guid+1,0); // == new_state + (1<<handle_guid_shift);
  
            if ( s->m_state($).compare_exchange_strong(old_state,new_state,mo_acq_rel,mo_relaxed) )
            {
                // I released the slot
                // cmpx provides the acquire barrier for the free :
                FreeSlot(s);
                return;
            }
            // somebody else mucked with me
        }
    }

The maintenance of ref counts only requires relaxed atomic increment & release atomic decrement (except when the pointed-at object is initially made and finally destroyed, then some more work is required). Even just the relaxed atomic incs could get expensive if you did a ton of them, but my philosophy for how to use this kind of system is that you inc & dec refs as rarely as possible. The key thing is that you don't write functions that take owning refs as arguments, like :


void bad_function( OwningRefT<Thingy> sptr )
{
    more_bad_funcs(sptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    if ( sptr != NULL )
    {
        bad_function(sptr);
    }
}

hence doing lots of inc & decs on refs all over the code. Instead you write all your code with naked pointers, and only use the smart pointers where they are needed to ensure ownership for the lifetime of usage. eg. :

void good_function( Thing * ptr )
{
    more_good_funcs(ptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    Thingy * ptr = sptr.GetPtr();
    if ( ptr != NULL )
    {
        good_function(ptr);
    }
}

If you like formal rules, they're something like this :


1. All stored variables are either OwningRef or WeakRef , depending on whether it's
an "I own this" or "I see this" relationship.  Never store a naked pointer.

2. All variables in function call args are naked pointers, as are variables on the
stack and temp work variables, when possible.

3. WeakRef to pointer resolution is only provided as WeakRef -> OwningRef.  Naked pointers
are only retrieved from OwningRefs.

And obviously there are lots of enchancements to the system that are possible. A major one that I recommend is to put more information in the reference table state word. If you use a 32-bit weak reference handle, and a 64-bit state word, then you have 32-bits of extra space that you can check for free with the weak reference resolution. You could put some mutex bits in there (or an rwlock) so that the state contains the lock for the object, but I'm not sure that is a big win (the only advantage of having the lock built into the state is that you could atomically get a lock and inc refcount in a single op). A better usage is to put some object information in there that can be retrieved without chasing the pointer and inc'ing the ref and so on.

For example in Oodle I store the status of the object in the state table. (Oodle status is a progression through Invalid->Pending->Done/Error). That way I can take a weak ref and query status in one atomic load. I also store some lock bits, and you aren't allowed to get back naked pointers unless you have a lock on them.

The code for the weak ref table is now in the cbliblf.zip that I made for the last post. Download : cbliblf.zip

( The old cblib has a non-LF weak reference table that's similar for comparison. It's also more developed with helpers and fancier templates and such that could be ported to this version. Download : cblib.zip )

ADDENDUM : alternative DecRef that uses CAS instead of atomic decrement. Removes the two-atomic free path. Platforms that implement atomic add as a CAS loop should probably just use this form. Platforms that have true atomic add should use the previously posted version.


    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state($).load(mo_relaxed);
        
        handle_type old_guid  = handle_get_guid(h);

        for(;;)
        {
            // I haven't done my dec yet, guid must still match :
            LF_OS_ASSERT( state_get_guid(state_prev) == old_guid );
            // check current refcount :
            handle_type state_prev_rc = state_get_refcount(state_prev);
            LF_OS_ASSERT( state_prev_rc >= 1 );
            if ( state_prev_rc == 1 )
            {
                // I'm taking refcount to 0
                // also inc guid, which releases the slot :
                handle_type new_state = make_state(old_guid+1,0);

                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_acq_rel,mo_relaxed) )
                {
                    // I released the slot
                    // cmpx provides the acquire barrier for the free :
                    FreeSlot(s);
                    return;
                }
            }
            else
            {
                // this is just a decrement
                // but have to do it as a CAS to ensure state_prev_rc doesn't change on us
                handle_type new_state = state_prev-1;
                LF_OS_ASSERT( new_state == make_state(old_guid,  state_prev_rc-1) );
                
                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_release,mo_relaxed) )
                {
                    // I dec'ed a ref
                    return;
                }
            }
        }
    }

5/02/2013

05-02-13 - Simple C++0x style LF structs and adapter for x86-Windows

I've seen a lot of lockfree libraries out there that are total crap. Really weird non-standard ways of doing things, or overly huge and complex.

I thought I'd make a super simple one in the correct modern style. Download : cbliblf.zip

(If you want a big fully functional much-more-complete library, Intel TBB is the best I've seen. The problem with TBB is that it's huge and entangled, and the license is not clearly free for all use).

There are two pieces here :

"cblibCpp0x.h" provides atomic and such in C++0x style for MSVC/Windows/x86 compilers that don't have real C++0x yet. I have made zero attempt to make this header syntatically identical to C++0x, there are various intentional and unintentional differences.

"cblibLF.h" provides some simple lockfree utilities (mostly queues) built on C++0x atomics.

"cblibCpp0x.h" is kind of by design not at all portable. "cblibLF.h" should be portable to any C++0x platform.

WARNING : this stuff is not super well tested because it's not what I use in Oodle. I've mostly copy-pasted this from my Relacy test code, so it should be pretty strong but there may have been some copy-paste errors.

ADDENDUM : In case it's not clear, you do not *want* to use "cblibCpp0x.h". You want to use real Cpp0x atomics provided by your compiler. This is a temporary band-aid so that people like me who use old compilers can get a cpp0x stand-in, so that they can do work using the modern syntax. If you're on a gcc platform that has the __atomic extensions but not C1X, use that.

You should be able to take any of the C++0x-style lockfree code I've posted over the years and use it with "cblibCpp0x.h" , perhaps with some minor syntactic fixes. eg. you could take the fastsemaphore wrapper and put the "semaphore" from "cblibCpp0x.h" in there as the base semaphore.

Here's an example of what the objects in "cblibLF.h" look like :


//=================================================================         
// spsc fifo
//  lock free for single producer, single consumer
//  requires an allocator
//  and a dummy node so the fifo is never empty
template <typename t_data>
struct lf_spsc_fifo_t
{
public:

    lf_spsc_fifo_t()
    {
        // initialize with one dummy node :
        node * dummy = new node;
        m_head = dummy;
        m_tail = dummy;
    }

    ~lf_spsc_fifo_t()
    {
        // should be one node left :
        LF_OS_ASSERT( m_head == m_tail );
        delete m_head;
    }

    void push(const t_data & data)
    {
        node * n = new node(data);
        // n->next == NULL from constructor
        m_head->next.store(n, memory_order_release); 
        m_head = n;
    }

    // returns true if a node was popped
    //  fills *pdata only if the return value is true
    bool pop(t_data * pdata)
    {
        // we're going to take the data from m_tail->next
        //  and free m_tail
        node* t = m_tail;
        node* n = t->next.load(memory_order_acquire);
        if ( n == NULL )
            return false;
        *pdata = n->data; // could be a swap
        m_tail = n;
        delete t;
        return true;
    }

private:

    struct node
    {
        atomic<node *>      next;
        nonatomic<t_data>   data;
        
        node() : next(NULL) { }
        node(const t_data & d) : next(NULL), data(d) { }
    };

    // head and tail are owned by separate threads,
    //  make sure there's no false sharing :
    nonatomic<node *>   m_head;
    char                m_pad[LF_OS_CACHE_LINE_SIZE];
    nonatomic<node *>   m_tail;
};

Download : cbliblf.zip

old rants