On LZ Optimal Parsing

So the LZ-Huffman I've done for RAD/Oodle does some new stuff with optimal parsing. I want to write up what I'm doing so I remember it, and I'm also going to try to write about the semi-optimal parsing that a lot of people are doing now.

First let's review a bit. In LZ coding you have a lot of choices about how you code things. Greedily choosing the longest match at each point dOpoes not produce the best code stream. There is a very good ancient paper by Storer and Szymanski that introducing optimal parsing, and their method is both fast and 100% optimal, but it is only "optimal" in the sense of coding the fewest number of literals possible. See my old notes on Storer-Szymaski optimal parsing . Now, for the old LZSS coder that used fixed bit coding (8 bit literals, 24 bit matches, or whatever), that optimality is also the same as code length optimality. But in reality what we care about is code length optimality - we want to produce the shortest possible file - which is not necessarilly the same as coding the fewest possible literals.

There are sort of two interesting domains of modern LZ coders, and optimal parsing matters a whole lot to both of them. One domain is the super fast, super simple LZ coder like my LZH. Optimal parsing is important to LZH because the matches are so important to compression, you really want to get the best matches you can. The other domain is something like LZMA or BALZ (an ROLZ) that does a lot of context coding of literals. In those cases, the context coder for literals is so good that you really want to only code good matches, and often the literals will be so cheap you want to send them as literals; also the cost of literals and matches is highly variable for those coders.

Optimal parsing is a path search in a graph. For game developers you can think about A* as I talk about this. You start yourself at the head of the file, and your goal is to walk to the end of the file in the shortest possible path. From each node in the graph, your exit paths are the possible coding operations at that point in the file, eg. write a literal, write match X of length Y. Each coding operation has a certain cost which is equal to the number of bits output if you do that coding operation, and then you step along that link to the next point in the file as specified by the match length.

Note that with the original LZSS we only needed to consider either matching or not matching at each position in the file, since any time you wrote a match the best thing to do was to write the maximum match length. With true code-optimal parsing, you have to consider all possible match offsets and all possible match lengths at each location (!!). This is a huge number of coding choices, but we can rule out almost all of them. (more on this later).

So, the brute force retarded optimal parser should be obvious at this point. Start with an empty code stream and push position 0 on the stack. Repeat forever : pop the stack. Try all coding operations at the position specified in the stack. Push all of those coding operations on the stack with the code stream they made, and the position of the end of the match. This works fine and produces a 100% optimal parse, but is insanely slow. If you have N coding choices at each position it makes N^L operations for a length L file.

With adaptive coding (eg. you're using an adaptive entropy coder for literals matches and offsets), you cannot get a 100% optimal parse unless you do this full search, because it does not have the optimal subgraph property. That is if you have the optimal path from A to C, and that path goes through B, that optimal path is not the same as taking he optimal path from A to B and the optimal path from B to C. To speed things up we are going to start making approximations, and we are going to generally assume a looser version of the optimal subgraph property. What this means in terms of LZ coding is that if you are at a certain position in the file, the best way to code the whole file does not necessarily include the best way to code from {begin} to the current position.

Let's revisit the original LZSS parse now. LZSS does no entropy coding, so it doesn't matter how you get to a given byte of the stream, you will code the remainder in the same way. This is obviously a "Dynamic Programming" problem. You have shared sub-portions of the problem that can be collapsed by storing their result in a table. The optimal path from point P to the end is the same regardless of how you get to P. You could now proceed by doing the brute force approach and search depth-first, and each time you find a path, you store all the sub-paths that you visited, and then you reuse them instead of recomputing them. Or you can do it a more elegant way which is identical, which is to walk backwards through the file and at each position P store the optimal path from the P to the end. When you find the optimal path at P obviously all the future subgraphs that you might need are already done because you're going backwards. So, we now think of the LZSS optimal parse as a way of solving the graph search using dynamic programming.

The case of a static entropy coder is simpler so let me focus on that first. Offsets and lengths and literals are written in some # of bits that is known in advance and we wish to make the shortest possible code stream. We can do a sort of LZSS backwards parse to do this quickly. It's not as simple as LZSS because we cannot assume that the best match choice at each step is the longest one.

Optimal Parser for Static Statistics :

We're going to walk backwards like LZSS. At each position already considered (future positions in the file), we have already found the optimal choice at that spot. We have various coding choices at the current position, either literal or various {offset,length} possibilities. The total cost of each of those coices is : {Cost to code the current choice} + {Table cost at current pos + match length} , that is the already computed cost to finish the walk after you step match length.

It should be obvious that we can solve this just by considering every coding operation possible at each position and walk backwards. That gives us a total cost of O(N*L) , which is a lot better than O(N^L) , but it's still slow because N is very large. So let's start making approximations.

What are all these coding choices that we have? We found various offsets that we matched at least min match length (usually 3). This gives us a bunch of possible offsets and the maximum match length at each offset : {Off1,Len1} , {Off2,Len2}. In general we also have to consider all lengths >= 3 at each offset.

The first approximation is the assumption that for any given match length, the lowest code cost comes from the smallest possible offset that matches with that length. That is, we assume that if Off1 < Off2 , then cost(Off1) <= cost(Off2). This is certainly true for all non-pathological files in the real world. This kind of assumption is also self-satisfying, that is by making this assumption, the optimal parser will prefer shorter offsets, which makes them more likely, which makes their code length shorter. It's a positive feedback loop; I'll talk more about feedback later.

With this approximation, we no longer need to consider all offsets and all match lengths. Now we only need to consider all match lengths, and the offset for any length is chosen for us - it's the lowest offset that codes that length. Our code choices now look like :

you find a match of length 7 at Off1, length 11 at Off2, length 22 at Off3
with Off1 < Off2 < Off3

choose from :

Length 3-7 at Off1
Length 8-11 at Off2
Lebgth 12-22 at Off3

Okay, so we could do this, but it's still too slow. We need to reduce our choices further.

One option is to only consider the longest possible length at each offset. That is, assume that if you code a length at a given offset, the best choice is the longest length at that offset. So in our example you would only choose from {7 at Off1, 11 at Off2, 22 at Off3}. That is sort of okay, but it actually gives up a lot of compression. It misses too many good choices. Apparently there are some good coding paths that involve taking a shorter than maximal match.

To get to the really cool optimization, we need to stop and think about this intuitively. Why would you not write a maximal match? It should almost always give us a shorter total code len, because writing a longer match is very cheap and lets us not code symbols after the match. The cost of a given match is

Cost(offset) + Cost(match length) + CostTable( at pos + match length )

Let's assume for now that the cost of the offset and the match length are seperable (they aren't used to predict each other). The cost of each match length is an entropy code that generally codes shorter lengths in fewer bits. It's sub-linear in match length, but it is generally increasing. However, it increases less than the average cost of future symbols. Note that for long match lengths, the cost of len and (len+1) may be exactly equal depending on how you code your lengths. For example of you use a semi-log2 code, then the cost of many large match lengths is identical. That means coding longer matches is free in terms of the current code cost.

The CostTable generally increases backwards linearly proportional to the average entropy. That is :

CostTable[ {end} ] = 0
CostTable[ end - 1 ] = cost to code form (end-1) to end
CostTable[ end - N ] = N * H

on average ; where H is the average entropy of one symbol

However, even though CostTable is linearly increasing backwards on average, it is not always. Consider for example what CostTable looks like in the neighborhood of a very good very long match. Say at some position P this long match is available with length L. At position (P+1) the best coding choice will be that same match with length (L-1), same at (P+2) etc. The cost table in this area will look like :

CostTable[ P-1 ] = Cost to code literal at (P-1) + Cost(L) + CostTable[ P + L ];
CostTable[ P   ] = Cost(L  ) + CostTable[ P + L ];
CostTable[ P+1 ] = Cost(L-1) + CostTable[ P + L ];
CostTable[ P+2 ] = Cost(L-2) + CostTable[ P + L ];
CostTable[ P+3 ] = Cost(L-3) + CostTable[ P + L ];

If L is large, like say 100 or more, the Cost(L) and the Cost(L-1) will be very very similar. In this region, CostTable[] is increasing as you step backwards, but it's only increasing very slowly, far more slowly than the average slope of H. We see that CostTable goes backwards like N*H in a line, but it deviates up and down across the line. Some areas you code worse than average entropy, some areas you code much better than average entropy. The areas where you code better than average entropy are candidates for choosing a shorter than maximal match length!

What are we really doing as we step back through the file and choose the best match length? We have this constant table Cost(L) that tells us the cost of coding each match length. This table is increasing, it's smallest at Cost(3) and largest at Cost(infinity). We find the maximum match length at a given offset and the minimum at a given offset (the minimum is the smallest length that can't be coded from a lower offset). We take the Cost(L) table in that range and add it onto CostTable() in that range, then pick the lowest value after the sum. Generally CostTable increases backwards faster than Cost(L) increases forwards, but not always! As we saw in the example above with a long match, CostTable can increase backwards slower than Cost(L) increases forwards.

So, how do we find this spots quickly? Well, when we are looking at coding at spot P, we have already visited all the CostTable spots > P. What we'd like to know is, Is CostTable ahead of us increasing sublinearly? And if so, where is it MOST sublinear? That is, Where is CostTable[P] as far below (end - P)*H ? That will tell us the most likely spot to code a less than maximal match.

We could build a binary search tree for these spots as we go backwards, which would be logN and all, but I used a simpler solution. At each spot P, I store the best sublinear spot preceding that spot (preceding in file order, actually occuring later in the backwards walk). That is :

I chose a coding at pos P and updated CostTable[P]

BestSpotPreceding[P] = P;
int step = 1;
while( CostTable[P] + step * TinyCost < CostTable[P + step] )
 BestSpotPreceding[P + step] = P;

Normally when CostTable is increasing a lot as you step back, this does not get triggered at all. That is, CostTable[P+1] is usually a H smaller than CostTable[P]. But in weird cases, the cost of coding this symbol was almost free, so BestSpotPreceding points back to us. BestSpotPreceding gets updated in waves that terminate at cheap code spots. Normally you keep stepping back, then you hit a really cheap code spot and a wave percolates forward, filling out BestSpotPreceding, then you step some more normally, then you hit a cheap code spot and it waves through again. Sort of carries in an arithmetic coder. Anyway, this is technically still O(N*L), but it's a very cheap operation and it's way way less than N*L in practice.

So, how do we use this? Instead of considering only the maximum length at each offset ( {7 at Off1, 11 at Off2, 22 at Off3} in our previous example ), we also consider matches at BestSpotPreceding[] in that offset. That is :

Look in 3-7 at Off1
Consider a 7 at Off1
Also consider BestSpotPreceding[Pos + 7] at Off1 (if it's >= 3 )

Look in 8-11 at Off2
Consider an 11 at Off2
Also consider BestSpotPreceding[Pos + 11] at Off2 (if it's >= 8)

Look in 12-22 at Off3
Consider a 22 at Off3
Also consider BestSpotPreceding[Pos + 22] at Off3 (if it's >= 12)

And those are all the matches we have to consider. I did this and found it hurt less than 0.001 bpb compared to the full optimal parse that considers all possible coding operations at each position.

ADDENDUM #1 : I should mention there's another cool way to get this optimization if you don't want to do the BestSpotPreceding thing. We again want to ramp up our intuition for the problem. The cases where we need to make a non-trivial decision about match lengths are cases when we are writing two matches in a row, both greater than minimum length, and they overlap. That means there's a big range of ways we can code the same block. For example :


At pos P we find a match of length 7
At pos P+4 we find a match of length 10

The area from pos P to [P+14] can be coded with two matches, the ways are :


The area of overlap of the matches all codes to the same offset, it's the same offset that was found at P+4, but if you match past that at P, then you just match less of it. The amount of overlap is the number of different coding choices we have.

We're coding the same two offsets and skipping the same number of symbols, so the only difference in code cost is the cost to code the match lengths. The code costs of the choices are :

Cost(len 4) + Cost(len 10)
Cost(len 5) + Cost(len 9)
Cost(len 6) + Cost(len 8)
Cost(len 7) + Cost(len 7)

But we don't even need to compute these. The cost of coding a match len is almost always increasing in len, and increasing more slowly for larger lens, that is :

(len1 > len2) implies codecost( len1 ) >= codecost( len2 )

(len1 > len2) implies ( codecost( len1 +1 ) - codecost(len1) ) <= ( codecost( len2 + 1 ) - codecost( len2 ) )

That is, code costs go up for larger lens, but they go up more slowly. It's like a sqrt(x) function - sharp at the beginning and then levels out. That means the lowest total code cost is the one that's most skewed. You want to extend the longer match as much as possible because that's very cheap, and you want to shorten the shorter match as much as possible because that saves you a lot. Intuitively you can think the difference between coding a len of 3 and 4 is very big, but the difference between 511 and 512 is tiny tiny.

In our example that means the best choice is :

Cost(len 4) + Cost(len 10)

The simple heuristic to do this in general is :

Consider coding option at pos P
Find max match len possible at each offset
Consider coding with max match len

len = max match len
while ( matchoffset[ P + len - 1] == matchoffset[ P + len ] )
if ( len != max match len )
 consider coding at len

Note the similarity with the BestSpotPreceding heuristic. We try coding with our longest match len. Then we also try coding with the *shortest* match len that still reaches the same match after we jump forward. We're trying the two most skew possibilities, our match as long as possible and our match as short as possible, constrained to be getting the same match following. (matchoffset is already computed because we're going backwards).

ADDENDUM #2 : More early outs. LowestCostPreceding. @@ todo : write this.

There are some more minor issues we need to consider.

The first one is that if we are doing a Huffman (or static arithmetic) coder to entropy code the length, offset & literal, then we can't actually know the code length in advance. And the way we choose to do the coding will affect the statistics which will affect the code lengths, which affects how we choose to do the coding, ad infinitum. It's a feedback loop.

As usual in these kind of optimization searches, we make the semi-static assumption. We make a good guess for the code lengths, find the optimal parse for those code lengths, use that parse to guess new code lengths, optimal parse again, etc. Hopefully this smoothly converges. Well, it doesn't. The problem is that the coding alphabet is actually pretty sparse. On medium size files there are not that many literals and there are plenty of offsets and match lengths you might not code at all in one particular encoding choice. For example, in one pass you might not code any matches of length {31} at all. Now when you do the next pass you need something reasonable in there. If you use the actual code lengths, you can strangely skew how you do the coding, and it can actually ping pong a lot in the search and not settle down.

To improve this, I suggest smoothing the counts in the early passes. That is, the chance of a length 31 match should be somewhere between the chance of a length 30 and length 32 match, regardless of the coincidence that we happened not to code with it in the last pass. To do this, I fit a laplacian probability model to lengths and offsets, and in the early passes I blend that model in to smooth out the regularize the distribution. As I do more passes I decrease the weight of that blended simple model and use the raw counts more.

In practice, you can avoid doing multiple passes of the actual optimal parser without hurting compression too much. You can do a first pass with just a normal non optimal parser and build statistics from that. Smooth those statistics with a regularizing simple model. Then do the optimal parse with those statistics, and you're done. Doing more passes will improve the coding decisions, though.

The other issue is that it's still a bit slow. Most of the slowness comes from the areas where very long matches are possible. If there's a long match of length L at pos P, we also consider (L-1) at (P+1), and (L-2) at (P+2), etc., and for each of those we generally find many closer offsets to also consider, roughly O(L) of them, which makes us O(L^2) over that region. On super degerate files like "pic" in the Calgary Corpus, this makes us very slow. However, it's easy to fix. In this area with the long match, it really doesn't matter much what we do, because all of them are good choices. This is a property that in areas of super high compression, the importance to the whole file is much less. You want to maximize your wins on the hard to compress spots, because in terms of output bytes they are more important. So, in these big long match areas, we just cut out the whole area inside the long match, and we always code the long match at the beginning. (actually I allow coding at P and P+1 up to P+K for a small K ; K is like 4 and L is like 100).

Optimal Parser for Adaptive Coders :

Now, with real modern coders you aren't doing static statistics, they're adapting as you go. You could just make a guess at the overall statistics and use the semi-static optimal parser that I just described, and then code the stream with your adaptive coders. That's better than not optimal parsing at all, but it's giving up a lot. Adaptive coders get a big win by changing their statistics as they scan through the file, and an optimal parser can make much better decisions if it uses the actual local statistics. For example, in one part of the file, literals might code very well, and the optimal parser should favor literals more, in another part of the file there might be lots of very short match lengths, and the optimal parser should code for that, etc.

One way to solve this is to use a different kind of semi-static approximation and optimal parse forward in blocks. For each block, you pretend your coder are static and use some fixed code lengths. Now, run the semi-static optimal parser that I described above to find the optimal coding for the next 1000 bytes or so. Now step up through that coding and output codes with your adaptive entropy coder. Only output 900 bytes or so, don't go all the way to the end of the block so that you avoid edge effects. Now hold use the current statistics to find the optimal parse for the next 1000 bytes. This in fact works very well, and I believe was first used by RKive many years ago. He called it "dynamic programming" which is sort of right in the sense that all backward optimal parses are a type of dynamic programming.

There's another form of adaptive optimal parse that a lot of people do now using forward walking. It's called "flexible parsing" by the encode.ru guy, and I believe a form of this is used in LZMA.

The simplest form of "flexible parse" is just the old "lazy evaluation" of Zip. Lazy matching walks through the file and considers either writing {match} or {literal + match at next pos}. It's a slightly non-greedy parser. Flexible parsing is basically the same thing, but we consider a few different choices.

To do this, we run through the whole file first and find the best match at each position. (you don't actually have to do this in practice, you could just look ahead some amount N from the current position to avoid multiple passes). In any cases, the match search at each pos is only done once, and we have them ready far enough in the future to do the flexible parse.

So, we consider a few coding choices at the current location. We choose the one that gives the lowest code cost to the end of the file. To make good decisions, it needs to also consider the coding after you take the current operation. That is, it's like a recursive function :

CodeCostRecurive( P )
 consider n in N choices
  cost of choice n = Cost(code n) + CodeCostRecurive( P + len(n) )
 return best of N

now of course if we actually did this full recursion we would be doing the whole O(N^L) tree like we didn't want to do. But we don't need to do that because we're only using this to make the decision at the current byte. That is, we can use the "short horizon" assumption - coding choices far in the future don't matter that much to our current coding choice. It's like only considering a few moves ahead in a Chess program (which is a bad thing to do there, but a good thing here). In fact, you only need to look 2 moves ahead here to make good coding choices.

Also, you don't need to consider very many N choices. N = 4 or 8 is plenty. Certainly consider coding a literal and coding the longest possible match, and then heuristically pick some other likely good choices in between (such as the longest matches available with some shorter offsets, or the longest matches possible from previous-match-references if you have those).

How do we terminate the recursion? After 2 steps (considering one op and then one more op after that) we just need an estimate of the cost of coding the rest of the file. That's easy, it's just :

CodeCostEstimate( P ) = ( {end} - P ) * H

That is, just assume the remainder is coded at the entropy. The entropy you use here should be a conservative overestimate of the entropy; for example just use the entropy of what you've done so far and add a fudge factor. This ensures that you actually prefer writing matches and don't do something strange if you start the file in a very low entropy region.

This flexible parse sounds kind of expensive, but in fact with N = 4, it's only 16 table lookups to compute all the code lengths, then we pick the best one. This of course is not a true optimal parse, but it's pretty damn good, and the fact that it's using the true current adaptive statistics makes up for the fact that it's not doing the best possible parse.

ADDENDUM : I should note that this kind of forward parse is also better at making good decisions with PMR's (Previous Match References). PMR's are cheap to code offsets, so they should always be considered in the optimal parse when they are possible matches. Going backwards you can't make good decisions about trying to use PMR's because you haven't done the earlier parts of the file yet.


10-07-08 - 2

Random file stuff I've learned :

Good article on Windows file caching . The Windows file cache architecture is pretty sweet really. Files are cached in 256k blocks. The caching is also file "buffering". That is, file IO in the application doesn't get any buffer just for that app - you get access to the cache pages, that is your buffer. It means multiple apps on the same file share buffers. This is all pretty sweet. The caching is just a file mapping of pages in the system address space. When you create a memory mapped file, you are just getting direct access to those pages, which you can then lock to get pointers in your address space. You can lock & unlock the pages just like any pages using VirtualAlloc and VirtualFree. When you write to pages, you can control them getting flushed out the write cache by unlocking them.

Windows file caching is *extremely* greedy. It will basically throw away anything it has cached in order to cache the latest request. This makes it look very good in benchmarks, but can be pretty retarded in practice. For example, I was testing by doing linear streaming IO across a 2 GB file. Windows caches every single byte of the 2 GB file so that future touches to it are instant; in order to do that, it will evict the baby and the bathwater from the cache (including stuff like the cache of the NTFS MFT table).

One thing I haven't been able to figure out is a way to manually control the prefetching that windows does. When you do a bunch of ReadFile calls in a row, the cache will asynchronously prefetch ahead the next 256k chunk. This means that even if you just call the synchronous ReadFile (or fread) to get bits of the file, the return is almost immediate. What I would like, however, is a way to manually say "prefetch at this byte". The only way I can do that at the moment is to issue a read at that byte on a thread, but that's not really ideal, it would be better to be able to signal the underlying cache directly.

Also it appears that if you use memory mapped files, windows stops doing the smart async prefetch ahead. Say you memory map a huge file, then you run through streaming touching pages in order. Windows uses the virtual page fault from the memory manager to interrupt and pull in pages from disk. It seems to not prefetch because you get a lot of faults and your app just stalls when you get to pages that aren't in memory yet. Memory mapped files are pretty awesome, it's cool that you just get access to the cache, but without the ability to request prefetches they're pretty worthless since the plain old ReadFile beats them so badly. (if you are in fact doing totally random file accesses on a huge file, then they are just about perfect).

Stdio in MSVC has a 4k buffer and then calls straight to ReadFile to fill it. The 4k buffer isn't really providing you any "buffering" it just hides function call overhead and makes the reads page aligned; the real buffering is inside ReadFile.

If you're linking with the multithreaded libs, stdio becomes outrageously slow. Personally I like my threading model to work such that individual objects, such as a "FILE" are not protected from concurrent use by the library. That is, once the API hands out an atomic object to the app, it is up to the app if it wants to protect that object from multiple threads accessing it or not. The library should only do protection from access to shared/global data structures. (I know they can't do this with stdio because they want to make sure it's safe, and they want to protect stdout/stdin/stdierr). Anyway, the result is that "getc" goes from being around 10 clocks to being around 170 clocks. Fortunately, you can fix that pretty easily by just bringing back the macro yourself. Just define this :

#define macrogetc(_stream)     (--(_stream)->_cnt >= 0 ? ((U8)*(_stream)->_ptr++) : _filbuf(_stream))

And use it to do your IO when you know the FILE is only being used by one thread.

Something I just discovered today : while Windows buffering is in fact very good for reads, it sucks for writes. If I write out a stream of data through a buffered file, I only get around 25 MB/sec. If I write out chunks with async non-buffered calls, I get 100 MB/sec. (conversely, buffered streaming input is actually faster than doing a bunch of async non-buffered reads).


10-05-08 - 5

Rant on New Arithmetic Coders

Can we fucking stop calling arithmetic coding "range coding" !? It's fucking arithmetic coding. "Range coding" differs from the ancient CACM arithmetic coder in exactly two ways :

1. It uses a "virtual queue" for the carry propagation, exactly as I described in my VirtQ paper . (warning : ancient!)

2. It introduces an approximation to the range update which lets it use more bits by changing the order of multiplication :


range = range_high - range_low + 1;
range_high = range_low + (symhigh * range) / symtot - 1;
range_low += (symlow * range) / symtot;

"range coder" :

range = range_high - range_low;
range_high = range_low + symhigh * (range / symtot);
range_low += symlow * (range / symtot);

This small rearrangement of the math does in fact make a huge difference, but dude look, it's literally just moving where you put the parentheses.

Of course it means you save a divide, and it means range can go up to 31 bits since you never multiply by it. Getting range up to 31 bits means you can afford to shuffle out whole bytes at a time. In CACM the multiplicataion has to fit in 32 bits so your range can only be 16 bits so outputting bytes is a no go; range stays between 15 and 16 bits all the time in CACM while a "range coder" can go down to 23 bits; when it's down at 23 bits you only have 9 bits of coding precision because of the approximation of doing the divide before the multiply.

I'm sick of all these web pages that are like "LZMA uses a statistical algorithm called range coding". No, it uses a fucking entropy coder called arithmetic coding.

Okay, this was just an angry rant, but it reminds me of something interesting. I recently stumbled on some new arithmetic coders and they blew my mind for a minute until I figured out how they work. (BTW For reference, the canonical "range coder" is by Michael Schindler )

(BTW #2 it's sort of curious to me historically that I missed the range coder; I was so close to inventing it myself, but somehow got right next to it and didn't see it. All the pieces of the range coder were thought of previously by various people, including the original CACM guys, such as trying to use 32 bits, shifting out bytes, approximating the division, and the virtual queue, but on their own each of those ideas doesn't quite work or make a big difference, nobody put those ideas together and realized that together they make this "range coder" thing that's a big improvement).

The new arithmetic coders I've discovered are the fpaq0p binary range coder and "rc" the "russian range coder". FPAQ0p is by Ilia Muraviev, the guy from "encode.ru" who's done lots of good stuff; you can find it on Matt Mahoney's PAQ page . The "russian range coder" used to have a web page but it appears to have disappeared. I found the code in MRP (Minimum Rate Predictors, a decent lossless image coder).

The "fpaq0p" arithmetic encoder is brutally simple. It's what you would write if you just heard of arithmetic encoding; in fact it was my first idea for a coder ever, but it didn't work, so everybody dropped it. Well, it's back. If you remember the super basic idea of an arithmetic coder is you're specifying a code stream by writing an infinite precision number; at any time you know your desired number is in some interval, low and high, you shrink the interval down to specify the symbol, and to do it in finite precision you shift out top bits of low & high as they become the same. (See my ancient write up of basic arithcoding )

So what would you write?

range_low = 0;
range_high = 0xFFFFFFFF; // 32 bits on

range = range_high - range_low;
range_high = range_low + symhigh * (range / symtot);
range_low += symlow * (range / symtot);

while( (range_low>>24) == (range_high>>24) )
	*outPtr+= = (range_low>>24);
	range_low <<= 8;
	range_high <<= 8;
	range_high += 0xFF;

Okay, good job, but this is badly broken. First of all, you're overflowing your 32 bit code. Second of all, it doesn't handle carry propagation, and it doesn't handle underflow. The problem with carries is that range can step up to the next bit and push a bit over code size. The problem with underflow is that we aren't handling the case of the range getting very very small but while the top bytes are different.

In particular, when range_high comes down close to range_low from above and range_low comes up close to range_high from below, but the top bytes are different, eg, they're like 0x800000AA and 0x7FFFFFBB , the top bytes are different so they don't get shifted out, but the range can get arbitrarily small. (in particular the top byte of "low" is X, the top byte of "hi" is X+1, the second byte of low is 0xFF and the second byte of hi is 0x00).

But fpaq0p works, so how does it do it?

The main trick is to only do binary coding. Then you just don't worry about underflow. With the underflow problem, your range can get as small as only 1 , with 0x01000000 and 0x00FFFFFF, but you implicitly add 1 to the top, so you still have enough range to code 2 symbols, which is what you need for your binary alphabet to be decodable. Any alphabet larger than binary is not codable there.

The other minor trick is just about where you do your "+1". The high is always implicitly +1 above what you actually store in range_high. We started it with 0xFFFFFFFF but we want an implicit FFFFF... repeating to infinity after it. That's why we shift in FF, but we also have to do it right in the range update in a way that doesn't blow the 32 bits or ever generate a carry.

So the working fpaq0p coder is :

range_low = 0;
range_high = 0xFFFFFFFF; // 32 bits on

range = range_high - range_low;
range_mid = range_low + symp0 * (range / symtot);
if ( bit )
	range_low = range_mid+1;
	range_high = range_mid;

while( (range_low>>24) == (range_high>>24) )
	*outPtr+= = (range_low>>24);
	range_low <<= 8;
	range_high <<= 8;
	range_high += 0xFF;

Which looks like it's doing the same trivial thing, but is subtly different.

Note that in the horrible underflow case your coding precision can get as bad as only 1 bit (that is, the probabilities are forced to 50/50 regardless of what they really are). Obviously that hurts coding, but it's actually not that bad, because it's statistically incredibly rare. In fact it happens something like 2^-24 of the time, since all spots on the code range are equally likely. And even when it does happen you only lose a few bits of coding efficiency.


To do this faster :

	while( (range_low>>24) == (range_high>>24) )

You can do :

	while( ((x1 ^ x2)>>24) == 0 )


	while( ((x1 ^ x2)&0xFF000000) == 0 )


	while( (x1 ^ x2) < (1<<24) )

but in practice it doesn't really matter because you're stalled out on branches anyway.

Also in a binary arithmetic coder you never actually divide by symtot. You use symtot = some fixed power of 2, and you do a probability update that keeps tot the same. My code's got this old rung/ladder thing that does different styles of adaptation with a power of 2 total. Or you can just do the super simple thing that everybody seems to do these days :

#define symbits		(12) // or whatever < 14
#define symtot		(1<< symbits)
#define updshift	(6) // or anything in {1,symbits-1}

void update(int & p0,bool bit)

	if ( bit )
		p0 -= p0 >> updshift;
		p0 += (symtot - p0) >> updshift;

Here "updshift" controls how fast your adaptation is vs. how precise you become in the long term. This update makes no distinction between the initial rollup when you have no stats and the adapted state, which the rung/ladder does. This is an approximation of a normal n0/n1 counter with an implicit max count of (1 << updshift). Compare to an n0/n1 update :

void update(int & p0,bool bit)
	int n0 = p0;
	int n1 = symtot - p0;
	int increment = symtot >> updshift;
	if ( bit )
		n1 += increment;
		n0 += increment;
	p0 = (n0 * symtot) / (n0 + n1);

We can see the equivalence and approximation by looking at the bit on case :

	I'll use := to mean assignment while I do maths :

	p0 := (n0 * symtot) / (n0 + n1);
	p0 := (p0 * symtot) / (symtot + incr);
	p0 := (p0 * symtot)*(symtot - incr) / (symtot + incr)*(symtot - incr);
		now approximate incr^2 is much less than symtot^2
	p0 := (p0 * symtot)*(symtot - incr) / (symtot^2);
	p0 := (p0)*(symtot - incr) / (symtot);
	p0 := p0 - (p0*incr)/symtot);
	p0 -= p0/(symtot/incr);
	p0 -= p0>>updshift;

BTW this approximation goes back to the ancient stuff by Howard & Vitter on fast arithmetic coding. Of course in practice you also want to merge your modelling with arithmetic coding since they do the same random branch on bit.

Most people these days are good at designing their compressor to drive the binary arithmetic coder well. A well driven binary arithmetic coder is generally coding probabilities close to 50%. That's better because it keeps precision high and keeps you in the accurate part of the coarse probability update approximation. It's also better because it means you're calling the coder many fewer times. If you only call the arithmetic coder with probablities close to 50% it means you're only doing coding roughly as many times as the number of bits you output, which is low. You accomplish this by coding symbols that are "natural" for the problem. For example, for LZ77 offsets you code in the log2 of the offset (similarly with BWT MTF codes). etc. etc.

So I also found this range coder implementation called the "Russian Range Coder". It's extremely similar to the standard Michael Schindler coder except the renorm & bit output loop is slightly different :

"Russian Range Coder" :

    while((rc->low ^ (rc->low + rc->range)) < (1<<24))
		*rc->ptr++ = (U8) (rc->low >> 24);
		rc->range <<= 8;
		rc->low <<= 8;
    while(rc->range < (1 << 16))
		*rc->ptr++ = (U8) (rc->low >> 24);
		rc->range = ((~rc->low) & 0xFFFF) << 8;
		rc->low <<= 8;

So what's going on here? The first part we can recognize is exactly the simple loop from fpaq0p that just outputs the top byte when it's the same. But if you just do that you're fucked for multisymbol alphabets because of underflow, so looky here, the second part handles that. It checks if the range is getting too small to resolve symbols. But normally in the virtual queue / schindler coder that would mean setting up a carry queue and such. This coder avoids that. It does it by just giving up a huge chunk of the range.

The bad case occurs when the range is small and the top bytes are off by 1. The second bytes are 0x00 and 0xFF. With a virtual queue, you would put the top byte of "low" in the virtual queue spot, and count the number of 0xFF bytes but don't write them yet. If you get a carry, you propagate the carry by writing the top of low+1, then implicitly flipping the 0xFF's to 0x00's and writing them. This is the virtual Q / schindler way.

Here we don't check for underflow and do the carry and such, we just smash down the "high" to the "low". We know the top bytes are not equal, so the next bytes must be 00 and FF, eg. we have something like 0x0B00XXXX and 0x0AFFXXXX , we just smash the high down to 0x0AFFFFFF. This bit :

	rc->range = ((~rc->low) & 0xFFFF) << 8;
	rc->low <<= 8;

is the same as :

	high = low | 0xFFFF;
	range = high - low;
	range <<= 8;
	low <<= 8;

which I think makes it clearer what's happening. You just chop down to make your top bytes the same, eliminating the bad straddle situation which caused underflow. Now you can renormalize the simple way by shifting out equal top bytes and get a big enough range back.

Now obviously this is an approximation that loses some efficiency, you're throwing away on average half your range in the cases that you have this bad underflow, but again it's pretty rare. I don't know the exact probabilities, but the measured efficiency is very close to the full coder.

In practice, however, this coder is not any faster than the full correct schindler coder, so there's no point to using it. It is however, very simple and elegant once you see how it works.

ps. the other new thing is the Jarek Duda lattice coder thing which I have yet to understand. It seems to offer no advantages and many disadvantages, but I would like to grok it for my own brain's happiness.

old rants