4/13/2017

Understanding Marlin

I will be using "Tunstall" to mean any variable to fixed coder. I am considering large alphabet (eg. 8-bit input alphabet), "plural" (eg. non-prefix-free dictionary). I am also considering only modeling order-0 statistics.

I label the symbols 'a','b','c' from most probable to least probable. I will use single quotes for symbols, and double quotes for words in the dictionary. So :


P('a') , P('b'), etc. are given

P('a') >= P('b') >= P('c') are ordered

I will use the term "Marlin" to describe the way they estimate the probability of dictionary words. (everything else in the paper is either obvious or well known (eg. the way the decoder works), so the innovation and interesting part is the word probability estimation, so that is what I will call "Marlin" , the rest is just "Tunstall").

Ok. To build a Tunstall dictionary your goal is to maximize the average input length, which is :


average input length = Sum[words] { P(word) * L(word) }

since the output length is fixed, maximizing the input length maximizes compression ratio.

In the original Tunstall algorithm on binary input alphabet, this is easily optimized by splitting the most probable word, and adding its two children. This can be done in linear time using two queues for left and right (0 and 1) children.

The Marlin algorithm is all about estimating P(word).

The first naive estimate (what I did in my 12/4/2015 report) is just to multiply the character probabilities :


P(word) = Prod[c in word] P('c')

that is

P("xyz") = P('x') * P('y') * P('z')

but that's obviously not right. The reason is that the existence of words in the dictionary affects the probability of other words.

In particular, the general trend is that the dictionary will accumulate words with the most probable characters (a,b,c) which will make the effective probability of the other letters in the remainder greater.

For example :


Start the dictionary with the 256 single-letter words

At this point the naive probabilities are exact, that is :

P("a") (the word "a") = P('a') (the letter 'a')

Now add the most probable bigram "aa" to the dictionary.

We now have a 257-word dictionary.  What are the probabilities when we code from it ?

Some of the occurrances of the letter 'a' will now be coded with the word "aa"

That means P("a") in the dictionary is now LESS than P('a')

Now add the next most probable words, "ab" and "ba"

The probability of "a" goes down more, as does P("b")

Now if we consider the choice of what word to add next - is it "ac" or "bb" ?

The fact that some of the probability of those letters has been used by words in the dictionary affects
our estimate, which affects our choice of how to build the dictionary.

so that's the intuition of the problem, and the Marlin algorithm is one way to solve it.

Let's do it intuitively again in a bit more detail.

There are two issues : the way the probability of a shorter word is reduced by the presence of longer words, and the way the probability of raw characters that start words is changed by the probability of them coming after words in the dictionary.


Say you have word W in your dictionary

and also some of the most probable children.

W, Wa, Wb are in dictionary

Wc, Wd are not

We'll say word "W" has 2 children (Wa and Wb).

So word "W" will only be used from the dictionary if the children are NOT a or b
(since in that case the longer word would be used).

So if you have seen word "W" so far, to use word W, the next character must be terminal,
eg. one that doesn't correspond to another child.

So the probability of word W should be adjusted by :

P(W) *= P(c) + P(d)

Because we are dealing with sorted probability alphabets, we can describe the child set with just one integer to indicate which ones are in the dictionary. In Marlin terminology this is c(w), and corresponds to the state Si.

If we make the tail cumulative probability sum :


Ptail(x) = Sum[ c >= x ] P('c')  (sum of character probabilities to end)
Ptail(255) = P(255)
Ptail(254) = P(254) + P(255)
Ptail(0) = sum of all P('c')  = 1.0

then the adjustment is :

P(W) *= P(c) + P(d)
P(W) *= Ptail('c')
P(W) *= Ptail(first char that's not a child)

P(W) *= Ptail( num_children(W) )
(I'm zero-indexing, so no +1 here as in the Marlin paper, they 1-base-index)

ADD : I realized there's a simpler way to think about this. When you add a child word, you simply remove that probability from the parent. That is :

let P_init(W) be the initial probability of word W
when it is first added to the dictionary and has no children

track running estimate of P(W)

when you add child 'x' making word "Wx"

The child word probability is initialized from the parent's whole probability :

P_init(Wx) = P_init(W) * P('x')

And remove that from the running P(W) :

P(W) -= P_init(Wx)

That is you just make P(W) the probability of word W, excluding children that exist.
Once you add all children, P(W) will be zero and the word is useless.

Okay, so that does the first issue (probability of words reduced by the presence of longer words). Now the next issue. Consider the same simple example case first :


W,Wa,Wb are in dictionary, Wc,Wd are not
(no longer children of W are either)

Say you reach node "Wa"

there are no children of "Wa" in the dictionary, so all following characters are equally likely

This means that starting the next word, the character probabilities are equal to their original true probabilities

But say you reach node "W" and leave via 'c' or 'd'

In that case the next character must be 'c' or 'd' , it can never be 'a' or 'b'

So the probability of the next character being 'c' goes up by the probability of using word "W" and the
probability of being a 'c' after "W" , that's :

estimate_P('c') +=  P("W") * P('c') / ( P('c') + P('d') )

or

estimate_P('c') +=  P("W") * P('c') / Ptail( num_children(W) )

now you have to do this for all paths through the dictionary. But all ways to exit with a certain child count are similar, so you can merge those paths to reduce the work. All words with 2 children will be in the same exit probability state ('a' and 'b' can't occur but chars >= 'c' can).

This is the Marlin state S_i. S_i means that character is >= i. It happens because you left the tree with a word that had i children.

When you see character 2 that can happen from state 0 or 1 or 2 but never states >= 3.


for estimating probability of word W

W can only occur in states where the first character W[0] is possible

that is state S_i with i <= W[0]

When character W[0] does occur in state S_i , the probability of that character is effectively higher,
because we know that chars < i can't occur.

Instead of just being P(char W[0]) , it's divided by Ptail(i)

(as in estimate_P('c') +=  P("W") * P('c') / Ptail( num_children(W) ) above)


So :

P(W) = Sum[ i <= W[0] ] P( state S_i ) * P( W | S_i )

the probability of W is the sum of the probability of states it can start from
(recall states = certain terminal character sets)
times the probability of W given that state

let

P_naive(W) = Product[ chars c in W ] P(char 'c')

be the naive word probability, then :

P(W | S_i) = (1 / Ptail(i)) * P_naive(W) * Ptail( num_children(W) )


is what we need.  This is equation (1) in the Marlin paper.

The first term increases the probability of W for higher chars, because we know the more probable lower chars can't occur in this state (because they found longer words in the dictionary)

The last term decreases the probability of W because it will only be used when the following character doesn't cause a longer word in the dictionary to be used

Now of course there's a problem. This P(W) probability estimate for words requires the probability of starting the word from state S_i, which we don't know. If we had the P(W) then the P of states is just :


P(S_i) = Sum[ words W that have i children ] * P(W)

so to solve this you can just iterate. Initialize the P(S_i) to some guess; the Marlin code just does :

P(state 0) = 1.0
P(all others) = 0.0

(recall state 0 is the state where all chars are possible, no exclusions, so
characters just have their original order-0 probability)

feeds that in to get P(W), feeds that to update P(S_i), and repeats to convergence.

To build the dictionary you simply find the word W with highest P(W) and split it (adding its next most probable child and increasing its child count by 1).

The Marlin code does this :


seed state probabilities

iterate
{

build dictionary greedily using fixed state probabilities

update state probabilities

}

That is, during the dictionary creation, word probabilities are estimated using state probabilities from the previous iteration. They hard-code this to 3 iterations.

There is an alternative, which is to update the state probabilities as you go. Any time you do a greedy word split, you're changing 3 state probabilities, so that's not terrible. But changing the state probabilities means all your previous word estimate probabilities are now wrong, so you have to go back through them and recompute them. This makes it O(N^2) in the dictionary size, which is bad.

For reference, combining our above equations to make the word probability estimate :


P(W) = Sum[ i <= W[0] ] P( state S_i ) * P( W | S_i )

P(W | S_i) = (1 / Ptail(i)) * P_naive(W) * Ptail( num_children(W) )

so:

P(W) = P_naive(W) * Ptail( num_children(W) ) * Sum[ i <= W[0] ] P( state S_i ) / Ptail(i)

the second half of that can be tabulated between iterations :

P_state_tail(j) = Sum[ i <= j ] P( state S_i ) / Ptail(i)

so :

P(W) = P_naive(W) * Ptail( num_children(W) ) * P_state_tail( W[0] )

you can see the "Marlin" aspect of all this is just in using this P(W) rather than P_naive(W) . How important is it exactly to get this P(W) right? (and is it right?) We'll find out next time...

No comments:

old rants