12/01/2014

12-01-14 - AC Quantization

I just read the Xiph PVQ demo page ; pretty interesting, go have a read. (BTW it kills me that Daala is using lapped transforms, oy!) (also - thank god some people still write real blogs and RSS them so I can find them!)

It reminded me of some things I've been talking about for a long time, but I couldn't find a link to myself on my blog. Anyway, internally at RAD I've been saying that we should be sending AC coefficients with more precision given to the *sum* of the values, and less given to the *distribution* of the values.

ADD : some searching finds these which are relevant -

cbloom rants 08-25-09 - Oodle Image Compression Looking Back
cbloom rants 08-27-09 - Oodle Image Compression Looking Back Pictures
cbloom rants 03-10-10 - Distortion Measure
cbloom rants 10-30-10 - Detail Preservation in Images

I've never actually pursued this, so I don't have the answers. But it is something that's been nibbling at my brain for a long time, and I still think it's a good idea, so that tells me there's something there.

I'll scribble a few notes -

1. I've long believed that blocks should be categorized into "smooth", "detail" and "edge". For most of this discussion we're going to ignore smooth and edge and just talk about detail. That is, blocks that are not entirely smooth, and don't have a dominant edge through them (perhaps because that edge was predicted).

2. The most important thing in detail blocks is preserving the amount of energy in the various frequency subbands. This is something that I've talked about before in terms of perceptual metrics. In a standard DCT you do make categories something like :

01447777
23669999
566BBC99
56BBBCCC
8ABBBCCC
8ADDDEEE
8AADDEEE
8AADDEEE
and what's most important is preserving the sum in each category. (that chart was pulled out of my ass but you get the idea).

The sums should be preserved in a kind of wavelet subband quadtree type of way. Like preserve the sum of each of the 4x4 blocks; then go only to the upper-left 4x4 and divide it into 2x2's and preserve those sums, and then go to only the upper-left 2x2, etc.

3. You can take a standard type of codec and optimize the encoding towards this type of perceptual metric, and that helps a bit, but it's the wrong way to go. Because you're still spending bits to exactly specify the noise in the high frequency area. (doing the RD optimization just let you choose the cheapest way to specify that noise).

What you really want is a perceptual quantizer that fundamentally gives up information in the right way as you reduce bitrate. At low bits you just want to say "hey there's some noise" and not spend bits specifying the details of it.

The normal scalar quantizers that we use are just not right. As you remove bits, they kill energy, which looks bad. It looks better for that energy to be there, but wrong.

3 1/2. The normal zig-zag coding schemes we use are really bad.

In order to specify any energy way out in the highest-frequency region (E in the chart above) you have to send a ton of zeros to get there. This makes it prohibitively costly in bits.

One of the first things that you notice when implementing an R/D optimized coder with TQ is that it starts killing all your high frequency detail. This is because under any kind of normal D norm, with zig-zag-like schemes, the R to send those values is just not worth it.

But perceptually that's all wrong. It makes you over-smooth images.

3 3/4. Imagine that the guy allocating bits is standing on the shore. The shore is the DC coefficient at 00. He's looking out to see. Way out at the horizon is the super-high-frequency coefficients in the bottom right of the DCT. In the foreground (that's AC10 and AC01) he can see individual waves and where rocks are. Way out at sea he shouldn't be spending bandwidth trying to describe individual waves, but he should still be saying things like "there are a lot of big waves out there" or "there's a big swell but no breakers". Yeah.

4. What you really want is a joint quantizer of summed energy and the distribution of that energy. At max bit rate you send all the coefficients exactly. As you reduce bitrate, the sum is preserved pretty well, but the distribution of the lower-right (highest frequency) coefficients becomes lossy. As you reduce bit rate more, the total sum is still pretty good and the overall distribution of energy is mostly right, but you get more loss in where the energy is going in the lower frequency subbands, and you also get more scalar quantization of the lower frequency subbands, etc.

Like by the time that AC11 has a scalar quantizer of 8, the highest frequency lower-right area has its total energy sent with a quantizer of 32, and zero bits are sent for the location of that energy.

5. When the energy is unspecified, you'd like to restore in some nice way. That is, don't just restore to the same quantization vector every time ("center" of the quantization bucket), since that could create patterns. I dunno. Maybe restore with some randomness; restore based on prediction from the neighborhood; restore to maximum likelihood? (ML based on neighborhood/prediction/image not just a global ML)

6. An idea I've tossed around for a while is a quadtree/wavelet-like coding scheme. Take the 8x8 block of coefficients (and as always exclude DC in some way). Send the sum of the whole thing. Divide into four children. So now you have to send a (lossy) distribution of that sum onto the 4 child slots. Go to the upper left (LL band) and do it again, etc.

7. The more energy you have, the less important its exact distribution, due to masking. As you have more energy to distribute, the number of vectors you need goes up a lot, but the loss you can tolerate also goes up. In terms of the bits to send a block, it should still increase as a function of the energy level of that block, but it should increase less quickly than naive log2(distributions) would indicate.

8. Not all AC's are equally likely or equally perceptually important. Specifically the vector codebook should contain more entries that preserve values in the upper-left (low frequency) area.

9. The interaction with prediction is ugly. (eg. I don't know how to do it right). The nature of AC values after mocomp or intra-prediction is not the same as the nature of AC's after just transform (as in JPEG). Specifically, ideas like variance masking and energy preservation apply to the transformed AC values, *not* to the deltas that you typically see in video coding.

10. You want to send the information about the AC in a useful order. That is, the things you send first should be very strong classifiers of the entropy of that block for coding purposes, and of the masking properties for quantization purposes.

For example, packJPG first send the (quantizated) location of the last-non-zero coefficient in Z-scan order. This turns out to be the best classifier of blocks for normal JPEG, so that is used as the primary bit of context for coding further information about the block.

You don't want sending the "category" or "masking" information to be separate side-band data. It should just be the first part of sending the coefficients. So your category is maybe something like the bit-vector of which coefficient groups have any non-zero coefficients. Something like that which is not redundant with sending them, it's just the first gross bit of information about their distribution.


I found this in email; I think there are some more somewhere ...

(feel free to ignore this) Currently every video coder in the world does something like mocomp then DCT (or a similar transform (eg. one that is orthonormal, complete, and ) on the residuals. The thing is, the DCT really doesn't make much sense on the residuals. Heck even just removing the DC doesn't usually do much on the residuals, because the mocomp will generally match the DC, so the DC of the residuals is near zero. But the DCT still has value. The big advantage it has is the ability to send just a few coefficients and get a rough approximation of simple shapes like flat vs. gradient vs. noisy. For example if the residual is sort of speckly all over, to send that by coding values one by one you would have to send lots of values. With the DCT you can just send some signal in AC11 or AC22 and you get some rough approximation of splotches all over. The DCT has another advantage; with quantization and R/D optimization, you want to be able to identify which coefficients are okay to crush. You want to be able to put the basis functions in order of perceptual importance so that you can send them in a sequence where truncating the tail is an okay way to send less data. The subtle issue in all cases is that the coder choice implies a general type of error that you will create, because the coder creates an ordering, and under R/D optimization, you will prefer to send the parts of the signal that are cheaper. eg. in the normal DCT + zigzag type of coder, sending high frequency signals is much more expensive than low frequency. The result is that R/D optimization on DCT+zigzag coders tends to make things smoother. That is both good and bad. It means you can't really let the R/D optimization go nuts or you lose all your high frequency energy and your video appears to have no "detail". It would be nice to have coders that create different types of error under heavy R/D optimization. But maybe there's something better than the DCT? 1. Basis transform other than DCT. This is just a small change, not a huge shift in method, but some subsets are special : 1.a. Multiple bases. This makes the coder over-complete, which is bad in theory but can help in practice. For example you could have 3 different bases possible - DCT (cosine), Haar (steps), and Legendre (polynomials), and the coder could first send a selection of bases then the residuals. 1.b. KLT - optimal bases per frame. The encoder looks at the actual residuals and finds the optimum (in the sense of "most energy compacting") basis. You must then transmit the basis at the start of the frame. I believe this is not possible for 8x8 transforms, because sending the basis costs too much; you would have to send 64*64 coefficients, which is large compared to typical video frame sizes. It is possible, however, for 4x4 bases, and that could be cascaded with a 2x2 Haar to make an 8x8 transform. There are some problems with this, though - because the KLT bases are computed, the perceptual character of them is unknown; eg. how to order them by energy is known, but that order might not be their perceptual order. Also there is no "fast" transform because the bases are unknown; you must use straight basis addition to do the reverse transform. (there are weird cases where KLT basis is clearly good; for example scenes that don't match the "natural/smooth" assumption of DCT, such as movie titles or credits, scenes full of the same pattern (eg. videos of grass); but just like with color transforms, the disadvantage of KLT is that you no longer know exactly what your bases are, and you can't pre-decide their perceptual importance (eg. the JPEG quantizer matrix)) 1.c. Overcomplete basis. Rather than multiple bases or per-frame bases, use a static pre-chosen overcomplete set. Maybe some cosines, some linear ramps, some exact hard edges at various locations and angles. 2. Quadtree with "don't care" distribution. I haven't quite figure this out, but the thinking goes like this : Perhaps the most important (perceptual) single piece of information about the residuals is the total amount of AC energy. This is just the L-something sum of the residuals. Less important is exactly how that energy is distribution. So maybe we could come up with a way of encoding the residuals such that we first send the sum of energy, and then we send information about the distribution, and when we increase the "quantizer" or we do R/D truncation, what we lose first is information about the distribution, not the total. That is, loss causes the location of energy to become imprecise, not the total. I don't have a great idea for how to do this, but maybe something like this : First send the AC total for the whole block. Then consider blocks in a quadtree hierarchy. For each block, you already know the total of energy, so transmit how that energy is split among the 4. Then for any child whose total is not zero, recurse onto that child and repeat. At this point it's lossless, so it's not very interesting; the question is how to introduce some loss. Two ideas : 2.a. Add some amount of "don't care" energy to each quad chunk. That is, you know the total for each quadtree block as you descend. Rather than sending the 4 portions which must add up to the total, send 4 portions which are LE the total; the remainder (missing energy) is "don't care" energy and is distributed in some semi-random way. 2.b. Quantize the distribution. Send the total first. Then the distribution is sent with some larger quantizer, and the missing parts are fixed. eg. you send a total of 13. The distribution gets an extra quantizer of 3. So the first block you would send as a distribtuon of 13/3 = 4. So you send how the 4 is spread. That only accounts for 4*3 = 12, so 1 was not sent, you put it somewhere random. If the distribution quantizer is 1, it's lossless; if the distribution quantizer is reasonably large, then you send no info about the distribution, just the total energy. This method is pretty poor at sending linear ramps or edges, though. 3. S/D/E separate coders ; I've been talking about this idea for a while without much substance. The idea is that the perceptually important thing really changes dramatically based on the local character. For smooth areas, you really must keep it smooth, but the exact level or type of ramp is not that important. For edges, you want to keep a sharp edge, and not introduce ringing or blocking, but any other detail near the edge gets severely visually masked, so you can fuck that up. For detail area, the dominating thing is the amount of energy and the rough frequency of it, the exact speckles don't matter. Anyway, it occurred to me that rather than trying to do an SDE coder from the ground up, what you can do is just provide various coders, like maybe 1c and 2b and also normal DCT coding. You don't specifically ever decide "this block is smooth" and use the smooth coder, instead you just try all the coders and take the one that comes out best. One of the difficulties with this is you don't know the perceptual impact of your coding decisions. With a normal DCT zigzag type coder, you can just use the magnitude of the values to be a rough estimate of the perceptual importance, and add some fudge factors for the DC being more importan than AC and so on. With weird bases and weird coders you don't have a good way to make R/D decisions other than by actually running a perceptual D calculation for each R choice, which is several orders of magnitude slower.

(there were some good responses to that mail as well...)

(If someone wants to give me a professorship and a few grad students so that I can solve these problems, just let me know anytime...)

No comments:

old rants