7/06/2009

07-06-09 - Small Image Compression Notes

Lapping appears to be a complete red herring. I've wasted a lot of time on it and I'm very angry. I've been trying to work up a lapped block DCT image coder. The idea is that block-DCT-based is good for speed and parallelization for micro-core architectures, good for memory bandwidth, etc. and the lapping theoretically lets you avoid some of the nasty block artifacts by effectively extending your basis functions.

In practice it just doesn't work. I've tried lots of different lapping methods, and in all of them if I make a parameterized lap amount based on a kaiser-bessel-derived window and then tweak the lap amount to maximize SSIM, it tunes to no lapping at all. Basically what's happening is that the extra bit rate cost caused by the forward lap scrambling things up is too great for the win of smoother basis functions on decompress to make up. Obviously in a few contrived cases it does help, such as on very smooth images at very high compression. (of course the large lap basis functions are a form of modeling - they will help any time the image is smooth over the larger area, and hurt when it is not).

The really silly thing about this is that areas where the image is very smooth over a large area are the cases we already handle very well!! Yeah sure naive JPEG looks awful, but even a deblocking filter after decompress can fix that case very easily. In areas that aren't smooth, lapping actually makes artifacts like ringing worse.

The other issue is I'm having a little trouble with lagrange bitstream optimization. Basically my DCT block coder does a form of "trellis quantization" (which I wrote about before) where it can selectively zero coefficients if it decides it gets an R/D win by doing so. Obviously this gives you a nice RMSE win at a given rate (by design it does so - any time it finds a coefficient to zero, it steps up the R/D slope). But what does this actually do?

Think about trying to make the best bit stream for a given rate. Say two bits per pixel. If we don't do any lagrange optimization at all, we might pick some quantizer, say Q = 16. Now we turn on lagrange optimization, it finds some coefficients to zero, that reduces the bit rate, so to get back to the target bit rate, we can use a lower quantizer. It searches for the right lagrange lambda by iterating a few times and we wind up with something like Q = 12 , and some values zeroed, and a better RMSE. What's happened is we got to use a lower quantizer, so we made more, larger, nonzero coefficients, and then we selectively zeroed a few that took the most R/D.

But what does this actually do to the image qualitatively? What it does is increase the quality everywhere (Q =16 goes to Q=12) , but then it stomps on the quality in a few isolated spots (trellis quantization zeros some coefficients). If you compare the two images, the lagrange optimized one looks better everywhere, but then is very smooth and blurred out in a few spots. Normally this is not a big deal and it's just a win, but sometimes I've found it actually looks really awful.

Even if you optimize for some perceptual metric like SSIM it doesn't detect how bad this is, because SSIM is still a local measurement and this is a nonlocal artifact. Your eyes very quickly pick out that part of the image has been blurred way more than the rest of it. (in other cases it does the same thing, but it's actually good; it sort of acts like a bilateral filter actually, it will give bits to the high contrast edges and kill coefficients in the texture part, so for like images of skin it does a nice job of keeping the edges sharp and just smoothing out the interior, as opposed to non-lagrange-optimized JPEG which allocates bits equally and will preserve the skin pore detail and make the edges all ringy and chopped up).

I guess the fix to this is some hacky/heuristic way to just force the lagrange optimization not to be too aggressive.

I guess this is also an example of a computer problem that I've observed many times in various forms : when you let a very aggressive optimizer run wild seeking some path to maximize some metric, it will do so, and if your metric does not perfectly measure exactly the thing that you actually want to optimize, you can get some very strange/bad results.

10 comments:

ryg said...

Lapping: Yeah, that mirrors my experience as well. It looks nice on paper, but the gains in PSNR are relatively small and completely vanish once you compare based on SSIM or (subjective) perceptual quality. Good idea to turn "how much does lapping actually help" into an optimization problem.

"I guess the fix to this is some hacky/heuristic way to just force the lagrange optimization not to be too aggressive.": There's the Psy RDO stuff that's been added to x264 some months ago. I couldn't find details (didn't spend much time looking though), but looking at the images, what seems to be happening is that it tries to keep the amount of "noisiness", measured as e.g. sum of absolute difference between all pixels in their block and their immediate neighbors.

Put differently, you treat the low-frequency part as usual, and for the high-frequency part, you try to converge on something that has roughly the same energy but not necessarily at the right frequencies. To give an 1D example, if your DCT coefficients are "18, 4, 3, 2, 1, 1, 1, 1", such an algorithm would give you something like "18, 4, 3, 2, 0, 0, 2, 0".

So if you have an image region that's mainly smooth with some noise on top (due to film grain, fine details, whatever), you reconstruct the smooth part properly, and for the noise you just substitute some arbitrary noise at roughly the right frequency and roughly the right intensity - playing on the fact that the human visual is very good at detecting the presence of noise and very bad at telling one 8x8 block of noise from another.

The main question would be how many high-frequency coefficients you need to make noise stil seem noisy and not let the transform basis functions shine through too much.

nothings said...

The main question would be how many high-frequency coefficients you need to make noise stil seem noisy and not let the transform basis functions shine through too much.

Maybe add some "noise" basis functions that are less correlated than the normal ones! (they'd be redundant, but...)

Autodidactic Asphyxiation said...

"Yeah sure naive JPEG looks awful, but even a deblocking filter after decompress can fix that case very easily"

What do people use for JPEG deblocking? All I could find was this.

cbloom said...

I'll write a post about deblocking, I've collected a bunch of papers on it.

"Maybe add some "noise" basis functions that are less correlated than the normal ones! (they'd be redundant, but...)"

Yeah, this is something I've been thinking about too, maybe I'll write a post instead of a long comment...

Jon Olick said...

I've done something similar before. I detect a noisy region at a particular bit-plane and just generate noise on the decompression side instead of trying to encode the noise exactly. Works really well.

cbloom said...

Actually I just tried something totally brain-dead and it works great.

I just only allow the RD to consider killing the very last non-zero value, and only if it's 1, and if it's not in a very important AC coefficient.

This eliminates all the unsightly variation and surprisingly preserves 90% of the RDO win. It also makes the encoder way faster.

nothings said...

I've probably posted about this here before, but for quantized DCT deblocking, I still think there's something clever to be done by knowing about the quantizer.

I.e. when you dequantize a DCT coefficient, you put it in the middle of its range -- since this will minimize error across all possible images that would have had that quantized coefficient. It would be better to instead choose a dequantized value within the allowed range which prefers decoding to an image that is "more like what you find in real images" instead of decoding to the mathematical average of all mathematically possible images.

Smartly choosing the right dequantized values for multiple coefficients in a given block might make it possible to get rid of ringing; deblocking might require looking at cross-block properties, and I don't have the slightest clue where to even begin to figure out what mathematics would be involved in a decoder with either of those behaviors.

cbloom said...

Yeah, Sean, there are papers on exactly that. I'll mention it when I write about deblocking.

nothings said...

http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=723513

nothings said...

Ah, good, there you go. I've never read any of these papers because of the pay wall, so I could never be sure what they were even about.

old rants