11/18/2008

11-18-08 - DXTC Part 2

First of all, let's dispell the illusion that some people have that DXT1 is "pretty good". DXT1 is fucking awful. It compresses at 1.33333 bits per byte (4 bits per pixel). That's very large as far as image compressors are concerned. For typical images, around 4.0 bpb is true lossless, around 1.5 is perceptually lossless, and around 0.5 is "very good". In fact wavelet compressors can get as low as 0.1 bpb and acheive about the same quality as DXT1. Despite this I've heard smart people saying that "DXT1 is pretty good". Yes, it is a convenient fixed size block, and yes the decoder is extremely fast and simple, but as far as quality is concerned it is not even close. At 1.33333 it should be near lossless.

Here are some numbers on various DXT1 compressors. The numbers in the table are the RMSE (sqrt of the L2 error). The far right column is a wavelet compressor for comparison; it's not the best wavelet compressor in the world by a long shot, it's "cbwave" which is very old and which I designed for speed, not maximum quality. In any case it gives you an idea how far off DXT1 is. (BTW I always try to show results in RMSE because it is linear in pixel magnitude, unlike MSE or PSNR which is a very weird nonlinear scale). More discussion after the table...

(ADD: the "ryg" results shown here have a bug and should be ignored; see DXTC Post Summary )

file Squish opt Squish CB opt CB ryg D3DX8 FastDXT cbwave
kodim01.bmp 8.2808 8.3553 8.352 8.6924 9.374 9.8466 9.9565 2.6068
kodim02.bmp 6.1086 6.2876 6.1287 6.3025 7.52 7.4308 8.456 1.6973
kodim03.bmp 4.7804 4.9181 4.8312 5.0225 5.855 6.094 6.4839 1.3405
kodim04.bmp 5.6913 5.8116 5.7285 5.9394 6.9408 7.1032 7.3189 1.8076
kodim05.bmp 9.6472 9.7223 9.707 10.112 10.8934 11.273 12.0156 2.9739
kodim06.bmp 7.1472 7.2171 7.1777 7.44 8.1005 8.5195 8.6202 2.0132
kodim07.bmp 5.7804 5.8834 5.8379 6.0583 6.8153 7.2182 7.372 1.4645
kodim08.bmp 10.2391 10.3212 10.346 10.747 11.3992 11.8703 12.2668 3.2936
kodim09.bmp 5.2871 5.3659 5.3306 5.5234 5.9884 6.5332 6.6716 1.6269
kodim10.bmp 5.2415 5.3366 5.2777 5.4633 5.9377 6.4601 6.4592 1.7459
kodim11.bmp 6.7261 6.8206 6.7643 7.0216 7.8221 8.1056 8.2492 1.8411
kodim12.bmp 4.7911 4.8718 4.8204 4.9863 5.6651 6.005 6.0748 1.5161
kodim13.bmp 10.8676 10.9428 10.925 11.4237 12.402 12.7139 12.9978 4.1355
kodim14.bmp 8.3034 8.3883 8.3398 8.6722 9.4258 9.896 10.8481 2.4191
kodim15.bmp 5.8233 5.9525 5.8568 6.0862 6.6749 7.3085 7.4932 1.6236
kodim16.bmp 5.0593 5.1629 5.0863 5.2851 5.8093 6.3361 6.1592 1.546
kodim17.bmp 5.5019 5.6127 5.5313 5.7358 6.4975 6.7395 6.8989 1.7166
kodim18.bmp 7.9879 8.0897 8.0192 8.3716 9.7744 9.5357 9.7857 2.9802
kodim19.bmp 6.5715 6.652 6.6692 6.91 8.0128 7.9229 8.0096 2.0518
kodim20.bmp 5.4533 5.5303 5.4895 5.6864 6.3457 6.4878 6.8629 1.5359
kodim21.bmp 7.1318 7.2045 7.1724 7.4582 8.1637 8.4703 8.6508 2.0659
kodim22.bmp 6.43 6.5127 6.4644 6.7137 7.8264 8.0046 7.9488 2.2574
kodim23.bmp 4.8995 5.0098 4.9244 5.0906 5.6989 6.3057 6.888 1.3954
kodim24.bmp 8.4271 8.5274 8.4699 8.8564 9.3906 9.9389 10.5156 2.4977
clegg.bmp 14.9733 15.2566 15.1755 15.7168 16.3563 21.471 32.7192 10.5426
FRYMIRE.bmp 10.7184 12.541 12.132 12.8278 12.989 16.7308 28.9283 6.2394
LENA.bmp 7.138 7.2346 7.1763 7.4264 8.1203 8.742 9.5143 4.288
MONARCH.bmp 6.5526 6.6292 6.5949 6.846 7.5162 8.1053 8.6993 1.6911
PEPPERS.bmp 6.3966 6.5208 6.4557 6.677 7.3618 8.1855 8.8893 2.3022
SAIL.bmp 8.3233 8.3903 8.3598 8.6627 9.8685 9.7838 10.5673 2.9003
SERRANO.bmp 6.3508 6.757 6.8385 7.9064 7.5303 9.0549 18.3631 4.6489
TULIPS.bmp 7.5768 7.656 7.6146 7.8786 8.4084 9.3817 10.5873 2.2228

Back to comparing the DXT1 encoders. BTW the test set here is the Kodak image set plus the Waterloo Bragzone image set. The Kodak set is all photographs that are pretty noisy, and there's not a huge difference in the coders. The Bragzone image set has some synthetic images with things like gradients which are harder to compress well, and there you can really dramatically see the bad encoders fall apart. In particular if you look at the results on "clegg" and "frymire" and "serrano" you can see how bad the "FastDXT" coder is.

The "Squish" in the table is the iterative cluster fit with uniform weighting. All coders work on linear RGB error; and the MSE is mean per pixel not per component.

The "CB" encoder is a simple 2-means fit. I seed the means by doing the PCA to find the best fit line. I put a plane perpendicular to that line through the average and take all the points on each half to be the two clusters, average the cluster to seed the means, and then iteratively refine the means by reclustering. Once I have the 2-means I do a simple search to find the best 565 DXT end points to find the two means. There are 3 cases to try :

1. put c0 and c1 = the two means

2. put c2 and c3 = the two means (so c0 and c1 are pushed out)

3. make (c0,c2) straddle mean 1 and (c3,c1) straddle mean 2 - this is best for gaussian clusters around the mean

A 2-means like this is slightly better than doing a pca "range fit" like Simon's fast method or the "ryg" method. If the data was all Gaussian noise, they would be equivalent, but of course it's not. You often get blocks that have a bunch of pixels at the low end which are all exactly the same color ( for example, all perfect black), and then a spread of a bunch of junk at the high end (like some orange, some yellow, etc.). You want to put one end point exactly on perfectly black and put the other endpoint at the center of the cluster on the high end.

"CB opt" and "Squish opt" take the results of the CB and Squish compressors and then improve them by iterative search. Simon Brown on his page mentions something about doing greedy endpoint refinement but claims it "doesn't converge because the indeces change". That's silly, of course it converges.

To do a greedy search : try wiggling one or both end points in 565 space. Find new best index fit for the new end points. Measure new error. If the error is lower, take the step.

Of course that works and it does converge and it's pretty simple and improves the encoding after Squish.

In fact you can do even better by doing simulated annealing instead of a plain greedy search. We should know by now that any time we have a greedy search like this where we can measure the error and it can get stuck in local minima, we can improve it with something like simulated annealing. I use 256 steps of annealing with a sinusoidal decreasing temperature pattern. I randomly pick a way to wiggle the two end points (you need to consider wiggling both, not just single wiggles). I try the wiggle and measure the error delta. Negative errors (improvements) are always taken, positive errors are taken probabilitistically based on the temperature. This is what was done in the "opt" coders above.

Most of the time the annealing search just makes a small improvement, but once in a while (such as on "frymire" and "serrano") it really finds something remarkable and makes a big difference.

Simon's cluster fit alg is very sweet. I didn't really understand it at first from the description of the code, so I'm going to redescribe it for myself here.

The basic idea is you find an assignment of indices, and from that solve a best fit to give you the end points for those indices. If you think about all the colors in the block living in 3x3 color space, assigning indices is like giving labels to each point. All the points with the same label are a "cluster". Then each cluster is fit with either an end point or one of the interpolated points.

Once you know the clusters, putting the best two end points is a simple linear optimization. You can solve the least squares analytically, it's not like a matrix iteration least squares problem or anything.

So the trick is how to decide the indices. If you tried them all, it would be 4^16 ways, which is way too many. So what Simon does is create a linear ordering of the points using the PCA axis, then try all clusterings that can be created by planes perpendicular to that linear axis. That's equivalent to just trying all groupings of the indeces sorted by their dot along the PCA axis. That is, the clusters are

[0,i) , [i,j) [j,k) [k,16)

for all i,j,k
which is a manageable amount to search, and gives you most of the interesting clusterings that you care about. Something that might improve Squish is tring a few different axes and picking the best.

BTW this end point optimization is very approximate. One issue is that it assumes the best index for each point doesn't change. It also of course just uses real number arithmetic to make the 1/3 points, not the actual integer rounding that's done. Those factors are actually pretty big.

4 comments:

castano said...

I think my "DXT compression in CUDA" article has a fairly good explanation of the squish method:

http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/dxtc/doc/cuda_dxtc.pdf

castano said...

Have you tried using the axis returned by 2-means fit as the starting point for squish?

Do you have any thoughts about how to compress the Alpha blocks? I currently use an iterative end point optimization:

- Compute the endpoints based on the best fit line or bounding box
- Assign indices for these endpoints.
- Compute end points that minimize error for this index assignment.
- Repeat second step until the algorithm converges.

cbloom said...

No, I haven't tried any other axes for Squish.

Actually one of the ways you could definitely beat "Squish opt" is just by randomly trying a few axes to start squish with.

I haven't really thought about alpha blocks at all yet.

castano said...

Yes, but that would multiply the compression time by the number of trials. It would be nice if it were possible to pick a better axis without spending too much time.

BTW, these are the results on NVTT with normal and fast compression options:

'kodim01.png' 8.3178 8.8283
'kodim02.png' 6.1999 6.5642
'kodim03.png' 4.8679 5.199
'kodim04.png' 5.7507 6.0604
'kodim05.png' 9.6922 10.1898
'kodim06.png' 7.1862 7.5565
'kodim07.png' 5.8445 6.1761
'kodim08.png' 10.2765 10.8359
'kodim09.png' 5.3387 5.6826
'kodim10.png' 5.2946 5.6115
'kodim11.png' 6.7792 7.1448
'kodim12.png' 4.8437 5.1419
'kodim13.png' 10.9068 11.5045
'kodim14.png' 8.3499 8.7971
'kodim15.png' 5.8989 6.1742
'kodim16.png' 5.1197 5.4364
'kodim17.png' 5.5792 5.8778
'kodim18.png' 8.0427 8.4399
'kodim19.png' 6.6181 7.0284
'kodim20.png' 5.4907 5.7841
'kodim21.png' 7.18 7.5852
'kodim22.png' 6.4762 6.8052
'kodim23.png' 4.967 5.2579
'kodim24.png' 8.4657 8.8632
'clegg.tif' 15.0331 16.114
'frymire.tif' 12.1675 13.317
'lena.tif' 7.0679 7.3861
'monarch.tif' 6.6051 6.915
'sail.tif' 8.3798 8.7851
'serrano.tif' 6.5715 8.4293
'tulips.tif' 7.6531 8.0117
Average: 7.3214 7.7904

It looks like my squish version is slightly better. Hmm... I thought that my fast compressor was equivalent to ryg's....

old rants