12/09/2010

12-09-10 - Perceptual vs TID

My TID2008 score is the RMSE of fitting to match MOS values (0-9) ; fit from perceptual metric to MOS score is of the form Ax + Bx^C , so 0 -> 0 , and for fitting purposes I invert and normalize the scale so that 0 = no error and 1 = maximum error. For fitting I try {value},{acos(value)}, and {log(value)} and use whichever is best.

I also exclude the "exotic" distortions from TID because they are weird and not something well handled by most of the metrics (they favor SSIM which seems to be one of the few that handles them okay). I believe that including them is a mistake because they are very much unlike any distortion you would ever get from a lossy compressor.

I also weight the MOS errors by 1/Variance of each MOS ; basically if the humans couldn't agree on a MOS for a certain image, there's no reason to expect the synthetic metric to agree. (this is also like using an entropy measure for modeling the human data; see here for example)


MSE        : 1.169795  [1,2]
MSE-SCIELAB: 0.808595  [2]
MS-SSIM    : 0.691881  [1,2]
SSIM       : 0.680292  [1]
MS-SSIM-fix: 0.639193  [2] (*)
WSNR       : 0.635510  [1]
PSNR-HVS   : 0.583396  [1,2] (**)
VIF        : 0.576115  [1]
PSNR-HVS-M : 0.564279  [1] (**)
IW-MS-SSIM : 0.563879  [2]
PSNR-HVS-M : 0.552151  [2] (**)
MyDctDelta : 0.548938  [2]

[1] = scores from TID reference "metrics_values"
[2] = scores from me
[1,2] = confirmed scores from me and TID are the same

the majority of these metrics are Y only, using rec601 Y (no gamma correction) (like metric mux).

WARNING : the fitting is a nasty evil business, so these values should be considered plus/minus a few percent confidence. I use online gradient descent (aka single layer neural net) to find the fit, which is notoriously tweaky and sensitive to annoying shit like the learning rate and the order of values and so on.

(*) = MS-SSIM is the reference implementation and I confirmed that I match it exactly and got the same score. As I noted previously , the reference implementation actually uses a point subsample to make the multiscale pyramid. That's obviously a bit goofy, so I tried box subsample instead, and the result is "MS-SSIM-fix" - much better.

(**) = PSNR-HVS have received a PSNR-to-MSE conversion (ye gods I hate PSNR) ; so my "PSNR-HVS-M" is actually "MSE-HVS-M" , I'm just sticking to their name for consistency. Beyond that, for some reason my implementation of PSNR-HVS-M ([2]) is better than theirs ([1]) and I haven't bothered to track down why exactly (0.564 vs 0.552).

WSNR does quite well and is very simple. It makes the error image [orig - distorted] and then does a full image FFT, then multiplies each tap by the CSF (contrast sensitivity function) for that frequency, and returns the L2 norm.

PSNR-HVS is similar to WSNR but uses 8x8 DCT's instead of full-image FFT. And of course the CSF for an 8x8 DCT is just the classic JPEG quantization matrix. That means this is equivalent to doing the DCT and scaling by one over the JPEG quantizers, and then taking the L2 delta (MSE). Note that you don't actually quantize to ints, you stay in float, and the DCT should be done at every pixel location, not just 8-aligned ones.

VIF is the best of the metrics that comes with TID ; it's extremely complex, I haven't bothered to implement it or study it that much.

PSNR-HVS-M is just like PSNR-HVS, but adds masked thresholds. That is, for each tap of the 8x8 DCT, a just noticeable threshold is computed, and errors are only computed above the threshold. The threshold is just proportional to the L2 AC sum of the DCT - this is just variance masking. They do some fiddly shit to compensate for gross variance vs. fine variance. Notably the masking is only used for the threshold, not for scaling above threshold (though the CSF still applies to scaling above threshold).

IW-MS-SSIM is "Information Weighted" MS-SSIM. It's a simple weight term for the spatial pooling which makes high variance areas get counted more (eg. edges matter). As I've noted before, SSIM actually has very heavy variance masking put in (it's like MSE divided by Variance), which causes it to very severely discount errors in areas of high variance. While that might actually be pretty accurate in terms of "visibility", as I noted previously - saliency fights masking effects - that is, the edge areas are more important to human perception of quality. So SSIM sort of over-crushes variance and IW-SSIM puts some weight back on them. The results are quite good, but not as good as the simple DCT-based metrics.

MyDctDelta is some of the ideas I wrote about here ; it uses per-pixel DCT in "JPEG space" (that is, scaled by 1/JPEG Q's), and does some simple contrast-band masking, as well as multi-scale sum deltas.

There's a lot of stuff in MyDctDelta that could be tweaked that I haven't , I have no doubt the score on TID could easily be made much better, but I'm a bit afeared of overtraining. The TID database is not big enough or varied enough for me to be sure that I'm stressing the metric enough.


Another aside about SSIM. The usual presentation says compute the two terms :


V = (2*sigma1*sigma2 + C2)/(sigma1_sq + sigma2_sq + C2);
C = (sigma12 + C3)/(sigma1*sigma2 + C3);

for variance and correlation dot products. We're going to wind up combining these multiplicatively. But then notice that (with the right choice of C3=C2/2)

V*C = (2*sigma12 + C2/2)/(sigma1_sq + sigma2_sq + C2);

so we can avoid computing some terms.

But this is a trick. They've actually changed the computation, because they've changed the pooling. The original SSIM has three terms : mean, variance, and correlation dot products. They are each computed on a local window, so you have the issue of how you combine them into a single score. Do you pool each one spatially and then cross-multiply? Or do you cross-multiply and then pool spatially ?


SSIM = Mean{M} * Mean{V} * Mean{C}

or

SSIM = Mean{M*V*C}

which give quite different values. The "efficient" V*C SSIM winds up using :

SSIM = Mean{M} * Mean{V*C}

which is slightly worse than doing separate means and multiplying at the end (which is what MS-SSIM does).

No comments:

old rants