10/05/2012

10-05-12 - OodleLZ Encoder Speed Variation with Worker Count

Thought I would look into this. One thing I've been wondering is whether putting workers on the hyper-threads helps or not.

Measured speed on enwik8. This is the slow optimal encoder to give it something to do. enwik8 is encoded by breaking into 4 MB chunks (24 of them). Each chunk gets 4 MB of dictionary overlap precondition. Matches before the overlap are found using the LRM (Long Range Matcher). The LRM is created for the whole file and shared between all chunks.

What we see :

The speed dip from 0 to 1 workers is expected, it's the cost of firing up threads and communication and chunking and such. (0 = synchronous, just encode on the main thread).

My machine has 4 real cores and 8 hyper-cores. From 1-4 workers we see not-quite-linear speedup, but big steps. Once we get into the hyperthreads, the benefit is smaller but I'm still seeing steady speedup, which surprises me a bit, I thought it would flatten out more after 4 workers.

(the wiggle at 7 is probably just a random fluctuation in Windows (some service doing something I didn't ask it to do, you bastards); I only ran this test once so the numbers are not very solid; normally I run 40 trials or so when measuring speeds on Windows).

And here's the Oodle ThreadProfile of the encode showing what's happening all the threads :


(click to zoom)

Of course part of the reason for the not-quite-linear speedup is the gap at the end when not all the workers are busy. You can fix that by using smaller chunks, but it's really not anything to worry too much about. While it does affect the latency of this single "encode enwik8" operation, it doesn't affect throughput of the overall system under multiple workloads.


OodleLZHLW enwik8 compressed size variation with different chunkings :


28,326,489   4 MB chunks - no LRM
27,559,112   4 MB chunks with LRM
27,098,361   8 MB chunks with LRM , 4 matches
26,976,079   16 MB chunks , 4 matches
26,939,463   16 MB chunks , 8 matches
26,939,812   16 MB chunks , 8 matches, with thresholds

In each case the amount of overlap is = the chunk size (it's really overlap that affects the amount of compression). After the first one, all others are with LRM. Note that the effective local dictionary size varies as you parse through a chunk; eg. with 4 MB chunks, you start with 4 MB of overlap, so you have an effective 4 MB local window, as you parse your window effectively grows up to a max of 8 MB, so the end of each chunk is better compressed than the beginning.

My LZHLW optimal parse only considers 4 matches normally; as the overlap gets bigger, that becomes a worse compromise. Part of the problem is how those matches are chosen - I just take the 4 longest matches (and the lowest offset at each unique length). Normally this compromise is okay, you get a decent sampling of matches to choose from; on moderate file sizes the cost from going to infinite to 16 to 4 matches is not that great, but as the dictionary gets bigger, you will sometimes fill all 4 matches with high offsets (because they provide the longest match lengths) and not any low offsets to try.

At 16 MB chunks (+16 overlap = 32 MB total window) it becomes necessary to consider more matches. (in fact there's almost no benefit in going from 8 MB to 16 MB chunks without increasing the number of matches).

I tried adding "thresholds"; requiring that some of the matches found be in certain windows, but it didn't help; that merits more investigation. Intuitively it seems to me that the optimal parser wants to be able to choose between some long high-offset matches and some shorter low-offset matches, so the question is how to provide it a few good selections to consider. I think there's definitely some more win possible in my optimal parser by considering more matches, or by having a better heuristic to choose which matches to consider.

No comments:

old rants