10:04 am - Curious... very curious. But in a good way.
Small bit of trivia regarding the HQ codepaths... it turns out there's only 12 'distinct' transformations per corner of the scaled-up area.
And further testing shows that any given corner has over 80% of it's possibilities being redundant. Meaning I can reduce that 16k+1k table down quite far.
Some more random numbers:
Remember that 16k table? Make it do the comparisons for each quadrant of the upscaled pixel seperately, and suddenly we only need 176 bytes. We also lose the '1k' table entirely for translations, so we still have the same number of memory accesses (2), but with far, FAR less data accessed.
In fact, that reduction will reduce the size of the actual 'coefficient' tables as well. Down to 96 bytes each per pixel of output. So all the data for an entire transformation will be reduced to 176+(96*X*Y) bytes total. Severe cache localization kicks in here, making the code scream.
And what will these changes do to the actual code-size? Reduce it, actually. Replace the sub-function calls with the multi-pass calculation method and the code is shrunk noticably. Replace the hard-coded transformations with compact coeffecient-calculated methods, and the code is DRASTICALLY reduced...
A goal of the code+data (sans initialization components) fitting in 8k is a reasonable goal now, surprisingly. The slowest part may well become the input-buffer conversion before doing the subsequent passes for comparison and filtering.4 comments