Okay... there's a really nifty, quite wonderful image-scaling library for low-colour low-res pixel artwork. Specifically, it's meant for re-rasterizing older video-game artwork to a higher resolution while keeping angles and lines and what-not intact, by the name of HQ3x originally, though now it supports 2x and 4x rescaling.
Here's where it gets stupid. First off, it requires to be handed a 16bpp image, in 565 format. Why? Because it converts to YUV colour-space for comparisons so it can be seperately sensitive to colour versus brightness changes to determine borders between areas on sprites, so it knows where to put the hard edges and where to blur. This is a good thing. So what's the bad?
It uses a massively pre-calculated lookup table for this. And another one to convert the 16bpp graphics to 32bpp before actually blending! So it wastes 512k, HALF A MEG, of very randomly accessed memory. Meaning you can assume that the majority of the accesses to both arrays will be cache misses, generating a stall in the CPU until the memory is read. Limiting the code to the speed of the memory bus.
Modern CPU's run as much as 10x faster than the memory bus they're attached to, and only have around 32-64k of full-speed cache. While a lookup table used to be an always-win, on modern CPU's it's a HUGE loss. Specifically because it's often faster to execute as many as 100 clock cycles than to wait for one cache miss. That's around 100-150 instructions on the same modern CPU's, because they can run multiple bits of non-interacting code at once.
Yes, this is a complete reversal of earlier days. CPU's used to take 100 or more cycles to perform a divide. Now they take on the order of 5-9, in some cases even less. In part due to better design, in part due to fancier mathematics tricks. But gods... to see code specifically targetted as higher-end CPU's because it's CPU-intensive... and realizing it's set up in a way that specifically cripples modern CPU's... it explains why good results are being returned on everything from a P2 to a modern P4/Athlon with this code. It's purely limited by how fast system RAM can be accessed.
Another bit of boneheadedness it does... it has a nicely-designed 256-entry 'function lookup table' that's based on a slightly sub-optimal pattern calculation block I won't pick apart just yet. Over half of those sub-functions performs from 2-4 further colour-comparisons, raising the total to 10-12 per pixel (of which there are only 4 unique comparisons possible), and introducing severe jump-prediction issues, when simply expanding the function-lookup table to 4096 entries allows for the other 4 pattern-comparisons to be included in the single jump-table, letting them be merged into the other 8 and simplified down to 4 TOTAL comparisons per pixel, AND removing jump-prediction entirely from the inner-loop code.
Now what's with the ranting on jump prediction? Older CPU's simply assumed jumps would either pass or fail, and you programmed accordingly to the clock-count charts. But most of the time, it's complex to reverse a conditional-jump without noticable re-arranging of the surrounding code, which can break the instruction pairing that allows for 2 or more instructions to be processed at once. This, again, breaks severely with earlier CPU's. Previously, you always assumed jumps would be taken. Now, you don't need to make that assumption, you simply need to make the same jump location perform the same jump the majority of the time. The problem is... once again, due to the layout of this function, the conditional jumps are essentially perfectly random, resulting in ruined jump prediction and more CPU stalls on the order of 25ish for Athlon's to as many as another 75ish for P4's. That's, again, quite a few instructions.
So anything else this library does that's bad? Not really. It does a very obtuse per-pixel-per-situation macro definition block, that doesn't really fix anything. But that's simply a coding-style complaint, and actually an understandable one. The macro's are good, they just could be broken up into 2 parts instead of being kept as one contigious unit, and simplified if they were broken up.
But gah... so far I'm managing to reduce the memory load from 513k to 16k, AND reducing the code-size and sub-function calls by over half, not counting the colour-blurring functions which I'm rebuilding to handle full 32bpp input using bit-shifting instead of bit-masked two-stage multiplication combinations that the compiler can't optimize. And that will use the same 'block' of code with a single switch between 3 4k-entry function-pointer tables to switch between 2x, 3x, and 4x scaling, instead of being entirely seperate functions with fully duplicated internal logic.
Would this be considered re-writing procerdural code into an object-oriented format, or am I mis-understanding what object-oriented means? =^.^=Leave a comment