The expanses of WolfWings' land
scratched on the wall for all to see

June 21st, 2006
June 21st, 2006
June 21st, 2006
June 21st, 2006
June 21st, 2006

[User Picture]01:04 am - Memory versus Code optimizations...
As some of you may know, I'm into programming.

As more of you should know, I'm a huge fan of emulation.

I started work on this once before, trying to rebuild the HQ#x systems to use a dynamic set of look-up tables.

Problem is, before I attempted to do most of it by hand. This time, I sat down and made tools specifically to let me look at the code from a much higher vantage point. Specifically, I've only looked at the 2x and 4x modes as they possess a useful trait, especially the 2x mode. They calculate each quarter of their output effectively seperately, with the 'stock' code housing this in a massive switch statement like a poorly-structured lookup table and brute-force loop unrolling.

On the 2x case this isn't utterly horrible, as the code compiles down to around 30k or so. On the 4x case... well, running any filter where the 'inner loop' is over 100k long is going to be slow no matter what you do. And that's not even counting the 256k of lookup tables both versions use to do a 'fast' YUV comparison.

Right now, I'm admitedly focussing specifically on the 15-bit rendering zSNES uses, as that's the emulator I'm the most familiar with developing for, and the SNES games tend to be the 'sweet spot' that the HQ filter looks near-ideal on. NES games don't have enough gradiants to take advantage of the blending-detecting of the HQ filter, later consoles have tended to be higher-resolution or 3D to begin with, both cases where the HQ filter isn't as useful.

So, I've reduced the YUV 'difference detection' down to this:
unsigned int Diff(unsigned int c1, unsigned int c2) {
  unsigned int r, g, b;
  unsigned int Y, U, V;
  unsigned int t;
  r = ((c1 & (31 << 10)) - (c2 & (31 << 10)));
  g = ((c1 & (31 <<  5)) - (c2 & (31 <<  5))) <<  5;
  b = ((c1 &  31       ) - (c2 &  31       )) << 10;
  Y = ((24 << 10) + r + b + g)       > (24 << 11);
  U = (( 3 << 10) + r - b)           > ( 3 << 11);
  V = (( 6 << 10) - r - b + (g * 2)) > ( 6 << 11);
  t = (Y || U || V);
  return t;

This compiles down to 39(-Os/-O1) or 40(-O3/-O2) instructions when simulated as an inline through GCC with -mregparm=2 -fomit-frame-pointer, with no jumps in the code and only 3 conditional instructions spaced at least five instructions apart so the resulting code should in theory be highly pipeline-friendly.

This trades 256k of lookup tables and 18-21 inline instructions for 0k of lookup tables and 39-40 inline instructions. I believe this is a net win on older architectures primarilly due to the highly-limited cache sizes, and a win on newer architectures primarilly because of the longer parallel pipelines being able to run most of the code two or three instructions at a time so it ends up pipeline-equivilant and with reduced cache pollution.

Next, the core logic of the HQ function. Using some parsing tools I wrote in QBasic of all things, I was able to perform some visualizations on the massive switch statement, and I noticed something. The code wasn't using simplistic axial mirroring, but it was using rotational mirroring between the four pixels. And the majority of the comparisons were highly identical and could be handled with a three-deep jump tree which showed a high prediction rate, and usually only needed one comparison before short-circuiting to the pixel-calculation code. The final and most rarely-visited jump was the actual 'core' of the HQ system it turns out, and a highly chaotic and hand-tuned table of choices which I was able to re-arrange the bitmask to maintain a jump-location table for using 64 entries. And the rotation between pixels is also readilly done with two shifts and an or.

So, trading one 30k function of highly chaotic and jump-filled code with 25-75% prediction rates for one 4k function with a total of seven jumps, six of which show 90% prediction-rate and which are successively halving the search-space. The final jump being the 64-entry un-predictable jump table. There are also five conditional moves buried in the code at equivilant depths to the jump table to handle special cases that are buried in the otherwise very simplistic filter logic, though the conditional moves may be more effecient as full-blown jumps on newer architectures as they are FAR beyond 95% predictable, being nothing more than a bitmask and equivilancy comparison for between 1-out-of-8 and 1-out-of-128 chances.

I need to wait to get my laptop back before I can run a full set of benchmarks, but the code is if nothing else far more maintainable and easier to comprehend by most measures now than it was before. But overall I believe for HQ2x at least this will be a dramatic win due to the fact that the cache pressure is reduced from just under 300k of total code and data to under 8k of code and data, though with around twice the raw 'instruction count' per pass.

I still need to dig up an old 486 or a cycle-accurate 486 emulator to benchmark the compiled code against to prove or disprove my theory that this will be a win on that platform as well, and I still need to analyze the 4x code, but for now I'm happy and moderately content with the improvements to the 2x code. I just need to run some brute-force checks to verify I end up with identical results, then hopefully in the next month or so I'll push the code changes upstream to the zSNES tree.

EDIT: After some benchmarking on a Pentium 90 I found laying around, I'm re-building the code at all but the outermost, simple comparison sections to directly store the 'blend' amounts and calculate the final blend directly from those instead of using the mask-and-compare special-case situations. This will result in an additional 1k of data, but will reduce the code complexity further and remove ~1.5k as well for an overall savings.1 commentLeave a comment

Log in

No account? Create an account