Very low-level programming

In most computer programs where speed is critical, there are a small number of chunks of code in which the CPU spends most of its time - the innermost loops. For a fractal program, it will be the iteration function. For a high-precision number library, it will probably be the multiplication routine. For a video encoder, it might be the loop that searches for motion compensation vectors.

In each case, programmers will generally spend a lot of time optimizing these innermost loops for the most important CPU microarchitectures in order to squeeze out every last drop of performance where it really matters. This often involves hand-writing assembly code, modelling the execution pipelines of the target machine and poring over optimization documentation.

Doing this, programmers will often think things like "if only this CPU had an instruction that does X, which could be easily done in hardware, this routine could be so much faster". This is why every couple of years or so a new set of instructions is added to the x86 architecture: MMX, SSE, SSE2, SSE3, SSSE3, SSE4, AVX. These instructions are getting more and more complicated with time, which makes the optimization process all the more difficult - one might have to look at dozens of ways of doing something to find the best one for your specific task.

All this makes me wonder if there might be a better way. With each new instruction set extension, new functional units are added along with new ways of hooking them up. Suppose that instead of programming these inner loops the usual way (deciding which registers represent which variables, choosing instructions and ordering them) we could instead set some state inside the CPU that directly connected the output of multiplier 1 to an the input of adder 2 (for example) in effect creating a special purpose machine for the duration of the inner loop. The wiring up probably takes longer than running the original code would, but once you've wired it up each iteration through the loop can run extremely quickly, since there's no single execution thread causing a bottleneck. It would be essentially like having an FPGA built into your CPU.

I can think of several reasons why this isn't currently done. One is that writing applications at such a "low level" creates a dependency between the software and the most intimate details of the hardware. Tomorrow's CPU might be able to run today's software more quickly by having some extra multiplication units (for example) available to each core. But if the applications are addressing the multiplication units directly, that won't help unless the software is rewritten to take advantage of them. One possible answer to this might be to create a programming interface that involves a very large number of virtual functional units and allow the CPU to decide how to map that on to the actual hardware in the optimal way (possibly turning parts into more traditional code if it there aren't sufficient functional units) - a sort of "JIT Verilog" if you will.

A second reason is that when you increase the amount of state on the CPU affiliated with a particular execution thread, you make context switching more difficult, because there's more registers that need to get swapped out to memory. FPGAs are generally not reprogrammed extremely frequently because they contain a lot of state and it takes time to reprogram them. This could also be solved by virtualizing functional units, or it could be solved by keeping track of whether the CPU was reprogrammed since the last time the thread run, and only reprogramming when necessary. That would solve the common case at the expense of speed when doing dumb things like running two different video encoders at the same time (something that is also likely to be rather sub-optimal with todays CPUs, since it isn't generally something that is optimized for).

What would be even cleverer is if the CPU could examine the code sequence and wire things up automatically to maximize the speed of that inner loop. To some extent, CPUs are starting to do such things. Modern CPUs keep a queue of decoded instructions and if an inner loop fits into this instruction queue, no decoding needs to be done while it is running. Cache hints are another example of a transparent feature designed to allow optimization when specific hardware details are known.

Leave a Reply