Archive for July, 2010

I am not (just) a strange loop

Sunday, July 4th, 2010

A while back, I read Douglas Hofstadter's book "I am a strange loop". As one might expect from Hofstadter, it's a fascinating book packed with good ideas. However, it happens that I disagree with someone of them. Hofstadter believes that the human brain fundamentally works in an entirely mechanistic, deterministic manner and that all of the mysteries of consciousness can be explained in terms of symbols being triggered by other symbols in the brain. Our subjective "awareness" is, according to Hofstadter, just an illusion - a hallucination. I'm not convinced by this - the concept of a hallucination implies that there is something (someone) there to experience the hallucination. But if a hallucination has an experience, how can it be a hallucination? It's sort of like how the concept of "creating time" is meaningless, because the concept of creation implies a time before and a time after.

If "souls" (or whatever less loaded word you'd prefer to use to mean the part of us which has subjective experiences) are not made of particles or patterns of particles, how do they get distributed so that there's one per human brain? I think that's asking the wrong question, because it presupposes that souls are localized entities like particles. I think there are many other possibilities. Unfortunately because we have no way to do experiments on subjective experience, answering this question seems to be out of the reach of science (at least for the moment).

Floating-point faster than integers

Saturday, July 3rd, 2010

In an attempt to get a little extra speed in my fractal program, I decided to try rewriting the innermost loop to use 32-bit fixed-point numbers instead of double-precision floating-point numbers. The precision would be less good but I need to write some precision-switching code anyway. I had noticed that my CRT simulator got faster when I switched from floating-point to integer code, so I assumed (incorrectly, it turns out) that the same would be true in this program.

It turns out that the best way to multiply two signed 32-bit integers and then shift the 64-bit result right by 22 bits (the number of fractional bits I'm using) is to use the one-operand IMUL instruction followed by the SHRD instruction. Unfortunately the only register you get to pick with this combination is one of the input registers - both the output and the other input are in EAX. This is a problem because it means that you have to load EAX right before every multiplication and stash the result somewhere else right after, before starting the next multiplication. All this shuffling slows the code right down - my 2GHz Core Duo laptop peaks at ~150 million iterations per second for the double-precision routine and ~100 million iterations per second for the integer routine. To make matters worse, you also lose the use of the EDX register (which is stomped by the IMUL) so even with frame pointer omission you're down to just 5 registers (ESI, EDI, EBP, ECX and EBX).

Another possible way is to use the MMX registers and the PMULUDQ instruction, but that has two problems: the multiplication is unsigned and there's no arithmetic right-shift in the MMX/SSE ISA so it seems unlikely that it could be made faster than the IMUL version.

This makes me wonder if floating-point instructions would also be faster for other uses where integers have traditionally reigned supreme. Bignums for example. Instead of storing 32 bits of precision in each 32-bit register, you can store 52 bits of precision in the mantissa part of each 64-bit register. There is a little extra complexity involved since the floating point units aren't designed for this (and for multiplication you need to do four 26-bit*26-bit->52-bit multiplies instead of one 52-bit*52-bit->104-bit multiply). However, the extra bits might make this a win. Poking around at the gmp source suggests that they aren't currently using any such tricks for x86, though there is some FPU hackery on other architectures.

Very low-level programming

Friday, July 2nd, 2010

In most computer programs where speed is critical, there are a small number of chunks of code in which the CPU spends most of its time - the innermost loops. For a fractal program, it will be the iteration function. For a high-precision number library, it will probably be the multiplication routine. For a video encoder, it might be the loop that searches for motion compensation vectors.

In each case, programmers will generally spend a lot of time optimizing these innermost loops for the most important CPU microarchitectures in order to squeeze out every last drop of performance where it really matters. This often involves hand-writing assembly code, modelling the execution pipelines of the target machine and poring over optimization documentation.

Doing this, programmers will often think things like "if only this CPU had an instruction that does X, which could be easily done in hardware, this routine could be so much faster". This is why every couple of years or so a new set of instructions is added to the x86 architecture: MMX, SSE, SSE2, SSE3, SSSE3, SSE4, AVX. These instructions are getting more and more complicated with time, which makes the optimization process all the more difficult - one might have to look at dozens of ways of doing something to find the best one for your specific task.

All this makes me wonder if there might be a better way. With each new instruction set extension, new functional units are added along with new ways of hooking them up. Suppose that instead of programming these inner loops the usual way (deciding which registers represent which variables, choosing instructions and ordering them) we could instead set some state inside the CPU that directly connected the output of multiplier 1 to an the input of adder 2 (for example) in effect creating a special purpose machine for the duration of the inner loop. The wiring up probably takes longer than running the original code would, but once you've wired it up each iteration through the loop can run extremely quickly, since there's no single execution thread causing a bottleneck. It would be essentially like having an FPGA built into your CPU.

I can think of several reasons why this isn't currently done. One is that writing applications at such a "low level" creates a dependency between the software and the most intimate details of the hardware. Tomorrow's CPU might be able to run today's software more quickly by having some extra multiplication units (for example) available to each core. But if the applications are addressing the multiplication units directly, that won't help unless the software is rewritten to take advantage of them. One possible answer to this might be to create a programming interface that involves a very large number of virtual functional units and allow the CPU to decide how to map that on to the actual hardware in the optimal way (possibly turning parts into more traditional code if it there aren't sufficient functional units) - a sort of "JIT Verilog" if you will.

A second reason is that when you increase the amount of state on the CPU affiliated with a particular execution thread, you make context switching more difficult, because there's more registers that need to get swapped out to memory. FPGAs are generally not reprogrammed extremely frequently because they contain a lot of state and it takes time to reprogram them. This could also be solved by virtualizing functional units, or it could be solved by keeping track of whether the CPU was reprogrammed since the last time the thread run, and only reprogramming when necessary. That would solve the common case at the expense of speed when doing dumb things like running two different video encoders at the same time (something that is also likely to be rather sub-optimal with todays CPUs, since it isn't generally something that is optimized for).

What would be even cleverer is if the CPU could examine the code sequence and wire things up automatically to maximize the speed of that inner loop. To some extent, CPUs are starting to do such things. Modern CPUs keep a queue of decoded instructions and if an inner loop fits into this instruction queue, no decoding needs to be done while it is running. Cache hints are another example of a transparent feature designed to allow optimization when specific hardware details are known.

Random number generation on 8-bit AVR

Thursday, July 1st, 2010

For my physical tone matrix I wanted a mode where at the start (or end) of each cycle the machine would pick a random "on" light and turn it off, and a random "off" light and turn it on. With 5-10 lights on, this makes a fairly nice little bit of random music - continually changing so it doesn't get too repetitive.

To implement this I decided the best way was to pick a frame (the second one of the cycle turned out to be the most convenient) and decrement one of two variables "randomOn" or "randomOff" depending on whether the current LED within the frame was on or off. If the variable reached zero the LED would be toggled. That doesn't take too many cycles. But the sample before this frame we need to generate random numbers and initialize randomOn and randomOff with them. randomOn needs to be in the range 0..lightsLit-1 (where lightsLit is the number of "on" LEDs) and randomOff needs to be in the range 0..(255-lightsLit). So we need to generate two random numbers in the range 0..n-1 where n is in the range 1..255 as quickly as possible. I wanted to try to generate these numbers in a single sample as trying to spread the work across multiple samples makes the program much more complicated. That left me with about 300 cycles to generate two random numbers.

The usual way of generating a random number in the range 0..n-1 is to generate a random number in a much larger range (say 0..65535) and use the modulo operation ("remainder after division by n") to get it into the range 0..n-1.

Generating a 16-bit random number is done in a fairly standard way using a Linear Congruential Generator (I used the X = 214013*X + 2531011 variant because it cuts out a few multiplications). That by itself takes nine 8-bit by 8-bit multiplications and 56 cycles. I rewrote it myself in assembly language rather than using the built-in generator from avr-libc, because the latter it not very optimized (it uses a full 32-bit multiplication which is not necessary).

If you use the 16-bit modulo function from libgcc it takes something like 215 cycles, which was too many. Even unrolled it's something like 144, which is still too many. I was about to embark on the work to spread this loop across several samples when I read this which describes a way of turning a division by a constant into a multiplication by a constant. That's not very useful for arbitrary division, since computing the multiplier constant is more work than just doing the division. But we're not doing arbitrary division here - we know something about our divisor n - it is no larger than 255. So it becomes practical to precalculate a table of multipliers and look up an entry in this table at runtime.

The next problem is that that method only works when the division is exact (has remainder zero) which is no good for this application since it's the remainder we're interested in. But the classic bit-twiddling reference book Hacker's Delight describes a variation which does work (for some divisors the dividend needs to be added back after the multiplication, and for some the result needs to be shifted right by a number of bits depending on the divisor). So our mod algorithm looks like this:

  1. Look up multiplier in table indexed by divisor
  2. Multiply dividend by multiplier
  3. Look up helper routine in table indexed by divisor (there are 18 variations - 9 possible shifts and either "add" or "don't add", but not all of them are used) and jump to it.
  4. Multiply by the divisor and subtract the result from the original dividend - the result is the remainder.

The resulting mod routine takes 40-53 cycles (depending on divisor) giving 96-109 cycles for the entire random number routine. With various other bits of overhead, the random routine takes 253 cycles on this "init" sample and up to 29 per sample on the first frame of the cycle.