I am not (just) a strange loop

July 4th, 2010

A while back, I read Douglas Hofstadter’s book “I am a strange loop”. As one might expect from Hofstadter, it’s a fascinating book packed with good ideas. However, it happens that I disagree with someone of them. Hofstadter believes that the human brain fundamentally works in an entirely mechanistic, deterministic manner and that all of the mysteries of consciousness can be explained in terms of symbols being triggered by other symbols in the brain. Our subjective “awareness” is, according to Hofstadter, just an illusion – a hallucination. I’m not convinced by this – the concept of a hallucination implies that there is something (someone) there to experience the hallucination. But if a hallucination has an experience, how can it be a hallucination? It’s sort of like how the concept of “creating time” is meaningless, because the concept of creation implies a time before and a time after.

If “souls” (or whatever less loaded word you’d prefer to use to mean the part of us which has subjective experiences) are not made of particles or patterns of particles, how do they get distributed so that there’s one per human brain? I think that’s asking the wrong question, because it presupposes that souls are localized entities like particles. I think there are many other possibilities. Unfortunately because we have no way to do experiments on subjective experience, answering this question seems to be out of the reach of science (at least for the moment).

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Floating-point faster than integers

July 3rd, 2010

In an attempt to get a little extra speed in my fractal program, I decided to try rewriting the innermost loop to use 32-bit fixed-point numbers instead of double-precision floating-point numbers. The precision would be less good but I need to write some precision-switching code anyway. I had noticed that my CRT simulator got faster when I switched from floating-point to integer code, so I assumed (incorrectly, it turns out) that the same would be true in this program.

It turns out that the best way to multiply two signed 32-bit integers and then shift the 64-bit result right by 22 bits (the number of fractional bits I’m using) is to use the one-operand IMUL instruction followed by the SHRD instruction. Unfortunately the only register you get to pick with this combination is one of the input registers – both the output and the other input are in EAX. This is a problem because it means that you have to load EAX right before every multiplication and stash the result somewhere else right after, before starting the next multiplication. All this shuffling slows the code right down – my 2GHz Core Duo laptop peaks at ~150 million iterations per second for the double-precision routine and ~100 million iterations per second for the integer routine. To make matters worse, you also lose the use of the EDX register (which is stomped by the IMUL) so even with frame pointer omission you’re down to just 5 registers (ESI, EDI, EBP, ECX and EBX).

Another possible way is to use the MMX registers and the PMULUDQ instruction, but that has two problems: the multiplication is unsigned and there’s no arithmetic right-shift in the MMX/SSE ISA so it seems unlikely that it could be made faster than the IMUL version.

This makes me wonder if floating-point instructions would also be faster for other uses where integers have traditionally reigned supreme. Bignums for example. Instead of storing 32 bits of precision in each 32-bit register, you can store 52 bits of precision in the mantissa part of each 64-bit register. There is a little extra complexity involved since the floating point units aren’t designed for this (and for multiplication you need to do four 26-bit*26-bit->52-bit multiplies instead of one 52-bit*52-bit->104-bit multiply). However, the extra bits might make this a win. Poking around at the gmp source suggests that they aren’t currently using any such tricks for x86, though there is some FPU hackery on other architectures.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Very low-level programming

July 2nd, 2010

In most computer programs where speed is critical, there are a small number of chunks of code in which the CPU spends most of its time – the innermost loops. For a fractal program, it will be the iteration function. For a high-precision number library, it will probably be the multiplication routine. For a video encoder, it might be the loop that searches for motion compensation vectors.

In each case, programmers will generally spend a lot of time optimizing these innermost loops for the most important CPU microarchitectures in order to squeeze out every last drop of performance where it really matters. This often involves hand-writing assembly code, modelling the execution pipelines of the target machine and poring over optimization documentation.

Doing this, programmers will often think things like “if only this CPU had an instruction that does X, which could be easily done in hardware, this routine could be so much faster”. This is why every couple of years or so a new set of instructions is added to the x86 architecture: MMX, SSE, SSE2, SSE3, SSSE3, SSE4, AVX. These instructions are getting more and more complicated with time, which makes the optimization process all the more difficult – one might have to look at dozens of ways of doing something to find the best one for your specific task.

All this makes me wonder if there might be a better way. With each new instruction set extension, new functional units are added along with new ways of hooking them up. Suppose that instead of programming these inner loops the usual way (deciding which registers represent which variables, choosing instructions and ordering them) we could instead set some state inside the CPU that directly connected the output of multiplier 1 to an the input of adder 2 (for example) in effect creating a special purpose machine for the duration of the inner loop. The wiring up probably takes longer than running the original code would, but once you’ve wired it up each iteration through the loop can run extremely quickly, since there’s no single execution thread causing a bottleneck. It would be essentially like having an FPGA built into your CPU.

I can think of several reasons why this isn’t currently done. One is that writing applications at such a “low level” creates a dependency between the software and the most intimate details of the hardware. Tomorrow’s CPU might be able to run today’s software more quickly by having some extra multiplication units (for example) available to each core. But if the applications are addressing the multiplication units directly, that won’t help unless the software is rewritten to take advantage of them. One possible answer to this might be to create a programming interface that involves a very large number of virtual functional units and allow the CPU to decide how to map that on to the actual hardware in the optimal way (possibly turning parts into more traditional code if it there aren’t sufficient functional units) – a sort of “JIT Verilog” if you will.

A second reason is that when you increase the amount of state on the CPU affiliated with a particular execution thread, you make context switching more difficult, because there’s more registers that need to get swapped out to memory. FPGAs are generally not reprogrammed extremely frequently because they contain a lot of state and it takes time to reprogram them. This could also be solved by virtualizing functional units, or it could be solved by keeping track of whether the CPU was reprogrammed since the last time the thread run, and only reprogramming when necessary. That would solve the common case at the expense of speed when doing dumb things like running two different video encoders at the same time (something that is also likely to be rather sub-optimal with todays CPUs, since it isn’t generally something that is optimized for).

What would be even cleverer is if the CPU could examine the code sequence and wire things up automatically to maximize the speed of that inner loop. To some extent, CPUs are starting to do such things. Modern CPUs keep a queue of decoded instructions and if an inner loop fits into this instruction queue, no decoding needs to be done while it is running. Cache hints are another example of a transparent feature designed to allow optimization when specific hardware details are known.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Random number generation on 8-bit AVR

July 1st, 2010

For my physical tone matrix I wanted a mode where at the start (or end) of each cycle the machine would pick a random “on” light and turn it off, and a random “off” light and turn it on. With 5-10 lights on, this makes a fairly nice little bit of random music – continually changing so it doesn’t get too repetitive.

To implement this I decided the best way was to pick a frame (the second one of the cycle turned out to be the most convenient) and decrement one of two variables “randomOn” or “randomOff” depending on whether the current LED within the frame was on or off. If the variable reached zero the LED would be toggled. That doesn’t take too many cycles. But the sample before this frame we need to generate random numbers and initialize randomOn and randomOff with them. randomOn needs to be in the range 0..lightsLit-1 (where lightsLit is the number of “on” LEDs) and randomOff needs to be in the range 0..(255-lightsLit). So we need to generate two random numbers in the range 0..n-1 where n is in the range 1..255 as quickly as possible. I wanted to try to generate these numbers in a single sample as trying to spread the work across multiple samples makes the program much more complicated. That left me with about 300 cycles to generate two random numbers.

The usual way of generating a random number in the range 0..n-1 is to generate a random number in a much larger range (say 0..65535) and use the modulo operation (“remainder after division by n”) to get it into the range 0..n-1.

Generating a 16-bit random number is done in a fairly standard way using a Linear Congruential Generator (I used the X = 214013*X + 2531011 variant because it cuts out a few multiplications). That by itself takes nine 8-bit by 8-bit multiplications and 56 cycles. I rewrote it myself in assembly language rather than using the built-in generator from avr-libc, because the latter it not very optimized (it uses a full 32-bit multiplication which is not necessary).

If you use the 16-bit modulo function from libgcc it takes something like 215 cycles, which was too many. Even unrolled it’s something like 144, which is still too many. I was about to embark on the work to spread this loop across several samples when I read this which describes a way of turning a division by a constant into a multiplication by a constant. That’s not very useful for arbitrary division, since computing the multiplier constant is more work than just doing the division. But we’re not doing arbitrary division here – we know something about our divisor n – it is no larger than 255. So it becomes practical to precalculate a table of multipliers and look up an entry in this table at runtime.

The next problem is that that method only works when the division is exact (has remainder zero) which is no good for this application since it’s the remainder we’re interested in. But the classic bit-twiddling reference book Hacker’s Delight describes a variation which does work (for some divisors the dividend needs to be added back after the multiplication, and for some the result needs to be shifted right by a number of bits depending on the divisor). So our mod algorithm looks like this:

  1. Look up multiplier in table indexed by divisor
  2. Multiply dividend by multiplier
  3. Look up helper routine in table indexed by divisor (there are 18 variations – 9 possible shifts and either “add” or “don’t add”, but not all of them are used) and jump to it.
  4. Multiply by the divisor and subtract the result from the original dividend – the result is the remainder.

The resulting mod routine takes 40-53 cycles (depending on divisor) giving 96-109 cycles for the entire random number routine. With various other bits of overhead, the random routine takes 253 cycles on this “init” sample and up to 29 per sample on the first frame of the cycle.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Physical tone matrix screen construction details

June 30th, 2010

I think the most unusual thing about the physical tone matrix I posted about yesterday is the screen – I hadn’t seen one like it before I made it, and having made one I now know why – they’re very fiddly to make. I’d like to go into some more details about how it is made and how it works.

Step 1 is to cut out a 6″ square piece of 1/8″ thick acrylic. I bought a 12″x24″ piece from Delvie’s plastics (the cheapest shipped price I could find for such a small quantity), which is enough for 8 such screens (though I doubt I’ll make another 7 since threading the wire takes so long – I hope I’ll be able to make something else fun out of the rest, though). Cutting this stuff is really easy – just score it a few times with a knife and a straightedge where you want it to break, clamp it to a table so that the scored line runs along the edge of the table and then push down hard on the bit that protrudes over the edge of the table – it’ll break cleanly along the scored line. Leave the backing paper on for now.

Step 2 is to drill the holes. I deliberated a bit about the hole pattern to use. I originally thought of threading the wires just horizontally and vertically, so that the switch was formed at the place where the two wires “cross over” (but don’t quite touch) in the following picture (red is top, blue is bottom):

But then I decided I’d rather have the central point of the switch not have wire over it, so that the LED would be able to shine through at that point. I also realized that I’d be able to make the switches longer (and therefore easier to press) if the wires were arranged diagonally and fairly close to each other. Working backwards from where I wanted the wires to be on the top, I came up with this pattern:

This requires about 67 feet of wire altogether – because of the zig-zagging, each horizontal and vertical wire is about 15″ on the matrix itself, with 2″ spare on the top, left and right sides and 14″ spare on the bottom to connect to the PCB. Use 22AWG solid-core hook-up wire – this should work nicely.

Here is a page showing where all the holes need to be drilled. Print this out and glue it on to one side of the acrylic sheet. Use a drill press to drill all 1,024 holes. I used this which is the cheapest one I could find. It’s a bit flimsy but good enough for the purpose (as well as for PCBs, which is what it’s meant for). It doesn’t quite have enough clearance to reach the center holes if you put it together in the obvious way, but it does come with an attachment which lets you move the drill further away from the metal bar, if you jam a piece of plastic or something in the gap. I used these bits which seemed to work fine. The positioning doesn’t have to be perfect but the closer the better. I think I used a number 68 drill bit or thereabouts – make sure your wire fits through the first hole you drill before drilling the rest. The plastic swarf from the drill will accumulate under the glued paper a bit but that doesn’t really matter.

While you’ve got the drill press handy, make a hole of ~2mm diameter in each corner for the screws to hold it onto the PCB. The way I built it, the switch matrix is screwed to the PCB using 1-1/2″ long screws and then the PCB is screwed to the bottom box, so the entire thing is rigid (the top of the box also screwed to the bottom of the box – only these screws need to be removed to change the batteries). Choose the positions of these holes carefully, since you will need to make holes in the same position in the PCB.

Step 3 is to remove the backing paper and sand the acrylic on the bottom side using a piece of sandpaper. Shine an LED through it to make sure it’s diffusing properly. If you don’t sand it the screen will have a very narrow viewing angle (depending on the LEDs used – cheap high brightness ones tend to have a very narrow viewing angle though) and when you can see them they will dazzle you. I think I used a #100 sandpaper or thereabouts – I don’t think it matters much but try on a scrap piece first if you’re worried. An orbital sander will probably get you a more homogeneous finish, but I just did it by hand (you can see the swirl patterns I made if you look at the screen very closely).

Step 4 is to cut and strip the wire. Cut 16 lengths of wire 19″ long and 16 lengths of wire 31″ long. Avoid kinking/bending the wire at this stage, to the extent you can. Use wire strippers to strip of all but 2″ of insulation from the 19″ lengths and all but 14″ of insulation from the 31″ lengths. You’ll need to do this about 6″ at a time. You might need to grip the 2″ piece of insulation with pliers when stripping the last bit, otherwise you’ll remove the wrong bit. Keep the pieces of insulation for step 6.

Step 5 is to thread the wire. Take each piece of wire in turn and thread it into the acrylic, starting at the bottom (31″ sections) and the right (19″ sections). Push the wire in so that the remaining insulation is right up against the bottom of the acrylic. Follow the pattern carefully – the top wire of each switch goes horizontally and the bottom goes vertically. Bear in mind that the pattern alternates each row, so if you start with the top one in the first row, you’ll start with the bottom the second. The PCB has holes for soldering the top and left sides directly underneath the switch matrix, so make sure you pick the alternating pattern that gets it to line up. Make sure none of the wires touch each other – you can always pull them apart slightly with needlenose pliers if they do.

There is a bit of a knack to getting the wires flat and tight with no kinks. Here is how I did it. Suppose you have one segment threaded on the bottom and you’re doing the next one on the top side.
a) Thread the loose end of the wire through the next hole. Pull it through. As you are doing so, make sure the wire is in the plane that is perpendicular to the acrylic sheet and that passes through the two holes. If the wire starts to twist over, adjust it so that it is back in this plane. If you don’t do this, you’ll get a kink in the wire when you pull it tight, which makes it difficult to get it to go where you want it to.
b) get a flat piece of metal (like the side of a pair of pliers) and push against the threaded segment on the bottom. This will prevent the bottom segment from being pushed up in the next step.
c) get another smaller flat piece of metal (like the end of another pair of pliers) and push against the newly threaded segment on the top. Start pushing at the “old” end (bending the wire into a right angle) and work your way along to the “new” end until it’s totally flat against the acrylic sheet. If you don’t do this there will be slack in the wire which will cause the switches to move when you touch them.

It gets more difficult once you get more of the wires in place, because you’ve got to navigate around the protruding ends with your flat edges. It might make it easier to do the outermost wires first and then work in towards the middle, but I didn’t think that far ahead.

Once I got into practice, it took me about 20 minutes to do each wire, so about 11 hours of threading altogether (this is why I’m not planning to make any more). It’s not too bad if you do a couple a day for a couple of weeks – one can carry on a conversation or watch TV at the same time.

Step 6 is to re-insulate the uninsulated loose ends. Cut 1″ sections of insulation from the leftovers from step 4 and push them back onto the ends of the wire. If there are any bends or kinks in these parts of the wires you’ll have to straighten them out with pliers first. Twist the insulation slightly as you push it back on and it’ll go on more easily. This will help avoid short circuits when the circuit is partially assembled and you’re testing it.

The switch matrix works by strobing each row and then testing each column. The rows are connected to the output of the “row” 74HC595 shift registers. They are connected via 10K resistors so that if something metal touches the switches they won’t cause short out anything. The “active” row (which changes about 1,000 times per second) is brought to logic high (+5V), the others to logic low.

The switch columns are connected (again via 10K resistors) to 74HC4051 analogue multiplexers so that the microcontroller can specify (with three output bits) which of the 8 columns in each half of the matrix it wants to read. This column selection changes 15,625 times per second. The outputs of the two 4051s are connected to Darlington pairs which amplify the signal (which is very weak after it’s passed through a finger) and pass it on to two input pins of the microcontroller (one for the left half, one for the right). Immediately after reading these pins, the microcontroller changes the multiplexer selection bits to maximize the time the circuit has to settle into the correct state for the next switch.

The Darlington pair bases are each connected to ground through a capacitor – they won’t register as “on” until this capacitor charges up. The larger the capacitor the less sensitive the switches. If the capacitor is too small, you’ll get “false on” readings from switches that you aren’t actually touching (if the effect could be controlled this might make an interesting proximity sensor sort of input device but it’s probably too finicky). If the capacitor is too large then you’ll have to press the switches really hard or have damp fingers to for the touch to register.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Physical tone matrix

June 29th, 2010

Inspired by Andre’ Michelle’s Tone Matrix flash applet I decided to make a physical version – i.e. a box with buttons, flashing lights and knobs to turn. I wanted 16×16 – the Bliptronic 5000 just doesn’t have enough lights. All the existing 16×16 equivalents were more money than I wanted to pay: A Tenori-on costs something like $1000. Four Bliptronic 5000s cost $200. A monome two fifty six costs $1400. I was thinking more along the lines of $50. Besides which, I thought to myself, it would be fun to make it myself.

I decided to base it around the ATmega328 microcontroller used in the Arduino, after reading how easy it is to get started hardware hacking with the Arduino. Yes, I know a Cortex M0 is a cheaper and more powerful chip, but I didn’t know about these until I’d already started with the ATmega328. Also, they’re more difficult to solder.

Because of my assembly programming roots, I wanted to push the CPU to its limits. Sketching out the core audio and video assembly routines, I realized that with the Arduino’s 16MHz clock rate, I could do 16 channel PWM audio at a sample rate of 15.625KHz (i.e. a sample every 1024 cycles) with a 256-element 8-bit waveform and arbitrary volumes and frequencies for each channel. At the same time I could control a 16×16 LED matrix with a refresh rate of 61Hz (line rate of 976Hz) with each individual LED having an independent brightness (duty cycle of 0-16 sample periods per frame, i.e. up to 1/16 in 1/256 increments) using 32 bits of shift registers and the SPI lines (at maximum speed it takes a significant number cycles to output so many bits, so this is interleaved with other code). Unfortunately (because of the large gamma of LEDs) only a few of these are distinguishable. With all of this going on in the background (written in heavily optimized assembly language) I still had enough spare CPU cycles to do some interesting foreground things.

The next problem was how to make the switch matrix and LED matrix. I thought about buying 4 Bliptronic 5000s and tearing them apart, or using sixteen Sparkfun 4×4 button pads with PCBs, but both of those options cost more than I wanted to pay. The cheapest high-brightness LEDs I could find cost $0.06 in quantity from Mouser, giving a screen cost of $15.36 – much more like it. (Even cheaper prices are possible in greater volumes from Transistor Parts Wholesale). I think this is the cheapest way of making a 6″x6″ screen.

I had trouble thinking of a way to set up 256 switches over the LEDs without obscuring them or spending too much, until I remembered an effect I had noticed messing about with transistors years ago – you can make a touch switch out of a couple of transistors arranged in a Darlington configuration. I could arrange them in a matrix like the LEDs so that the only per-switch cost was a couple of small sections of wire. This could be threaded through an acrylic sheet which would serve a dual purpose – to diffuse the LEDs and to hold the switch wires. I originally thought I would have 4 solder connections to the PCB per switch, but then I realized that that would be too difficult to solder and that it would be better to just connect my switch matrix at the edges.

The row strobes used for the switches are the same as the ones used for the LEDs, and the column strobes are accomplished with a couple of 8-bit analogue multiplexers, so that only two Darlington pairs are needed for the entire matrix (the final thing includes a third for a “menu” switch). One tricky thing here turns out to be the capacitance – since the finger resistance is about 10 megaohms and we move on to the next switch 15,625 times per second, we need a capacitance of no more than 6pF, which one gets from having just a few centimeters of wire in close proximity. Fortunately that’s just for between the multiplexer and the Darlington pair – the switch matrix itself only changes configuration 976 times per second so we can get away with a larger capacitance. Even so, I think this is at about the limit of practical resolution for such a matrix. It proved necessary to put a capacitor between the base of the Darlington pair and ground to counteract the admittance of the switch matrix and reduce its sensitivity.

I used a double sided circuit board (in order that I could have LED row wires on one side and column wires on the other), but I think if I were doing it again I’d use a single sided board and just connect the columns by soldering the LED anodes to straight pieces of wire laid across the top of the board. Between aligning the two sides correctly, doing very fiddly soldering under the components on the top side, making lots of vias and not being able to test most of the board until almost everything was soldered (due to some of the component legs acting as vias) it was more trouble than it was worth.

For debugging purposes, I made a connector so that the device could be connected via an Arduino and a USB port to a computer. There is essentially a “non stop” debug interface built into the program – as it’s running, one can send commands over the ATmega’s UART to peek and poke memory.

The sound quality isn’t great at the moment – it was okay on the breadboard but the breadboarded version of the circuit had a “screen” of only 4 LEDs. With 256 LEDs the ripples on the power supply are much bigger and there’s a lot of noise on the speaker when lots of the pixels are lit. I did have some decoupling capacitors in the circuit but I drastically underestimated the amount of capacitance I would need – I’ll replace the capacitors with larger ones after my next Mouser order.

The final design has 4 potentiometers: tuning, volume, decay/sustain and tempo.

The software I wrote for it has lots of features:

  • Random mode: after each cycle through the pattern, extinguish one LED at random and light another at random. This keeps the pattern varying.
  • Game of Life mode: after each cycle through the pattern, transform the pattern according to Conway’s rules.
  • Various waveforms: choose from sine wave, square wave, triangle wave (which unfortunately sounds indistinguishable from the sine wave) or two different types of random noise. There are also a couple of different waveform editors so you can make up your own.
  • Tuning editor: the default scale is pentatonic but you can change it to use whichever frequencies you like.
  • Overrides for decay, tempo and tuning so they can be set via either digital or analogue controls.
  • Microtone mode: sets the matrix up as a 256-key keyboard spanning 7.5 octaves with a 34-TET tuning. LEDs corresponding to the notes of the C major scale are lit (which makes a pretty pattern). Because of the way the switch matrix works, only chording within a row or a column is possible without introducing spurious notes.
  • Saving and loading patterns, waveforms, tunings and other settings to/from EEPROM. Unfortunately there is a bug in the software at the moment which causes it to crash after saving.
  • Multi-pattern mode: loads a new pattern from EEPROM each time the current one finishes.
  • Sync in and sync out sockets: I believe these should be compatible with those on the Bliptronic 5000, but I don’t have one to try it out with.
  • Ability to have less than 16 beats before repeat (for making rhythms with different time signatures).
  • Just for fun, a red/green/blue LED triplet. This displays a hue corresponding to the beat currently being played within the pattern. It seems totally frivolous, but was quite handy for debugging this problem when the program was so broken that the serial code didn’t even work.

There are a few more that I’ve thought of but haven’t implemented yet. Almost half the flash is unused currently – enough to add a few games as well.

The machine runs great on 4 AA batteries (6.52V according to my multimeter) or from a 5V supply. I think the ICs are rated up to 15V or so, but the LED current limiting resistors would need to be increased to use a higher voltage.

One tricky thing about making this is threading all the wire through the acrylic sheet – I used fairly thick wire (22 AWG) for strength so I had to be careful to avoid kinks and make sure the wire was tight each time. Using thinner wire would be easier but flimsier. I imagine that it could be mass produced more easily using techniques similar to making a double sided PCB (i.e. using tracks and through-hole plated vias instead of a solid piece of wire).

In the end the parts cost for this adds up to $64.83 (not counting time, broken tools and supplies like solder, toner, paper, acetone, steel wool, glue and ferric chloride), though not all of the parts I actually used were bought new (I salvaged the speaker and a capacitor from an old alarm clock, for example). It could probably be mass produced for significantly less (particularly the case I imagine).

The schematics, source code, PCB layout and parts list are available here. If you make a copy or derivative I’d love to hear about it.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Volatile registers in GCC

June 14th, 2010

When writing microcontroller code in assembler, it’s nice to be able to keep some commonly used variables in registers. The 8-bit AVR devices in particular have a nice big set of registers, most of which are rarely used in compiled code. Usually it’s a waste of time to write all the code in assembler, though – it’s much better to write the non time-critical bits in C, compile them with GCC and then link with the assembly code bits.

When accessing the register variables from the C code, the natural thing to do is just to make a global register variable:

register uint8_t frame __asm__ ("r3");

That has two effects – it allows you do access the variable as if it were a normal variable:

void initFrame()
{
    frame = 0;
}

And it prevents the compiler from using r3 for something else (though one also has to be careful that the register isn’t used by any other linked in libraries, including the C library and the compiler support functions in libgcc).

The trouble comes when you try to read those register variables. If optimizations are turned on, then the following code might just be an infinite loop:

void waitForFrame(int frameToWaitFor)
{
    while (frame != frametoWaitFor);
}

The compiler hoists the read of frame outside the loop, and never sees the updates. If frame was a normal variable we could fix this just by adding volatile, but using volatile with a register variable doesn’t work. This seems odd until we think about what volatile actually means. A read from or write to a volatile variable is considered an observable effect (like doing IO) so the compiler won’t optimize it away. But the compiler has no concept of a “read from” or “write to” a register – registers are just used or not used and the optimizations around them are unaffected by the notion of volatile.

There is a reasonably easy and not-too-invasive way to fix this, though, through the use of inline assembly. If you write:

#define getFrame() ({ \
    __asm__ volatile ("" : "=r"(frame)); \
    frame; \
})

void waitForFrame(int frameToWaitFor)
{
    while (getFrame() != frametoWaitFor);
}

The compiler will treat the "" as a block of assembly code which writes to r3 and which has unknown side effects (so that it can’t be hoisted out of a loop for example). The code doesn’t actually do anything (so the generated code won’t be adversely affected) but it essentially provides a read barrier to the register. Unfortunately you can’t use getFrame() to write back to frame, so to increment it for example you have to do frame = getFrame() + 1; but that’s actually kind of helpful because it makes the possibility of a race condition (for example by an interrupt routine also incrementing frame at the same time) more obvious.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Improved GPU usage for fractal plotter

November 1st, 2009

I’ve been tinkering with my fractal plotter again. One thing that annoyed me about it was the pauses when you zoomed in or out past a power of 2. I thought this was due to matrix operations until I did some profiling and discovered that it was actually dilation (both doubling and halving) of the “tiles” of graphical data to which squares are plotted and which themselves are painted to the screen.

This is work that can quite easily be done on the GPU, without even having to resort to pixel shaders, by using the ability to render to a texture. Here is the result and here is the source. In order to do this I moved all the tiles to video memory (default pool instead of managed pool) and used ColorFill() to actually plot blocks instead of locking and writing directly to textures. All this adds up to much more CPU time available for fractal iterations.

Another change is that instead of an array of grids of tiles, I’ve switched to using a grid of “towers” each of which is itself a tile and can point to 4 other towers. This simplifies the code somewhat.

There is still some glitchiness when zooming but it is much less noticable now.

This reminds me of something I meant to write about here. When I originally converted my fractal program to use Direct3D, I figured that locking and unlocking textures was probably an expensive operation so rather than locking and unlocking every time I needed to plot a square, I kept them all locked most of the time and just unlocked them to paint. However, it turns out that this “optimization” was actually a terrible pessimization – now all the tiles were dirtied each frame and had to be copied from system memory to video memory for each paint, and because of the locking nothing else could happen during that time. I was able to get a big speed up by locking and unlocking around each plot operation – that caused only the parts of tiles that were actually plotted on to be dirtied. It just goes to show that when optimizing you do have to be careful to actually measure performance and see where the slow bits really are.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Wireless mice

October 31st, 2009

This is fascinating. I wonder how long it will be before people are implanting tiny devices into mouse brains that receive commands from the internet via the cellular networks and transmit video and audio back, so that the mice can be driven around by remote control and used to spy on people and things.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Fall foliage

October 30th, 2009

Inspired by this XKCD comic:

I made this:

And this, going in the opposite direction:

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark