Archive for August, 2011

Blog updates

Thursday, August 11th, 2011

I've been doing a little spring-cleaning of the old blog now that I've started writing in it again. I installed the WP-Syntax and WP LaTeX plugins and went back through my old posts to prettify up the code blocks and equations. I think I've got most of them (with the exception of a few bits of code that are in languages that I made up, so the plugin doesn't understand them, and a few odd here and there).

There are so many LaTeX and syntax highlighting plugins available for WordPress, though - I have no idea if the ones I'm using are the best. I ended up just using the "most popular" search on the WordPress plugins page and picked the first results for "syntax" and "latex" that looked like they would do what I wanted, on the theory that the most popular ones would be the best. They seem to work well enough but I do worry a bit that at some point I will find they're missing some important feature and I'll have to switch plugins and go back and update all my old posts again.

I also went back and fixed any typos and mistakes in equations that I noticed while doing the updates, and updated with new developments such as the mention of the Eclipse of Mars illusion in The undiscovered colour and the 75 ohm load in CRTC emulation for MESS.

I also want to change my theme (Default is okay but I think it belies how much work I put into this thing). I'll probably change it to Twenty Eleven with a custom header unless someone with better design taste than me has a better suggestion. I need to find a better way to do centering of large images as well. The current way still mostly works but occasionally collides with the links over on the right, and there's a lot of pictures without the correct stylesheet class. I also want to fix the ordered list style - I'd much prefer nice round bullets to those chevrons.

To make the banner I wanted to take a big pile of pictures that I had made especially for blog entries and just sort of stick them all together at random in a big jumbled collage with some nice blending between them. I thought it would be easy to find software to do this but there's about a million "collage maker" programs, almost all of them "trial-ware" and all of them completely useless for the purpose. It seems that to do a jumbled collage in any of them I'd have to individually add, place and rotate each picture. I found one with a wizard that did an automatic layout with blending, but you can't change the blend radius to anything other than "hardly noticeable at all", the pictures are forced into a grid layout and the program is buggy anyway (and won't let you save in the trial version, but doesn't seem to mind you taking a screenshot of the finished result).

So I think I'm going to write my own automatic collage maker. Shouldn't be too hard, and that way it'll be easy to update the header as I add new images.

One other thought that came to mind as I was going through all my old posts - this blog used to be a lot more picture heavy than it is at the moment. I have a big backlog of pictures (over 2 years worth) that I do eventually want to put up here - I'll probably do a month a day at some point. Most of our pictures these days are taken by Gennie, though, and she tends to put them up on her flickr. There will be some overlap with that, though I don't think I'll be including all the food pictures here (I know I have in the past but that was before Gennie started using flickr).

PIC12 architecture review

Wednesday, August 10th, 2011

There are probably very few ways of getting to know an architecture better than trying to optimize the hell out of a piece of assembly code written for it, so I think I know PIC12 quite well by now. It's a cute little architecture - a very small number of instructions (though could be smaller - why are TRIS and OPTION instructions rather than memory locations?). Most operations work on the W register and/or a memory location, so it's either like having a single register or 32 registers (depending on whether you consider the memory locations registers or not).

As you might expect, the W register does become a bit of a bottleneck on occasion (though not completely - you can set, clear and test individual bits of any memory location without affecting W, amongst a few other things). The upside is that there are so few instructions that it's easy to commit the architecture to memory.

It's a bit unfortunate that returning from a function always sets W to a constant - I very rarely found use for this feature and it seemed to get in the way fairly often. At the very least a RET variant which doesn't set W would have been helpful.

Most instructions are 1 cycles or 2 for jumps just as with the AVR8 architecture, however one of these "cycle"s is actually 4 clock cycles. This makes me suspect that "1 cycle operation" is mostly a marketing feature - they might have been able to make the instructions 2-5 clock cycles but making them 4 or 8 (while slower) makes cycle counting easier. It may also simplify the internal circuitry, and it may also improve code density (not so many NOPs needed to make different code paths take the same time).

It may be that the 8-bit AVR devices do something similar but also have an internal frequency multiplier so that they really do 1 cycle operation. I suspect that "multiply frequency by 4" circuitry is much more complicated and finicky than "divide frequency by 4" circuitry.

The architecture has call and return instructions, but it's prudent to avoid them in the most time-critical code, since they cost 4 cycles (2 for the call, 2 for the return). Also, the stack depth is very limited (just two return addresses) so often a continuation-stashing style can be useful (especially since an indirect jump to W is just a single cycle).

A PIC12 programmer from an Arduino

Tuesday, August 9th, 2011

In order to program the microcontrollers for interval bars, I needed a programmer. This is a device which connects to the chip and to a computer, and which allows one to transfer a program from the computer to the chip. You can buy them for $40 or so but I decided to make my own - since I already have an Arduino to act as a low level computer interface, I could make it very cheaply. The only complicated bit is the 13V power supply - I used this little boost converter circuit.

The Arduino program is fairly straightforward - it just reads a hex file over the serial serial line, checks the checksums and converts it to the sequence of signals according to the PIC12's programming specifications. It has facilities for reading back the microcontroller's contents and passing them back to the computer as well, and also a way of calibrating the clock rate. I added this not because I accidentally overwrote the OSCCAL and backup OSCCAL values (well, I did, but I had already read them at that point so I knew what they were) but because I wanted to know what the effect of the OSCCAL value was and if the factory programmed one was optimal.

I discovered that it wasn't quite the best (the preprogrammed value of 0x24 gave a within-spec clock speed of 996.5KHz but 0x26 gave a clock speed of 1002.0KHz). I also discovered that (with the one part I tested) the low bit of the OSCCAL was ignored and the top 7 bits interpreted as a signed quantity between -64 and 63 gave an extremely linear relationship to frequencies between 587.1KHz and 1230.1KHz with a resolution of about 5KHz. I also discovered that the frequency stability was very good - a standard deviation of only about 30Hz.

Here's the schematic for the programmer:

To build the programmer, edit the second line of build_programmer.bat to point to your Arduino installation (if it's somewhere other than C:\Program Files\Arduino) and run it. To upload it to the Arduino you'll need WinAVR, modify upload.bat to point to your WinAVR installation and run it. We can't use the avrdude from the Arduino installation here because it won't toggle the line to reset (that's normally done by the Arduino software).

To use the programmer, connect to it with a serial monitor (like the one in the Arduino IDE). There are several one-character commands:

  • W - Write program from Arduino to PIC.
  • O - Write program from Arduino to PIC, including OSCCAL value (necessary if write failed, or if we want to use a different OSCCAL value).
  • P - Write program from Arduino to PIC, including OSCCAL value and backup OSCCAL value - only use if the backup OSCCAL value got screwed up somehow.
  • R - Read program from PIC to Arduino.
  • U - Upload program from PC to Arduino (follow this with the contents of your hex file).
  • L - Download program from Arduino to PC.
  • T - Start timing mode.
  • Q - Stop timing mode.
  • A - Perform automatic calibration.
  • B - Output a nybble to port B (for debugging purposes.)

The returned status codes are as follows:

  • K - Success.
  • EF - Error: config fuses didn't verify.
  • EV - Error: Program, OSCCAL, user IDs or backup OSCCAL didn't verify.
  • ER - Address in hex file out of range.
  • EU - Unknown record type in hex file.
  • EZ - Non-zero byte count in end-of-file marker in hex file.
  • EC - Hex file checksum failure.
  • EH - Incorrect hex digit.
  • E: - colon expected but not received.

In automatic calibration mode, the Arduino also sends calibration information as three numbers per line: OSCCAL value (3 digits), frequency in Hz (7 digits) and standard deviation of frequency in Hz (5 digits).

Don't overdo it with the automatic calibration - it reprograms the PIC with each possible OSCCAL value so you'll wear out the flash if you do it too much.

Final interval bars code (I hope)

Monday, August 8th, 2011

I think I've now got the optimal implementation of the PIC12 code for interval bars. It's gone through many rewrites since my last post on the subject. I decided to get rid of the "heartbeat" after all in favor of a burst system which sends and receives 9 bits (all the information for one bar) at a time every 100 clock cycles or so and synchronizes once in each direction during that period, right before data transmission. This means we can use 3-pin connectors instead of 4-pin connectors. A 2-pin connector isn't really practical since extra electronics would have to be added to separate and recombine the power and data signals.

The downside to this approach is that the microcontroller in the "root" box now has two time-critical tasks - reading the bits as they come in (can't just request one when we're ready for it any more) and outputting audio samples. But I think that is managable, especially if there are queues so that the interrupt routines can just add to or remove from their respective queue and the real work can be done in the foreground. The average data transmission rate is one bar every 317 cycles - the other 217 are spent in the "prime" phase where each bar discovers which bars are connected to it and cycles are broken to turn the graph into a tree. The data rate is about 3,154 bars per second or about 32ms per update with 100 bars - a workable latency for musical purposes.

The program takes 496 instruction slots of the available 512 and 21 bytes of RAM of the available 25. The source file is just 354 lines.

I realized early on that there wasn't going to be space (in flash or RAM) or time for any debugging code so that this program would be impossible to debug on real hardware. I knew I'd never get it right the first time, so it was necessary to write a simulator. There is an existing simulator included with the Microchip tools, but I couldn't get it to work properly and in any case it certainly wouldn't support programmatically randomly plugging and unplugging as many as 100 instances. So I wrote my own cycle exact simulator. Actually it had to be rather better than cycle exact to simulate the fact that the microcontrollers run at slightly different speeds. My simulated timebase is about 39 picoseconds, giving a frequency resolution of about 39Hz - 512 steps between 0.99MHz and 1.01MHz.

After getting the simulator working, I spent a long time cycling through a process that looks like this:

  1. Run the program in the simulator.
  2. Discover that after a while it locks up or produces incorrect results for an extended period.
  3. Look at various diagnostics I added (up to and including full interleaved instruction traces) to figure out what went wrong.
  4. Adjust the program to avoid the problem, and repeat.

Most of the time, a trip through this loop increases the time-to-failure by a factor of between 2 and 10, but occasionally, it's turned out that there was no simple fix to the problem - the program required substantial rewrites to avoid the situation. These rewrites in turn have their own bugs, and the time-to-failure again becomes very small. It got easier, though - the same sorts of problems kept cropping up and I got better at recognizing them with time. Also, at the beginning I kept having to interrupt the cycle to write more diagnostic code when my existing techniques proved insufficient.

With the current version the simulation ran for more than a week of real time (91 hours of simulated time), went through 15,371,546 configurations with a worst settling time of 92ms.

The best version before this one ran for 774430 reconfigurations and 9 hours of real time (about 4.5 hours of simulated time) before getting itself into a state from which some of the bars stopped responding. That problem took a week to track down and because it happens so rarely. The story of how it happens is kind of like a distilled version of a theatrical farce. There is one signal line for communication which can be in one of two states. As the program progresses, signals of different meanings need to be exchanged (there are about 27 different meanings in the latest version). The two communicating bars need to be "on the same page" about what the meaning of a 0 and 1 will be. But because bars can be connected or disconnected at any time, these meanings can become confused. The farce happens when one signal is confused for another and (due to coincidences that would be amazing if I wasn't conspiring to arrange them) this causes a worse confusion later on and so on, escalating until we get to a state from which the system can't recover.

The way out of this mess is by making some of the messages more complicated than a single bit. For example, the "prime" signal which initiates the data transfer is a 6-cycle low, a 31-cycle high and another 6-cycle low. The receiving code checks the line twice (37 cycles apart) for lows and in the middle somewhere for a high. This means that it can't be confused with either the 9-bit data signal (which is at most 36 cycles of low in a row) or for a single long low signal. The response to this is an 8-cycle low, an 8-cycle high and then a low of variable length (in previous versions it was a single long low of varying length). This increases the number of "this can't happen" signals. When we detect one of these we can put the program into a state that is robust against further unexpected input.

A continual battle with this program has been making it as fast as possible whilst still being reliable and fitting in the available space. There isn't enough space to duplicate any significant part of the program for each of the 12 combinations of input and output pin, so I initially divided it up into "read from pin x" and "write to pin x" subroutines. The "write to pin x" subroutines can then be folded together by means of a couple of variables whose values can be written to the IO port to signify a low and a high respectively. Since reading from a memory location takes the same time as loading a constant, there's no cost to this indirection (apart from the setup code which has to initialize these variables depending on the pin we need to write to). The read subroutines can't be factored this way because the bit to read from is encoded in the opcode of the instruction which tests a bit. Using an "extract bit x" subroutine would have slowed the program down too much.

Phew. I think that (per line and per byte) this was the most difficult-to-write program I've ever written. Brian Kernighan is often quoted as saying "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." However, there is a corollary to this - if you write a program that's as clever as you can make it and then force yourself to debug it, you become twice as smart in the process.

Edit 14th July 2013:

LFT said it better than I could.

Rethinking memory

Sunday, August 7th, 2011

There are currently two major ways in which programming languages deal with memory. One is explicit deallocation (as used by C, C++ and assembly language) and the other is garbage collection, which is used by almost everything else.

Explicit deallocation is generally thought of as being very error prone and difficult to get right, but if your language has good features to support it (like destructors in C++) it can be very nearly as easy as garbage collection. It also has some other advantages over garbage collection:

  1. Less actual memory is needed to accomplish any given task (most garbage collected languages need about twice the memory for any given task as a non-garbage collected language).
  2. Non-memory resources can be cleared up in the same way as memory.
  3. Everything is deterministic (which is not the case if you have a garbage collector running on a separate thread).
  4. Most garbage collectors need to pause your program to work, making them unsuitable for hard real-time systems.
  5. Locality of reference is better preserved, leading to more cache-friendly memory access patterns.

Garbage collection also has it's own advantages:

  1. You never need to deallocate memory explicitly - the computer essentially emulates infinite memory for you.
  2. You don't need to explicitly break (or avoid making) cycles of references - the garbage collector will find them.
  3. With a moving collector, allocation can be much faster (requiring just a pointer increment).
  4. If you can always arrange for garbage collection to happen in time that would otherwise be idle, the system can be faster overall.
  5. With a moving collector, memory fragmentation can be avoided.

Generally I prefer explicit deallocation to garbage collection. This might be because my parents drummed it into me as a child that I should put away my toys when I finished playing with them, not leave them lying around to be tidied away my someone else. The one thing that does bother me about explicit deallocation, though, is the possibility of memory fragmentation. Lately I have been wondering if would be possible to fix this problem whilst retaining the advantages of explicit deallocation. Specifically what got me thinking about this is the crazy memory tricks I implemented for my fractal plotter, and how to generalize it to other programs. The result is something I call ravioli allocation (code here although it doesn't actually do anything at the time of writing).

The idea there is that you have a pile of objects which each take the same amount of memory. So there is a simple algorithm for keeping them defragmented - whenever you deallocate an object, move the highest object in memory into the space it vacated.

There are two difficulties with doing this more generally:

  1. To move object X, you have to know where all the pointers pointing to it are so you can fix them up. This caused many long and painful debugging sessions with my fractal program.
  2. All the objects have to be the same size.

The second problem can be solved if the objects are large enough to hold two pointers, by simulating arrays and structures using binary trees (though access time for elements increases a bit - from O(1) to O(log n)).

The first problem can be solved by linking together in a linked list all the pointers which point to the same thing. The list needs to be a doubly linked list because we need to be able to remove a pointer from the list it's in. So, each "high level" pointer really consists of three "low level" pointers - one to the referent itself and two for the previous and next "high level" pointers in the list. An object itself will also need space for two other "low level" pointers - one to the first element of the linked list of pointers pointing to that object, and one to tell the allocator whether it contains 0, 1 or 2 "high level" pointers (this can also be used as a vtable pointer for other purposes).

To avoid splitting objects across cache lines, we want the size of each object to be a power of two, so they will need to be able to hold 8 "low level" pointers (i.e. be 32 bytes on 32-bit systems, 64 bytes on 64-bit systems). That means that each 32-bit object can store 6 32-bit integers or 24 characters. On a 64-bit machine, an object could store 12 32-bit integers or 48 characters. So the small string optimization is very natural with ravioli allocation.

Once we have the ability to move objects around for defragmentation purposes, we might want to move them around for other purposes too. For example, if we expect that whenever object A is used, object B is also very likely to be used (and vice versa) we might want to put them on the same cache line. We can also set up separate memory arenas for separate purposes and move objects between them to improve locality for paging purposes.

One big downside of this system seems to be that deallocation (or moving an object in general) can be quite an expensive operation - chasing one pointer (and one potential cache miss) for each pointer that points to the moved object. However, I suspect that this wouldn't be as expensive as it sounds most of the time, since the objects that we move will tend to be ones which were allocated recently (since they will be in higher memory locations) and won't have been around long enough to have accumulated too many pointers. The objects that tend to have a lot of objects pointing to them will tend to be long lived and will migrate towards the beginning of memory (even if they don't start off there). However, walking this linked list should probably be done with non-temporal loads to avoid polluting the cache with a bunch of objects that you don't really care about. Also, consider the average case - each object can only point to two other objects, which means that the average number of pointers pointing to an object will be at most 2. And there is the nice property that a routine which doesn't deallocate data that it didn't allocate, also won't move data that it didn't allocate, so performance of small routines can be quantified nicely.

There is a fair bit of overhead to all these linked lists - 44% in the best case (one big array) and 300% in the worst case (a linked list of small records). However, some of this is offset against existing overhead (heap information, vtable pointers, fragmentation, existing binary tree pointers, reference counts[1]) so for real programs the costs might be much smaller or even negative.

Multithreading might seem to be a problem at first - one thread can't hold on to a pointer to an object in another thread's heap. However, you can transfer data to or from another thread if the other thread is in a known (waiting) state. You have to be careful transferring data between threads, but that's true anyway.

Local (stack) variables present a bit of a problem - they can't hold pointers (unless you walk the entire stack looking for pointers whenever you deallocate something). One way would be to store "how to get there" (from a global root object which is allocated first and never moves) instead of "where it is" on the stack. Another would be for the stack itself to be allocated on the heap. This is probably the faster solution, and has some nice implications for implementing things like arbitrary closures and continuations. Hardware support would definitely help here though.

[1] There is a nice synergy between ravioli allocation and reference counting - whenever you change a pointer you remove it from its linked list, and if the linked list then contains zero entries you know you can remove the object you just stopped referencing. Cycles must be explicitly broken, though - you can't have a raw/weak pointer (because it would go stale when the object it referred to moved).

Edit 14th July 2013:

Here's a good rant against garbage collection.

Code blog

Saturday, August 6th, 2011

It's been a year since I last wrote something here, but I haven't been idle. I made a resolution to write some code every day and commit some kind of functional change to my GitHub repository each day, even if it's just a one line change. So far I've kept to it pretty well (apart from travel days). I'm adding more stuff there all the time, most of it brand new but some of it is stuff that I wrote previously and which hadn't been under version control before (updated to work with the library code I'm also committing). I know my github workflow is rather unusual - lots of projects in one repository and commits corresponding to days rather than to features. But the version control is mainly for my benefit at the moment (since it's just me working on this stuff) - I just want to have all this backed up offsite, to be able to see previous versions and to have a record of what code I wrote when. The current set of subdirectories under of the main repository is:

  • 4to8 - little program I wrote to convert the audio data for the game Fire! by New Deal Productions from 4-bit to 8-bit format so that I could play it back properly.
  • 8088 - various projects relating to the original IBM PC/XT - a cycle-exact emulator and a demo I'm working on.
  • UnityALFE - a compiled language I'm playing about with. This name is a bit overloaded, I might call it ALFE (A Language For Everything) instead.
  • codec - an idea I had for a method of compressing audio with extremely low playback overhead. Doesn't currently work, and I have no idea if it ever will.
  • collage - a program to genarate a header image to use on this blog (not yet finished).
  • crc32 - a handy utility for computing CRC32 checksums of files.
  • crtsim - a CRT simulator.
  • euclids_orchard - a program to generate a Euclid's Orchard image.
  • fractals - some fractal plotter programs. Currently there is only one - a zoomable Bifurcation fractal.
  • image_resample - my custom image resampler.
  • include - various libraries used in multiple projects.
  • intervals - code and simulator for my intervals bar toy that is in progress.
  • multifunction - a program to search for useful multifunction gates.
  • oscilloscope - an oscilloscope program based on an idea I had for rendering waveforms. Not really started.
  • perceptual - a program to find maximally distributed colours in perceptual colour space.
  • primes - prime number experiments.
  • ravioli - an idea I had for eliminating memory fragmentation.
  • run_random - recursively enumerates all files in a directory and then picks one at random to run.
  • strobe - an Arduino program for a strobe light with controllable duty cycle and accurately controllable frequency.
  • test - some unit tests for libraries in the include subdirectory.
  • tone_matrix - the code for my physical tone matrix.

I'll blog more about these in upcoming entries. And I'll try to remember to update this list as I add more projects to the repository.