More 8088 MPH how it's done

April 12th, 2015

While the 1K colour mode and 4-channel audio player are the big technical achievements I wrote for 8088 MPH, there are a few other bits and pieces in the demo I wanted to write up here. Specifically, three tricks which make possible full-screen 60Hz movement that isn't scrolling.

Starfield (particles)

This is the was the first effect I wrote for the demo, about 4 years ago. The inner loop looks like:

patchAddress:
  mov di,9999
  stosb
  shl di,1
  mov di,[di]
  es: mov [di],ah
  mov [patchAddress + 1],di

This is unrolled 948 times (once for each moving particle, with a separate patchAddress for each one). This is the number we calculated it would be possible to have if each one is updated at a rate of 60Hz (though the frame rate isn't critical here - since the particles are scanned in random order, tearing can never happen). The initial (random) value loaded into DI is the position of the particle (a value between 0 and 16383 inclusive, a byte address in video RAM). The value in AL is zero, so STOSB erases the particle. Then the lines "shl di,1 mov di,[di]" move the particle - the data table at DS:0 to DS:32767 is a sort of cross between a circular linked list and a vector field, describing the trajectory that each particle follows. So topologically speaking the particles are all moving around in a loop, but the points along this loop are arranged so that they form trajectories which look like the movement we want. The shift is necessary because the vector table is two bytes per element but the other uses of DI are one byte per element.

The line "es: mov [di],ah" plots the star on the screen. We actually have 7 different colours of stars, held in AH, BL, BH, CL, CH, DL and DH. These registers are initialized appropriately at startup, and the register that each iteration of the unrolled loop uses is set at random.

My first version of this effect included 21 different vector fields but unfortunately they don't compress well (and the best ones take ages to compute from first principles) so we ended up just using a single one and including it in the binary instead of generating it at run-time.

Moire

This can be done with practically 0 CPU time on Amiga by means of multiple bitplanes scrolling independently. We don't have that facility on CGA so it's a bit more CPU-intensive but still pretty straightforward. We use 40-column text mode to avoid snow, and do the same "right half bar" (0xde, ' ▐') trick that is used for 160x100x16 mode (based on 80-column text mode). The number of scanlines per row is set to 4 to create an 80x50 "chunky pixel" mode (though actually we only had enough time for 47 rows at 60Hz, so we just left the last few rows blank).

The large bitmap (picture of circles) is actually stored in the executable rather than computed at runtime. A second copy is created at runtime (shifted over by one half-character). Two pointers into these bitmaps are maintained in the registers SP and BP and we process four pixels per iteration of the inner loop, which looks like this:

  pop ax
  xor ax,[bp+99]
  stosb
  inc di
  mov al,ah
  stosb
  inc di

This is unrolled 940 times (20 times per row, 47 rows) with appropriate "+99" adjustments in each row, and adjustments to SP and BP are made after each row of 20:

  add sp,stride-40
  add bp,stride

To make the transition effect at the end, we overwrite blocks of these instructions by replacing the "xor ax,[bp+99]" instruction with "mov ax,0".

Unfortunately once we added the music the effect no longer runs at quite 60Hz and some tearing can be seen with close observation. This is something we hope to fix in a final version.

Kefrens bars (+raster bars)

The idea behind Kefrens bars is to get rid of the frame buffer altogether and just have a line buffer (Atari 2600 style). Then any video memory changes that you make on one scanline appear on all lower scanlines (unless overwritten by change on a lower scanline). During vertical overscan/blank/sync the line buffer is cleared. TThis has been done lots of times on other platforms, but never on CGA before. Several things make it tricky on this platform: the CGA wait states severely limit how many VRAM writes you can do in any given scanline (we ended up with 2 reads and 4 writes per scanline, plotting 7 nybble-wide pixels). The other is that you need to synchronize the routine with the CRT raster (aka "racing the beam"), meaning cycle counting and tuning the routine to take exactly 304 cycles.

Normally it's not possible for piece of 8088 code to always take exactly 304 cycles - the reason being that every 72 cycles PIT channel 1 wraps and triggers DMA channel 0 to refresh a row of DRAM, stealing the bus from the CPU for 4 cycles. As 304 is not divisible by 72, these refresh cycles appear at different places on the screen from scanline to scanline, so you need potentially 9 different scanline routines for the 9 different refresh phases. The number of scanlines per frame (262) is not divisible by 9 so you need potentially 9 different frame routines as well.

Fortunately there's another way that is both easier and requires less RAM. It's possible to reprogram PIT channel 1 so that the refreshes come every 76 cycles instead (exactly 4 times per scanline). While this may technically be out of spec for the DRAM chips, in practice we haven't found any machines on which the slower refresh doesn't work - in fact many require massively longer refresh intervals before they start to decay. Even the ones that crash with a refresh period of 84 or 80 cycles work at 76 just due to manufacturing tolerances. This is one ingredient we used to make the Kefrens bars work. Note that we had to put it back to 72 for the credits part as that routine's inner loop is a multiple of 72 cycles but not 76.

Another complication is that the CRTC in our CGA cards (the MC6845) does not cope well with a one-scanline-high frame (I believe it can be done, but that it requires reprogramming the CRTC a couple of times per scanline.) So instead our "line buffer" is actually two scanlines high. This gives a nice dithering effect on our Kefrens bars - we liked the way it looked, so we stopped trying to get the single-scanline version working.

The inner loop of the effect is 200 unrolled iterations of the 304-cycle:

  mov ax,9999
  mov ds,ax
  mov sp,[bx]
  pop di
  mov al,[es:di]
  pop cx
  and ax,cx
  pop cx
  or ax,cx
  stosw
  pop ax
  and ah,[es:di+1]
  pop cx
  or ax,cx
  stosw
  pop ax
  out dx,al
  mov ds,bp
  lodsb
  out soundPort,al

The routine uses two big tables. One (at DS:BX) is 200*838 elements in size (one for each combination of scanline and frame number - 200 scanlines, 838 frames). Each element of this table is 2 bytes, a pointer into the other, smaller table (at SS:SP). The total size of this table is 327kB, the largest single table in the entire demo. This big table is stored sideways (much like the sample table in the MOD player). Since we need to reload DS a couple of times each scanline already, we might as well reload it with a different value on each scanline. The "9999"s are patched with the right DS values at loop unrolling times. The entries in the SS:SP tables are 12 bytes each, and hold all the information needed for a combination of a Kefrens bar position and a raster bar colour. There are 154*16 entries (one for each such combination) and the whole table needs to be doubled in order to have different entries on odd and even scanlines. So 12*154*16*2 = 59136 bytes, fits quite happily in a single segment.

The small table can easily be generated at runtime, but the big table proved troublesome. With some heavy optimization, I got the precomputing of Puppeh's nice Kefrens bars motions and raster bars patterns down to 28 seconds or so but my target was less than 15 seconds. After a few false starts I figured out that the Kefrens bars part of the table was quicker to compute but compressed less well while the raster bars part of the table took longer to compute but compressed much better. So I ended up just splitting the tables, compressing the rasters and precalculating the Kefrens.

If you try the demo out on real hardware and the Kefrens bars effect seems unstable, try disabling any network card drivers and networking software you may have installed. IRQs from a NIC or similar hardware can mess up the delicate timings. This is one of the more timing-sensitive effects in the whole demo, though - without a 4.77MHz and genuine MC6845-based IBM CGA card the image is unlikely to stabilize.

One other thing you might notice about this routine is that it outputs to the speaker on each scanline! This was something I didn't get working in time for Revision, but my plan for this effect was that we would have PWM music playing in the background. So there's more that can be done with this effect that what we did in 8088 MPH (if not the sound, then those cycles could be used for something else).

Source code

The source file for all the bits of the demo I wrote are now up on my github:

Effects:

Tools:

8088 PC Speaker MOD player: How it's done

April 10th, 2015

The last 100 seconds of 8088 MPH sound very different to the rest of the demo. The end tune is actually a 4-channel Amiga MOD file (which you can download here) composed by coda. Playing back a MOD through PC speaker on such a slow machine has never been done before. Here is how we did it.

We knew early on that we wanted to push the limits of audio from the machine as well as video, but also that we wanted to use the PC speaker for output. The first Sound Blaster card didn't arrive on the scene until 1989. Also, Galaxy Player (glx212) can play a MOD on a 4.77MHz 8088 with a Sound Blaster, so doing that wouldn't have broken any records. Galaxy Player is a very impressive piece of code - I once broke into it with a debugger to figure out how it works. It can also play back through the PC speaker, but that requires a much faster computer.

Most software that uses the PC speaker plays square wave beeps. This can be done with minimal CPU involvement as channel 2 of the PC's 8253 Programmable Interval Timer (PIT) is connected (via an AND gate) to the speaker. Setting that channel to square wave mode and programming an appropriate frequency (as a divisor of 105MHz/88 ~= 1.193MHz) is about all you need to do for an unattended beep. Changing the beep frequency 60 times a second (via an interrupt) is how the music is played for most of 8088 MPH. By rapidly switching between several notes (arpeggiation) you can give the impression of playing multiple notes at once (though it's a poor substitute for true chords) and by varying the duty cycles of arpeggios you can get a very limited impression of volume attenuation.

Playing back pre-recorded sampled sound over the PC speaker was less common, but still a well understood technique. As well as a square wave mode, the PIT has a "one shot" mode where the output goes low while the counter counts down and then goes high waiting for further input. By loading new values into the PIT's count register regularly, Pulse Width Modulation can be achieved. I remember being stunned the first time I heard this being done on the family Amstrad PC1512 - a crystal-clear (for the time) 6 bits of dynamic range at a sample rate of 18.6kHz! The "regularly" bit is the problem, though. Most (if not all) PC speaker software using this technique used a hardware timer interrupt (IRQ0, PIT channel 0) to trigger the CPU regularly to reload the channel 2 count with the next sample. However, interrupts on a 4.77MHz 8088 are really slow, and when playing back samples at ~20kHz there is basically no time for anything else (like the mixing required for a MOD player). All you can do is play back pre-rendered sound, and 100 seconds of 16.6kHz audio would have been 1.6MB of data - way too much for our 360kB disk space budget even with pklite helping out, not to mention our 640kB of RAM.

However, since we were targeting a specific CPU (and a specific clock speed) for this demo, we had a way of timing things without the timer interrupt - counting cycles. Conceivably, the same techniques could be used in a more portable player - having several predefined routines for the most common slow CPUs (and an IRQ-driven version for faster CPUs), or calibrating the speed on startup.

When writing highly-optimized 8088 code, there are two techniques which are particularly important - loop unrolling and self-modifying code. There is a tension between these two techniques, though - the more unrolling you do the lower your looping overhead but the more code you have to self-modify. Galaxy player plays these two techniques off each other quite nicely - picking an unrolled loop size which minimizes the total time the routine takes. Unfortunately this technique isn't a good fit for CPU-timed PC speaker audio because the samples aren't generated sequentially - it first renders a frame of samples for channel 0, then adds in a frame of samples for channel 1 and so on. It might be possible to statically intersperse the PIT count "out" instructions into the mixing code but that's really difficult code to write (it really needs to be written by a sophisticated tool that has knowledge of the exact 8088 timings to minimize jitter). Perhaps for a future project...

After playing about with (and measuring the execution speed of) a whole lot of different routines, I settled on one that doesn't unroll the sample loop at all (but does completely unroll the channel loop). The mixing code itself looks like this:

  add bp,9999  mov bx,bp  mov bl,99  mov al,[bx]
  add si,9999  mov bx,si  mov bl,99  add al,[bx]
  add di,9999  mov bx,di  mov bl,99  add al,[bx]
  add dx,9999  mov bx,dx  mov bl,99  add al,[bx]
  out 0x42,al

This code can't play back arbitrary MODs because it has a limitation on sample length - rather than being of arbitrary length, samples are all exactly 256 bytes long. The idea is that samples repeat with one oscillation (i.e. an up and a down) over those 256 bytes. With a 16.6kHz output sample rate that translates to an output frequency range of 0 to 8.3kHz with a frequency resolution of 0.25Hz. So it's capabilities are somewhat closer to the C64's SID chip than the Amiga's Paula (in fact we codenamed the routine SID during development - we had others codenamed VIC and Paula which may get an airing in the future).

Our "SID" also has no volume tables - the volumes have to be baked into the samples themselves, so there may need to be several copies of any given sample at different volume levels (one for each volume level at which the sample is played). If we have too many samples, volume can be quantized (the program that converts from the source .mod to the player's internal format does the volume baking and optimizes the number of quantization levels). If we can squeeze it into 288 CPU cycles (spoiler: we can!), that's 72 PIT cycles which divides nicely into 4 - each channel's samples go from 1 to 18, and the final sample goes from 4 to 72 (we have to take care not to program 0 into the PIT or it'll count down for 65536 cycles and we'll get no audio for about 55ms). It also works out well for DRAM refresh - using the default DRAM refresh period of 18 we'll get exactly four refresh bus cycles per sample, and they'll be in the same places in the execution of the code every sample, so won't cause any jitter.

The four registers bp, si, di and dx each hold the respective channel's current position within the waveform (as an 8.8 bit fixed point number). The "9999"s are the respective channels' frequencies. The higher the frequency, the further the position gets advanced each sample (direct digital synthesis - similar to that which I used in the Physical Tone Matrix). These frequency values are modified right in the code itself during runtime (self-modifying code). This reduces register pressure (which is important as there are not a lot of registers to spare!)

Similarly, the "99"s (also patched at runtime) are the waveform numbers for each channel. The 256x256 waveform table is turned "sideways" (the low byte of the address is the waveform number and the high byte is the position) in order to avoid having to shift the high 8 bits from the position register to the low 8 bits of the sample pointer.

Now, if we run this code in a loop we'll get a chord playing from the PC speaker. But only a single chord - we want to play a whole song, where the note frequencies and waveform numbers change potentially 50 times per second. So we need to add a way to patch the frequencies and waveform numbers in the code. The fastest way to do that is to use the stack as our stream of patch data:

  pop bx
  pop word[cs:bx]

This means we need to run with interrupts disabled, but we need to do that anyway - the delays introduced by the timer interrupt would cause massive audio quality degradation.

Now we have enough ingredients to play an actual tune, but not a long one! If we're pulling 4 bytes off the stack for every sample we play, we're going to run out of stack in under a second. We might as well just pull preprocessed samples from the stack if we're going to do that - it would last 4 times longer!

However, most of the time we don't need to patch anything - we just want to leave the sample playing until we next want to change something, and then we probably want to change everything at once after 20ms (331 samples). So we want some kind of loop that counts down and then we do the patching once it reaches zero:

loopTop:
  add bp,9999  mov bx,bp  mov bl,99  mov al,[bx]
  add si,9999  mov bx,si  mov bl,99  add al,[bx]
  add di,9999  mov bx,di  mov bl,99  add al,[bx]
  add dx,9999  mov bx,dx  mov bl,99  add al,[bx]
  out 0x42,al
  loop loopTop
  pop bx
  pop word[cs:bx]
  mov cl,99
  jmp loopTop

The "99" in the "mov cl,99" instruction is (you guessed it) another value that is patched at runtime.

This code takes exactly 288 cycles to run in the case where we're patching, but it's quite a bit shorter in the no-patch case. We need it to run at 288 cycles every iteration (patch or no patch) to keep the samples coming regularly. Fortunately, there's a nice place to stick some NOPs where they will be executed only in the non-patch case, making two loops that mutually overlap without either being nested within the other:

v:
  times 15 nop
loopTop:
  add bp,9999  mov bx,bp  mov bl,99  mov al,[bx]
  add si,9999  mov bx,si  mov bl,99  add al,[bx]
  add di,9999  mov bx,di  mov bl,99  add al,[bx]
  add dx,9999  mov bx,dx  mov bl,99  add al,[bx]
  out 0x42,al
  loop v
  pop bx
  pop word[cs:bx]
  mov cl,99
  jmp loopTop

That's it - that's the entire inner loop as it is when the CPU executes its first instruction. All the remaining magic is in the data that's pointed to by the stack pointer. Let's think about how fast we're burning through that data now. 50 times a second we need to patch (worst case) 10 locations (4 frequencies, 4 sample numbers and the loop counter twice). That's 40 bytes, 50 times per second or 2000 bytes per second. That means we get through our 64kB of stack data in less than 33 seconds. That's better than 1 second, but still too short for our song by a factor of 3. We could use a larger stack, but then we'd need to update our stack segment somehow every 33 seconds at least. There's no code to do that, though, and nowhere to put such code that wouldn't execute far too often.

Or is there? Take another look at those 15 NOPs at the start of the routine. If you squint a bit, don't they sort of look like a blank canvas just waiting to be painted with some amazing work of art? (No? Maybe it's just me then). Yes, if we keep the "mov cl,99" line patched to be "mov cl,1" we can patch as many times as we like without the code between v and loopTop being executed at all, which means that we can patch some code into there and then switch the CL value to 2 for a sample in order to execute it. We have to make sure that these little "patched routines" (I call them v-instructions, hence the label) take exactly the same time as the 15 NOPs. This turns out to be possible for all the v-instructions we need for 8088 MPH. However, some of them are *smaller* than the 15 NOPs (in particular those which use some of their bus cycles to access memory or IO ports instead of fetch instruction bytes). This means we need to jump to a different place in the code to start them - that's easy enough, though, we can just patch the destination byte of the "loop v" instruction the same way we're patching everything else (almost half the bytes in this little routine get patched at some point!)

The next thing to notice is that the set of v-instructions that we can execute form a small (albeit verbose) bytecode interpreted language - we can do whatever we like in there provided we meet the space and time requirements. Our little mod player has become Turing complete! In particular, we can modify the stack pointer in order to do loops. That means instead of being hundreds of kilobytes, our v-instruction program can be relatively small (the one in 8088 MPH is just 652 bytes long). It's tricky to write, because it needs to have inside it the locations of the various points within the program that we need to patch, which might move around as we debug things. So rather than writing them directly I wrote some assembler macros to generate them for me. Oh, and because a v-instruction is made up of several CPU instructions, I ended up writing a sort-of mini assembler in the assembler's macro language! Here is one of the v-instructions from the CRTC update v-instruction routine in 8088 MPH:

  forget 7
  startAt 4
  MOV_BX_DX
  MOV_DX_iw
  w 0x3d4
  MOV_AX_iw
  w 0x990d
  OUT_DX_AX
  MOV_DX_BX
  runV 1

This translates to the 8088 instructions:

  mov bx,dx
  mov dx,0x03d4
  mov ax,0x990d
  out dx,ax
  mov dx,bx

The "0x99" (AH value) is, you guessed it, patched to the desired CRTC start address byte (yes, I used self-modifying code in the v-instructions as well as in the 8088 instructions).

The main v-instruction routine runs 50 times per second and updates the frequency and waveform data in the actual mixing routine. It also fetches more song data from the pointer ES:0 (and increments ES). This means that we can burn through just 800 bytes of data per second for our song instead of 2000, dramatically reducing the memory usage. Only 12 of the 16 bytes in the paragraph are used for the actual musical data, the other 4 bytes hold the address of another subroutine to set the stack pointer to, and an argument for that routine. These "h-instruction" subroutines do things like printing a character on the screen, changing the cursor position for the print routine, changing the CRTC start address (for hardware scrolling) and finishing the routine (by patching the final "jmp looptop" instruction. So there's 3 layers of code here: the actual 8088 mixer code, the v-instruction code in the stack and the h-instruction code interspersed into the song data.

Somewhat surprisingly, there's still plenty of v-instruction time left - even the longest of these takes only 119 samples of the 331 available. So the routine actually ends up using only about 87% of the available time (less when just scrolling slowly without printing any extra characters). Getting the routine to do more is tricky, though - using more than 4 bytes per 20ms frame for the non-song stuff would probably involve moving through the song data at a non-integer number of segments per frame. Also, the fact that the only persistent registers available to v-instructions are ES and AH (though AL and BX can be stomped) makes writing new v-instructions tricky. More can certainly be done, though.

You might notice that the "mov cl,99" instruction directly follows the instruction that patches it ("pop word[cs:bx]"). This means that the 8088 will execute the unpatched version of the instruction (as it's already in the prefetch queue by the time it's patched). The v-instruction program generation macros take this into account. In theory this instruction could be anywhere between the "loop v" instruction and the "jmp loopTop" instruction, but in practice if I put it anywhere except where it is, the routine ends up taking more than 288 CPU cycles. For debugging in DOSBox (which executes the patched version) I have a debug switch which moves the instruction before the pop (the timing is wrong on DOSBox anyway, but I can at least test functional changes that way). Debugging is still tricky, though - breakpoints don't work so well when your entire program is executed from the same 15 byte memory region!

Some pre-processing of the source .mod file is performed on a modern computer before transferring it to the old PC - resampling the looping samples to 256 bytes and changing the note and effect data into frequencies and waveform numbers in the 800 bytes per second format. The .mod interpreter is based on a older version of PT2PLAY. The source is here and there is a compiled binary for Win32 here. Only mode 1 works at present. All Protracker effects should work with the exception of 9xx (set sample offset), E0x (hardware filter), E9x (retrigger) and CIA timer modes. The latter could be accomplished to some extent by adjusting the number of samples per tick.

As well as generating the data for the routine, mod_convert generates a .wav so that musicians can hear how their work sounds on the PC without actually having to have one. To do this, the program generates a 1-bit waveform at 1.193MHz and then resamples it to 44.1kHz, so even the high frequency carrier wave is reproduced. It's essentially a little emulator of the PC speaker circuit.

One nice feature of this routine that I didn't find a way to use in "8088 MPH" is ring modulation. If the second "mov bl,99" instruction is patched to be "mov bl,al" then the output from the first channel is used as the second channel's waveform number. If the waveform numbers in slots 1..18 are all the same basic waveform multiplied by a suitable set of amplitudes, then the output of the second channel is ring modulated by the first!

One other nice little finishing touch with the credits part is the way that it exits to DOS at the end, with the "A:\>_" prompt right after. This is a fully functioning DOS prompt, not a mockup. The idea is that the "What can you do?" line is immediately followed by the prompt to actually do it! In order to get this to work, the screen had to be in the "unscrolled" location at when the demo exits (DOS and the BIOS scroll the screen by shifting video RAM data, not by changing the CRTC start address). That's easy enough then, just subtract the number of character positions that we scroll from 8192 and program that value into the start address initially. We also need to move our initial write pointer and VileR's awesome background 40-column ANSI art so that they start in the right place. I accomplished this without needing to add code to handle wrapping, by taking advantage of the fact that the CGA ignores address bit 14, causing physical addresses 0xB8000-0xBBFFF to be mirrored in the address range 0xBC000-0xBFFFF. This turns out to be another emulator-breaking change, though (at least in DOSBox)!

The source for the player itself can be found on my github.

1K colours on CGA: How it's done

April 8th, 2015

[Update: VileR's writeup of the 1K colour mode is now up. His has fewer technical details but is much easier to understand than mine as it has pictures!]

When displaying graphics on an original IBM Color Graphics Adapter (CGA), normally only 4 colours (from a palette of 16) are possible at once. A few games written for such systems took advantage of the artifacting on the card's NTSC composite output to get 16 colours at once. On Saturday, a team of people including myself, Trixter, Scali and VileR released a demo ("8088 MPH") which smashed this limit and won first place in the "Oldskool Demo" compo at the Revision 2015 demoparty in Saarbrücken, Germany. Some commenters have suggested that the production is a fake and that what we claimed to have done is impossible. Others have suggested it's dithered or flickered to get more colours. But it is none of these things. Here is how we did it.

First of all, what defines a colour on the composite output? There's only one signal line on a composite connection (plus a ground return path) so you can't have separate red, green and blue analog levels like you have on a VGA card (or separate red, green, blue and intensity lines like you have on an RGBI connection.) Instead, a composite signal effectively sequences the red, green and blue signals in time. A composite colour is a signal which repeats at a frequency of 3.57MHz (half of the width of a text character in 80-column mode). Given such a signal, you can compute its DC component (average voltage), oscillation amplitude (about this average) and phase (relative to the color burst pulse at the start of each scanline). These three parameters directly correspond to brightness (luminance), saturation and hue respectively. Higher frequencies (2nd and greater multiples of 3.57MHz) are not involved in colour decoding and would normally have been filtered out by the decoding circuitry in the composite monitors CGA cards would have been connected to in 1981.

The most common CGA composite mode works by putting the card in 1 bit-per-pixel (1bpp) mode - i.e. each pixel is either off (black) or on (white, generally, though this could be changed via the palette register). A single period of color carrier oscillation contains 4 pixels in this mode (the pixel rate is 14.318MHz), so there are 16 possible waveforms you can make with patterns of lit and unlit pixels and hence 16 artifact colours.

Separately to artifact colour, the CGA card has 16 "direct" colours (the ones that are available in text modes). These are just the 16 possible RGBI bit patterns on an RGBI output, but how does the card generate these colours on the composite output? It does so by generating 3.57MHz waveforms on the card at 6 different phases using flip-flops. These are the colours blue, green, cyan, red, magneta and yellow. Including the constant digital signals 0 and 1 (GND and +5V) gives 8 basic colours. To get the intense versions of these, an additional DC offset is applied when the digital signal is turned into an analogue one at the output.

The 1K colour trick hinges on noticing that the direct colours are not the same as the artifact colours. In 1bpp mode you can change the palette register to get different sets of artifact colours. Suppose you change the palette register to blue - then any black pixel will "turn off" the corresponding part of the "blue" waveform. These "chopped up" colours are different yet again from the 16 direct colours and the 16 normal artifact colours. So you could get 256 colours that way (though you can't put them wherever you like because there are limits to how often you can change the value in the palette register).

Suppose that in 1bpp mode you had a second palette register so that you could change the colour corresponding to the 0 bit as well as the one corresponding to the 1 bit. Then using the same techniques you could generate 2K colours (16 foreground colours, 16 background colours and 16 bit patterns for choosing which colour goes where - but swapping foreground and background and inverting the bit pattern yields the same colour). Here we come to the crucial part of the trick: in text mode you can kind of do that - the attribute byte for a character (when flashing is disabled) lets you choose the foreground and background colours independently. Unfortunately you don't get to choose the bit patterns you want - those are defined by the bits in the CGA's character ROM, which can't be changed from software.

VileR is the one who deserves credit for the next observation. He pointed out to me that the characters 'U' (capital letter U, 0x55) and '‼' (double exclamation mark, 0x13) both have bit patterns in their top two rows 11001100 and 01100110 respectively) which are the same for the left nybble as for the right nybble, and the same in both rows. Therefore, if we change the number of scanlines per character row to 2 (as is done in a number of other CGA games to get a 160x100x2 mode using a "vertical half solid" character - 0xDD or 0xDE) we should be able to get ~500 colours (2 useful characters times 16 foreground colours times 16 background colours) at a resolution of 80x100.

In order to get from there to 1024 colours we need to find some more characters with the same properties as 0x55 and 0x13. It would be fantastic if there happened to be for every nybble value X a character with that bit pattern in its top 4 nybbles, but unfortunately only the nybble patterns 1100 and 0110 are obtainable that way. However, if we consider just the top scanline instead of the top two, we find two more characters with the right property - '░' (light shade, 0xb0, bit pattern 00100010) and '▒ ' (medium shade, 0xb1, bit pattern 01010101). Unfortunately the second scanlines of these characters don't play ball, and if we tried to use them with 2 scanlines per row we'd get horizontal stripes instead of solid-coloured pixels.

So to get those extra colours we need to use 1 scanline per row. However, there's are several complications in doing so. One is that the CRTC on the CGA card (Motorola MC6845) cannot generate more than 128 rows (plus up to 32 extra scanlines) per frame and we need to generate 262 scanlines per frame in order to maintain the correct ratio of hsync pulses to vsync pulses that the monitor requires to generate a stable picture.

It is possible to do this, though, by generating multiple CRTC frames per CRT frame (and suppressing the vsync pulse for all but one of them). This is how we generated the wide picture before the credits part (the one with our faces) - in that image there's a 100 scanline frame with 1 scanline per row and immediately below it a 162 scanline frame with 2 scanlines per row.

But there were several 1K colour images in the demo that filled the entire screen - how did we do those? The answer is very similar but instead of having one frame 100 scanlines high, we have 100 frames that are 2 scanlines high (all with 1 scanline per row). In the middle of each of these frames the memory address is advanced by one row by the CRTC. In each frame we advance the CRTC start address register by one row's worth of characters, so that the top row of one frame is the same as the bottom row of the frame above it. So each frame straddles two pixel rows and each 2-scanline-high "pixel" straddles two CRTC frames.

So we're done, right? That's all there is to the trick? Well, not quite - there are more complications. If you do the obvious thing and set 80-column text mode, colour burst enabled via the BIOS, you will see either no colours at all on your composite display or colours that flash in and out and change hue (on monitors that don't have a properly functioning colour-killer circuit). The reason for this is that the CGA card was never designed to be used in 80-column text mode with composite colour display (the text doesn't have enough horizontal resolution to be readable) and there's a hardware bug that prevents it from working properly anyway.

The bug is that the CGA card takes the horizontal sync (hsync) signal from the CRTC (which just goes high and low once per scanline) and uses it to trigger a more complicated composite pulse signal consisting of front porch, sync, breezeway, color burst and back porch. The whole process takes 10 character periods in modes other than 80-column text (-HRES modes) so the BIOS programs the CGA's hsync width register to 10. But in +HRES (80-column text) mode these 10 characters are half the width, so the hsync process gets interrupted half way through leading to a truncated sync pulse and no burst at all.

This is well-known and the usual way of dealing with the problem is to set the border colour (palette register) to 6 (dark yellow - not brown as it is on the 5153 RGBI TTL monitor) so that the monitor picks up its color burst from the border instead. However, on our hardware we found that doing this made +HRES modes significantly darker than -HRES modes. This is because monitors and capture devices calibrate their gain to normalize the amplitude of the burst pulse, and colour 6 is brighter than the normal burst pulse. Not all of the demo uses +HRES mode and we found that we could not use a single set of calibration settings for both -HRES and +HRES parts - if we tried then either the +HRES parts were too dark or the -HRES parts were washed out, leaving colours 9-15 barely distinguishable shades of white. We didn't really want to have to edit our capture to brighten up just some parts of the demo. Another problem was that both of the capture devices we had brought with us to the party were giving a shimmery picture (unstable horizontal sync) with this fix.

Instead what we ended up doing is leaving the border colour as black but increasing the horizontal sync width to its maximum value of 16 characters (programmed as zero, which looks wrong, but it's a 4-bit register and the compare is done after the increment - at least on the MC6845 CRTCs on the CGA cards we were using). This gives a burst of either half or three-quarters the standard width (depending on whether the character it starts on corresponds to a rising or falling edge of the CGA's internal +LCLK signal that is used to time the hsync sequence. I think we managed to arrange it so that it's always three-quarters but there may be bugs in that part of the code.

That fixes the brightness problem but unfortunately some capture devices (including the one that Trixter used to do some test/failsafe captures before the party) cope less well with this than with the border colour 6 change. If we release a "final version" (with a few minor improvements and bug fixes) we might include a "calibration screen" that people can use to choose the border colour, hsync width and phase that works best with their output device.

Yet another complication is that there were multiple revisions of the IBM CGA card. They had (to a good approximation) the same standard composite artifact colours but different direct colours. On the older CGA cards, colours 1-6 were all the same brightness, as were colours 9-14. This made them indistinguishable on monochrome composite monitors, so for the second revision of the CGA card, IBM added some more resistors to the output DAC in order to make different colours different brightnesses. They also removed the -BLANK signal so that the burst pulse is the same amplitude no matter whether it comes from border colour 6 or from the hsync burst (the truncated burst problem is still present, though).

Different direct colours mean that our 1K colour mode displays a different set of colours on old CGA cards as on new CGA cards. We debated a bit about whether we should target old or new CGA cards for our demo, but in the end we decided to go for old CGA cards, mainly because the set of colours you can get from an old CGA card are more useful (artistically speaking) than those from a new CGA card.

In order to make the hand-drawn 1K colour pictures, VileR and I made some test captures of the old CGA card's output with all the useful combinations of attribute and character, which he then used as a palette to paint his pictures. Happily he was able to find in there some close correspondances to the 16 RGBI colours and the 16 colours of the Commodore 64's palette.

For the pictures that were converted from photographs, I wanted to be able to use more characters than just 0x13, 0x55, 0xB0 and 0xB1 - I wanted to be able to try all different characters (even those that have different left and right nybbles in their top scanline) to get a closer match to the source image. However, getting calibration images for all 65536 combinations (let alone the 4 billion artifacts that can be generated from adjacent characters) was impractical. To make that work, I really needed to have a mathematical model of the CGA's composite output stage that I could use to generate the right colours. Ideally I would be able to generalize this to new CGA as well.

My first attempt at this was the one I used for the Hydra image - I assumed that the direct colours had hue/phase angles that were multiples of exactly 45 degrees, and that the CGA's pixel colour multiplexer chip was able to switch instantaneously between them. However, the hydra didn't come out looking how I expected on real hardware. Much later, I learnt that the main reason for this is that the TTL logic chips used on the CGA card don't switch instantaneously - there are logic delays between a signal coming in to an input pin and the corresponding change happening on the output pin. When your color carrier period is 279ns, a delay of just 7ns causes a noticable phase shift of 9 degrees.

There are several logic chips on the various signal paths of interest here, all with their own logic delays. My second attempt at modelling the CGA involved looking up the data sheets for all these chips, finding typical values for the logic delays (most of them were listed as a range) and generating an accurate model that way. This worked excellently for 1bpp mode, reasonably well for 2bpp mode, and not so well at all for +HRES mode. This is the implementation that is in the current SVN versions of DOSBox. I kept adding more and more parameters to my model and attempted to tune them to match my captured calibration images but I could not get good results that way. The trouble seemed to be in the guts of the multiplexer chip itself - the output signal depends in a complicated and mysterious way on all of the input signals, so the number of parameters required to describe its behavior quickly becomes impractical.

The final breakthrough came when I realized that I didn't need to model the composite signal *exactly* - I just needed to model it well enough to describe the observed colours. All the relevant colour information is at frequencies below 7.16MHz. By the sampling theorem, if we can reconstruct a version of the signal sampled at 14.318MHz, it'll be exactly correct not including frequencies at or above 7.16MHz (which we don't care about). The key insight is that we don't care about what happens to the signal *in between* those samples - it can bounce around, transition as slow or fast as you like as long as we know where it ends up when we measure the sample - all that extra freedom just manifests in the frequencies we don't care about.

The multiplexer takes a while to transition from one colour to another - on the order of 70ns (one 1bpp pixel time). So there isn't a place in the signal that we can sample and be sure that the previous transition has stopped and the next transition has not yet started. I theorized that at any given time there will not be more than one transition taking place. So a transition (and hence a sample) can be completely described by 1024 parameters - one for each combination of left colour (16 possibilities), right colour (16 possibilities) and phase within the color carrier cycle (4 possibilties).

I made a test pattern which does a very good try at getting swatches of all 4096 foreground/background/pattern combinations in just a couple of screenfuls (some can only be obtained for a short stretch, as transitions). This was quite a feat in itself - I needed an area of screen consisting of scanline 3 of several characters repeating vertically, necessitating having four CRTC scanlines within a single CRT scanline - a hairier CRTC manipulation trick than any that we actually used in 8088 MPH itself!)

I set up a model with these parameters and tried to match it to my captures. I initially tried to use a gradient-ascent hill-climbing algorithm to search the parameter space but before I could get it to work I realized that most of the parameters affected very few of the test swatches - any transition between two different colours X and Y can only affect colours with X and Y as foreground and background colours (32 of the 4096). If the left colour and the right colour are the same then any swatch with that colour as either foreground or background can be affected (496 of the 4096). That observation made a more naive hillclimbing algorithm much more practical, and it just takes a few minutes to find a set of 1024 parameters that match the measured values to within a few percent.

I wanted to model the new CGA as well as the old CGA, but didn't have an easy way to get good captures for that. However, the difficult bit of the old CGA to model (the multiplexer) is identical in the new CGA. So instead of applying the technique described above to the final output, I applied it to the multiplexer output and the intensity bit output separately. This yields a 256-element table and a 16-element table respectively. The outputs from these are summed to get the final output. There was a small amount of degradation from the 1024-element table version but it's too small to notice directly. To generate new CGA output, I just duplicated the intensity bit logic and applied it to the R, G and B bits (with appropriate scaling based on the resistor values in IBM's schematics). I haven't yet tested if this really matches new CGA output, but I don't currently know of any reason why it wouldn't.

This CGA simulation algorithm is implemented in a program I made called CGA2NTSC which has two main functions. One is to act as a CGA composite output stage emulator and NTSC decoder (taking a picture such as one might find on an RGBI monitor and show what it would look like on the composite output (old and new CGA). The other is to take a (24-bit colour) input image and try to find a set of data which, when loaded into the CGA's video memory, will best reproduce that image on the composite output. This is what we used to make the faces picture in the demo. It supports 1bpp and 2bpp modes as well as both text modes (though we only used it in +HRES mode for 8088 MPH). The program uses error diffusion (which can be turned down or off). I've had a couple of requests to make it use ordered dithering instead. That's possible for 1bpp and 2bpp modes but doesn't really make sense for text modes where you don't get an arbitrary choice of bit patterns. The program is a bit unpolished but should be reasonably usable.

Next time: how I played a 4 channel MOD at a sample rate of 16.6kHz through the PC speaker on a 4.77MHz 8088 CPU.

Multiplying faster with squares

October 30th, 2012

The 8088 has a hardware multiplier, but it's quite slow:

MUL with byte arguments: 69 cycles plus 1 for each set bit in AL plus 1 if the high byte of the result is 0
MUL with word arguments: 123 cycles plus 1 for each set bit in AX plus 1 if the high word of the result is 0

Signed multiplies are even slower (taking at least 80 cycles for a byte, 134 cycles for a word), and depend on the number of set bits in the absolute value of the accumulator, the signs of the operands and whether or not the explicit operand is -0x80 (-0x8000 for word multiplies). I also measured some word IMULs apparently taking a half-integer number of cycles to run, suggesting that there's either some very weird behavior going on with the 8088's multiplier or that there's a bug in my timing program (possibly both).

Can we beat the 8088's hardware multiplier with a software routine? There's a clever trick which can sometimes be used to speed up multiplications, based on the following:

(a+b)2 = a2 + b2 + 2ab
(a-b)2 = a2 + b2 - 2ab

Subtracting these gives:
4ab = (a+b)2 - (a-b)2

Or:
ab = (a+b)2/4 - (a-b)2/4

So if we keep in memory a table of x2/4 we can do a multiply with an add, two subtractions and two table lookups.

Does this still work if the entries in the table are fractional? Yes, it does, and here's why. x2/4 is an integer for even x, and for odd x it will always be an integer plus a quarter, as a quick induction will show. (a+b) and (a-b) are always both odd or both even (since they differ by 2b which is always even) so the quarters always cancel and we can just ignore them.

In 8088 assembly code, this can be written like this (assuming the operands are in ax and cx and that we don't care about the high word of the result):

  xchg ax,bx
  mov si,offset table
  shl bx,1
  shl cx,1
  add bx,cx
  mov ax,[si+bx]
  sub bx,cx
  sub bx,cx
  sub ax,[si+bx]

That works out to about 88 cycles (93 with DRAM refresh) - already quite an improvement over the hardware multiplier. We can do even better, though - if we place our table of squares at DS:0 and assume that our operands are in si and bx we get:

  shl bx,1
  shl si,1
  mov ax,[si+bx]
  neg bx
  sub ax,[si+bx]

Which is just 56 cycles (59 with DRAM refresh) - more than twice as fast as the hardware multiplier!

For byte-sized multiplies we can do something similar in just 34 cycles (36 with DRAM refresh):

  mov al,[si+bx]
  neg bx
  sub al,[si+bx]

However, it's less important for byte-sized multiplies since we can fit an entire 8-bit multiply table in memory and do:

  xchg bh,bl
  mov al,[si+bx]

Which is only about 20 cycles (21 with DRAM refresh). There's obviously a bit of a memory-usage vs time tradeoff there though.

There are some more nice things about these algorithms:

  1. There's more choices about which registers to use for the operands and results.
  2. If you're doing a lot of multiplies and want the results of them all scaled by a constant factor, you can build this into the table.
  3. If you're summing the results of several multiplies (e.g. computing a dot product) you can do it right in the accumulator (just change the "mov" to an "add".

The downsides (apart from the memory usage and time to initialize the table) are that you don't get the high byte/word for free and that it only works for signed multiplies.

How to get away with disabling DRAM refresh

October 29th, 2012

On the original IBM PC 5150 (and it's mostly electrically-equivalent derivatives, the 5155 and 5160) the operation of the bus (the data channel between the CPU and it's memory and peripherals) is interrupted is interrupted 2,187,500 times every 33 seconds (a rate of about 66KHz) for 11/13,125,000 of a second each time (i.e. 4 out of every 72 CPU cycles). During that time, very little of the machine can operate - no RAM can be read or written and no peripherals can be accessed (the CPU might be able to continue doing it's thing if it's in the middle of a long calculation, and the peripherals will continue to operate - it's just that nothing can communicate with each other).

Why does this happen? Well, most computers (including the one this blog post is about) use DRAM (Dynamic RAM) chips for their main memory, as it's fast and much cheaper than the slightly faster SRAM (Static RAM) chips. That's because each DRAM bit consists of a single capacitor and transistor as opposed to the 4 or more transistors that make up a bit of SRAM. That capacitor saves a lot of hardware but it has a big disadvantage - it discharges with time. So DRAM cells have to be "refreshed" periodically (every 2ms for the 16 kbit 4116 DRAMs in the original 5150) to maintain their contents. Reading a bit of DRAM involves recharging the capacitor if it's discharged, which refreshes it.

But a computer system won't generally read every bit of RAM in any given interval. In fact, if it's sitting in a tight idle loop it might very well not access most of the memory for many minutes at a time. But we would be justified in complaining if our computers forgot everything in their memories whenever they were left idle! So all general-purpose computers using DRAM will have some kind of circuitry for automatically accessing each memory location periodically to refresh it. In the 5150, this is done by having channel 1 of the 8253 Programmable Interval Timer (PIT) hooked up to the Direct Memory Access (DMA) controller's channel 0. The BIOS ROM programs the PIT for the 66KHz frequency mentioned above, and programs the DMA controller to read a byte each time it's triggered on channel 0. The address it reads counts up from 0 to 65,535 for each access, then goes back down to 0 again.

If the DRAM needs to be refreshed every 2ms why does the refresh circuit run at 66KHz and not 500Hz, or for that matter 8.192MHz? To answer that question, we need to know something about how the memory is organized. The original 5150 had banks of 8 chips (plus a 9th for parity checking). Each chip is 16 kbit, so a bank is 16KBytes. If you had a full 640KB of RAM organized this way, that would be 40 banks or 360 separate chips! (By the time that much memory become common, we were mostly using 64 kbit chips though.) Within each chip, the 16 kbits are organized in a grid of 128 "rows" and 128 "columns". To read a bit, you input the "row" address, then the "column" address, then read back the result (hence the chips could have just 16 pins each, as each address pin corresponds to both a "row" bit and a "column" bit). Happily, whenever a row is accessed, all the DRAM cells on that row are refreshed no matter what column address is ultimately accessed. Also, the low 7 bits of the physical byte address correspond to rows and the next 7 bits correspond to columns (the remaining 6 bits correspond to the bank address). So actually you could quite happily get away with just refreshing addresses 0-127 instead of addresses 0-65,535 on this machine (though there was a good reason for doing so, as we'll see later).

To ensure that they meet tolerances, electronic components (including DRAM chips) are manufactured with certain margins of error, which means that often one could get away with reprogramming the PIT to reduce the DRAM refresh rate a bit in order to squeeze a little bit more performance out of these old machines - it was a common hack to try, and I remember trying it on the family computer (an Amstrad PC1512) after reading a little bit about DRAM refresh in a computer magazine. I seem to recall that I got it up from the standard 1/18 to maybe 1/19 or 1/20 before it became unstable, but the performance improvement was too small to notice, so the little .COM file I made with DEBUG never made it as far as my AUTOEXEC.BAT.

For many of the timing experiments and tight loops I've been playing with on my XT, I've been disabling DRAM refresh altogether. This squeezes out a bit more performance which is nice but more importantly it makes the timings much more consistent (which is essential for anything involving lockstep). However, whenever I've told people about this the reaction is "doesn't that make the machine crash?" The answer is "no, it doesn't - if you're careful". If you turn off the refresh circuitry altogether you have to be sure that the program you're running accesses each DRAM row itself, which happens automatically if you're scanning through consecutive areas of RAM at a rate of more than 66KB/s, or for that matter if you've done enough loop unrolling that your inner loop covers more than 127 consecutive bytes of code. Since these old machines don't have caches as such, unrolled loops are almost always faster than rolled up ones anyway, so that's not such a great hardship.

Not all of the machines I'm tinkering with use 4116 DRAM chips. Later (64KB-256KB) 5150 motherboards and XTs use 4164 (64 kbit) chips, and modified machines (and possibly also clones) use 41256 (256 kbit) chips. The principles are exactly the same except these denser chips are arranged as 256x256 and 512x512 bits respectively, which means that there are 8 or 9 row bits, which means that instead of accessing 128 consecutive bytes every 2ms you have to access 256 consecutive bytes every 4ms or 512 consecutive bytes every 8ms respectively (the PIT and DMA settings were kept the same for maximum compatibility - fortunately the higher density DRAMs decay more slowly so this is possible). So when disabling DRAM refresh, one should be sure to access 512 consecutive bytes every 8ms since this will work for all 3 DRAM types.

The cycle-exact emulator I'm writing will be able to keep track of how long it's been since each DRAM row has been refreshed and will emit a warning if a row is unrefreshed for too long and decays. That will catch DRAM refresh problems that are missed due to the margins of error in real hardware, and also problems affecting only 41256 chips (my machine uses 4164s).

Modern PCs still use DRAM, and still have refresh cycles, though the overhead has gone down by an order of magnitude and the exact mechanisms have changed a few times over the years.

What I want from an online store

October 28th, 2012

Amazon's great - I buy all sorts of stuff from there. But their searching interface still seems to be optimized for the books, CDs and DVDs that they started out with. It works fine if you know what you're looking for and just need the search results to decide if you want the hardback or paperback edition. But if you need a waste paper basket that fits into a particular space, a digital alarm clock with a red LED display or a shaver light/socket with an 8" spacing between the mounting holes (all things I've tried to find there) - you're limited to sorting a single department's search results by price or score and then going to each of the thousands of results one by one to see if any have sufficiently detailed information on the product page (most don't).

What I really want is an interface like Mouser's - also known as a parametric search, whereby you can narrow down by (multiple values of) all kinds of different parameters and then sort the results by various parameters as well (Mouser's great by the way - I've bought lots of bits from there and it's got the best combination of prices, interface and selection of all the places I've tried). I suspect that the reasons that this isn't done by Amazon already are:

  1. Electronic component manufacturers often make ranges of components with many parameters that you can choose independently (for example, these resistors have 7 independent parameters mentioned in the datasheet).
  2. Electronic component manufacturers (by necessity) provide much more detailed information about their product than manufacturers of consumer goods (which hitherto have mostly just had to look nice on a store shelf).
  3. Amazon sells a lot more different products than Mouser (I make it 3,489,158 for Mouser and 166,367,811 for Amazon at the time of writing).

This seems like it ought to be surmountable, if enough people cared. It would be great to see an open, crowdsourced, wiki-like worldwide database of SKUs with as much technical information as possible about each product that Amazon, eBay and anyone else who wanted to could plug in to. It wouldn't need to be perfect to be useful (Mouser's isn't) - in fact it's usefulness probably increases roughly linearly with the number of products in it.

A new protected class: things you've said

October 27th, 2012

Some years ago, when this site was much smaller than it was today, it was suggested to me that I might want to be careful about what I wrote here lest it get read by a prospective employer who might find a reason in it to decline me employment. Having my website be nothing more than a online resume would be very boring, though, so I declined - in rather more polite terms than I really felt. Besides which, any employer who would do such a thing would clearly not be a good employer to work for. I'm lucky in that I have a pretty desirable skillset, though - not everyone is so fortunate.

I bring this up now because of this horrifying story that I read this morning. The very suggestion that such a thing might be done will have a massively chilling effect on participation in publicly archived discussions. Blogging is already hard enough knowing that everything I say is really part of my permanent record without imagining that it will be data-mined to discover all sorts of things about me that I didn't want to share in the first place!

We talk a lot about free speech in the western world, and take very seriously any possibility that government might limit that speech. But I think we don't take seriously enough threats to our free speech from public sector. Knowing that we can't get arrested for stuff we say online isn't terribly useful if that same stuff can make us unemployable.

So, I'd like to see some kind of legal framework that would prevent employers from discriminating against prospective hires based on things they've said. Such a framework wouldn't be completely unprecedented - there are already several pieces of information that are technically available to employers which they can't use in employment decisions. I propose that we just expand that to make "stuff you've said" a protected class. Naturally, that would also make it illegal to fire someone over something that they said (though exceptions would probably have to be made for things directly related to their job - it should still be possible to fire someone for violating an NDA, for example).

Companies don't like to have employees who say terrible things on the internet, because it reflects badly on them (and their hiring practices). But it only does so because they have the power to do something about employees who say terrible things on the internet. If they didn't have that power, they can just say "it's not work related - it's nothing to do with us". Essentially, because it's not prohibited it's essentially compulsory. So companies ought to be clamouring for this legislation - it would ensure they could concentrate on their core business and not have to go googling for dirt on their employees. It would also mean that they could choose the best person for the job without having to take into account stuff that fundamentally doesn't matter to them. And it would make it less likely that they would be left short-handed due to an ill-advised comment.

Practical considerations of a consumption tax

October 26th, 2012

I enjoy listening to NPR's Planet Money podcasts and I have learned a lot about economics from them.

One recent show that was particularly interesting and that has had me thinking is the one about the no brainer economic platform. That has made me reconsider several of my own economic opinions and throw several sacred cows on the barbecue (to mix up some metaphors).

In particular, I had always thought that income taxes were the best kind of taxes, being progressive and cheap to collect. However, as Planet Money points out - taxing something discourages that behavior and we don't want to discourage income and employment! A consumption tax makes much more sense in terms of incentives - discouraging consumption is environmentally friendly and, while it isn't naturally the most progressive form of taxation, can be tweaked to make it reasonably progressive. How to make it practical to collect is a different matter, though. Taxing something is always extremely invasive, as the government will require lots of information and documentation about that thing in order to make sure that you're not cheating on your taxes.

A carbon emission tax seems like one of the best kinds of consumption tax. We're pretty universally agreed that putting carbon dioxide in the atmosphere is an undesirable thing. Most of that net carbon comes from fossil fuels which tend to come from a few big mining operations, so it's natural and easy to tax these operations per tonne of carbon they sell. They will then pass these costs on to their customers and ultimately to the consumer, discouraging the consumption by means of increased fuel costs (of course, there are political problems with raising fuel prices but that's a bit out of the scope of this discussion - after all, Planet Money does point out that their economic platform would be political suicide!)

Fossil fuels aren't the only way carbon gets into the atmosphere, though. Should we also charge dairy farmers for the methane emissions from cows? That has a big effect on climate change too, although the carbon involved came out of the atmosphere much more recently than that from fossil fuels. The fairer we try to be, the more invasive the tax becomes. For example, consider someone who lives completely off the grid, growing trees and felling them for firewood. He's completely self-sufficient in terms of carbon cycles (while burning the trees puts out carbon dioxide, it's completely cancelled by growing them in the first place). So it doesn't seem like he should be paying any carbon tax. So we'd have to offset the emission tax against carbon absorption - in other words pay a credit to individuals and organizations causing a net reduction in the amount of carbon in the atmosphere. Growing trees is one such activity. What about landfill site operators, who sequester enormous amounts of carbon in the ground? That's not generally considered to be a hugely "green" activity, but by this measure it would be.

Of course, as we inevitably move away from fossil fuels, that form of tax becomes increasingly useless. We can tax other forms of "digging up stuff from the ground and using it up" consumption, but I suspect that getting all the tax money we need that way would have some really undesirable consequences (like pricing a lot of useful things like electronics right out of the range of what most people can afford).

What other things are consumed that we could tax? Well, there's a lot of energy falling on the earth in terms of sunlight that we can collect and reuse in various forms (and indeed I expect that's where much of our energy will come from in a few decades). But it's hard to think of that as "consumption" when it's so plentiful and so much of it just goes to waste. Harnessing solar energy is also something we don't really have any interest in discouraging.

There is one resource that we each have a finite amount of and have to be very careful how we use it. That is our time. Could we tax consumption of time? That sounds like a very regressive poll tax on the face of it. But what if we tax only wasted time? I subscribe to the view that time is only wasted if you're spending it doing something that you don't truly enjoy. There's an easy way to tell (from a taxation point of view) if you're doing something that you don't really enjoy - somebody has to pay you to do it. If you do it without being paid, then you're probably doing it because you want to and therefore not wasting your time.

So that brings us right back around to the income tax again. There's even a justification here for making the income tax progressive - if you need to be paid more to do some piece of work, obviously it's more unpleasant and therefore should be taxed at a higher rate (yes, I know that people don't really get paid in proportion to the unpleasantness of their jobs but it could be a useful legal fiction).

There's one aspect to income taxes that I don't really like, which is casual labour. Suppose I wanted to hire the kid next door to trim my hedges - if I had to fill in a big pile of tax paperwork in order to do so I probably wouldn't bother - I'd just do it myself instead. Lots of "under the table" work goes on and many blind eyes are turned to it which is a sad state of affairs - our laws ought to reflect actual practice rather than making huge swathes of our population (technically) outlaws. How do we draw the line between which jobs should be taxed and which shouldn't?

I haven't thought through all the ramifications of this, but I have an idea that perhaps income tax ought be linked to limited liability. We give this great gift to corporations, limiting the liability of investors to only the amount of money they invest - if the corporation does something which costs society more than that, society absorbs the remainder of that cost. The idea is that by doing this we encourage entrepreneurship, which is all very well but it seems like there should be a cost to limiting liability, so perhaps income tax should only be raised when the employer is a limited liability organization. The deal ends up being the same as it currently is for such organizations, but if you don't need to limit your liability, then your employees don't need to pay income tax either - I think it's quite a neat way to delineate. On the small business end of things, there will be a population of companies with limited liability and a population without, they will compete with each other and market forces will determine the size of business at which limiting liability becomes worth it. Perhaps there could be some kind of sliding scale of liability (and therefore taxation) to ease the transition for companies growing across that boundary. I suspect I don't have the economic chops to have a good sense of how well that would work out in the real world though.

Intervals: clock and power on the same line

October 25th, 2012

One of the main design considerations of Interval bars was trying to minimize the number of connections between adjacent bars. The current design has three connections - two for power and one which is a bidirectional, self-synchronizing data signal. The logic to synchronize everything ended up being a fairly complicated bit of code. It would be much simpler and faster if all the microcontrollers were driven from a single clock signal, so that all the signals could be synchronous. However, that would end up using 4 pins instead of 3.

What if instead of combining the clock signal with data, we combined it with power? Well, it's only worth doing this if we can do it without too much external hardware. A simple resistor/capacitor filter would be acceptable, but that's not enough to properly restore both power and clock signals. The DC part (across the capacitor) would be okay but then across the resistor we would get an AC signal, when what we really want is a square wave that oscillates between 0 and 5V, so we'd need some additional components to fix up the clock signal.

The worse problem, though, is that with each interval bar added to the circuit, we'd be adding an additional capacitor (in series with a resistor) between ground and (power+clock), so the clock signal is going to get degraded (it's amplitude will be reduced) with each bar. We can reduce this problem by increasing the resistance of the resistor, but then we'd have to use a higher voltage for the power line, increasing the complexity of the root box.

In the end I decided that a solution that makes the hardware simpler but the software more complicated was preferable, since the software only needs to be written once, but who knows how many times an interval bar will get built?

Variadic templates in ALFE

October 24th, 2012

Some of the types in ALFE are templates which can accept different numbers of parameters - in other words, variadic templates. How does this fit in with the Kind system I wrote about before?

As well as the Kind constructors "<>" (template) and "" (type), we need a third Kind constructor which represents zero or more Kinds in a sequence suitable for template arguments. Let's call this Kind constructor "...". As with C++ templates, we should allow the members of the sequence to have arbitrary kind. So <...> is the kind of a template that takes zero or more arguments whose kinds are type, while <<>...> is the kind of a template that takes zero or more arguments whose kinds are <> (a single argument template whose argument is of kind type). The kind of Function is <, ...> (the comma is required because Function has one or more arguments rather than zero or more). More complicated kinds like <<...>, ...> are also possible.

To actually create a variadic template, we need to implement "recursive case" and "base case" templates and have the compiler choose between them by pattern matching, just as in C++. So the definition of Tuple might look something like:

Tuple<@T, ...> = Structure { T first; Tuple<...> rest; };
Tuple<> = Structure { };

Is that enough? Well, I believe this gives ALFE's kind system parity with C++'s. However, there is one more possible Kind constructor that I can imagine - the Kind constructor which can represent any Kind at all - let's call it $. A template with an argument of this kind (i.e. has kind <$>) can be instantiated with an argument of any kind. So you could have:

Foo<@T> = { ... };
Foo<@T<>> = { ... };
Foo<@T<@R<$>>> = { ... };

The first template is used when Foo is instantiated with a type, the second is used when it's instantiated with a template of kind <> and the third is used when it's instantiated with any other single-argument template, no matter what the kind of the argument is. The third template could then pass R as an argument to another instantiation of Foo and thereby "peel apart" the kind of the argument (as long as none of the kinds involved have multiple arguments).

By combining $ with ... I think we could potentially peel apart completely arbitrary kinds. However, I'm not sure if this is worth implementing since I can't think of any use for such a thing. Still, it's good to have some idea about how to proceed if I do eventually come across such a use!