Archive for April, 2015

More 8088 MPH how it's done

Sunday, April 12th, 2015

While the 1K colour mode and 4-channel audio player are the big technical achievements I wrote for 8088 MPH, there are a few other bits and pieces in the demo I wanted to write up here. Specifically, three tricks which make possible full-screen 60Hz movement that isn't scrolling.

Starfield (particles)

This is the was the first effect I wrote for the demo, about 4 years ago. The inner loop looks like:

  mov di,9999
  shl di,1
  mov di,[di]
  es: mov [di],ah
  mov [patchAddress + 1],di

This is unrolled 948 times (once for each moving particle, with a separate patchAddress for each one). This is the number we calculated it would be possible to have if each one is updated at a rate of 60Hz (though the frame rate isn't critical here - since the particles are scanned in random order, tearing can never happen). The initial (random) value loaded into DI is the position of the particle (a value between 0 and 16383 inclusive, a byte address in video RAM). The value in AL is zero, so STOSB erases the particle. Then the lines "shl di,1 mov di,[di]" move the particle - the data table at DS:0 to DS:32767 is a sort of cross between a circular linked list and a vector field, describing the trajectory that each particle follows. So topologically speaking the particles are all moving around in a loop, but the points along this loop are arranged so that they form trajectories which look like the movement we want. The shift is necessary because the vector table is two bytes per element but the other uses of DI are one byte per element.

The line "es: mov [di],ah" plots the star on the screen. We actually have 7 different colours of stars, held in AH, BL, BH, CL, CH, DL and DH. These registers are initialized appropriately at startup, and the register that each iteration of the unrolled loop uses is set at random.

My first version of this effect included 21 different vector fields but unfortunately they don't compress well (and the best ones take ages to compute from first principles) so we ended up just using a single one and including it in the binary instead of generating it at run-time.


This can be done with practically 0 CPU time on Amiga by means of multiple bitplanes scrolling independently. We don't have that facility on CGA so it's a bit more CPU-intensive but still pretty straightforward. We use 40-column text mode to avoid snow, and do the same "right half bar" (0xde, ' ▐') trick that is used for 160x100x16 mode (based on 80-column text mode). The number of scanlines per row is set to 4 to create an 80x50 "chunky pixel" mode (though actually we only had enough time for 47 rows at 60Hz, so we just left the last few rows blank).

The large bitmap (picture of circles) is actually stored in the executable rather than computed at runtime. A second copy is created at runtime (shifted over by one half-character). Two pointers into these bitmaps are maintained in the registers SP and BP and we process four pixels per iteration of the inner loop, which looks like this:

  pop ax
  xor ax,[bp+99]
  inc di
  mov al,ah
  inc di

This is unrolled 940 times (20 times per row, 47 rows) with appropriate "+99" adjustments in each row, and adjustments to SP and BP are made after each row of 20:

  add sp,stride-40
  add bp,stride

To make the transition effect at the end, we overwrite blocks of these instructions by replacing the "xor ax,[bp+99]" instruction with "mov ax,0".

Unfortunately once we added the music the effect no longer runs at quite 60Hz and some tearing can be seen with close observation. This is something we hope to fix in a final version.

Kefrens bars (+raster bars)

The idea behind Kefrens bars is to get rid of the frame buffer altogether and just have a line buffer (Atari 2600 style). Then any video memory changes that you make on one scanline appear on all lower scanlines (unless overwritten by change on a lower scanline). During vertical overscan/blank/sync the line buffer is cleared. TThis has been done lots of times on other platforms, but never on CGA before. Several things make it tricky on this platform: the CGA wait states severely limit how many VRAM writes you can do in any given scanline (we ended up with 2 reads and 4 writes per scanline, plotting 7 nybble-wide pixels). The other is that you need to synchronize the routine with the CRT raster (aka "racing the beam"), meaning cycle counting and tuning the routine to take exactly 304 cycles.

Normally it's not possible for piece of 8088 code to always take exactly 304 cycles - the reason being that every 72 cycles PIT channel 1 wraps and triggers DMA channel 0 to refresh a row of DRAM, stealing the bus from the CPU for 4 cycles. As 304 is not divisible by 72, these refresh cycles appear at different places on the screen from scanline to scanline, so you need potentially 9 different scanline routines for the 9 different refresh phases. The number of scanlines per frame (262) is not divisible by 9 so you need potentially 9 different frame routines as well.

Fortunately there's another way that is both easier and requires less RAM. It's possible to reprogram PIT channel 1 so that the refreshes come every 76 cycles instead (exactly 4 times per scanline). While this may technically be out of spec for the DRAM chips, in practice we haven't found any machines on which the slower refresh doesn't work - in fact many require massively longer refresh intervals before they start to decay. Even the ones that crash with a refresh period of 84 or 80 cycles work at 76 just due to manufacturing tolerances. This is one ingredient we used to make the Kefrens bars work. Note that we had to put it back to 72 for the credits part as that routine's inner loop is a multiple of 72 cycles but not 76.

Another complication is that the CRTC in our CGA cards (the MC6845) does not cope well with a one-scanline-high frame (I believe it can be done, but that it requires reprogramming the CRTC a couple of times per scanline.) So instead our "line buffer" is actually two scanlines high. This gives a nice dithering effect on our Kefrens bars - we liked the way it looked, so we stopped trying to get the single-scanline version working.

The inner loop of the effect is 200 unrolled iterations of the 304-cycle:

  mov ax,9999
  mov ds,ax
  mov sp,[bx]
  pop di
  mov al,[es:di]
  pop cx
  and ax,cx
  pop cx
  or ax,cx
  pop ax
  and ah,[es:di+1]
  pop cx
  or ax,cx
  pop ax
  out dx,al
  mov ds,bp
  out soundPort,al

The routine uses two big tables. One (at DS:BX) is 200*838 elements in size (one for each combination of scanline and frame number - 200 scanlines, 838 frames). Each element of this table is 2 bytes, a pointer into the other, smaller table (at SS:SP). The total size of this table is 327kB, the largest single table in the entire demo. This big table is stored sideways (much like the sample table in the MOD player). Since we need to reload DS a couple of times each scanline already, we might as well reload it with a different value on each scanline. The "9999"s are patched with the right DS values at loop unrolling times. The entries in the SS:SP tables are 12 bytes each, and hold all the information needed for a combination of a Kefrens bar position and a raster bar colour. There are 154*16 entries (one for each such combination) and the whole table needs to be doubled in order to have different entries on odd and even scanlines. So 12*154*16*2 = 59136 bytes, fits quite happily in a single segment.

The small table can easily be generated at runtime, but the big table proved troublesome. With some heavy optimization, I got the precomputing of Puppeh's nice Kefrens bars motions and raster bars patterns down to 28 seconds or so but my target was less than 15 seconds. After a few false starts I figured out that the Kefrens bars part of the table was quicker to compute but compressed less well while the raster bars part of the table took longer to compute but compressed much better. So I ended up just splitting the tables, compressing the rasters and precalculating the Kefrens.

If you try the demo out on real hardware and the Kefrens bars effect seems unstable, try disabling any network card drivers and networking software you may have installed. IRQs from a NIC or similar hardware can mess up the delicate timings. This is one of the more timing-sensitive effects in the whole demo, though - without a 4.77MHz and genuine MC6845-based IBM CGA card the image is unlikely to stabilize.

One other thing you might notice about this routine is that it outputs to the speaker on each scanline! This was something I didn't get working in time for Revision, but my plan for this effect was that we would have PWM music playing in the background. So there's more that can be done with this effect that what we did in 8088 MPH (if not the sound, then those cycles could be used for something else).

Source code

The source file for all the bits of the demo I wrote are now up on my github:



8088 PC Speaker MOD player: How it's done

Friday, April 10th, 2015

The last 100 seconds of 8088 MPH sound very different to the rest of the demo. The end tune is actually a 4-channel Amiga MOD file (which you can download here) composed by coda. Playing back a MOD through PC speaker on such a slow machine has never been done before. Here is how we did it.

We knew early on that we wanted to push the limits of audio from the machine as well as video, but also that we wanted to use the PC speaker for output. The first Sound Blaster card didn't arrive on the scene until 1989. Also, Galaxy Player (glx212) can play a MOD on a 4.77MHz 8088 with a Sound Blaster, so doing that wouldn't have broken any records. Galaxy Player is a very impressive piece of code - I once broke into it with a debugger to figure out how it works. It can also play back through the PC speaker, but that requires a much faster computer.

Most software that uses the PC speaker plays square wave beeps. This can be done with minimal CPU involvement as channel 2 of the PC's 8253 Programmable Interval Timer (PIT) is connected (via an AND gate) to the speaker. Setting that channel to square wave mode and programming an appropriate frequency (as a divisor of 105MHz/88 ~= 1.193MHz) is about all you need to do for an unattended beep. Changing the beep frequency 60 times a second (via an interrupt) is how the music is played for most of 8088 MPH. By rapidly switching between several notes (arpeggiation) you can give the impression of playing multiple notes at once (though it's a poor substitute for true chords) and by varying the duty cycles of arpeggios you can get a very limited impression of volume attenuation.

Playing back pre-recorded sampled sound over the PC speaker was less common, but still a well understood technique. As well as a square wave mode, the PIT has a "one shot" mode where the output goes low while the counter counts down and then goes high waiting for further input. By loading new values into the PIT's count register regularly, Pulse Width Modulation can be achieved. I remember being stunned the first time I heard this being done on the family Amstrad PC1512 - a crystal-clear (for the time) 6 bits of dynamic range at a sample rate of 18.6kHz! The "regularly" bit is the problem, though. Most (if not all) PC speaker software using this technique used a hardware timer interrupt (IRQ0, PIT channel 0) to trigger the CPU regularly to reload the channel 2 count with the next sample. However, interrupts on a 4.77MHz 8088 are really slow, and when playing back samples at ~20kHz there is basically no time for anything else (like the mixing required for a MOD player). All you can do is play back pre-rendered sound, and 100 seconds of 16.6kHz audio would have been 1.6MB of data - way too much for our 360kB disk space budget even with pklite helping out, not to mention our 640kB of RAM.

However, since we were targeting a specific CPU (and a specific clock speed) for this demo, we had a way of timing things without the timer interrupt - counting cycles. Conceivably, the same techniques could be used in a more portable player - having several predefined routines for the most common slow CPUs (and an IRQ-driven version for faster CPUs), or calibrating the speed on startup.

When writing highly-optimized 8088 code, there are two techniques which are particularly important - loop unrolling and self-modifying code. There is a tension between these two techniques, though - the more unrolling you do the lower your looping overhead but the more code you have to self-modify. Galaxy player plays these two techniques off each other quite nicely - picking an unrolled loop size which minimizes the total time the routine takes. Unfortunately this technique isn't a good fit for CPU-timed PC speaker audio because the samples aren't generated sequentially - it first renders a frame of samples for channel 0, then adds in a frame of samples for channel 1 and so on. It might be possible to statically intersperse the PIT count "out" instructions into the mixing code but that's really difficult code to write (it really needs to be written by a sophisticated tool that has knowledge of the exact 8088 timings to minimize jitter). Perhaps for a future project...

After playing about with (and measuring the execution speed of) a whole lot of different routines, I settled on one that doesn't unroll the sample loop at all (but does completely unroll the channel loop). The mixing code itself looks like this:

  add bp,9999  mov bx,bp  mov bl,99  mov al,[bx]
  add si,9999  mov bx,si  mov bl,99  add al,[bx]
  add di,9999  mov bx,di  mov bl,99  add al,[bx]
  add dx,9999  mov bx,dx  mov bl,99  add al,[bx]
  out 0x42,al

This code can't play back arbitrary MODs because it has a limitation on sample length - rather than being of arbitrary length, samples are all exactly 256 bytes long. The idea is that samples repeat with one oscillation (i.e. an up and a down) over those 256 bytes. With a 16.6kHz output sample rate that translates to an output frequency range of 0 to 8.3kHz with a frequency resolution of 0.25Hz. So it's capabilities are somewhat closer to the C64's SID chip than the Amiga's Paula (in fact we codenamed the routine SID during development - we had others codenamed VIC and Paula which may get an airing in the future).

Our "SID" also has no volume tables - the volumes have to be baked into the samples themselves, so there may need to be several copies of any given sample at different volume levels (one for each volume level at which the sample is played). If we have too many samples, volume can be quantized (the program that converts from the source .mod to the player's internal format does the volume baking and optimizes the number of quantization levels). If we can squeeze it into 288 CPU cycles (spoiler: we can!), that's 72 PIT cycles which divides nicely into 4 - each channel's samples go from 1 to 18, and the final sample goes from 4 to 72 (we have to take care not to program 0 into the PIT or it'll count down for 65536 cycles and we'll get no audio for about 55ms). It also works out well for DRAM refresh - using the default DRAM refresh period of 18 we'll get exactly four refresh bus cycles per sample, and they'll be in the same places in the execution of the code every sample, so won't cause any jitter.

The four registers bp, si, di and dx each hold the respective channel's current position within the waveform (as an 8.8 bit fixed point number). The "9999"s are the respective channels' frequencies. The higher the frequency, the further the position gets advanced each sample (direct digital synthesis - similar to that which I used in the Physical Tone Matrix). These frequency values are modified right in the code itself during runtime (self-modifying code). This reduces register pressure (which is important as there are not a lot of registers to spare!)

Similarly, the "99"s (also patched at runtime) are the waveform numbers for each channel. The 256x256 waveform table is turned "sideways" (the low byte of the address is the waveform number and the high byte is the position) in order to avoid having to shift the high 8 bits from the position register to the low 8 bits of the sample pointer.

Now, if we run this code in a loop we'll get a chord playing from the PC speaker. But only a single chord - we want to play a whole song, where the note frequencies and waveform numbers change potentially 50 times per second. So we need to add a way to patch the frequencies and waveform numbers in the code. The fastest way to do that is to use the stack as our stream of patch data:

  pop bx
  pop word[cs:bx]

This means we need to run with interrupts disabled, but we need to do that anyway - the delays introduced by the timer interrupt would cause massive audio quality degradation.

Now we have enough ingredients to play an actual tune, but not a long one! If we're pulling 4 bytes off the stack for every sample we play, we're going to run out of stack in under a second. We might as well just pull preprocessed samples from the stack if we're going to do that - it would last 4 times longer!

However, most of the time we don't need to patch anything - we just want to leave the sample playing until we next want to change something, and then we probably want to change everything at once after 20ms (331 samples). So we want some kind of loop that counts down and then we do the patching once it reaches zero:

  add bp,9999  mov bx,bp  mov bl,99  mov al,[bx]
  add si,9999  mov bx,si  mov bl,99  add al,[bx]
  add di,9999  mov bx,di  mov bl,99  add al,[bx]
  add dx,9999  mov bx,dx  mov bl,99  add al,[bx]
  out 0x42,al
  loop loopTop
  pop bx
  pop word[cs:bx]
  mov cl,99
  jmp loopTop

The "99" in the "mov cl,99" instruction is (you guessed it) another value that is patched at runtime.

This code takes exactly 288 cycles to run in the case where we're patching, but it's quite a bit shorter in the no-patch case. We need it to run at 288 cycles every iteration (patch or no patch) to keep the samples coming regularly. Fortunately, there's a nice place to stick some NOPs where they will be executed only in the non-patch case, making two loops that mutually overlap without either being nested within the other:

  times 15 nop
  add bp,9999  mov bx,bp  mov bl,99  mov al,[bx]
  add si,9999  mov bx,si  mov bl,99  add al,[bx]
  add di,9999  mov bx,di  mov bl,99  add al,[bx]
  add dx,9999  mov bx,dx  mov bl,99  add al,[bx]
  out 0x42,al
  loop v
  pop bx
  pop word[cs:bx]
  mov cl,99
  jmp loopTop

That's it - that's the entire inner loop as it is when the CPU executes its first instruction. All the remaining magic is in the data that's pointed to by the stack pointer. Let's think about how fast we're burning through that data now. 50 times a second we need to patch (worst case) 10 locations (4 frequencies, 4 sample numbers and the loop counter twice). That's 40 bytes, 50 times per second or 2000 bytes per second. That means we get through our 64kB of stack data in less than 33 seconds. That's better than 1 second, but still too short for our song by a factor of 3. We could use a larger stack, but then we'd need to update our stack segment somehow every 33 seconds at least. There's no code to do that, though, and nowhere to put such code that wouldn't execute far too often.

Or is there? Take another look at those 15 NOPs at the start of the routine. If you squint a bit, don't they sort of look like a blank canvas just waiting to be painted with some amazing work of art? (No? Maybe it's just me then). Yes, if we keep the "mov cl,99" line patched to be "mov cl,1" we can patch as many times as we like without the code between v and loopTop being executed at all, which means that we can patch some code into there and then switch the CL value to 2 for a sample in order to execute it. We have to make sure that these little "patched routines" (I call them v-instructions, hence the label) take exactly the same time as the 15 NOPs. This turns out to be possible for all the v-instructions we need for 8088 MPH. However, some of them are *smaller* than the 15 NOPs (in particular those which use some of their bus cycles to access memory or IO ports instead of fetch instruction bytes). This means we need to jump to a different place in the code to start them - that's easy enough, though, we can just patch the destination byte of the "loop v" instruction the same way we're patching everything else (almost half the bytes in this little routine get patched at some point!)

The next thing to notice is that the set of v-instructions that we can execute form a small (albeit verbose) bytecode interpreted language - we can do whatever we like in there provided we meet the space and time requirements. Our little mod player has become Turing complete! In particular, we can modify the stack pointer in order to do loops. That means instead of being hundreds of kilobytes, our v-instruction program can be relatively small (the one in 8088 MPH is just 652 bytes long). It's tricky to write, because it needs to have inside it the locations of the various points within the program that we need to patch, which might move around as we debug things. So rather than writing them directly I wrote some assembler macros to generate them for me. Oh, and because a v-instruction is made up of several CPU instructions, I ended up writing a sort-of mini assembler in the assembler's macro language! Here is one of the v-instructions from the CRTC update v-instruction routine in 8088 MPH:

  forget 7
  startAt 4
  w 0x3d4
  w 0x990d
  runV 1

This translates to the 8088 instructions:

  mov bx,dx
  mov dx,0x03d4
  mov ax,0x990d
  out dx,ax
  mov dx,bx

The "0x99" (AH value) is, you guessed it, patched to the desired CRTC start address byte (yes, I used self-modifying code in the v-instructions as well as in the 8088 instructions).

The main v-instruction routine runs 50 times per second and updates the frequency and waveform data in the actual mixing routine. It also fetches more song data from the pointer ES:0 (and increments ES). This means that we can burn through just 800 bytes of data per second for our song instead of 2000, dramatically reducing the memory usage. Only 12 of the 16 bytes in the paragraph are used for the actual musical data, the other 4 bytes hold the address of another subroutine to set the stack pointer to, and an argument for that routine. These "h-instruction" subroutines do things like printing a character on the screen, changing the cursor position for the print routine, changing the CRTC start address (for hardware scrolling) and finishing the routine (by patching the final "jmp looptop" instruction. So there's 3 layers of code here: the actual 8088 mixer code, the v-instruction code in the stack and the h-instruction code interspersed into the song data.

Somewhat surprisingly, there's still plenty of v-instruction time left - even the longest of these takes only 119 samples of the 331 available. So the routine actually ends up using only about 87% of the available time (less when just scrolling slowly without printing any extra characters). Getting the routine to do more is tricky, though - using more than 4 bytes per 20ms frame for the non-song stuff would probably involve moving through the song data at a non-integer number of segments per frame. Also, the fact that the only persistent registers available to v-instructions are ES and AH (though AL and BX can be stomped) makes writing new v-instructions tricky. More can certainly be done, though.

You might notice that the "mov cl,99" instruction directly follows the instruction that patches it ("pop word[cs:bx]"). This means that the 8088 will execute the unpatched version of the instruction (as it's already in the prefetch queue by the time it's patched). The v-instruction program generation macros take this into account. In theory this instruction could be anywhere between the "loop v" instruction and the "jmp loopTop" instruction, but in practice if I put it anywhere except where it is, the routine ends up taking more than 288 CPU cycles. For debugging in DOSBox (which executes the patched version) I have a debug switch which moves the instruction before the pop (the timing is wrong on DOSBox anyway, but I can at least test functional changes that way). Debugging is still tricky, though - breakpoints don't work so well when your entire program is executed from the same 15 byte memory region!

Some pre-processing of the source .mod file is performed on a modern computer before transferring it to the old PC - resampling the looping samples to 256 bytes and changing the note and effect data into frequencies and waveform numbers in the 800 bytes per second format. The .mod interpreter is based on a older version of PT2PLAY. The source is here and there is a compiled binary for Win32 here. Only mode 1 works at present. All Protracker effects should work with the exception of 9xx (set sample offset), E0x (hardware filter), E9x (retrigger) and CIA timer modes. The latter could be accomplished to some extent by adjusting the number of samples per tick.

As well as generating the data for the routine, mod_convert generates a .wav so that musicians can hear how their work sounds on the PC without actually having to have one. To do this, the program generates a 1-bit waveform at 1.193MHz and then resamples it to 44.1kHz, so even the high frequency carrier wave is reproduced. It's essentially a little emulator of the PC speaker circuit.

One nice feature of this routine that I didn't find a way to use in "8088 MPH" is ring modulation. If the second "mov bl,99" instruction is patched to be "mov bl,al" then the output from the first channel is used as the second channel's waveform number. If the waveform numbers in slots 1..18 are all the same basic waveform multiplied by a suitable set of amplitudes, then the output of the second channel is ring modulated by the first!

One other nice little finishing touch with the credits part is the way that it exits to DOS at the end, with the "A:\>_" prompt right after. This is a fully functioning DOS prompt, not a mockup. The idea is that the "What can you do?" line is immediately followed by the prompt to actually do it! In order to get this to work, the screen had to be in the "unscrolled" location at when the demo exits (DOS and the BIOS scroll the screen by shifting video RAM data, not by changing the CRTC start address). That's easy enough then, just subtract the number of character positions that we scroll from 8192 and program that value into the start address initially. We also need to move our initial write pointer and VileR's awesome background 40-column ANSI art so that they start in the right place. I accomplished this without needing to add code to handle wrapping, by taking advantage of the fact that the CGA ignores address bit 14, causing physical addresses 0xB8000-0xBBFFF to be mirrored in the address range 0xBC000-0xBFFFF. This turns out to be another emulator-breaking change, though (at least in DOSBox)!

The source for the player itself can be found on my github.

1K colours on CGA: How it's done

Wednesday, April 8th, 2015

[Update: VileR's writeup of the 1K colour mode is now up. His has fewer technical details but is much easier to understand than mine as it has pictures!]

When displaying graphics on an original IBM Color Graphics Adapter (CGA), normally only 4 colours (from a palette of 16) are possible at once. A few games written for such systems took advantage of the artifacting on the card's NTSC composite output to get 16 colours at once. On Saturday, a team of people including myself, Trixter, Scali and VileR released a demo ("8088 MPH") which smashed this limit and won first place in the "Oldskool Demo" compo at the Revision 2015 demoparty in Saarbrücken, Germany. Some commenters have suggested that the production is a fake and that what we claimed to have done is impossible. Others have suggested it's dithered or flickered to get more colours. But it is none of these things. Here is how we did it.

First of all, what defines a colour on the composite output? There's only one signal line on a composite connection (plus a ground return path) so you can't have separate red, green and blue analog levels like you have on a VGA card (or separate red, green, blue and intensity lines like you have on an RGBI connection.) Instead, a composite signal effectively sequences the red, green and blue signals in time. A composite colour is a signal which repeats at a frequency of 3.57MHz (half of the width of a text character in 80-column mode). Given such a signal, you can compute its DC component (average voltage), oscillation amplitude (about this average) and phase (relative to the color burst pulse at the start of each scanline). These three parameters directly correspond to brightness (luminance), saturation and hue respectively. Higher frequencies (2nd and greater multiples of 3.57MHz) are not involved in colour decoding and would normally have been filtered out by the decoding circuitry in the composite monitors CGA cards would have been connected to in 1981.

The most common CGA composite mode works by putting the card in 1 bit-per-pixel (1bpp) mode - i.e. each pixel is either off (black) or on (white, generally, though this could be changed via the palette register). A single period of color carrier oscillation contains 4 pixels in this mode (the pixel rate is 14.318MHz), so there are 16 possible waveforms you can make with patterns of lit and unlit pixels and hence 16 artifact colours.

Separately to artifact colour, the CGA card has 16 "direct" colours (the ones that are available in text modes). These are just the 16 possible RGBI bit patterns on an RGBI output, but how does the card generate these colours on the composite output? It does so by generating 3.57MHz waveforms on the card at 6 different phases using flip-flops. These are the colours blue, green, cyan, red, magneta and yellow. Including the constant digital signals 0 and 1 (GND and +5V) gives 8 basic colours. To get the intense versions of these, an additional DC offset is applied when the digital signal is turned into an analogue one at the output.

The 1K colour trick hinges on noticing that the direct colours are not the same as the artifact colours. In 1bpp mode you can change the palette register to get different sets of artifact colours. Suppose you change the palette register to blue - then any black pixel will "turn off" the corresponding part of the "blue" waveform. These "chopped up" colours are different yet again from the 16 direct colours and the 16 normal artifact colours. So you could get 256 colours that way (though you can't put them wherever you like because there are limits to how often you can change the value in the palette register).

Suppose that in 1bpp mode you had a second palette register so that you could change the colour corresponding to the 0 bit as well as the one corresponding to the 1 bit. Then using the same techniques you could generate 2K colours (16 foreground colours, 16 background colours and 16 bit patterns for choosing which colour goes where - but swapping foreground and background and inverting the bit pattern yields the same colour). Here we come to the crucial part of the trick: in text mode you can kind of do that - the attribute byte for a character (when flashing is disabled) lets you choose the foreground and background colours independently. Unfortunately you don't get to choose the bit patterns you want - those are defined by the bits in the CGA's character ROM, which can't be changed from software.

VileR is the one who deserves credit for the next observation. He pointed out to me that the characters 'U' (capital letter U, 0x55) and '‼' (double exclamation mark, 0x13) both have bit patterns in their top two rows 11001100 and 01100110 respectively) which are the same for the left nybble as for the right nybble, and the same in both rows. Therefore, if we change the number of scanlines per character row to 2 (as is done in a number of other CGA games to get a 160x100x2 mode using a "vertical half solid" character - 0xDD or 0xDE) we should be able to get ~500 colours (2 useful characters times 16 foreground colours times 16 background colours) at a resolution of 80x100.

In order to get from there to 1024 colours we need to find some more characters with the same properties as 0x55 and 0x13. It would be fantastic if there happened to be for every nybble value X a character with that bit pattern in its top 4 nybbles, but unfortunately only the nybble patterns 1100 and 0110 are obtainable that way. However, if we consider just the top scanline instead of the top two, we find two more characters with the right property - '░' (light shade, 0xb0, bit pattern 00100010) and '▒ ' (medium shade, 0xb1, bit pattern 01010101). Unfortunately the second scanlines of these characters don't play ball, and if we tried to use them with 2 scanlines per row we'd get horizontal stripes instead of solid-coloured pixels.

So to get those extra colours we need to use 1 scanline per row. However, there's are several complications in doing so. One is that the CRTC on the CGA card (Motorola MC6845) cannot generate more than 128 rows (plus up to 32 extra scanlines) per frame and we need to generate 262 scanlines per frame in order to maintain the correct ratio of hsync pulses to vsync pulses that the monitor requires to generate a stable picture.

It is possible to do this, though, by generating multiple CRTC frames per CRT frame (and suppressing the vsync pulse for all but one of them). This is how we generated the wide picture before the credits part (the one with our faces) - in that image there's a 100 scanline frame with 1 scanline per row and immediately below it a 162 scanline frame with 2 scanlines per row.

But there were several 1K colour images in the demo that filled the entire screen - how did we do those? The answer is very similar but instead of having one frame 100 scanlines high, we have 100 frames that are 2 scanlines high (all with 1 scanline per row). In the middle of each of these frames the memory address is advanced by one row by the CRTC. In each frame we advance the CRTC start address register by one row's worth of characters, so that the top row of one frame is the same as the bottom row of the frame above it. So each frame straddles two pixel rows and each 2-scanline-high "pixel" straddles two CRTC frames.

So we're done, right? That's all there is to the trick? Well, not quite - there are more complications. If you do the obvious thing and set 80-column text mode, colour burst enabled via the BIOS, you will see either no colours at all on your composite display or colours that flash in and out and change hue (on monitors that don't have a properly functioning colour-killer circuit). The reason for this is that the CGA card was never designed to be used in 80-column text mode with composite colour display (the text doesn't have enough horizontal resolution to be readable) and there's a hardware bug that prevents it from working properly anyway.

The bug is that the CGA card takes the horizontal sync (hsync) signal from the CRTC (which just goes high and low once per scanline) and uses it to trigger a more complicated composite pulse signal consisting of front porch, sync, breezeway, color burst and back porch. The whole process takes 10 character periods in modes other than 80-column text (-HRES modes) so the BIOS programs the CGA's hsync width register to 10. But in +HRES (80-column text) mode these 10 characters are half the width, so the hsync process gets interrupted half way through leading to a truncated sync pulse and no burst at all.

This is well-known and the usual way of dealing with the problem is to set the border colour (palette register) to 6 (dark yellow - not brown as it is on the 5153 RGBI TTL monitor) so that the monitor picks up its color burst from the border instead. However, on our hardware we found that doing this made +HRES modes significantly darker than -HRES modes. This is because monitors and capture devices calibrate their gain to normalize the amplitude of the burst pulse, and colour 6 is brighter than the normal burst pulse. Not all of the demo uses +HRES mode and we found that we could not use a single set of calibration settings for both -HRES and +HRES parts - if we tried then either the +HRES parts were too dark or the -HRES parts were washed out, leaving colours 9-15 barely distinguishable shades of white. We didn't really want to have to edit our capture to brighten up just some parts of the demo. Another problem was that both of the capture devices we had brought with us to the party were giving a shimmery picture (unstable horizontal sync) with this fix.

Instead what we ended up doing is leaving the border colour as black but increasing the horizontal sync width to its maximum value of 16 characters (programmed as zero, which looks wrong, but it's a 4-bit register and the compare is done after the increment - at least on the MC6845 CRTCs on the CGA cards we were using). This gives a burst of either half or three-quarters the standard width (depending on whether the character it starts on corresponds to a rising or falling edge of the CGA's internal +LCLK signal that is used to time the hsync sequence. I think we managed to arrange it so that it's always three-quarters but there may be bugs in that part of the code.

That fixes the brightness problem but unfortunately some capture devices (including the one that Trixter used to do some test/failsafe captures before the party) cope less well with this than with the border colour 6 change. If we release a "final version" (with a few minor improvements and bug fixes) we might include a "calibration screen" that people can use to choose the border colour, hsync width and phase that works best with their output device.

Yet another complication is that there were multiple revisions of the IBM CGA card. They had (to a good approximation) the same standard composite artifact colours but different direct colours. On the older CGA cards, colours 1-6 were all the same brightness, as were colours 9-14. This made them indistinguishable on monochrome composite monitors, so for the second revision of the CGA card, IBM added some more resistors to the output DAC in order to make different colours different brightnesses. They also removed the -BLANK signal so that the burst pulse is the same amplitude no matter whether it comes from border colour 6 or from the hsync burst (the truncated burst problem is still present, though).

Different direct colours mean that our 1K colour mode displays a different set of colours on old CGA cards as on new CGA cards. We debated a bit about whether we should target old or new CGA cards for our demo, but in the end we decided to go for old CGA cards, mainly because the set of colours you can get from an old CGA card are more useful (artistically speaking) than those from a new CGA card.

In order to make the hand-drawn 1K colour pictures, VileR and I made some test captures of the old CGA card's output with all the useful combinations of attribute and character, which he then used as a palette to paint his pictures. Happily he was able to find in there some close correspondances to the 16 RGBI colours and the 16 colours of the Commodore 64's palette.

For the pictures that were converted from photographs, I wanted to be able to use more characters than just 0x13, 0x55, 0xB0 and 0xB1 - I wanted to be able to try all different characters (even those that have different left and right nybbles in their top scanline) to get a closer match to the source image. However, getting calibration images for all 65536 combinations (let alone the 4 billion artifacts that can be generated from adjacent characters) was impractical. To make that work, I really needed to have a mathematical model of the CGA's composite output stage that I could use to generate the right colours. Ideally I would be able to generalize this to new CGA as well.

My first attempt at this was the one I used for the Hydra image - I assumed that the direct colours had hue/phase angles that were multiples of exactly 45 degrees, and that the CGA's pixel colour multiplexer chip was able to switch instantaneously between them. However, the hydra didn't come out looking how I expected on real hardware. Much later, I learnt that the main reason for this is that the TTL logic chips used on the CGA card don't switch instantaneously - there are logic delays between a signal coming in to an input pin and the corresponding change happening on the output pin. When your color carrier period is 279ns, a delay of just 7ns causes a noticable phase shift of 9 degrees.

There are several logic chips on the various signal paths of interest here, all with their own logic delays. My second attempt at modelling the CGA involved looking up the data sheets for all these chips, finding typical values for the logic delays (most of them were listed as a range) and generating an accurate model that way. This worked excellently for 1bpp mode, reasonably well for 2bpp mode, and not so well at all for +HRES mode. This is the implementation that is in the current SVN versions of DOSBox. I kept adding more and more parameters to my model and attempted to tune them to match my captured calibration images but I could not get good results that way. The trouble seemed to be in the guts of the multiplexer chip itself - the output signal depends in a complicated and mysterious way on all of the input signals, so the number of parameters required to describe its behavior quickly becomes impractical.

The final breakthrough came when I realized that I didn't need to model the composite signal *exactly* - I just needed to model it well enough to describe the observed colours. All the relevant colour information is at frequencies below 7.16MHz. By the sampling theorem, if we can reconstruct a version of the signal sampled at 14.318MHz, it'll be exactly correct not including frequencies at or above 7.16MHz (which we don't care about). The key insight is that we don't care about what happens to the signal *in between* those samples - it can bounce around, transition as slow or fast as you like as long as we know where it ends up when we measure the sample - all that extra freedom just manifests in the frequencies we don't care about.

The multiplexer takes a while to transition from one colour to another - on the order of 70ns (one 1bpp pixel time). So there isn't a place in the signal that we can sample and be sure that the previous transition has stopped and the next transition has not yet started. I theorized that at any given time there will not be more than one transition taking place. So a transition (and hence a sample) can be completely described by 1024 parameters - one for each combination of left colour (16 possibilities), right colour (16 possibilities) and phase within the color carrier cycle (4 possibilties).

I made a test pattern which does a very good try at getting swatches of all 4096 foreground/background/pattern combinations in just a couple of screenfuls (some can only be obtained for a short stretch, as transitions). This was quite a feat in itself - I needed an area of screen consisting of scanline 3 of several characters repeating vertically, necessitating having four CRTC scanlines within a single CRT scanline - a hairier CRTC manipulation trick than any that we actually used in 8088 MPH itself!)

I set up a model with these parameters and tried to match it to my captures. I initially tried to use a gradient-ascent hill-climbing algorithm to search the parameter space but before I could get it to work I realized that most of the parameters affected very few of the test swatches - any transition between two different colours X and Y can only affect colours with X and Y as foreground and background colours (32 of the 4096). If the left colour and the right colour are the same then any swatch with that colour as either foreground or background can be affected (496 of the 4096). That observation made a more naive hillclimbing algorithm much more practical, and it just takes a few minutes to find a set of 1024 parameters that match the measured values to within a few percent.

I wanted to model the new CGA as well as the old CGA, but didn't have an easy way to get good captures for that. However, the difficult bit of the old CGA to model (the multiplexer) is identical in the new CGA. So instead of applying the technique described above to the final output, I applied it to the multiplexer output and the intensity bit output separately. This yields a 256-element table and a 16-element table respectively. The outputs from these are summed to get the final output. There was a small amount of degradation from the 1024-element table version but it's too small to notice directly. To generate new CGA output, I just duplicated the intensity bit logic and applied it to the R, G and B bits (with appropriate scaling based on the resistor values in IBM's schematics). I haven't yet tested if this really matches new CGA output, but I don't currently know of any reason why it wouldn't.

This CGA simulation algorithm is implemented in a program I made called CGA2NTSC which has two main functions. One is to act as a CGA composite output stage emulator and NTSC decoder (taking a picture such as one might find on an RGBI monitor and show what it would look like on the composite output (old and new CGA). The other is to take a (24-bit colour) input image and try to find a set of data which, when loaded into the CGA's video memory, will best reproduce that image on the composite output. This is what we used to make the faces picture in the demo. It supports 1bpp and 2bpp modes as well as both text modes (though we only used it in +HRES mode for 8088 MPH). The program uses error diffusion (which can be turned down or off). I've had a couple of requests to make it use ordered dithering instead. That's possible for 1bpp and 2bpp modes but doesn't really make sense for text modes where you don't get an arbitrary choice of bit patterns. The program is a bit unpolished but should be reasonably usable.

Next time: how I played a 4 channel MOD at a sample rate of 16.6kHz through the PC speaker on a 4.77MHz 8088 CPU.