[Update: VileR's writeup of the 1K colour mode is now up. His has fewer technical details but is much easier to understand than mine as it has pictures!]
When displaying graphics on an original IBM Color Graphics Adapter (CGA), normally only 4 colours (from a palette of 16) are possible at once. A few games written for such systems took advantage of the artifacting on the card's NTSC composite output to get 16 colours at once. On Saturday, a team of people including myself, Trixter, Scali and VileR released a demo ("8088 MPH") which smashed this limit and won first place in the "Oldskool Demo" compo at the Revision 2015 demoparty in Saarbrücken, Germany. Some commenters have suggested that the production is a fake and that what we claimed to have done is impossible. Others have suggested it's dithered or flickered to get more colours. But it is none of these things. Here is how we did it.
First of all, what defines a colour on the composite output? There's only one signal line on a composite connection (plus a ground return path) so you can't have separate red, green and blue analog levels like you have on a VGA card (or separate red, green, blue and intensity lines like you have on an RGBI connection.) Instead, a composite signal effectively sequences the red, green and blue signals in time. A composite colour is a signal which repeats at a frequency of 3.57MHz (half of the width of a text character in 80-column mode). Given such a signal, you can compute its DC component (average voltage), oscillation amplitude (about this average) and phase (relative to the color burst pulse at the start of each scanline). These three parameters directly correspond to brightness (luminance), saturation and hue respectively. Higher frequencies (2nd and greater multiples of 3.57MHz) are not involved in colour decoding and would normally have been filtered out by the decoding circuitry in the composite monitors CGA cards would have been connected to in 1981.
The most common CGA composite mode works by putting the card in 1 bit-per-pixel (1bpp) mode - i.e. each pixel is either off (black) or on (white, generally, though this could be changed via the palette register). A single period of color carrier oscillation contains 4 pixels in this mode (the pixel rate is 14.318MHz), so there are 16 possible waveforms you can make with patterns of lit and unlit pixels and hence 16 artifact colours.
Separately to artifact colour, the CGA card has 16 "direct" colours (the ones that are available in text modes). These are just the 16 possible RGBI bit patterns on an RGBI output, but how does the card generate these colours on the composite output? It does so by generating 3.57MHz waveforms on the card at 6 different phases using flip-flops. These are the colours blue, green, cyan, red, magneta and yellow. Including the constant digital signals 0 and 1 (GND and +5V) gives 8 basic colours. To get the intense versions of these, an additional DC offset is applied when the digital signal is turned into an analogue one at the output.
The 1K colour trick hinges on noticing that the direct colours are not the same as the artifact colours. In 1bpp mode you can change the palette register to get different sets of artifact colours. Suppose you change the palette register to blue - then any black pixel will "turn off" the corresponding part of the "blue" waveform. These "chopped up" colours are different yet again from the 16 direct colours and the 16 normal artifact colours. So you could get 256 colours that way (though you can't put them wherever you like because there are limits to how often you can change the value in the palette register).
Suppose that in 1bpp mode you had a second palette register so that you could change the colour corresponding to the 0 bit as well as the one corresponding to the 1 bit. Then using the same techniques you could generate 2K colours (16 foreground colours, 16 background colours and 16 bit patterns for choosing which colour goes where - but swapping foreground and background and inverting the bit pattern yields the same colour). Here we come to the crucial part of the trick: in text mode you can kind of do that - the attribute byte for a character (when flashing is disabled) lets you choose the foreground and background colours independently. Unfortunately you don't get to choose the bit patterns you want - those are defined by the bits in the CGA's character ROM, which can't be changed from software.
VileR is the one who deserves credit for the next observation. He pointed out to me that the characters 'U' (capital letter U, 0x55) and '‼' (double exclamation mark, 0x13) both have bit patterns in their top two rows 11001100 and 01100110 respectively) which are the same for the left nybble as for the right nybble, and the same in both rows. Therefore, if we change the number of scanlines per character row to 2 (as is done in a number of other CGA games to get a 160x100x2 mode using a "vertical half solid" character - 0xDD or 0xDE) we should be able to get ~500 colours (2 useful characters times 16 foreground colours times 16 background colours) at a resolution of 80x100.
In order to get from there to 1024 colours we need to find some more characters with the same properties as 0x55 and 0x13. It would be fantastic if there happened to be for every nybble value X a character with that bit pattern in its top 4 nybbles, but unfortunately only the nybble patterns 1100 and 0110 are obtainable that way. However, if we consider just the top scanline instead of the top two, we find two more characters with the right property - '░' (light shade, 0xb0, bit pattern 00100010) and '▒ ' (medium shade, 0xb1, bit pattern 01010101). Unfortunately the second scanlines of these characters don't play ball, and if we tried to use them with 2 scanlines per row we'd get horizontal stripes instead of solid-coloured pixels.
So to get those extra colours we need to use 1 scanline per row. However, there's are several complications in doing so. One is that the CRTC on the CGA card (Motorola MC6845) cannot generate more than 128 rows (plus up to 32 extra scanlines) per frame and we need to generate 262 scanlines per frame in order to maintain the correct ratio of hsync pulses to vsync pulses that the monitor requires to generate a stable picture.
It is possible to do this, though, by generating multiple CRTC frames per CRT frame (and suppressing the vsync pulse for all but one of them). This is how we generated the wide picture before the credits part (the one with our faces) - in that image there's a 100 scanline frame with 1 scanline per row and immediately below it a 162 scanline frame with 2 scanlines per row.
But there were several 1K colour images in the demo that filled the entire screen - how did we do those? The answer is very similar but instead of having one frame 100 scanlines high, we have 100 frames that are 2 scanlines high (all with 1 scanline per row). In the middle of each of these frames the memory address is advanced by one row by the CRTC. In each frame we advance the CRTC start address register by one row's worth of characters, so that the top row of one frame is the same as the bottom row of the frame above it. So each frame straddles two pixel rows and each 2-scanline-high "pixel" straddles two CRTC frames.
So we're done, right? That's all there is to the trick? Well, not quite - there are more complications. If you do the obvious thing and set 80-column text mode, colour burst enabled via the BIOS, you will see either no colours at all on your composite display or colours that flash in and out and change hue (on monitors that don't have a properly functioning colour-killer circuit). The reason for this is that the CGA card was never designed to be used in 80-column text mode with composite colour display (the text doesn't have enough horizontal resolution to be readable) and there's a hardware bug that prevents it from working properly anyway.
The bug is that the CGA card takes the horizontal sync (hsync) signal from the CRTC (which just goes high and low once per scanline) and uses it to trigger a more complicated composite pulse signal consisting of front porch, sync, breezeway, color burst and back porch. The whole process takes 10 character periods in modes other than 80-column text (-HRES modes) so the BIOS programs the CGA's hsync width register to 10. But in +HRES (80-column text) mode these 10 characters are half the width, so the hsync process gets interrupted half way through leading to a truncated sync pulse and no burst at all.
This is well-known and the usual way of dealing with the problem is to set the border colour (palette register) to 6 (dark yellow - not brown as it is on the 5153 RGBI TTL monitor) so that the monitor picks up its color burst from the border instead. However, on our hardware we found that doing this made +HRES modes significantly darker than -HRES modes. This is because monitors and capture devices calibrate their gain to normalize the amplitude of the burst pulse, and colour 6 is brighter than the normal burst pulse. Not all of the demo uses +HRES mode and we found that we could not use a single set of calibration settings for both -HRES and +HRES parts - if we tried then either the +HRES parts were too dark or the -HRES parts were washed out, leaving colours 9-15 barely distinguishable shades of white. We didn't really want to have to edit our capture to brighten up just some parts of the demo. Another problem was that both of the capture devices we had brought with us to the party were giving a shimmery picture (unstable horizontal sync) with this fix.
Instead what we ended up doing is leaving the border colour as black but increasing the horizontal sync width to its maximum value of 16 characters (programmed as zero, which looks wrong, but it's a 4-bit register and the compare is done after the increment - at least on the MC6845 CRTCs on the CGA cards we were using). This gives a burst of either half or three-quarters the standard width (depending on whether the character it starts on corresponds to a rising or falling edge of the CGA's internal +LCLK signal that is used to time the hsync sequence. I think we managed to arrange it so that it's always three-quarters but there may be bugs in that part of the code.
That fixes the brightness problem but unfortunately some capture devices (including the one that Trixter used to do some test/failsafe captures before the party) cope less well with this than with the border colour 6 change. If we release a "final version" (with a few minor improvements and bug fixes) we might include a "calibration screen" that people can use to choose the border colour, hsync width and phase that works best with their output device.
Yet another complication is that there were multiple revisions of the IBM CGA card. They had (to a good approximation) the same standard composite artifact colours but different direct colours. On the older CGA cards, colours 1-6 were all the same brightness, as were colours 9-14. This made them indistinguishable on monochrome composite monitors, so for the second revision of the CGA card, IBM added some more resistors to the output DAC in order to make different colours different brightnesses. They also removed the -BLANK signal so that the burst pulse is the same amplitude no matter whether it comes from border colour 6 or from the hsync burst (the truncated burst problem is still present, though).
Different direct colours mean that our 1K colour mode displays a different set of colours on old CGA cards as on new CGA cards. We debated a bit about whether we should target old or new CGA cards for our demo, but in the end we decided to go for old CGA cards, mainly because the set of colours you can get from an old CGA card are more useful (artistically speaking) than those from a new CGA card.
In order to make the hand-drawn 1K colour pictures, VileR and I made some test captures of the old CGA card's output with all the useful combinations of attribute and character, which he then used as a palette to paint his pictures. Happily he was able to find in there some close correspondances to the 16 RGBI colours and the 16 colours of the Commodore 64's palette.
For the pictures that were converted from photographs, I wanted to be able to use more characters than just 0x13, 0x55, 0xB0 and 0xB1 - I wanted to be able to try all different characters (even those that have different left and right nybbles in their top scanline) to get a closer match to the source image. However, getting calibration images for all 65536 combinations (let alone the 4 billion artifacts that can be generated from adjacent characters) was impractical. To make that work, I really needed to have a mathematical model of the CGA's composite output stage that I could use to generate the right colours. Ideally I would be able to generalize this to new CGA as well.
My first attempt at this was the one I used for the Hydra image - I assumed that the direct colours had hue/phase angles that were multiples of exactly 45 degrees, and that the CGA's pixel colour multiplexer chip was able to switch instantaneously between them. However, the hydra didn't come out looking how I expected on real hardware. Much later, I learnt that the main reason for this is that the TTL logic chips used on the CGA card don't switch instantaneously - there are logic delays between a signal coming in to an input pin and the corresponding change happening on the output pin. When your color carrier period is 279ns, a delay of just 7ns causes a noticable phase shift of 9 degrees.
There are several logic chips on the various signal paths of interest here, all with their own logic delays. My second attempt at modelling the CGA involved looking up the data sheets for all these chips, finding typical values for the logic delays (most of them were listed as a range) and generating an accurate model that way. This worked excellently for 1bpp mode, reasonably well for 2bpp mode, and not so well at all for +HRES mode. This is the implementation that is in the current SVN versions of DOSBox. I kept adding more and more parameters to my model and attempted to tune them to match my captured calibration images but I could not get good results that way. The trouble seemed to be in the guts of the multiplexer chip itself - the output signal depends in a complicated and mysterious way on all of the input signals, so the number of parameters required to describe its behavior quickly becomes impractical.
The final breakthrough came when I realized that I didn't need to model the composite signal *exactly* - I just needed to model it well enough to describe the observed colours. All the relevant colour information is at frequencies below 7.16MHz. By the sampling theorem, if we can reconstruct a version of the signal sampled at 14.318MHz, it'll be exactly correct not including frequencies at or above 7.16MHz (which we don't care about). The key insight is that we don't care about what happens to the signal *in between* those samples - it can bounce around, transition as slow or fast as you like as long as we know where it ends up when we measure the sample - all that extra freedom just manifests in the frequencies we don't care about.
The multiplexer takes a while to transition from one colour to another - on the order of 70ns (one 1bpp pixel time). So there isn't a place in the signal that we can sample and be sure that the previous transition has stopped and the next transition has not yet started. I theorized that at any given time there will not be more than one transition taking place. So a transition (and hence a sample) can be completely described by 1024 parameters - one for each combination of left colour (16 possibilities), right colour (16 possibilities) and phase within the color carrier cycle (4 possibilties).
I made a test pattern which does a very good try at getting swatches of all 4096 foreground/background/pattern combinations in just a couple of screenfuls (some can only be obtained for a short stretch, as transitions). This was quite a feat in itself - I needed an area of screen consisting of scanline 3 of several characters repeating vertically, necessitating having four CRTC scanlines within a single CRT scanline - a hairier CRTC manipulation trick than any that we actually used in 8088 MPH itself!)
I set up a model with these parameters and tried to match it to my captures. I initially tried to use a gradient-ascent hill-climbing algorithm to search the parameter space but before I could get it to work I realized that most of the parameters affected very few of the test swatches - any transition between two different colours X and Y can only affect colours with X and Y as foreground and background colours (32 of the 4096). If the left colour and the right colour are the same then any swatch with that colour as either foreground or background can be affected (496 of the 4096). That observation made a more naive hillclimbing algorithm much more practical, and it just takes a few minutes to find a set of 1024 parameters that match the measured values to within a few percent.
I wanted to model the new CGA as well as the old CGA, but didn't have an easy way to get good captures for that. However, the difficult bit of the old CGA to model (the multiplexer) is identical in the new CGA. So instead of applying the technique described above to the final output, I applied it to the multiplexer output and the intensity bit output separately. This yields a 256-element table and a 16-element table respectively. The outputs from these are summed to get the final output. There was a small amount of degradation from the 1024-element table version but it's too small to notice directly. To generate new CGA output, I just duplicated the intensity bit logic and applied it to the R, G and B bits (with appropriate scaling based on the resistor values in IBM's schematics). I haven't yet tested if this really matches new CGA output, but I don't currently know of any reason why it wouldn't.
This CGA simulation algorithm is implemented in a program I made called CGA2NTSC which has two main functions. One is to act as a CGA composite output stage emulator and NTSC decoder (taking a picture such as one might find on an RGBI monitor and show what it would look like on the composite output (old and new CGA). The other is to take a (24-bit colour) input image and try to find a set of data which, when loaded into the CGA's video memory, will best reproduce that image on the composite output. This is what we used to make the faces picture in the demo. It supports 1bpp and 2bpp modes as well as both text modes (though we only used it in +HRES mode for 8088 MPH). The program uses error diffusion (which can be turned down or off). I've had a couple of requests to make it use ordered dithering instead. That's possible for 1bpp and 2bpp modes but doesn't really make sense for text modes where you don't get an arbitrary choice of bit patterns. The program is a bit unpolished but should be reasonably usable.
Next time: how I played a 4 channel MOD at a sample rate of 16.6kHz through the PC speaker on a 4.77MHz 8088 CPU.