More 8088 MPH how it's done

While the 1K colour mode and 4-channel audio player are the big technical achievements I wrote for 8088 MPH, there are a few other bits and pieces in the demo I wanted to write up here. Specifically, three tricks which make possible full-screen 60Hz movement that isn't scrolling.

Starfield (particles)

This is the was the first effect I wrote for the demo, about 4 years ago. The inner loop looks like:

patchAddress:
  mov di,9999
  stosb
  shl di,1
  mov di,[di]
  es: mov [di],ah
  mov [patchAddress + 1],di

This is unrolled 948 times (once for each moving particle, with a separate patchAddress for each one). This is the number we calculated it would be possible to have if each one is updated at a rate of 60Hz (though the frame rate isn't critical here - since the particles are scanned in random order, tearing can never happen). The initial (random) value loaded into DI is the position of the particle (a value between 0 and 16383 inclusive, a byte address in video RAM). The value in AL is zero, so STOSB erases the particle. Then the lines "shl di,1 mov di,[di]" move the particle - the data table at DS:0 to DS:32767 is a sort of cross between a circular linked list and a vector field, describing the trajectory that each particle follows. So topologically speaking the particles are all moving around in a loop, but the points along this loop are arranged so that they form trajectories which look like the movement we want. The shift is necessary because the vector table is two bytes per element but the other uses of DI are one byte per element.

The line "es: mov [di],ah" plots the star on the screen. We actually have 7 different colours of stars, held in AH, BL, BH, CL, CH, DL and DH. These registers are initialized appropriately at startup, and the register that each iteration of the unrolled loop uses is set at random.

My first version of this effect included 21 different vector fields but unfortunately they don't compress well (and the best ones take ages to compute from first principles) so we ended up just using a single one and including it in the binary instead of generating it at run-time.

Moire

This can be done with practically 0 CPU time on Amiga by means of multiple bitplanes scrolling independently. We don't have that facility on CGA so it's a bit more CPU-intensive but still pretty straightforward. We use 40-column text mode to avoid snow, and do the same "right half bar" (0xde, ' ▐') trick that is used for 160x100x16 mode (based on 80-column text mode). The number of scanlines per row is set to 4 to create an 80x50 "chunky pixel" mode (though actually we only had enough time for 47 rows at 60Hz, so we just left the last few rows blank).

The large bitmap (picture of circles) is actually stored in the executable rather than computed at runtime. A second copy is created at runtime (shifted over by one half-character). Two pointers into these bitmaps are maintained in the registers SP and BP and we process four pixels per iteration of the inner loop, which looks like this:

  pop ax
  xor ax,[bp+99]
  stosb
  inc di
  mov al,ah
  stosb
  inc di

This is unrolled 940 times (20 times per row, 47 rows) with appropriate "+99" adjustments in each row, and adjustments to SP and BP are made after each row of 20:

  add sp,stride-40
  add bp,stride

To make the transition effect at the end, we overwrite blocks of these instructions by replacing the "xor ax,[bp+99]" instruction with "mov ax,0".

Unfortunately once we added the music the effect no longer runs at quite 60Hz and some tearing can be seen with close observation. This is something we hope to fix in a final version.

Kefrens bars (+raster bars)

The idea behind Kefrens bars is to get rid of the frame buffer altogether and just have a line buffer (Atari 2600 style). Then any video memory changes that you make on one scanline appear on all lower scanlines (unless overwritten by change on a lower scanline). During vertical overscan/blank/sync the line buffer is cleared. TThis has been done lots of times on other platforms, but never on CGA before. Several things make it tricky on this platform: the CGA wait states severely limit how many VRAM writes you can do in any given scanline (we ended up with 2 reads and 4 writes per scanline, plotting 7 nybble-wide pixels). The other is that you need to synchronize the routine with the CRT raster (aka "racing the beam"), meaning cycle counting and tuning the routine to take exactly 304 cycles.

Normally it's not possible for piece of 8088 code to always take exactly 304 cycles - the reason being that every 72 cycles PIT channel 1 wraps and triggers DMA channel 0 to refresh a row of DRAM, stealing the bus from the CPU for 4 cycles. As 304 is not divisible by 72, these refresh cycles appear at different places on the screen from scanline to scanline, so you need potentially 9 different scanline routines for the 9 different refresh phases. The number of scanlines per frame (262) is not divisible by 9 so you need potentially 9 different frame routines as well.

Fortunately there's another way that is both easier and requires less RAM. It's possible to reprogram PIT channel 1 so that the refreshes come every 76 cycles instead (exactly 4 times per scanline). While this may technically be out of spec for the DRAM chips, in practice we haven't found any machines on which the slower refresh doesn't work - in fact many require massively longer refresh intervals before they start to decay. Even the ones that crash with a refresh period of 84 or 80 cycles work at 76 just due to manufacturing tolerances. This is one ingredient we used to make the Kefrens bars work. Note that we had to put it back to 72 for the credits part as that routine's inner loop is a multiple of 72 cycles but not 76.

Another complication is that the CRTC in our CGA cards (the MC6845) does not cope well with a one-scanline-high frame (I believe it can be done, but that it requires reprogramming the CRTC a couple of times per scanline.) So instead our "line buffer" is actually two scanlines high. This gives a nice dithering effect on our Kefrens bars - we liked the way it looked, so we stopped trying to get the single-scanline version working.

The inner loop of the effect is 200 unrolled iterations of the 304-cycle:

  mov ax,9999
  mov ds,ax
  mov sp,[bx]
  pop di
  mov al,[es:di]
  pop cx
  and ax,cx
  pop cx
  or ax,cx
  stosw
  pop ax
  and ah,[es:di+1]
  pop cx
  or ax,cx
  stosw
  pop ax
  out dx,al
  mov ds,bp
  lodsb
  out soundPort,al

The routine uses two big tables. One (at DS:BX) is 200*838 elements in size (one for each combination of scanline and frame number - 200 scanlines, 838 frames). Each element of this table is 2 bytes, a pointer into the other, smaller table (at SS:SP). The total size of this table is 327kB, the largest single table in the entire demo. This big table is stored sideways (much like the sample table in the MOD player). Since we need to reload DS a couple of times each scanline already, we might as well reload it with a different value on each scanline. The "9999"s are patched with the right DS values at loop unrolling times. The entries in the SS:SP tables are 12 bytes each, and hold all the information needed for a combination of a Kefrens bar position and a raster bar colour. There are 154*16 entries (one for each such combination) and the whole table needs to be doubled in order to have different entries on odd and even scanlines. So 12*154*16*2 = 59136 bytes, fits quite happily in a single segment.

The small table can easily be generated at runtime, but the big table proved troublesome. With some heavy optimization, I got the precomputing of Puppeh's nice Kefrens bars motions and raster bars patterns down to 28 seconds or so but my target was less than 15 seconds. After a few false starts I figured out that the Kefrens bars part of the table was quicker to compute but compressed less well while the raster bars part of the table took longer to compute but compressed much better. So I ended up just splitting the tables, compressing the rasters and precalculating the Kefrens.

If you try the demo out on real hardware and the Kefrens bars effect seems unstable, try disabling any network card drivers and networking software you may have installed. IRQs from a NIC or similar hardware can mess up the delicate timings. This is one of the more timing-sensitive effects in the whole demo, though - without a 4.77MHz and genuine MC6845-based IBM CGA card the image is unlikely to stabilize.

One other thing you might notice about this routine is that it outputs to the speaker on each scanline! This was something I didn't get working in time for Revision, but my plan for this effect was that we would have PWM music playing in the background. So there's more that can be done with this effect that what we did in 8088 MPH (if not the sound, then those cycles could be used for something else).

Source code

The source file for all the bits of the demo I wrote are now up on my github:

Effects:

Tools:

4 Responses to “More 8088 MPH how it's done”

  1. […] write-up of the demo. And reenigne has done a piece on the 1024-colour tweak, the mod player, some other stuff, and VileR has also done a piece on the new CGA colours. (Update: I have done pieces on the sprite […]

  2. […] effects and such (although as you may know, reenigne found a way around that, which is how he pulled off the Kefrens bars in 8088 MPH, among other things). The upside is that you don’t have to explicitly write the value during […]

Leave a Reply