computer « Reenigne blog

Archive for the ‘computer’ Category

A tale of two debuggers

Tuesday, February 9th, 2016

Every non-trivial compiler to have an intermediate representation - that is, a form of code that is neither the input source code nor the output binary or assembly code. Apart from the fact that neither of these are good representations on which to do many optimizations, it's also good for a compiler to be able to accept multiple different input languages and target multiple different output architectures.

A particularly important part of writing a compiler is figuring out a good intermediate representation. And one particularly important part of doing that is deciding on how to store debugging information. I'm not planning to write an ALFE debugger immanently but I expect that one would be useful eventually so I want to make sure that the infrastructure is sufficient to do so when I do.

I think that some desirable properties to have in a debugger are consistency, predictability and flexibility even in the face of highly optimized code - ideally it shouldn't even be necessary to have separate "debug" and "release" configurations - you'd debug on the same build of your program that ended up on the user's machine. Though one might still make a separate build with some extra checking, logging or other debugging infrastructure in order to investigate a particular bug.

There's two ways to approach writing a consistent, predictable debugger. One is to approach it from the machine code side: the places the debugger can stop are the spaces between machine instructions (let's call these red locations), and when it's stopped you can examine and modify register contents and memory locations. In other words, an old-style low-level debugger like DEBUG in MS-DOS, that understands nothing about the language that the program was written in.

The other way is to work from the source level. and is kind of a holy grail of debuggers. The places the debugger can stop are the spaces between statements or expressions with side effects (let's call these blue locations). Any source-level expression can be evaluated, any statement can be executed and any lvalue modified. What's more, the order in which things happen is predictable.

To understand what this means, consider the following ALFE program:

Int d = 0;
Int e = 0;
Int a() { ++d; return 2; }
Int b() { ++e; return 3; }
Int c = a() + b();

For the object code that results from the compilation of the last line, the compiler is entitled to generate the code that evaluates a() and b() in either order (since the order doesn't make any difference to the output of the program) or even both at once if that's possible in the target architecture's instruction set. However, when debugging we would like the order to be predictable - i.e. we would want the debugger to act as if a() is evaluated before b(). That's obviously a trivial example, but in real code optimizations can cause debugging to be unpredictable in far more severe ways.

One way this could be done is to ignore the binary program completely and interpret the source program (or, equivalently, rebuild the program with no optimizations and debug that). That would make the program run much slower under the debugger than when running normally (which may make it too slow to be practical - I've written programs where trying to debug using the non-optimized version was just too slow, which made debugging very tricky).

So, most debuggers don't try to be that predictable - they implement some hybrid approach. For example, executing things in machine code order rather than trying to simulate anything, and storing in the debug information an instruction address for each source code line (so that, unless explicitly stepping through instructions rather then the source code) you can only stop on instructions corresponding to source code lines.

I think a better approach (from the point of view of predictability) might be for the debugger to run the normal, optimized binary code until a breakpoint is hit, at which point it switches to interpretation, reconstructing all the information it needs about the program state from the stopped program's registers and memory. That might mean that on hitting a breakpoint in b() you might see d unincremented in the above example, but stepping through the same line of code would give a different result. I'm not sure if that causes more confusion than it solves.

Seeing e incremented before d should have no observable consequences (outside of the debugger) to the execution of the program - otherwise the compiler would be incorrect in swapping the calls. Checking that the compiler and debugger are correct is a specific non-goal of the debugger - we really have to work on the assumption that they are correct, otherwise we'd be second-guessing ourselves.

With an incremental compiler sufficiently well integrated with the debugger, interpretation is unnecessary - the debugger can invoke the compiler to add the breakpoint as a function call and recompile. The breakpoint then becomes an observable of the program, so that only the specific instances of the specific optimizations that need to be are disabled. Care needs to be taken to ensure that the format of in-memory data structures doesn't change in order that the program doesn't have to be restarted, but that shouldn't be too difficult.

Anyway, what does this mean for debug information? We need to somehow relate items in the object code to items in the source code and vice-versa. What does this mean in practice?

For one thing, we'll need to find a relationship between red locations and blue locations. One way to represent this would be a linked list of locations which may be either red or blue or purple (for the case where a red location perfectly coincides with a blue one). Given a location and type, you can obtain the next red location, the next blue location, the next purple location or the next location of any colour. This isn't sufficient for all optimization transformations (for example in a nested loop swapping the inner loop with the outer one so that the former becomes outer and the latter becomes inner) but it's a start.

The other thing we need is enough information to reconstruct the value of any variable given the register values and memory contents (and vice versa, for going back to the "running" state after interpreting for a while). The naive way of doing that would be to have a big matrix with red locations horizontally and variables vertically, and in each cell an algorithm for computing that variable and another for modifying the machine state when that variable is modified. Obviously that would take far too much memory for all but the most trivial programs, but it might be possible to store it in a highly compressed form, such as just storing changes to this table for each instruction.

Going back to the intermediate form, I think all this means is that I want to be able to store two things:

Some "pseudo-instructions" (which don't do anything to the output object code itself, but mark the position of a blue location, and have pointers to the previous and next blue locations, which may not be the same as the previous and next blue locations in the object code code). Some of these pseudo-instructions may also coincide with an actual instruction (when the blue location is between two consecutive red locations).
For each instruction, a list of changes to the "how to compute a high level variable" table.

Posted in computer, language | 1 Comment »

ISA Bus Sniffer update

Monday, February 8th, 2016

Several years ago, I designed the ISA bus sniffer, a sort of special-purpose logic analyzer for capturing the signals from the CPU and ISA bus in an XT-class system. Well, I finally got around to finishing it, writing the three pieces of software needed to make it work (the microcontroller program, the 8088 test harness and the decoder, as well as making the changes to the XT Server to retrieve traces.

Here is what the contraption looks like:

ISA Bus Sniffer, completed

In this picture you can also see the board that goes into the 8087 FPU socket to probe the CPU pins. The yellow wires go to a switch so that I can remotely reset the microcontroller for reflashing.

Here is how it fits into the machine:

ISA Bus Sniffer in situ

Here is what the XT Server currently looks like:

ISA Bus Sniffer in situ

And here is an example of what the output from the sniffer looks like after decoding:

20FFF .p...  00F24 FF 00 FC .......
20FFF Ip...  00F24 FF 01 FC .......                          I
20FF1 SC...  00F24 FF 01 FC .......  T1                      S F6E1         MUL CL
00F25 .C...  00F25 FF 01 FC .....D.  T2 S0
20F25 .C...  00F25 FF 01 FC .....D.  T3 S1
20F25 .C...  0020B FF 10 FC .....D.  Tw S2
20F25 .C...  0020B FF 10 FC ..r..D.  Tw S3
20F25 .C...  0020B FF 10 FC .Wr..D.  Tw S4 FF <-d [   0020B]
20F25 .C...  0020B FF 00 FC .....D.  Tw
20FFF .C...  00F25 FF 00 FC ..r....  Tw
20FF6 .C...  00F25 F6 00 FC ..r....  Tw
20FF6 .p...  00F25 F6 00 FC ..r....  T4    F6 <-f [   00F25]
20FF6 .C...  00F25 F6 00 FC .......  T1
00F26 .C...  00F26 F6 00 FC .......  T2
20F26 .C...  00F26 FF 00 FC ..r....  T3
20FE1 .p...  00F26 E1 00 FC ..r....  T4    E1 <-f [   00F26]
20FE1 .p...  00F26 E1 00 FC .......

From left to right the columns are: CPU address/data, CPU flags, bus address, bus data, DMA requests and acks, interrupt requests, bus flags, bus states for CPU and DA accesses, transfers, prefetch queue status, instruction data and decoded instruction.

You can try this out! Grab these files:
https://github.com/reenigne/reenigne/blob/master/8088/defaults_bin.asm
https://github.com/reenigne/reenigne/blob/master/8088/defaults_common.asm
https://github.com/reenigne/reenigne/blob/master/8088/trace/trace.asm
https://github.com/reenigne/reenigne/blob/master/8088/trace/trace.asm
http://yasm.tortall.net/Download.html (NASM should work too with minor modifications).

Replace the code under "testRoutine:" with the code that you'd like to get a trace of (it needs to run exactly the same each time it's run, so be sure to initialize all registers and memory you use - I made that mistake and it has me scratching my head for a bit). Build trace.bin and upload it to http://www.reenigne.org/xtserver - within seconds you should get a link to a trace like the one above.

Posted in computer, emulation | 6 Comments »

8088 MPH final version announcement

Saturday, August 1st, 2015

Hornet, CRTC and DESiRE are pleased to announce a final version of our demo 8088 MPH which can be downloaded at http://www.reenigne.org/8088MPH_final.ZIP. Also a capture of the final version can be seen and heard at https://www.youtube.com/watch?v=hNRO7lno_DM. This final version has some new graphics, minor bug fixes and compatibility improvements, but it still breaks all your emulators.

Happy birthday Trixter!

Posted in computer | 1 Comment »

More 8088 MPH how it's done

Sunday, April 12th, 2015

While the 1K colour mode and 4-channel audio player are the big technical achievements I wrote for 8088 MPH, there are a few other bits and pieces in the demo I wanted to write up here. Specifically, three tricks which make possible full-screen 60Hz movement that isn't scrolling.

Starfield (particles)

This is the was the first effect I wrote for the demo, about 4 years ago. The inner loop looks like:

patchAddress:
  mov di,9999
  stosb
  shl di,1
  mov di,[di]
  es: mov [di],ah
  mov [patchAddress + 1],di

This is unrolled 948 times (once for each moving particle, with a separate patchAddress for each one). This is the number we calculated it would be possible to have if each one is updated at a rate of 60Hz (though the frame rate isn't critical here - since the particles are scanned in random order, tearing can never happen). The initial (random) value loaded into DI is the position of the particle (a value between 0 and 16383 inclusive, a byte address in video RAM). The value in AL is zero, so STOSB erases the particle. Then the lines "shl di,1 mov di,[di]" move the particle - the data table at DS:0 to DS:32767 is a sort of cross between a circular linked list and a vector field, describing the trajectory that each particle follows. So topologically speaking the particles are all moving around in a loop, but the points along this loop are arranged so that they form trajectories which look like the movement we want. The shift is necessary because the vector table is two bytes per element but the other uses of DI are one byte per element.

The line "es: mov [di],ah" plots the star on the screen. We actually have 7 different colours of stars, held in AH, BL, BH, CL, CH, DL and DH. These registers are initialized appropriately at startup, and the register that each iteration of the unrolled loop uses is set at random.

My first version of this effect included 21 different vector fields but unfortunately they don't compress well (and the best ones take ages to compute from first principles) so we ended up just using a single one and including it in the binary instead of generating it at run-time.

Moire

This can be done with practically 0 CPU time on Amiga by means of multiple bitplanes scrolling independently. We don't have that facility on CGA so it's a bit more CPU-intensive but still pretty straightforward. We use 40-column text mode to avoid snow, and do the same "right half bar" (0xde, ' ▐') trick that is used for 160x100x16 mode (based on 80-column text mode). The number of scanlines per row is set to 4 to create an 80x50 "chunky pixel" mode (though actually we only had enough time for 47 rows at 60Hz, so we just left the last few rows blank).

The large bitmap (picture of circles) is actually stored in the executable rather than computed at runtime. A second copy is created at runtime (shifted over by one half-character). Two pointers into these bitmaps are maintained in the registers SP and BP and we process four pixels per iteration of the inner loop, which looks like this:

  pop ax
  xor ax,[bp+99]
  stosb
  inc di
  mov al,ah
  stosb
  inc di

This is unrolled 940 times (20 times per row, 47 rows) with appropriate "+99" adjustments in each row, and adjustments to SP and BP are made after each row of 20:

  add sp,stride-40
  add bp,stride

To make the transition effect at the end, we overwrite blocks of these instructions by replacing the "xor ax,[bp+99]" instruction with "mov ax,0".

Unfortunately once we added the music the effect no longer runs at quite 60Hz and some tearing can be seen with close observation. This is something we hope to fix in a final version.

Kefrens bars (+raster bars)

The idea behind Kefrens bars is to get rid of the frame buffer altogether and just have a line buffer (Atari 2600 style). Then any video memory changes that you make on one scanline appear on all lower scanlines (unless overwritten by change on a lower scanline). During vertical overscan/blank/sync the line buffer is cleared. TThis has been done lots of times on other platforms, but never on CGA before. Several things make it tricky on this platform: the CGA wait states severely limit how many VRAM writes you can do in any given scanline (we ended up with 2 reads and 4 writes per scanline, plotting 7 nybble-wide pixels). The other is that you need to synchronize the routine with the CRT raster (aka "racing the beam"), meaning cycle counting and tuning the routine to take exactly 304 cycles.

Normally it's not possible for piece of 8088 code to always take exactly 304 cycles - the reason being that every 72 cycles PIT channel 1 wraps and triggers DMA channel 0 to refresh a row of DRAM, stealing the bus from the CPU for 4 cycles. As 304 is not divisible by 72, these refresh cycles appear at different places on the screen from scanline to scanline, so you need potentially 9 different scanline routines for the 9 different refresh phases. The number of scanlines per frame (262) is not divisible by 9 so you need potentially 9 different frame routines as well.

Fortunately there's another way that is both easier and requires less RAM. It's possible to reprogram PIT channel 1 so that the refreshes come every 76 cycles instead (exactly 4 times per scanline). While this may technically be out of spec for the DRAM chips, in practice we haven't found any machines on which the slower refresh doesn't work - in fact many require massively longer refresh intervals before they start to decay. Even the ones that crash with a refresh period of 84 or 80 cycles work at 76 just due to manufacturing tolerances. This is one ingredient we used to make the Kefrens bars work. Note that we had to put it back to 72 for the credits part as that routine's inner loop is a multiple of 72 cycles but not 76.

Another complication is that the CRTC in our CGA cards (the MC6845) does not cope well with a one-scanline-high frame (I believe it can be done, but that it requires reprogramming the CRTC a couple of times per scanline.) So instead our "line buffer" is actually two scanlines high. This gives a nice dithering effect on our Kefrens bars - we liked the way it looked, so we stopped trying to get the single-scanline version working.

The inner loop of the effect is 200 unrolled iterations of the 304-cycle:

  mov ax,9999
  mov ds,ax
  mov sp,[bx]
  pop di
  mov al,[es:di]
  pop cx
  and ax,cx
  pop cx
  or ax,cx
  stosw
  pop ax
  and ah,[es:di+1]
  pop cx
  or ax,cx
  stosw
  pop ax
  out dx,al
  mov ds,bp
  lodsb
  out soundPort,al

The routine uses two big tables. One (at DS:BX) is 200*838 elements in size (one for each combination of scanline and frame number - 200 scanlines, 838 frames). Each element of this table is 2 bytes, a pointer into the other, smaller table (at SS:SP). The total size of this table is 327kB, the largest single table in the entire demo. This big table is stored sideways (much like the sample table in the MOD player). Since we need to reload DS a couple of times each scanline already, we might as well reload it with a different value on each scanline. The "9999"s are patched with the right DS values at loop unrolling times. The entries in the SS:SP tables are 12 bytes each, and hold all the information needed for a combination of a Kefrens bar position and a raster bar colour. There are 154*16 entries (one for each such combination) and the whole table needs to be doubled in order to have different entries on odd and even scanlines. So 12*154*16*2 = 59136 bytes, fits quite happily in a single segment.

The small table can easily be generated at runtime, but the big table proved troublesome. With some heavy optimization, I got the precomputing of Puppeh's nice Kefrens bars motions and raster bars patterns down to 28 seconds or so but my target was less than 15 seconds. After a few false starts I figured out that the Kefrens bars part of the table was quicker to compute but compressed less well while the raster bars part of the table took longer to compute but compressed much better. So I ended up just splitting the tables, compressing the rasters and precalculating the Kefrens.

If you try the demo out on real hardware and the Kefrens bars effect seems unstable, try disabling any network card drivers and networking software you may have installed. IRQs from a NIC or similar hardware can mess up the delicate timings. This is one of the more timing-sensitive effects in the whole demo, though - without a 4.77MHz and genuine MC6845-based IBM CGA card the image is unlikely to stabilize.

One other thing you might notice about this routine is that it outputs to the speaker on each scanline! This was something I didn't get working in time for Revision, but my plan for this effect was that we would have PWM music playing in the background. So there's more that can be done with this effect that what we did in 8088 MPH (if not the sound, then those cycles could be used for something else).

Source code

The source file for all the bits of the demo I wrote are now up on my github:

Effects:

100-scanline faces picture
Credits scrolling with 4 channel music
Kefrens/raster bars
Moire pattern
Particles startfield
1K colour mode (code used in demo was not identical, but was based on this and modified by other coders).

Tools:

Posted in computer | 4 Comments »

8088 PC Speaker MOD player: How it's done

Friday, April 10th, 2015

The last 100 seconds of 8088 MPH sound very different to the rest of the demo. The end tune is actually a 4-channel Amiga MOD file (which you can download here) composed by coda. Playing back a MOD through PC speaker on such a slow machine has never been done before. Here is how we did it.

We knew early on that we wanted to push the limits of audio from the machine as well as video, but also that we wanted to use the PC speaker for output. The first Sound Blaster card didn't arrive on the scene until 1989. Also, Galaxy Player (glx212) can play a MOD on a 4.77MHz 8088 with a Sound Blaster, so doing that wouldn't have broken any records. Galaxy Player is a very impressive piece of code - I once broke into it with a debugger to figure out how it works. It can also play back through the PC speaker, but that requires a much faster computer.

Most software that uses the PC speaker plays square wave beeps. This can be done with minimal CPU involvement as channel 2 of the PC's 8253 Programmable Interval Timer (PIT) is connected (via an AND gate) to the speaker. Setting that channel to square wave mode and programming an appropriate frequency (as a divisor of 105MHz/88 ~= 1.193MHz) is about all you need to do for an unattended beep. Changing the beep frequency 60 times a second (via an interrupt) is how the music is played for most of 8088 MPH. By rapidly switching between several notes (arpeggiation) you can give the impression of playing multiple notes at once (though it's a poor substitute for true chords) and by varying the duty cycles of arpeggios you can get a very limited impression of volume attenuation.

Playing back pre-recorded sampled sound over the PC speaker was less common, but still a well understood technique. As well as a square wave mode, the PIT has a "one shot" mode where the output goes low while the counter counts down and then goes high waiting for further input. By loading new values into the PIT's count register regularly, Pulse Width Modulation can be achieved. I remember being stunned the first time I heard this being done on the family Amstrad PC1512 - a crystal-clear (for the time) 6 bits of dynamic range at a sample rate of 18.6kHz! The "regularly" bit is the problem, though. Most (if not all) PC speaker software using this technique used a hardware timer interrupt (IRQ0, PIT channel 0) to trigger the CPU regularly to reload the channel 2 count with the next sample. However, interrupts on a 4.77MHz 8088 are really slow, and when playing back samples at ~20kHz there is basically no time for anything else (like the mixing required for a MOD player). All you can do is play back pre-rendered sound, and 100 seconds of 16.6kHz audio would have been 1.6MB of data - way too much for our 360kB disk space budget even with pklite helping out, not to mention our 640kB of RAM.

However, since we were targeting a specific CPU (and a specific clock speed) for this demo, we had a way of timing things without the timer interrupt - counting cycles. Conceivably, the same techniques could be used in a more portable player - having several predefined routines for the most common slow CPUs (and an IRQ-driven version for faster CPUs), or calibrating the speed on startup.

When writing highly-optimized 8088 code, there are two techniques which are particularly important - loop unrolling and self-modifying code. There is a tension between these two techniques, though - the more unrolling you do the lower your looping overhead but the more code you have to self-modify. Galaxy player plays these two techniques off each other quite nicely - picking an unrolled loop size which minimizes the total time the routine takes. Unfortunately this technique isn't a good fit for CPU-timed PC speaker audio because the samples aren't generated sequentially - it first renders a frame of samples for channel 0, then adds in a frame of samples for channel 1 and so on. It might be possible to statically intersperse the PIT count "out" instructions into the mixing code but that's really difficult code to write (it really needs to be written by a sophisticated tool that has knowledge of the exact 8088 timings to minimize jitter). Perhaps for a future project...

After playing about with (and measuring the execution speed of) a whole lot of different routines, I settled on one that doesn't unroll the sample loop at all (but does completely unroll the channel loop). The mixing code itself looks like this:

  add bp,9999  mov bx,bp  mov bl,99  mov al,[bx]
  add si,9999  mov bx,si  mov bl,99  add al,[bx]
  add di,9999  mov bx,di  mov bl,99  add al,[bx]
  add dx,9999  mov bx,dx  mov bl,99  add al,[bx]
  out 0x42,al

This code can't play back arbitrary MODs because it has a limitation on sample length - rather than being of arbitrary length, samples are all exactly 256 bytes long. The idea is that samples repeat with one oscillation (i.e. an up and a down) over those 256 bytes. With a 16.6kHz output sample rate that translates to an output frequency range of 0 to 8.3kHz with a frequency resolution of 0.25Hz. So it's capabilities are somewhat closer to the C64's SID chip than the Amiga's Paula (in fact we codenamed the routine SID during development - we had others codenamed VIC and Paula which may get an airing in the future).

Our "SID" also has no volume tables - the volumes have to be baked into the samples themselves, so there may need to be several copies of any given sample at different volume levels (one for each volume level at which the sample is played). If we have too many samples, volume can be quantized (the program that converts from the source .mod to the player's internal format does the volume baking and optimizes the number of quantization levels). If we can squeeze it into 288 CPU cycles (spoiler: we can!), that's 72 PIT cycles which divides nicely into 4 - each channel's samples go from 1 to 18, and the final sample goes from 4 to 72 (we have to take care not to program 0 into the PIT or it'll count down for 65536 cycles and we'll get no audio for about 55ms). It also works out well for DRAM refresh - using the default DRAM refresh period of 18 we'll get exactly four refresh bus cycles per sample, and they'll be in the same places in the execution of the code every sample, so won't cause any jitter.

The four registers bp, si, di and dx each hold the respective channel's current position within the waveform (as an 8.8 bit fixed point number). The "9999"s are the respective channels' frequencies. The higher the frequency, the further the position gets advanced each sample (direct digital synthesis - similar to that which I used in the Physical Tone Matrix). These frequency values are modified right in the code itself during runtime (self-modifying code). This reduces register pressure (which is important as there are not a lot of registers to spare!)

Similarly, the "99"s (also patched at runtime) are the waveform numbers for each channel. The 256x256 waveform table is turned "sideways" (the low byte of the address is the waveform number and the high byte is the position) in order to avoid having to shift the high 8 bits from the position register to the low 8 bits of the sample pointer.

Now, if we run this code in a loop we'll get a chord playing from the PC speaker. But only a single chord - we want to play a whole song, where the note frequencies and waveform numbers change potentially 50 times per second. So we need to add a way to patch the frequencies and waveform numbers in the code. The fastest way to do that is to use the stack as our stream of patch data:

  pop bx
  pop word[cs:bx]

This means we need to run with interrupts disabled, but we need to do that anyway - the delays introduced by the timer interrupt would cause massive audio quality degradation.

Now we have enough ingredients to play an actual tune, but not a long one! If we're pulling 4 bytes off the stack for every sample we play, we're going to run out of stack in under a second. We might as well just pull preprocessed samples from the stack if we're going to do that - it would last 4 times longer!

However, most of the time we don't need to patch anything - we just want to leave the sample playing until we next want to change something, and then we probably want to change everything at once after 20ms (331 samples). So we want some kind of loop that counts down and then we do the patching once it reaches zero:

loopTop:
  add bp,9999  mov bx,bp  mov bl,99  mov al,[bx]
  add si,9999  mov bx,si  mov bl,99  add al,[bx]
  add di,9999  mov bx,di  mov bl,99  add al,[bx]
  add dx,9999  mov bx,dx  mov bl,99  add al,[bx]
  out 0x42,al
  loop loopTop
  pop bx
  pop word[cs:bx]
  mov cl,99
  jmp loopTop

The "99" in the "mov cl,99" instruction is (you guessed it) another value that is patched at runtime.

This code takes exactly 288 cycles to run in the case where we're patching, but it's quite a bit shorter in the no-patch case. We need it to run at 288 cycles every iteration (patch or no patch) to keep the samples coming regularly. Fortunately, there's a nice place to stick some NOPs where they will be executed only in the non-patch case, making two loops that mutually overlap without either being nested within the other:

v:
  times 15 nop
loopTop:
  add bp,9999  mov bx,bp  mov bl,99  mov al,[bx]
  add si,9999  mov bx,si  mov bl,99  add al,[bx]
  add di,9999  mov bx,di  mov bl,99  add al,[bx]
  add dx,9999  mov bx,dx  mov bl,99  add al,[bx]
  out 0x42,al
  loop v
  pop bx
  pop word[cs:bx]
  mov cl,99
  jmp loopTop

That's it - that's the entire inner loop as it is when the CPU executes its first instruction. All the remaining magic is in the data that's pointed to by the stack pointer. Let's think about how fast we're burning through that data now. 50 times a second we need to patch (worst case) 10 locations (4 frequencies, 4 sample numbers and the loop counter twice). That's 40 bytes, 50 times per second or 2000 bytes per second. That means we get through our 64kB of stack data in less than 33 seconds. That's better than 1 second, but still too short for our song by a factor of 3. We could use a larger stack, but then we'd need to update our stack segment somehow every 33 seconds at least. There's no code to do that, though, and nowhere to put such code that wouldn't execute far too often.

Or is there? Take another look at those 15 NOPs at the start of the routine. If you squint a bit, don't they sort of look like a blank canvas just waiting to be painted with some amazing work of art? (No? Maybe it's just me then). Yes, if we keep the "mov cl,99" line patched to be "mov cl,1" we can patch as many times as we like without the code between v and loopTop being executed at all, which means that we can patch some code into there and then switch the CL value to 2 for a sample in order to execute it. We have to make sure that these little "patched routines" (I call them v-instructions, hence the label) take exactly the same time as the 15 NOPs. This turns out to be possible for all the v-instructions we need for 8088 MPH. However, some of them are *smaller* than the 15 NOPs (in particular those which use some of their bus cycles to access memory or IO ports instead of fetch instruction bytes). This means we need to jump to a different place in the code to start them - that's easy enough, though, we can just patch the destination byte of the "loop v" instruction the same way we're patching everything else (almost half the bytes in this little routine get patched at some point!)

The next thing to notice is that the set of v-instructions that we can execute form a small (albeit verbose) bytecode interpreted language - we can do whatever we like in there provided we meet the space and time requirements. Our little mod player has become Turing complete! In particular, we can modify the stack pointer in order to do loops. That means instead of being hundreds of kilobytes, our v-instruction program can be relatively small (the one in 8088 MPH is just 652 bytes long). It's tricky to write, because it needs to have inside it the locations of the various points within the program that we need to patch, which might move around as we debug things. So rather than writing them directly I wrote some assembler macros to generate them for me. Oh, and because a v-instruction is made up of several CPU instructions, I ended up writing a sort-of mini assembler in the assembler's macro language! Here is one of the v-instructions from the CRTC update v-instruction routine in 8088 MPH:

  forget 7
  startAt 4
  MOV_BX_DX
  MOV_DX_iw
  w 0x3d4
  MOV_AX_iw
  w 0x990d
  OUT_DX_AX
  MOV_DX_BX
  runV 1

This translates to the 8088 instructions:

  mov bx,dx
  mov dx,0x03d4
  mov ax,0x990d
  out dx,ax
  mov dx,bx

The "0x99" (AH value) is, you guessed it, patched to the desired CRTC start address byte (yes, I used self-modifying code in the v-instructions as well as in the 8088 instructions).

The main v-instruction routine runs 50 times per second and updates the frequency and waveform data in the actual mixing routine. It also fetches more song data from the pointer ES:0 (and increments ES). This means that we can burn through just 800 bytes of data per second for our song instead of 2000, dramatically reducing the memory usage. Only 12 of the 16 bytes in the paragraph are used for the actual musical data, the other 4 bytes hold the address of another subroutine to set the stack pointer to, and an argument for that routine. These "h-instruction" subroutines do things like printing a character on the screen, changing the cursor position for the print routine, changing the CRTC start address (for hardware scrolling) and finishing the routine (by patching the final "jmp looptop" instruction. So there's 3 layers of code here: the actual 8088 mixer code, the v-instruction code in the stack and the h-instruction code interspersed into the song data.

Somewhat surprisingly, there's still plenty of v-instruction time left - even the longest of these takes only 119 samples of the 331 available. So the routine actually ends up using only about 87% of the available time (less when just scrolling slowly without printing any extra characters). Getting the routine to do more is tricky, though - using more than 4 bytes per 20ms frame for the non-song stuff would probably involve moving through the song data at a non-integer number of segments per frame. Also, the fact that the only persistent registers available to v-instructions are ES and AH (though AL and BX can be stomped) makes writing new v-instructions tricky. More can certainly be done, though.

You might notice that the "mov cl,99" instruction directly follows the instruction that patches it ("pop word[cs:bx]"). This means that the 8088 will execute the unpatched version of the instruction (as it's already in the prefetch queue by the time it's patched). The v-instruction program generation macros take this into account. In theory this instruction could be anywhere between the "loop v" instruction and the "jmp loopTop" instruction, but in practice if I put it anywhere except where it is, the routine ends up taking more than 288 CPU cycles. For debugging in DOSBox (which executes the patched version) I have a debug switch which moves the instruction before the pop (the timing is wrong on DOSBox anyway, but I can at least test functional changes that way). Debugging is still tricky, though - breakpoints don't work so well when your entire program is executed from the same 15 byte memory region!

Some pre-processing of the source .mod file is performed on a modern computer before transferring it to the old PC - resampling the looping samples to 256 bytes and changing the note and effect data into frequencies and waveform numbers in the 800 bytes per second format. The .mod interpreter is based on a older version of PT2PLAY. The source is here and there is a compiled binary for Win32 here. Only mode 1 works at present. All Protracker effects should work with the exception of 9xx (set sample offset), E0x (hardware filter), E9x (retrigger) and CIA timer modes. The latter could be accomplished to some extent by adjusting the number of samples per tick.

As well as generating the data for the routine, mod_convert generates a .wav so that musicians can hear how their work sounds on the PC without actually having to have one. To do this, the program generates a 1-bit waveform at 1.193MHz and then resamples it to 44.1kHz, so even the high frequency carrier wave is reproduced. It's essentially a little emulator of the PC speaker circuit.

One nice feature of this routine that I didn't find a way to use in "8088 MPH" is ring modulation. If the second "mov bl,99" instruction is patched to be "mov bl,al" then the output from the first channel is used as the second channel's waveform number. If the waveform numbers in slots 1..18 are all the same basic waveform multiplied by a suitable set of amplitudes, then the output of the second channel is ring modulated by the first!

One other nice little finishing touch with the credits part is the way that it exits to DOS at the end, with the "A:\>_" prompt right after. This is a fully functioning DOS prompt, not a mockup. The idea is that the "What can you do?" line is immediately followed by the prompt to actually do it! In order to get this to work, the screen had to be in the "unscrolled" location at when the demo exits (DOS and the BIOS scroll the screen by shifting video RAM data, not by changing the CRTC start address). That's easy enough then, just subtract the number of character positions that we scroll from 8192 and program that value into the start address initially. We also need to move our initial write pointer and VileR's awesome background 40-column ANSI art so that they start in the right place. I accomplished this without needing to add code to handle wrapping, by taking advantage of the fact that the CGA ignores address bit 14, causing physical addresses 0xB8000-0xBBFFF to be mirrored in the address range 0xBC000-0xBFFFF. This turns out to be another emulator-breaking change, though (at least in DOSBox)!

The source for the player itself can be found on my github.

Posted in computer | 10 Comments »

1K colours on CGA: How it's done

Wednesday, April 8th, 2015

[Update: VileR's writeup of the 1K colour mode is now up. His has fewer technical details but is much easier to understand than mine as it has pictures!]

When displaying graphics on an original IBM Color Graphics Adapter (CGA), normally only 4 colours (from a palette of 16) are possible at once. A few games written for such systems took advantage of the artifacting on the card's NTSC composite output to get 16 colours at once. On Saturday, a team of people including myself, Trixter, Scali and VileR released a demo ("8088 MPH") which smashed this limit and won first place in the "Oldskool Demo" compo at the Revision 2015 demoparty in Saarbrücken, Germany. Some commenters have suggested that the production is a fake and that what we claimed to have done is impossible. Others have suggested it's dithered or flickered to get more colours. But it is none of these things. Here is how we did it.

First of all, what defines a colour on the composite output? There's only one signal line on a composite connection (plus a ground return path) so you can't have separate red, green and blue analog levels like you have on a VGA card (or separate red, green, blue and intensity lines like you have on an RGBI connection.) Instead, a composite signal effectively sequences the red, green and blue signals in time. A composite colour is a signal which repeats at a frequency of 3.57MHz (half of the width of a text character in 80-column mode). Given such a signal, you can compute its DC component (average voltage), oscillation amplitude (about this average) and phase (relative to the color burst pulse at the start of each scanline). These three parameters directly correspond to brightness (luminance), saturation and hue respectively. Higher frequencies (2nd and greater multiples of 3.57MHz) are not involved in colour decoding and would normally have been filtered out by the decoding circuitry in the composite monitors CGA cards would have been connected to in 1981.

The most common CGA composite mode works by putting the card in 1 bit-per-pixel (1bpp) mode - i.e. each pixel is either off (black) or on (white, generally, though this could be changed via the palette register). A single period of color carrier oscillation contains 4 pixels in this mode (the pixel rate is 14.318MHz), so there are 16 possible waveforms you can make with patterns of lit and unlit pixels and hence 16 artifact colours.

Separately to artifact colour, the CGA card has 16 "direct" colours (the ones that are available in text modes). These are just the 16 possible RGBI bit patterns on an RGBI output, but how does the card generate these colours on the composite output? It does so by generating 3.57MHz waveforms on the card at 6 different phases using flip-flops. These are the colours blue, green, cyan, red, magneta and yellow. Including the constant digital signals 0 and 1 (GND and +5V) gives 8 basic colours. To get the intense versions of these, an additional DC offset is applied when the digital signal is turned into an analogue one at the output.

The 1K colour trick hinges on noticing that the direct colours are not the same as the artifact colours. In 1bpp mode you can change the palette register to get different sets of artifact colours. Suppose you change the palette register to blue - then any black pixel will "turn off" the corresponding part of the "blue" waveform. These "chopped up" colours are different yet again from the 16 direct colours and the 16 normal artifact colours. So you could get 256 colours that way (though you can't put them wherever you like because there are limits to how often you can change the value in the palette register).

Suppose that in 1bpp mode you had a second palette register so that you could change the colour corresponding to the 0 bit as well as the one corresponding to the 1 bit. Then using the same techniques you could generate 2K colours (16 foreground colours, 16 background colours and 16 bit patterns for choosing which colour goes where - but swapping foreground and background and inverting the bit pattern yields the same colour). Here we come to the crucial part of the trick: in text mode you can kind of do that - the attribute byte for a character (when flashing is disabled) lets you choose the foreground and background colours independently. Unfortunately you don't get to choose the bit patterns you want - those are defined by the bits in the CGA's character ROM, which can't be changed from software.

VileR is the one who deserves credit for the next observation. He pointed out to me that the characters 'U' (capital letter U, 0x55) and '‼' (double exclamation mark, 0x13) both have bit patterns in their top two rows 11001100 and 01100110 respectively) which are the same for the left nybble as for the right nybble, and the same in both rows. Therefore, if we change the number of scanlines per character row to 2 (as is done in a number of other CGA games to get a 160x100x2 mode using a "vertical half solid" character - 0xDD or 0xDE) we should be able to get ~500 colours (2 useful characters times 16 foreground colours times 16 background colours) at a resolution of 80x100.

In order to get from there to 1024 colours we need to find some more characters with the same properties as 0x55 and 0x13. It would be fantastic if there happened to be for every nybble value X a character with that bit pattern in its top 4 nybbles, but unfortunately only the nybble patterns 1100 and 0110 are obtainable that way. However, if we consider just the top scanline instead of the top two, we find two more characters with the right property - '░' (light shade, 0xb0, bit pattern 00100010) and '▒ ' (medium shade, 0xb1, bit pattern 01010101). Unfortunately the second scanlines of these characters don't play ball, and if we tried to use them with 2 scanlines per row we'd get horizontal stripes instead of solid-coloured pixels.

So to get those extra colours we need to use 1 scanline per row. However, there's are several complications in doing so. One is that the CRTC on the CGA card (Motorola MC6845) cannot generate more than 128 rows (plus up to 32 extra scanlines) per frame and we need to generate 262 scanlines per frame in order to maintain the correct ratio of hsync pulses to vsync pulses that the monitor requires to generate a stable picture.

It is possible to do this, though, by generating multiple CRTC frames per CRT frame (and suppressing the vsync pulse for all but one of them). This is how we generated the wide picture before the credits part (the one with our faces) - in that image there's a 100 scanline frame with 1 scanline per row and immediately below it a 162 scanline frame with 2 scanlines per row.

But there were several 1K colour images in the demo that filled the entire screen - how did we do those? The answer is very similar but instead of having one frame 100 scanlines high, we have 100 frames that are 2 scanlines high (all with 1 scanline per row). In the middle of each of these frames the memory address is advanced by one row by the CRTC. In each frame we advance the CRTC start address register by one row's worth of characters, so that the top row of one frame is the same as the bottom row of the frame above it. So each frame straddles two pixel rows and each 2-scanline-high "pixel" straddles two CRTC frames.

So we're done, right? That's all there is to the trick? Well, not quite - there are more complications. If you do the obvious thing and set 80-column text mode, colour burst enabled via the BIOS, you will see either no colours at all on your composite display or colours that flash in and out and change hue (on monitors that don't have a properly functioning colour-killer circuit). The reason for this is that the CGA card was never designed to be used in 80-column text mode with composite colour display (the text doesn't have enough horizontal resolution to be readable) and there's a hardware bug that prevents it from working properly anyway.

The bug is that the CGA card takes the horizontal sync (hsync) signal from the CRTC (which just goes high and low once per scanline) and uses it to trigger a more complicated composite pulse signal consisting of front porch, sync, breezeway, color burst and back porch. The whole process takes 10 character periods in modes other than 80-column text (-HRES modes) so the BIOS programs the CGA's hsync width register to 10. But in +HRES (80-column text) mode these 10 characters are half the width, so the hsync process gets interrupted half way through leading to a truncated sync pulse and no burst at all.

This is well-known and the usual way of dealing with the problem is to set the border colour (palette register) to 6 (dark yellow - not brown as it is on the 5153 RGBI TTL monitor) so that the monitor picks up its color burst from the border instead. However, on our hardware we found that doing this made +HRES modes significantly darker than -HRES modes. This is because monitors and capture devices calibrate their gain to normalize the amplitude of the burst pulse, and colour 6 is brighter than the normal burst pulse. Not all of the demo uses +HRES mode and we found that we could not use a single set of calibration settings for both -HRES and +HRES parts - if we tried then either the +HRES parts were too dark or the -HRES parts were washed out, leaving colours 9-15 barely distinguishable shades of white. We didn't really want to have to edit our capture to brighten up just some parts of the demo. Another problem was that both of the capture devices we had brought with us to the party were giving a shimmery picture (unstable horizontal sync) with this fix.

Instead what we ended up doing is leaving the border colour as black but increasing the horizontal sync width to its maximum value of 16 characters (programmed as zero, which looks wrong, but it's a 4-bit register and the compare is done after the increment - at least on the MC6845 CRTCs on the CGA cards we were using). This gives a burst of either half or three-quarters the standard width (depending on whether the character it starts on corresponds to a rising or falling edge of the CGA's internal +LCLK signal that is used to time the hsync sequence. I think we managed to arrange it so that it's always three-quarters but there may be bugs in that part of the code.

That fixes the brightness problem but unfortunately some capture devices (including the one that Trixter used to do some test/failsafe captures before the party) cope less well with this than with the border colour 6 change. If we release a "final version" (with a few minor improvements and bug fixes) we might include a "calibration screen" that people can use to choose the border colour, hsync width and phase that works best with their output device.

Yet another complication is that there were multiple revisions of the IBM CGA card. They had (to a good approximation) the same standard composite artifact colours but different direct colours. On the older CGA cards, colours 1-6 were all the same brightness, as were colours 9-14. This made them indistinguishable on monochrome composite monitors, so for the second revision of the CGA card, IBM added some more resistors to the output DAC in order to make different colours different brightnesses. They also removed the -BLANK signal so that the burst pulse is the same amplitude no matter whether it comes from border colour 6 or from the hsync burst (the truncated burst problem is still present, though).

Different direct colours mean that our 1K colour mode displays a different set of colours on old CGA cards as on new CGA cards. We debated a bit about whether we should target old or new CGA cards for our demo, but in the end we decided to go for old CGA cards, mainly because the set of colours you can get from an old CGA card are more useful (artistically speaking) than those from a new CGA card.

In order to make the hand-drawn 1K colour pictures, VileR and I made some test captures of the old CGA card's output with all the useful combinations of attribute and character, which he then used as a palette to paint his pictures. Happily he was able to find in there some close correspondances to the 16 RGBI colours and the 16 colours of the Commodore 64's palette.

For the pictures that were converted from photographs, I wanted to be able to use more characters than just 0x13, 0x55, 0xB0 and 0xB1 - I wanted to be able to try all different characters (even those that have different left and right nybbles in their top scanline) to get a closer match to the source image. However, getting calibration images for all 65536 combinations (let alone the 4 billion artifacts that can be generated from adjacent characters) was impractical. To make that work, I really needed to have a mathematical model of the CGA's composite output stage that I could use to generate the right colours. Ideally I would be able to generalize this to new CGA as well.

My first attempt at this was the one I used for the Hydra image - I assumed that the direct colours had hue/phase angles that were multiples of exactly 45 degrees, and that the CGA's pixel colour multiplexer chip was able to switch instantaneously between them. However, the hydra didn't come out looking how I expected on real hardware. Much later, I learnt that the main reason for this is that the TTL logic chips used on the CGA card don't switch instantaneously - there are logic delays between a signal coming in to an input pin and the corresponding change happening on the output pin. When your color carrier period is 279ns, a delay of just 7ns causes a noticable phase shift of 9 degrees.

There are several logic chips on the various signal paths of interest here, all with their own logic delays. My second attempt at modelling the CGA involved looking up the data sheets for all these chips, finding typical values for the logic delays (most of them were listed as a range) and generating an accurate model that way. This worked excellently for 1bpp mode, reasonably well for 2bpp mode, and not so well at all for +HRES mode. This is the implementation that is in the current SVN versions of DOSBox. I kept adding more and more parameters to my model and attempted to tune them to match my captured calibration images but I could not get good results that way. The trouble seemed to be in the guts of the multiplexer chip itself - the output signal depends in a complicated and mysterious way on all of the input signals, so the number of parameters required to describe its behavior quickly becomes impractical.

The final breakthrough came when I realized that I didn't need to model the composite signal *exactly* - I just needed to model it well enough to describe the observed colours. All the relevant colour information is at frequencies below 7.16MHz. By the sampling theorem, if we can reconstruct a version of the signal sampled at 14.318MHz, it'll be exactly correct not including frequencies at or above 7.16MHz (which we don't care about). The key insight is that we don't care about what happens to the signal *in between* those samples - it can bounce around, transition as slow or fast as you like as long as we know where it ends up when we measure the sample - all that extra freedom just manifests in the frequencies we don't care about.

The multiplexer takes a while to transition from one colour to another - on the order of 70ns (one 1bpp pixel time). So there isn't a place in the signal that we can sample and be sure that the previous transition has stopped and the next transition has not yet started. I theorized that at any given time there will not be more than one transition taking place. So a transition (and hence a sample) can be completely described by 1024 parameters - one for each combination of left colour (16 possibilities), right colour (16 possibilities) and phase within the color carrier cycle (4 possibilties).

I made a test pattern which does a very good try at getting swatches of all 4096 foreground/background/pattern combinations in just a couple of screenfuls (some can only be obtained for a short stretch, as transitions). This was quite a feat in itself - I needed an area of screen consisting of scanline 3 of several characters repeating vertically, necessitating having four CRTC scanlines within a single CRT scanline - a hairier CRTC manipulation trick than any that we actually used in 8088 MPH itself!)

I set up a model with these parameters and tried to match it to my captures. I initially tried to use a gradient-ascent hill-climbing algorithm to search the parameter space but before I could get it to work I realized that most of the parameters affected very few of the test swatches - any transition between two different colours X and Y can only affect colours with X and Y as foreground and background colours (32 of the 4096). If the left colour and the right colour are the same then any swatch with that colour as either foreground or background can be affected (496 of the 4096). That observation made a more naive hillclimbing algorithm much more practical, and it just takes a few minutes to find a set of 1024 parameters that match the measured values to within a few percent.

I wanted to model the new CGA as well as the old CGA, but didn't have an easy way to get good captures for that. However, the difficult bit of the old CGA to model (the multiplexer) is identical in the new CGA. So instead of applying the technique described above to the final output, I applied it to the multiplexer output and the intensity bit output separately. This yields a 256-element table and a 16-element table respectively. The outputs from these are summed to get the final output. There was a small amount of degradation from the 1024-element table version but it's too small to notice directly. To generate new CGA output, I just duplicated the intensity bit logic and applied it to the R, G and B bits (with appropriate scaling based on the resistor values in IBM's schematics). I haven't yet tested if this really matches new CGA output, but I don't currently know of any reason why it wouldn't.

This CGA simulation algorithm is implemented in a program I made called CGA2NTSC which has two main functions. One is to act as a CGA composite output stage emulator and NTSC decoder (taking a picture such as one might find on an RGBI monitor and show what it would look like on the composite output (old and new CGA). The other is to take a (24-bit colour) input image and try to find a set of data which, when loaded into the CGA's video memory, will best reproduce that image on the composite output. This is what we used to make the faces picture in the demo. It supports 1bpp and 2bpp modes as well as both text modes (though we only used it in +HRES mode for 8088 MPH). The program uses error diffusion (which can be turned down or off). I've had a couple of requests to make it use ordered dithering instead. That's possible for 1bpp and 2bpp modes but doesn't really make sense for text modes where you don't get an arbitrary choice of bit patterns. The program is a bit unpolished but should be reasonably usable.

Next time: how I played a 4 channel MOD at a sample rate of 16.6kHz through the PC speaker on a 4.77MHz 8088 CPU.

Posted in computer | 42 Comments »

Multiplying faster with squares

Tuesday, October 30th, 2012

The 8088 has a hardware multiplier, but it's quite slow:

MUL with byte arguments: 69 cycles plus 1 for each set bit in AL plus 1 if the high byte of the result is 0
MUL with word arguments: 123 cycles plus 1 for each set bit in AX plus 1 if the high word of the result is 0

Signed multiplies are even slower (taking at least 80 cycles for a byte, 134 cycles for a word), and depend on the number of set bits in the absolute value of the accumulator, the signs of the operands and whether or not the explicit operand is -0x80 (-0x8000 for word multiplies). I also measured some word IMULs apparently taking a half-integer number of cycles to run, suggesting that there's either some very weird behavior going on with the 8088's multiplier or that there's a bug in my timing program (possibly both).

Can we beat the 8088's hardware multiplier with a software routine? There's a clever trick which can sometimes be used to speed up multiplications, based on the following:

(a+b)² = a² + b² + 2ab
(a-b)² = a² + b² - 2ab

Subtracting these gives:
4ab = (a+b)² - (a-b)²

Or:
ab = (a+b)²/4 - (a-b)²/4

So if we keep in memory a table of x²/4 we can do a multiply with an add, two subtractions and two table lookups.

Does this still work if the entries in the table are fractional? Yes, it does, and here's why. x²/4 is an integer for even x, and for odd x it will always be an integer plus a quarter, as a quick induction will show. (a+b) and (a-b) are always both odd or both even (since they differ by 2b which is always even) so the quarters always cancel and we can just ignore them.

In 8088 assembly code, this can be written like this (assuming the operands are in ax and cx and that we don't care about the high word of the result):

  xchg ax,bx
  mov si,offset table
  shl bx,1
  shl cx,1
  add bx,cx
  mov ax,[si+bx]
  sub bx,cx
  sub bx,cx
  sub ax,[si+bx]

That works out to about 88 cycles (93 with DRAM refresh) - already quite an improvement over the hardware multiplier. We can do even better, though - if we place our table of squares at DS:0 and assume that our operands are in si and bx we get:

  shl bx,1
  shl si,1
  mov ax,[si+bx]
  neg bx
  sub ax,[si+bx]

Which is just 56 cycles (59 with DRAM refresh) - more than twice as fast as the hardware multiplier!

For byte-sized multiplies we can do something similar in just 34 cycles (36 with DRAM refresh):

  mov al,[si+bx]
  neg bx
  sub al,[si+bx]

However, it's less important for byte-sized multiplies since we can fit an entire 8-bit multiply table in memory and do:

  xchg bh,bl
  mov al,[si+bx]

Which is only about 20 cycles (21 with DRAM refresh). There's obviously a bit of a memory-usage vs time tradeoff there though.

There are some more nice things about these algorithms:

There's more choices about which registers to use for the operands and results.
If you're doing a lot of multiplies and want the results of them all scaled by a constant factor, you can build this into the table.
If you're summing the results of several multiplies (e.g. computing a dot product) you can do it right in the accumulator (just change the "mov" to an "add".

The downsides (apart from the memory usage and time to initialize the table) are that you don't get the high byte/word for free and that it only works for signed multiplies.

Posted in computer | 5 Comments »

How to get away with disabling DRAM refresh

Monday, October 29th, 2012

On the original IBM PC 5150 (and it's mostly electrically-equivalent derivatives, the 5155 and 5160) the operation of the bus (the data channel between the CPU and it's memory and peripherals) is interrupted is interrupted 2,187,500 times every 33 seconds (a rate of about 66KHz) for 11/13,125,000 of a second each time (i.e. 4 out of every 72 CPU cycles). During that time, very little of the machine can operate - no RAM can be read or written and no peripherals can be accessed (the CPU might be able to continue doing it's thing if it's in the middle of a long calculation, and the peripherals will continue to operate - it's just that nothing can communicate with each other).

Why does this happen? Well, most computers (including the one this blog post is about) use DRAM (Dynamic RAM) chips for their main memory, as it's fast and much cheaper than the slightly faster SRAM (Static RAM) chips. That's because each DRAM bit consists of a single capacitor and transistor as opposed to the 4 or more transistors that make up a bit of SRAM. That capacitor saves a lot of hardware but it has a big disadvantage - it discharges with time. So DRAM cells have to be "refreshed" periodically (every 2ms for the 16 kbit 4116 DRAMs in the original 5150) to maintain their contents. Reading a bit of DRAM involves recharging the capacitor if it's discharged, which refreshes it.

But a computer system won't generally read every bit of RAM in any given interval. In fact, if it's sitting in a tight idle loop it might very well not access most of the memory for many minutes at a time. But we would be justified in complaining if our computers forgot everything in their memories whenever they were left idle! So all general-purpose computers using DRAM will have some kind of circuitry for automatically accessing each memory location periodically to refresh it. In the 5150, this is done by having channel 1 of the 8253 Programmable Interval Timer (PIT) hooked up to the Direct Memory Access (DMA) controller's channel 0. The BIOS ROM programs the PIT for the 66KHz frequency mentioned above, and programs the DMA controller to read a byte each time it's triggered on channel 0. The address it reads counts up from 0 to 65,535 for each access, then goes back down to 0 again.

If the DRAM needs to be refreshed every 2ms why does the refresh circuit run at 66KHz and not 500Hz, or for that matter 8.192MHz? To answer that question, we need to know something about how the memory is organized. The original 5150 had banks of 8 chips (plus a 9th for parity checking). Each chip is 16 kbit, so a bank is 16KBytes. If you had a full 640KB of RAM organized this way, that would be 40 banks or 360 separate chips! (By the time that much memory become common, we were mostly using 64 kbit chips though.) Within each chip, the 16 kbits are organized in a grid of 128 "rows" and 128 "columns". To read a bit, you input the "row" address, then the "column" address, then read back the result (hence the chips could have just 16 pins each, as each address pin corresponds to both a "row" bit and a "column" bit). Happily, whenever a row is accessed, all the DRAM cells on that row are refreshed no matter what column address is ultimately accessed. Also, the low 7 bits of the physical byte address correspond to rows and the next 7 bits correspond to columns (the remaining 6 bits correspond to the bank address). So actually you could quite happily get away with just refreshing addresses 0-127 instead of addresses 0-65,535 on this machine (though there was a good reason for doing so, as we'll see later).

To ensure that they meet tolerances, electronic components (including DRAM chips) are manufactured with certain margins of error, which means that often one could get away with reprogramming the PIT to reduce the DRAM refresh rate a bit in order to squeeze a little bit more performance out of these old machines - it was a common hack to try, and I remember trying it on the family computer (an Amstrad PC1512) after reading a little bit about DRAM refresh in a computer magazine. I seem to recall that I got it up from the standard 1/18 to maybe 1/19 or 1/20 before it became unstable, but the performance improvement was too small to notice, so the little .COM file I made with DEBUG never made it as far as my AUTOEXEC.BAT.

For many of the timing experiments and tight loops I've been playing with on my XT, I've been disabling DRAM refresh altogether. This squeezes out a bit more performance which is nice but more importantly it makes the timings much more consistent (which is essential for anything involving lockstep). However, whenever I've told people about this the reaction is "doesn't that make the machine crash?" The answer is "no, it doesn't - if you're careful". If you turn off the refresh circuitry altogether you have to be sure that the program you're running accesses each DRAM row itself, which happens automatically if you're scanning through consecutive areas of RAM at a rate of more than 66KB/s, or for that matter if you've done enough loop unrolling that your inner loop covers more than 127 consecutive bytes of code. Since these old machines don't have caches as such, unrolled loops are almost always faster than rolled up ones anyway, so that's not such a great hardship.

Not all of the machines I'm tinkering with use 4116 DRAM chips. Later (64KB-256KB) 5150 motherboards and XTs use 4164 (64 kbit) chips, and modified machines (and possibly also clones) use 41256 (256 kbit) chips. The principles are exactly the same except these denser chips are arranged as 256x256 and 512x512 bits respectively, which means that there are 8 or 9 row bits, which means that instead of accessing 128 consecutive bytes every 2ms you have to access 256 consecutive bytes every 4ms or 512 consecutive bytes every 8ms respectively (the PIT and DMA settings were kept the same for maximum compatibility - fortunately the higher density DRAMs decay more slowly so this is possible). So when disabling DRAM refresh, one should be sure to access 512 consecutive bytes every 8ms since this will work for all 3 DRAM types.

The cycle-exact emulator I'm writing will be able to keep track of how long it's been since each DRAM row has been refreshed and will emit a warning if a row is unrefreshed for too long and decays. That will catch DRAM refresh problems that are missed due to the margins of error in real hardware, and also problems affecting only 41256 chips (my machine uses 4164s).

Modern PCs still use DRAM, and still have refresh cycles, though the overhead has gone down by an order of magnitude and the exact mechanisms have changed a few times over the years.

Posted in computer, emulation | 10 Comments »

Variadic templates in ALFE

Wednesday, October 24th, 2012

Some of the types in ALFE are templates which can accept different numbers of parameters - in other words, variadic templates. How does this fit in with the Kind system I wrote about before?

As well as the Kind constructors "<>" (template) and "" (type), we need a third Kind constructor which represents zero or more Kinds in a sequence suitable for template arguments. Let's call this Kind constructor "...". As with C++ templates, we should allow the members of the sequence to have arbitrary kind. So <...> is the kind of a template that takes zero or more arguments whose kinds are type, while <<>...> is the kind of a template that takes zero or more arguments whose kinds are <> (a single argument template whose argument is of kind type). The kind of Function is <, ...> (the comma is required because Function has one or more arguments rather than zero or more). More complicated kinds like <<...>, ...> are also possible.

To actually create a variadic template, we need to implement "recursive case" and "base case" templates and have the compiler choose between them by pattern matching, just as in C++. So the definition of Tuple might look something like:

Tuple<@T, ...> = Structure { T first; Tuple<...> rest; };
Tuple<> = Structure { };

Is that enough? Well, I believe this gives ALFE's kind system parity with C++'s. However, there is one more possible Kind constructor that I can imagine - the Kind constructor which can represent any Kind at all - let's call it $. A template with an argument of this kind (i.e. has kind <$>) can be instantiated with an argument of any kind. So you could have:

Foo<@T> = { ... };
Foo<@T<>> = { ... };
Foo<@T<@R<$>>> = { ... };

The first template is used when Foo is instantiated with a type, the second is used when it's instantiated with a template of kind <> and the third is used when it's instantiated with any other single-argument template, no matter what the kind of the argument is. The third template could then pass R as an argument to another instantiation of Foo and thereby "peel apart" the kind of the argument (as long as none of the kinds involved have multiple arguments).

By combining $ with ... I think we could potentially peel apart completely arbitrary kinds. However, I'm not sure if this is worth implementing since I can't think of any use for such a thing. Still, it's good to have some idea about how to proceed if I do eventually come across such a use!

Edited 31st January 2016: Since writing the above, I have realized that the concept of the $ Kind specifier has a fatal flaw. Just because our Foo template has been instantiated with a template of a particular kind, does not mean that we have any way of obtaining type constructors of suitable kinds for instantiating that template. We can't pass R as an argument to another instantiation of Foo because we have no way to find such a thing to pass - all we have is a template that we know takes such a thing. There might be no suitable type constructors in the entire program! So there is really nothing we can do with our T here at all - we can't instantiate it (even partially) since we don't know the kind of its first argument. We could pass it to another template, but that just moves the problem around without making any progress on it.

Posted in computer, language | No Comments »

Classes vs structures in ALFE

Tuesday, October 23rd, 2012

A lot of the time in my C++ code I find myself writing class hierarchies. However, because of the way that inheritance works in C++, you need to use a pointer to point to an object which is of a particular type or some subtype of that type. In my code these pointers tend to be reference counted smart pointers to alleviate thinking about lifetime issues but the principle is the same.

Then, because I don't want to have to keep typing "Reference foo" everywhere, I encapsulate these smart pointers into value classes which provide functions that call the (virtual) functions in the implementation class. So I end up with code like this:

class Foo
{
public:
    Foo() : _implementation(new Implementation) { }
    void frob() { _implementation->frob(); }
protected:
    Foo(Implementation* implementation) : _implementation(implementation) { }
    class Implementation : public ReferenceCounted
    {
    public:
        virtual void frob() { ... }
    };
private:
    Reference<Implementation> _implementation;
};
 
class Bar : public Foo
{
public:
    Bar() : Foo(new Implementation) { }
protected:
    class Implementation : public Foo::Implementation
    {
    public:
        virtual void frob() { ... }
    private:
        ...
    };
};

That's really more boilerplate than I want to write for every class. I'd really much rather write code that looks more like C# or Java:

class Foo
{
    public Foo() { }
    public virtual void frob() { ... }
};
 
class Bar
{
    public Bar() { }
    public override void frob() { ... }
    ...
};

So I'm thinking that in ALFE I should have some syntax that looks close to the latter and behaves like the former. I'm leaning towards making "Struct" behave like C++ struct and class (i.e. a value type), and "Class" behave like a C# class (reference semantics). So:

Foo = Structure : Reference<Implementation> {
    Foo() : base(new Implementation) { }
    Void frob() { referent()->frob(); }
access<Foo>:
    Foo(Implementation* implementation) : base(implementation) { }
    Implementation = Structure : ReferenceCounted {
        virtual Void frob() { ... }
    };
};
 
Bar = Structure : Foo {
    Bar() : Foo(new Implementation) { }
access<Bar>:
    Implementation = Structure : Foo.Implementation {
        virtual Void frob() { ... }
    access<>:
        ...
    };
};

is the same as:

Foo = Class {
    Foo() { }
    virtual Void frob() { ... }
};
 
Bar = Class : Foo {
    Bar() { }
    virtual Void frob() { ... }
access<>:
    ...
};

There is a comparison to be made here between POD (Plain Old Data) and non-POD classes in C++ (though it's not quite the same because the Implementation classes would be non-POD in C++).

Obviously there's a lot of details here which would need to be fleshed out but I think something along these lines would be useful.