Archive for the ‘Uncategorized’ Category

8086 microcode disassembled

Thursday, September 3rd, 2020

Recently I realised that, as part of his 8086 reverse-engineering series, Ken Shirriff had posted online a high resolution photograph of the 8086 die with the metal layer removed. This was something I have been looking for for some time, in order to extract and disassemble the 8086 microcode. I had previously found very high resolution photos of the die with the metal layer intact, but only half of the bits of the microcode ROM were readable. Ken also posted a high resolution photograph of the microcode ROM of the 8088, which is very similar but not identical. I was very curious to know what the differences were.

I used bitract to extract the bits from the two main microcode ROMs, and also from the translation ROM which maps opcode bit patterns onto positions within the main microcode ROM.

The microcode is partially documented in US patent 4363091. In particular, that patent has source listings for several microcode routines. Within these, there are certain patterns of parts of instructions which I was able to find in the ROM dump. This allowed me to figure out how the bit patterns in the ROM correspond to the operands and opcodes of the microcode instruction set, in a manner similar to cracking a monoalphabetic substitution cipher. My resulting disassembly of the microcode ROM can be found here and the code for my disassembler is on github.

This disassembly has answered many questions I had about the 8088 and 8086. The remainder of this post contains the answers to these questions and other interesting things I found in the microcode.

What are the microcode differences between the 8086 and the 8088?

The differences are in the interrupt handling code. I think it comes down to fact that the 8086 does two special bus accesses to acknowledge an interrupt (one to tell the PIC that it is ready to service the interrupt, the second to fetch the interrupt number for the IRQ that needs to be serviced). These are word-sized accesses for some reason, so the 8088 would break them into four accesses instead of two. This would confuse the PIC, so the 8088 does a single access instead and relies on the BIU to split the access into two. The other changes seem to be fallout related to that.

Are the microcode listings in the US4363091 accurate?

Mostly. There are differences, however (which added some complexity to the deciphering process). The differences are in the string instructions. For example, the "STS" (STOSB/STOSW) instruction in the patent is:

CR  S      D      Type  a     b     F
-------------------------------------
0   IK     IND    7     F1    1
1   (M)    OPR    6     w     DA,BL
2   IND    IK     0     F1    0
3                 4     none  RNI

In the actual CPU, this has become:

0   IK    -> IND       7   F1    RPTS
1   M     -> OPR       6   w     DA,BL
2   IND   -> IK        0   NF1      5
3   SIGMA -> tmpc      5   INT   RPTI
4   tmpc  -> BC        0   NZ       1
5                      4   none  RNI

The arrow isn't a difference - I just put that in my disassembly to emphasize the direction of data movement in the "move" part of the microcode instructions. Likewise, the "F1 1" in the patent listing is the same as the "F1 RPTS" in my disassembly - I have replaced subroutine numbers with names to make it easier to read.

The version in the patent does a check for pending interrupts in the "RPTS" routine, before it processes any iterations of the string. This means that if there is a continuous "storm" of interrupts, the string instruction will make no progress. The version in the CPU corrects this, and checks for interrupts on line 3, after it has done the store, allowing it to progress. This was probably not a situation that was expected to occur in normal operation (in fact, I seem to recall crashing my 8088 and 8086 machines by having interrupts happen too rapidly to be serviced). The change was most likely done to accommodate debugging with the trap flag (which essentially means that there is always an interrupt pending when the trap flag is set). Without this change, code that used the repeated string instructions would not have progressed under the debugger.

How many different instructions does the 8086 have, according to the microcode? What are they?

The CPU has 60 instructions, and they're in a fairly logical sort of order:

(Numbers are: number of opcodes handled, size of top-level microcode routine.)

MOV rm<->r     4  3
LEA            1  1
alu rm<->r    32  4
alu rm,i       4  5
MOV rm,i       2  4
alu r,i       16  4
MOV r,i       16  3
PUSH rw        8  4
PUSH sr        4  4
PUSHF          1  4
POP rw         8  3
POP sr         4  3
POPF           1  3
POP rmw        1  6
CBW            1  2
CWD            1  7
MOV A,[i]      2  4
MOV [i],A      2  4
CALL cd        1  4
CALL cw        1  8
XCHG AX,rw     8  3
rot rm,1       2  3
rot rm,CL      2  8
TEST rm,r      2  3
TEST A,i       2  4
SALC           1  3
XCHG rm,r      2  5
IN A,ib        2  4
OUT ib,A       2  4
IN A,DX        2  2
OUT DX,A       2  2
RET            2  4
RETF           2  2
IRET           1  4
RET/RETF iw    4  4
JMP cw/JMP cb  2  6
JMP cd         1  7
Jcond         32  3
MOV rmw<->sr   2  2
LES            1  4
LDS            1  4
WAIT           1  9 (discontinuous)
SAHF           1  4
LAHF           1  2
ESC            8  1
XLAT           1  5
STOS           2  6 (discontinuous)
CMPS/SCAS      4 13 (discontinuous)
MOVS/LODS      4 11 (discontinuous)
JCXZ           1  5 (discontinuous)
LOOPNE/LOOPE   2  5
LOOP           1  4
DAA/DAS        2  4
AAA/AAS        2  8
AAD            1  4
AAM            1  6
INC/DEC rw    16  2
INT ib         1  2
INTO           1  4
INT 3          1  3

The discontinuous instructions were most likely broken up because they had bug fixes making them too long for their original slots. Similarly "POP rmw" appears to have been shortened by at least 3 instructions as there is a gap after it. Moving code around after it's been written (and updating all the far jump/call locations) would probably have been tricky.

Which instructions, if any, are not handled by the microcode?

There is no microcode for the segment override prefixes (CS:, SS:, DS: and ES:). Nor for the other prefixes (REP, REPNE and LOCK), nor the instructions CLC, STC, CLI, STI, CLD, STD, CMC, and HLT. The "group" opcodes 0xf6, 0xf7, 0xfe and 0xff do not have top level microcode instructions. So none of the instructions with 0xf in the high nybble of the opcode are initially handled by the microcode. Most of these instruction are very simple and probably better done by random logic. HLT is a little surprising - I really thought I'd find a microcode loop for that one since it only seems to check for interrupts every other cycle.

The group instructions are decoded slightly differently but the microcode routines handling them break down as follows:

INC/DEC rm        3
PUSH rm           4
NOT rm            3
NEG rm            3
CALL FAR rm       8
CALL rm           8
TEST rm,i         4
JMP rm            2
JMP FAR rm        4
IMUL/MUL rmb      8
IMUL/MUL rmw      8
IDIV/DIV rmb      8
IDIV/DIV rmw      8

Then there are various subroutines and tail calls (listed in translation.txt). Highlights:

  • interrupt handling (16 microinstructions)
  • sign handling for multiply and divide, flags for multiply (32)
  • effective address computation (16)
  • reset routine (sets CS=0xffff, DS=ES=SS=FLAGS=PC=0) (6)

Does the microcode contain any "junk code" that doesn't do anything?

It seems to! While most of the unused parts of the ROM (64 instructions) are filled with zeroes, there are a few parts which aren't. The following instructions appear right at the end of the ROM:

A     -> tmpa      5   INT   FARCALL2      011100011.0110
[  5] -> [ a]      5   UNC   INTR     F    011100011.0111

There doesn't appear to be any way for execution to reach these instructions. This code saves AL to tmpa (which doesn't appear to then be used at all) and then does either an interrupt or (if an interrupt is pending) a far call. In the interrupt case it also does a move between a source and a destination that aren't used anywhere else (and hence I have no idea what they are). This makes me wonder if there was at one point a plan for something like an "INT AL" instruction. With the x86 instruction set we ended up with, such a thing has to be done using self-modifying code, a table of INT instructions, or faking the operation of INT in software).

The following code is also inaccessible and appears to do something with the low byte of the last offset read from or written to, and the carry flag:

IND   -> tmpaL     1   LRCY  tmpc     F      01010?10?.1010

No idea what that could be for - nothing else in the microcode treats the IND register as two separate bytes.

Are there are any parts of the microcode that are still not understood?

When the WAIT instruction finishes in the non-interrupt case (i.e. by the -TEST pin going active to signal that the 8087 has completed an instruction) the microcode sequence finishes using this sequence:

                   4   [ 1]  none
                   4   none  RNI

I don't know what the "[ 1]" does - it isn't used anywhere else.

There is also a bit (shown as "Q" in the listings) which does not have an obvious function for "type 6" (bus IO) operations. This Q bit is only set for "W" (write) operations, and is differentiated in the listing by write operations without it being shown in lower case ("w"). There seems to be no pattern as to which writes use this bit. The string move instructions use it, as does the stack push for the flags when an interrupt occurs, and the push of the segment for a far call or interrupt (but not the offset). It would make sense if this bit was used to distinguish between memory and port IO bus accesses, but the CPU seems to have another mechanism for this (most likely the group decode ROM, which I have not decoded as there are too many unknowns about what its inputs and outputs are).

Are there any places where the microcode could have been improved to speed up the CPU?

Despite many of the instructions seeming to execute quite ponderously by the standards of later CPUs, the microcode appears to be very tightly written and I didn't find many opportunities for improvement. If the MOVS/LODS opcode was split up into separate microcode routines for LODS and MOVS, the LODS routine could avoid a conditional jump and execute 1 cycle faster. But there is only room for that because of the "POP rmw" shortening, which may have happened quite late in the development cycle (especially if it was a functional bug fix rather than an optimisation - optimisations might not have met the bar at that point).

There may be places where prefetching could be suspended earlier before a jump, but it's not quite so obvious that that would be an optimisation. Especially if the "suspend" operation is synchronous, and waits for the BIU to complete the current prefetch cycle before continuing the microcode program. And especially if that would make the microcode routine longer.

It would of course be possible to make improvements if the random logic is changed as well. The NEC V20 and V30 implement the same instructions at a generally lower number of cycles per instruction, but they have 63,000 transistors instead of 29,000 so probably have a much larger proportion of random logic to microcode.

Does the microcode have any hidden features, opcodes or easter eggs that have not yet been documented?

It does! Using the REP or REPNE prefix with a MUL or IMUL instruction negates the product. Using the REP or REPNE prefix with an IDIV instruction negates the quotient. As far as I know, nobody has discovered these before (or at least documented them).

Signed multiplication and division works by negating negative inputs and then negating the output if exactly one of the inputs was negative. That means that the CPU needs to remember one bit of state (whether or not to negate the output) across the multiplication and division algorithms. But these algorithms use all three temporary registers, and the internal counter, and the ALU (so the bit can't be put in the internal carry flag for example). I was scratching my head about where that bit might be kept. I was also scratching my head about why the multiplication and division algorithms check the F1 ("do we have a REP prefix?") flag. Then I realised that these puzzles cancel each other out - the CPU flips the F1 flag for each negative sign in the multiply/divide inputs! There's already an microcode instruction to check for that, so the 8086's designers just needed to add an instruction to flip it.

I was thinking the microcode instruction might set the F1 flag instead of flipping it - that would mean that you could get a (probably negated) "absolute value" operation (almost) for free with a multiply. But an almost-free negation is pretty good too - REP is a byte cheaper than "NEG AX", and with 16-bit multiplies the savings are even greater (eliminates a NEG AX / ADC DX, 0 / NEG DX) sequence. Still small compared to the multiply, but a savings nonetheless.

I contemplated using this in a demoscene production as another "we break all your emulators" moment, but multiplication and division on the 8086 and 8088 CPUs is sufficiently slow to be of limited use for demos.

The F1ZZ microcode instruction (which controls whether the REPE/REPNE SCAS/CMPS sequences terminate early) is also used in the LOOPE and LOOPNE instructions. Which made me wonder if one of the REP prefixes would also reverse the sense of the test. However, neither prefix seems to have any effect on these instructions.

Update 2nd January 2023

I've made a new version of the disassembly here incorporating some changes from the comments below. I have transcribed the group ROM, got rid of "NWB", added the RNI flag to W microinstructions, and changed XZC to ADC.

Comparison of CGA card versions

Monday, October 8th, 2012

Over at the Vintage Computer Forums I asked about the differences between CGA card versions.

The main change that occurred during the CGA's lifetime seems to be to do with composite output. In particular, the old CGA (part numbers 1804472 and 1501486) had the following formula for the composite output voltage: COMPOSITE = 0.72*CHROMA + 0.28*I. The new CGA (part numbers 1504910 and 1501981) has the formula COMPOSITE = 0.29*CHROMA + 0.1*R + 0.22*G + 0.07*B + 0.32*I. The consequences of this are more obvious on a monochrome monitor, since there CHROMA only makes 3 different shades of grey (0 for colours 0 and 8, 1 for colours 7 and 15 and 0.5 for all the others). So an old CGA will only yield 6 different shades of grey on a monochrome monitor, while on a new CGA the 16 different colours will yield 16 (theoretically) different shades of grey (though some of them may be very similar).

On a colour monitor, a new CGA will give a lower saturation then an old CGA, but the brightnesses of the different colours will seem more appropriate. On an old CGA, blue seems lighter than it should be while yellow seems darker - new CGA fixes that.

The other thing about new CGA is that its composite output is a better match to standard NTSC than old CGA's, which means that the results will be more consistent between different monitors. The old CGA's color burst has both too high of an amplitude and too high of a DC offset, which causes many NTSC output devices to reduce the gain, making the resulting image too dark (I have to turn the brightness and contrast right up to get a decent image from my 1501486 card on the TV I connect it to).

That's the theory - to check how well it works in practice, I'd like to do some side-by-side comparisons. Rather than trying to buy a new CGA card (which is likely to be expensive, might be unsuccessful and probably wouldn't work too well side-by-side with another CGA card in the same machine anyway), I want to make an add-on card for my old CGA card which adds a second composite output, with new-style colours. The differences between the two cards are localized to a small part of the circuit, so there isn't too much to duplicate.

The future of the past and the past of the future

Sunday, September 23rd, 2012

Today is the 50th anniversary of the first broadcast of the Jetsons. It's always fascinating to look at how the people in the past used to imagine the future would be like, and see how different their extrapolations were to how things actually turned out. There are certainly technologies and societal changes that have happened in the last 50 years that would have been impossible to predict.

Equally fascinating, I think, is to imagine how people in the future will think of our present. Okay, it's a rather different problem in that there will (hopefully!) be actual historical records of what life today is like (in fact, our present is probably the most well-documented historical period ever). Still, we surely have misconceptions today about what life was like in the past, and it's interesting to wonder what misconceptions the people of the future will have about us. What technologies that have yet to be invented will be so ubiquitous and game-changing that people will have real trouble imagining what life was like without them? What changes will happen to society which will make today seem unfathomably alien? Given enough time, I'm sure such changes are inevitable, so (despite the excellent records) I think it would be completely unsurprising if the people of tomorrow have some serious misconceptions about the people of today (especially amongst those who don't study the past for a living).

Credit card designed for internet shopping

Friday, July 23rd, 2010

It seems like it would be possible to make a great deal of money by creating a payment system better than credit cards. Paypal has come closest to doing this, but has a lot of problems.

How could one make credit cards better? It would be very difficult to get your payment system into as many retailers as Visa and Mastercard, so perhaps aiming for a niche would be a good idea. One such niche might be internet purchases. Develop a payment system/credit card designed specifically for internet purchases and it could be very popular.

One major difference between a payment system designed for the internet and one predating it is that one could authenticate the transaction, not the identity. Whenever you buy something online, you put in your card number as usual but then (unlike with normal credit cards) there is an extra step - you log into the payment system's website and approve the requested transaction. Until you have done that, the merchant doesn't get any money (and won't deliver the goods). This cuts out all "stolen card" type fraud, since the thief would also need to steal your payment system password (which never goes anywhere near the merchant). This would allow this payment system to undercut the existing credit card issuers and become competitive. Fraud caused by loss of the payment system password would be treated the same as fraud caused by the loss of any other online banking password (which I think varies from place to place).

This system would still need a "chargeback" mechanism to combat fraud from merchants (and mechanisms to combat fraudulent chargebacks) but my impression is that the costs of these are small compared to the "stolen card" costs.

With suitable cellphone applications for authorization, the system could even be used for brick-and-mortar stores as well.

Unlike Paypal, this system would actually be able to lend money like a credit card, it wouldn't need to be linked to a bank account or credit card.