Archive for September, 2020

8086 microcode disassembled

Thursday, September 3rd, 2020

Recently I realised that, as part of his 8086 reverse-engineering series, Ken Shirriff had posted online a high resolution photograph of the 8086 die with the metal layer removed. This was something I have been looking for for some time, in order to extract and disassemble the 8086 microcode. I had previously found very high resolution photos of the die with the metal layer intact, but only half of the bits of the microcode ROM were readable. Ken also posted a high resolution photograph of the microcode ROM of the 8088, which is very similar but not identical. I was very curious to know what the differences were.

I used bitract to extract the bits from the two main microcode ROMs, and also from the translation ROM which maps opcode bit patterns onto positions within the main microcode ROM.

The microcode is partially documented in US patent 4363091. In particular, that patent has source listings for several microcode routines. Within these, there are certain patterns of parts of instructions which I was able to find in the ROM dump. This allowed me to figure out how the bit patterns in the ROM correspond to the operands and opcodes of the microcode instruction set, in a manner similar to cracking a monoalphabetic substitution cipher. My resulting disassembly of the microcode ROM can be found here and the code for my disassembler is on github.

This disassembly has answered many questions I had about the 8088 and 8086. The remainder of this post contains the answers to these questions and other interesting things I found in the microcode.

What are the microcode differences between the 8086 and the 8088?

The differences are in the interrupt handling code. I think it comes down to fact that the 8086 does two special bus accesses to acknowledge an interrupt (one to tell the PIC that it is ready to service the interrupt, the second to fetch the interrupt number for the IRQ that needs to be serviced). These are word-sized accesses for some reason, so the 8088 would break them into four accesses instead of two. This would confuse the PIC, so the 8088 does a single access instead and relies on the BIU to split the access into two. The other changes seem to be fallout related to that.

Are the microcode listings in the US4363091 accurate?

Mostly. There are differences, however (which added some complexity to the deciphering process). The differences are in the string instructions. For example, the "STS" (STOSB/STOSW) instruction in the patent is:

CR  S      D      Type  a     b     F
-------------------------------------
0   IK     IND    7     F1    1
1   (M)    OPR    6     w     DA,BL
2   IND    IK     0     F1    0
3                 4     none  RNI

In the actual CPU, this has become:

0   IK    -> IND       7   F1    RPTS
1   M     -> OPR       6   w     DA,BL
2   IND   -> IK        0   NF1      5
3   SIGMA -> tmpc      5   INT   RPTI
4   tmpc  -> BC        0   NZ       1
5                      4   none  RNI

The arrow isn't a difference - I just put that in my disassembly to emphasize the direction of data movement in the "move" part of the microcode instructions. Likewise, the "F1 1" in the patent listing is the same as the "F1 RPTS" in my disassembly - I have replaced subroutine numbers with names to make it easier to read.

The version in the patent does a check for pending interrupts in the "RPTS" routine, before it processes any iterations of the string. This means that if there is a continuous "storm" of interrupts, the string instruction will make no progress. The version in the CPU corrects this, and checks for interrupts on line 3, after it has done the store, allowing it to progress. This was probably not a situation that was expected to occur in normal operation (in fact, I seem to recall crashing my 8088 and 8086 machines by having interrupts happen too rapidly to be serviced). The change was most likely done to accommodate debugging with the trap flag (which essentially means that there is always an interrupt pending when the trap flag is set). Without this change, code that used the repeated string instructions would not have progressed under the debugger.

How many different instructions does the 8086 have, according to the microcode? What are they?

The CPU has 60 instructions, and they're in a fairly logical sort of order:

(Numbers are: number of opcodes handled, size of top-level microcode routine.)

MOV rm<->r     4  3
LEA            1  1
alu rm<->r    32  4
alu rm,i       4  5
MOV rm,i       2  4
alu r,i       16  4
MOV r,i       16  3
PUSH rw        8  4
PUSH sr        4  4
PUSHF          1  4
POP rw         8  3
POP sr         4  3
POPF           1  3
POP rmw        1  6
CBW            1  2
CWD            1  7
MOV A,[i]      2  4
MOV [i],A      2  4
CALL cd        1  4
CALL cw        1  8
XCHG AX,rw     8  3
rot rm,1       2  3
rot rm,CL      2  8
TEST rm,r      2  3
TEST A,i       2  4
SALC           1  3
XCHG rm,r      2  5
IN A,ib        2  4
OUT ib,A       2  4
IN A,DX        2  2
OUT DX,A       2  2
RET            2  4
RETF           2  2
IRET           1  4
RET/RETF iw    4  4
JMP cw/JMP cb  2  6
JMP cd         1  7
Jcond         32  3
MOV rmw<->sr   2  2
LES            1  4
LDS            1  4
WAIT           1  9 (discontinuous)
SAHF           1  4
LAHF           1  2
ESC            8  1
XLAT           1  5
STOS           2  6 (discontinuous)
CMPS/SCAS      4 13 (discontinuous)
MOVS/LODS      4 11 (discontinuous)
JCXZ           1  5 (discontinuous)
LOOPNE/LOOPE   2  5
LOOP           1  4
DAA/DAS        2  4
AAA/AAS        2  8
AAD            1  4
AAM            1  6
INC/DEC rw    16  2
INT ib         1  2
INTO           1  4
INT 3          1  3

The discontinuous instructions were most likely broken up because they had bug fixes making them too long for their original slots. Similarly "POP rmw" appears to have been shortened by at least 3 instructions as there is a gap after it. Moving code around after it's been written (and updating all the far jump/call locations) would probably have been tricky.

Which instructions, if any, are not handled by the microcode?

There is no microcode for the segment override prefixes (CS:, SS:, DS: and ES:). Nor for the other prefixes (REP, REPNE and LOCK), nor the instructions CLC, STC, CLI, STI, CLD, STD, CMC, and HLT. The "group" opcodes 0xf6, 0xf7, 0xfe and 0xff do not have top level microcode instructions. So none of the instructions with 0xf in the high nybble of the opcode are initially handled by the microcode. Most of these instruction are very simple and probably better done by random logic. HLT is a little surprising - I really thought I'd find a microcode loop for that one since it only seems to check for interrupts every other cycle.

The group instructions are decoded slightly differently but the microcode routines handling them break down as follows:

INC/DEC rm        3
PUSH rm           4
NOT rm            3
NEG rm            3
CALL FAR rm       8
CALL rm           8
TEST rm,i         4
JMP rm            2
JMP FAR rm        4
IMUL/MUL rmb      8
IMUL/MUL rmw      8
IDIV/DIV rmb      8
IDIV/DIV rmw      8

Then there are various subroutines and tail calls (listed in translation.txt). Highlights:

  • interrupt handling (16 microinstructions)
  • sign handling for multiply and divide, flags for multiply (32)
  • effective address computation (16)
  • reset routine (sets CS=0xffff, DS=ES=SS=FLAGS=PC=0) (6)

Does the microcode contain any "junk code" that doesn't do anything?

It seems to! While most of the unused parts of the ROM (64 instructions) are filled with zeroes, there are a few parts which aren't. The following instructions appear right at the end of the ROM:

A     -> tmpa      5   INT   FARCALL2      011100011.0110
[  5] -> [ a]      5   UNC   INTR     F    011100011.0111

There doesn't appear to be any way for execution to reach these instructions. This code saves AL to tmpa (which doesn't appear to then be used at all) and then does either an interrupt or (if an interrupt is pending) a far call. In the interrupt case it also does a move between a source and a destination that aren't used anywhere else (and hence I have no idea what they are). This makes me wonder if there was at one point a plan for something like an "INT AL" instruction. With the x86 instruction set we ended up with, such a thing has to be done using self-modifying code, a table of INT instructions, or faking the operation of INT in software).

The following code is also inaccessible and appears to do something with the low byte of the last offset read from or written to, and the carry flag:

IND   -> tmpaL     1   LRCY  tmpc     F      01010?10?.1010

No idea what that could be for - nothing else in the microcode treats the IND register as two separate bytes.

Are there are any parts of the microcode that are still not understood?

When the WAIT instruction finishes in the non-interrupt case (i.e. by the -TEST pin going active to signal that the 8087 has completed an instruction) the microcode sequence finishes using this sequence:

                   4   [ 1]  none
                   4   none  RNI

I don't know what the "[ 1]" does - it isn't used anywhere else.

There is also a bit (shown as "Q" in the listings) which does not have an obvious function for "type 6" (bus IO) operations. This Q bit is only set for "W" (write) operations, and is differentiated in the listing by write operations without it being shown in lower case ("w"). There seems to be no pattern as to which writes use this bit. The string move instructions use it, as does the stack push for the flags when an interrupt occurs, and the push of the segment for a far call or interrupt (but not the offset). It would make sense if this bit was used to distinguish between memory and port IO bus accesses, but the CPU seems to have another mechanism for this (most likely the group decode ROM, which I have not decoded as there are too many unknowns about what its inputs and outputs are).

Are there any places where the microcode could have been improved to speed up the CPU?

Despite many of the instructions seeming to execute quite ponderously by the standards of later CPUs, the microcode appears to be very tightly written and I didn't find many opportunities for improvement. If the MOVS/LODS opcode was split up into separate microcode routines for LODS and MOVS, the LODS routine could avoid a conditional jump and execute 1 cycle faster. But there is only room for that because of the "POP rmw" shortening, which may have happened quite late in the development cycle (especially if it was a functional bug fix rather than an optimisation - optimisations might not have met the bar at that point).

There may be places where prefetching could be suspended earlier before a jump, but it's not quite so obvious that that would be an optimisation. Especially if the "suspend" operation is synchronous, and waits for the BIU to complete the current prefetch cycle before continuing the microcode program. And especially if that would make the microcode routine longer.

It would of course be possible to make improvements if the random logic is changed as well. The NEC V20 and V30 implement the same instructions at a generally lower number of cycles per instruction, but they have 63,000 transistors instead of 29,000 so probably have a much larger proportion of random logic to microcode.

Does the microcode have any hidden features, opcodes or easter eggs that have not yet been documented?

It does! Using the REP or REPNE prefix with a MUL or IMUL instruction negates the product. Using the REP or REPNE prefix with an IDIV instruction negates the quotient. As far as I know, nobody has discovered these before (or at least documented them).

Signed multiplication and division works by negating negative inputs and then negating the output if exactly one of the inputs was negative. That means that the CPU needs to remember one bit of state (whether or not to negate the output) across the multiplication and division algorithms. But these algorithms use all three temporary registers, and the internal counter, and the ALU (so the bit can't be put in the internal carry flag for example). I was scratching my head about where that bit might be kept. I was also scratching my head about why the multiplication and division algorithms check the F1 ("do we have a REP prefix?") flag. Then I realised that these puzzles cancel each other out - the CPU flips the F1 flag for each negative sign in the multiply/divide inputs! There's already an microcode instruction to check for that, so the 8086's designers just needed to add an instruction to flip it.

I was thinking the microcode instruction might set the F1 flag instead of flipping it - that would mean that you could get a (probably negated) "absolute value" operation (almost) for free with a multiply. But an almost-free negation is pretty good too - REP is a byte cheaper than "NEG AX", and with 16-bit multiplies the savings are even greater (eliminates a NEG AX / ADC DX, 0 / NEG DX) sequence. Still small compared to the multiply, but a savings nonetheless.

I contemplated using this in a demoscene production as another "we break all your emulators" moment, but multiplication and division on the 8086 and 8088 CPUs is sufficiently slow to be of limited use for demos.

The F1ZZ microcode instruction (which controls whether the REPE/REPNE SCAS/CMPS sequences terminate early) is also used in the LOOPE and LOOPNE instructions. Which made me wonder if one of the REP prefixes would also reverse the sense of the test. However, neither prefix seems to have any effect on these instructions.

Update 2nd January 2023

I've made a new version of the disassembly here incorporating some changes from the comments below. I have transcribed the group ROM, got rid of "NWB", added the RNI flag to W microinstructions, and changed XZC to ADC.