Multiplying faster with squares

The 8088 has a hardware multiplier, but it's quite slow:

MUL with byte arguments: 69 cycles plus 1 for each set bit in AL plus 1 if the high byte of the result is 0
MUL with word arguments: 123 cycles plus 1 for each set bit in AX plus 1 if the high word of the result is 0

Signed multiplies are even slower (taking at least 80 cycles for a byte, 134 cycles for a word), and depend on the number of set bits in the absolute value of the accumulator, the signs of the operands and whether or not the explicit operand is -0x80 (-0x8000 for word multiplies). I also measured some word IMULs apparently taking a half-integer number of cycles to run, suggesting that there's either some very weird behavior going on with the 8088's multiplier or that there's a bug in my timing program (possibly both).

Can we beat the 8088's hardware multiplier with a software routine? There's a clever trick which can sometimes be used to speed up multiplications, based on the following:

(a+b)2 = a2 + b2 + 2ab
(a-b)2 = a2 + b2 - 2ab

Subtracting these gives:
4ab = (a+b)2 - (a-b)2

ab = (a+b)2/4 - (a-b)2/4

So if we keep in memory a table of x2/4 we can do a multiply with an add, two subtractions and two table lookups.

Does this still work if the entries in the table are fractional? Yes, it does, and here's why. x2/4 is an integer for even x, and for odd x it will always be an integer plus a quarter, as a quick induction will show. (a+b) and (a-b) are always both odd or both even (since they differ by 2b which is always even) so the quarters always cancel and we can just ignore them.

In 8088 assembly code, this can be written like this (assuming the operands are in ax and cx and that we don't care about the high word of the result):

  xchg ax,bx
  mov si,offset table
  shl bx,1
  shl cx,1
  add bx,cx
  mov ax,[si+bx]
  sub bx,cx
  sub bx,cx
  sub ax,[si+bx]

That works out to about 88 cycles (93 with DRAM refresh) - already quite an improvement over the hardware multiplier. We can do even better, though - if we place our table of squares at DS:0 and assume that our operands are in si and bx we get:

  shl bx,1
  shl si,1
  mov ax,[si+bx]
  neg bx
  sub ax,[si+bx]

Which is just 56 cycles (59 with DRAM refresh) - more than twice as fast as the hardware multiplier!

For byte-sized multiplies we can do something similar in just 34 cycles (36 with DRAM refresh):

  mov al,[si+bx]
  neg bx
  sub al,[si+bx]

However, it's less important for byte-sized multiplies since we can fit an entire 8-bit multiply table in memory and do:

  xchg bh,bl
  mov al,[si+bx]

Which is only about 20 cycles (21 with DRAM refresh). There's obviously a bit of a memory-usage vs time tradeoff there though.

There are some more nice things about these algorithms:

  1. There's more choices about which registers to use for the operands and results.
  2. If you're doing a lot of multiplies and want the results of them all scaled by a constant factor, you can build this into the table.
  3. If you're summing the results of several multiplies (e.g. computing a dot product) you can do it right in the accumulator (just change the "mov" to an "add".

The downsides (apart from the memory usage and time to initialize the table) are that you don't get the high byte/word for free and that it only works for signed multiplies.

Leave a Reply