Multiplying faster with squares

The 8088 has a hardware multiplier, but it's quite slow:

MUL with byte arguments: 69 cycles plus 1 for each set bit in AL plus 1 if the high byte of the result is 0
MUL with word arguments: 123 cycles plus 1 for each set bit in AX plus 1 if the high word of the result is 0

Signed multiplies are even slower (taking at least 80 cycles for a byte, 134 cycles for a word), and depend on the number of set bits in the absolute value of the accumulator, the signs of the operands and whether or not the explicit operand is -0x80 (-0x8000 for word multiplies). I also measured some word IMULs apparently taking a half-integer number of cycles to run, suggesting that there's either some very weird behavior going on with the 8088's multiplier or that there's a bug in my timing program (possibly both).

Can we beat the 8088's hardware multiplier with a software routine? There's a clever trick which can sometimes be used to speed up multiplications, based on the following:

(a+b)2 = a2 + b2 + 2ab
(a-b)2 = a2 + b2 - 2ab

Subtracting these gives:
4ab = (a+b)2 - (a-b)2

Or:
ab = (a+b)2/4 - (a-b)2/4

So if we keep in memory a table of x2/4 we can do a multiply with an add, two subtractions and two table lookups.

Does this still work if the entries in the table are fractional? Yes, it does, and here's why. x2/4 is an integer for even x, and for odd x it will always be an integer plus a quarter, as a quick induction will show. (a+b) and (a-b) are always both odd or both even (since they differ by 2b which is always even) so the quarters always cancel and we can just ignore them.

In 8088 assembly code, this can be written like this (assuming the operands are in ax and cx and that we don't care about the high word of the result):

  xchg ax,bx
  mov si,offset table
  shl bx,1
  shl cx,1
  add bx,cx
  mov ax,[si+bx]
  sub bx,cx
  sub bx,cx
  sub ax,[si+bx]

That works out to about 88 cycles (93 with DRAM refresh) - already quite an improvement over the hardware multiplier. We can do even better, though - if we place our table of squares at DS:0 and assume that our operands are in si and bx we get:

  shl bx,1
  shl si,1
  mov ax,[si+bx]
  neg bx
  sub ax,[si+bx]

Which is just 56 cycles (59 with DRAM refresh) - more than twice as fast as the hardware multiplier!

For byte-sized multiplies we can do something similar in just 34 cycles (36 with DRAM refresh):

  mov al,[si+bx]
  neg bx
  sub al,[si+bx]

However, it's less important for byte-sized multiplies since we can fit an entire 8-bit multiply table in memory and do:

  xchg bh,bl
  mov al,[si+bx]

Which is only about 20 cycles (21 with DRAM refresh). There's obviously a bit of a memory-usage vs time tradeoff there though.

There are some more nice things about these algorithms:

  1. There's more choices about which registers to use for the operands and results.
  2. If you're doing a lot of multiplies and want the results of them all scaled by a constant factor, you can build this into the table.
  3. If you're summing the results of several multiplies (e.g. computing a dot product) you can do it right in the accumulator (just change the "mov" to an "add".

The downsides (apart from the memory usage and time to initialize the table) are that you don't get the high byte/word for free and that it only works for signed multiplies.

3 Responses to “Multiplying faster with squares”

  1. Digi says:

    Hello Andrew.
    I can't to contact you at.@digger.org. Can you reply to me at my email specified with this comment?

  2. Repose says:

    c64 demo coder here
    I wrote the fastest possible 6502 16x16 mult
    http://www.codebase64.org/doku.php?id=base:fastest_multiplication
    at 192 cycles. The 8 bit version is 38 cycles. Even a 128k table is slower. Was interesting to see how comparable 8088 is. You can use the same general ideas as me; not changing pointers between sub-multiplies and optimized adds. Wonder how long it would take in 8088; would you like to try?

    • Andrew says:

      Hi Repose!

      I had a quick go at an 8088 16x16->32 multiply using no more than 2kB of tables, but came to the conclusion that there's no point as it'll be a lot slower than the hardware multiply instruction. The reason is that when multiplying a+0x100*b by c+0x100*d then it's necessary to do 8 lookups into the quarter-square tables (a+c, a-c, b+c, b-c, a+d, a-d, b+d and b-d). Each of these is going to be 8 cycles just for the bus accesses to these table locations, and another 8 cycles to fetch a couple of instruction bytes for each of these lookups. So that's 128 cycles just for the lookups, and that's without DRAM refresh or doing any negations. I'd love to be proved wrong, though, because I'm in the process of writing some 3D code for 8088 and the multiplies are a significant bottleneck.

Leave a Reply