{"id":1925,"date":"2012-10-30T20:00:18","date_gmt":"2012-10-30T20:00:18","guid":{"rendered":"https:\/\/www.reenigne.org\/blog\/?p=1925"},"modified":"2012-09-27T00:32:50","modified_gmt":"2012-09-26T23:32:50","slug":"multiplying-faster-with-squares","status":"publish","type":"post","link":"https:\/\/www.reenigne.org\/blog\/multiplying-faster-with-squares\/","title":{"rendered":"Multiplying faster with squares"},"content":{"rendered":"<p>The 8088 has a hardware multiplier, but it's quite slow:<\/p>\n<p>MUL with byte arguments: 69 cycles plus 1 for each set bit in AL plus 1 if the high byte of the result is 0<br \/>\nMUL with word arguments: 123 cycles plus 1 for each set bit in AX plus 1 if the high word of the result is 0<\/p>\n<p>Signed multiplies are even slower (taking at least 80 cycles for a byte, 134 cycles for a word), and depend on the number of set bits in the absolute value of the accumulator, the signs of the operands and whether or not the explicit operand is -0x80 (-0x8000 for word multiplies). I also measured some word IMULs apparently taking a half-integer number of cycles to run, suggesting that there's either some very weird behavior going on with the 8088's multiplier or that there's a bug in my timing program (possibly both).<\/p>\n<p>Can we beat the 8088's hardware multiplier with a software routine? There's a clever trick which can sometimes be used to speed up multiplications, based on the following:<\/p>\n<p>(a+b)<sup>2<\/sup> = a<sup>2<\/sup> + b<sup>2<\/sup> + 2ab<br \/>\n(a-b)<sup>2<\/sup> = a<sup>2<\/sup> + b<sup>2<\/sup> - 2ab<\/p>\n<p>Subtracting these gives:<br \/>\n4ab = (a+b)<sup>2<\/sup> - (a-b)<sup>2<\/sup><\/p>\n<p>Or:<br \/>\nab = (a+b)<sup>2<\/sup>\/4 - (a-b)<sup>2<\/sup>\/4<\/p>\n<p>So if we keep in memory a table of x<sup>2<\/sup>\/4 we can do a multiply with an add, two subtractions and two table lookups.<\/p>\n<p>Does this still work if the entries in the table are fractional? Yes, it does, and here's why. x<sup>2<\/sup>\/4 is an integer for even x, and for odd x it will always be an integer plus a quarter, as a quick induction will show. (a+b) and (a-b) are always both odd or both even (since they differ by 2b which is always even) so the quarters always cancel and we can just ignore them.<\/p>\n<p>In 8088 assembly code, this can be written like this (assuming the operands are in ax and cx and that we don't care about the high word of the result):<\/p>\n<pre lang=\"asm\">\r\n  xchg ax,bx\r\n  mov si,offset table\r\n  shl bx,1\r\n  shl cx,1\r\n  add bx,cx\r\n  mov ax,[si+bx]\r\n  sub bx,cx\r\n  sub bx,cx\r\n  sub ax,[si+bx]\r\n<\/pre>\n<p>That works out to about 88 cycles (93 with DRAM refresh) - already quite an improvement over the hardware multiplier. We can do even better, though - if we place our table of squares at DS:0 and assume that our operands are in si and bx we get:<\/p>\n<pre lang=\"asm\">\r\n  shl bx,1\r\n  shl si,1\r\n  mov ax,[si+bx]\r\n  neg bx\r\n  sub ax,[si+bx]\r\n<\/pre>\n<p>Which is just 56 cycles (59 with DRAM refresh) - more than twice as fast as the hardware multiplier!<\/p>\n<p>For byte-sized multiplies we can do something similar in just 34 cycles (36 with DRAM refresh):<\/p>\n<pre lang=\"asm\">\r\n  mov al,[si+bx]\r\n  neg bx\r\n  sub al,[si+bx]\r\n<\/pre>\n<p>However, it's less important for byte-sized multiplies since we can fit an entire 8-bit multiply table in memory and do:<\/p>\n<pre lang=\"asm\">\r\n  xchg bh,bl\r\n  mov al,[si+bx]\r\n<\/pre>\n<p>Which is only about 20 cycles (21 with DRAM refresh). There's obviously a bit of a memory-usage vs time tradeoff there though.<\/p>\n<p>There are some more nice things about these algorithms:<\/p>\n<ol>\n<li>There's more choices about which registers to use for the operands and results.<\/li>\n<li>If you're doing a lot of multiplies and want the results of them all scaled by a constant factor, you can build this into the table.<\/li>\n<li>If you're summing the results of several multiplies (e.g. computing a dot product) you can do it right in the accumulator (just change the \"mov\" to an \"add\".<\/li>\n<\/ol>\n<p>The downsides (apart from the memory usage and time to initialize the table) are that you don't get the high byte\/word for free and that it only works for signed multiplies.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The 8088 has a hardware multiplier, but it's quite slow: MUL with byte arguments: 69 cycles plus 1 for each set bit in AL plus 1 if the high byte of the result is 0 MUL with word arguments: 123 cycles plus 1 for each set bit in AX plus 1 if the high word [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-1925","post","type-post","status-publish","format-standard","hentry","category-computer"],"_links":{"self":[{"href":"https:\/\/www.reenigne.org\/blog\/wp-json\/wp\/v2\/posts\/1925","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.reenigne.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.reenigne.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.reenigne.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.reenigne.org\/blog\/wp-json\/wp\/v2\/comments?post=1925"}],"version-history":[{"count":2,"href":"https:\/\/www.reenigne.org\/blog\/wp-json\/wp\/v2\/posts\/1925\/revisions"}],"predecessor-version":[{"id":1927,"href":"https:\/\/www.reenigne.org\/blog\/wp-json\/wp\/v2\/posts\/1925\/revisions\/1927"}],"wp:attachment":[{"href":"https:\/\/www.reenigne.org\/blog\/wp-json\/wp\/v2\/media?parent=1925"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.reenigne.org\/blog\/wp-json\/wp\/v2\/categories?post=1925"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.reenigne.org\/blog\/wp-json\/wp\/v2\/tags?post=1925"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}