Here's some history: The 8088 CPU takes 4 cycles to access a byte of main memory (for now, don't consider CGA RAM as part of this because CGA ram has wait states). Because the 8088 has an 8-bit data path, it doesn't matter how large the transfer unit is -- whether a byte transfer or a word transfer, it's going to take 4 cycles for every 8 bits to get into the CPU. Since instruction code is also data in RAM, reading the next few bytes of code is also subject to this delay. That is the main reason why smaller = faster on 8088. Here's how the prefetch queue comes into play: The 8088 has two units that operate independently of each other, the Execution Unit (EU) that does the thinking, and the Bus Interface Unit (BIU) that does the data fetching. In most prior chips, like the 8080 and 6800, the BIU just blindly grabs bytes and serves them to the EU. For the 808x, Intel decide to exploit the idle times resulting from the execution of long instructions by adding a buffer to the BIU. It blindly grabs bytes, but it has a tiny 4-byte buffer called the prefetch queue and, like its name implies, contains as many instruction opcodes as it is able to grab, one byte every 4 cycles. This is why you get an additional speedup if you rearrange your code to include smaller instructions directly after ones that take a long time to execute: They will have already been prefetched and waiting in the queue, which saves you waiting 4 cycles per opcode byte. One neat way to measure this behavior is to observe what happens when shifting by more than one bit: 8088 supports only two syntaxes, SHL xx,1 and SHL xx,CL. The ,1 form is 2 bytes and takes 2 cycles; the ,CL form is also 2 bytes, but takes 8 cycles to warm up and then an additional 4*CL cycles to perform the shift. So you would think that it is always faster to do this: SHL AX,1 SHL AX,1 SHL AX,1 ...etc. However, this starts to backfire: Because the SHL xx,1 form only takes 2 cycles to execute, it executes faster than the prefetch queue can stay filled. Once the queue is empty, your 2-cycle instruction is now taking 2+(4c*2c) = 10 cycles to execute because it's taking 8 cycles just to get the instruction loaded into the CPU! So there is a break-even point: Shifting up to 3 places, ,1 is faster... Shifting 4 places, both ,1 and ,CL are the same... and shifting 5 places or more, the ,CL form is faster. Fun, eh? :-) Now that you know all this, you can answer a few questions: Q: Why does the 8086 execute code faster than the 8088 at the same clock speeds? A: It doesn't execute code any faster than the 8088 does, but the 8086 has a 16-bit BIU instead of an 8-bit one, so word accesses and byte accesses take the same time (4 cycles) as long as the word accesses are aligned on word boundaries. This is why a 7.16MHZ 8086 executes code 80-90% faster than a 4.77MHz 8088, even though it is driven only 50% faster MHz-wise... and it's also why it's generally a Good Thing to word-align your data (since it can't hurt and can only help). Q: Why does the NEC V20, an 8088 clone, execute code between 20-30% faster than the 8088 at the same clock speeds? A: Mostly because the prefetch queue on the V20 is 2 bytes larger than 8088's queue (6-byte prefetch queue vs. 4-byte). (I say "mostly" because the NEC V20 had other improvements, like faster effective address calculation, better loop counter/shift register implementation, dedicated hardware for multiplication, etc. But the main all-purpose speedup was the larger queue.) Q: In 808x assembler, why is it so important to avoid jumps/branches, even at the expense of increasing code size? A: Because every jump empties the prefetch queue! So not only do you take the 17-cycle hit for jump taken (as opposed to 5 cycles for not), you also screw yourself because you've lost any benefit the prefetch queue might have had for you.