Here's some history:
The 8088 CPU takes 4 cycles to access a byte of main memory (for now,
don't consider CGA RAM as part of this because CGA ram has wait states).
Because the 8088 has an 8-bit data path, it doesn't matter how large
the transfer unit is -- whether a byte transfer or a word transfer, it's
going to take 4 cycles for every 8 bits to get into the CPU.  Since
instruction code is also data in RAM, reading the next few bytes of code
is also subject to this delay.  That is the main reason why smaller =
faster on 8088.

Here's how the prefetch queue comes into play:  The 8088 has two units
that operate independently of each other, the Execution Unit (EU) that
does the thinking, and the Bus Interface Unit (BIU) that does the data
fetching.  In most prior chips, like the 8080 and 6800, the BIU just
blindly grabs bytes and serves them to the EU.  For the 808x, Intel
decide to exploit the idle times resulting from the execution of long
instructions by adding a buffer to the BIU.  It blindly grabs bytes, but
it has a tiny 4-byte buffer called the prefetch queue and, like its name
implies, contains as many instruction opcodes as it is able to grab, one
byte every 4 cycles.  This is why you get an additional speedup if you
rearrange your code to include smaller instructions directly after ones
that take a long time to execute: They will have already been prefetched
and waiting in the queue, which saves you waiting 4 cycles per opcode byte.

One neat way to measure this behavior is to observe what happens when
shifting by more than one bit:  8088 supports only two syntaxes, SHL
xx,1 and SHL xx,CL.  The ,1 form is 2 bytes and takes 2 cycles; the ,CL
form is also 2 bytes, but takes 8 cycles to warm up and then an
additional 4*CL cycles to perform the shift.  So you would think that it
is always faster to do this:

SHL AX,1
SHL AX,1
SHL AX,1

...etc.  However, this starts to backfire: Because the SHL xx,1 form
only takes 2 cycles to execute, it executes faster than the prefetch
queue can stay filled.  Once the queue is empty, your 2-cycle
instruction is now taking 2+(4c*2c) = 10 cycles to execute because it's
taking 8 cycles just to get the instruction loaded into the CPU!  So
there is a break-even point:  Shifting up to 3 places, ,1 is faster...
Shifting 4 places, both ,1 and ,CL are the same... and shifting 5 places
or more, the ,CL form is faster.  Fun, eh?  :-)

Now that you know all this, you can answer a few questions:

Q: Why does the 8086 execute code faster than the 8088 at the same clock
speeds?
A: It doesn't execute code any faster than the 8088 does, but the 8086
has a 16-bit BIU instead of an 8-bit one, so word accesses and byte
accesses take the same time (4 cycles) as long as the word accesses are
aligned on word boundaries.  This is why a 7.16MHZ 8086 executes code
80-90% faster than a 4.77MHz 8088, even though it is driven only 50%
faster MHz-wise... and it's also why it's generally a Good Thing to
word-align your data (since it can't hurt and can only help).

Q: Why does the NEC V20, an 8088 clone, execute code between 20-30%
faster than the 8088 at the same clock speeds?
A: Mostly because the prefetch queue on the V20 is 2 bytes larger than
8088's queue (6-byte prefetch queue vs. 4-byte).  (I say "mostly"
because the NEC V20 had other improvements, like faster effective
address calculation, better loop counter/shift register implementation,
dedicated hardware for multiplication, etc.  But the main all-purpose
speedup was the larger queue.)

Q: In 808x assembler, why is it so important to avoid jumps/branches,
even at the expense of increasing code size?
A: Because every jump empties the prefetch queue!  So not only do you
take the 17-cycle hit for jump taken (as opposed to 5 cycles for not),
you also screw yourself because you've lost any benefit the prefetch
queue might have had for you.