Devolving the CPU

Until fairly recently, most computers could only do one thing at a time. To put it in technical terms, they had only one "execution unit" to take instructions from memory and actually act upon them. Computers have been able to apparently run many programs at once for much longer than that, through sleight of hand (switching the execution unit between different programs many times per second - so fast that you can't tell they're not executing simultaneously).

But in recent years, CPU manufacturers have been finding it more and more difficult to make execution units faster. CPU speeds increased exponentially until a few years ago, but the fastest CPUs you can buy have been about 3.6GHz for several years now.

But Moore's law (which says that the number of transistors you can put in a single chip doubles every 18 months) has continued to hold. So what do we do with these extra transistors, if we're not using them to make the execution unit faster?

Well, they're instead being used to put multiple execution units on a single CPU, so that computers really can execute multiple programs at the same time, without switching between them. This has the effect of making computers faster (since these days they're rarely doing only one thing at a time anyway, or if they are the work can often be divided amongst execution units to effectively increase the speed).

This means that programmers need to change the way they write programs, to obtain maximum speed from the latest machines. They can no longer write a program as a single list of instructions (or a single "thread of execution" to use the technical term) and expect it to run 1000 times as fast in 15 years time. Instead, programmers must write programs that use multiple threads and divide the work between them. This "multi-threaded" style of programming is more difficult and (if you don't know what you're doing) tricky errors can occur (for example, some errors only happen if two particular parts of the program happen to be executed at the same time, so can be very difficult to reproduce).

Once we've got the hang of doing multi-threaded programming properly, having any one execution unit go really fast won't be so important, because programs will be able to go just as fast by dividing up their work between execution units. This means that we will see a redistribution of transistors on future CPUs - we will go from a few very fast but complex execution units to many slower but simpler execution units, because we can make CPUs with better overall throughput that way.

There are several ways to simplify an execution unit. You can get rid of speed features like pipelining, branch prediction, out-of-order execution and speculation. Removing these gives you another advantage - it becomes much easier to determine how fast a thread will execute, which means less optimization and tuning is needed.

Another way is to simplify the instruction set (the operation codes that the execution unit actually interprets). Most currently used instruction sets are very complicated (in order to squeeze as much power as possible from a single execution unit) so there is great scope for simplification there. All the single instruction, multiple data instructions can be removed (the parallelism they enable can be more easily realized with multiple execution units).

Another simplification would be to remove all the registers and use a pure stack machine architecture. A few processors that use this technique have been developed in the past, but they never took off as they were too slow. This may cease to be a consideration in highly parallel architectures.

Stack machines have a number of advantages besides simplicity. Compilers can also be made much simpler (the parse tree can be translated into code almost directly, with no register allocation or complicated instruction selection algorithms needed). The CPU stores less non-memory state, making the operation to switch an execution unit from one thread to another much quicker. This state could be as simple as an instruction pointer, a stack pointer and some context to identify the thread.

Stack machines can also have much higher code densities than register machines, because you don't need to use any bits to specify which registers to work on - all instructions work implicitly on the top of the stack. Code density will continue to be important as the bus between CPU and RAM is a bottleneck that will only get worse as the number of execution units increase.

Another technique that can improve code density is bit-aligned instructions. By making assigning short opcodes to the most commonly used instructions and longer opcodes to the rarer instructions, you effectively apply Huffman compression to the instruction stream. The Intel 432 used this technique (and was very slow, but as the shifting that is required to decode such an instruction stream is pretty easy to implement in hardware I think this wouldn't be a big issue for a massively parallel CPU).

With a stack machine, certain patterns of instructions tend to occur over and over again. These can be assigned shorter codes (as is done in many popular lossless compression techniques like ZIP files).

Working out the optimal coding for this instruction set would be fairly simple in principle. First design a minimal instruction set for a stack based language without regard to code density (maybe just use one byte per opcode). Compile a large quanity of software to this architecture (spanning the range of possible uses - an operating system, application suite, development tools, games, emulators, number crunching, scientific visualization etc.). Next, determine the instruction frequencies across all these codebases and apply Huffman coding to figure out the optimal opcodes. With some fine-tuning, I think this could give significantly higher code densities than any of the architectures currently in use.

Another advantage of having many execution units is that transistors can be assigned to different tasks in the most efficient possible way. For example, if you have one ALU per execution unit and it is only in use 70% of the time, you could instead have 7 ALUs per 10 execution units - the ultimate in superscalar design.

3 Responses to “Devolving the CPU”

  1. [...] at the hardware level for making processors fast, but that isn’t a particular concern here). Devolving the CPU has some ideas for generating a good instruction [...]

  2. [...] but is missing performance features such as out-of-order execution. This vindicates what I wrote in devolving the CPU – that individual cores will get simpler and slower again as they get more [...]

  3. anonymous coward says:

    The problem with a stack machine is that it doesn't readily allow for superscalar/wide issue/software pipelining because a stack machine is inherently serial. That means that to get any parallelism you need to invoke interconnect overheads instead of using ILP or wide-issue scheduling. Not to mention the need to run legacy software. https://millcomputing.com/ has plenty to say on these topics.

Leave a Reply