Hardware support for ravioli

Ravioli memory is something that could benefit significantly from hardware support. Page operations and inter-thread communication are some specific things that could be accelerated, but there are others as well:

  • Dereferencing through reverse-ravioli trees (and forward-ravioli trees to find parent objects).
  • Allocating and deallocating/moving. The latter requires fixing up any pointers than happen to be in registers, so the set of registers that hold pointers needs to be found somehow (either by using tag bits or having a subset of registers dedicated to pointers).
  • Prefetching the memory for the ravioli pointed to by registers, the ravioli pointed to by those ravioli and so on. Since there are a relatively small number of memory locations that could be accessed at any moment in time. This is in contrast to current CPU architectures, where pretty much any address could be generated at any point in time.

Registers and other ravioli are two places where pointers can be found and which need to be fixed up. Are there any others? What about the stack? Well, searching the entire stack for pointers to the raviolo that's getting moved is an O(n) operation, which we want to avoid. So we can't be storing pointers on the stack.

One thing about stacks as implemented in modern OSes - you have to decide how big they can get in advance. With virtual memory systems you don't have to commit storage for the entire stack, but you do have set some kind of limit on it so that heaps, dynamically-loaded libraries and stacks belonging to other threads can go below them (assuming a downward growing stack). If you choose a limit that is too small you'll get a stack overflow, and choosing a limit that is too large is wasting address space.

Both of these problems could be solved by not having a separate stack at all, and storing activation frames in the ravioli. This eliminates the distinction between stack and heap data, and allows for trivial implementation of Cactus Stacks.

What other chunks of memory can we shoehorn into ravioli? What about the actual code that the CPU runs? Code loading and unloading less often causes fragmentation than heap allocation/deallocation, but it could happen (and might be more likely with dynamic code generation). Putting code into ravioli has some other nice effects as well - it means that ravioli addresses can be embedded directly into code, and those addresses can be updated automatically when those ravioli move around - it's just like any other reference from one raviolo to another. Unless your code raviolo ends with an instruction like "jump to the address in this register" or "return from subroutine" or "halt the CPU", it will need a successor raviolo to jump to when it reaches the end. So all such code ravioli must end with a pointer to the next code raviolo to run, in the tradition of the Royal McBee RPC-4000. It could also end with two addresses for a conditional jump instruction. Some instructions will need a third address as data (e.g. load an immediate address into an address register). Code ravioli are sort of like basic blocks except for all being the same size. They may contain several actual CPU instructions.

Along with code come the vtable pointers previously mentioned. These are a bit like code (in that they are normally generated when the program is compiled) but are not executable. My thinking was that each raviolo would have a pointer's worth of metadata which tells both the CPU and the program what it's looking at. However, the vtables (if not stored as ravioli) are another potential source of fragmentation. Worse, if every single "just data" raviolo has to dedicate 1/4 of its space to a vtable pointer it's quite wasteful. Getting rid of the vtable pointer takes the limiting-case overhead for straight arrays down from 300% to 167% (binary trees), 100% (ternary trees) or 31% (7-child trees with ravioli that are 8 pointers long). Vtable pointers can still be implemented by compilers in the normal way (as a normal pointer field in a structure).

The CPU will still need some metadata about each raviolo to know where the pointers are and whether it's a reference tree raviolo that should be skipped for dereferencing. This is only a few bits worth, however - I've counted 12 different types of ravioli that are the size of 4 pointers:

  • 1 incoming pointer
  • 1 incoming pointer, 1 outgoing pointer
  • 2 incoming pointers, 1 outgoing pointer
  • 1 incoming pointer, 2 outgoing pointers
  • 2 incoming pointers
  • 3 incoming pointers
  • 4 incoming pointers
  • 3 incoming pointers, 1 outgoing pointer
  • 2 incoming pointers, 2 outgoing pointers
  • 1 incoming pointer, 3 outgoing pointers
  • 1 incoming pointer, 3 outgoing pointers - reference only (reverse tree)
  • 3 incoming pointers, 1 outgoing pointer - reference only (forward tree)

Suppose our pointers are 64 bits, so our ravioli are 256 bits and aligned to 32 byte boundaries. Then we know that the low 5 bits of our incoming pointer will all be zero, so we can reuse them for these type bits. The low 5 bits of outgoing pointers might still be useful for pointing to particular bytes inside ravioli (especially for string handling). Of course, for finding matching pointers for the purposes of moving ravioli, these low bits would be ignored.

There we have it - everything raviolized. Except for the CPU registers - there's no possibility of fragmentation there. In fact, one could think of the register file as one extra large and special raviolo that is unique in not having an incoming pointer (it's location is implicit) and whose outgoing pointers don't need to be reference-enumerated (since they're automatically updated when the ravioli they point to move anyway).

Another advantage that a machine implementing ravioli in hardware might have is the possibility of isolating threads from each other even if they share an address space. The CPU knows where all the pointers are and can control access to the places where they are so that only a valid pointer can be placed in a pointer slot. Thus a thread can't get access to data unless it already has a pointer to that data somewhere. In particular, a thread can't access another thread's data. So we can use the same address space for all processes, and we no longer need to distinguish between threads in the same process and threads in another process. The OS keeps a mapping from (non-pointer) thread ID numbers to the initial page of each thread, so if a thread needs to refer to another thread it can do so via this thread ID.

Leave a Reply