Threading and ravioli

One thing that's been bothering me about ravioli memory is the possibility that it doesn't interact very well with threading. Trying to have multiple threads manipulating a single ravioli heap is a non-starter: whenever one thread needs to allocate or deallocate a raviolo (a very common operation) all other threads would need to be suspended and (in the deallocation case) have any pointers in their registers potentially adjusted.

So rather than having a single heap for all threads, we should instead have a heap per thread. But one thread may not refer to another thread's heap! If it did, then again we've have to synchronize those threads whenever a deallocation occurred.

So essentially when using ravioli, every thread must have a separate address space (logically, even if multiple threads share the same address space in the traditional sense). But that kind of defeats the point of having separate threads in the first place! The whole point of threads is that they share an address space, so that multiple threads can party on the same data without explicitly transferring it from one thread to another.

Threads are the standard way to build interactive applications that remain responsive when the doing some background processing (or network access, disk access, etc.). They've also become increasingly important as the exponential rise in single-core performance has given way to a rise in the number of hardware threads per CPU. They have low overhead (switching threads can be very quick since the address space doesn't need to change) and are reasonably easy for operating systems to implement.

But that doesn't necessarily mean they're the best way to structure applications to take best advantage of a CPU's resources. There is a serious hidden penalty to threads that applies even when using the most sophisticated lock-free techniques and latest multi-core CPUs. That penalty is this: sharing memory between cores is not free. If a piece of memory is written to from one core and read from another, the caches that make these memory accesses fast must be flushed. So memory accesses become a great deal slower when there is contention between cores for the cache lines in question.

So, to get best performance out of multi-threaded code, the threads should work as much as possible with memory that is private to them and share as little memory between threads as possible, in other words they should avoid being "chatty".

With ravioli, communication between threads essentially amounts to message passing. One thread passes a message consisting of some data structure to another thread. To accomplish that, the ravioli comprising that data structure are moved into some shared-memory buffer, ownership of that buffer is transferred to the other thread and then the marshalling ravioli are moved on to the other thread's heap. That sounds like a lot of extra work but the extra transfers each happen within a single thread and seem likely to be quick compared to all the cache-flushing involved in the actual moving of information from one core to another (though I should qualify that I haven't actually measured it). In fact, the localization of this movement (in space and time) may even have some performance benefits over the ad-hoc chattiness of most multi-threaded programs.

How can multiple threads share a single address space in the traditional sense? One possible way would be to divide up the address space into pages of (say) 4KB (we might want to experiment with different values to see what gives the best tradeoff of performance to per-thread memory overhead, but there are likely to be advantages in choosing a size which corresponds to the natural page size of the machine's MMU). Whenever a thread needs to allocate a new raviolo, it uses space from the current page. If there is no more space left on the current page, it requests a new page (which involves coordination with the other threads). If a thread has two entire pages that are completely empty, one of them is returned to the inter-thread pool of empty pages (which again requires coordination). This hysteresis prevents excess inter-thread coordination when a thread's memory usage is close to a whole number of pages. A thread's pages are linked together using the first raviolo from each page (which is reserved for that purpose), as are the pages in the free list. So a thread's pages do not have to be contiguous in the address space and fragmentation is not a problem.

These pages also form a natural mechanism for the "shared memory buffers" used for inter-thread communication, alluded to above. It's pretty straightforward to move a page from one thread to another (just remove it from one thread's list and place it on another's). The one tricky part is making sure that only the ravioli that actually need to be moved to another thread are placed in the page to be moved. So a thread will need a way of allocating a page separate from it's normal heap-top. This page doesn't get its ravioli compacted and isn't automatically freed if there is too much unused space. A shared memory buffer is almost like the address space of a different thread altogether (except that there isn't a corresponding thread of CPU execution). Once the buffer is placed on the other thread, a compacting operation is needed to restore the "no gaps" invariant - ravioli from the end of the new buffer are moved into any previously free space, after which the previously shared buffer is just normal memory.

It occurred to me that these special buffers may also be useful for serializing to or deserializing from disk, network, or the address spaces of other processes. In these cases the high bits of the pointers (the ones corresponding to the page address) need to be fixed up to point to the real pages on deserialization. But perhaps this method of serialization would be overly wasteful - for all but the largest regions of data, most of those pointer bits are useless to the deserializer. So perhaps it would make more sense to compress the data before sending it to remove this redundant information.

Leave a Reply