Deconstructing the database

This post elaborates on some database organization ideas that I mentioned in passing in Foundry.

The usual interface presented to a database is one based on "query/update" - the database contains some set of information which can be queried and updates can be made when this information needs to be changed. At any time there is a single "most up to date" state of the database consisting of the data after all "update" operations known so far have been performed.

The usefulness of this format cannot be denied but there is another possible way of organizing things - I call it "accumulate/cache" (I haven't read much in the way of database theory literature so this might already exist by some other name). In such a database, you (at some fundamental level) never change things, only add new things. "Update" operations consist of sending a "delta" to the database saying what needs to change, what it needs to change to and the timestamp of the change that is being used as the "parent" of this change. These deltas are then themselves timestamped and dumped onto the end of one big file (or table, if you're implementing this on top of a traditional relational database, though because of the limited nature of the operations on this table, writing special purpose code for this could probably improve efficiency).

To extract information from this database, you need a timestamp as well as the key of the piece of information you want to look up. You can look at the database as it was at any point in history just by supplying the appropriate timestamp. Thus you get what is essentially revision control for free (except for the fact that this kind of database probably uses much more space than the traditional sort - lack of disk space in previous generations may be why this sort of database isn't the way things are usually done).

Given a timestamp and a key, the algorithm for looking up a piece of information is quite simple: first look in the cache and see if that timestamp/key combination is there. If it is, you're done - just return that value. If it's not, look up that timestamp in the accumulation table. If that accumulation changes the value corresponding to your key, add the result of the query to the cache and you're done. Otherwise pick the parent timestamp of that change and repeat.

The caching is necessary to avoid bad asymptotic behavior as the size of the database grows. I think (but haven't proved) that for a database of size N it should be possible to get back to O(log N) amortized performance with an O(N log N) sized cache. The amount of cache used can be tuned by throwing away the least recently accessed entries depending on the database's access patterns (patterns which look back at history a lot will probably require larger caches than patterns which only look at recent timestamps).

Only the accumulation table needs to be backed up as the cache table is automatically reconstructed as necessary when queries are made. The fact that the accumulation table is only ever appended to simplifies this.

The next piece of infrastructure needed on top of this is a "head" mechanism. The last timestamp in the accumulation table is the latest physical change made but this might not be the latest logical change. We would like the ability to have "atomic" operations made up of several changes. This can be done without any additional locks (so scales very well). The "head" value is just a timestamp filed under some well-known key. To get the most recent "head" just get the value of "head" corresponding to the most recent timestamp, and do all your other queries with the head value timestamp.

To perform an atomic update, you need to make your changes and then merge them with the latest head. This is trivial if the current head is the parent of your first change. Otherwise something else has written to the database in the meantime and a merge will need to be done, creating a new "parent of first change". Eventually (as long as the time between commits is long compared to the time it takes to do a commit) you'll get to the trivial case, and the only thing that needs to be done atomically is the trivial "compare parent to the value of head and swap with the new head if they're equal" (I think most CPUs can do this atomically with a single instruction, but in any case it is one of the simpler lock-free algorithms).

Some databases already work like this behind the scenes to some extent (log replaying etc) but I think that making this part of the interface simplifies the atomicity logic considerably, not to mention the usefulness of being able to query versions of the database from any point in history.

One complication occurs if it becomes necessary to remove from the database some information which should never have been placed there (e.g. some copyrighted work). Since you want to obliterate history in this case, an update needs to be made on the accumulation table. The cache (or the portion of it from after that point) must then be regenerated. One also needs to decide if subsequent updates to the copyrighted value are "derived works" or not, and either obliterate or leave alone the updates to that value.

One limiting factor of this architecture is access times - even with the fastest current disks one can only traverse a chain at maybe 500 links per second. It will only be a few years before we're all using solid state flash disks, though, and that will dramatically speed up access times.

Edit 14th July 2013:

Since writing this I came across an interesting article and talk on the same subject.

One Response to “Deconstructing the database”

  1. [...] Reenigne blog Stuff I think about « Deconstructing the database [...]

Leave a Reply