Archive for the ‘algorithms’ Category

On positive and negative reputation

Sunday, February 21st, 2016

A nice feature of most places on the internet is that people can easily create a new identity (you might have to solve a captcha but that's about it). This wouldn't work so well in real life as it does on the internet - in real life if someone commits a crime they need to be held accountable for that, so it's important that we each have a real life identity that we can't just replace. Similarly, social safety-net programs need to ensure that any given person does not collect more money than they are entitled to, so they also need to use real-life identities.

Social media websites should not need to know peoples' real-life identities. But if identities can be discarded and replaced, how can we deal with the online equivalent of crimes (i.e. spam, abuse and malware)? I think the answer is just to ignore them with extreme prejudice. To decide if some message (whether it's an email, an RSS feed item, a link from an aggregator or whatever) is worth reading, we should ideally be able to look at the reputation of its originator. Completely new identities, not vouched-for by any identity with actual positive reputation would have no reputation and the messages they send should be consigned to the deepest levels of the spam filter.

Unlike in real life, there's no point in internet identities having negative reputation overall - if one did, the owner of that identity would have nothing to lose by abandoning it and spinning up a new clean one. Blacklists won't work, we'll have to use whitelists.

If you somehow grew up under a rock or something and were new to the internet, how could you build up reputation if nobody's reading your messages? Well, presumably you know some people in real life, and some of those may have some internet reputation. Contact them (offline if need be) and get them to vouch for you until you have enough reputation that strangers can read your messages.

Reputation scores should be subjective, not a single global score. So user A may have a different idea than user B about what user C's reputation is. The centralization of a global score would cause problems, and could be gamed (earning reputation from people who give it away easily and spending it where it more lucrative). My value of a reputation score for user C should be a influenced by whether I have liked their posts, and by the reputation scores for user C according to other people who who have high reputation scores according to me. It's sort of like Google's PageRank algorithm but for users instead of websites.

Abusive messages are mostly anonymous and would therefore generally have an extremely low reputation score. Otherwise, they would stand to quickly lose whatever reputation they had. So abuse is solved the same way as spam (and ends up in the same bucket).

Credit reporting agencies like Experian and Equifax keep reputation scores on our real life identities for non-crime purposes, like determining if it would be wise to lend us money. I sometimes think it would be a good idea if those companies were not allowed to use our real-life identities, so that "bad credit" could be escaped just by creating a new "credit identity". Then nobody would ever lend more money to someone than they had spent building up their credit reputation. The current system allows "no credit" young people to build up huge unsecured debts which they are then shackled with for an extremely long time. Student loan debts in the US cannot be discharged in bankruptcy, on the theory that the benefits obtained by attending college can't be given up, but this system can have some devastating consequences for those who ended up paying more for their degrees than those degrees were worth.

Hardware support for ravioli

Wednesday, February 17th, 2016

Ravioli memory is something that could benefit significantly from hardware support. Page operations and inter-thread communication are some specific things that could be accelerated, but there are others as well:

  • Dereferencing through reverse-ravioli trees (and forward-ravioli trees to find parent objects).
  • Allocating and deallocating/moving. The latter requires fixing up any pointers than happen to be in registers, so the set of registers that hold pointers needs to be found somehow (either by using tag bits or having a subset of registers dedicated to pointers).
  • Prefetching the memory for the ravioli pointed to by registers, the ravioli pointed to by those ravioli and so on. Since there are a relatively small number of memory locations that could be accessed at any moment in time. This is in contrast to current CPU architectures, where pretty much any address could be generated at any point in time.

Registers and other ravioli are two places where pointers can be found and which need to be fixed up. Are there any others? What about the stack? Well, searching the entire stack for pointers to the raviolo that's getting moved is an O(n) operation, which we want to avoid. So we can't be storing pointers on the stack.

One thing about stacks as implemented in modern OSes - you have to decide how big they can get in advance. With virtual memory systems you don't have to commit storage for the entire stack, but you do have set some kind of limit on it so that heaps, dynamically-loaded libraries and stacks belonging to other threads can go below them (assuming a downward growing stack). If you choose a limit that is too small you'll get a stack overflow, and choosing a limit that is too large is wasting address space.

Both of these problems could be solved by not having a separate stack at all, and storing activation frames in the ravioli. This eliminates the distinction between stack and heap data, and allows for trivial implementation of Cactus Stacks.

What other chunks of memory can we shoehorn into ravioli? What about the actual code that the CPU runs? Code loading and unloading less often causes fragmentation than heap allocation/deallocation, but it could happen (and might be more likely with dynamic code generation). Putting code into ravioli has some other nice effects as well - it means that ravioli addresses can be embedded directly into code, and those addresses can be updated automatically when those ravioli move around - it's just like any other reference from one raviolo to another. Unless your code raviolo ends with an instruction like "jump to the address in this register" or "return from subroutine" or "halt the CPU", it will need a successor raviolo to jump to when it reaches the end. So all such code ravioli must end with a pointer to the next code raviolo to run, in the tradition of the Royal McBee RPC-4000. It could also end with two addresses for a conditional jump instruction. Some instructions will need a third address as data (e.g. load an immediate address into an address register). Code ravioli are sort of like basic blocks except for all being the same size. They may contain several actual CPU instructions.

Along with code come the vtable pointers previously mentioned. These are a bit like code (in that they are normally generated when the program is compiled) but are not executable. My thinking was that each raviolo would have a pointer's worth of metadata which tells both the CPU and the program what it's looking at. However, the vtables (if not stored as ravioli) are another potential source of fragmentation. Worse, if every single "just data" raviolo has to dedicate 1/4 of its space to a vtable pointer it's quite wasteful. Getting rid of the vtable pointer takes the limiting-case overhead for straight arrays down from 300% to 167% (binary trees), 100% (ternary trees) or 31% (7-child trees with ravioli that are 8 pointers long). Vtable pointers can still be implemented by compilers in the normal way (as a normal pointer field in a structure).

The CPU will still need some metadata about each raviolo to know where the pointers are and whether it's a reference tree raviolo that should be skipped for dereferencing. This is only a few bits worth, however - I've counted 12 different types of ravioli that are the size of 4 pointers:

  • 1 incoming pointer
  • 1 incoming pointer, 1 outgoing pointer
  • 2 incoming pointers, 1 outgoing pointer
  • 1 incoming pointer, 2 outgoing pointers
  • 2 incoming pointers
  • 3 incoming pointers
  • 4 incoming pointers
  • 3 incoming pointers, 1 outgoing pointer
  • 2 incoming pointers, 2 outgoing pointers
  • 1 incoming pointer, 3 outgoing pointers
  • 1 incoming pointer, 3 outgoing pointers - reference only (reverse tree)
  • 3 incoming pointers, 1 outgoing pointer - reference only (forward tree)

Suppose our pointers are 64 bits, so our ravioli are 256 bits and aligned to 32 byte boundaries. Then we know that the low 5 bits of our incoming pointer will all be zero, so we can reuse them for these type bits. The low 5 bits of outgoing pointers might still be useful for pointing to particular bytes inside ravioli (especially for string handling). Of course, for finding matching pointers for the purposes of moving ravioli, these low bits would be ignored.

There we have it - everything raviolized. Except for the CPU registers - there's no possibility of fragmentation there. In fact, one could think of the register file as one extra large and special raviolo that is unique in not having an incoming pointer (it's location is implicit) and whose outgoing pointers don't need to be reference-enumerated (since they're automatically updated when the ravioli they point to move anyway).

Another advantage that a machine implementing ravioli in hardware might have is the possibility of isolating threads from each other even if they share an address space. The CPU knows where all the pointers are and can control access to the places where they are so that only a valid pointer can be placed in a pointer slot. Thus a thread can't get access to data unless it already has a pointer to that data somewhere. In particular, a thread can't access another thread's data. So we can use the same address space for all processes, and we no longer need to distinguish between threads in the same process and threads in another process. The OS keeps a mapping from (non-pointer) thread ID numbers to the initial page of each thread, so if a thread needs to refer to another thread it can do so via this thread ID.

Threading and ravioli

Tuesday, February 16th, 2016

One thing that's been bothering me about ravioli memory is the possibility that it doesn't interact very well with threading. Trying to have multiple threads manipulating a single ravioli heap is a non-starter: whenever one thread needs to allocate or deallocate a raviolo (a very common operation) all other threads would need to be suspended and (in the deallocation case) have any pointers in their registers potentially adjusted.

So rather than having a single heap for all threads, we should instead have a heap per thread. But one thread may not refer to another thread's heap! If it did, then again we've have to synchronize those threads whenever a deallocation occurred.

So essentially when using ravioli, every thread must have a separate address space (logically, even if multiple threads share the same address space in the traditional sense). But that kind of defeats the point of having separate threads in the first place! The whole point of threads is that they share an address space, so that multiple threads can party on the same data without explicitly transferring it from one thread to another.

Threads are the standard way to build interactive applications that remain responsive when the doing some background processing (or network access, disk access, etc.). They've also become increasingly important as the exponential rise in single-core performance has given way to a rise in the number of hardware threads per CPU. They have low overhead (switching threads can be very quick since the address space doesn't need to change) and are reasonably easy for operating systems to implement.

But that doesn't necessarily mean they're the best way to structure applications to take best advantage of a CPU's resources. There is a serious hidden penalty to threads that applies even when using the most sophisticated lock-free techniques and latest multi-core CPUs. That penalty is this: sharing memory between cores is not free. If a piece of memory is written to from one core and read from another, the caches that make these memory accesses fast must be flushed. So memory accesses become a great deal slower when there is contention between cores for the cache lines in question.

So, to get best performance out of multi-threaded code, the threads should work as much as possible with memory that is private to them and share as little memory between threads as possible, in other words they should avoid being "chatty".

With ravioli, communication between threads essentially amounts to message passing. One thread passes a message consisting of some data structure to another thread. To accomplish that, the ravioli comprising that data structure are moved into some shared-memory buffer, ownership of that buffer is transferred to the other thread and then the marshalling ravioli are moved on to the other thread's heap. That sounds like a lot of extra work but the extra transfers each happen within a single thread and seem likely to be quick compared to all the cache-flushing involved in the actual moving of information from one core to another (though I should qualify that I haven't actually measured it). In fact, the localization of this movement (in space and time) may even have some performance benefits over the ad-hoc chattiness of most multi-threaded programs.

How can multiple threads share a single address space in the traditional sense? One possible way would be to divide up the address space into pages of (say) 4KB (we might want to experiment with different values to see what gives the best tradeoff of performance to per-thread memory overhead, but there are likely to be advantages in choosing a size which corresponds to the natural page size of the machine's MMU). Whenever a thread needs to allocate a new raviolo, it uses space from the current page. If there is no more space left on the current page, it requests a new page (which involves coordination with the other threads). If a thread has two entire pages that are completely empty, one of them is returned to the inter-thread pool of empty pages (which again requires coordination). This hysteresis prevents excess inter-thread coordination when a thread's memory usage is close to a whole number of pages. A thread's pages are linked together using the first raviolo from each page (which is reserved for that purpose), as are the pages in the free list. So a thread's pages do not have to be contiguous in the address space and fragmentation is not a problem.

These pages also form a natural mechanism for the "shared memory buffers" used for inter-thread communication, alluded to above. It's pretty straightforward to move a page from one thread to another (just remove it from one thread's list and place it on another's). The one tricky part is making sure that only the ravioli that actually need to be moved to another thread are placed in the page to be moved. So a thread will need a way of allocating a page separate from it's normal heap-top. This page doesn't get its ravioli compacted and isn't automatically freed if there is too much unused space. A shared memory buffer is almost like the address space of a different thread altogether (except that there isn't a corresponding thread of CPU execution). Once the buffer is placed on the other thread, a compacting operation is needed to restore the "no gaps" invariant - ravioli from the end of the new buffer are moved into any previously free space, after which the previously shared buffer is just normal memory.

It occurred to me that these special buffers may also be useful for serializing to or deserializing from disk, network, or the address spaces of other processes. In these cases the high bits of the pointers (the ones corresponding to the page address) need to be fixed up to point to the real pages on deserialization. But perhaps this method of serialization would be overly wasteful - for all but the largest regions of data, most of those pointer bits are useless to the deserializer. So perhaps it would make more sense to compress the data before sending it to remove this redundant information.

Modifications to a programming language for ravioli

Monday, February 15th, 2016

I'd like to do some experiments with the ravioli memory I've been writing about in the past few blog posts. To give it a fair test I'd need a decent body of code (like a compiler) written in a language that I have control over. This is another reason I want to make my own compiler.

Some language modifications would be necessary to make good use of ravioli. While ravioli could be used with any non-garbage-collected language, to minimize the overhead it would be a good idea to add some language features which ensure that there is no duplication between the "low level" information used by the ravioli themselves and the "high level" information stored in the ravioli. For example, you would not want to implement reference counting on top of ravioli since the ravioli themselves have something even better (reference enumeration). Also you'd want to make sure that the vtable pointers used by the ravioli to determine where the pointers are in a given raviolo (and, if reference trees are used, whether it is normal or reverse) are also used for the high-level atoms or vtables.

I'm not sure quite what modifications to ALFE could be made to help with ravioli. I suspect most of it is just re-implementing low-level data structures, but I guess I'll find out what else can be done when I implement it.

Reference trees for ravioli

Sunday, February 14th, 2016

One thing that always bothered me about ravioli memory is that deleting a node can take O(n) time because the raviolo that was moved into the new gap might have a large number of pointers pointing to it. While the behavior will probably be O(1) amortized, it's quite possible to have a pathological case when deleting a large chunk of memory, leading to O(n2) behavior. Though that could probably be mitigated by not moving any ravioli until we're done deleting (i.e. having separate deallocation and move steps).

Or we could turn the move operation from O(n) to O(1) by limiting the number of pointers which can point at any raviolo. We can do this by having two types of ravioli:

  1. The normal type, which is pointed to by exactly one pointer. Each normal raviolo has a "parent" entry which points to that pointer.
  2. The reverse type, which can only point to one other raviolo (the "target") but which has two ravioli pointing to it. Each reverse raviolo has two "parent" entries which point to those pointers.

So, if you dereference a pointer expecting to find a normal raviolo but instead find a reverse raviolo, you need to (repeatedly) follow the "target" pointer of the reverse raviolo until you find a normal one. This may have to happen O(log(n)) times, so there is an additional cost to dereferencing (although array and object dereferencing is already O(log(n)) with the previous ravioli.) In return, moving (and hence deleting) becomes O(1) as there are at most 2 pointers to update. So this type of ravioli might be better for applications where the data is changed a lot, whereas the earlier type might be better for applications where the data mostly doesn't change but is queried a lot.

With this scheme, each raviolo only needs 4 pointer-sized entries: one for the vtable, one for the parent pointer (or target pointer for a reverse raviolo) and two for normal pointers (parent pointers for a reverse raviolo). Though (possibly depending on the type of data involved) it might be worth using 8-pointer ravioli to make the trees shallower and reduce overhead. 8-pointer reference tree ravioli could also interoperate with linked-list ravioli.

For an array of simple data, 4-pointer ravioli has an overhead of 300%. However, such arrays are growable/shrinkable for no extra cost, and you can even insert and delete elements in the middle without copying the entire array. 8-pointer ravioli has an overhead of 60%.

Modifying a pointer also involves O(log(n)) operations, as it may involve rebalancing the reference tree (the tree of reverse ravioli pointing to the old and new targets of the pointer). It could actually take much longer than that if you're deleting the last reference to a large data structure, since you'd have to delete all those ravioli. That's true for all reference counting systems, though, and there's no scope for pathological behavior since any given raviolo can only be deleted once.

The set of ravioli in a program at any given point in time form a graph. It would be interesting to visualize these graphs for a complicated program and see what patterns they make and how they change with time. This would also be a useful debugging tool - especially if you can delve into the actual values in each raviolo.

2-forward, 1-back list for ravioli

Saturday, February 13th, 2016

It turns out that having three "low-level" pointers per "high-level" pointer is not actually necessary in ravioli memory - two is enough.

The reason for this is that all the entries in the linked list are actually the same, so we don't need to store the pointer value in every element of the list. What we can instead do is to have two types of node:

  1. A node containing the pointer to the target and a pointer to the next node in the linked list.
  2. A node containing a pointer to the next node in the linked list and a pointer that points back in the linked list either 2 or 3 entries.

To find the target pointer if you have a node that doesn't contain it, just advance through the linked list until you do. To add a pointer to the linked list, add a forward pointer if that would cause the next backward pointer to point back 3 entries, otherwise add new backwards pointer so there are two "back 2" entries and the invariants are maintained. To remove a pointer, do the opposite to adding it.

If we want to make each raviolo a power of 2 bytes in size, we can then have 3 "high-level" pointers per raviolo (it doesn't allow us to make the ravioli smaller, since we still need a pointer for the vtable and to the first entry in that node's reference list).

However, I've since moved on to a more conceptually clean version of ravioli, which I'll talk about tomorrow.

Scrolling game edge tile list

Friday, February 12th, 2016

Suppose you're writing a scrolling game for very limited hardware. You have a frame buffer which is larger than the visible screen, but smaller than the entire level. So when the screen scrolls upwards (for example) you may need to plot a new set of tiles into the frame buffer just above the current screen position, so that the correct image will be displayed when the top of the visible screen overlaps the space where those tiles are plotted.

Now suppose that the hardware is so limited that there is not enough CPU time to plot an entire row of tiles each time you pass a tile boundary - perhaps you can get 60fps normally, but every time you pass a tile boundary you drop a frame or two while the new row of tiles is drawn into the frame buffer.

To avoid this glitch, instead of drawing all tiles at once, we can just draw one or two each frame. Suppose there are N tiles horizontally across the screen, and it takes M frames to scroll upwards by the span of 1 tile. Then we can draw N/M tiles each frame, and by the time the top of the screen gets to the bottom of a particular row of tiles, all of those tiles have been drawn.

Now we need to keep track of which tiles have been drawn and which haven't. The obvious way to do this would be to just have an array of bits, 1 for "tile has been drawn" and 0 for "tile needs to be drawn". Now, each frame when we draw a tile we can iterate through the array to find the first tile that hasn't been drawn and draw that one.

But what if there are rather a lot of tiles? Iterating through all those slots to find a "needs to be drawn" one could take a while, especially if all the undrawn slots are all the way over to the right. So we've turned a O(N) operation (N being the number of tiles drawn) into a O(N2) operation. Oh, and did I forget to mention that this is a 2D scrolling game? So we'll have a similiar list for each side of the screen, and each time we pass a tile boundary going vertically, we'll have to shift the left and right lists one tile.

How can we get rid of this quadratic behavior? Well, I think a data structure that should work well is a linked list. If we put the information about one tile slot in each node of the list, that makes it O(1) to shift the whole list one tile forwards or back for lateral scrolling, but doesn't fix the O(N2) search behavior. However, notice that we only need to store 1 bit of information about each tile slot. So what we can do is apply a run-length-encoding transformation to the list. Each node stores a pair of numbers - the length of a run of drawn tiles and the lengh of a run of undrawn tiles immediately following it.

Now all the operations we need are O(1). For the "top of screen" list, scrolling left or right by one tile just involves adding a slot to one end of the list and removing one from another. Finding the next tile to be drawn just involves looking at the first element of the list. We need two more operations: throwing away the entire list and replacing it with an "all drawn" list (for scrolling downwards) and throwing away the entire list (hopefully a single "all drawn" entry) and replacing it with a "none drawn" list (for scrolling downwards). Throwing away a list with potentially many entries sounds like it could potentially be an O(N) operation, but as it's a linked list we can move all the nodes to a "free nodes" list just by modifying the ends.

As always where performance is concerned, be sure to actually benchmark this before using it - depending on the hardware and the number of tiles in question, all that pointer chasing and cleverness may actually be slower than the naive bit array method. This is especially true if you're not in a hard-real-time situation, and the average case behavior matters more than the worst case behavior. Constant time algorithms have a definite appeal from an the aesthetics pespective, though.

Random number generation on 8-bit AVR

Thursday, July 1st, 2010

For my physical tone matrix I wanted a mode where at the start (or end) of each cycle the machine would pick a random "on" light and turn it off, and a random "off" light and turn it on. With 5-10 lights on, this makes a fairly nice little bit of random music - continually changing so it doesn't get too repetitive.

To implement this I decided the best way was to pick a frame (the second one of the cycle turned out to be the most convenient) and decrement one of two variables "randomOn" or "randomOff" depending on whether the current LED within the frame was on or off. If the variable reached zero the LED would be toggled. That doesn't take too many cycles. But the sample before this frame we need to generate random numbers and initialize randomOn and randomOff with them. randomOn needs to be in the range 0..lightsLit-1 (where lightsLit is the number of "on" LEDs) and randomOff needs to be in the range 0..(255-lightsLit). So we need to generate two random numbers in the range 0..n-1 where n is in the range 1..255 as quickly as possible. I wanted to try to generate these numbers in a single sample as trying to spread the work across multiple samples makes the program much more complicated. That left me with about 300 cycles to generate two random numbers.

The usual way of generating a random number in the range 0..n-1 is to generate a random number in a much larger range (say 0..65535) and use the modulo operation ("remainder after division by n") to get it into the range 0..n-1.

Generating a 16-bit random number is done in a fairly standard way using a Linear Congruential Generator (I used the X = 214013*X + 2531011 variant because it cuts out a few multiplications). That by itself takes nine 8-bit by 8-bit multiplications and 56 cycles. I rewrote it myself in assembly language rather than using the built-in generator from avr-libc, because the latter it not very optimized (it uses a full 32-bit multiplication which is not necessary).

If you use the 16-bit modulo function from libgcc it takes something like 215 cycles, which was too many. Even unrolled it's something like 144, which is still too many. I was about to embark on the work to spread this loop across several samples when I read this which describes a way of turning a division by a constant into a multiplication by a constant. That's not very useful for arbitrary division, since computing the multiplier constant is more work than just doing the division. But we're not doing arbitrary division here - we know something about our divisor n - it is no larger than 255. So it becomes practical to precalculate a table of multipliers and look up an entry in this table at runtime.

The next problem is that that method only works when the division is exact (has remainder zero) which is no good for this application since it's the remainder we're interested in. But the classic bit-twiddling reference book Hacker's Delight describes a variation which does work (for some divisors the dividend needs to be added back after the multiplication, and for some the result needs to be shifted right by a number of bits depending on the divisor). So our mod algorithm looks like this:

  1. Look up multiplier in table indexed by divisor
  2. Multiply dividend by multiplier
  3. Look up helper routine in table indexed by divisor (there are 18 variations - 9 possible shifts and either "add" or "don't add", but not all of them are used) and jump to it.
  4. Multiply by the divisor and subtract the result from the original dividend - the result is the remainder.

The resulting mod routine takes 40-53 cycles (depending on divisor) giving 96-109 cycles for the entire random number routine. With various other bits of overhead, the random routine takes 253 cycles on this "init" sample and up to 29 per sample on the first frame of the cycle.

Algorithm for finding "hot" records in a database

Wednesday, October 15th, 2008

Suppose you have a database, and (as often happens with databases) records change from time to time. Suppose also that you'd like to maintain a list of the "hottest" records, that is the ones which have been changing a lot lately.

The first thing you have to determine is whether you want to put the emphasis more on "a lot" or "lately" - i.e. you need to have a characteristic time tc such that n changes tc ago are equivalent to n times e changes now. This time determines how quickly changes "decay" into irrelevance. Depending on your application, this might be a day or so.

The next thing you might try is to keep a table of all the changes made, along with a time for each. Then you can just weight the change times according to how long ago they are and add them up. That's going to be a big table and an expensive operation, though.

A clever trick is to use a running average and "last changed" timestamp in each row of the original table. The running average starts off at 0. Each time the row is modified, calculate the number of characteristic times since the last change N = (tnow-tlast)/tc, update the average by multiplying it by e-N and adding 1 and then update the old "last changed" timestamp to tnow for the next change.

To show that this works, suppose the running average was a=1+e-N1+e-N1-N2+e-N1-N2-N3+... (one term for each change, weighted by how long ago they happened). When we update the running average it becomes 1+e-N(1+e-N1+e-N1-N2+e-N1-N2-N3+...) = 1+e-N+e-N-N1+e-N-N1-N2+... which is just what we want.

That isn't quite the end of the story though because the running averages in the table are not directly comparable to each other - if a record had a burst of activity a long time ago but then hasn't been touched since, it will have a similar activity to a record which had a similar burst of activity which has only just ended. To compute the "current" value of the running average we need to multiply a by the e-N corresponding to the time since it was last updated (without adding one this time, since we haven't added another unit of activity). This requires looking at all the records in the table though, which will be faster than the table of changes approach but might still be rather slow for a big database.

If we only care about the top (10, say) "hottest" records, we can speed it up by caching the results in a small table, and noting that scaling all the activity values by the same factor doesn't affect the ordering of the list. Suppose we have a singleton value tupdate which is the time we last updated the small table and a10 which is the activity of the 10th hottest record the last time it was changed. Whenever we change a record, take the new activity value a, multiply it by eN (note no minus sign here) where N=(tnow-tupdate)/tc and compare it to a10. If it's larger the new record should be inserted into the "top ten" table and the old 10th hottest record shuffled out (if the new record wasn't already in the table) - think of a high score table for a game. When this happens, set tupdate=tnow, multiply all the activity values in the small table by e-N and update a10 with the new value. Then when you need to display the hottest records just display this table.

There is one more complication which comes about from deleting records. If you delete a record it probably shouldn't appear in the "hottest" records list, even it was updated quite recently. But if you delete a record from the small table when it is deleted from the big table, you will only have 9 records in the small table and you'd have to go searching through the entire big table to find the new 10th record.

If records don't get deleted from your database too often, a simple workaround to this problem is to keep maybe 20 records instead of 10 in the small table so that there are plenty of "substitutes" around, and only display the top 10 of them.

The algorithms used by Digg, Reddit, StackOverflow etc. are a little more complicated than this because the records of those sites also have a "rating" which is factored in (higher rated records are considered "hotter") but which can change with time. There might be a way to deal with this by scaling the running average according to the rating and updating the hot records table when the rating changes.

Lispy composable compiler

Friday, July 4th, 2008

When I finally write the compiler for my super-language, I want to make it nice and modular - i.e. make it very easy to compile different languages, target different architectures and include different optimizations.

In order to achieve this I will need to have some kind of intermediate form in which pieces of code in various states of compilation to be transmitted from one module to another.

This form should be as simple as possible but should also make it possible to express all the complexities that occur in real code. In short, it will need to be a tree structure.

While I'm not convinced that the Lisp idea of "code without syntax" is a great one (I think every available visual/syntactical clue makes it easier to understand unfamiliar code) I have to admit that the lisp data structures (SExps and Lists) are perfect for this role. I wouldn't expect people to be reading large quantities of this code but it's handy to be able to print it out for debugging purposes.

A Symbol (what lispers call a SExp, S-Expression or Symbolic Expression) is either an Integer, a String, an Atom, a List or a Vector. Integers are what you'd expect them to be. Strings are used to hold things like variable names. Atoms are displayed as strings but treated as unique, opaque integers internally. All you do with an Atom is compare it to another Atom and see if they are the same. See Stringy Identifiers for another perspective on these. This is not quite the same as the lisp usage of the word, where integers are also atoms.

A List (delimited with parentheses) is effectively an object. It consists one or more Symbols, the first of which must be an Atom. It has a type, which is represented by its Atom and the number and types of the following Symbols.

A Vector [delimited with square brackets] is effectively an array of length possibly not known until run-time. This can be used for a sequence of statements (for example) or for the bytes in the compiled program before it is output to disk. Lisp doesn't distinguish between lists and vectors but I think it is useful to keep them separate.

It is not a coincidence that this is quite similar to the RTL of GCC. Many of the same problems are being solved, and I was learning about RTL when I came up with this. However, this structure is a little more general-purpose and a little simpler. In particular it can also be used for front-ends.

These expressions will generally be manipulated by pattern matchers - that is, we search through the program for an object matching some characteristics, perform some transformation on it and then substitute the result for the original object.

For example, here is a part of an x86 assembler written in a hypothetical language which has support for these Symbols built in:

    with(x) {
        match (`jmp` d:Int):
            if (isSByte(d))
                becomes (`db` [0xeb d]);
                becomes (`db` [0xe9 bytesFromDWord(d)]);

This matches a List which consists of the `jmp` atom followed by an integer (which we will call d). We pass that integer to a function and (depending on the result) we emit one of two possible forms of the x86 JMP instruction. As "bytesFromDWord" is known to be a function and not an Symbol, it is evaluated and the result placed into the vector rather than two Symbols (bytesFromDWord and (d)) being created and placed into the vector.

This works just as well for the front end. The "unless" statement that I have written about before can be broken down using this code:

    match (`unless` condition): becomes (`if` (`!` condition));

And we should have this optimization as well, in case condition is itself a negated condition:

    match (`!` (`!` x)): becomes (x);

Quite a lot of the compiler can be written very easily and tersely this way. There are some complications to do with the order in which transformations are applied, though - we would rather avoid having to specify that explicitly and let the computer figure it out. But the order might depend in quite a complicated way on the context. One way to deal with this might be to keep the pre-transformed Symbol around and continue to match against it even after it has been replaced. Another might be for transformations to transform other transformations as they are applied to the Symbols. I haven't quite figured this bit out yet. This seems like one of those problems that might be technically intractable but always works out to be fast enough in practice.

With some cleverness, it ought to be possible to write a very good compiler this way. Great speed is possible as no string handling is done after the initial parsing phase and symbol table lookups (e.g. we don't write out assembler instructions and rely on a separate assembler to assemble them, though we could still easily obtain assembly code if necessary by having a separate output module.) Many speed improvements are possible (e.g. by performing some transformations when the compiler is compiled).

But the really great thing about this scheme is that code for the compiler is completely self-describing - it's exactly the code one would hope to be able to write.

Having a set format for code also enables the scenario I wrote about in Compiler compilers.

Finally, this structure works just as well for disassemblers and decompilers as it does for compilers and assemblers. Some of the algorithms will be different most of the transformations probably cannot be reversed directly but many of the same patterns show up.