Algorithm for finding "hot" records in a database

Suppose you have a database, and (as often happens with databases) records change from time to time. Suppose also that you'd like to maintain a list of the "hottest" records, that is the ones which have been changing a lot lately.

The first thing you have to determine is whether you want to put the emphasis more on "a lot" or "lately" - i.e. you need to have a characteristic time tc such that n changes tc ago are equivalent to n times e changes now. This time determines how quickly changes "decay" into irrelevance. Depending on your application, this might be a day or so.

The next thing you might try is to keep a table of all the changes made, along with a time for each. Then you can just weight the change times according to how long ago they are and add them up. That's going to be a big table and an expensive operation, though.

A clever trick is to use a running average and "last changed" timestamp in each row of the original table. The running average starts off at 0. Each time the row is modified, calculate the number of characteristic times since the last change N = (tnow-tlast)/tc, update the average by multiplying it by e-N and adding 1 and then update the old "last changed" timestamp to tnow for the next change.

To show that this works, suppose the running average was a=1+e-N1+e-N1-N2+e-N1-N2-N3+... (one term for each change, weighted by how long ago they happened). When we update the running average it becomes 1+e-N(1+e-N1+e-N1-N2+e-N1-N2-N3+...) = 1+e-N+e-N-N1+e-N-N1-N2+... which is just what we want.

That isn't quite the end of the story though because the running averages in the table are not directly comparable to each other - if a record had a burst of activity a long time ago but then hasn't been touched since, it will have a similar activity to a record which had a similar burst of activity which has only just ended. To compute the "current" value of the running average we need to multiply a by the e-N corresponding to the time since it was last updated (without adding one this time, since we haven't added another unit of activity). This requires looking at all the records in the table though, which will be faster than the table of changes approach but might still be rather slow for a big database.

If we only care about the top (10, say) "hottest" records, we can speed it up by caching the results in a small table, and noting that scaling all the activity values by the same factor doesn't affect the ordering of the list. Suppose we have a singleton value tupdate which is the time we last updated the small table and a10 which is the activity of the 10th hottest record the last time it was changed. Whenever we change a record, take the new activity value a, multiply it by eN (note no minus sign here) where N=(tnow-tupdate)/tc and compare it to a10. If it's larger the new record should be inserted into the "top ten" table and the old 10th hottest record shuffled out (if the new record wasn't already in the table) - think of a high score table for a game. When this happens, set tupdate=tnow, multiply all the activity values in the small table by e-N and update a10 with the new value. Then when you need to display the hottest records just display this table.

There is one more complication which comes about from deleting records. If you delete a record it probably shouldn't appear in the "hottest" records list, even it was updated quite recently. But if you delete a record from the small table when it is deleted from the big table, you will only have 9 records in the small table and you'd have to go searching through the entire big table to find the new 10th record.

If records don't get deleted from your database too often, a simple workaround to this problem is to keep maybe 20 records instead of 10 in the small table so that there are plenty of "substitutes" around, and only display the top 10 of them.

The algorithms used by Digg, Reddit, StackOverflow etc. are a little more complicated than this because the records of those sites also have a "rating" which is factored in (higher rated records are considered "hotter") but which can change with time. There might be a way to deal with this by scaling the running average according to the rating and updating the hot records table when the rating changes.

Leave a Reply