Archive for the ‘computer’ Category

Algorithm for finding “hot” records in a database

Wednesday, October 15th, 2008

Suppose you have a database, and (as often happens with databases) records change from time to time. Suppose also that you’d like to maintain a list of the “hottest” records, that is the ones which have been changing a lot lately.

The first thing you have to determine is whether you want to put the emphasis more on “a lot” or “lately” - i.e. you need to have a characteristic time tc such that n changes tc ago are equivalent to n times e changes now. This time determines how quickly changes “decay” into irrelevance. Depending on your application, this might be a day or so.

The next thing you might try is to keep a table of all the changes made, along with a time for each. Then you can just weight the change times according to how long ago they are and add them up. That’s going to be a big table and an expensive operation, though.

A clever trick is to use a running average and “last changed” timestamp in each row of the original table. The running average starts off at 0. Each time the row is modified, calculate the number of characteristic times since the last change N = (tnow-tlast)/tc, update the average by multiplying it by e-N and adding 1 and then update the old “last changed” timestamp to tnow for the next change.

To show that this works, suppose the running average was a=1+e-N1+e-N1-N2+e-N1-N2-N3+… (one term for each change, weighted by how long ago they happened). When we update the running average it becomes 1+e-N(1+e-N1+e-N1-N2+e-N1-N2-N3+…) = 1+e-N+e-N-N1+e-N-N1-N2+… which is just what we want.

That isn’t quite the end of the story though because the running averages in the table are not directly comparable to each other - if a record had a burst of activity a long time ago but then hasn’t been touched since, it will have a similar activity to a record which had a similar burst of activity which has only just ended. To compute the “current” value of the running average we need to multiply a by the e-N corresponding to the time since it was last updated (without adding one this time, since we haven’t added another unit of activity). This requires looking at all the records in the table though, which will be faster than the table of changes approach but might still be rather slow for a big database.

If we only care about the top (10, say) “hottest” records, we can speed it up by caching the results in a small table, and noting that scaling all the activity values by the same factor doesn’t affect the ordering of the list. Suppose we have a singleton value tupdate which is the time we last updated the small table and a10 which is the activity of the 10th hottest record the last time it was changed. Whenever we change a record, take the new activity value a, multiply it by eN (note no minus sign here) where N=(tnow-tupdate)/tc and compare it to a10. If it’s larger the new record should be inserted into the “top ten” table and the old 10th hottest record shuffled out (if the new record wasn’t already in the table) - think of a high score table for a game. When this happens, set tupdate=tnow, multiply all the activity values in the small table by e-N and update a10 with the new value. Then when you need to display the hottest records just display this table.

There is one more complication which comes about from deleting records. If you delete a record it probably shouldn’t appear in the “hottest” records list, even it was updated quite recently. But if you delete a record from the small table when it is deleted from the big table, you will only have 9 records in the small table and you’d have to go searching through the entire big table to find the new 10th record.

If records don’t get deleted from your database too often, a simple workaround to this problem is to keep maybe 20 records instead of 10 in the small table so that there are plenty of “substitutes” around, and only display the top 10 of them.

The algorithms used by Digg, Reddit, StackOverflow etc. are a little more complicated than this because the records of those sites also have a “rating” which is factored in (higher rated records are considered “hotter”) but which can change with time. There might be a way to deal with this by scaling the running average according to the rating and updating the hot records table when the rating changes.

JavaScript vs PHP

Monday, October 13th, 2008

In order to implement Tet4 I learnt two new languages - JavaScript (or JScript, or ECMAScript - the language has a bit of an identity crisis) and PHP. Why PHP? It’s installed on my web hosting server and seems to have a huge community of people writing code in it and pre-written scripts. It may not be the ideal language for writing web server apps, but it does seem to be the most well-supported.

JavaScript seems to be a very clean, pretty language. The whole closure thing seemed a bit weird at first but once I understood that “class” is spelled “function” and “public” is spelled “this.” I got to rather liking it. I especially like how each scope has access
to the variables from all the outer scopes - that saves a lot of messing about. It’s very well integrated with the browser - manipulating the DOM feels very natural and not tacked on.

PHP on the other hand is a bit of a mess. It is as if its designers had a little spinner with markings “C, C++, Perl” which they spun each day to decide what languages features to copy that day. If JavaScript was sent by God, surely PHP was sent by the devil.

W3Schools has been an excellent reference for learning all this.

I have to say though that automatically promoting integers to double-precision floating point numbers on overflow is weird. On IE7, computing the value of 1111111111*1111111111 gives 1234567900987654400 (you can easily see this is wrong because it’s even). This caused a rather hard-to-debug problem with my random number generator (which assumed that when multiplying two 32-bit integers together, at least the low 32 bits of the result should be correct). If you’re going to automatically promote numbers, at least have the decency to use a multiple-precision integer library - there are lots around.

Touch typing is irrelevant for programmers

Saturday, October 11th, 2008

Steve Yegge argues that touch typing is a fundamental skill that all programmers should have.

I disagree. Being able to type at a reasonable speed is important, but touch typing isn’t. I never learnt to do “proper” touch typing, but I type pretty fast and I don’t look at the keyboard when I type. I also don’t always use the same finger for each key - I use the finger which happens to be nearest to the key I want to press. I’m not overly terse when writing comments, emails and documentation. Improving my typing speed wouldn’t make me more productive - the bottleneck is my brain, not my fingers.

Touch typing was invented for typing dictated text on mechanical typewriters which required a lot more force to operate than a computer keyboard. Programming is very different. Working on code there is much more punctuation, much more moving around the document and much more reading. Keeping hands in the touch typing position means that they are not in good places for punctuation and cursor movement, which is slower and leads to RSI.

Touch typists also dislike laptop keyboards. I prefer laptop keyboards to full-size keyboards - the keys are closer together and travel less far, which makes typing faster.

Programs to write programs

Wednesday, October 8th, 2008

When one needs to write a computer program but that computer program would be particularly difficult and (especially) repetitive to write, a useful pattern is to instead write a program to write the program for you. Programming in assembly language is very tedious so we have compilers. Configure scripts are fiddly and tend to be very similar so we have autoconf. Makefiles are hard to get right so we have automake. When building GCC, there are many files written in domain-specific languages (such as machine description files and opt files) which are used to used to generate the final C source code that is compiled to produce the final GCC. Even the C preprocessor is best thought of as a separate language from C itself, to allow for some limited compile-time code generation.

I wonder how far it is sensible to take this. If the program that writes the program is itself difficult to write, can we write a program to write that one too? At some point it seems that you will reach a point of diminishing returns - adding layers of code generation makes things more complicated, not less. Perhaps the kinds of programs we can understand (and therefore write and debug) are limited by the number of layers of code generation which need to be used. Perhaps with better tools for understanding these layers (such as this), we can write programs that are more sophisticated and less buggy.

Another factor is that in many languages (especially compiled languages) it is quite difficult to write code that writes code - you have to make sure that a suitable compiler is installed, invoke it etc. I hope that future compiled languages will include
code generation libraries which make this easier.

Stupid image resizing

Wednesday, October 1st, 2008

It greatly annoys me when web pages don’t resize images properly. In particular, one thing that there seems to be an epidemic of at the moment is web pages that embed images at one resolution and then use styles like “max-width: 100%” in a column narrower than the image. This makes the images look horrible because most browsers (at least IE7 and Firefox 2) resize the images by just dropping columns, causing diagonal lines and curves to be all wavy. At least Firefox 3 gets this right and resamples the image but it’s still a waste of bandwidth.

Along similar lines, here is an interesting article about the interactions between gamma correction and image scaling. I hadn’t thought about that before (my own image resizing routines always just assumed a linear brightness scale) but this has definitely opened my eyes.

How to fix Microsoft

Monday, September 29th, 2008

My post yesterday kind of devolved a bit into ranting about how Microsoft’s processes have the unindented consequence of causing the code to increase in complexity in the long term. This begs the question “how could Microsoft change to avoid this?”

Back in the 90s, Microsoft was generally considered to be an excellent place to work (much as Google is now). They still are to some extent but today that “excellent” means more “sensible” than “exciting”. In the 90s, Microsoft was often described at having a culture that resembled a collection of many small startup companies rather than one large corporation. I think that was pretty much gone by the time I started working there in 2001 - by then it was very corporate (sometimes embarrasingly so). Meanwhile, startup companies are doing the kinds of development that Microsoft can only dream about, on tiny budgets.

Reading Paul Graham’s essays about startups has given me some hints about how these startups can be so effective in the way that Microsoft can’t, as well as why the ones that fail do.

There is a whole spectrum of corporate cultures, from two or three guys in an apartment at one end (let’s call it the left end for the sake of argument) to Microsoft at the other (the right end). The left (when successful) is great at motivating their employees to do great things, rapidly prototyping and turning over problems quickly. The right is great at (eventually) producing very large, complicated, highly polished pieces of software, being consistent and avoiding risk.

To really be successful you need to work at both ends of the spectrum (and everywhere in between). The ecosystem of people founding startups, writing some great software and then getting bought by Microsoft (or Yahoo, Google, whoever) has done some pretty great things. But does Microsoft need all those startups to exist externally to itself in order to do the startup-y things? Perhaps not, if it can return to the culture it had in the 90s.

Startups need only a couple of things to exist - talented people and money. Both of these Microsoft has in spades. But they also need these people to be very motivated. Microsoft isn’t a great motivator of people anymore - the difference in rewards between those who are putting in a reasonable minimum effort and those who are going all out is (in the grand scheme of things) small and it takes a long time of working very consistently hard to rise above the crowd. The average employee knows that it is very unlikely that they will ever be able to make a big difference to anything. Certain philosophical positions adopted by Microsoft mean that they are no longer highly regarded by the geeks that they need the most.

One thing Microsoft can do to revitalize itself would be to incubate startups inside itself. Allow employees to form small teams to solve some specific problem. If they succeed, they are rewarded with lots of money. Even if this approaches the amount that they might get from a successful startup this would probably be much cheaper for Microsoft than buying the equivalent startup, assuming a reasonably high success rate. These internal startups could even compete (especially if there are multiple approaches that seem promising).

These startups should be autonomous for the most part - they should be able to choose the software and programming languages that makes the most sense for them. They should not have to coordinate with other teams who might be doing similar things, or worry about treading on the toes of others. They should be allowed to concentrate on writing great software. Unlike startups in the real world, they won’t have to worry about getting funding (just getting a first version working in the agreed upon time). They can have office space on campus or work in somebody’s apartment if they prefer. If Paul Graham is to be believed, these factors will cause the success rates of these internal startups to be much greater than those of startups in the real world.

While the resulting software might not meet Microsoft’s standards for security, localization etc., these things are probably better done at the more corporate end of the spectrum, by those employees who prefer the stability and “sensible”ness that you get with that corporate culture. The left will quickly produce imaginative, highly competitive software and the right will polish it to corporate standards - the best of both worlds.

The downside for Microsoft of this approach would be the drain of talent from the “classic” product units, and possibly also from the company altogether (if successful “startup” employees were rewarded with what Neil Stephenson describes as something rhyming with “luck you money”). But if that drain is happening already, maybe there’s nothing to lose.

The search for simplicity

Sunday, September 28th, 2008

There are several ways in which computer programming and physics are very similar. Possibly the most important is that both disciplines are, fundamentally, a search for simplicity.

In physics, we have a big pile of experimental results and we want to find the simplest theory that satisfies them all. Just listing all the experiments and their results gives you a theory of physics, but not a particularly useful one since it’s not very simple and doesn’t predict the results of future experiments (only the past ones). Rather than just listing the results, we would like to find a general theory, an equation, a straight line through the points of data which allows for interpolation and extrapolation. This is a much more difficult thing to do as it requires insight and imagination.

In computer programming, we generally have a big pile of specifications about what a program should do - maybe a list of possible interactions with the user (what they input and what they should expect to see as output). These might be encapsulated as testcases. To write a program that satisfies all the testcases, we could just go through them all one by one, write code to detect that particular testcase and hard-code the output for that particular input. That wouldn’t be very useful though, as the program would fail as soon as the user tried to do something that wasn’t exactly one of the scenarios that the designers had anticipated. Instead we want to write programs for the general case - programs that do the right thing no matter what the input is. When the “right thing” isn’t precisely specified, we get to choose the output that makes the most sense according to our internal model of how the program should act.

I think a number of software companies in recent years (Microsoft in particular but others as well) have started to fall into the trap of writing software that concentrates too much on what the behavior of the software should be for particular (sometimes quite specific) scenarios, at the expense of doing the right thing in the most general case. Windows is chock full of “special case” code (”epicycles” if you will) to work around particular problems when the right thing to do would have been to fix the general problem, or sometimes even to explain that this is how we should expect it to work. Here is one example of this kind of band-aiding. I discovered another the other day - I was running some older Windows software in Vista and accessed the “Help” functionality, which was implemented an old-style .hlp file. Vista told me that it no longer includes the .hlp viewer by default (I guess it was a piece of the OS that doesn’t get a lot of use these days, and they had just dropped it from the default distribution to avoid having to bring it up to the latest coding standards). I was pointed to the download location (where I had to install an ActiveX control to verify that my copy of Windows was genuine before I was allowed to download the viewer).

Part of the problem is that (at Microsoft at least) it’s very difficult to make big changes. Rewriting some core piece of functionality, even if the programming itself is easy, would involve months of planning, scheduling, designing, specification writing, testcase writing, test-plan reviewing, management sign off meetings, threat modelling, localization planning, documentation planning, API reviewing, performance testing, static analysis, political correctness checking, code reviewing and integrating. And of course everyone whose code might possibly be affected by the change needs to sign off on it and put in their two cents about the correct design. And it must comply with the coding standards du jour, which change every year or two (so delay too long and you’ll probably have to start all over again.) When you come to understand all this, the long gap between XP and Vista becomes less surprising (in fact, it’s quite a surprise to me that it only took a little over 5 years, considering how many pieces were completely rewritten). All this process exists for a reason (mostly the politician’s fallacy) but is rigorously justified and widely accepted.

Because it’s difficult to make big changes, people tend to make little changes instead (”hey, we can work around this by just doing x in case y - it’s just one extra line of code”) - these don’t require much process (usually just a code review - most of the rest of the processes for such small changes is automated). All these small changes add up to a great deal of extra code complexity which makes it very difficult for newcomers to understand the code, and even more difficult to rewrite it in the future because people will have come to depend on these edge cases.

Ray tracing in GR

Saturday, September 27th, 2008

Following on from this post, a natural generalization is that to non-Euclidean spaces. This is important for simulating gravity, for example rendering a scientifically accurate trip through a wormhole (something I have long wanted to do but never got to work). The main difference is that ones rays are curved in general, which makes the equations much more difficult (really they need to be numerically integrated, making it orders of magnitude slower than normal ray-tracing). One complication of this is that generally the rays will also curve between the eye point and the screen. But the rays between your screen and your eye in real life do not curve, so it would look wrong!

I think the way out of this is to make the virtual screen very small and close to the eye. This doesn’t affect the rendering in flat space (since only the directions of the rays matter) and effectively eliminates the need to take into account curvature between the screen and the eye (essentially it makes the observer into a locally Euclidean reference frame).

Another complications of simulated relativity is the inability to simulate time dilation. Well, you can simulate it perfectly well if you’re the only observer in the simulated universe but this would be a big problem for anyone who wanted to make a relativistically-accurate multiplayer game - as soon as the players are moving fast enough with respect to each other to have different reference frames, they will disagree about their expected relative time dilations.

The hobgoblin of little minds

Thursday, September 25th, 2008

Most software projects with more than one programmer seem to enforce some kind of formatting style for the code - brace positions, indent width, use of tabs - that sort of thing.

At Microsoft, we didn’t spend a lot of time talking about the style but we did have one rule - you should try to make your code consistent with the code around it. (If you were in the fortunate position of starting a brand new project from scratch, you got to choose the style yourself.) At least until the tyrannical StyleCop showed up. I left before using it for very long but I hated having to placate it (especially when I disagreed with its rules - for example, it wouldn’t let me insert extra blank lines to group related functions, or arrange my functions in a more logical order than the standard one).

The GNU coding standards are similarly strict. I haven’t disagreed with them very much (though I dislike the convention of having two spaces after a full stop).

I suppose having style guidelines (provided they are good ones) does make the source code look prettier and more consistent. However, I’m not convinced that it is worth the effort, especially since any programmer will have to be able to read code written in any style anyway (lest they start making assumptions and get fooled by a malevolent patch). In fact, there may be certain benefits to allowing every programmer to adopt their own personal favorite style. For one thing you’d be able to tell at a glance just who wrote any particular piece of code in your program (assuming that there were a small number of contributors and you’re familiar with all their work, which I don’t think are particularly bad assumptions in many cases).

Programmers’ personal style also changes with time, so this can also be a good gauge for how old a particular piece of code is.

However one chooses to format their code, it is important that is readable - not having indentation at all, or having inconsistent indentation in a given class or file, or having indentation that misleads you about which “if” an “else” is paired with should not be acceptable.

One might get the impression from reading the above that my own preferred coding style is not particularly consistent. Nothing could be further from the truth - I’ve spent many hours (probably far too many) reformatting code to my own personal taste (K&R style with the exception of putting opening braces of global functions and classes on column 1) to make it look prettier. I used to prefer indenting by 2 spaces but now I prefer 4 (a habit I picked up at Microsoft). As I like to avoid very deeply nested constructs (and like to be able to see how deeply nested I am easily), I may even increase that further in the future.

What makes a game addictive?

Wednesday, September 24th, 2008

I have an unfortunate habit of getting addicted to computer games - this is one reason why I don’t play them as much as I used to, and why when I do now play games, I usually pick one with a definite ending so that there’s a natural place to stop.

But occasionally I do slip into excess playing of Tet4, Freecell or Spider Solitaire. I think what makes these games particularly addictive is that restarting them is a move which seems to bring you closer to winning. In all these games, the state of the game starts out as quite simple and the moves are obvious, but as you play things get more complexified and constrained until either you lose or (if possible) winning becomes inevitable. Restarting decomplexifies so when you lose your brain (which is still thinking in terms of the game) naturally reaches for the “restart” key.