Random numbers

A commenter asked how random programs like random song shuffle work. How can a computer (which follows the same instructions each time) have unpredictable behavior?

Well, to understand this, first think about how random numbers are generated in the "real world" outside of computers. One way to do this is to throw dice - as long as you don't do something tricky, most people would agree that that's a perfectly reasonable source of randomness.

Except that if you really think about it, it isn't really random at all. In principle, one could make a machine that throws dice exactly the same way each time and always get a double six. You'd have to isolate it from stray air currents, minor gravitational influences from passing traffic and so on (as needs to be done for many sensitive scientific experiments). If you think about it, the randomness you get from throwing dice is really just due to these kinds of factors (and the fact that we human beings can't control our muscles finely enough to exert precisely the same forces each time). So really, what you're doing when you're throwing dice is "amplifying" these unpredictable factors (air currents, position, throwing force etc.) so that any tiny changes in them result in an entirely different number coming up.

Computers generate random numbers the same way. Well, sort of - they don't simulate the physics of three-dimensional dice being buffeted by air currents and bouncing off things - they do something that doesn't require such a complicated program but the same principle is at work. One reasonably common one (not the best, but fine for uses such as games) is to multiply by 22695477, add one and then take the remainder after divison by 4294967296. If I give you the sequence 953210035, 3724055312, 1961185873 you'd be hard pushed to tell me that those numbers were generated by that formula starting from the number 42 (maybe not quite so hard pushed as telling me the precise forces acting on dice given the numbers that came up but you get the idea). Other random number generators use the same idea but with more complex formulae to make the patterns even more obscure.

The problem with this method is that (as with any computer program) given the same input you're going to get the same output every time. And indeed this was a problem with some home computers and calculators in the 80s - if you tried to use the random number generator right after turning the machine on you'd get the same "random" numbers each time. The solution is to "seed" the random number generator somehow. The usual method is to use the current time. While it's predictable in some sense, you will get a different random sequence every time in practice so for most purposes you will never notice a pattern.

For some other purposes (like generating cryptographic keys or passwords, which have to be unpredictable even to people who are trying very hard to predict them) just using the time isn't enough and you have to mix in other sources of unpredictability (or "entropy" to use the technical term). This is why some SSH clients ask you to bash on the keyboard or wiggle the mouse the first time they are run - someone who wants to intercept your communications would have to know exactly which keys you pressed and how you wiggled the mouse (with extremely high precision). This is very easy to get wrong (and difficult to know when you've got it wrong) and can lead to security holes. Hence the old saw that "generation of random numbers is too important to be left to chance"! Some computers even have hardware built in which generates random numbers from unpredictable physical processes like thermal noise or quantum wavefunction collapse.

Once you've got a sequence of (pseudo-)random numbers evenly distributed over some range (for example 0 to 4294967295 in the above example) there are a variety of techniques to massage these into various different forms for different purposes. If you want an integer in the range 1 to 10 you can find the remainder after division by 10 and then add 1. If you want a random real number in the interval 0 to 1 you can divide by 4294967295. If you want numbers from a Gaussian distribution you can use a Box-Muller transform. And if you want to shuffle songs in your playlist you can start with an empty playlist and add a random song to it (removing it from the original playlist each time) until the original playlist is empty.

This last algorithm has a flaw in some sense (though it's more of a flaw in our brains). While all possible orders are equally likely, it will tend to play two songs in a row by the same artist more often than you would expect (just look at the bug database for any open source media player application to see the evidence for this). The problem isn't that the numbers aren't random enough, it's just that that if you have n artists in your playlist you'll see this phenomenon once every n songs on average. The problem is that we tend to notice this when it happens, and because the times when it happens stand out more than the times when it doesn't the problem seems to be worse than it really is. Another factor is that we are used to radio stations, which deliberately avoid doing this by hand-picking the order of songs rather than using a random shuffle. Some media player applications have appeased the complainers by actually making their shuffle features *less* random - adjusting the algorithm so it avoids picking orders which have two songs by the same artist in a row. This seems hacky to me, but if users like it I suppose I can't disagree with that.

Leave a Reply