Archive for May, 2008

Diff algorithm

Wednesday, May 21st, 2008

Determining where the differences are between two files is a very important application in many areas of computing. I use it all the time when programming (to get an overview of my most recent changes). It's also used in file compression (so that only changed parts of big files need to be stored/transmitted) and in genetics.

The usual algorithm for diff is to find the longest common subsequence (LCS) between the two files, or equivalently, the edit script (the set of insertions and deletions you need to transform one file into the other).

However, this may not be the best algorithm for looking at differences between pieces of source code. People who edit source code don't just insert and remove text, they also move things around. This isn't handled so well by the LCS algorithm. Refinements are possible (like trying to match up adds with corresponding deletes to turn them into moves) but I think better results might be obtained by using an algorithm designed from scratch to handle moves.

I have an idea for such an algorithm. First, find the longest common substring (LCSS) between the two files (unlike a subsequence, the characters in a substring must be consecutive in the original string). Then, remove this string from both of the files, and repeat until either no characters are left or the substring is sufficiently short that it might just as well be an add and a remove rather than a move.

This can be done fairly efficiently (the LCSS can be found in O(N) time with a sufficiently sophisticated algorithm) and has a number of advantages over the standard method:

  • The algorithm can be "tuned" for the particular type of files it's working on. For example, you might tell it to prefer edits which affect fewer syntactical units in the language you're working in.
  • If your edit script is too noisy, you can increase the minimum length of found substring.

The day after I originally wrote this, Bram Cohen (of BitTorrent fame) wrote this. I haven't had a chance to play with his algorithm yet (or indeed mine), but I'd be interested to see how they compare (pun intended).

2D game

Tuesday, May 20th, 2008

I don't tend to have time to play many computer games these days (though I made exceptions for Portal and Lost:Via Domus), but I still like to keep abreast of how technology trends change in games.

A concept that I think is interesting in games is to "raise the stakes" as the game progresses, so that later levels seem to be more important than earlier ones. A couple of contemporary games (neither of which I have played) that embody this principle are Katamari Damacy and Spore. As the game progresses, the very scale of the action increases, causing the early parts of the game to pale in comparison.

3D graphics have become much more impressive in the past few years but the trouble with 3D is that the extra degrees of freedom make the controls more complicated and difficult to learn (especially for people who don't play many 3D games).

I think it would be interesting to apply the "increasing scale" concept to a 2D game using modern graphical techniques. My idea is a 2D-scrolling shoot-em-up. You start off with a very small, simple "asteroids" spaceship with a very simple weapon. As you blow things up, you can collect more weapons which make your spaceship bigger. Using these bigger weapons, you can attack bigger and bigger targets. The view gradually "zooms out" as the size of the ship, weapons and targets increases.

This idea also reminds me of the Seven Force in Gunstar Heroes on the Megadrive. Easy parts of this boss appeared in an early level and then in a later level there was a "scaled out" version which was much harder and with smaller sprites.

And with modern technology, 2D games can be seriously beautiful. Aquaria does actually zoom in and out a bit, but not by orders of magnitude as I'm imagining.

Devolving the CPU

Monday, May 19th, 2008

Until fairly recently, most computers could only do one thing at a time. To put it in technical terms, they had only one "execution unit" to take instructions from memory and actually act upon them. Computers have been able to apparently run many programs at once for much longer than that, through sleight of hand (switching the execution unit between different programs many times per second - so fast that you can't tell they're not executing simultaneously).

But in recent years, CPU manufacturers have been finding it more and more difficult to make execution units faster. CPU speeds increased exponentially until a few years ago, but the fastest CPUs you can buy have been about 3.6GHz for several years now.

But Moore's law (which says that the number of transistors you can put in a single chip doubles every 18 months) has continued to hold. So what do we do with these extra transistors, if we're not using them to make the execution unit faster?

Well, they're instead being used to put multiple execution units on a single CPU, so that computers really can execute multiple programs at the same time, without switching between them. This has the effect of making computers faster (since these days they're rarely doing only one thing at a time anyway, or if they are the work can often be divided amongst execution units to effectively increase the speed).

This means that programmers need to change the way they write programs, to obtain maximum speed from the latest machines. They can no longer write a program as a single list of instructions (or a single "thread of execution" to use the technical term) and expect it to run 1000 times as fast in 15 years time. Instead, programmers must write programs that use multiple threads and divide the work between them. This "multi-threaded" style of programming is more difficult and (if you don't know what you're doing) tricky errors can occur (for example, some errors only happen if two particular parts of the program happen to be executed at the same time, so can be very difficult to reproduce).

Once we've got the hang of doing multi-threaded programming properly, having any one execution unit go really fast won't be so important, because programs will be able to go just as fast by dividing up their work between execution units. This means that we will see a redistribution of transistors on future CPUs - we will go from a few very fast but complex execution units to many slower but simpler execution units, because we can make CPUs with better overall throughput that way.

There are several ways to simplify an execution unit. You can get rid of speed features like pipelining, branch prediction, out-of-order execution and speculation. Removing these gives you another advantage - it becomes much easier to determine how fast a thread will execute, which means less optimization and tuning is needed.

Another way is to simplify the instruction set (the operation codes that the execution unit actually interprets). Most currently used instruction sets are very complicated (in order to squeeze as much power as possible from a single execution unit) so there is great scope for simplification there. All the single instruction, multiple data instructions can be removed (the parallelism they enable can be more easily realized with multiple execution units).

Another simplification would be to remove all the registers and use a pure stack machine architecture. A few processors that use this technique have been developed in the past, but they never took off as they were too slow. This may cease to be a consideration in highly parallel architectures.

Stack machines have a number of advantages besides simplicity. Compilers can also be made much simpler (the parse tree can be translated into code almost directly, with no register allocation or complicated instruction selection algorithms needed). The CPU stores less non-memory state, making the operation to switch an execution unit from one thread to another much quicker. This state could be as simple as an instruction pointer, a stack pointer and some context to identify the thread.

Stack machines can also have much higher code densities than register machines, because you don't need to use any bits to specify which registers to work on - all instructions work implicitly on the top of the stack. Code density will continue to be important as the bus between CPU and RAM is a bottleneck that will only get worse as the number of execution units increase.

Another technique that can improve code density is bit-aligned instructions. By making assigning short opcodes to the most commonly used instructions and longer opcodes to the rarer instructions, you effectively apply Huffman compression to the instruction stream. The Intel 432 used this technique (and was very slow, but as the shifting that is required to decode such an instruction stream is pretty easy to implement in hardware I think this wouldn't be a big issue for a massively parallel CPU).

With a stack machine, certain patterns of instructions tend to occur over and over again. These can be assigned shorter codes (as is done in many popular lossless compression techniques like ZIP files).

Working out the optimal coding for this instruction set would be fairly simple in principle. First design a minimal instruction set for a stack based language without regard to code density (maybe just use one byte per opcode). Compile a large quanity of software to this architecture (spanning the range of possible uses - an operating system, application suite, development tools, games, emulators, number crunching, scientific visualization etc.). Next, determine the instruction frequencies across all these codebases and apply Huffman coding to figure out the optimal opcodes. With some fine-tuning, I think this could give significantly higher code densities than any of the architectures currently in use.

Another advantage of having many execution units is that transistors can be assigned to different tasks in the most efficient possible way. For example, if you have one ALU per execution unit and it is only in use 70% of the time, you could instead have 7 ALUs per 10 execution units - the ultimate in superscalar design.

Economics

Sunday, May 18th, 2008

I used to think economics was a terribly dry subject, but lately (especially since I've become part of the working population and a homeowner) I have been finding it interesting to think about how wealth, money and value are created, destroyed and moved around.

For example, would we all be better off if we tried to create wealth that lasts at the expense of ephemeral wealth? Perhaps to some extent, but on the other hand one can have too many marble statues and too few cheese sandwiches.

Where the money goes

Saturday, May 17th, 2008

There was an interesting piece on the radio a while ago about some people who had a religious objection to war - specifically, their religion said that they could not pay taxes if that money would go to funding a war.

That got me wondering - what would be the consequences of trying to accommodate these people? Suppose that your tax forms had a series of boxes which you could check to tell the government what you want your tax money to go towards. Then the government could try to ensure that the war was funded only by the tax money from people who had ticked the "war" box on the tax form, and fund everything else with what is left over.

Of course, the government couldn't guarantee that peoples' money went where they wanted it to go (if they could, many people would probably just leave all the boxes unticked, or just tick the cheapest boxes in an attempt to pay no tax or less tax.

But even as a non-binding thing, the aggregate data could be useful as a gauge of public opinion - if the war is costing much more money than the taxes paid by the people who actually think it's a good idea, maybe it's time to elect a government that will end the war. Similarly for roads, schools, healthcare and welfare. Such a thing might encourage governments to be more open about what your money is being spent on and why.

Copyrighting public domain information

Friday, May 16th, 2008

Some people seem be confused about copyright and public domain, thinking things like:

  1. I used some public domain code in my program, so I have to release it as public domain as well
  2. I should be able to take some public domain code, add my own copyright to it and then prevent anyone else from using the public domain version
  3. I need a new law passed so that I can copyright this list of facts I compiled.

These are all wrong. Public domain isn't a copyright license like the GPL - it's just the absence of copyright, meaning that you can do what you like with it.

There's nothing stopping anyone from taking copyright-expired works of arts or lists of facts and copyrighting them, but such a copyright would be completely useless because it couldn't be enforced - if you sue someone for violating your copyright on that work they have an iron-clad defence - they can just say "my work is based on the public domain version, not your version" (in the case of copyright-expired works) or "I compiled my own list of facts which happens to be the same as yours because both are based on the same reality" (in the case of the phone book example) and you would not be able to prove that they had referred to your copyrighted work.

No new law is necessary because there's an easy workaround - just change something slightly to make it a new creative work - add a fake name to your phone book or change a few words in that old story. Then not only do you have a perfectly good legal case against anyone who copies your work, you also have a way to prove it (their copy will also have your changes). And you won't be "removing" anything from the public domain to boot. Sure you probably can't make much money by taking public domain works, changing something and then releasing a copyrighted version, but that's seems quite reasonable to me because you haven't actually contributed much (if anything).

I understand that map makers have used this technique in the past - changing the position of roads slightly or adding features to the map that don't exist in reality in order to make their maps copyrightable "creative works" and to enable them to track down counterfeiters.

Crisp protocol

Thursday, May 15th, 2008

When I was working for Microsoft, I used to take cheese sandwiches to work for lunch every day. I like to have salt and vinegar flavored crisps in my cheese sandwiches. None of the vending machines at Microsoft sell such things so I have to take my own. This being the USA, salt and vinegar crisps are generally only sold in big bags which last me a week or more. I generally remember to buy more crisps at the supermarket when I am getting close to running out, but remembering to actually take them to work was, for some reason, more problematic.

I therefore needed a crisp refill protocol. I used to email myself at home when I needed more, so I would remember to put crisps in my bag when doing my morning email. But after a while I found myself forgetting even in the short time between answering my email and making my lunch. I realized I needed to be reminded right when I am making my lunch, and worked out an even simpler protocol - I just put a twist tie in my empty lunch box and then I see it when I'm refilling it. This is even simpler and worked much better (though I still managed to fail to bring in crisps once while using this protocol). This protocol does mean that all the twist ties end up at home though.

Now that I work at home, this is all so much simpler.

Stack measuring

Wednesday, May 14th, 2008

We generally do what we can to make programs reliable, but there is one particular cause of software failure which seems to have been surprisingly under-attacked. Probably because it is fairly rare in real-world situations. That failure is stack overflow.

Many real-world programs have recursive algorithms where the recursion depth depends on user input and is unchecked, and where the activation records are allocated on the stack. Current computer systems do not recover well from stack overflow (you generally get an access violation which the program cannot recover from because it is in an unknown state). This means that it is possible to supply input which crashes the program.

The compiler for my ideal language will detect when a program uses a potentially unbound recursion and allocate the activation records for such functions on the heap instead of the stack. This means that calling such a function can potentially throw an out of memory exception, which is much easier to recover from than a stack overflow (since we still have a stack to work with, and because the state of the program is well-defined - as far as the calling function is concerned the called function threw an exception and as far as the called function is concerned it never got called.)

A nice side effect of this is that now every function with an empty set exception specification has a well defined maximum stack size, which means that when you create a thread you know exactly how much stack it needs. This means stack guard pages are no longer necessary (stack can be treated just like any other area of memory) and potentially means that threads that perform small tasks can be created with very little overhead, possibly making thread pools unnecessary.

A programmer can track the maximum stack size of every thread in their application. If a change causes the stack size to increase dramatically this may be due to calling a very high-level function from a very low-level function, which may indicate a bug. Similarly, adding recursion without realizing it also indicates a possible bug, so it would be nice if the compiler could also alert the programmer to that.

ALFE build system

Tuesday, May 13th, 2008

When I finally get around to building my ideal language, it will have a linker and build system built in. It seems very silly to write your code in one language, your makefile in another, your configure script in a third etc.

The compiler will be invoked with the name of one source file on the command line, and that source file includes all the other required source files by means of an "include" statement which works similarly to the "#include" of C and C++, but never includes any given actual file more than once.

The program can also specify the name of the output file to generate, which can be a binary (.exe or .dll for Windows, or equivalent on other platforms) or intermediate file to be used with a C or C++ linker (.obj or .lib, or equivalent). Similarly, object and library files can be included just as source files can.

On a first pass through the program, the compiler will enumerate the complete set of input files and check their timestamps. If no input files are more recent than the output file, compilation will stop early.

Eric Clapton story

Monday, May 12th, 2008

A friend of mine told me this story a while ago. I have no idea if it's true. A friend (or possibly a friend of a friend) of his was a big Eric Clapton fan, but lived out in the boonies far away from any potential tour stops. But eventually he learned that Clapton would be playing at a (relatively) nearby city, and started saving his pennies so that he could go.

He managed to gather together enough money to buy a beaten up old car which (he hoped) would hold together long enough to get him to the city and back. After an epic journey he arrived in the city on the day of the concert and started trying to buy a ticket - only to find out that the show was completely and utterly sold out. Some tickets were being scalped, but for much more money than he had left. Dejected, having got so only to be foiled at the last hurdle, he wandered to a nearby seedy bar to drink himself into a stupour.

The barman asked why he looked so depressed, and listened with a sympathetic ear as the fan related his story.

Some hours later, the sounds from outside the bar suggested that the concert had ended and our hero decided that he should get back to his car to sleep off the drink before the long drive home. The barman enigmatically said "just stick around a little longer".

And he was right to - because shortly after that, Eric Clapton himself walked into the bar and proceeded to jam for the small audience there until the early hours.