Archive for the ‘algorithms’ Category

Scripted merge

Saturday, May 24th, 2008

I've been thinking a bit about version control systems. It occurs to me that almost every element of a version control system is actually fairly easy technically except for two - diff and merge.

Diff is the process of finding the difference between two files - I've written about some ways to deal with this in the past.

Merge is process of taking two changes X and Y and finding the change Z consisting of both X and Y. This sounds easy but it's really solving the "A is to B as C is to what" problem (the original being A, X being B, Y being C and Z being the answer). Suppose X is adding a method to a class and Y is renaming the same class. Most merge systems would add the new method with the original name (since Y didn't include renaming of the new method).

I think the answer is to treat every change as a little program in its own right. X isn't just "inserting the lines corresponding to the new method", it's "insert the lines corresponding to the new method, substituting the current name of the class where appropriate" and Y isn't just "change this list of instances of Foo to Bar" it's "change Foo to Bar everywhere it appears in the context of a class name in namespace Baz". In other words, have the changes themselves actually have some intelligence about what they are supposed to be doing. Then these programs could just be "run" (in any other) and produce the correct result.

Such changes would be more than just lists of "insert this text in this position in this file" and "delete this text from this position in file", and could therefore not easily be generated from diffing two files.

However, such changes could be generated by a text editor which has some understanding of the format of the files it is used to edit. For example, if such an editor had a "rename class" command, it could generate a change saying exactly that.

Diff algorithm

Wednesday, May 21st, 2008

Determining where the differences are between two files is a very important application in many areas of computing. I use it all the time when programming (to get an overview of my most recent changes). It's also used in file compression (so that only changed parts of big files need to be stored/transmitted) and in genetics.

The usual algorithm for diff is to find the longest common subsequence (LCS) between the two files, or equivalently, the edit script (the set of insertions and deletions you need to transform one file into the other).

However, this may not be the best algorithm for looking at differences between pieces of source code. People who edit source code don't just insert and remove text, they also move things around. This isn't handled so well by the LCS algorithm. Refinements are possible (like trying to match up adds with corresponding deletes to turn them into moves) but I think better results might be obtained by using an algorithm designed from scratch to handle moves.

I have an idea for such an algorithm. First, find the longest common substring (LCSS) between the two files (unlike a subsequence, the characters in a substring must be consecutive in the original string). Then, remove this string from both of the files, and repeat until either no characters are left or the substring is sufficiently short that it might just as well be an add and a remove rather than a move.

This can be done fairly efficiently (the LCSS can be found in O(N) time with a sufficiently sophisticated algorithm) and has a number of advantages over the standard method:

  • The algorithm can be "tuned" for the particular type of files it's working on. For example, you might tell it to prefer edits which affect fewer syntactical units in the language you're working in.
  • If your edit script is too noisy, you can increase the minimum length of found substring.

The day after I originally wrote this, Bram Cohen (of BitTorrent fame) wrote this. I haven't had a chance to play with his algorithm yet (or indeed mine), but I'd be interested to see how they compare (pun intended).