Diff algorithm

Determining where the differences are between two files is a very important application in many areas of computing. I use it all the time when programming (to get an overview of my most recent changes). It's also used in file compression (so that only changed parts of big files need to be stored/transmitted) and in genetics.

The usual algorithm for diff is to find the longest common subsequence (LCS) between the two files, or equivalently, the edit script (the set of insertions and deletions you need to transform one file into the other).

However, this may not be the best algorithm for looking at differences between pieces of source code. People who edit source code don't just insert and remove text, they also move things around. This isn't handled so well by the LCS algorithm. Refinements are possible (like trying to match up adds with corresponding deletes to turn them into moves) but I think better results might be obtained by using an algorithm designed from scratch to handle moves.

I have an idea for such an algorithm. First, find the longest common substring (LCSS) between the two files (unlike a subsequence, the characters in a substring must be consecutive in the original string). Then, remove this string from both of the files, and repeat until either no characters are left or the substring is sufficiently short that it might just as well be an add and a remove rather than a move.

This can be done fairly efficiently (the LCSS can be found in O(N) time with a sufficiently sophisticated algorithm) and has a number of advantages over the standard method:

  • The algorithm can be "tuned" for the particular type of files it's working on. For example, you might tell it to prefer edits which affect fewer syntactical units in the language you're working in.
  • If your edit script is too noisy, you can increase the minimum length of found substring.

The day after I originally wrote this, Bram Cohen (of BitTorrent fame) wrote this. I haven't had a chance to play with his algorithm yet (or indeed mine), but I'd be interested to see how they compare (pun intended).

Leave a Reply