Archive for August, 2008

I never believed in Jesus

Thursday, August 21st, 2008

I used to think of myself as a Christian, back in primary school and the early part of secondary school. I was brought up in the Church of England. My first primary school was a C of E school, meaning that we had prayers in morning assembly and (one or twice a week) the vicar would come and tell us bible stories. For a while I went to Sunday school every week. I found this very boring and I'm sure I always used to say "It's Sunday - why do I have to go to school?".

In secondary school one day they handed out little red Gideon bibles to everyone in my year. I recall observing with horror when some of my classmates mutilated their copies. I read mine every day for probably the best part of a year. I don't recommend it - the writing style is really rather awful (I eventually gave up because it was so unreadable rather than through lack of faith).

Even amongst all this religious upbringing, I don't think I ever believed that the stories about Jesus and all the other bible stories were ever more than stories. They were just too ridiculous. In fact, I don't think it even occurred to me that there was anyone who ever thought of them as anything other than parables - certainly none of the religious people I knew ever gave me the impression they did. In fact it wasn't until I got to university that it sunk in that there were people who actually believed in the literal truth of the bible! This may seem strange to my American readers as this point of view is quite common here but I suspect it is the exception rather than the rule in England.

The trouble with editors

Wednesday, August 20th, 2008

Since changing jobs I now mostly work with Unix machines rather than Windows ones. Mostly it's a not too unpleasant experience (once things are set up the way I like them) - Unix seems to be much more flexible and configurable than Windows, even if some things don't work quite so well out of the box.

One area I have run into difficulties with is editing text files. When I first started using DOS I used to use edlin, which is horrible. After a while I discovered QEdit on a magazine cover disk and loved it - very quick to load and scroll (even on an 8MHz 8086 machine), easy to use but with powerful macro facilities there when you need them.

I used that up until 2002 when I switched from using Windows 98 to Windows XP and QEdit stopped working so well. I bought TSE (the updated, renamed QEdit) and it was worth every penny - everything that was great about QEdit was the same or better in TSE and I've used it almost every day since for almost every text editing task (though I have been known to use the Visual Studio editor occasionally for its Intellisense goodness). Moving from QEdit to TSE I discovered that the Windows shortcuts for select, copy and paste (shift+navigation to select, Ctrl-X to cut, Ctrl-C to copy and Ctrl-V to paste) were much quicker and easier than the QEdit defaults (which I think originated in WordStar) Ctrl-K B, Ctrl-K K etc.

When I switched to Linux, TSE was one thing that I missed most. It works under Wine but not particularly well (it always defaults to full screen, copy/paste integration doesn't work and scrolling is painfully slow). A Linux version of TSE is in the works but doesn't show any sign of materializing soon.

Joe is the closest thing to TSE that Linux has to offer but it's a bit limited and just different enough to be annoying. It also isn't available on all of the machines that I need to use. The latter requirement basically means I have to make the big choice - vi or emacs?

I gravitated towards vi initially as it seemed to be a bit more lightweight and I had some experience with it from my brief stint using Debian sometime around 1999-2000. I relearnt how to do the basics, put it into syntax highlighting mode and fix the colours (dark blue on black is unreadable to me, and of course comments have to be in green). But it still drives me crazy - I never know which mode I'm so I'm always inserting ":q" into files or (worse) typing text which gets interpreted as random commands and then trying to figure out what I did and how to undo it.

I eventually found out that most people who develop GNU software use Emacs (I saw a lot of people using it at the GCC Summit) and that it has great features for writing C code according to the GNU guidelines. So I decided I had better learn Emacs. I had used it a bit at university (the Physics department gave us a brief tutorial in it as part of the Fortran course). I have also learnt a bit of Lisp in the intervening years and having Lisp as the macro language of an editor seemed to be a rather reasonable thing to do.

One big problem annoyance with all the Unix editors is that none of them support the Windows-style shortcuts I mentioned above (plus Ctrl-Z, Ctrl-Y, Ctrl-A and Ctrl-S for undo, redo, select-all and save) by default. I almost squealed with delight when I discovered the CUA mode in Emacs 22 which brings it significantly closer to being the perfect editor (at least once I figured out how to tweak various terminal settings to actually make it work - this helped. Unfortunately many of the machines at work still use Emacs 21, so there was some more fiddling required to install the CUA package.

Then I need to figure out if I can get the scrolling behavior to work like QEdit's, fix up some behaviors related to tabs and (once again) fix the syntax highlighting colours. "Eventually I'll have everything the way I like it" may be the mantra of a great many Linux users...

Part of the reason this is so difficult is that customizing editors is generally considered an advanced thing to do - you really have to learn the basics first. But if the first thing you want to do with a new editor is make it act like your old one, you first have to learn the default way of doing things, then do the customization, then unlearn the default way of doing things. My natural inclination is to take shortcuts - try to figure out the advanced stuff without really getting the hang of the basics first. This does seem to have a tendency to complicate things.

Square aeroplanes

Tuesday, August 19th, 2008

Almost all aeroplanes have round cross-sections. This doesn't seem to be the best shape for optimal use of the interior space. In particular, it means that there isn't much headroom above the window seats. On my most recent flights I found that could not even stretch my arms above my head.

I don't think the reason for these round cross-sections is an aerodynamic one - if I recall my fluid dynamics correctly, drag is a function of turbulence, which is caused by chaotic vortices being shed from the trailing edge. I don't think that would be affected by a square cross section.

I suspect that the real reason is that with our current materials, it's easier to make planes strong enough if they're round - a square cross section would have weak points at the corner edges and might lead to parallelogramming. I wonder if future advances in materials science might be able to counteract this.

Airport integration

Monday, August 18th, 2008

One of the most annoying things about flying is that airports are generally a long way out of town. This means that you have to leave extra early in case of bad traffic on the way to the airport, and it means that upon arriving you still have to get a taxi/bus/shuttle to the place you actually want to go to.

This seems to be an inevitable consequence of aeroplane technology - it's very loud and smelly and requires a lot of space for take off and landing. But if cities and aeroplanes were designed from scratch with each other in mind, one could imagine that this could be different. While take-offs and landings would still have to happen a way out of town, perhaps boarding and deplaning (what a silly word - what's wrong with "disembarking"?) could happen in the middle of the city and then the plane could taxi all the way to the runway. Perhaps pieces of the plane essential for flying but not for passenger comfort (like wings and tails) could be made detachable - they would be attached right before takeoff and detached right after landing. The rest of the plane could then fit into a tunnel for taxiing into the city. Sounds a bit dangerous, perhaps, but I'm sure there's a way it could be done without compromising safety.

Of course, because this would require massive infrastructure investments simultaneously all over the world it's unlikely to happen for a very long time (if ever) - probably not before our current idea of air travel is superceded by something altogether cheaper, faster, more comfortable and more environmentally friendly anyway. So I guess it will remain just another of my more crazy ideas.

String storage

Sunday, August 17th, 2008

Most applications store strings as a consecutive sequence of characters. Sometimes when the string is copied the characters are copied too. Sometimes strings are reference counted or garbage collected so to minimize this copying, but copies are made when concatenating and performing other "string building" operations (otherwise the characters would no longer be consecutive in memory).

An alternative that might work better (especially for something like a compiler) would be to do the concatenation lazily. Actual character data comes from just a few places (the input files which are kept in memory in their entirety, static character data, and program argument data). There are two subtypes of string - one consists of a pointer to the first character and an integer recording the number of characters in the string. The other subtype consists of a vector of strings which are to be concatenated together. Integers (and maybe also formatting information) could be kept in other subtypes. The resulting tree-like data structure has a lot in common with the one I described in Lispy composable compiler.

I'm not sure if this actually saves much (if anything) in terms of memory space or speed over the usual methods (I suppose it depends on how long the average basic string chunk is), but it does have at least one potential advantage - Vectors (especially if they grow by doubling) will have many fewer possible lengths than strings, so memory fragmentation may be reduced. I think it's also kind of neat (especially if you have such data structures lying around anyway).

Startups

Saturday, August 16th, 2008

I've been reading a lot about startups lately, specifically in the essays of Paul Graham. There is a part of me that regrets not having ever tried to start my own startup - unfortunately the bursting of the bubble and the intricacies of US immigration policy meant that there was never really a practical time. While being filty rich does have a certain appeal, the more compelling reason is the one at the end of this essay - getting the business of getting enough money for life over with as quickly as possible in order that I can work on whatever I want to. The summer/autumn before I started at Microsoft, when I had several months of mostly uninterrupted time to myself to work on whatever I wanted to was one of the most intellectually satisfying (not to mention relaxing) periods of my life.

I have always thought of myself as having progressive politics but this is the best argument I have seen yet against taxing the rich. I guess one possible counterargument is that maybe taking very big risks isn't really necessary to develop new technologies (nor necessarily particularly beneficial to society as a whole) - the only reason that this has been true in the past is that developers and funders of new technologies have very little information about whether these new technologies are likely to be successful or not. With better information, perhaps we will be able to pick the winners before a lot of money has been spent, meaning that the big risks will be unnecessary.

One other thought: as startups become cheaper than ever to create and run, perhaps we will see more of them following the model of craigslist - staying small (in the financial sense) and private and concentrating on what users want instead of aiming for explosive growth, trying to become as profitable as possible and then either go public or get aquired. While the latter is more lucrative, I suspect the former might be more intellectually and spiritually satisfying.

Parsing expression grammar grammar

Friday, August 15th, 2008

I have found that a fun thing to do is make up grammars for computer languages - figure out what syntax rules work well and what is ambiguous (to both humans and computers - it seems the two are more closely related in this respect that I would initially have imagined).

The language I eventually want to write will have a parser generator (probably generating packrat parsers from Parsing Expression Grammars) built in, so I thought I would write a grammar for the grammars accepted by that - a rather self-referential exercise. I keep going back and forth on some of the syntax details, but this is how it looks at the moment:

// Characters

Character = `\n` | `\r` | ` `..`~`;

EndOfLine = `\r\n` | `\n\r` | `\n` | `\r`

AlphabeticCharacter = `a`..`z` | `A`..`Z` | `_`;

AlphanumericCharacter = AlphabeticCharacter | `0`..`9`;

EscapedCharacter = `\\` (`\\` | `\`` | `n` | `r` | `"`);


// Space

MultilineComment :=
  `/*` (
      MultilineComment
    | !`*/` Character
  )* "*/"
// Note that this is recursive because multi-line comments nest!
// To match C-style (non-nesting comments), use 
// CStyleMultilineComment := `/*` (!`*/` Character)* "*/";

Space =
  (
      ` `
    | EndOfLine
    | `//` (!EndOfLine Character)*
    | MultilineComment
  )*;

_ := !AlphanumericCharacter [Space];


// Tokens

Identifier := AlphabeticCharacter AlphanumericCharacter*;

CharacterLiteral := `\`` ( Character-(`\n` | `\\` | `\``) | EscapedCharacter )* "`";
  // No spaces matched afterwards

StringLiteral := `"` ( Character-(`\n` | `\\` | `"`) | EscapedCharacter )* "\"";
  // Optionally matches _ afterwards

// Productions and rules

CharacterRange := CharacterLiteral ".." CharacterLiteral

Rule :=
  (
    (
      (
          Identifier
        | "[" Rule "]"
        | "!" Rule
        | "&" Rule
        | "(" Rule ")"
        | "EndOfFile"
        | StringLiteral
        | CharacterRange
        | CharacterLiteral
      ) / "|" / "-" / "\\" / "/"
    ) ["+" | "*"]
  )*;

Production := [Identifier] (":=" | "=") Rule ";";

= [_] Production* EndOfFile;

The rules are as follows:

Rule1 | Rule2 prioritized alternative
Rule1 Rule2 sequence
Rule* Kleene star
Rule+ Rule Rule*
!Rule does not match Rule
&Rule matches Rule but is not consumed
(Rule) order of operations
Rule1-Rule2 matches Rule1 but not Rule2
Rule1/Rule2 a sequence of strings matching Rule1 separated by strings matching Rule2 - left-associative (i.e. X := Y/Z => X := Y (Z Y)*)
Rule1\Rule2 a sequence of strings matching Rule1 separated by strings matching Rule2 - right-associative (i.e. X := Y\Z => X := Y [Z X])
Char1..Char2 matches a character between the character in Char1 and the character in Char2

Having a single grammar for both Parser and Lexer is nice in some respects but does introduce some additional complications. Some strings (those I've called CharacterLiterals here) must match exactly (no whitespace is consumed after them) and some (those I've called StringLiterals here) must consume any whitespace that appears after them (done by optionally matching the _ production). Similarly with productions - those created with ":=" optionally match _ at the end.

The root production has no name.

The "/" and "\" delimiters makes it really easy to write grammars for expressions with infix operators. For example, the core of the C++ expression production is:

LogicalOrExpression := CastExpression
  / (".*" | "->*")
  / ("*" | "/" | "%")
  / ("+" | "-")
  / ("<<" | ">>")
  / ("<" | ">" | "<=" | ">=")
  / ("==" | "!=")
  / "&"
  / "^"
  / "|"
  / "&&"
  / "||";

It always comes down to economics

Thursday, August 14th, 2008

What do Bruce Schneier, Mark Chu-Carroll and Paul Graham have in common? They're all bloggers (though Graham prefers to describe himself as an essayist) who seem to have gradually transitioned towards writing about economics starting from their more specialized subjects (cryptography, mathematics and lisp respectively).

I guess this is because economics is a very powerful subject - if you want to change the world you have to make it worth peoples' while to do so, which means you have to set up the economic conditions for it to do so. Conversely, if you know (or think you know) how the world is likely to change in the next few years you can use economic principles to figure out what to place bets on (invest in).

Deleting from the database

Wednesday, August 13th, 2008

Continuing on from yesterday's post - a commenter asked what happens if you want to delete something from the database irrevocably. I was going to reply with a comment but my reply got kind of long and it's an important concept so I decided to make a post out of it.

I have thought about this a bit and I think there are two reasons why you would want to delete something irrevocably. One is to save disk space (if the thing you want to delete is very big and you're sure you won't need it again). Current storage capacities are sufficiently high that this is a much less common problem than it used to be (especially for programmers - I suppose astronomers and particle physicists tend to have larger chunks of data, though).

The other reason is if the data should never have been checked into the version control system in the first place - for example if you accidentally checked your tax return into your company's source control repository.

Either way, any reasonable VCS should be able to cope with such requirements. To prevent abuse, this would probably have to be done by the owner or administrator of the system storing the data. I know at Microsoft there was a process for this (I never had to use it myself, but I saw an embarrassing bug report which had had its edit history erased so I know it can be done).

This is one area where distributed version control has a bit of a disadvantage - you have to contact everyone who has downloaded a copy of the repository and persuade them to erase the unwanted data and pass on the message to everybody who might have downloaded it from them and so on. For a popular project, this might be as impossible as unpublishing something that's been published on the internet.

As for the technical details of obliterating from the accumulate/cache database in particular, it's easy to do (as long as you can remove data from the accumulate table) except for any changes which happen on top of the undesirable change. These either have to be reparented or deleted. If there have been a lot of changes on top of the unwanted one this might be very labour-intensive or undesirable respectively, particularly if those changes are to data you're trying to remove.

Deconstructing the database

Tuesday, August 12th, 2008

This post elaborates on some database organization ideas that I mentioned in passing in Foundry.

The usual interface presented to a database is one based on "query/update" - the database contains some set of information which can be queried and updates can be made when this information needs to be changed. At any time there is a single "most up to date" state of the database consisting of the data after all "update" operations known so far have been performed.

The usefulness of this format cannot be denied but there is another possible way of organizing things - I call it "accumulate/cache" (I haven't read much in the way of database theory literature so this might already exist by some other name). In such a database, you (at some fundamental level) never change things, only add new things. "Update" operations consist of sending a "delta" to the database saying what needs to change, what it needs to change to and the timestamp of the change that is being used as the "parent" of this change. These deltas are then themselves timestamped and dumped onto the end of one big file (or table, if you're implementing this on top of a traditional relational database, though because of the limited nature of the operations on this table, writing special purpose code for this could probably improve efficiency).

To extract information from this database, you need a timestamp as well as the key of the piece of information you want to look up. You can look at the database as it was at any point in history just by supplying the appropriate timestamp. Thus you get what is essentially revision control for free (except for the fact that this kind of database probably uses much more space than the traditional sort - lack of disk space in previous generations may be why this sort of database isn't the way things are usually done).

Given a timestamp and a key, the algorithm for looking up a piece of information is quite simple: first look in the cache and see if that timestamp/key combination is there. If it is, you're done - just return that value. If it's not, look up that timestamp in the accumulation table. If that accumulation changes the value corresponding to your key, add the result of the query to the cache and you're done. Otherwise pick the parent timestamp of that change and repeat.

The caching is necessary to avoid bad asymptotic behavior as the size of the database grows. I think (but haven't proved) that for a database of size N it should be possible to get back to O(log N) amortized performance with an O(N log N) sized cache. The amount of cache used can be tuned by throwing away the least recently accessed entries depending on the database's access patterns (patterns which look back at history a lot will probably require larger caches than patterns which only look at recent timestamps).

Only the accumulation table needs to be backed up as the cache table is automatically reconstructed as necessary when queries are made. The fact that the accumulation table is only ever appended to simplifies this.

The next piece of infrastructure needed on top of this is a "head" mechanism. The last timestamp in the accumulation table is the latest physical change made but this might not be the latest logical change. We would like the ability to have "atomic" operations made up of several changes. This can be done without any additional locks (so scales very well). The "head" value is just a timestamp filed under some well-known key. To get the most recent "head" just get the value of "head" corresponding to the most recent timestamp, and do all your other queries with the head value timestamp.

To perform an atomic update, you need to make your changes and then merge them with the latest head. This is trivial if the current head is the parent of your first change. Otherwise something else has written to the database in the meantime and a merge will need to be done, creating a new "parent of first change". Eventually (as long as the time between commits is long compared to the time it takes to do a commit) you'll get to the trivial case, and the only thing that needs to be done atomically is the trivial "compare parent to the value of head and swap with the new head if they're equal" (I think most CPUs can do this atomically with a single instruction, but in any case it is one of the simpler lock-free algorithms).

Some databases already work like this behind the scenes to some extent (log replaying etc) but I think that making this part of the interface simplifies the atomicity logic considerably, not to mention the usefulness of being able to query versions of the database from any point in history.

One complication occurs if it becomes necessary to remove from the database some information which should never have been placed there (e.g. some copyrighted work). Since you want to obliterate history in this case, an update needs to be made on the accumulation table. The cache (or the portion of it from after that point) must then be regenerated. One also needs to decide if subsequent updates to the copyrighted value are "derived works" or not, and either obliterate or leave alone the updates to that value.

One limiting factor of this architecture is access times - even with the fastest current disks one can only traverse a chain at maybe 500 links per second. It will only be a few years before we're all using solid state flash disks, though, and that will dramatically speed up access times.

Edit 14th July 2013:

Since writing this I came across an interesting article and talk on the same subject.