Archive for May, 2007

Wrong number

Sunday, May 20th, 2007

Phone rings.

Me: Hello?
Random woman at the other end: It's raining and it's cold so come home right now.
Me: I'm sorry, I think you must have the wrong number.
RW: Just come home right now.
Me: I... am home.
RW: Ok. Bye.

Tainted files

Monday, May 14th, 2007

A lot of work has gone into making Windows in general and Internet Explorer specifically more secure over the past few years. One of my favorite Windows security features is that when you download a file with IE and save it to disk, it is marked with some extra metadata saying "hey, this file is not to be trusted". Then, when you later forgot where the file came from and try to run it a big scary warning pops up saying "you might not want to run this unless you're sure you it's safe". It doesn't prevent you from doing anything like some security features, it's just a perfectly sensible tainting system.

It's a shame that this technique hasn't been generalized to files that come from other untrusted sources like CDs, floppy disks, USB keys, etc. A lot of attacks could be mitigated that way.

Cases

Sunday, May 13th, 2007

C++ is hard to parse. One reason for this is because it's difficult to tell what is supposed to be a variable and what is supposed to be the name of a user defined type. This means that information from later stages of parsing has to be fed back into earlier stages, making the whole thing a big tangled mess.

It's not just compilers that have problems either - the problematic constructs can also be difficult for humans to understand (especially if taken out of context).

I think C++ would be much easier to parse (for both humans and machines) if a very simple rule were added - if a name starts with a capital letter, it is the name of a type, and if it starts with a lower-case letter it is the name of a variable. Sort of a 1-bit hungarian notation (except that it has meaning to the compiler).

This is already a de facto standard in many programs (often by virtue of being a consequence of an even stricter scheme). Though of course if it were added to C++ now it would break many many programs...

A nice side effect of this is that the syntax for types could be made much more regular, whilst still retaining the "definition mirrors use" paradigm. Given that we found a captial letter at the start of an identifier, we know we are parsing a type name. So if we come across an open parenthesis ("(") we can parse that as "a function returning the type to the left of the parenthesis. So for example we can name the type "array of N pointers to functions returning pointers to functions returning pointers to characters" just by reading it backwards: Char*()*()*[N]. This is much easier than the normal C++ declaration "char*(*(*a[N])())()" which is very hard to understand and which also buries the name of the variable you are declaring in the middle of the type name rather than putting it on the right where it belongs. SPECS solves the same problem but at the cost of a less familiar syntax and losing "definition mirrors use".

The true meaning of Reenigne

Saturday, May 12th, 2007

The time has come for me to tell the story of how I came to own the domain reenigne.org, and hence the reason why this blog is called what it is called.

The story I tell on the site is that I reverse-engineered Digger, hence "reverse engineer" or "reenigne". That's only part of the story though. My experience doing this manual reverse-engineering project (and a couple of others) convinced me that building a tool to aid in decompilation would be an interesting project. I decided to call this tool Reenigne and bought the domain name for the project. I set up my personal website there as a temporary measure but when I got a job with Microsoft, Reenigne (the program) got put on the back burner.

Decompilation (or, more generally, analysis of programs) is an interesting problem in general. An important result in theoretical computer science (the undecidability of the halting problem) says that it's impossible to write a program that even determines if a given input program halts or not, let alone analyses it to the extent that Reenigne would have to in order to be useful.

But that theorem only applies to infinite computers. If you know in advance the maximum memory that the program can use you can simulate a machine running the program and detect if it gets into a looping state. Not tremendously useful since that could still be many orders of magnitude more data than today's computers have even for the simplest programs. Does it matter for real-world programs? Well maybe not for early 80s games...

Another advantage we have over the Turing machines in the halting problem is an oracle - a human operator. If Reenigne gets stuck analyzing a particular program it can ask for human help. A human may be able to spot paths of execution or quickly prove things that the computer working alone cannot do. Together (the idea goes) Reenigne and a human operator form an unbeatable team that can reverse-engineer any real world program to source code quickly and (relatively) easily (assuming some knowledge of what the program does and how it might work). The human inputs to Reenigne can in principle be recorded to form a script which can be played back to enable anyone who has a copy of the same binary program to obtain the source code. The source code itself could not be distributed (it's copyrighted) but as far as I can tell the Reenigne scripts could be arbitrarily distributed - they're not a derivative work of the binary program, they are more like a commentary on it (I even thought about having the program save these scripts with the extension ".commentary" to hammer this point home extra hard).

Assuming that .commentary files for any piece of software are easily available, you can obtain source code for any binary you have - effectively making any program open in the same way that GPL software is (although with the minor restriction that you cannot legally redistribute the program itself). As I'm sure you can imagine, such a state of affairs would scare the bejeezus out of Microsoft. That's not the reason I haven't written it though - it's just lack of time.

The game doesn't end there though - for any possible Reenigne it's possible to write a program in such a way as to maximizes the difficulty in reverse-engineering it with that particular Reenigne (by twisting up the logic to make it hard for a human to see what's going on, and by causing the program to have to do things like prove Fermat's Last Theorem just to find all the code in amongst random distracting junk). This would make programs slower and could never eliminate the threat of reverse-engineering completely (as I said, it's always possible in theory to simulate the program on a bigger computer so any roadblocks would just pose engineering problems, not theoretical problems). But given the lengths that some software misguidedly goes to to keep its secrets from the owner of a computer whilst spilling those same secrets wide open to the CPU, I have no doubt that if a useful Reenigne were to become available, all these tricks would be tried.

We would then enter a period of "arms racing" between the Reenigne-creators (which, I am assuming, would be more than just me by then) and those trying to keep their algorithms and content decryption keys secret. They add a Reenigne-defeating trick, then a new version of Reenigne appears which understands and works around that trick, and so on.

Given the propensity of the US government to pass laws like the DMCA (aka the "Snake Oil protection act") and arrest computer programmers for doing their jobs, I suspect it would be unwise for me to pursue Reenigne seriously whilst living in the US (at least while the political climate is still so pro-"intellectual property").

Compiler compilers

Friday, May 11th, 2007

I think it would be cool if a language had a parser generator built in so that you could write BNF grammars right in the source code and the compiler would be able to turn these into parsing functions (this would also be really easy if the grammars were interpreted as Parsing Expression Grammars). What would be even better is if you could write text in the languages these grammars describe and have the compiler compile them right into the program alongside the code generated from the language itself.

For example, if you were writing an assembler, you might define a language for defining mnemonics and how they (and their arguments) translate into opcodes. Then you would add text in this language describing the architecture that your assembler will target. Rather than this parser running every time you ran your assembler, the output of the parser could be embedded right into the program so the result is highly efficient.

An efficient code generator that could be called at runtime would also be useful. This would make applications like formula-driven fractal generators, graphing programs, electrical circuit simulators and numerical analysis packages fast and easy to write. Combined with the parser generators, it would also make it very simple to write a compiler for this hypothetical language - just define the syntax and the semantics, plug in the code generator and there's your compiler. It would be almost like having a compiled homoiconic language, except with a proper built-in syntax (unlike Lisp). Unfortunately it would compete with itself so is unlikely ever to be written as a commercial compiler.

High density mouse pointer

Wednesday, May 9th, 2007

Many techniques have been invented for making the mouse pointer more visible. One is a little animation of circles radiating out from it when you press a "locate" button, another is just making it bigger. Yet another is adding "mouse trails" showing the position the pointer was at over the last few frames (though this has the disconcerting effect of making your pointer appear to be a snake). One which I think has been inadequately explored is making it "high density". Normally when you move the mouse the operating system erases the pointer from the old location and plots it at the new location (doing nothing in between) if you move the pointer around quickly in a circle it the pointer seems to appear at discrete locations.

I think it would be better if, in any given frame, the operating system plotted the pointer at the position it was at in the last frame, the position it is at in the current frame and everywhere in between. This would give the impression of a "streak" as you moved the pointer, or a solid circle if you moved it in a rapid circle, as if the pointer is plotted much more often than once per frame - more "densely" in other words. It would be kind of like mouse trails done properly.

Recursive subdivision

Tuesday, May 8th, 2007

Recently I was experimenting with drawing some Mandelbrot sets. Rather than scanning across the window pixel by pixel and doing one escape per pixel I generated a sequence of complex numbers within the window by repeatedly adding a complex number with irrational real and imaginary parts (the tangent of the argument of which was also irrational) and wrapping the real and imaginary parts into the window. This scatters points across the window evenly in such a way that no point is ever hit twice (within the bounds of precision) and that the pixel grid becomes gradually more dense. Thus the image sort of "rezzes in" and is gradually refined. Anti-aliasing is sort of built in - as the distance between samples gradually decreases, the effective resolution gradually increases. The program never completely stops but you can tell when it's done as the picture seems perfectly anti-aliased and stops changing. I didn't build this in, but the generated points could potentially be redrawn in a larger window to do a sort of (limited) real-time zoom (a la XaoS but hopefully with less artifacting - though it might take a lot of memory to store all that data).

This is kind of neat but I think we can do much better:

  • The program doesn't take advantage of the natural structure of the Mandelbrot set (large "lakes" of high iteration count, which have detail only at the edges)
  • The iteration limit is fixed (it would look wrong if some pixels had different iteration limits) - it would be nice if the iteration limit could be gradually increased (for points near the boundary) at the same time as resolution was gradually increased.
  • There is no way of reusing the calculation for one point for a nearby point (Synchronous Orbit Iteration, which can really speed things up for very deep zooms)

I think all these things are possible with the right data structure. My idea is to form a sort of 2-dimensional linked list. The image is divided into rectangles. Each rectangle has iteration data (corresponding to its center) and 8 pointers to other rectangles, two on each edge. The two pointers may point to the same rectangle or to different rectangles. Suppose both right-pointers of rectangle A point to rectangle B. Then either the two left-pointers of rectangle B can point back to rectangle A, or only one of them does. Thus, each rectangle is bordered on each side by either two rectangles, one rectangle or half a rectangle. And by a process analogous to linked-list insertion, any rectangle can be subdivided horizontally (if it doesn't have half rectangles on either of the horizontal edges) and/or vertically (if it doesn't have half rectangles on either of the vertical edges).

Thus at any point we can increase resolution (by splitting rectangles) or iteration count (by iterating on existing rectangles). If a rectangle is bounded by lake on all sides it is probably also full of lake and we can avoid subdividing or iterating, meaning that we do less work (or defer work) in the boring areas. We also have a natural way of doing SOI (copying the iteration data when we split a rectangle).

We can also do real-time zooming, by noticing that the relationship between rectangle dimension and pixel dimension need not be fixed, and (if necessary) throwing away or paging out rectangles that are outside the window.

There are some fiddly parts, like "how do you divide your priorities between iterating and subdividing" and "what do you do at the edges" which are the reason why I haven't explored this any futher, but it's an interesting project.

Foundry

Monday, May 7th, 2007

It would be really awesome if there was a source control/versioning system that integrated with a text editor so that every change you made was persisted as a new version - you'd never even need to save let alone check in your changes. Infinite-level undo would always be available even across power failures.

This version-control system would have to be multi-headed so that your changes didn't break the version of the product that other people were using or working on until they were ready and tested.

Furthermore, I think this version-control system would be extra awesome if (in the background, or on a remote machine) it rebuilt your program and ran tests on it every time you made a change, so that you were notified of bugs as soon as possible after they occurred.

Obviously there wouldn't be time to build after every keystroke so you'd have to assign a priority to the builds - perhaps the "Save" command could be repurposed to make the current state higher priority, states you mark as "this should be good" are higher priority still and states marked as "this has been reviewed" are marked higher still. If a bug does turn up the system can go back and do a binary search on earlier versions to track down exactly when it was introduced.

There could be a database of all the changes, and another database holding all of the intermediates:

  • particular versions of particular files
  • built executable files
  • intermediate build files

This should allow incremental building and testing to happen pretty quickly (especially if the system knows which changes can affect which executables and which testcases), as long as "rebuild the world" files like compilers and global header files aren't modified.

The system could keep track of which files affect which build steps by watching which files are accessed during a build and keeping this information in the "intermediates" database too.

To eliminate false-positive and false-negative test results due to a dirty machine state, a refinement of the system could run all the tests on a virtual machine, the state of which is reset after each test. This could be done by storing diffs from one virtual machine image to another in the intermediates database. The system could then either build VM images from these diffs (and store them as intermediates) before the tests start or have the VM software choose the right bits dynamically as particular virtual disk sectors are requested (whichever is faster - probably some combination of the two techniques).

No competition

Saturday, May 5th, 2007

I've never really liked competing with others. As a child I would often refuse to play party games even at my own birthday party - I preferred to just sit out and watch instead. I suspect that this is because I actually have a fiercely competitive nature, and don't like the feelings that this nature inspires in me (the feeling that I must be better than others, regardless of their feelings, lest I be marked the "loser".)

For the same reason I've never really liked sports (playing or watching) and I suspect I would do better at work if on being told that I will be ranked against my peers my natural inclination was not to think "well I just won't play that game then - I'll just sit it out and do my own thing".

During one electronics class at school, I was helping one of the other students understand something that he was having trouble with. Another student advised me that I should not help him because we were to be graded on a curve, and helping one student implicitly hurts all the others. I don't want to live in a world where nobody helps anybody else - life isn't a zero sum game.

I think competition is overrated as a motivator for human accomplishment anyway. The great works of art of the world weren't created to prove their creators superior to all the other artists, and I think most of the accomplishments in academia happen in spite of the great competition for funding (and publish or perish rather than because of it.

I also suspect that the software industry could have achieved much more were it not for the duplication of effort caused by having competing companies solving essentially the same problems (particularly because of the exponential increase in complexity caused by having to interoperate). Different ideas should be allowed to compete on their own merits rather than on the merits of the companies that sell them.

Having a free market with competition to provide the best prices/best customer service/least environmental harm seems like a good idea in theory but from the individual customer's point of view, their only power is to take their business elsewhere. So my choice is between the supermarket that's close, the one that treats their employees well, the one that's cheap or the one that has the chocolate muffins I like. And (other than writing a letter that is likely to be ignored) I don't really have a way to tell the other two supermarkets why I'm not choosing them. The system only works on the aggregate level - individual consumers with requirements different from the profitable herd are basically screwed.

What's the answer? I'm not sure. Clearly some forms of competition are necessary (since communism didn't work out so well) and some people do great things precisely because of competitive pressures. But I think I would like to see a shift in policy towards rewarding cooperation and "absolute performance" and away from rewarding "performance relative to competition". Unfortunately that's rather more difficult to set up than a free market - in some disciplines (like computer programming) absolute performance is extremely difficult to measure absolutely (almost any metric you choose can be gamed). Also, if different factors become important (for example if we as a species suddenly decide than environmentalism is important) we all have to agree to change the metrics to take this into account, whereas in a free market we just have to have enough consumers decide "environmentalism is important to me - I will choose the environmentally friendly product even though it is more expensive".

Interesting blog

Friday, May 4th, 2007

Scott Aaronson's lecture serious Quantum Computing since Democritus is brilliant - fascinating and hilarious. If I had been taught by this guy at university I might still be in academia. A quote:
Oracles were apparently first studied by Turing, in his 1938 PhD thesis. Obviously, anyone who could write a whole thesis about these fictitious entities would have to be an extremely pure theorist, someone who wouldn't be caught dead doing anything relevant. This was certainly true in Turing's case -- indeed he spent the years after his PhD, from 1939 to 1943, studying certain abstruse symmetry transformations on a 26-letter alphabet.
The sheer range of material just stunning too, and there are lots of in-jokes for those with a passing familiarity with the subjects covered. Plus really smart people like Greg Kuperberg and Bram Cohen read his blog. (And he uses the same hosting provider as me.)