Archive for the ‘language’ Category

Lispy composable compiler

Friday, July 4th, 2008

When I finally write the compiler for my super-language, I want to make it nice and modular - i.e. make it very easy to compile different languages, target different architectures and include different optimizations.

In order to achieve this I will need to have some kind of intermediate form in which pieces of code in various states of compilation to be transmitted from one module to another.

This form should be as simple as possible but should also make it possible to express all the complexities that occur in real code. In short, it will need to be a tree structure.

While I'm not convinced that the Lisp idea of "code without syntax" is a great one (I think every available visual/syntactical clue makes it easier to understand unfamiliar code) I have to admit that the lisp data structures (SExps and Lists) are perfect for this role. I wouldn't expect people to be reading large quantities of this code but it's handy to be able to print it out for debugging purposes.

A Symbol (what lispers call a SExp, S-Expression or Symbolic Expression) is either an Integer, a String, an Atom, a List or a Vector. Integers are what you'd expect them to be. Strings are used to hold things like variable names. Atoms are displayed as strings but treated as unique, opaque integers internally. All you do with an Atom is compare it to another Atom and see if they are the same. See Stringy Identifiers for another perspective on these. This is not quite the same as the lisp usage of the word, where integers are also atoms.

A List (delimited with parentheses) is effectively an object. It consists one or more Symbols, the first of which must be an Atom. It has a type, which is represented by its Atom and the number and types of the following Symbols.

A Vector [delimited with square brackets] is effectively an array of length possibly not known until run-time. This can be used for a sequence of statements (for example) or for the bytes in the compiled program before it is output to disk. Lisp doesn't distinguish between lists and vectors but I think it is useful to keep them separate.

It is not a coincidence that this is quite similar to the RTL of GCC. Many of the same problems are being solved, and I was learning about RTL when I came up with this. However, this structure is a little more general-purpose and a little simpler. In particular it can also be used for front-ends.

These expressions will generally be manipulated by pattern matchers - that is, we search through the program for an object matching some characteristics, perform some transformation on it and then substitute the result for the original object.

For example, here is a part of an x86 assembler written in a hypothetical language which has support for these Symbols built in:

    with(x) {
        match (`jmp` d:Int):
            if (isSByte(d))
                becomes (`db` [0xeb d]);
            else
                becomes (`db` [0xe9 bytesFromDWord(d)]);
        ...
    }

This matches a List which consists of the `jmp` atom followed by an integer (which we will call d). We pass that integer to a function and (depending on the result) we emit one of two possible forms of the x86 JMP instruction. As "bytesFromDWord" is known to be a function and not an Symbol, it is evaluated and the result placed into the vector rather than two Symbols (bytesFromDWord and (d)) being created and placed into the vector.

This works just as well for the front end. The "unless" statement that I have written about before can be broken down using this code:

    match (`unless` condition): becomes (`if` (`!` condition));

And we should have this optimization as well, in case condition is itself a negated condition:

    match (`!` (`!` x)): becomes (x);

Quite a lot of the compiler can be written very easily and tersely this way. There are some complications to do with the order in which transformations are applied, though - we would rather avoid having to specify that explicitly and let the computer figure it out. But the order might depend in quite a complicated way on the context. One way to deal with this might be to keep the pre-transformed Symbol around and continue to match against it even after it has been replaced. Another might be for transformations to transform other transformations as they are applied to the Symbols. I haven't quite figured this bit out yet. This seems like one of those problems that might be technically intractable but always works out to be fast enough in practice.

With some cleverness, it ought to be possible to write a very good compiler this way. Great speed is possible as no string handling is done after the initial parsing phase and symbol table lookups (e.g. we don't write out assembler instructions and rely on a separate assembler to assemble them, though we could still easily obtain assembly code if necessary by having a separate output module.) Many speed improvements are possible (e.g. by performing some transformations when the compiler is compiled).

But the really great thing about this scheme is that code for the compiler is completely self-describing - it's exactly the code one would hope to be able to write.

Having a set format for code also enables the scenario I wrote about in Compiler compilers.

Finally, this structure works just as well for disassemblers and decompilers as it does for compilers and assemblers. Some of the algorithms will be different most of the transformations probably cannot be reversed directly but many of the same patterns show up.

Plug-ins should not be in the same address space as the main process

Friday, June 27th, 2008

After some thought I have come to the conclusion that a software architecture which allows third-party code to run in the same process as the main program is a bad idea.

The first problem with such an architecture is reliability - how do you protect yourself against bugs in the third-party code? The short answer is that you can't - the plugin has the ability to stomp all over your process's invariants. The best you can do is terminate the process when such a situation is detected.

The second problem is that a compiler can't reason about the third-party code (it might not even exist at compile time). This means that there are all kinds of checks and optimizations that cannot be performed (like some of the stack tricks I mentioned a while ago).

The third problem is that one cannot tell what code the plug-in will rely on - if it sticks to documented interfaces that's okay, but (either deliberately or accidentally) a plug-in might rely on undocumented behavior which causes the plug-in to break if the program is updated. This forces the program to have ugly kludges to keep old plug-ins working.

If plug-ins are run as separate processes communicating with the main program via well-defined IPC protocols, all these problems are solved. The down-side is that these protocols are likely to be a little more difficult to code against and (depending on the efficiency of the operating system's IPC implementation) may also be significantly slower. The speed problem seems unlikely to be insurmountable though - current OSes can play back HD video which involves the uncompressed video stream crossing the user/kernel boundary - few applications are likely to need IPC bandwidth greater than that.

Compile-time symbolic computations

Saturday, May 31st, 2008

A compiler ought to be able to perform compuations at compile time. Unlike the generated code, these computations don't have to be blazingly fast (since they only happen at compile time), and don't have to conform to any particular machine's architecture (since the language should be the same for different architectures anyway) so some nice things can be done.

Arbirary-precision integer arithmetic is a nice easy one and one that is done by many scripting languages already. This allows you to write expressions such as:

const int a = 1000000000000/1000;

and have a initialized to one billion even on 32-bit machines.

Rational arithmetic is also easy and useful. This allows one to write:

const int a = (1000/3)*3;

and have a initialized to exactly 1000 even on machines lacking any sort of floating point facility.

Then there are closed forms such as:

const int a = sqrt(1000)^2;  // a==1000, not 961

One could even make a compiler understand pi as a symbolic constant of type "Real" and evaluate it to the appropriate number of significant figures for the expression and type that is being initialized. So:

const int a = truncate_cast<int>(pi*10000);

would initialize a to 31416.

Once these things are in place, a compiler can perform quite sophisticated mathematical transformations to allow programmers to write what they really mean and still obtain optimal code. Even to the level of Maple/Mathematica if the compiler writers are sophisticated enough.

Compile time metalanguages

Friday, May 30th, 2008

The C programming language actually contains two different languages - C itself (the code for which runs at run-time) and the preprocessor language which runs at compile time and has completely different syntax. It's also not very powerful - often programmers resort to writing programs which generate C code as output instead.

It would be really nice to be able to make both these approaches unnecesary, and have a compile time language that uses the same syntax as the run-time language. A compiler ought to be able to detect parts of the program which run on constant data, run them during compilation and emit the output directly into the generated binary.

Replacing #ifdefs is easy - just use the normal "if" and ensure that your compiler will optimize away false clauses for constant expressions. GCC does this and indeed the GNU coding standards recommend this.

Replacing code that performs complex transformations on data is more complicated, but not intractably so as long as the compiler can reliably determine that all the functions used in these transformations are pure.

Lexical scoping and closures

Thursday, May 29th, 2008

A closure is a just a piece of code with some context - data on which it works. Such things are common in the functional programming community but were left out of the C language because they're actually quite difficult to implement properly.

The C++ language has sufficiently powerful constructs that one can create closures, but the syntax is very unwieldy. One must write:

class Totalizer
{
    int _total;
public:
    Totalizer() : _total(0) { }
    void operator(int x) { _total += x; }
    int total() const { return _total; }
};
 
int total(Vector<int> x)
{
    Totalizer t;
    for_each(x.begin(), x.end(), t);
    return t.total();
}

where one would really like to write something like:

int total(Vector<int> x)
{
    int t;
    for_each (x, lambda(int y) {t += y;} );
    return t;
}

Here the code is the "{t += y;}" bit and the context is the "t" variable, which is local to the "total" function.

C# 3.0 provides functionality like this.

What's really going on here is that an object is being created which encapulates some subset of the local variables (the "t" variable in this case) and which is passed to the for_each function (just like in the C++ version). Closures and objects are really just different syntaxes for the same thing.

If we want to return a closure from a function we can do so, but we must then allocate the object in such a way that the "t" variable will still be valid when the function exits. This is where things get really fiddly. This is one of those things that is easier in a language with garbage collection (you can just allocate the closure on the heap and the GC will take care of cleaning it up when it's finished with) but with care I think it should be possible to do it in a non-GC language as well.

Stringy identifiers

Wednesday, May 28th, 2008

In most computer languages, identifiers can only contain alphanumeric characters (plus, often, the underscore '_') and can't start with a number. There are very good reasons for this, but there are occasions when it would be desirable to be able to use arbitrary strings as identifiers. For example, suppose that we have a language that has a parser generator built in. Suppose also that we give it a parsing rule like this:

WhileStatement := ("while" | "until") "(" Expression ")" Statement

This is really just another way of writing a parsing class:

class WhileStatement { ... }

We would like the members of this class to be automatically generated by the parser generator in a predictable way. With string-identifiers, the generated class might look something like this:

  class WhileStatement
  {
    Enum { `"while"`, `"until"` } `"while" | "until"`;
    Expression expression;
    Statement statement;
  }

i.e. the name of the member variable which holds the piece of data about whether this is a "while" loop or an "until" loop is called `"while" | "until"`. The type of this member is

Enum { `"while"`, `"until"` }

which means that it can take only the values `"while"` and `"until"`.

This class can be used as follows:

  WhileStatement whileStatement = WhileStatement::Parse(file);
  if (whileStatement.`"while" | "until"` == `"while"`)
    ...

This reads very naturally, and avoids the need for anyone (human or compiler) to make up any variable names. It's good to avoid making the programmer make up names for things when not absolutely necessary.

Stack measuring

Wednesday, May 14th, 2008

We generally do what we can to make programs reliable, but there is one particular cause of software failure which seems to have been surprisingly under-attacked. Probably because it is fairly rare in real-world situations. That failure is stack overflow.

Many real-world programs have recursive algorithms where the recursion depth depends on user input and is unchecked, and where the activation records are allocated on the stack. Current computer systems do not recover well from stack overflow (you generally get an access violation which the program cannot recover from because it is in an unknown state). This means that it is possible to supply input which crashes the program.

The compiler for my ideal language will detect when a program uses a potentially unbound recursion and allocate the activation records for such functions on the heap instead of the stack. This means that calling such a function can potentially throw an out of memory exception, which is much easier to recover from than a stack overflow (since we still have a stack to work with, and because the state of the program is well-defined - as far as the calling function is concerned the called function threw an exception and as far as the called function is concerned it never got called.)

A nice side effect of this is that now every function with an empty set exception specification has a well defined maximum stack size, which means that when you create a thread you know exactly how much stack it needs. This means stack guard pages are no longer necessary (stack can be treated just like any other area of memory) and potentially means that threads that perform small tasks can be created with very little overhead, possibly making thread pools unnecessary.

A programmer can track the maximum stack size of every thread in their application. If a change causes the stack size to increase dramatically this may be due to calling a very high-level function from a very low-level function, which may indicate a bug. Similarly, adding recursion without realizing it also indicates a possible bug, so it would be nice if the compiler could also alert the programmer to that.

Unity language build system

Tuesday, May 13th, 2008

When I finally get around to building my ideal language, it will have a linker and build system built in. It seems very silly to write your code in one language, your makefile in another, your configure script in a third etc.

The compiler will be invoked with the name of one source file on the command line, and that source file includes all the other required source files by means of an "include" statement which works similarly to the "#include" of C and C++, but never includes any given actual file more than once.

The program can also specify the name of the output file to generate, which can be a binary (.exe or .dll for Windows, or equivalent on other platforms) or intermediate file to be used with a C or C++ linker (.obj or .lib, or equivalent). Similarly, object and library files can be included just as source files can.

On a first pass through the program, the compiler will enumerate the complete set of input files and check their timestamps. If no input files are more recent than the output file, compilation will stop early.

Cases

Sunday, May 13th, 2007

C++ is hard to parse. One reason for this is because it's difficult to tell what is supposed to be a variable and what is supposed to be the name of a user defined type. This means that information from later stages of parsing has to be fed back into earlier stages, making the whole thing a big tangled mess.

It's not just compilers that have problems either - the problematic constructs can also be difficult for humans to understand (especially if taken out of context).

I think C++ would be much easier to parse (for both humans and machines) if a very simple rule were added - if a name starts with a capital letter, it is the name of a type, and if it starts with a lower-case letter it is the name of a variable. Sort of a 1-bit hungarian notation (except that it has meaning to the compiler).

This is already a de facto standard in many programs (often by virtue of being a consequence of an even stricter scheme). Though of course if it were added to C++ now it would break many many programs...

A nice side effect of this is that the syntax for types could be made much more regular, whilst still retaining the "definition mirrors use" paradigm. Given that we found a captial letter at the start of an identifier, we know we are parsing a type name. So if we come across an open parenthesis ("(") we can parse that as "a function returning the type to the left of the parenthesis. So for example we can name the type "array of N pointers to functions returning pointers to functions returning pointers to characters" just by reading it backwards: Char*()*()*[N]. This is much easier than the normal C++ declaration "char*(*(*a[N])())()" which is very hard to understand and which also buries the name of the variable you are declaring in the middle of the type name rather than putting it on the right where it belongs. SPECS solves the same problem but at the cost of a less familiar syntax and losing "definition mirrors use".

Compiler compilers

Friday, May 11th, 2007

I think it would be cool if a language had a parser generator built in so that you could write BNF grammars right in the source code and the compiler would be able to turn these into parsing functions (this would also be really easy if the grammars were interpreted as Parsing Expression Grammars). What would be even better is if you could write text in the languages these grammars describe and have the compiler compile them right into the program alongside the code generated from the language itself.

For example, if you were writing an assembler, you might define a language for defining mnemonics and how they (and their arguments) translate into opcodes. Then you would add text in this language describing the architecture that your assembler will target. Rather than this parser running every time you ran your assembler, the output of the parser could be embedded right into the program so the result is highly efficient.

An efficient code generator that could be called at runtime would also be useful. This would make applications like formula-driven fractal generators, graphing programs, electrical circuit simulators and numerical analysis packages fast and easy to write. Combined with the parser generators, it would also make it very simple to write a compiler for this hypothetical language - just define the syntax and the semantics, plug in the code generator and there's your compiler. It would be almost like having a compiled homoiconic language, except with a proper built-in syntax (unlike Lisp). Unfortunately it would compete with itself so is unlikely ever to be written as a commercial compiler.