Archive for the ‘language’ Category

Language optimized for refactoring

Friday, October 24th, 2008

One property of computer languages that is important but often seems to be overlooked is how easy it is to refactor programs written in them.

The one example that springs immediately to mind is renaming a class. In C++ this is a bit more difficult than in many languages because the constructors and destructors have the same name as the class, so you have to go and change all of those too. PHP wins here for calling them __construct and __destruct respectively.

If you are in the school of thought that has C++ method definitions in a separate file (e.g. .cpp) to class declarations (.h), you have to go and change things in two different files (even if you’re just adding a method that nobody calls yet). If that class implements an COM interface defined by a .idl file then there’s yet another thing you need to change.

Python’s syntactically-significant whitespace is another winner here because if (for example) you put another statement in an “if” clause that currently only has one statement, you don’t have to add braces.

I’m sure there are many other, deeper examples.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Once you go OOP, there’s no going back

Thursday, October 23rd, 2008

Object Oriented Programming is at least as much a state of mind as a set of programming language facilities. When I learnt C++ it was a bit difficult to get used to writing object-oriented programs but now that I’ve been doing it for many years I can’t get used to thinking about my programs any other way.

I was writing some PHP code recently and (not knowing about PHP classes) started writing it in a procedural fashion. After a while I noticed that many of the functions I was writing started to fall naturally into classes (with a first parameter that gave the function context). So it was only natural to re-write it in object-oriented style once I figured out how to do so.

In the process of doing so, I found lots of bugs in my original code (which I had thought was rather nifty). Many functions became much simpler. I also found it was much easier to do various optimizations that would have been very difficult to do without classes (such as minimizing the number of database queries). My code file did become somewhat bigger, but I attribute this to the extra indentation most lines have, and the fact that PHP requires you to write “$this->” everywhere.

I also tried writing a C program (from scratch) for the first time in a very long time a while ago. I found myself using an object-oriented style and implementing vtables as structs.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

PHP could be more secure

Monday, October 20th, 2008

Given that PHP is designed to be used to write applications that run on web servers, you’d think it would have been designed rather more with security in mind.

In particular, PHP’s dynamic typing seems to be a source of security weaknesses. Dynamic typing has advantages in rapid development and code malleability but is not particularly helpful for writing secure code – security is greatly helped by being able to restrict each variable to a specific set of values and having the compiler enforce this.

Similarly with the SQL API – because the interface is all just strings instead of strongly typed objects, SQL injection vulnerabilities becomes all to easy to write.

Variable scope is another one – because there are no variable declarations it’s not obvious where variables are introduced, so one could be using variables declared earlier without realizing it (this is why register_globals changed from default-on, to default-off, to deprecated to removed).

Then there are ill-concieved features like magic quotes, and missing features like cryptographically secure random number generation.

A well-designed language for web development would be secure by default when doing the most obvious thing – one shouldn’t have to go out of one’s way to learn what all the security pitfalls are and have to write to explicitly address each of them (and update your code when the next such pitfall is discovered).

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

JavaScript vs PHP

Monday, October 13th, 2008

In order to implement Tet4 I learnt two new languages – JavaScript (or JScript, or ECMAScript – the language has a bit of an identity crisis) and PHP. Why PHP? It’s installed on my web hosting server and seems to have a huge community of people writing code in it and pre-written scripts. It may not be the ideal language for writing web server apps, but it does seem to be the most well-supported.

JavaScript seems to be a very clean, pretty language. The whole closure thing seemed a bit weird at first but once I understood that “class” is spelled “function” and “public” is spelled “this.” I got to rather liking it. I especially like how each scope has access
to the variables from all the outer scopes – that saves a lot of messing about. It’s very well integrated with the browser – manipulating the DOM feels very natural and not tacked on.

PHP on the other hand is a bit of a mess. It is as if its designers had a little spinner with markings “C, C++, Perl” which they spun each day to decide what languages features to copy that day. If JavaScript was sent by God, surely PHP was sent by the devil.

W3Schools has been an excellent reference for learning all this.

I have to say though that automatically promoting integers to double-precision floating point numbers on overflow is weird. On IE7, computing the value of 1111111111*1111111111 gives 1234567900987654400 (you can easily see this is wrong because it’s even). This caused a rather hard-to-debug problem with my random number generator (which assumed that when multiplying two 32-bit integers together, at least the low 32 bits of the result should be correct). If you’re going to automatically promote numbers, at least have the decency to use a multiple-precision integer library – there are lots around.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Static scoping improved

Thursday, September 11th, 2008

Many programming languages have a facility (usually called “static”) to allow the programmer to declare a variable which is visible only to some particular object but has storage at the program’s scope – i.e. its value is the same for all instances of that object and when it changes for one it changes for all the other instances too.

One programming language feature I’ve never seen (but which I think would be useful) is a generalization of this – the ability to declare a variable which is only visible in a particular object but whose scope is the (lexical) parent object. I call this “outer”. For top-level objects, this would be the same as static but for nested classes the scope would be that of the outer class.

One could even use the “outer” keyword multiple times to put the variable in any particular level in the object nesting tree. This doesn’t violate encapsulation, since members can still only be declared inside their classes.

If you have “outer” instead of “static” (and maybe a few other more minor tweaks) any program can be turned into an isolated object inside another program – i.e. you can easily turn a program into a multi-instance version of that program with all the instances running in the same process.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

String storage

Sunday, August 17th, 2008

Most applications store strings as a consecutive sequence of characters. Sometimes when the string is copied the characters are copied too. Sometimes strings are reference counted or garbage collected so to minimize this copying, but copies are made when concatenating and performing other “string building” operations (otherwise the characters would no longer be consecutive in memory).

An alternative that might work better (especially for something like a compiler) would be to do the concatenation lazily. Actual character data comes from just a few places (the input files which are kept in memory in their entirety, static character data, and program argument data). There are two subtypes of string – one consists of a pointer to the first character and an integer recording the number of characters in the string. The other subtype consists of a vector of strings which are to be concatenated together. Integers (and maybe also formatting information) could be kept in other subtypes. The resulting tree-like data structure has a lot in common with the one I described in Lispy composable compiler.

I’m not sure if this actually saves much (if anything) in terms of memory space or speed over the usual methods (I suppose it depends on how long the average basic string chunk is), but it does have at least one potential advantage – Vectors (especially if they grow by doubling) will have many fewer possible lengths than strings, so memory fragmentation may be reduced. I think it’s also kind of neat (especially if you have such data structures lying around anyway).

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Parsing expression grammar grammar

Friday, August 15th, 2008

I have found that a fun thing to do is make up grammars for computer languages – figure out what syntax rules work well and what is ambiguous (to both humans and computers – it seems the two are more closely related in this respect that I would initially have imagined).

The language I eventually want to write will have a parser generator (probably generating packrat parsers from Parsing Expression Grammars) built in, so I thought I would write a grammar for the grammars accepted by that – a rather self-referential exercise. I keep going back and forth on some of the syntax details, but this is how it looks at the moment:

// Characters

Character = `\n` | `\r` | ` `..`~`;

EndOfLine = `\r\n` | `\n\r` | `\n` | `\r`

AlphabeticCharacter = `a`..`z` | `A`..`Z` | `_`;

AlphanumericCharacter = AlphabeticCharacter | `0`..`9`;

EscapedCharacter = `\\` (`\\` | `\`` | `n` | `r` | `"`);

// Space

MultilineComment :=
  `/*` (
      MultilineComment
    | !`*/` Character
  )* "*/"
// Note that this is recursive because multi-line comments nest!
// To match C-style (non-nesting comments), use
// CStyleMultilineComment := `/*` (!`*/` Character)* "*/";

Space =
  (
      ` `
    | EndOfLine
    | `//` (!EndOfLine Character)*
    | MultilineComment
  )*;

_ := !AlphanumericCharacter [Space];

// Tokens

Identifier := AlphabeticCharacter AlphanumericCharacter*;

CharacterLiteral := `\`` ( Character-(`\n` | `\\` | `\``) | EscapedCharacter )* "`";
  // No spaces matched afterwards

StringLiteral := `"` ( Character-(`\n` | `\\` | `"`) | EscapedCharacter )* "\"";
  // Optionally matches _ afterwards

// Productions and rules

CharacterRange := CharacterLiteral ".." CharacterLiteral

Rule :=
  (
    (
      (
          Identifier
        | "[" Rule "]"
        | "!" Rule
        | "&" Rule
        | "(" Rule ")"
        | "EndOfFile"
        | StringLiteral
        | CharacterRange
        | CharacterLiteral
      ) / "|" / "-" / "\\" / "/"
    ) ["+" | "*"]
  )*;

Production := [Identifier] (":=" | "=") Rule ";";

= [_] Production* EndOfFile;

The rules are as follows:

Rule1 | Rule2 prioritized alternative
Rule1 Rule2 sequence
Rule* Kleene star
Rule+ Rule Rule*
!Rule does not match Rule
&Rule matches Rule but is not consumed
(Rule) order of operations
Rule1-Rule2 matches Rule1 but not Rule2
Rule1/Rule2 a sequence of strings matching Rule1 separated by strings matching Rule2 – left-associative (i.e. X := Y/Z => X := Y (Z Y)*)
Rule1\Rule2 a sequence of strings matching Rule1 separated by strings matching Rule2 – right-associative (i.e. X := Y\Z => X := Y [Z X])
Char1..Char2 matches a character between the character in Char1 and the character in Char2

Having a single grammar for both Parser and Lexer is nice in some respects but does introduce some additional complications. Some strings (those I’ve called CharacterLiterals here) must match exactly (no whitespace is consumed after them) and some (those I’ve called StringLiterals here) must consume any whitespace that appears after them (done by optionally matching the _ production). Similarly with productions – those created with “:=” optionally match _ at the end.

The root production has no name.

The “/” and “\” delimiters makes it really easy to write grammars for expressions with infix operators. For example, the core of the C++ expression production is:

LogicalOrExpression := CastExpression
  / (".*" | "->*")
  / ("*" | "/" | "%")
  / ("+" | "-")
  / ("<<" | ">>")
  / ("<" | ">" | "<=" | ">=")
  / ("==" | "!=")
  / "&"
  / "^"
  / "|"
  / "&&"
  / "||";
  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

English as a language for programming

Friday, August 8th, 2008

Programmers write programs in computer languages but the comments and identifiers (which are important, but not meaningful to the computer) are written in a human language.

Usually this human language is English, but not always – I have occasionally run across a pieces of source code in French, German and Hebrew. I guess it makes sense for a programmer to write code in their first language if they are not expecting to collaborate with someone who doesn’t speak that language (or if that piece of code is very specific to that language – like a natural language parser).

On the other hand, it seems kind of short-sighted to write a program in anything other than English these days. There can’t be many programmers who don’t speak some amount of English (since most of the technical information they need to read is written in English), and it seems likely that all but the most obscure hobby programs will eventually be examined or modified by someone who doesn’t speak the first language of the original author (if that language isn’t English).

There are other advantages to standardizing on English – a common vocabulary can be developed for particular programming constructs which makes programs easier to understand for those who are not familiar with their internal workings. The aim is, of course, that any programmer should be able to understand and work on any program.

That there is a particular subset of the English language that is used by programmers is already evident to some extent – I think it will be interesting in the next few years and decades to see how this subset solidifies into a sub-language in its own right.

I should point out that I’m not advocating putting legal or arbitrary technical barriers to prevent programs being written in other languages – more that it might be useful to have tools which can help out with programming tasks for programs written in English.

Having said all that I think that there will in years to come, a higher proportion of programming will be done to solve particular one-off problems rather than create lasting programs – there’s no reason why these throw-away programs shouldn’t be in languages other than English. Tool support for this can be very minimal, though – perhaps just treating the UTF-8 bytes 0×80-0xbf and 0xc2-0xf4 as alphabetic characters and the sequence 0xef, 0xbb, 0xbf as whitespace.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

Concrete data type

Friday, July 25th, 2008

A handy thing to have around (and something I want to put into the Unity language) is a data type for representing physical quantities. This keeps track of a real number plus powers of metres, seconds, kilograms, amps, Kelvins (and maybe others, including user-defined ones). Two values of type Concrete can always be multiplied or divided but can only be added or subtracted if their dimensions match, and can only be converted to other types if they are dimensionless.

Common units of measurement and physical constants can be given as constants. Because the units are tracked automatically you can do things like:

(2*metre+5*inch)/metre

and get the right answer.

Usually the compiler should be able to check the dimensions at compile time and elide them like other type information, or give a compile error for expressions like:

(2*metre+5)/metre

Along similar lines, the log() function could be considered to return a value of type Concrete with some special non-zero dimension. Then you can (and indeed must) specify to which base the logarithm should be by dividing by another logarithm (e.g. log(x)/log(e), log(x)/log(10) or log(x)/log(2)). This syntax is rather more verbose than the usual method of having separate functions for common bases (log(), lg(), ln() etc.) but I find that this is more than made up for by the fact that one doesn’t have to remember which function corresponds to which base – it’s self-describing.

Another useful type for scientific/engineering work would be a value with confidence interval (e.g. 5±1 meaning “distributed normally with a mean of 5 and a standard deviation of 1″). There are well-defined rules for doing arithmetic with these. A generalization of this to other distribution functions might also be useful.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark

My first made-up language

Wednesday, July 23rd, 2008

Years ago (more than a dozen), I tried to write (in C) an interpreter/compiler for a dialect of BASIC that I made up. It was based on BBC BASIC and the BBC’s graphics facilities (which I was very familiar with at the time) but it would have run on the PC and had some additional features:

  • Operators and control structures from C
  • More string and array processing facilities
  • More graphics commands
  • Interactive editor and debugger
  • Commands for accessing PC facilities (interrupts, calling native code etc.)
  • Built-in help
  • Facilities for storing chunks of code as a library
  • Complex numbers
  • Self-modification and introspection facilities

The last of these is what I want to write about today.

As a child I used many of the 8-bit BASIC variants which had line numbers and it always irked me that there were some things you could do in immediate mode that you couldn’t do from a running program, such as editing the program. Why was it that typing:

PRINT "HELLO"

printed “HELLO” immediately and:

10 PRINT "HELLO"

printed “HELLO” when the program was run but

10 10 PRINT "HELLO"

didn’t create a program that replaced itself with the program ’10 PRINT “HELLO”‘ when it was run? While doing so didn’t seem terribly useful it seemed to me that an interpreter would be just as easy (if not easier) to write with this facility than without it, and that it was an unnatural ommission.

Along similar lines, my dialect had an “INTERPRET” command which took a string and ran it as BASIC code and a “PROGRAM$” array which contained the lines of the program.

I got some way in to writing a parser for it but as I didn’t know how to write a parser back then I got horribly tangled trying to write code for each possible combination of current state and next piece of input.

The similarities between this and the Unity language that I’m in the (very slow and gradual) process of writing haven’t escaped me.

  • Reddit
  • Digg
  • Facebook
  • StumbleUpon
  • Twitter
  • Delicious
  • Share/Bookmark