Parsing expression grammar grammar

I have found that a fun thing to do is make up grammars for computer languages - figure out what syntax rules work well and what is ambiguous (to both humans and computers - it seems the two are more closely related in this respect that I would initially have imagined).

The language I eventually want to write will have a parser generator (probably generating packrat parsers from Parsing Expression Grammars) built in, so I thought I would write a grammar for the grammars accepted by that - a rather self-referential exercise. I keep going back and forth on some of the syntax details, but this is how it looks at the moment:

// Characters

Character = `\n` | `\r` | ` `..`~`;

EndOfLine = `\r\n` | `\n\r` | `\n` | `\r`

AlphabeticCharacter = `a`..`z` | `A`..`Z` | `_`;

AlphanumericCharacter = AlphabeticCharacter | `0`..`9`;

EscapedCharacter = `\\` (`\\` | `\`` | `n` | `r` | `"`);


// Space

MultilineComment :=
  `/*` (
      MultilineComment
    | !`*/` Character
  )* "*/"
// Note that this is recursive because multi-line comments nest!
// To match C-style (non-nesting comments), use 
// CStyleMultilineComment := `/*` (!`*/` Character)* "*/";

Space =
  (
      ` `
    | EndOfLine
    | `//` (!EndOfLine Character)*
    | MultilineComment
  )*;

_ := !AlphanumericCharacter [Space];


// Tokens

Identifier := AlphabeticCharacter AlphanumericCharacter*;

CharacterLiteral := `\`` ( Character-(`\n` | `\\` | `\``) | EscapedCharacter )* "`";
  // No spaces matched afterwards

StringLiteral := `"` ( Character-(`\n` | `\\` | `"`) | EscapedCharacter )* "\"";
  // Optionally matches _ afterwards

// Productions and rules

CharacterRange := CharacterLiteral ".." CharacterLiteral

Rule :=
  (
    (
      (
          Identifier
        | "[" Rule "]"
        | "!" Rule
        | "&" Rule
        | "(" Rule ")"
        | "EndOfFile"
        | StringLiteral
        | CharacterRange
        | CharacterLiteral
      ) / "|" / "-" / "\\" / "/"
    ) ["+" | "*"]
  )*;

Production := [Identifier] (":=" | "=") Rule ";";

= [_] Production* EndOfFile;

The rules are as follows:

Rule1 \| Rule2	prioritized alternative
Rule1 Rule2	sequence
Rule*	Kleene star
Rule+	Rule Rule*
!Rule	does not match Rule
&Rule	matches Rule but is not consumed
(Rule)	order of operations
Rule1-Rule2	matches Rule1 but not Rule2
Rule1/Rule2	a sequence of strings matching Rule1 separated by strings matching Rule2 - left-associative (i.e. X := Y/Z => X := Y (Z Y)*)
Rule1\Rule2	a sequence of strings matching Rule1 separated by strings matching Rule2 - right-associative (i.e. X := Y\Z => X := Y [Z X])
Char1..Char2	matches a character between the character in Char1 and the character in Char2

Having a single grammar for both Parser and Lexer is nice in some respects but does introduce some additional complications. Some strings (those I've called CharacterLiterals here) must match exactly (no whitespace is consumed after them) and some (those I've called StringLiterals here) must consume any whitespace that appears after them (done by optionally matching the _ production). Similarly with productions - those created with ":=" optionally match _ at the end.

The root production has no name.

The "/" and "\" delimiters makes it really easy to write grammars for expressions with infix operators. For example, the core of the C++ expression production is:

LogicalOrExpression := CastExpression
  / (".*" | "->*")
  / ("*" | "/" | "%")
  / ("+" | "-")
  / ("<<" | ">>")
  / ("<" | ">" | "<=" | ">=")
  / ("==" | "!=")
  / "&"
  / "^"
  / "|"
  / "&&"
  / "||";

This entry was posted on Friday, August 15th, 2008 at 4:00 pm and is filed under language. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Reenigne blog

Parsing expression grammar grammar

Leave a Reply