Strings and localization in ALFE

PHP is a pretty bad and ugly language in many ways, but it does have some advantages - because it's so ubiquitous for web development, a lot of effort has been put into getting it to run fast and including many useful libraries. Another advantage is its string manipulation, which is what I want to talk about today.

PHP has a nice feature whereby if you have a variable called $foo (all scalars in php start with the "$" sigil) and you want to output a string containing that value, you can just write:

echo "The current foo value is $foo units.";

In other words, you don't have to close the string to insert a variable into it. If this feature didn't exist you'd have to write:

echo "The current foo value is " . $foo . " units.";

as in C++:

cout << "The current foo value is " << foo << " units.";

which doesn't seem like much more typing, but when you're writing web pages (which are full of these inserts) it really adds up and makes the result much more pleasant to use. I was skeptical at first but after having written some PHP code I want to steal this feature for my own language. Being a statically typed language, UnityALFE also inserts .toString() calls when the inserted variable is not of String type.

Inserting more complex expressions is also possible. Though I think PHP's syntax for this is a bit confusing:

"<div class='logged_in_as'>Logged in as: {$user->link()}</div>"

Much better to have a single insertion character and allow what follows to be either a variable name or a parenthesized expression:

"<div class='logged_in_as'>Logged in as: $(user.link())</div>"

The trouble with including literal strings like this in your program, though, is that sooner or later you're going to want to translate it into other (human) languages. Current GNU C/C++ best practices for internationalizable software suggest enclosing UI strings in a gettext macro (_("foo")) which adds that string to a table and replaces the string itself with an index into that table - then localization is just a matter of replacing the table. At Microsoft we avoided putting UI strings in source code altogether, instead putting them in an .rc file and making up a macro name to refer to the string's index. That was painful, because to add a string we had to change three files - the source file (relating code to macro name), the header file (relating macro name to index) and the resource file (relating macro name to English text). I suppose there might have been economic forces resisting change to this system - Microsoft software is translated into so many languages that adding a UI string is quite expensive - you have to pay to have it translated a lot of times. Similarly with changing a UI string.

Both these methods also constrain how you need to phrase your UI strings. The canonical example is that you can't write:

printf(_("%i file%s copied"), n, n==1 ? "" : "s");

because the rules for localization vary from language to language. Translating would involve changing the code as well as the contents of the string. Thus such pieces of UI usually get phrased as:

printf(_("Files copied: %i"), n);

instead.

I want to include a localization system right in the UnityALFE compiler which solves all these problems. The high level aims:

  • It should be easy to add a string with inserts - as easy as it is in PHP. Modifying strings should be similarly easy - no editing multiple files, no _(), no making up names for the strings or finding the next available ID number.
  • The system should keep track of strings between compiles so that they can be consistently associated with their translations.
  • Translators should be able to permute, duplicate, add, remove and modify inserts as appropriate for the target language.
  • Software designers should not have to compromise their designs in the original language to make it possible to translate them into other languages - in other words there is no need to explicitly internationalize strings - it happens automatically (of course, there are non-string internationalization issues which are not in scope here, such as staying far, far away from maps.

To accomplish this, I think the best way is to have a tool (or perhaps a special mode of the compiler) which modifies the source files it processes. This is generally considered a bad thing but in this case I think it's the best solution. The one and only change it makes is to add an identifying number to the start of each string. The insert system gives a handy way to do this, so it might change:

"<div class='logged_in_as'>Logged in as: $(user.link())</div>"

to:

"$(/*12345*/)<div class='logged_in_as'>Logged in as: $(user.link())</div>"

If you're a programmer changing this string, you now have a choice - you can delete the "$(/*12345*/)" part, causing the translators to consider this a brand new string to be translated from scratch, or you can leave it alone, causing the translators to consider it a modification rather than a deletion and addition. The IDs are only ever added by this tool - no programmer should ever have to try to figure out which is the next available number by hand. It might be a good idea for the version control system to be made aware of this system so that if two strings are added with the same number on different branches and the branches are then merged, the merge system knows how to change one of the IDs automatically (everywhere that it appears). Similar empty inserts with comments can be used to leave notes for translators, such as telling them the context in which a particular string is used.

For non-UI strings, the programmer and/or translator can change the ID to a dummy so that it's clear to all concerned that the string does not need to be translated:

"$(/**/)<html>"

The interface for translators should be something like this. The tool creates one file (the translations file) for each generated binary for each language the software is translated into. This file contains one line per string in the original source, and that line contains the original string (including ID and comments), the translated string, and any notes from the translator. The compiler then reads this generated file back in to generate translation files for each language/binary combination. The translator can either edit this file directly or use some kind of tool, but either way it just gets checked back into the version control system.

The tool also checks for differences between the translations file and the source, and tells the translator which strings have been added and changed (including where the string appears in the original source, so the translator can look up the context). Because the translation file contains the inserts as source (the variable name or expression after the $) the translator is free to change this code to make sense for other languages. To handle the plural case, the string might be written as:

print("$(/*120*/)$n file$(n == 1 ? "" : "s") copied");

and the translator can then change anything inside the outermost double-quotes to make the string make sense for the target language. That might even involve calling out to external code to handle complicated cases:

"$(/*120*/)$n file$(n == 1 ? "" : "s") copied", "$n $(locale.plural(n, "plik")) kopiowane"

An important difference here is that the compiled locale file can now contain code as well as data. This may introduce testing difficulties, but it's necessary to provide this "no compromise to end user experience" goal. If necessary, site policy could dictate the software internationalization rules we usually use now, or some middle ground like ensuring all locale methods are pure functions returning in bounded time, so they can't adversely affect other parts of the software.

Leave a Reply