Archive for October, 2011

Removing outliers

Monday, October 17th, 2011

If you've got a sequence of data that you want to do some statistical analysis on, but you know that some of it is bad, how do you remove the bad data? You could just remove the top 5% and bottom 5% of values, for example, but maybe you don't have any bad data (or maybe you have lots) and that would adverse affect your measurement of the standard deviation.

Suppose that you know that your dataset is supposed to follow a normal distribution (lots do). Then you could remove outliers by measuring the skew and kurtosis (3rd and 4th moments), and just repeatedly remove the sample furthest from the mean until these measurements look correct. This algorithm is guaranteed to terminate since if you only have 2 samples the skew and kurtosis will be 0. You've still got a parameter or two to tune though (how much skew and kurtosis you'll tolerate).

If the world were rational

Sunday, October 16th, 2011

Sometimes I wonder what the world would be like if it were possible to get everybody to agree on true things.

Suppose that, whenever two people find that they disagree about something, they invoke the Argulator algorithm to resolve the dispute. The Argulator algorithm is trivial for statements which empirically true or false - you just go and do the experiment to find out who is right. For other statements, each side enumerates each argument for their position as a set of statements, then figure out which statements in each of their opponent's arguments they disagree with and recursively invoke the Argulator algorithm on each of those.

The Argulator algorithm isn't guaranteed to terminate, though - you might have two positions which are perfectly internally consistent and consistent with the observable universe, but which still disagree with each other. For example, two people might disagree about the purpose of government - one might argue that its purpose is to provide a safety net or basic minimum standard of living for the least well off members of society. Another might argue that its purpose is to further the cause of human achievement (maximize rate of total long-term economic output) by any means necessary. Without agreement on what the end goal is, the chances of agreeing of a course of action are pretty slim.

Beats of two colour carriers

Saturday, October 15th, 2011

Okay, I tried the second oscillator and the effect of the color adjust capacitor is definitely frequency. Based on the beat patterns it looks like the range of the capacitor is maybe 300ppm or so and (with very careful adjustment) it's possible to get the beat frequency down to a few ppm. SMPTE 170M specifies a tolerance of about 3ppm, but frequency differences of less than 30ppm or so are unlikely to be noticeable even with a TV which doesn't use the received color burst frequency.

Effect of the color adjust variable capacitor

Friday, October 14th, 2011

For a long time, I thought the effect of the "color adjust" variable capacitor on the XT motherboard was to change the duty cycle of the 14MHz clock signal going into the CGA card. This would in turn change the hue of the green and magenta chroma colours (and all the artifact colours) while leaving the yellow, blue, red and cyan artifact colours alone.

However, having now tried it out, this does not seem to be the case at all:

The hues of all the chroma colours are changed. I don't really understand why, since the yellow and color burst signals are the same, so there's no way to introduce a phase shift between them (and hence no way to change the hue of that colour).

Looking at information about crystal oscillator circuits, a variable capacitor like that is usually included in order to fine-tune the frequency of the circuit. Quartz crystal frequency sources have a very small damping factor (high Q value) which means there's a very narrow range of frequencies in which they like to oscillate, but the external circuitry can influence the output frequency a little bit (as we've seen before).

But what we're seeing on the screen there doesn't look like the effect of changing the frequency either. With a frequency that isn't quite right, I'd expect to see hue changes on the right-hand side of the screen but not the left.

That makes me think that the TV must be using both the phase and frequency of the detected color burst for its reference. I'm not sure why it would do that - it seems like it would be more trouble than it's worth. Maybe it was too difficult to make an oscillator that retained the right frequency with sufficient accuracy that the right hand side of the screen never exhibited a hue drift with respect to the left over the entire lifetime of the TV and in all the conditions in which it might operate.

And that still doesn't explain why the hue of the yellow bar changes (the effect is the same even if I put it at on the left-hand side of the screen). Perhaps it's to do with the TV expecting 227.5 color carrier cycles per scanline and receiving 228.

I guess I can find out for sure whether the effect of the capacitor is frequency or not by using a second 3.57MHz oscillator connected to the TV and observing how the beat pattern changes as the capacitor is turned.

Masking out the sync signals

Thursday, October 13th, 2011

In the course of the CGA debugging I talked about yesterday, I found it useful to be able to accurately visualize signals with a period much smaller than a frame. The setup I showed in yesterday's video was rather less than ideal for that since the TV adjusts the DC offset of the incoming signal so that the blanking period is at the normal (black) level.

I realized I could get around this by taking the "display enable" signal from the CGA card and ANDing it with the signal I wanted to visualize. Then a logic "1" signal would give me a white screen, signals with frequencies that were integer multiples of the horizontal sync frequency would give me vertical lines and so on.

When I tried it out I got a bit of a surprise - strange interference patterns were appearing on the TV that I hadn't seen without the AND gate. A moment's thought told me why this was - the AND gate was acting as an amplifier - picking up whatever electromagnetic radiation was flying around and turning it into 1s and 0s.

Here's some video of the effect.

This was much more interesting than actually debugging the CGA card, so I started poking around at all kinds of signals and visualizing them. Unfortunately I forgot where the -12V (or possibly +12V) signal on the bus was and touched it, frying my one remaining 74HC08. I finished the debugging with a 74LS08 which doesn't show such nice interference patterns, then wired up two NAND gates from a 74HC00 as an AND gate so that you could get to see the effect.

Hardware debugging with no oscilloscope

Wednesday, October 12th, 2011

I bought a CGA card for my XT, but it stopped working after a few weeks. These things are quite fixable - computers weren't made with surface mount technology back then, and most of the chips on the CGA card in particular are standard 74 series logic chips which are still made (the exceptions being RAM, character ROM and 6845 CRTC).

I knew the fault wasn't the RAM (since I had seen it happen - the picture flickered in and out for a moment before disappearing completely, meaning that the RAM contents were still intact while it was "out") and I knew it wasn't the CRTC because the sync pulses and cursor were still being generated. That left the character ROM and surrounding circuitry.

The problem was that I didn't (and don't) have an oscilloscope to debug it with. I knew what the signals on each pin were supposed to look like (more or less) - I just needed some way to visualize 1MHz+ signals. Well, the ideal device was right there - the TV. It's actually even more ideal than an oscilloscope since (given the horizontal and vertical sync pulses from the CGA card) it would display signals in their corresponding places on the screen.

Unfortunately, it wasn't the shift register at all, but using the same technique I did track down and fix the problem, a broken NOR gate which was masking off the character clock pulse sent to the shift register. Luckily I had a replacement for that one too. It's a 74HC part which (I've since learned) is not supposed to be electrically compatible with the 74LS chips used in the CGA (the voltage thresholds are different) but it seems to be working anyway.

Strings and localization in ALFE

Tuesday, October 11th, 2011

PHP is a pretty bad and ugly language in many ways, but it does have some advantages - because it's so ubiquitous for web development, a lot of effort has been put into getting it to run fast and including many useful libraries. Another advantage is its string manipulation, which is what I want to talk about today.

PHP has a nice feature whereby if you have a variable called $foo (all scalars in php start with the "$" sigil) and you want to output a string containing that value, you can just write:

echo "The current foo value is $foo units.";

In other words, you don't have to close the string to insert a variable into it. If this feature didn't exist you'd have to write:

echo "The current foo value is " . $foo . " units.";

as in C++:

cout << "The current foo value is " << foo << " units.";

which doesn't seem like much more typing, but when you're writing web pages (which are full of these inserts) it really adds up and makes the result much more pleasant to use. I was skeptical at first but after having written some PHP code I want to steal this feature for my own language. Being a statically typed language, UnityALFE also inserts .toString() calls when the inserted variable is not of String type.

Inserting more complex expressions is also possible. Though I think PHP's syntax for this is a bit confusing:

"<div class='logged_in_as'>Logged in as: {$user->link()}</div>"

Much better to have a single insertion character and allow what follows to be either a variable name or a parenthesized expression:

"<div class='logged_in_as'>Logged in as: $(user.link())</div>"

The trouble with including literal strings like this in your program, though, is that sooner or later you're going to want to translate it into other (human) languages. Current GNU C/C++ best practices for internationalizable software suggest enclosing UI strings in a gettext macro (_("foo")) which adds that string to a table and replaces the string itself with an index into that table - then localization is just a matter of replacing the table. At Microsoft we avoided putting UI strings in source code altogether, instead putting them in an .rc file and making up a macro name to refer to the string's index. That was painful, because to add a string we had to change three files - the source file (relating code to macro name), the header file (relating macro name to index) and the resource file (relating macro name to English text). I suppose there might have been economic forces resisting change to this system - Microsoft software is translated into so many languages that adding a UI string is quite expensive - you have to pay to have it translated a lot of times. Similarly with changing a UI string.

Both these methods also constrain how you need to phrase your UI strings. The canonical example is that you can't write:

printf(_("%i file%s copied"), n, n==1 ? "" : "s");

because the rules for localization vary from language to language. Translating would involve changing the code as well as the contents of the string. Thus such pieces of UI usually get phrased as:

printf(_("Files copied: %i"), n);

instead.

I want to include a localization system right in the UnityALFE compiler which solves all these problems. The high level aims:

  • It should be easy to add a string with inserts - as easy as it is in PHP. Modifying strings should be similarly easy - no editing multiple files, no _(), no making up names for the strings or finding the next available ID number.
  • The system should keep track of strings between compiles so that they can be consistently associated with their translations.
  • Translators should be able to permute, duplicate, add, remove and modify inserts as appropriate for the target language.
  • Software designers should not have to compromise their designs in the original language to make it possible to translate them into other languages - in other words there is no need to explicitly internationalize strings - it happens automatically (of course, there are non-string internationalization issues which are not in scope here, such as staying far, far away from maps.

To accomplish this, I think the best way is to have a tool (or perhaps a special mode of the compiler) which modifies the source files it processes. This is generally considered a bad thing but in this case I think it's the best solution. The one and only change it makes is to add an identifying number to the start of each string. The insert system gives a handy way to do this, so it might change:

"<div class='logged_in_as'>Logged in as: $(user.link())</div>"

to:

"$(/*12345*/)<div class='logged_in_as'>Logged in as: $(user.link())</div>"

If you're a programmer changing this string, you now have a choice - you can delete the "$(/*12345*/)" part, causing the translators to consider this a brand new string to be translated from scratch, or you can leave it alone, causing the translators to consider it a modification rather than a deletion and addition. The IDs are only ever added by this tool - no programmer should ever have to try to figure out which is the next available number by hand. It might be a good idea for the version control system to be made aware of this system so that if two strings are added with the same number on different branches and the branches are then merged, the merge system knows how to change one of the IDs automatically (everywhere that it appears). Similar empty inserts with comments can be used to leave notes for translators, such as telling them the context in which a particular string is used.

For non-UI strings, the programmer and/or translator can change the ID to a dummy so that it's clear to all concerned that the string does not need to be translated:

"$(/**/)<html>"

The interface for translators should be something like this. The tool creates one file (the translations file) for each generated binary for each language the software is translated into. This file contains one line per string in the original source, and that line contains the original string (including ID and comments), the translated string, and any notes from the translator. The compiler then reads this generated file back in to generate translation files for each language/binary combination. The translator can either edit this file directly or use some kind of tool, but either way it just gets checked back into the version control system.

The tool also checks for differences between the translations file and the source, and tells the translator which strings have been added and changed (including where the string appears in the original source, so the translator can look up the context). Because the translation file contains the inserts as source (the variable name or expression after the $) the translator is free to change this code to make sense for other languages. To handle the plural case, the string might be written as:

print("$(/*120*/)$n file$(n == 1 ? "" : "s") copied");

and the translator can then change anything inside the outermost double-quotes to make the string make sense for the target language. That might even involve calling out to external code to handle complicated cases:

"$(/*120*/)$n file$(n == 1 ? "" : "s") copied", "$n $(locale.plural(n, "plik")) kopiowane"

An important difference here is that the compiled locale file can now contain code as well as data. This may introduce testing difficulties, but it's necessary to provide this "no compromise to end user experience" goal. If necessary, site policy could dictate the software internationalization rules we usually use now, or some middle ground like ensuring all locale methods are pure functions returning in bounded time, so they can't adversely affect other parts of the software.

Templates and Kinds in ALFE

Monday, October 10th, 2011

Programmers work with values like 4 and "hydroxylic". We also generally work with variables like count and acid. Each variable holds precisely one value. But in many languages, not every variable can hold any value. In a statically typed language, count might be able to hold 4 and acid might be able to hold "hydroxylic", but not the other way around. That quality of variables and values that determines which objects can fit into which boxes is called type. So count might have a type of Integer and acid might have a type of String. Then if you have a function called square which multiplies its argument by itself and returns the result, it makes sense to say square(count) but not to say square(acid) - the latter is an error which is detected and flagged as quickly as possible (when the program is compiled, if not when it's typed in in the first place). On this blog, as a notational convention, types (and more generally type constructors, see below) start with a capital letter and variables and values don't. I think this is a useful convention, and some languages (including the language I am writing) enforce it.

A type constructor is a generalization of a type. It's either a type itself (which can be used to declare variables) or a template type constructor, a "function" of sorts which, when given another type constructor, yields a third type constructor. C++, C# and Java programmers are familiar with these "functions" - they're called templates or generics. A good example of a template type constructor is Array. You give it a type like String and you get back Array<String> (i.e. an array of strings). You can also have Array<Array<String>> if you're feeling adventurous. A type constructor can have multiple arguments, so you might have Pair<Integer, String> which (through the magic of something called currying) is the same thing as Pair<Integer><String>.

What if you have a template type constructor with a parameter that is not a type but is itself a template type constructor? Well, you can do that too - say you have a type constructor called FooCollection and you pass it Array (note, not Array<Integer> or Array<String>, which are types, just plain Array) and you get back FooCollection<Array> which might be implemented in terms of Array<Foo>. Similarly, FooCollection<List> might be implemented in terms of List<Foo>.

So if the type of the argument of square is Integer, how do we talk about the argument of FooCollection. We can't say "the type of the argument of FooCollection" because what it accepts isn't a type - a type constructor can't have a type, type is a kind of type constructor. And in fact that is exactly the word we use instead: "the kind of the argument of FooCollection is <>." Other examples of kinds are (nothing, i.e. the empty string) which is the kind of types like Integer and String (no angle brackets follow them), <><> aka <,> (which is the kind of Pair - i.e. you need to give it two types to get back a type) and <<>> (which is the kind of FooCollection - i.e. you need to give it a type constructor of kind <> to get back a type). The wikipedia article on kinds uses a different notation further removed from the C++-inspired one I use here, but the two are equivalent and the transformation from one to the other is purely mechanical.

Incidentally, you can refer to kinds in C++ as well - instead of the empty string for the kind of types, C++ uses "class" (a rather confusing overload of the word - distinct from "a structured type with default accessibility private") or "typename" which is a bit better. Instead of <>, C++ uses "template<typename>", instead of <><>, C++ uses "template<typename, typename>" (C++ has no currying) and instead of <<>> C++ uses "template<template<typename>>" (a so-called "template template parameter" - one of the more obscure areas of C++).

So we've gone from values to types to kinds, what comes next in the sequence? ("Sort?" that already has another meaning in computer science though.) Suppose you had a function which accepts a kind and returns another kind. My angle brackets here have that effect - they take a list of kinds and return a kind which is the kind of a type constructor that accepts type constructors of the listed kinds and returns a type. What a confusing sentence! Fortunately nobody to my knowledge has found a non-trivial use for anything one level up from kind. I say "fortunately" not just because it's really confusing but because we've run out of notation! If you start with a lower case letter to signify a value argument and an upper case letter to signify a type constructor argument, how do you signify a kind argument? Also, if we'd need to use different brackets (I almost wrote "a different kind of brackets" which would further confuse things - writing about this stuff is hard!) to distinguish kind functions from value functions and type constructor functions, and all the ASCII brackets already have other meanings in my language.

Anyway, this is the sort of thing that you bump your head against when you write compilers for languages which support templates. Fortunately, if you're just writing normal applications in this language you'll probably never need to be concerned with it. Until you do. At which point just doing the most obvious thing should work fine without any fuss. I think it's fun to think about though.

Ideas for a reddit killer

Sunday, October 9th, 2011

I like reddit and have made it part of my daily internet schedule. I had to give it up for a while because it was too much of a time sink but then I discovered the Reddit Enhancement Suite which allows you to hide a set of subreddits from r/all, and which makes the thing much more manageable.

There are lots of things I would like to do differently, though, if I was writing a reddit-like website. One thing I don't really like is the subreddit system - rather than one big website it's more like thousands of little websites (of various sizes) which you can combine into a single feed. But it's hard for somebody submitting a link to know which sub-reddit(s) to submit it to - if you submit it to a large one it'll probably get lost in the noise, if you submit it to a small one very few people will see it, and if you submit it to too many you might get accused of spamming (not to mention having your upvotes split between submissions). Then there are massive flamewars about whether a particular post is appropriate for a particular subreddit. Readers have a similar problem - which subreddits should you subscribe to to see all the news you want to see without having to wade through too much rubbish and duplicate posts?

I'd rather have a system of tags instead of categories - people can apply multiple categories to a single post and you could set up complex rules about what you want to see ("I want to see all posts in category X unless they're also in category Y and aren't made by user Z").

Another problem is "how do I get things that I've made that I'm proud of onto reddit?" Being an honest user, I want to avoid submitting my blog posts myself ("It's not strictly forbidden to submit a link to a site that you own or otherwise benefit from in some way, but you should sort of consider yourself on thin ice.") except for things like r/somethingimade. I had a "submit this site to reddit" button here for a while but to my knowledge nobody ever pressed it. I want to contribute to reddit and make it better for people who like to read the sort of things I like to read (which are also the sort of things I like to write), but since most of the things I read on the internet I find there and most of the others have already been submitted there, there is little opportunity for that (meanwhile, the large chunks of the site are flooded with useless memes and duplicates). So I'd like a reddit-killer that allows and encourages people to submit their own stuff - maybe even have the ability to submit an RSS feed so that new entries on a blog are submitted automatically.

That would mean a lot more posts in general, which means you'd need a more scalable way than reddit's "new" queue to allow people to find things that they're interested in. I envisage a "people with similar tastes to you also liked" system a la Amazon or Netflix (though Amazon's recommendation system is almost completely useless - no, the fact that I bought a bicycle wheel doesn't mean I'm interested in buying lots of slightly different kinds of bicycle wheel). So I'd want to automatically "like" or "upvote" my own posts, then they'll be shown to people who have liked my posts in the past, they will then upvote causing it to be shown to people similar to them and so on until it's traversed the internet or been downvoted into oblivion by the users most similar to me. Comments made in reply to posts or other comments could be handled the same way - I'd see interesting comments on my feed interspersed with the posts rather than having to go and check interesting posts/comments for interesting comments individually.

Yet another difference I'd like is to do with timeliness. Reddit is all about the newest, hottest thing that's just happened - posts more than a day or two old don't tend to stick around (unless you visit a quiet subreddit specifically). But I'm not not interested in the best new thing, I'm just interested in the best new thing I haven't seen yet. I'd rather see an old-but-good post I haven't yet seen than the best of the new stuff if the older one is better still. Now, this does have a danger of making the site an even bigger time-sink than Reddit is (since the pool of all stuff ever is bigger than the pool of stuff that's new in the last couple of days), so there would need to be some way to make it managable. Once you start downvoting a certain percentage of things it shows you, that probably means its time to stop showing you stuff. Also, the ability to say "don't show me anything new between the hours of 9am and 5pm Monday to Friday" (for example) would probably be a useful feature.

If you know your friend is on this site and has an interest in a particular sort of post, you might want to make sure he sees a particular post by recommending it to him - then the site will boost the score of that link for that friend (to an extent determined by his measure of your reputation). If the site becomes sufficiently popular then sharing links this way is far superior to emailing them to your friends or posting them on your facebook page, for both recommender and recommendee. With text posts and privacy controls, it could supplant these things altogether - i.e. you could read your email through the site and it would rank the unread items in your inbox alongside other things recommended for you. In other words, it becomes your "what have you got to show me" interface to the web just as Google is your "I'd like to search for things about" interface and Wikipedia is your "I'd like to find out the basic facts about" interface.

A particularly time-consuming part of reddit is watching videos and listening to audio, but these do have the advantage that they can be done in the background. Perhaps this site could take advantage of that by having a video/audio feed separately from the text feed, which plays in the background while you're reading the text feed and doing other things.

A debugging story from my Microsoft days

Saturday, October 8th, 2011

My debugging by thinking story yesterday reminded me of another tricky debugging problem I encountered while I was working at Microsoft.

We had a bunch of bugs that had come in through "Watson", the system that receives automatic bug reports from users' machines (the "This program has encountered a problem. Send error report to Microsoft?" dialog) and categorizes them in an attempt to find common causes. Yes, people really do look at those reports and act on them. We had a mandate to fix all the Watson reports with more than some number (10 or so) of occurrences. Watson reports with too few crashes aren't worth looking at (they're usually due to some hardware fault on one particular user's machine or something similar).

Most of these bugs had been dealt with, but there was one particular one that was rather tricky - at least one other developer had tried to figure it out, given up on it and thrown it back into the pile for someone else to look at. Somehow it ended up on my plate. It wasn't a huge number of crashes, but it was too many to ignore (maybe hundreds - the top Watson reports had thousands).

Developers don't (or didn't, the situation might have changed since I left) usually like looking at Watson bugs because they contain very little information - usually just a call stack and a list of the modules (DLLs) loaded into the process along with their versions. The call stack is the important bit for developers - they can see what piece of code the program was executing when it died, what piece of code called that code and so on. Often it includes useful variable values as well (function arguments and local variables). Most Watson bugs are treated as just failures to validate parameters: "This function should never be called with a NULL pointer argument but given this Watson report we can see that somebody did just that. We can't tell how it got into that state but we can add a NULL check and either throw an exception or return an error code here so that there's a chance the user will see a useful error message instead of crashing the entire program."

This particular bug could not be dismissed so easily - it was a double-release of some reference-counted COM pointer, and the nasty thing about these is that where they are detected bears absolutely no relation to the piece of code that actually did the double-release. So adding a check at the point where the program was crashing would not have helped at all. Sometimes you can fix these by just looking at all the pieces of code that use the object in question, but in this case that was out of the question - this was a very common object used in dozens if not hundreds of disparate places.

After banging my head against it for a few hours, I stepped back and thought "what information do we have to work with here?" The call stack wasn't much help, so on a whim I turned to look for clues in the module list. As it turned out, the critical piece of information I needed was right there, hiding in plain sight. At the time (I'm not sure if it's still true) there were two (slightly different) implementations of the managed project system in Visual Studio - one for Visual Basic and the other for C#. As it turned out, one thing that all the occurrences of this bug had in common was that the C# project system was loaded in all of them, and the VB project system was loaded in none of them (or it might have been the other way around, I don't recall). That fingered something C#-specific as the culprit.

As it turned out, there was precisely one place where the object in question was used by the C# project system but not the VB project system. Close inspection of this particular function showed that there was a very bizarre set of circumstances in which maybe-just-maybe the object could get double-released if everything lined up just right and something else had gotten into a bad state first. We had no idea how to trigger that particular set of circumstances so no way to be sure if this was the right fix, but making that change was enough to close the bug. I left Microsoft before finding out if it really was the right fix or not (something you'd only be able to tell by trying to cross reference the set of Watson reports from the more recent version of VS and seeing whether or not there was an equivalent report for the later version - and even that wouldn't be 100% reliable since something else could have changed to make the bug go away).

Still, the lack of definitive proof didn't stop me from feeling rather proud of myself for figuring that one out.