Archive for July, 2010

Bailout thoughts

Thursday, July 15th, 2010

In 2008, the US government spent a large amount of money bailing out financial organizations troubled as a result of the subprime mortgage crisis. The fiscal conservative in me thinks that maybe socializing losses isn't such a good idea, and that doing so will cause the invisible hand to engineer boom and bust cycles to siphon large amounts of public money towards private financial institutions again in the future.

On the other hand, the consensus seems to be that it has worked, things are looking up and it cost less than expected, so maybe it was the right thing to do.

I remember reading at the time about the disasterous consequences that were predicted if TARP wasn't passed - banks would fail, making it very difficult for ordinary businesses (who didn't cause the crisis) to get credit to grow and/or continue operating.

The obvious answer is to let the banks fail and to bail out these innocent businesses instead - lending to them directly instead of lending to the banks so the banks can lend to the businesses. But figuring out which businesses are good ones to lend money to is a rather complicated and difficult process in itself - one that the government isn't really set up to do. It makes perfect sense for the government to outsource that work to private institutions (i.e. banks) who were doing it anyway (and not doing altogether too bad of a job at that side of it).

On the other hand, how can we stop this happening again the future? I think the answer is to tie bailout money to the enacting of regulations designed to stop this happening again. In particular, for the current economic crisis it seems like the conflict of interest between credit rating agencies and the issuers of securities was a major cause, and I suspect that well placed regulations there could prevent similar crises.

The fundamental differences between Windows and Linux

Wednesday, July 14th, 2010

When I started to get deeper into Linux programming, I found it fascinating to learn about the design decisions that were made differently in Windows and Linux, and the history between them.

One example in particular is how dynamic linking is done. Windows dynamic link libraries (DLLs) are compiled and linked to load at a particular address in memory. When a DLL is loaded, the loader inserts the addresses of A DLL can be loaded at a different address (in the case that its preferred address is already in use), but then the loader must relocate it, which is an expensive operation and prevents the DLLs code from being used in other processes. Once loaded, however, code in a DLL is no slower than any other code on the system (calls from one module to another are slightly slower than calls within a module though, since there is a level of indirection involved).

With Linux, on the other hand, the equivalent (shared object files) are compiled to use position independent code, so can be loaded anywhere in memory without relocation. This improves process startup speed at the expense of runtime speed - because absolute addresses cannot be used, in situations where they would otherwise be used the load address must be found and added in.

Another way Linux improves startup speed at the expense of runtime speed is lazy binding of function calls. In a Linux process, a call to a shared library is not normally resolved until the first time it is called. Function pointers initially point to the resolver, and when a function is resolved that pointer is replaced with the found function pointer. This way, the loader doesn't spend any time resolving functions that are never called.

It makes perfect sense that Linux should sacrifice runtime speed for startup speed given the history of Unix. The first versions of Unix had no multithreading - each thread of execution (process) had its own memory space. So you needed to be able to start processes as quickly as possible. With the fork() system call, a new process could be created by duplicating an existing process, an operation which just involved copying some kernel structures (the program's data pages could be made copy-on-write). Because process startup was (relatively) light, and because of Unix's philosophy that a program's responsibilities should be as limited as possible (and that complex systems should be made out of a number of small programs), processes tend to proliferate on operating systems modelled on Unix to a much greater extent than Windows.

However, this does mean that a program developed with Windows in mind (as a single monolithic process) will tend to run faster on Windows than on Linux, and a program developed with Linux in mind (as many small cooperating processes continually being created and destroyed) will tend to run faster on Linux than on Windows.

Another way in which Linux and Windows differ is how they deal with low memory situations. On Linux, a system called the "OOM killer" (Out Of Memory killer) comes into play. The assumption is that if a machine is running too low on memory, some process or other has gone haywire and is using it all. The OOM killer tries to figure out which process that is (based on which processes are using a lot of memory, and which critical system processes are trusted not to go haywire) and terminates it. Unfortunately it doesn't always seem to make the right choice, and I have seen Linux machines become unstable after they run out of memory and the OOM killer kills the wrong thing.

Windows has no OOM killer - it will just keep swapping memory to disk and back until you get bored and kill the offending process yourself or reboot the machine. It's very easy to bring a Windows machine to its knees this way - just allocate more virtual address space than there is physical RAM and cycle through it, modifying each page as rapidly as possible. Everything else quickly gets swapped out, meaning that even bringing up the task manager to kill the program takes forever.

User code in kernel mode

Monday, July 12th, 2010

Most modern operating systems have a great deal of complexity in the interface between user mode and kernel mode - 1100 syscalls in Linux 2.4.17 and some 400 odd in Windows Vista. This has implications for how security (since all those calls need be hardened against possible attack) and also implications for how difficult it is to learn to program a particular platform (albeit indirectly, since the syscall interface is not usually used directly by application programs).

It would be much better if the system call interface (and the programming API on top of that) was as minimalistic as possible whilst still securely multiplexing hardware and abstrating away hardware differences.

The reason it isn't is that doing so would make an operating system extremely slow. It takes some 1000-1500 cycles to switch between user mode and kernel mode. So you can't do a syscall for each sample in a realtime-generated waveform, for example. There are many other ways that an operating system could be simplified if not for this cost. TCP, for example, is generally implemented in the kernel despite the fact that it is technically possible to implement it in userspace on top of a raw IP packet API. The reason is that network performance (especially server performance) would be greatly impacted by all the extra ring switching that would be necessary whenever a packet is sent or received.

A much simpler system API would be possible if user programs could run code in kernel mode - that way they could avoid the ring switch and all the other associated overhead. This was the way things used to be done, back in the days of DOS. Of course, back then every application you used needed to support every piece of hardware you wanted to use it with, an application crash would take down the entire system and there was no such thing as multi-user. We certainly don't want to go back to those bad old days.

Microsoft's Singularity OS (as I've mentioned before) solves this problem by requiring that all code is statically verifiable, and then runs it all in ring 0. Given how much code performance enthusiasts like I like to write highly optimized but completely unverifiable code, I think it will be a while before memory protection becomes a thing of the past.

But maybe there's a middle path involving both approaches - unverified code using the MMU and verified code running in the kernel. A user application could make a system call to upload a piece of code to the kernel (perhaps a sound synthesis engine or a custom TCP implementation) and the kernel could statically verify and compile that code.

With a suitably smart compiler, parts of the kernel could also be left in an intermediate form so that when the user code is compiled, the kernel implementations could be inlined and security checks moved outside of loops for even more speed.

Another thing that such a system would make practical is allowing applications to subvert the system calls of less trusted applications.

Optimizing by taking advantage of undefined behaviour

Sunday, July 11th, 2010

Undefined behavior doesn't sound like something you'd ever deliberately introduce into a program's codebase. Programmers generally prefer their programs' behaviors to be fully defined - nasal demons are rarely pleasant things.

However, there are sometimes good reasons for adding a statement to your program with undefined behavior. In doing so, you are essentially telling the compiler "I don't care what happens if control flow reaches this point", which allows the compiler to assume that control flow will never get to that point if doing so will allow it to better optimize the parts of the program that you do care about. So it can be an optimization hint.

A good way to use this hint is to put it in the failure path of assert statements. In checked/debug builds, assert works as normal (does the test and terminates the application with a suitable message if the test fails) but in release/retail builds assert checks the test and does something undefined if the test fails. The compiler can then skip generating the code to actually do the test (since it's allowed to assume that it will never fail) and can also assume that the same condition is false in the subsequent code, which (if it's sufficiently smart) will allow it to generate better code.

GCC has a built-in function for just this purpose: the __builtin_unreachable() function is specified to have undefined behavior if it is ever called, and is actually used in just this way. I think this is really clever, in a slightly twisted way.

What happened to DirectSound?

Saturday, July 10th, 2010

Some years ago, I was tinkering about with some sound synthesis code in Windows. This was in the pre-Vista days, and the officially recommended way of doing sound output was using DirectSound. My application was interactive, so I wanted to make it as low latency as possible without having too much overhead. I set it up with a buffer size of 40ms (3,528 bytes). I then set up IDirectSoundNotify with two pointers, one at the middle of the buffer and one at the end. Whenever one was triggered I would fill the other half of the buffer. This all worked great.

At least, it did until I came back to the code some years later, after having upgraded the OS on my laptop to Windows Vista (the hardware hadn't changed). Suddenly this code which worked great before sounded horrible - the sound was all choppy as if only half of the buffers were being played. What happened?

After some experimentation, I discovered that the jittering happened with a buffer of less than 7,056 bytes (80ms) but not with larger buffers. Armed with this evidence and a bit of pondering, I have a good theory about what happened.

The Windows audio system was drastically rewritten in Windows Vista and DirectSound was re-implemented - instead of a thin layer over the driver, it became a high level API implemented on top of the Windows Audio Session API (WASAPI). In doing so, it lost some performance (it's no long hardware accelerated) and, it seems, suffered an increase in latency - the DirectSound implementation uses a 40ms buffer (I think). That's all very well, but there's a bug in the implementation - IDirectSoundNotify fails to trigger if the positions are more closely spaced than this.

The preferred API for this kind of thing is now XAudio2 which is actually a little bit nicer (the code to do the same thing is slightly shorter) and works on both XP and Vista (unlike WASAPI). I can't really fault Microsoft too much since apparently this particular use of IDirectSoundNotify is rather unusual (or they would have made it work) but still it's annoying that DirectSound went from being the recommended API to buggy and (practically if not technically) deprecated in a single Windows version. Still, I understand that the world of Linux audio is even worse (albeit getting slowly better).

I wonder why audio APIs seem to go through so much churn relative to graphics (given that audio is that much less complicated, and that the hardware isn't really changing much any more). I sometimes wish all the audio APIs were as simple as a "get the next sample" callback API but I guess this is too CPU intensive, or at least was when the APIs were designed.

Website for making websites

Friday, July 9th, 2010

This is an idea I came up with when I was working on Argulator. Making websites like this involves a lot of very fiddly work - keeping the server-side and client-side bits synchronized, implementing login/logout stuff and so on. It would be great if there was a site (a "MetaSite" if you will) that let you create data-driven web applications whilst writing a minimal amount of code - sort of like how WordPress lets you make a blog whilst writing a minimal amount of HTML and CSS.

MetaSite would have, built in, the ability to create user accounts, login, logout, change password and so on. The users thus created can then create their own sites and interact with sites built by others. Some sites might require email verification and/or a captcha solve before they will allow the user to do certain things - MetaSite would take responsibility for that and once these tasks are done they don't need to be redone for each application.

There would be an "admin hierarchy" in that an admin of a site can appoint other users as admins with special powers (moderation and so on) who can then further delegate their powers. Upon withdrawing power from an admin, that power is then withdrawn from the closure of admins they've delegated to.

Users would be given tools to allow them to specify which sites can access the information stored by which other sites. One such "built in" site might be a repository of personal information.

Another useful "built in" site would be messaging - a virtual inbox which sites can use to notify their users when events of interest happen.

Yet another useful "built in" would be a shopping cart application which lets applications act as online shops. So if you've written a site and you decide to add a feature which you want to charge users for, it's as simple as just ticking a box. Since payment is centralized with MetaSite, it would be possible to do micropayments (making a single credit card charge to buy "tokens" which can be spent on any MetaSite site).

So far, nothing too unusual - most of this has already been done. Where MetaSite really comes into its own is its rapid application development tools. If you want to create a blogging service, a photo hosting service, an Argulator clone, a wiki or whatever else, it's just a matter of defining what the "objects" that are stored in your system are and how users can interact with them. MetaSite would have various widgets built in so if you define one field to be a date and say that users can edit this date, all the code for a calendar control widget would be automatically instantiated. All the defaults would be secure, reasonably attractive and AJAXified so that the site is nice to use. When developers do need to resort to code it would be written in a sandboxed language (not just uploading raw PHP to the site, that would be a security nightmare). This language would have instrinsics which abstract out all the web-specific stuff and allow developers to just concentrate on their application domain.

This is the big difference between MetaSite and Facebook - if you want to create a Facebook application you need to have your own web server and you need to write the server-side code to run on it. MetaSite would have a very shallow learning curve - making a new site should be as easy as starting a blog.

Metasite applications would be limited in the amount of storage space and CPU time they can use, so that they don't adversely affect other sites. One way Metasite could make money would be to sell extra CPU time and storage space to sites that need it. MetaSite could also make it easy for site admins to add advertising to their sites to monetize them and/or pay for such extras. Another value added feature might be the ability to run a MetaSite site with a custom domain name, or even on your own server.

Everything on MetaSite would be customizable - objects, data editing widgets, even entire applications. In fact, the usual way of creating one of these would be to take an existing thing that's similar to the thing that you're trying to make, and modify it. These modifications could then be available to authors of other sites as starting points for their own things (in fact, I think this should be the default and the ability to keep your customizations proprietary should be a "value added" feature).

Ideally it would be the first stop for any web startup trying to get their site up and running as quickly as possible, but if nothing else it would be a way to take advantage of all those "I need a facebook clone and will pay up to $100" projects on vWorker (nee RentaCoder).

I've gone so far as to register MetaSite.org but I haven't done anything with it yet.

VC++ inline assembly hack

Thursday, July 8th, 2010

Here's a cute little hack I came up with when doing some fixed-point arithmetic stuff in Visual C++ a while ago. I wanted to make a templated C++ class with some inline assembly, the generation of which depends on a numeric template parameter. Doing the obvious thing doesn't work:

template<int N> class ArithmeticHelper<N>
{
public:
    static Int5 MultiplyShiftRight(Int5 x, Int5 y)
    {
        __asm {
            mov eax, x
            imul y
            shrd eax, edx, N
        }
    }
};

However, one can introduce numeric parameters into inline assembly code by pretending that they are the length of something, and using the "length" operator:

template<int N> class ArithmeticHelper<N>
{
private:
    static const int nn[N];  // VC's inline asm doesn't allow "shl eax, N" directly, but this works...
public:
    static Int5 MultiplyShiftRight(Int5 x, Int5 y)
    {
        __asm {
            mov eax, x
            imul y
            shrd eax, edx, length nn
        }
    }
};

Most different colours

Wednesday, July 7th, 2010

For various visualization programming tasks, it's useful to have a set of colours that are as far apart as possible. I set out to write a program to generate such sets.

The first problem is that we must work in a perceptual colour space such as Lab rather than the usual sRGB, or our colours will be overly concentrated in the areas where sRGB over-represents.

Next, I set up a system of mutually repelling particles and let them arrange themselves (sliding along the sides and edges as required). This didn't work too well even just in sRGB space - the trouble is that I don't care about minimizing the distance to the most distant colours, only the nearest ones. I thought about do a Delaunay triangulation to find the nearest neighbours of each point, but it turns out that's overkill - all you need to do is just repel the 6 nearest particles (and recalculate what those particles are after each frame). If you end up with the closest points all on one side, the particle will be repelled away until one of the closest points is on the other side.

Even with this fix, I was still getting strange results. After some head-scratching, I realized that it was just to the "kinks" in the some of the edges of the sRGB gamut in Lab space:

The particles tend get "stuck" in these kinks.

I'm not quite sure what to do about this. Perhaps I can find optimal sets in a rectilinear gamut and then gradually morph this into sRGB while continuing to let the particles repel.

Arbitrary precision Mandelbrot sets

Tuesday, July 6th, 2010

I added arbitrary precision arithmetic to my Mandelbrot zoomer. Surprisingly, the difficult part wasn't the arbitrary precision arithmetic itself, it was deciding when and how to switch from double-precision to arbitrary precision arithmetic. Because my program reuses data computed at one zoom level to speed up the computation of more deeply zoomed images, some images contained pixels computed using both methods. The trouble is, the two methods don't give the same results. In fact, after some experimentation I discovered that the problem was even worse than that: even using just the double-precision code, you get different results depending on the number of digits you use. So the point that was represented in my program by 0xffa615ce + 0x00000002i (using 10.22 bit fixed point) escaped after 548 iterations but the point 0xffa615ce00000000 + 0x0000000200000000i (using 10.54 bit fixed point) escaped after 384 iterations. The problem is that after every multiplication we shift the result right by the number of fractional bits, which performs a rounding operation. The images generated look reasonable but are actually only approximations to the true images that would be calculated if infinite precision were employed.

Having determined this, I realized it would be necessary to throw away all the points computed with lower precision once we started using a higher precision. This isn't as bad as it sounds, since (when zooming in by a factor of 2) the recycled points only account for 1/4 of the computed points but if we just threw them all out at once it would result in a visible artifact when passing through the precision boundary - the entire window would go black and we'd restart calculation from scratch. I wanted to avoid this if possible, so I started thinking about convoluted schemes involving restarting a point if its 3 sibling points were of higher precision. Then I realized that I could just recycle the points but keep their colours when splitting blocks to a new precision boundary, avoiding the glitch. There are still some artifacts while the image is being computed, but they are less jarring.

Finally there was the problem of interfacing with double-precision floating-point numbers. The Mandelbrot set is contained entirely within the circle |c|<2 and in this range a double has 52 bits of fractional precision. But if we replace the 10.22 and 10.54 bit fixed point numbers with doubles, we'll have a region of blockiness where we need 53 or 54 bits but only have 52. Rather than having two sorts of 64-bit precision numbers, I decided to simplify things and have 12 integer bits in my fixed point numbers. The program is very slow when it switches into arbitrary precision mode - it's barely optimized at all. The 96-bit precision code is currently has a theoretical maximum speed of about 92 times slower than the SSE double-precision code (190 million iterations per second vs 2 million). It could be significantly faster with some hand assembler tuning, though - I have a 96-bit squaring routine that should speed it up by an order of magnitude or so. All of the multiplications in the Mandelbrot inner loop can be replaced with squarings, since [latex]2xy = (x+y)^2 - x^2 - y^2[/latex]. Squaring is a bit faster than multiplication for arbitrary precision integers, since you only need to do [latex]\displaystyle \frac{n(n+1)}{2}[/latex] digit multiplications instead of [latex]n^2[/latex]. Given that we are calculating [latex]x^2[/latex] and [latex]y^2[/latex] anyway and the subtractions are very fast, this should be a win overall.

Waller's Torus: a four-dimensional musical instrument

Monday, July 5th, 2010

Consider the major and minor chords in Western music. There are 24 of them altogether (with the normal twelve-tone equal temperament tuning system usually used): one major chord and one minor chord for each of the 12 notes in the chromatic scale (A-G and 5 accidentals).

Each of these 24 chords consists of 3 notes. Pick one of the notes in the chord and throw it away, leaving two notes. Given these two notes, one can always find two chords (one major and one minor) which use those two notes. So given any major chord one can find three minor chords which share two notes with it, and given any minor chord one can find three major chords which share two notes with it.

That suggests that the 24 chords form a sort of network or graph, with vertices corresponding to chords and edges corresponding to pairs of notes. In three dimensions this network can be arranged on the surface of a torus (Waller's Torus). You can arrange the vertices in such a way that all the edges are the same length, but some vertices will have different sets of angles between edges than others.

In 4 dimensions, the symmetries of the network are more apparent - the vertices can be arranged such that each edge is the same length and all the vertices have the same set of angles between edges. The resulting object isn't a polychroron because only three edges meet at each vertex but it isn't a polyhedron either because it does extend into all 4 dimensions. The analogous figure in 3D would might be something like a star shape - if you draw it in 2D the inner points will be closer to each other than the outer points but in 3D you can arrange the points on the surface of a cylinder (sort of like the points one might find at the top of a paper party hat from a Christmas cracker). Here's what it looks like:

Using this figure, one could generate a nonlinear mapping from points in a 4-dimensional space to chords. Not just major and minor chords either - other chords are represented by points other than the vertices. At each vertex you can move in the direction of one of the edges to change the pitch of one of the notes. Since there are only three degrees of freedom in a chord (one for the pitch of each note) there is also a direction at each point in 4-space which leaves the chord unchanged as you move along it.