Language optimized for refactoring

October 24th, 2008

One property of computer languages that is important but often seems to be overlooked is how easy it is to refactor programs written in them.

The one example that springs immediately to mind is renaming a class. In C++ this is a bit more difficult than in many languages because the constructors and destructors have the same name as the class, so you have to go and change all of those too. PHP wins here for calling them __construct and __destruct respectively.

If you are in the school of thought that has C++ method definitions in a separate file (e.g. .cpp) to class declarations (.h), you have to go and change things in two different files (even if you’re just adding a method that nobody calls yet). If that class implements an COM interface defined by a .idl file then there’s yet another thing you need to change.

Python’s syntactically-significant whitespace is another winner here because if (for example) you put another statement in an “if” clause that currently only has one statement, you don’t have to add braces.

I’m sure there are many other, deeper examples.

Once you go OOP, there’s no going back

October 23rd, 2008

Object Oriented Programming is at least as much a state of mind as a set of programming language facilities. When I learnt C++ it was a bit difficult to get used to writing object-oriented programs but now that I’ve been doing it for many years I can’t get used to thinking about my programs any other way.

I was writing some PHP code recently and (not knowing about PHP classes) started writing it in a procedural fashion. After a while I noticed that many of the functions I was writing started to fall naturally into classes (with a first parameter that gave the function context). So it was only natural to re-write it in object-oriented style once I figured out how to do so.

In the process of doing so, I found lots of bugs in my original code (which I had thought was rather nifty). Many functions became much simpler. I also found it was much easier to do various optimizations that would have been very difficult to do without classes (such as minimizing the number of database queries). My code file did become somewhat bigger, but I attribute this to the extra indentation most lines have, and the fact that PHP requires you to write “$this->” everywhere.

I also tried writing a C program (from scratch) for the first time in a very long time a while ago. I found myself using an object-oriented style and implementing vtables as structs.

Javascript exchange site

October 22nd, 2008

Back in the 80s, most home computers used to boot into a dialect of BASIC. This made it very obvious how to start to learn to program - just type things in and try things out to see what works.

Modern computers are much richer in many ways but do have the disadvantage that it’s less obvious how to start programming. One could even be forgiven for assuming that the typical off-the-shelf Windows Vista machine doesn’t even come with a built in programming language. Actually there are 3 (at least) - the windows command shell language, VBScript and JScript (JavaScript). The windows command shell language (the descendent of the MS-DOS batch language) is ugly, badly documented and almost impossible to debug so lets skip that one. Between VBScript and JScript, the latter is better to learn because it’s cross-platform and VBScript is Windows only. There are two ways (at least) to run JScript in Windows - one is through the Windows Script Host (wscript.exe or cscript.exe) and the other is through the web browser. The latter is a graphically rich, interactive and familiar environment so I think that’s the way to go.

JavaScript is a much nicer language than the 8-bit BASIC dialects from the 80s but it’s still not very discoverable. The tutorials and reference guides are all out there but you have to have a text editor open in one window, one browser window for your program and at least one other browser window as containing your reading material. I think that this is a problem that could be solved with a website.

I’d like to see a site which does for Javascript what computers booting into a BASIC interpreter did for BASIC - a one-stop shop for (at least beginner-level) Javascript development. It would allow you to type Javascript code right into a web page and see its output right there on the page immediately (perhaps with separate divs within the page for the Javascript code, the program’s output and tutorials).

The code editor might have syntax highlighting, intellisense, a built-in debugger - whatever can be provided to make programs as easy as possible to develop.

Once you’ve written some code you can save it on the website and access it from anywhere. You can also share it with friends. If one person defines an object someone else can use that object in their programs. In this way, a rich ecosystem of scripts can develop.

Another possible refinement would be for the web server itself to provide some abilities that scripts can use. Perhaps just storing a small amount of data per script per user so that scripts can do some persistent stuff, or perhaps allowing some server-side JavaScript as well as the client-side scripts, to enable the writing of rich AJAX web applications.

TODO-list management website

October 21st, 2008

I’m a big user of TODO lists. I generally keep a text editor open with at least one todo.txt file (either general or project-specific).

It would be nice to have a website to manage these lists of tasks and use them to help manage time and generate schedules. The schedules should be quite informal - each item should fall into one of three categories - tasks that should take less than a day, tasks that will probably take more than a day (and should be further broken down to get an accurate schedule) and tasks that have not yet been placed into one of the previous two buckets (more details on this costing algorithm).

The site should also have the ability to suggest the next task and allow the user to create dependencies between tasks (e.g. A must be completed before B can be started).

PHP could be more secure

October 20th, 2008

Given that PHP is designed to be used to write applications that run on web servers, you’d think it would have been designed rather more with security in mind.

In particular, PHP’s dynamic typing seems to be a source of security weaknesses. Dynamic typing has advantages in rapid development and code malleability but is not particularly helpful for writing secure code - security is greatly helped by being able to restrict each variable to a specific set of values and having the compiler enforce this.

Similarly with the SQL API - because the interface is all just strings instead of strongly typed objects, SQL injection vulnerabilities becomes all to easy to write.

Variable scope is another one - because there are no variable declarations it’s not obvious where variables are introduced, so one could be using variables declared earlier without realizing it (this is why register_globals changed from default-on, to default-off, to deprecated to removed).

Then there are ill-concieved features like magic quotes, and missing features like cryptographically secure random number generation.

A well-designed language for web development would be secure by default when doing the most obvious thing - one shouldn’t have to go out of one’s way to learn what all the security pitfalls are and have to write to explicitly address each of them (and update your code when the next such pitfall is discovered).

New technologies from new physics

October 19th, 2008

Almost every fundamental new discovery in physics so far has yielded great advances in technology. The exception seems to be general relativity - probably because gravity is such a weak force, it’s difficult to make consumer items out of it.

I like to wonder what new technologies we could hope (in our wildest dreams) to obtain with a complete theory of physics. It might take a while, because we don’t even know of any practical way of even getting experimental evidence for a grand unified theory so far, let alone make technology from those experimental results.

One possibility is new particles. Many promising theories predict various new particles. Unfortunately most particles other than the ones that make us up tend to be very short-lived and therefore don’t yield any new materials. But if we do find a new long lived particle (and it doesn’t cause a phase transition that swallows us all up) there is a possibility of new materials heavier, lighter, stronger or with better information storage abilities than the ones we have.

Another possibility is gravitational engineering. Particularly if we can find a way to violate the weak energy condition, we might be able to build stable, traversable wormholes, time machines and other such time/space abominations.

Even more far-fetched (but also possible) would be more ways to manipulate matter and energy, as in The Trigger and Ed stories.

You can’t learn something until you already almost know it

October 18th, 2008

This is one of those ideas that seem completely obvious when you first hear of it, but once you’ve been made aware of it you keep noticing it again and again.

When learning something, you have to have a frame of reference in which to place the new piece of knowledge, or you can’t understand it. This is why trying to teach can sometimes be a very frustrating experience - you might think that something is completely obvious and can’t understand why your student cannot understand it, but that’s because your student doesn’t yet have the scaffolding required to hold up that understanding, scaffolding that you’re taking for granted. Whenever you are frustrated by someone’s lack of understanding, try to imagine what their scaffolding looks like and give them the next piece from the set of pieces that are missing.

This also sometimes sets the pace about how quickly you can learn something completely new and unfamiliar - there are lots of pieces of scaffolding missing and you need to take each one and internalize it before you can understand the next. Since it isn’t always obvious what the “next” piece should be, sometimes you have to read the whole textbook to get each piece. The problem isn’t memorizing lots of facts (though that helps) it’s slotting each piece into the framework.

If you’ve read the information about the next piece but haven’t yet internalized it, sleeping on it can help. When you dream your mind is playing a kind of tetris, sorting things out and slotting things into gaps so that it all fits together.

This theory also explains why young children want to have the same books read to them over and over again - they start off knowing nothing (not even how to learn) so they seek out familiar patterns. In the context of that repetition, a new piece of scaffolding will occasionally drop into place. When that happens, there is a satisfying “Ah ha!” feeling associated with it. We have somehow evolved a mechanism to recognize this event and derive pleasure from it in order to give us a drive for learning.

Dual time

October 17th, 2008

The way we measure time is very complicated and difficult to get right, with all those time zones each with different daylight savings time rules. Perhaps we should rethink the whole thing.

The root of the problem is that there are two contradictory things we use time for - one is coordinating between people and the other is telling us when it will be dark. Timezones worked fine until we started collaborating globally, across time-zones. And daylight savings time is a hack to avoid sunrise being too early or too late in the day.

Perhaps instead all our clocks should show two times, “global time” (i.e. UTC) and “local time” (i.e. the time such that the sun will rise at 6am in place where the clock is). GPS could be used to make the local time clock adjust itself so that this was always true. One could also get something like our current timezones by having “standard time points” on the Earth’s surface - one would tell one’s clock to pretend it was at one of these points in order that all the clocks in a particular region agree (useful for things like television broadcasts).

“Global time” would be used for things like coordinating international teleconferences and “Local time” would be used to tell the farmer when to get up and milk the cows.

I think having days that were a couple of minutes shorter in the spring and a couple of minutes longer in the autumn would not be particularly confusing (we probably wouldn’t even notice apart from the fact that the difference between global time and local time goes up and down with the passing of the seasons).

Marginal cost of a vote

October 16th, 2008

Suppose you are in charge of a large political campaign (like, say, the ones for Obama and McCain that are going on the moment here in the US). You have a certain amount of money to spend and want to spend that money in a way that will make as many people vote for your candidate as possible. As always with such things, there are bound to be more things that you can conceive of doing than there is available money, so you have to choose only the things that meet a certain “expected number of voters swung per dollar” threshold. I wonder what that threshold is? I.e. if you give $100 to a political campaign, how many extra votes does that buy?

This isn’t the same as the total number of votes cast for a candidate divided by the total amount that campaign spent, because some of those voters would have voted for that candidate anyway without the campaign spending any money. I’m interested in the marginal cost of a vote.

I’m sure the figure varies from day to day (if the other candidate makes a big gaffe, you can probably exploit that to swing a lot of voters relatively cheaply) and from state to state (votes in swing states are more valuable than votes in safe states, so it’s worth spending more to swing them). I expect it also varies depending on how much the other campaign spends (since it costs money to undo their work). I’m sure the political campaigns do calculations to figure this stuff out - it would be interesting to see their statistics.

Algorithm for finding “hot” records in a database

October 15th, 2008

Suppose you have a database, and (as often happens with databases) records change from time to time. Suppose also that you’d like to maintain a list of the “hottest” records, that is the ones which have been changing a lot lately.

The first thing you have to determine is whether you want to put the emphasis more on “a lot” or “lately” - i.e. you need to have a characteristic time tc such that n changes tc ago are equivalent to n times e changes now. This time determines how quickly changes “decay” into irrelevance. Depending on your application, this might be a day or so.

The next thing you might try is to keep a table of all the changes made, along with a time for each. Then you can just weight the change times according to how long ago they are and add them up. That’s going to be a big table and an expensive operation, though.

A clever trick is to use a running average and “last changed” timestamp in each row of the original table. The running average starts off at 0. Each time the row is modified, calculate the number of characteristic times since the last change N = (tnow-tlast)/tc, update the average by multiplying it by e-N and adding 1 and then update the old “last changed” timestamp to tnow for the next change.

To show that this works, suppose the running average was a=1+e-N1+e-N1-N2+e-N1-N2-N3+… (one term for each change, weighted by how long ago they happened). When we update the running average it becomes 1+e-N(1+e-N1+e-N1-N2+e-N1-N2-N3+…) = 1+e-N+e-N-N1+e-N-N1-N2+… which is just what we want.

That isn’t quite the end of the story though because the running averages in the table are not directly comparable to each other - if a record had a burst of activity a long time ago but then hasn’t been touched since, it will have a similar activity to a record which had a similar burst of activity which has only just ended. To compute the “current” value of the running average we need to multiply a by the e-N corresponding to the time since it was last updated (without adding one this time, since we haven’t added another unit of activity). This requires looking at all the records in the table though, which will be faster than the table of changes approach but might still be rather slow for a big database.

If we only care about the top (10, say) “hottest” records, we can speed it up by caching the results in a small table, and noting that scaling all the activity values by the same factor doesn’t affect the ordering of the list. Suppose we have a singleton value tupdate which is the time we last updated the small table and a10 which is the activity of the 10th hottest record the last time it was changed. Whenever we change a record, take the new activity value a, multiply it by eN (note no minus sign here) where N=(tnow-tupdate)/tc and compare it to a10. If it’s larger the new record should be inserted into the “top ten” table and the old 10th hottest record shuffled out (if the new record wasn’t already in the table) - think of a high score table for a game. When this happens, set tupdate=tnow, multiply all the activity values in the small table by e-N and update a10 with the new value. Then when you need to display the hottest records just display this table.

There is one more complication which comes about from deleting records. If you delete a record it probably shouldn’t appear in the “hottest” records list, even it was updated quite recently. But if you delete a record from the small table when it is deleted from the big table, you will only have 9 records in the small table and you’d have to go searching through the entire big table to find the new 10th record.

If records don’t get deleted from your database too often, a simple workaround to this problem is to keep maybe 20 records instead of 10 in the small table so that there are plenty of “substitutes” around, and only display the top 10 of them.

The algorithms used by Digg, Reddit, StackOverflow etc. are a little more complicated than this because the records of those sites also have a “rating” which is factored in (higher rated records are considered “hotter”) but which can change with time. There might be a way to deal with this by scaling the running average according to the rating and updating the hot records table when the rating changes.