Archive for June, 2001

Non-hierarchical filing systems

Thursday, June 21st, 2001

Traditionally, hierarchical systems have been used to store information - a particular fact might be in a particular paragraph on a particular piece of paper, which is stored in a particular folder in a particular drawer of a particular filing cabinet in a particular office on a particular floor of a particular building in a particular city of a particular country of the world.

This is also the manner in which computers tend to store information. On a Windows system there is the "My computer" icon, under which there are a number of drives, each of which can contain files and subfolders. On a Unix system there is "/" (root) under which there are subdirectories (on my Debian GNU/Linux installation these include "/bin", "/dev", "/home", "/lib", "/lost+found", "/tmp", "/usr", "/var"). Again these contain files and subdirectories which can also contain files and subdirectories. A particular piece of information (file) is located by specifying an ordered list of folders - for example, this document might be found at "/home/andrew/work/website/andrew/computer/hierarchy.html" on a Unix system, or "C:\Work\Website\Andrew\computer\hierarchy.html" on a Windows system.

This is how things have always been done, and it is generally taken for that this is how it will always be done. That doesn't mean it's the best way, however. Modern systems are breaking down the boundaries of their hierarchies. For some time, Unix systems have had a facility called "symlinking", whereby an entry can be created in a directory which is neither a file not a subdirectory by a placeholder leading to another file somewhere else on the hierarchy. This can be likened to an entry in a dictionary for a synonym, or a piece of paper in a filing cabinet saying "this document is in folder x in cabinet y". Thus you can effectively have two copies of the document, but not use precious space having two copies (plus, if the document is edited, you don't get out of date versions cropping up).

Windows, too, is catching up. The "Single Instance Store" is more of a space saving device than an organizational tool, though - if you have two identical copies of a file on your hard disk, it puts them both on the same physical place on the hard disk. If you change one of the two files, then it creates a copy and modifies that rather than modifying both files.

The trouble is that in any given hierarchy, there is more than one place you might want to put a given file. Take the typical MP3 collection, for example. Do you classify by artist, or genre, or name of song, or stuff them all in together? At the moment this is left up to personal preference, but that's not necessarily the best way of doing things. I organize my MP3s by artist, except for the songs by artists whom I only have one song by, which I put all in together in my MP3 "root". This has it's advantages (I can see at a glance all of the artists and a good fraction of the song titles, the root folder isn't too unweily and I don't always have to go into a subfolder to see what songs I have by a given artist) and it's disadvantages (I have to do a search to find out the artist if I only know the song name, there's no concept of genre, because of the fact that Windows lists all the folders before any of the files, I occasionally find I have duplicates and if I get another song by an artist I only have one song by, the previous song get moved, which means it doesn't get played until I rebuild the playlist).

What is needed is a new filing system, one in which the searching, sorting and linking operations needed to overcome the disadvantages in any hierarchical system are made fundamental operations of the system - one in which the needs of the users rather than technical considerations are put first.

What I envision is a system whereby a file is located by one or more keywords in much the same way as you use a search engine to locate a document on the world wide web. So you could enter "music Frank_Sinatra" to get a list of music files by Frank Sinatra or "music Mack_The_Knife" to get a list of recordings of that song (possibly by various different artists). The key point is that the system doesn't distinguish between "music/Frank_Sinatra/Mack_The_Knife" and "music/Mack_The_Knife/Frank_Sinatra". Or even between either of those and "Frank_Sinatra/Mack_The_Knife/music" - it's completely commutative.

This system is much more reliable than a search engine, though, because a search engine indexes all words in a document, whereas this system just indexes "filenames" which are a lot less haphazard. A "filename" on this system might look something like this:
Media: Recorded music
Format: MP3
Artist: Frank Sinatra
Title: Mack The Knife
Size: 3,703,497 bytes
Length: 4:24
Bitrate: 112Kbps
Last modified: 15th January 2000, 02:10
Comments: ...

And so on, with any other infomation relevant to that file. This might seem a bit unwieldy for a filename, but remember that all this information is stored by the computer anyway (usually twice, in the case of an MP3 file, because of the ID3 tag) and with a bit of careful programming it should be possible to consider this the "filename" with minimal extra overhead.

Obviously not all fields are relevant to all files (for example, "Bitrate" would have no meaning for a text file, but "Written by" would). It is important that exactly which fields are present is flexible, and that new fields can be added that weren't necessarily even thought of when the filing system was designed.

So now operations which would previously have been classed as searches are classed in the way that simply looking in the contents of a particular directory are now. You could even make it work the same by having ordering the fields in some way. For example, "media" might be a fundamental class, equivalent to a directory in the root directory. Upon opening "media" the computer would show the different types of media files you have on your computer: "recorded music, musical score, photographs, drawings, movies, animations, web pages". Upon opening "recorded music" you might then be faced with a list of artists, and so on, just like it works today. But you aren't limited to that order. Upon opening "media" you could also choose to go directly to "artist" and see various different types of file associated with a particular artist.

It is probably best left up to the user to configure which items appear by default in the "root directory" - for example someone who did lots of things with music on their computer might choose to put a link to "music" there whilst someone who just had a few music files might just leave it in "media". The system provides an infrastructure whereby people can organize their files and make them easy to search and browse through, but is also very flexible.

Where the system really comes into its own is if it is linked to other computers. You could search for media by "Frank Sinatra" not just on your own hard disk, but in *the entire world*. You could see what someone's been doing lately by searching for files they have written (and made public) and sorting them in order of date. The possibilities are almost limitless.

Of course, there's still the problem of designing user interfaces to access this information easily. It's taken decades to perfect an interface to access files in the current, hierarchical system (in my opinion, the "Explorer" in Microsoft Windows 95 is the best and friendliest file management interface yet designed, although it still has its flaws) but hopefully what has been learnt from that will make it possible to implement an interface for the new system much more quickly.

If you're interested in this subject, there is now a mailing list devoted to discussing it. Join by clicking
here.

Legalese hall of shame

Saturday, June 2nd, 2001

Don't you just hate software licenses? Not only are they appalling examples of use of the English language, but more often than not they are downright hostile as well as being just plain difficult to read. This is particularly annoying when you "have to read and agree to this license" before installing the software - quite often you feel like you have to hire a lawyer before continuing the installation procedure, and let's face it - most of us just don't bother - we assume the license doesn't say anything too nasty and agree to it without reading it. I hope this page will start a meme which changes this deplorable state of affairs. I intend to list on this page the top ten best and the top ten worst software licences, according to the scoring system outlined below.

I haven't scored any licenses yet, but if you'd like to do one, please send the name of the license, the company which created it, the name of the software product it covers and the score to me at andrew@reenigne.org.

Legalese hall of shame - scoring system

  • 1 point for each occurence of any of the following legal jargon words or phrases:
    • notwithstanding
    • limitation
    • limited
    • including, but not limited to
    • may not
    • must not
    • authorized [or authorised]
    • entity
    • without prejudice
    • void
    • exclusive [or exclusively]
    • inclusive [or inclusively]
  • 1 point for each word in CAPITAL LETTERS, not including acronyms. As we all know, words in capital letters really *really* have to be obeyed, much like the difference between a "dare" and a "double dare".
  • 2 points for each term "defined" by the license
    • +2 extra points if the definition of the term is completely obvious to a non-lawyer without the definition
      • +2 extra points if you have to think to make sure that the definition means what you think it means.
    • +5 extra points if the term isn't used anywhere else in the license
  • 5 points for each misspelt word or misused punctuation mark.
  • 5 points for each use of the passive voice (a grammar checker will help here).
  • 5 points for each sentence longer than 50 words.
  • 10 points for each sentence which does not make sense in the language the license is written in.
  • 10 points for each of the following rights the license tries to take away from you:
    • the right not to have your email address used for unsolicited commercial email
    • the right to rent something
    • the right to lease something
    • the right to lend something
    • the right to borrow something
    • the right to time-shift or space-shift something
    • the right to archive something
    • the right to resell something
  • 10 (ten) points for each time a number is used in both numeric and longhand versions, e.g. "90 (ninety)" or "thirty (30)" (like that's supposed to make it clearer, or "more legal" or something).
  • 10 points for each of the following:
    • a premium rate telephone number
    • a telephone number in a different country to the country you are in, or in which you bought the software
    • a telephone number but no email address
    • a postal address but no email address
  • 10 points if you are told that by doing something you agree to the terms of the license (this always reminds me of childhood games involving writing an insult on a piece of paper and writing after it "if destroyed true").
  • 20 points if you paid for something, and the license claims it doesn't have to work.
  • 20 points if the license tries to take away your right to reverse-engineer anything.
    • +20 extra points if there's no "except to the extent that such activity is expressly permitted" (or equivalent) clause.
  • 20 points if the license has more than one language on the same page.
  • 50 points if the license is longer (in bytes) than the content it protects.
  • 50 points if you found a way to access the content without even seeing the license, let alone "agreeing" to it (implicitly or explicity).
  • 50 points if the license agreement itself is explicitly copyrighted.
  • -10 points for each attempt at humour
    • -10 extra points if it's actually funny