String storage

Most applications store strings as a consecutive sequence of characters. Sometimes when the string is copied the characters are copied too. Sometimes strings are reference counted or garbage collected so to minimize this copying, but copies are made when concatenating and performing other "string building" operations (otherwise the characters would no longer be consecutive in memory).

An alternative that might work better (especially for something like a compiler) would be to do the concatenation lazily. Actual character data comes from just a few places (the input files which are kept in memory in their entirety, static character data, and program argument data). There are two subtypes of string - one consists of a pointer to the first character and an integer recording the number of characters in the string. The other subtype consists of a vector of strings which are to be concatenated together. Integers (and maybe also formatting information) could be kept in other subtypes. The resulting tree-like data structure has a lot in common with the one I described in Lispy composable compiler.

I'm not sure if this actually saves much (if anything) in terms of memory space or speed over the usual methods (I suppose it depends on how long the average basic string chunk is), but it does have at least one potential advantage - Vectors (especially if they grow by doubling) will have many fewer possible lengths than strings, so memory fragmentation may be reduced. I think it's also kind of neat (especially if you have such data structures lying around anyway).

Leave a Reply