Starchive

Starchive is a concept, supported by some simple tools to make it a reality. Some of these tools exist now ("sdx version" and "sdx update"), others are starting to take shape. And although a lot of what I'm about to describe is driven by technology, and starkits in particular, this is not intended as Yet Another Plan To Create An Archive Of Everything.

Because, first of all Starchive is not tied to any particular location or server. The key concept of Starchive, is to treat collections as just that, and put the decision of where and how to store such collections aside. This is similar to the internet - you start drawing networks and nodes, until the essence sinks in that one does not have to draw network topologies or IP address "classes" at all for many practical purposes: IT'S A CLOUD!

ASSOCIATIVE FILE STORAGE

Starchive dispenses with the notion of a structured archive. It's a collection of data, the building blocks are called files, and it maintains any number of them, as well as any number of revisions of them. Starchive doesn't even care about names for files.

The term "collection" is going to be used a lot. A public site can act as a big repository, and hence huge collection. A private server can act as limited-purpose and/or limited-access repository, say for corporate intranet use. A hard disk can hold a collection, as cache for what was obtained from other collections, or as storage bin for changes which have not been made public.

But even a starkit itself is really nothing but an organized collection. Some starkits are simple "grab-and-run" programs. Others are more extensive, and include a set of standard Tcl packages they need to be self-contained and work as is. Yet others, such as kitten.kit, are really meant to be used a collections which provide a packaged set of functionality.

One way to look at this, is that a starkit is a selection of code which was picked to work together. Both in choice of what went it, and of which exact versions were combined.

Add to that the fact that a starkit is nothing more or less than a directory-tree-in-a-file, and one could say that a developer's development area is also nothing but a personally-picked collection. Plus of course the newly developed code, docs, data, test suites, etc. Whether wrapped as starkit or unwrapped, that's merely a representation detail now that Tcl has VFS.

What Starchive does, is take this model to a logical next step. There are structured-, hand-picked, developer-controlled sets of files, which tend to live on a local hard disk or file server, and there are... starchives.

A COLLECTION OF COLLECTIONS

A starchive accept starkits. You can send one to it, you have to give it a name, you have to have some identity (even if it's just an IP number), and you probably want to include comments and other meta information. It could be totally ad hoc, or it could be standardized by pointing to a file inside the submitted starkit, containing metainfo such as TIP55 and Cantcl are starting to define. Starchive does not care.

All Starchive does, is to accept collections and return them on demand. With multiple starchives, one could imagine that there will be a way to treat some starchive sites as private, while others act as public repositories. Access control and authentication will no doubt be used to varying degrees, depending on their purpose.

The simplify this mental model completely: think of starchives as a way to collect and return starkits on demand. All you need is a name, and a version ID (as calculated by "sdx version").

Ok, but how does it work? Will it scale (technically and organizationally)? Isn't this terribly inefficient, awkward, error prone, brittle, over-simplified?

To brush one set of objections aside: starchives can be extremely efficient, both in speed and in storage. Maintaining several Gb with a million files is easy nowadays. So is tracking tens of thousands of submitted starkits.

Another point is that while starkits-in/starkits-out is the basic model, nothing prevents Starchives from accepting zip's, and tar/gz's, nor form producing them on demand. Again, internally this is a cloud - what comes out will be generated on demand (and cached, if performance makes that necessary).

NAMING CONVENTIONS

One important design aspect will be naming conventions. This can and will evolve. The current "sdarchive" is about to become the first incarnation of a very simple starchive. It has a flat namespace. This may change when multi-level names are introduced, e.g. "app-rendering/fractal". Such a change can happen on the fly - all current starkits can simply be re-submitted to starchive under the new name. Old names can continue to function, if that is desired. Or not. Starchives as a foundation do not care. One starchive may be simplistic/flat, while another introduces strict namespace categorization.

Please take a minute to reflect on this approach. An application is (can be) a starkit. It's a collection. It has a version ID at any point in time. It can be submitted to a starchive. From then on, the application ceases to be "physical", it's no longer a starkit, in fact - it's a <name,version-id> tuple, living inside one or more specific starchives. That starchive may well be only on the local disk, so all this does is move files around. But the essence is that such an application becomes detached from the hierarchical directory trees we all are so used to. It's a collection, carefully composed and tested, and it ends up with what is essentially a fingerprint.

Just as with CVS, directory trees change from being considered the "original" data to being a checked out replica. In many deployment sitations, the checked out replica can be a starkit which never gets unwrapped at all. In other scenario's unwrapping is needed, leading to a different but equivalent application representation. In the case of Tcl and VFS and starkits, either one can be launched - a major convenience for development and testing.

STARCHIVE IMPLEMENTATION

By now, you may be wondering about performance and disk usage again. Ok, here's why this is not an issue: starkits submitted to a starchive are not stored as is. Instead, a catalog is created, with all file contents replaced by an MD5 signature. That (tiny) catalog is sent to the starchive, which returns information about which files are not in the Starchive already. Submission then takes place by sending the catalog and *only* the missing files. On access, all files are inserted in the generated starkit - on the fly.

So what Starchive stores, is each file version once. It is a (potentially huge) associative cache. This means that file space is used very efficiently. Submitting different starkits which all contain say, BWidgets, will cause one copy of BWidgets to be stored. It also means that the sequence "fetch, unwrap, make a change, wrap, re-submit" will be near-instant. Just as CVS is, but with a different granularity and storage model.

Submitting N different versions of a Tcl package such as BWidgets is fine. Only one copy of each unique file version is stored by Starchive.

COLLECTIONS AS METAKIT CATALOGS

The one missing link is the "catalog" files mentioned above. These are starkits with the data ripped out and MD5 signatures filled in instead. Each such catalog represent exactly one specific "collection". It has all the info needed to know exactly which version of which file is needed where in a starkit to faithfully reconstruct it. These catalogs are themselves also files, stored by Starchive.

The format of catalog files, which is in fact the data sent over-the-wire, i.e. the protocol used between Starchive servers and clients, is compatible with starkits. It's also compatible with the Catfish disk catalog utility, of which tens of thousands of copies have been downloaded over the years. The exact details and differences will need to be documented. There is pure-Tcl code to read Metakit datafiles ("readkit"), but it is not quite up to snuff. This means that none of this is tied to any particular implementation (or even to Tcl, for that matter).

The difference between catalogs and all other files in Starchives (apart from their well-defined structure), is that catalogs are associated with a name and additional meta information. One of the key tasks of Starchive, is to be able to browse, select, summarize, and make an inventory of all the catalogs submitted to it at some time or other.

Starchives only grow in size, they don't ever drop or lose anything. The only way to get from a big starchive to a new leaner one, is to declare the original one "closed", create a new one, populate it with a subset, and retire the original completely - if needed.

Again, starchives are not a *specific* site, or a *specific* structure or naming convention. These will have to evolve on top of what has been described. It may well be that one or more "major" or "central" public archives are set up. For now, the only plan is to take sdarchive as prototype and turn it into a starchive.

Ah... but it already is. The new "sdx update" function is essentially a first crude gateway to seeing collections as starchives. Even though the current server does not do all that md5 magic, it really does not matter. The sdarchive is a spot with collections, today. It'll soon be a new spot with the same and more collections - at which point the real first starchive will be a reality.

THE ESSENCE OF VERSIONING

This has become a long story on what "can" and "will" be done. The goal is not to coerce anyone to go along. It's a statement of direction, no more.

But there is more to be said on why this approach is being taken.

Most archive initiatives, in fact most "release" initiatives, take the perspective of constructing a product, moving towards the ultimate goal of a final release, and ultimately a sequence of updates and releases beyond that.

This is the "push" model. It defines a goal. It invents a version number, and it focuses on that deliverable. Once the target exists, it is *the* result which is going to be described, marketed, sold, introduced, etc.

In a vague hand-waving sense, this is a way of sealing everything including the kitchen sink. The "product" is self contained, it comes with a name, a release tag ("Myapp 1.0"), and with documentation in the box, so to speak.

But with all due respect for standard best practices, I think it's fundamentally wrong. It tries to contain everything into a single world.

As Goedel has shown, one cannot "prove a system with itself". One needs stronger logic than what is inside to say something about to whole.

I think the same holds for software, in a very deep sense. You create something, and then you have to step *out* of that world, rise above it, summarize what it is, what it does, and ... what the whole is going to end up being called.

Which is why putting something as innocent-looking as a version number *inside* a product is trouble. Oh sure, we go through it anyway - 0.99, 1.0a1, 1.0b1, 1.0rc1, 1.0 - but it's just not going to work. For that same reason, putting a *final* tag on something inside that same thing is never going to work. I have always resisted version numbering, because the moment I pick one, something comes along invalidating it. Maybe minor, maybe just a README change to add a last-minute observation, but it breaks the whole thing.

If you step back, and look how Linux distro's, RPM's, stickers-on-the-box, and more all attempt to get inside-yet-outside the whole, it becomes a logical step to leave this for what it is and drop it altogether.

That is why "sdx version <starkit>" is a lot more than just a gimmick.

PLANNED EVOLUTION

And that's where Starchive is fundamentally different from the archives I have seen so far. A Starchive is not centered around goals, but around snapshots. It does not force you to work towards a release, and then push that into the repository as being the real thing, it encourages you to give it snapshots. With whatever frequency is convenient. When collections are extracted, they will usually be the most recent versions (unless a downgrade is needed, which Starchives support trivially). Getting the very latest is made convenient and quick, encouraging "pull" modes of distribution. Nothing new, really - CVS works the same way.

What Starchive does is accommodate a work style of distribution based on a flow of releases. That does not mean everyone has to live with untested bleeding-edge code - this is simply a matter of naming: submit changes as two separately-named collections, one whenever convenient, the other after carefully managed tests & QA steps.

In fact, what Starchive does is *become* the distribution channel. Anything can be a starchive, including a small and focused repository for just release management to clients. They can be told to fetch something from A, without ever seeing that A is internally an elaborate version-managing Starchive. The power of packaging on the fly. Keep in mind that even a single starkit can be seen as a Starchive - albeit a single-purpose single-revision one.

Having said all this, I'd like to qualify the use of version ID's a bit more: for us mortals, simple release numbering schemes are definitely convenient. This is not a proposal to get rid of those. It's just that we have to start making a distinction between approximate update names, and precise ones used as basis for support and upgrade decisions. Build numbers won't cut it - with scripting, there are no clear-cut builds.

And lastly, a comment about an aspect which contrasts quite dramatically with systems such as CVS: Starchives are stateless. You can fetch a specific starkit from one, and submit it to another (if authorized, of course). Neither Starchive cares. You may be submitting something which is a 100% replica of something it already contains - no problem. Evidently, it would be nice to have tools for statistics, and to report the various cases of overlap. That also holds for multiple Starchives - some may be mirrors, others subsets, etc - tools will need to be devloped to manage this in the large.

Starchives are about flow. Pulling collections. Anything. Not just Tcl.

March 2003 - This page started by describing Starchive as a concept. Well, about one hundred lines of Tcl have now turned it into a reality - see Starchive implementation.