My girlfriend always calls me a revisionist…
It surprises and saddens me that I’m still so enamored with the version control offered by many of the web apps we use every day. It’s tragic that in this day and age I have to jump through so many hoops to get but a sliver of that ability on my desktop with a VCS. Even then I’m sacrificing a lot of the subtleties captured in, say, Google Docs.
In my proposal of a social file system, I hinted at a simple revision primitive that could add some wiki magic to an otherwise normal set of objects, but I really ought to clarify my thoughts on the design. I haven’t given too much thought to what these simple revision capabilities would enable long term, but I suspect it’s pretty huge. So here’s a sketch of how I think this VersionProxy could work.
First, a little background: while the ultimate goal is to represent higher-order objects other than plain old inodes (the files and folders we all know and kinda love), first I need to describe the base type that can be all of these things. To my mind, Resource is a more semantic name for this base class than inode, because these objects can be a lot more than just inodes.
So in this base Resource class there is a default hashing behavior (take an SHA hash of the contents attribute). Any time the contents of a Resource change, the modified properties of that Resource (modified_at, modified_by) get cycled and the hash gets updated. If the Resource is a collection (the contents attribute is list-like, containing child Resources), the hash is an SHA of the children’s concatenated hashes. When this change event fires, the parent (if it exists) should have its modified properties cycled as well. This cascades all change metadata upstream to the root Resource.
All Resources inherit this default behavior, but of course the hashing properties can be overridden, and will often need to be. What constitutes a modification can vary amongst object types. For a plain old file, “title” is just metadata, but for something like a blog post or wiki page, a change to its title is a noteworthy change to the content of the object itself.
When an object gets associated with a VersionProxy, these modification properties become quite significant. A change in hash means a new version of the object. This should be completely transparent to the caller. When calling a version-proxies object, what gets returned is the current version of the object, instrumented (like SQL Alchemy) to intercept committed changes to the object (though otherwise identical).
Because this means a lot of retired objects will be hanging around, some hacking on SQL Alchemy’s facilities to grab an object will be required — probably by chaining a filter onto the query object that filters out all inactive versionables. New methods will also have to be written to accommodate querying through these stored versions of an object.
Versioning one object, then, is relatively straightforward. It gets a bit less so when considering versioning a collection — or an object representing a collection, like a Folder. A Folder whose contents are tracked recursively with every update to any child can be called by another name: a VCS. And there’s no reason to reinvent the wheel here — many of the problems and corner cases associated with this approach have been rehashed over and over through the years. We are throwing one particular twist into the mix, though — we’re not just versioning plain old inodes but higher-order objects. But this fact can be neatly encapsulated, hidden away as an implementation detail.
Just like when versioning a single object, when versioning a collection it is critical to know what constitutes a change, complicated further by the requirement that it be aware of all child changes. One solution to this is to simply version-proxy all child objects recursively. When a change to any one or more children is committed this leads to its hash being recalculated, then to a call to the part container to do the same. And so on and so forth until the root is updated.
It gets more complex when trying to consider version numbers. Say we have a folder structure like:
A
|
----B1
----B2
|
----C1
When a change is made to B1, this leads to a new hash and modified date, and the call to parent A to do the same. That leaves us with two versions of B1 and A, let’s call them B1r1 and B1r2 and Ar1 and Ar2, respectively. Now if we make another change, this time two nodes: B1 and C1. This leaves us with three B1s, three As, two B2s and two C1s. It should be guaranteed algorithmically that committing multiple changes at the same time only leaves us with one new version of the root.
As mentioned before, operating on revision-proxied objects is transparent through the standard methods — proxied and non-proxied objects look the same. So there needs to be a VersionProxy-specific API to gain access to the different versions, as well as other niceties having to do with revisions (e.g. a diff tool, perhaps specific to each object type, a patch tool and other VCS features).
Another nice thing about having something like VersionProxy above your objects is that data retention features are built right in. Kill all old versions over one year old? No problem. One common data retention strategy, logarithmic history purging, could be made even more effective by having access to class-specific diff data, allowing a programmatic assessment to determine things like major/minor changes, lending more intelligence for selecting purge points.
Obviously tagging would be possible too, in both the “web2.0″ sense and the SVN sense. A version number is no more than a special tag of a particular release number.
There’s so much more here, but I’ve already rambled on for pages. Thoughts? Ideas? If you’ve read this far, let me know what you think. I promise to push what code I have into the google code repo. It’s not going to be much more than a specific SQL Alchemy model broken out into modules. There’s some interesting microformat output filters just for good measure. I’m thinking about writing a few hooks into particular APIs (Google’s Contact API, for instance), but I have to get the data model right first.
May 17th, 2008 at 10:08 pm
I think that if you keep deltas of everything, and do the logarithmic history purging (with some ability to hint at “good points”) it would be really cool. I’ve been reading up on a lot of things lately, and one of the things I read about but only happens in rare circumstances with programming would come in to play here if you think if your whole system as a VCS, the problem of the “Tangled Working Copy”.
Once I read about the Tangled Working Copy problem I started looking at git, but I really don’t need it… I’m happy with Subversion for now. I have precisely one project using it, I’m putting Forth/2, an old project for OS/2 up on SourceForge under the GPL.
I’m interested in having an object system instead of a file system, so it’s really good to hear about your work.
One idea I had a long time ago which you might find useful is the idea of an interactive file structure explorer… back in the days of MS-DOS I occasionally had need to see inside data files from various applications outside my control. I wrote a program to help me interactively guess the contents of fixed length records. My next step, which I never needed, would have been to write a mini language for specifying the structures of more complex documents. You might want to do something like that to make objects with methods out of JPEG and media files, and even things like the odd XLS file that someone sends to you.
Keep up the good work!.
–Mike–