27 January 2006

Version Control Systems

I've had a fair amount of experience using CVS, but not for a couple of years ago. Recently I have been investigating version control systems again. One scenario I wanted to investigate was the following. Assuming a tree of files like this:

A/
  B/
    foo.cpp
    foobar.cpp
C/
  D/
    bar.cpp
    barbaz.cpp

From this, we want to be able to checkout the following to a sandbox directory with a single checkout command:

F/
  foo.cpp
  bar.cpp

If you're a CVS or svn user, go along with me for a moment and assume that this is a valid thing to want to do.

Unfortunately, it seems to be impossible.

What about CVS modules?

Well, I would summarize what modules can do by saying that upon checkout, the use of modules can collapse existing directory hierarchy, or add directory hierarchy that does not exist in the repository, or place directories in places where they don't exist to start with, but they fall short of allowing you to arbitrarily rearrange the tree, and they won't rearrange files. This is probably in part because of the ways in which metadata is stored on disk for CVS sandboxes: it goes into your directories, even though CVS does not treat directories in general as first-class citizens.

Here is what you can do with modules:

Alias modules will simply allow you to use a different name for an existing module or path. When you check out, all intermediate directories will be created. For example:

r5rv_alias_dirs -a r5rv/a/b r5rv/c/d

Will give you:

r5rv/
     a/
       b/
         foo.cpp
         foobar.cpp
     c/
       d/
         bar.cpp
         barbaz.cpp

Regular modules let you label all or some of the files in a directory with a module name. For example, this will get only the files indicated and write them in a directory with the name of the module, generating no intermediate directories in the output:

r5rv_foo r5rv/a/b foo.cpp

This will give you:

r5rv_foo/
         foo.cpp

Leaving off the filename will get you all the files in the specified subdirectory:

r5rv_b r5rv/a/b

giving you:

r5rv_b/
       foo.cpp
       foobar.cpp

Note that for regular modules you can't put multiple directories on the same line, or CVS gets confused.

If you want to put specific files or the contents of specific directories together, you have to use ampersand modules. But: when CVS writes out ampersand modules, it addes the module names to the directory structure! So, given:

r5rv_bar r5rv/c/d/ bar.cpp
r5rv_foo_bar &r5rv_foo &r5rv_bar

Checking out r5rv_foo_bar will give you:

r5rv_foo_bar/
             r5rv_foo/
                      foo.cpp
             r5rv_bar/
                      bar.cpp

Not quite what we want.

You can exclude directories when defining an alias module. For example:

r5rv_no_c -a !r5rv/c r5rv

will give you only:

r5rv/
     a/
       d/

Note that if you already have r5rv checked out from the module r5rv_alias_dirs above, CVS will not remove the contents of c/, so it would be best to remove your local sandbox version of r5rv before checking out this new module.

So can we achieve an output dir like:

r5rv_proj/
          f/
            foo.cpp
            bar.cpp

I don't think CVS can do this directly, but we come close, if we're willing to accept intermediate directories around our individual files. Using as a model the r5rv_foo_bar module above, we can change it to rename the working directory to something other than the module name:

r5rv_foo_bar_renamed -d r5rv_proj &r5rv_foo &r5rv_bar

We get:

r5rv_proj/
          r5rv_bar/
                   bar.cpp
          r5rv_foo/
                   foo.cpp

What if we wanted to fix up this hierarchy with a post-checkout script? Well, assuming we can guarantee the script will run on the client, if we have a script that will move our files, we can specify that we want a script to run by defining a module like this:

r5rv_post -o combine.sh -d r5rv_proj &r5rv_foo &r5rv_bar

This gives us what we want, but there is one problem: we have not properly preserved the contents of the CVS metadata directories. This means that CVS no longer knows what directory foo.cpp and bar.cpp belong to, and will be generating all kinds of errors when, for example, doing a status check or update. There are some workarounds that can be put into place using .cvsignore files, and I suppose that if I was feeling ambitious I could write some scripts that would actually run sed on the metadata files in the CVS subdirectories, but it should be obvious that this solution would be ugly and brittle.

So what about Subversion?

In Subversion, when you commit your initial structure, you can do something like this:

original:

a/
  b/
     foo.cpp
     foobar.cpp
c/
  d/bar.cpp
     barbaz.cpp

Let's say you want to be able to mix-and-match these into your output. You can check them in to svn as separate projects. Easily done:

svn import ./a file:///cygdrive/e/repo-svn/r5rv-a -m "initial"
svn import ./b file:///cygdrive/e/repo-svn/r5rv-b -m "initial"

Now you want to create a directory that will contain both of them upon checkout:

mkdir r5rv-proj
svn import ./r5rv-proj file:///cygdrive/e/repo-svn/r5rv-proj -m "initial"

Then over in your workspace, you check out that empty directory: the last parameter is the name to give it in your workspace:

svn checkout file:///cygdrive/e/repo-svn/r5rv-proj r5rv-proj

Now you have a working copy of an empty directory (something that is impossible in CVS!) You want to add a svn:externals property to it. This consists of a set of svn URLs for things that should be retrieved here when the directory is checked out. You can do this with svn propset, but since you want one local directory name and one svn URL per line, you can't really do it directly on the command line. Instead you can do something like:

echo "r5rv-a file:///cygdrive/e/repo-svn/r5rv-a" >> props
echo "r5rv-c file:///cygdrive/e/repo-svn/r5rv-c" >> props

Check that it looks right:

$ cat props
r5rv-a file:///cygdrive/e/repo-svn/r5rv-a
r5rv-b file:///cygdrive/e/repo-svn/r5rv-b

then set the externals property using -F, which means read from a file, like so:

$ svn propset svn:externals -F props r5rv-proj/
property 'svn:externals' set on 'r5rv-proj'

You can confirm the property has been set:

$ svn propget svn:externals r5rv-proj
r5rv-a file:///cygdrive/e/repo-svn/r5rv-a
r5rv-b file:///cygdrive/e/repo-svn/r5rv-b

Clean up:

$ rm props

OK, now commit:

svn commit r5rv-proj/ -m "added external property"

And update:

svn update r5rv-proj/

You see something like this:

Fetching external item into 'r5rv-proj/r5rv-a'
A    r5rv-proj/r5rv-a/b
A    r5rv-proj/r5rv-a/b/foo.cpp
Updated external to revision 4.

Fetching external item into 'r5rv-proj/r5rv-c'
A    r5rv-proj/r5rv-c/d
A    r5rv-proj/r5rv-c/d/bar.cpp
Updated external to revision 4.

That's a lot of work! But the model is actually simpler than elaborate use of entries in the modules file. In addition, under svn, a non-privileged user can make this change (that is, a user with ordinary write access, rather than administrator access). And more importantly, these changes are all versioned; under CVS, if you want to use those module definitions, you have to make sure that the paths specified in the modules don't change, or the modules file will be out of sync with the repository structure. This is a general problem with CVS, which is not really designed to cope well with restructuring the repository.

Now your local project looks like this:

r5rv-proj/
          r5rv-a/
                 b/
                   foo.cpp
          r5rv-c/
                 d/
                   bar.cpp

Which is pretty close to what we want, taking into account the fact that neither CVS nor svn seems to be able to write individual files from different repository locations to the same output directory in the working copy.

However, there are a couple of things to keep in mind about the svn solution. First, your external links have to be complete svn URLs. That is, they can't be relative URLS to directories in the same repository [see note 1].

Worse, it seems that since I'm using svn URLs that start with svn+ssh, the URL I put into the svn::externals property has to include my username! [see note 2]. This means that the external URL will only work when I do the update, not when my co-workers do it. What a mess.

It appears that svn ignores externals when you perform a commit. That is, it assumes you are using an external piece of code that you don't own, and don't want to (or don't have permission to) change. However, you can specify that the external reference should either remain stuck on a particular version, or should refer to the head; if you choose the latter, when you do an update, if the external tree have changed your working copy will be updated.

Anyway, there it is: externals are slightly more flexible than CVS, but not arbitrarily flexible, and with some critical usability issues. Neither of the tools really will let you assemble a checkout out of arbitrary directories and files.

I am going with svn and hoping not to look back, but the ability to define completely arbitrary modules would have been a nice fit to our build process.

From the point of view of programming languages, or even a DSL for version control, this would seem like a perfect example of failed Greenspunning. Why shouldn't the modules system in CVS let you specify arbitrary paths, or even use regular expression pattern matching, to specify the members of a retrievable module? Why shouldn't a tool like svn let you use a DSL to specify what actually happens in the checkout process, with primitives that correspond to the domain entities and proper handling of metadata?

Without this, it seems to me, there is still room in the version control space for a better tool. But I guess there is always room for a better tool; the question is "how flexible is flexible enough?"

NOTE 1: Apparently this subversion feature (relative external links) has been under discussion for a long time. See the bug report. It looks like it might appear in svn 1.5, but I'm not holding my breath.

NOTE 2: There seems to be a way to configure resource files so that command-line svn clients will let you use external svn+ssh URLs that don't specify a username. However, at my workplace we are all using the TortoiseSVN client, and I'm not aware of any way to make that client work with these URLs without specifying the username. Also, just as a matter of design, it seems like a bad idea to force your repository to contain hard-coded references to its own URL, since these will break if the entire repository is moved.

No comments: