Subversion 1.5 WebDAV Write-Thru Proxies

October 16, 2007 C. Michael Pilato

Yesterday at the Pre-SubConf Subversion Workshop in Munich, I presented a short bit about a new Subversion 1.5 feature — WebDAV write-thru proxy support. This feature allows for a Subversion server deployment based around Apache 2.2.x and mod_proxy in which there is a single master server and associated repository, and one or more slave servers which handle read operations while passing through (proxying) write operations to that single master server. In this deployment scenario, each slave server has its own copy of the master repository which is kept in sync by some process, typically one driven by hook scripts on the master server itself.

Why might someone want to use such an arrangement? Well, users are far more likely to be doing read operations (such as checkouts, updates, status checks, log history requests, diff calculations, etc.) than write operations (such as commit, revision property changes, and lock/unlock requests). You might wish to employ several servers to share the burden of handling those read requests (a load-balancing scenario). Or perhaps you have a worldwide organization (such as CollabNet’s) with offices in, say, the United States, Eastern Europe, and India. If you deploy a single centralized server across the whole organization, you will almost certainly wind up favoring some users over others in terms of the performance of the network links between them and the Subversion server. WebDAV write-thru proxies would allow you to keep geographically local slave servers for each of your global regions, universally optimizing the performance of all of your users’ read requests.

For the purposes of my presentation, I whipped up a simple WebDAV write-thru proxy scenario with a single slave server, and using svnsync as the mechanism for propogating changes from the master server to the slave server. The following describes how you can do the same.

First, you’ll need to have Apache HTTP Server on your master and slave servers, and on the slaves it must be version 2.2.0 or better with mod_proxy enabled. Now, you’ll need to configure your master server to expose your repository. Assuming your Subversion repository is located at /opt/svn/project, you might do so by adding a block like this to its httpd.conf file:

<Location /svn/project>
   DAV svn
   SVNPath /opt/svn/project
</Location>

Your slave servers will each eventually have full replicas of the master server’s repository (which we’ll take care of in a bit), but at this point will need at least some httpd.conf magic to expose their replica (which, again, we’ll assume lives at /opt/svn/project):

<Location /svn/project>
   DAV svn
   SVNPath /opt/svn/project
   SVNMasterURI http://IP-ADDR-OF-MASTER/svn/project/
</Location>

Notice that SVNMasterURI directive — that’s the bit that’s new to Subversion 1.5. It tells the slave server to proxy write-type operations to the master machine, and provides the URL at which the master machine’s repository can be found.

Now, there’s a second bit of httpd.conf configury we need on each slave. Remember that we’re planning to use svnsync to push changes from the master to the slave servers. Well, svnsync does so using regular commits, and if we’re proxying commits back to the master server, that won’t work out so well. So we add another Location block which exposes the server’s repository, but which does not proxy write operations through, and which allows commits only from the master server’s IP address:

<Location /svn/project-proxy-sync>
   DAV svn
   SVNPath /opt/svn/project
   Order deny,allow
   Deny from all
   Allow from IP-ADDR-OF-MASTER <Location /svn/project>

Okay, let’s talk about the actual repositories. As I mentioned, each slave server needs a replica of the master repository. And for our purposes, that replica needs to be initialized as a read-only mirror of the master repository by svnsync. (See this blog post for some basic information about using svnsync.) You’ll run svnsync init from the master server, and for the sake of performance (which is critical in this scenario), you’ll do so using file:/// access to the master. The cautious among you will simply use svnadmin create (or the equivalent in some third-party Subversion tool) to create a new repository on each slave, create for it a permissive pre-revprop-change hook, then use svnsync init http://IP-ADDR-OF-SLAVE/svn/project-proxy-sync file:///opt/svn/project (notice we’re syncing via our special sync URL), and finally use svnsync sync http://IP-ADDR-OF-SLAVE/svn/project-proxy-sync to copy each revision from the master to the slave. Those of you who are more adventurous will find every way possible to shortcut this process, from creating just one slave’s svnsync-ready mirror repository and literally copying that to every other slave, to even optimizing out that sync job cost by hand-editing the revision 0 properties on literal copies of the master repository. I’ll leave such trickery as an exercise to the reader.

Let’s see where we are. We have servers with Apache HTTP Server configured. We have repositories in place on each machine, where the slave repositories are svnsync-ready mirrors of the master. Now we need the automation bits which keep those mirrors in sync. We do this using the master repository’s hook subsystem. If you plan to allow revision property changes, you’ll need a permissive pre-revprop-change hook script on the master repository, and then also a post-revprop-change hook which tells svnsync to re-copy revision properties for a given revision when one of that revision’s properties is changed:

#!/bin/sh
REVISION=${2}
# Launch (backgrounded) sync jobs for each slave server.
svnsync copy-revprops http://IP-ADDR-OF-SLAVE1/svn/project-proxy-sync ${REVISION} &
svnsync copy-revprops http://IP-ADDR-OF-SLAVE2/svn/project-proxy-sync ${REVISION} &
svnsync copy-revprops http://IP-ADDR-OF-SLAVE3/svn/project-proxy-sync ${REVISION} &

You’ll also need a post-commit hook to transfer new revisions in full to the slaves:

#!/bin/sh
# Launch (backgrounded) sync jobs for each slave server.
svnsync sync http://IP-ADDR-OF-SLAVE1/svn/project-proxy-sync &
svnsync sync http://IP-ADDR-OF-SLAVE2/svn/project-proxy-sync &
svnsync sync http://IP-ADDR-OF-SLAVE3/svn/project-proxy-sync &

Why are we backgrounding our svnsync processes? As I mentioned, performance is critical here. Every second that our slaves remain out of sync with the master is a second in which the user who performed the commit or revision property change might try to then perform a read operation (like svn update) against that revision. If he does so, he’ll get an error that indicates that the revision doesn’t exist on the server, which while not fatal is at the very least quite confusing.

At this point, we could go live with our servers and things would mostly work. Users would checkout working copies from one of the available slave servers. When they performed read operations against that repository, the server would field the requests from its replica of the master repository. When they did commits or changed revision properties, their slave server would hand off to the master server, which would do the real work and then propogate those changes back out to all the slaves. But there are some additional things you may want to configure before going live.

First, we never added any authentication or authorization stuff for our users. Interestingly, you’ll need the authentication stuff to match across all servers (and need to rig up a way to keep those in sync), but you need only the read-authorization stuff on the slaves, and both the read- and write-authorization stuff on the master. (It’s probably easiest just to keep all that stuff in sync across all the servers.)

Secondly, we didn’t do anything to handle lock/unlock client requests. To do so properly requires implementing post-lock and post-unlock hook scripts on the master which in turn perform the lock/unlock operations on each slave as the user doing the locking/unlocking. This is complicated work. Fortunately, if you choose to omit it, lock enforcement in your deployment scenario should still work. It’s just that lock queries (asking, "What’s locked, and by whom?") will always turn up empty.

Finally, we didn’t do anything to handle the problems that might occur if the link between the master and a slave server should go down at the wrong time. If the link drops during some client commit operation, that’s okay — the commit will never finish on the master server, and the user will hear back from his/her slave server that something went wrong. At worst the commit will complete on the master but the user’s client will never know about it. (This same thing can happen in a single-server setup if the link falls down while the server is trying to respond to the final commit MERGE request.) If the link drops during the svnsync phase after a commit, that slave server will continue to work, but might be out of sync until the next commit. You could implement a cron job on the master that occasionally syncs all the slave repositories to minimize that out-of-sync period. What about svnsync failing after a revision property change? That’s more complex — you may need to implement some wrapper around that process that can reliably track success and failure and provide a retry mechanism. That’s true also of something failing while trying to propogate lock/unlock status to the slave servers.

As you can see, the state of the art is currently not such that you flip a switch and suddenly wind up with a one-master-to-many-slaves repository replication deployment scenario. It’s tricky business, fraught with opportunities to make mistakes and to leave edge-cases uncovered. But this new feature in Subversion 1.5 provides the fundamental requirements if you’re willing to see the complexities through to completion.

Previous Article
TortoiseSVN and Subversion 1.5

TortoiseSVN's plans for Subversion 1.5 Read more ›

Next Article
Considerations when upgrading to Subversion 1.5

Upgrade considerations for Subversion 1.5 Read more ›