Mirror Management: How Often Should I Sync?

May 19, 2008 C. Michael Pilato

A CollabNet customer submitted the following question to us recently:

We set up mirrors of several Subversion repositories which CollabNet hosts for us. I have a question regarding the frequency of synchronization. How often should it be done? Should we synchronize often with a smaller number of changes, or only once in a while with larger amount?

As it turns out, I’d been considering this very question myself recently. The dynamics of mirror management in Subversion are interesting, so fielding this question gave me an opportunity to render some of my recent musings on the matter as text. And as there is nothing particularly unique about this customer’s Subversion deployment scenario, you the reader get to benefit from the generality and (now) publicity of the response that I offered to the inquirer.

My response was as follows:

This is an interesting question, and one I’ve been chewing on myself
lately. The answer I provide may not be satisfactory if you’re looking
for a simple "You need to sync every X minutes" type of response. The
answer is instead tied somewhat to the balance between your intended
purposes for the mirrors and the level of complexity you’re willing to endure while maintaining them.

Let’s start by examining the naive approach to synchronization, where
you fire off svnsync every so many minutes. Depending on how you wish
to use the mirrors, the exact number of minutes may vary. For a simple
nightly backup job, 60 x 24 = 1440 minutes works fine. But for a mirror
perhaps used by developers trying to stay atop the state of a rapidly
changing codebase, that’s not often enough. You might need to sync
every twenty minutes. Or five. Or one. You don’t want to poll the
original repository so often that you affect that server’s performance,
of course. (Launching denial-of-service (DoS) attacks against yourself
is not considered wise.) But the cost of attempting to sync an already
up to date mirror isn’t all that great.

Now, if versioned changes were the only bits of data maintained by
this synchronization task, the choice of how often to run svnsync
sync
would be just as simple as the above. Unfortunately, there are
also the unversioned revision properties to pay attention to as well.
Because you can change a revision property at any time, and because
Subversion doesn’t record when you did so, complete synchronization of
Subversion repositories gets complicated. Say you have 100 revisions in
your master repository, and you’ve just caught your mirror up to date,
too. Later, one of the developers changes the log message for revision
50. svnsync will never realize that this change happened. Future
svnsync sync invocations will of course continue to pull down new
revisions that have been added (r101 and later), but the log message
for r50 in the original and the mirror will not be the same. The
svnsync copy-revprops subcommand is the tool for remedying this
discrepancy, but something has to tell that subcommand to run, and
against which revisions to do its thing.

So the revision property synchronization angle on this adds
complexity. Most of the time, developers quickly realize mistakes
made in log messages, and fix them relatively soon after the commit
completes. As long as they make the fix before your sync job pulls down
that revision, all is well in the mirror. So that makes an argument for
doing synchronization less often (to allow time for post-facto log
message touch-ups). But how long is long enough? What about those cases
where somebody changes log messages on revisions committed months ago?
These questions can’t be answered without again looking to the purpose
of the mirrors. In your situation, does it matter if the revision
properties are out of sync so long as the core file/directory versioned
data is up to date? Maybe not. Maybe it’s okay if the revision
properties deviate for longer periods of time. Maybe in your situation,
a revision sync every ten minutes plus a nightly revision property sync
for all revisions in the repository is just what the sysadmin ordered.

(As promised, I’ve probably raised more questions than provided answers here.)

In my opinion, the best approach is a multi-faceted one, a
combination of real-time event-based triggering of sync actions and
scheduled just-in-case full synchronization jobs.

The first part of this is the part that, barring communication
errors between the master repository server and the servers housing the
mirrors, keeps those mirrors as up-to-date as possible. Ideally here,
your primary repository is able to push changes to your mirror(s), or
at least push notifications of changes to them. For example, your
primary repository might have post-commit and post-revprop-change hooks
that run svnsync to update the mirrors directly. Of, if that’s not
possible for reasons of firewalls and security and such, then perhaps
those post-commit and post-revprop-change hooks at least send email
notifications of changes, and the mirror machines have some automated
way of noticing those mails and triggering the relevant sync tasks. A
commit mail translates to running svnsync sync; a propchange mail to
running svnsync copy-revprops for the revision whose property was
changed.

The second facet covers the what-if cases. What if the mirror
machines didn’t get some of those email notifications? What if the sync
jobs themselves suffered network outages? To address this, you might
want to have some kind of regular scheduled task that attempts svnsync
sync
(usually finding nothing to sync, because the event-based sync
triggers are working just fine), and also does svnsync copy-revprops
across ranges of revisions (usually rewriting the mirror’s revision
properties with the values they already had, for the same reason). Of
course the thing to avoid is any given svnsync job taking so long as
to cause contention with other svnsync jobs operating against the same
repository.

While not outright instructive, I hope this has been informative
enough for you to decide which implementation works best for you.

Previous Article
Communities grow by non sequitur

Building community means caring about success for each other beyond what makes us personally successful: th...

Next Article
Subversion 1.5 Merge Tracking and Mergeinfo

Mergeinfo is simply the history of merges made into a path. But as with many things that can be described s...