Configuring TeamForge Git Integration with Git LFS using AWS S3

September 20, 2018 Eryk Szymanski

Welcome to the third blog in the series on how to configure TeamForge with Gerrit replication enabled.
In the first blog post, I have explained the problem of Git LFS data in context of replication and proposed a solution with AWS S3 Bucket Cross Region Replication (CRR). In the second blog post we went through the process of setting up the CRR between two regions.

Now it is time to configure our Gerrit servers to make use of this setup. Next, we will discuss the migration of the existing data from the file system to the S3 bucket. Finally, we will go through required configuration on the client side. This will let us to use the new setup. The starting point is that we already have two S3 Buckets with Cross Region Replication enabled across them.

Our setup

Here is the setup I use:

  • Gerrit master and S3 bucket for master: gerrit-lfs-master are both located in us-west-2
  • Gerrit slave and S3 bucket for slave: gerrit-lfs-replica are both located in ap-south-1

As you remember, we have configured two different users: one with access to the master and the other with read-only access to the replica. We don’t need write access to replica, as the data will be populated there using CRR mechanism.
Our users have been created in order to get appropriate accessKey and secretKey for both master and replica. One can find this data in the two download.cvs files (one for master user and the other for replica user), which were obtained as a part of the process described in the previous blog post.

Moving LFS data from FS to S3 bucket

Before we begin with the configuration changes, we have to consider what to do with existing LFS data. Default LFS configuration of TeamForge Git/Gerrit Integration stores all the Git LFS data on the local filesystem of the server. Of course, in case of a fresh install there is no Git LFS data on the server, so this step can be omitted, as there is nothing to migrate. In such a case, one can just switch the configuration to S3 bucket and start using it. But what if Git LFS data is already there?

Well, there are two options here:

  1. Migrate the whole data in one run.
  2. Migrate data step-by-step, each repo separately.

What to choose?

What are advantages and disadvantages of those options and when to use them?

The first option is very straightforward. We migrate all the LFS data from the local file system of the server to the S3 bucket in one run. As a consequence, all the LFS data will be stored in S3 bucket. Note, that this also means that all the LFS data from all repositories will be replicated using AWS S3 Cross Region Replication. This also holds true for Git LFS data from those Git repositories that are not replicated at all. Obviously, in some scenarios that might not be a desired option. Especially in cases where there are many Git repositories that contain huge amount of LFS data that should not be replicated.

So let’s talk about the second option. Here we need to re-configure Git LFS and migrate the content one by one: repository after repository. Repositories that are not replicated should not be migrated. Those will use the FS backend. Only the Git repositories that use Git LFS and are replicated should use S3 bucket. That way, we can have the best of both worlds: replicated repositories will have their LFS data replicated, but the non-replicated data stays on the filesystem. The disadvantage here is that we need to configure everything manually, on per-repo basis. Also, enabling/disabling replication will be problematic, as it involves data migration. How to perform the migration? For each repository, we need to fetch all the LFS data from the server, then switch the LFS configuration for this repo to use S3 bucket and push all the data back to the server which will store it in S3 bucket. I would not recommend to do this, unless necessary.

Migrating all the data in one run

To keep things simple, let’s assume that we are interested in replication of all repositories. In other words we want to migrate the whole data in one run. The easiest way to achieve this is to mount S3 bucket on the local filesystem and move data from local Git LFS storage into the master S3 bucket. We will need root access to the master Gerrit machine in order to perform this operation. So, how to achieve that?

Install and configure s3fs

First step is to install s3fs. You can find the instructions on how to do for CentOS and RHEL.

Once you have the s3fs installed it is time to prepare the access permissions. We need a combination of the access key and secret key separated by colon “:“, which we will store in a local ~/.passwd-s3fs file. The easiest way to prepare this file is to use the following command:

$ cat ACCESS_KEY:SECRET_KEY > ~/.passwd.s3fs

Please replace the ACCESS_KEY and SECRET_KEY with your actual key values. Next step is to apply the proper permissions.

$ chmod 600 ~/.passwd.s3fs

After that we are ready to mount our master bucket.

Mounting the master bucket

We will use the following command:

$ s3fs gerrit-lfs-master /opt/s3-drive -o passwd_file=~/.passwd-s3fs

After executing this command, we can verify that the filesystem is mounted by executing the following:

$ mount | grep s3fs
s3fs on /opt/s3-drive type fuse.s3fs (rw,nosuid,nodev,relatime,…)

The output line displaying the mount info indicates that we have successfully mounted the AWS S3 Bucket under expected directory. Now we have the filesystem mounted. We are ready to copy the data to the bucket. As we have everything available locally, moving the Git LFS data is nothing more than copying the files from the LFS directory on the filesystem to the S3 bucket directory which in our case is /opt/s3bucket.

The LFS data

Where is the Git LFS data located on the server filesystem? The directory in question is specified in ~gerrit/etc/lfs.config file under fs.directory option. For standard TeamForge Git Integration setup this is:

/opt/collabnet/teamforge/var/scm/gerrit/lfs

We now have everything we need to proceed. To make sure that our Git LFS data is consistent while copying it to S3 bucket now it is time to stop the Gerrit service on both master and slave. After those services are stopped, we are ready to copy the data.

The problem

But there is one glitch: Git LFS files on the local server filesystem and on the S3 bucket are stored in a different way. In both cases the file name corresponds to sha256 of the original file. But on the local server filesystem, the LFS objects are structured in a tree. This tree has two intermediate nodes: first directory contains the first two digits of the sha256, and the second level directory contains two next digits. The file itself is located under the second directory. For example, a file:61b4092eab8e3c783ba38a5eaef68467eb4f672d039937ce327364a518b81e8f will be stored under 61/b4/ directory. On the file system it looks like this:

61/b4/61b4092eab8e3c783ba38a5eaef68467eb4f672d039937ce327364a518b81e8f

On the S3 bucket the hierarchy Is flat, with no subdirectories. All we need is just the file name in form of sha256:

61b4092eab8e3c783ba38a5eaef68467eb4f672d039937ce327364a518b81e8f

The solution

Fortunately, the standard find command can help us with that. We want to search through all the subdirectories of the main Git LFS directory. First, we change the directory to the Git LFS data location:

$ cd /opt/collabnet/teamforge/var/scm/gerrit/lfs

All the files found underneath this directory needs to be copied to the S3 bucket directory /opt/s3bucket/. This can be achieved by executing the following command:

$ find . -type f -execdir cp -i {} /opt/s3bucket/ \;

We search the Git LFS directory to find all the files (but not directories, therefore -type f option) and copy them to the destination dir. We use simple cp command with combination with -execdir option. According to find manual page:

The -execdir primary is identical to the -exec primary with the exception that utility will be executed from the directory that holds the current file.The filename substituted for the string``{}'' is not qualified.

That way we get the part that we are interested in with a single Unix command. By the way, depending on the size of your LFS data this command might take some time. Also, please note that I am a bit paranoid here, as I use cp command with -i option. This option will cause the command to write a prompt to the standard error output, before copying a file that would overwrite an existing file.

Clean up after moving the data

Of course, that should not happen and would be a clear indication that something went wrong. After the data is copied over, we can safely unmount the S3 bucket using unmount command:

$ unmount /opt/s3bucket/

That’s all, we have just migrated our data successfully and can proceed with re-configuration of our Gerrit instances.
Note that after successful migration, the LFS data on the server filesystem is no longer needed. One can safely move the content of Git LFS directory to the backup location. Of course, I recommend doing this first after successful verification that the new setup with S3 bucket is configured and working properly. So, now it is time to reconfigure the master.

Changing Gerrit LFS backend configuration from filesystem to S3 Bucket

We will need to modify both lfs.config files on both master and replica.

Modifying lfs.config on the master

We will modify the lfs.config file that is located in ~gerrit/etc directory on the master. We change storage.backend from fs to s3 and set up all the required entries: region, bucket, accessKey and secretKey that are needed in S3 section. Note, that as we are not using fs backend anymore we can leave the options related to it unchanged. That might simplify the process of reverting our changes in case we want to switch back. Anyway, after our modification the relevant file fragment will look like this:

[storage]
        backend = s3
[s3]
        region = us-west-2
        bucket = gerrit-lfs-master
        accessKey = <master-access-key>
        secretKey = <master-secret-key>

Modifying lfs.config on the replica

In a similar way, we modify the lfs.config file on the replica machine, so that it points to gerrit-lfs-replica bucket:

[storage]
        backend = s3
[s3]
        region = ap-south-1
        bucket = gerrit-lfs-replica
        accessKey = <replica-access-key>
        secretKey = <replica-secret-key>

In both cases we need to put the corresponding accessKey and secretKey for master and replica respectively, in place of the placeholders that are marked with brackets above. And that’s all, we are done. It is time to start both master and replica servers.

After both master and replica Gerrit servers are successfully started, we can have a look at the client configuration.

Git LFS Configuration on the client

First thing to do is to assure that we have a reasonably new git and git-lfs client software installed. I don’t have any hard requirements here, but want you to know that I was using git version 2.18.0 and git-lfs in version 2.4.2 to test my setup.

Now we configure git-lfs to access the appropriate server on the client. We have two cases here.

Accessing master

If the client is located close to the geographical location where Gerrit master server is located, we want to fetch/push from/to the master. In this case everything is already set-up and no additional configuration changes are necessary. Everything will work out of the box.

Accessing replica

Things are getting more interesting if you want to clone from replica. In this case we need to inform the Git LFS client extension to fetch the Git LFS data from replica and push it to the master. The required modifications have to be applied to the local git config file, ”.git/config” of each project that uses git-lfs. Alternatively, one could set it globally, for all projects, with help of –global option of git config command.

Accessing replica using SSH

Let’s assume that someone wants to clone replicated repository using SSH. Let’s have a look at the clone command available in TeamForge UI:

git clone -c 'lfs.url=https://admin@gerritmaster.aws.collab.net/gerrit/lfs_test.git/info/lfs' ssh://admin@gerritreplica.aws.collab.net:29418/lfs_test && cd "lfs_test" && git config user.name "TeamForge Administrator" && git config user.email "root@gerritmaster.aws.collab.net" && git config url."ssh://gerritreplica.aws.collab.net:29418".insteadOf "ssh://gerritmaster.aws.collab.net:29418" && git config url."ssh://admin@gerritreplica.aws.collab.net:29418".insteadOf "ssh://admin@gerritmaster.aws.collab.net:29418" && git config url."ssh://admin@gerritmaster.aws.collab.net:29418".pushInsteadOf "ssh://admin@gerritreplica.aws.collab.net:29418" && scp -P 29418 admin@gerritreplica.aws.collab.net:hooks/commit-msg .git/hooks/

Please, have a look at the part that is highlighted in red. This tells the Git LFS extension on the client to talk to the Gerrit master server for both push and fetch operations. While this will work with no problems it is not what we want: it will not use any data that is replicated to the S3 bucket in replica location. How to fix this? We want to set lfs.url to point to the replica and lfs.pushurl to point to the master.

After the modification the clone command will look like this:

git clone -c 'lfs.pushurl=https://admin@gerritmaster.aws.collab.net/gerrit/lfs_test.git/info/lfs' -c 'lfs.url=https://admin@gerritreplica.aws.collab.net/gerrit/lfs_test.git/info/lfs' 
ssh://admin@gerritreplica.aws.collab.net:29418/lfs_test && cd "lfs_test" && git config user.name "TeamForge Administrator" && git config user.email "root@gerritmaster.aws.collab.net" && git config url."ssh://gerritreplica.aws.collab.net:29418".insteadOf "ssh://gerritmaster.aws.collab.net:29418" && git config url."ssh://admin@gerritreplica.aws.collab.net:29418".insteadOf "ssh://admin@gerritmaster.aws.collab.net:29418" && git config url."ssh://admin@gerritmaster.aws.collab.net:29418".pushInsteadOf "ssh://admin@gerritreplica.aws.collab.net:29418" && scp -P 29418 admin@gerritreplica.aws.collab.net:hooks/commit-msg .git/hooks/

After those modifications, our fetch and push operation will work as expected. The data will be pushed to the master and fetched from replica for both Git and Git LFS data.

Accessing replica using HTTPS

Now, what about HTTPS protocol? Please have a look at the already modified command (the modified part is in red):

git clone -c 'lfs.pushurl=https://admin@gerritmaster.aws.collab.net/gerrit/lfs_test.git/info/lfs' -c 'lfs.url= https://admin@gerritreplica.aws.collab.net/gerrit/lfs_test.git/info/lfs' https://admin@gerritreplica.aws.collab.net/gerrit/lfs_test && cd "lfs_test" && git config user.name "TeamForge Administrator" && git config user.email "root@gerritmaster.aws.collab.net" && git config url."https://gerritreplica.aws.collab.net/gerrit".insteadOf https://gerritmaster.aws.collab.net/gerrit"&& git config url."https://admin@gerritreplica.aws.collab.net/gerrit".insteadOf "https://admin@gerritmaster.aws.collab.net/gerrit" && git config url."https://admin@gerritmaster.aws.collab.net/gerrit".pushInsteadOf "https://admin@gerritreplica.aws.collab.net/gerrit" && curl -o .git/hooks/commit-msg https://admin@gerritreplica.aws.collab.net/gerrit/tools/hooks/commit-msg && chmod +x .git/hooks/commit-msg

As you can see the highlighted options look exactly the same as in case of the SSH protocol. That’s correct. The entries are identical for both protocols. This is because regardless of the protocol used by git clone command, the Git LFS will always use https.

With the command presented above we are able to clone over https. However, that’s not everything. Pushing Git LFS files will still not work. Why? Well, this is a bit tricky. The problem is in one of the lines that contain insteadof keyword, which tells Git to which server it should push and from which to fetch. One of those lines teaches git to use replica instead of master server. And because of this line our push to master will actually end up at the replica for Git LFS files. Obviously this will not work.

To overcome this problem, we need to add an extra option to .git/config file:

[url "https://admin@gerritmaster.aws.collab.net/gerrit/lfs_test.git/info/lfs"]
insteadOf = https://admin@gerritmaster.aws.collab.net/gerrit/lfs_test.git/info/lfs

It teaches git to use gerritmaster instead of gerritmaster when referencing Git LFS URL. While this looks strange, it is actually what we want. The sole purpose of this line is to avoid the other substitution in case of Git LFS URL. This trick works, because the insteadof is always applied to the longest string that matches. So, our Git LFS URL will point to the master as intended.

Verifying the setup

We can verify the setup by looking into the .git/config file within the cloned lfs_test repository. It should contain the following lines:

[lfs]
  url = https://admin@gerritreplica.aws.collab.net/gerrit/lfs_test.git/info/lfs
  pushurl = https://admin@gerritmaster.aws.collab.net/gerrit/lfs_test.git/info/lfs

Again, this will be the same for clone with SSH and HTTPS. As mentioned before, Git LFS always uses HTTPS, regardless of the protocol that is used by git clone. And again, in case of https protocol one more option is needed:

[url "https://admin@gerritmaster.aws.collab.net/gerrit/lfs_test.git/info/lfs"]
insteadOf = https://admin@gerritmaster.aws.collab.net/gerrit/lfs_test.git/info/lfs

That’s all

We have configured our replication properly, so that it works for both Git and Git LFS data. The good thing about our solution is that once configured it works transparently with no further effort. The data will be pushed to master and fetched from replica as expected. AWS S3 Cross Region Replication will automatically replicate all the Git LFS data from master to the replica.

Obviously, the not so nice part of this solution is that it requires modifications on the client side every time when we need to clone from replica server. Although the configuration is a bit cumbersome, I think the advantages of having Git LFS data replicated between different geographical locations easily makes it worth the effort. What’s your view? Do you have any experience with Git LFS replication? Do you have any comments, advices or suggestions on this topic? Please let me know what you think.

About the Author

Eryk Szymanski

Eryk is CollabNet’s Development Manager leading Git and Gerrit related development efforts. He has over 20 years of engineering and management experience ranging from start-ups to medium-size enterprises. Eryk holds Master degree in Computer Science and is Certified Scrum Master.

More Content by Eryk Szymanski
Previous Article
Catch CollabNet VersionOne at these Fall Events: SAFe Summit and DOES18

October will be a busy month for us as we head to both Washington DC and Las Vegas for two of the year’s mo...

Next Article
Asking the Experts: What’s the Best Way to Evangelize DevOps Internally?

As we gear up for our fall events, including the DevOps Enterprise Summit October 22-24, we have been talki...