You probably saw the news last week that researchers at Google had found a scenario where they were able to break the SHA1 algorithm by creating two PDF files with differing content that produced the same hash. If you are following this story then you may have also seen that the Webkit Subversion repository had problems after a user committed these example files to their repository so that they could be used in test cases for SHA1 collisions.
This post will try to explain the problem this caused and what prevention and remediation options are available if you run into it. The quick summary if you do not want to read this entire post is that the problem is really not that bad. If you run into it there are solutions to resolve it and you are not going to run into it in normal usage. There will also likely be some future updates to Subversion that avoid it entirely so if you regularly update your server and client when new releases come out you are probably safe not doing anything and just waiting for an update to happen.
First off let’s describe the problem and what the expected behavior would be if there were no problems. The problem happens when two files that have differing content but the same SHA1 content hash are committed to a repository. They do not have to be committed in the same transaction. In the case of Webkit, a developer committed the two example files that Google produced but there are now websites that will generate other examples. However, it is important to note that the files generated by all of these sites share a common byte sequence with the original files so it is easy to detect them and prevent the commit with a hook script, which I will discuss later. Back to the problem, the expected behavior is that these two files could be added to a repository like any other files, and you could get them back out later. Bytes in, bytes out just like any other file you version and that is certainly what the Webkit team was expecting here. The problem that happens is caused by a feature that was added in Subversion 1.6 called “Representation Sharing”. When Subversion stores the data for a commit, eventually bytes for each file are being written to disk. In the case of a newly added file this is essentially the content of the file, but a representation can also be the binary delta that is being stored when it is an update to an existing file. In any case, before the data is written to disk, Subversion calculates the SHA1 of the bytes and then consults a small SQLite database to see if it has stored these bytes before. If it has, then instead of storing the bytes again, it just stores a pointer to the data it already has stored. The idea here obviously being to save disk space. Think of a repository that has the same Java JAR file stored 100 times in different projects. This feature causes it to only store a single copy of that file which saves a decent amount of disk space. Unfortunately, in this scenario where two “different” files have the same SHA1, the incorrect behavior that happens is that the first file is stored to disk but the second file is effectively lost as the only thing stored for that file is a pointer to the first file.
The way this problem manifests later on is when someone tries to fetch the second file doing checkout/update etc. The file content in the repository is pointing to the first file, which has different content, but there was also some other metadata stored in the repository including the size and a different checksum for the full file data. When the server sends the data to the client, it is also sending it the expected file size and checksum for the fully reassembled file. On the client when it receives all of this data the checksum no longer matches and so it throws an error to let you know.
svn: E200014: Checksum mismatch for 'shattered-2.pdf':
Obviously SHA1 is not being used for these internal checksums (it is MD5) or else they would have matched here as well. I believe if the file sizes were different then that would also have triggered an error here even if these checksums matched.
The solution to this problem in Subversion itself will be to fix the representation sharing feature so that it recognizes this scenario as two different files and does not try to share the content. This also leads us to the first preventative remediation option available … disable this feature for your repository. The only cost to disabling this feature is that you will not benefit from the disk space savings it provides. You can disable this feature at any time (prior to the problem happening). Disabling the feature has no impact on the existing data in the repository that is already using it. Only future commits will be impacted in that they will not bother looking for the space savings. To disable this feature, you edit the “
db/fsfs.conf” file in the folder on the server for the repository. This is a plain text file with an INI style format and you would add this:
enable-rep-sharing = false
Normally, the file will already contain this section and it will contain a lot of comments about the feature with the setting commented out so that it uses the default setting. Remember though this feature was added in Subversion 1.6, so if you created your repository with an earlier version and never updated it, then you already do NOT have this feature and do not need to do anything.
Another option exists to prevent the problem, and that is to install a pre-commit hook to prevent committing these files. The hook script works by examining every file in the commit and looking for a “signature byte sequence” that all of the currently known examples will have in common. In other words, no matter what the filename is, or what site you have used to generate a PDF that demonstrates this problem, it will share the same “signature” as the original Google example. There will likely be new examples someday so this will not work forever. Personally, I would not recommend this hook script for most people. The cost of running it will be pretty high and slow down all commits, particularly really large commits of lots of files. As you will see in the rest of the article, even if you wind up having a repository run into this problem because someone commits these files, there are solutions available. So assuming you have control over who commits to your repositories, I would rather just educate my users and deal with the problem if it happens. Or just disable the representation sharing feature which would be a better solution that will work for all files, including in the future.
So these first two options talk about how to PREVENT the problem. Now lets talk about what to do once you get the problem. To do that, lets first talk about the scope of the problem. The only impact is that you cannot checkout/update the problematic file. So if you have a repository with lots of projects in one repository, only the project or folders with these files has any impact.
One solution is just to delete the second file. This will resolve this problem for normal SVN client usage, but it will not work for tools like svnsync or git-svn which try to replay every transaction in the repository. A normal SVN checkout/update is only requesting the HEAD revision. If you delete the files they no longer exist in the HEAD revision so this will work fine. But tools like svnsync fetch every revision one at a time and replay them. This will run into an error on the revision where the file was committed and not be able to proceed.
A second solution would be to remove the revision. To do that you have to use svnadmin dump to dump your repository up to the revision that introduced the problem. Then load this dump file to a new repository. This can be hard to do if you have a really big repository where it will create too much downtime. Also, if you have been doing more commits after the problematic revision then it is a bit more difficult to dump and load all of the subsequent revisions.
Another option is the one the Webkit repository used and that is to create a Subversion permission rule (authz) that blocks access to the file(s). This will work with tools like svnsync and git-svn as the server will not even try sending them the bad file. You could use this option to permanently fix the problem by adding the authz rule and then using svnsync to transfer the entire history to a new repository. The repository can be active while all this happens and you can then just do a quick cutover to the new repository at a convenient time. So even if it takes weeks to sync and schedule the cutover this gives you both a short term and long term solution to the problem. As a side note — if you ever replace a repository with a new version using the same name and you serve the repository via Apache, make sure you restart Apache after you do this. Subversion has file caches in memory that will give you weird errors if you do not do this.
To summarize so far, there are options to prevent the problem. If the disk space savings offered by representation sharing are not your main concern then that is the best solution and will absolutely prevent problems in your repository, but if you need the disk savings then a hook script exists to prevent the problem. Considering there will likely be an SVN update in the future that prevents the problem, you might also safely assume you will never have this problem and instead focus on remediation should you happen to run into it before it is fixed. If this happens, deleting the file from HEAD will give quick relief in many situations but an authz rule can also provide both short and long term relief and is likely the best option to resolve the problem.
Finally, there is still another problem scenario to be aware of and that is on the SVN client. Suppose you decide to turn off representation sharing in the repository. When this happens, users can commit files with duplicate hashes and they will be stored correctly in the repository. However, the SVN client since the 1.7 release has had a working copy format that stores the pristine copy of each file using its SHA1 identifier. When a user checks out these files, their working copy will only store a single copy of the file. So you will have two identical files in your working copy and will not get back the files you expected. If you are trying to run tests using these files, they obviously will not work or test what you are expecting. It is possible there will be a future release of Subversion that will provide options to handle this, but today it would not work if you tried to have a single working copy with two files that have the same SHA1. You will not see any errors, but you will not have the files you expected to have. This scenario was known when the feature was designed and at the time it was decided that it would only impact security researchers, such as the ones in this scenario, and that it was not going to be the priority to try and solve it for them. If the intent for the user is security research then there is a simple solution to this problem. Store your research files in a zip file and version the zip file. Have your test cases extract the files from the zip file when running the tests and the problem is solved.
These interpretations of the problem and its severity are obviously my own. I have tried to explain the problem, how to prevent it, and how to resolve it, as accurately as possible as well as make it clear where I am giving my opinion. If you have followup questions or are interested in how this will be fixed in future versions of Subversion, my preference would be to have those conversations in the mailing lists of the Subversion community at Apache. Those mailing lists are the canonical source of information on this problem and any approaches to resolving it. All discussion about this problem should happen there.
2017-02-28: Hyrum Wright posted an article on some of the history behind the representation sharing feature.