Deduplicate git object database (git repos)
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

This task will require understanding git internals. If you don't know what git objects are, you can read https://git-scm.com/book/en/v2/Git-Internals-Git-Objects.

I have a large number of bare git repositories (potentially 10,000+) and I need to deduplicate across them. The input will a number of bare git repos. I want a program which combines the repos into a single database file, or splits a database file back into bare git repos

Requirements

  • Must split exactly bitwise-identical into the original input files. I have a large (2 million repo) set I'm testing with, so I'll turn up any missed cases.
  • Needs to preserve ALL information--including objects that aren't reachable from commits
  • Needs to preserve ALL information--including stuff in bare repos that isn't a git object. It is okay to just bundle up all the other files, although bonus points for deduplicating these or leaving out the default versions.
  • Must deduplicate git objects (download a few forks of something on github if you want to check)
  • Must run on Linux, preferably as a command-line script/program

For delivery, just post open-source anywhere.

You can use this program: https://github.com/robinst/git-merge-repos
Hasan Bayat 1 month ago
How do you need objects packed into packfiles to be dealt with? What about deltified ones?
CyteBode 1 month ago
CyteBode--the original packfiles need to be recreated exactly, but I don't care how. I want a bitwise identical set of bare repo files, not bitwise identical objects. The solution should deduplicate things in packfiles, too. If there are two different delta encodings of the same object... hmm, I guess you'll end up needing to store what the deltas are between to recreate all but one of the deltas? A solution that keeps all the deltas is better than no solution, but I don't know whether packfiles tend to keep the same deltas and I'm worried about space there.
vanceza 1 month ago
Hasan--that is a vaguely related thing but it's not a solution at all. It can't recreate the original. Also, it doesn't preserve all objects.
vanceza 1 month ago
How are things in terms of storage space? You mentioned having 2 million repositories. If their sizes average 1MB, that would amount to a whopping 2TB of storage. How much free space do you have to work with? Also, can the original repositories be modified and/or deleted?
CyteBode 1 month ago
Plenty of swap space available--at least 3X the original size. Don't worry about deleting/modifying the originals--I'll be running this on a copy, and deleting the copy immediately after running it. Also if it helps, the machine running this has around 32GB ram available
vanceza 1 month ago
Okay, and one last question: Have you tried using some general-purpose deduplication software such as exDupe beforehand? I feel like despite the lack of domain-specific knowledge regarding git objects, it's likely to be more efficient than anything a code monkey could come up with for fifty bucks.
CyteBode 1 month ago
No, I only tried compressors that look a fixed distance back. exDupe/rzip type tools are a good suggestion--feel free to add it as a solution. If it runs and reduces size and I'll accept (unless someone submits the original requirements and beats it)
vanceza 1 month ago
Right, conventional compression formats are pretty horrible at deduplicating files when they're spread far apart. exDupe works much better in that respect.
CyteBode 1 month ago
awarded to CyteBode

Crowdsource coding tasks.

2 Solutions


hi,

i know isn't the specs, but this is a nice thing to try.

it duplicates everything in repos. (written in PHP)

link: https://github.com/myclabs/DeepCopy

This isn't even for git repos. It's for deep-cloning PHP objects.
vanceza 1 month ago
Winning solution

There's a very good tool called eXdupe that accomplishes deduplication and compression for general use. Since loose git objects exist as files already, this would do the job nicely. It's also really fast.

Not only does it efficiently deduplicate bitwise-identical files, it also partially deduplicates files that are slightly different but otherwise share the same data. As such, it may help with packfiles, at least to some extent.

It's available for Linux, although only for 64-bit. You mentioned having 32GB of RAM on your machine, so you likely have a 64-bit OS, unless you're using PAE on 32-bit. In that case, it might be possible to compile eXdupe from source, but I haven't tried.

Download this file: http://www.quicklz.com/exdupe/exdupe-050-linux-x64.tar, untar it (tar -xvf exdupe-050-linux-x64.tar), and run it as such:

./exdupe /path/to/all/your/repos/ database.full

And then for decompression:

./exdupe -R database.full /path/to/where/you/want/the/repos/

It also does differential backup, as shown in the first example on the homepage.

Edit: You should also take a look at the -gn switch which allows setting the amount of RAM used by the hash table. The recommendation is to use 1GB of RAM per TB of data.

Ran it on a 100K repo set--it knocks 45% off the size and 50% is the theoretical limit here. Works great!
vanceza 1 month ago
View Timeline