Git knapsack; dealing with commit size constraints
01 Jul 2023Many commercial git servers have limitations on the size of files, commits and pushes to repositories. Typically they enforce file size limits of around 25-100 MB.
When I tried to add a HLS splitted video to a Github Pages website, I hit this limit. None of my files exceeded the maximum file size of Github, but combining them all in a single commit and push did exceed this limit. The solution is clear: they need to be added using multiple commits and pushes. (The better solution in many cases is Git LFS).
python3 -m pip install git-knapsack
git knapsack
git knapsack
is a simple script that goes over all uncommitted files, and packs them into commits until the limit (currently 30 MB) is reached.
It pushes the changes and continues.
Here’s my initial version. There’s a lot to improve, but it gets the job done. Pull requests are always welcome in the bartbroere/git-knapsack repository.
"""
`git_knapsack.py`
Knapsack untracked files in a git repository in many commits, so that each commit does not exceed 30 MB.
Currently the knapsacking algorithm is extremely naive, and it does not expose any custom git features.
Effectively, this scripts performs git add, git commit and git push.
Note that it also will commit and push any untracked file.
If you run git status before this command and see anything you don't want committed,
either delete it or add it to the .gitignore file.
If any single file exceeds the git server's file limit or commit size limit, this script will not be able to help you.
The dependencies of this script are gitpython and tqdm.
"""
import os
from git import Repo
from tqdm import tqdm
repository = Repo(os.path.curdir)
untracked_files = repository.untracked_files
commit_size = 0
untracked_file_batch = []
for untracked_file in tqdm(untracked_files):
current_file_size = os.stat(untracked_file).st_size
if commit_size + current_file_size > 1024 ** 2 * 30: # keep commits below 30 MB
repository.index.add(untracked_file_batch)
repository.index.commit("Knapsack into multiple commits")
# For many hosts, pushing after each commit is required.
# Not only the commit and file size can be limited,
# but often also the size of a push over HTTPS has a size limit
origin = repository.remote('origin')
origin.push()
untracked_file_batch = [untracked_file] # reset the batch
commit_size = current_file_size # reset the commit size
else:
untracked_file_batch.append(untracked_file)
commit_size += current_file_size
# Clean up any files in the queue
repository.index.add(untracked_file_batch)
repository.index.commit("Knapsack into multiple commits")
origin = repository.remote('origin')
origin.push()
If this script is installed to a directory that is in the PATH
variable, it is available system wide with git knapsack
.
This is because git checks to see if any executables exist when it does not recognize a keyword (like knapsack
).
This is just a first draft. Some of the things to improve are:
- accepting command line arguments
- forwarding these arguments to
add
,commit
andpush
respectively - adding a command line argument to make the size configurable