git-annex

Setup two repositories

On computer one "ingegerdsdator"

mkdir annex
cd annex
git init
git-annex init "ingegerdsdator"
mv ../stuff-I-want-to-git-annex-to-manage .
git-annex add .

On computer two "hans-vita"

git clone ssh://ingegerdsdator/home/hans/annex ~/annex
cd annex
git-annex init hans-vita
git remote add ingegerdsdator ssh://ingegerdsdator/home/hans/annex

The last line is needed because without it, ingegerdsdator is only know from hans-vita as "origin".

From now on it is assumed the working directory is ~/annex.

On computer one "ingegerdsdator"

git remote add hans-vita ssh://hans-vita/home/hans/annex
git-annex sync

Add a directory with some files to one of the repositories

At one computer, say ingegerdsdator, do

mkdir -p text/
cp -av ../text/weblog text
git-annex add .
git commit -a -m added
git-annex sync

Get those files to the other repository

At the other computer, hans-vita, do

git-annex sync
git-annex get text/*

Try it the other way around

Add a set of files at the other computer, hans-vita, by

mkdir text/code
cp ../text/code/GitAnnex.muse text/code/
git-annex add .
git-annex sync

And to get it to ingegerdsdator, do

git-annex sync
git-annex get .

Modify an annexed file

At computer A do

git-annex unlock text/code/GitAnnex.muse
## 1. Edit the file (which in this case happened to be the source for the page
## you are reading right now)
## 2. Save the new version
##
## Since locking and unlocking is a bit tedious, save, but do not commit until
## you know you won't edit again in a while. Delaying the commit is not harmful.
##
## 3. Commit when you're done (and not before).
git-annex add text/code/GitAnnex.muse
git-annex sync

Delete an annexed file

Deleting an annexed file has two possible meanings

  1. Remove the contents of the file from this particular repository (e.g. to temporary save space in this particular repository). When syncing, the file will not be deleted from other repositories.
  2. Delete the record of the file from git (e.g. because you will never need that file again). When syncing, the file will be deleted from all other repositories.

To do 1. issue:

git-annex drop path/to/the.file

When syncing, the other repositories will learn that this repository no longer holds a copy of that file.

To do 2, and get the file removed from all repositories:

git rm path/to/the.file
git commit -a -m "removed annoying file"
git-annex sync

When syncing, the other repositiories will also delete the file.

Pull changes

In order to get the changes that was done at computer A to the file GitAnnex.muse (or actually all modified files), to be propagated to computer B, at computer B do:

git-annex sync
git-annex get .

Resolving conflicts from concurrent changes

If a annexed file is modified in two repositories a conflict will rise when these repositories are synced. Git-annex handles this by

  1. creating two files, each with a unique postfix.
  2. removing the original file.

Below is an excerpt of out put from git-annex when the file foo.R have been uniquely modified in two repos.

merge synced/master
Merge made by the 'recursive' strategy.
 projekt/foo.variant-655a.R                           |    1 +
 projekt/{foo.R => foo.variant-adde.R}                |    0
 2 files changed, 1 addition(+)
 create mode 120000 projekt/foo.variant-655a.R
 rename projekt/{foo.R => foo.variant-adde.R} (100%)
ok

In order to resolve the conflict do the following:

git-annex get foo*.R
git-annex unlock foo*.R
mv foo.variant-adde.R foo.R
rm foo.variant-655a.R
git-annex add foo.R
git-annex sync

git-annex as a backup system

git-annex can be used as a backup system. From within an annex, use:

git-annex --auto get .

to get content which is not backed up satisfactory. You can define what "satisfactory" means by editing .gitattributes.

However, you must explictly add .gitattributes with the --force option, since git-annex ignores dot-files by default.

.gitattributes is inherited by subfolders, which is awesome. In the top-most .gitattributes, which resides directly in the annex - not in the directory .git - I define how different file types should be backed up. In sub-folders I can get a certain number of copies of all files in these directories (and their sub-dirs), regardless of file type. File types are defined by suffix in the name.

In order to automatically get .gitattributes copied to new repositories, I set annex.numcopies to a really high number (99).

Here is my top-most .gitattributes:

.gitattributes annex.numcopies=99
*.Rnw annex.numcopies=99
*.muse annex.numcopies=99
*.tex annex.numcopies=99
*.pl annex.numcopies=99
*.pm annex.numcopies=99
*.R annex.numcopies=99
*.sql annex.numcopies=99
*.sh annex.numcopies=99
*.mbox annex.numcopies=2
*.odt annex.numcopies=2
*.ods annex.numcopies=2
*.png annex.numcopies=1
*.mp3 annex.numcopies=1
*.mp4 annex.numcopies=1
*.wav annex.numcopies=1
*.pdf annex.numcopies=1

In some sub-folders I have a .gitattributes like this (notice the first line, which catch important files that cannot be matched by their suffix).

* annex.numcopies=2
.gitattributes annex.numcopies=99
*.Rnw annex.numcopies=99
*.muse annex.numcopies=99
*.tex annex.numcopies=99
*.pl annex.numcopies=99
*.pm annex.numcopies=99
*.R annex.numcopies=99
*.sql annex.numcopies=99
*.sh annex.numcopies=99
*.mbox annex.numcopies=2
*.odt annex.numcopies=2
*.ods annex.numcopies=2
*.png annex.numcopies=1
*.mp3 annex.numcopies=1
*.mp4 annex.numcopies=1
*.wav annex.numcopies=1
*.pdf annex.numcopies=1

Add a repository at an untrusted host

There are providers giving away free shell accounts with storage. While I would not trust such providers enough to put my files at such hosts unencrypted, you can - securely - use such shell accounts with encrypted files, and with LUKS, all files will appear unencrypted to git-annex. The only thing you need to care about is not using the "ssh remote" since that implies that the untrusted host will not only see the contents of the files, but also get your credentials for your own box. (The firsts step in setting up a ssh remote involves ssh:ing from the the remote to your own box, a big no-no with untrusted hosts).

What you need is explained here: free-secure-online-backup.

Populating a new repository with CONTENT

files in the annex

git-annex sync provides a new repository with information on what files exists, and where, but to actually get content, in a automated way, you need two things:

  1. .gitattributes with definitions on how many copies of each file you want.
  2. having the contents of all .gittatributes-files copied to the new repo.

The latter is done by the following:

git-annex --force get *.gitattributes

Now, the new repository knows what files it is expected to keep a copy of, and it will get the right content (including dot-files) when you issue:

git-annex --force --auto get .

Global syncing of contents in a decentralised network of repos

To push contents of files according to the principles defined in *.gitattributes, I use the following snippet, in a command I call global-sync.sh. It parses the list of repositories, which is in .git/config and for each host mentioned there, it ssh into the host and syncs records, and then get the contents with --auto get.

So, when you have run global-sync.sh, you know that content that is supposed to be available at all repos, actually is there even if the content has changed recently.

## global-sync.sh
git-annex sync

for host in `grep ssh ~/annex/.git/config | cut -d ":" -f2 | cut -d "/" -f 3`; do
    ssh $host "cd annex; git-annex sync; git-annex --auto get"
done

directory structure outside annex

This is slightly off-topic, but relevant for the problem: "When I get a completely new $HOME, what do I need to do to get everything working as in my other $HOME:s?"

If you want .dot-directories containing symlinks pointing to files managed by git-annex, then you need to create these too on each new repository.

Some collection of files are not suitable for use with git-annex, and if you have directory structure to impose order on these files, that directory structure need to be created too (I use dropbox for some directories under $HOME, but outside the annex).

symlinks pointing into the annex

Files like .emacs need to be in $HOME rather than in $HOME/annex, which can easily be solved by letting $HOME/.emacs be a symlink to $HOME/annex/.emacs. git-annex will cater for $HOME/annex/.emacs, but the symlink must be put in $HOME by some other mechanism. I keep a tar-archive of symlinks pointing into $HOME/annex, so all symlinks can be recreated in one command.

You may, of course, manage symlinks.tbz with git-annex. Doing so will ease the distribution of updated versions symlinks.tbz.

To find and archive the symlinks:

#!/bin/sh

## dot-files in $HOME can by symlinked into ~/annex. The symlinks themselves
## are not backed up by git-annex. This script backups the symlinks.

## catch symlinks pointing to $HOME/annex or "annex"

tar -jcvf ~/annex/symlinks.tbz `find $HOME -xdev ! -iname "annex*" -type l -lname "$HOME/annex*" -o -lname "annex*"`

To use them, that is to un-archive them in a new $HOME, put symlinkz.tbz in the new $HOME, and issue:

tar -C / -jxvf ~/annex/symlinks.tbz

handling filesystem errors

I got fsck errors from an usb stick where I have an annex. git-annex fsck reported that everything was ok. git fsck informed about dangling blobs, but that is harmless, I think. git-annex unused resulted in:

git-annex unused
unused . (checking for unused data...) (checking master...) (checking Ekbrands_data/master...) (checking Ekbrands_data/synced/master...)
  Some corrupted files have been preserved by fsck, just in case:
    NUMBER  KEY
    1       SHA256-s8005--8c04a49afbdcd054036db8d3c3d884bd2da41d810cc15bb2797b824ff24a84dc

  To remove unwanted data: git-annex dropunused NUMBER

comments powered by Disqus


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: oktober 17, 2019