.. _afs_long-term-archive: Long-term AFS Archives with bup =============================== Overview -------- On Chicago, we run a program called `bup `_ for all our long-term archives. It uses git as a content-addressable block store, allowing for efficient storage of slowly-changing contents like homedirs and service backups. Or at least, that's the theory. In any case, the basic automation pipeline is something like this: * On a relatively frequent schedule, AFS volumes are released to Chicago. * On a less frequent schedule, these volumes are dumped (with ``vos dump``), processed, and archived using ``bup split``. The automation is overseen by AFS BOS. So where are the backups? ------------------------- The bup store is in /z/bup/ on chicago. The ${BUP_DIR} environment variable is set to this in a few scripts on chicago that need it. Looking at or restoring an archive file --------------------------------------- Because bup uses git, git's tools may be used to explore a little bit. * ``git branch -a`` will show you the names of all the archives stored internally to the bup store. * ``git log`` will show you the timestamps at which archives were taken. * git's so-called "approxidate" framework can be of some assistance; ``git log`` ``--since`` and ``--until`` will look at the date stamps of the commits and can help narrow your search. While the syntax for dated refs, ``${BRANCH}@{${DATE_STRING}}``, is present, note that it uses the *reflog* timestamp rather than the commit timestamp. Many early snapshots are therefore understood incorrectly. * ``git verify-pack`` and ``git fsck`` can be used to sanity-check the backing store, if that is ever needed. We should try to be good about keeping the older packs off-line as well as on-. However, ultimately, what you care about, I assume, is the output of a ``bup join`` command. Restoring from archive without nuking an exisiting volume ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If a user has asked for an emergency restore but does not want their home directory clobbered, consider creating, mounting, and restoring a new volume for them. Something like :: vos create chicago.acm.jhu.edu viceps recover.$USER fs mkm ~$USER/acmsys/recover recover.$USER.readonly bup -d ${BUP_DIR} join ${GIT_REF_NAME_OR_HASH} | \ vos restore chicago.acm.jhu.edu viceps recover.$USER -readonly And then ``vos remove`` the partition when the user has gotten their files back. Inserting a dump into the archive --------------------------------- file:///afs/acm.jhu.edu/group/admins.pub/scripts/chicago/dump-to-bup.sh (on Chicago, ``~root/bin/dump-to-bup.sh``) knows how to push a volume into the archive. The simplest thing, if you want to run it by hand, is to run:: echo ${VOLUMENAME} ${VOLUMENAME} | dump-to-bup.sh Yes, the name should be repeated twice as ``dump-to-bup`` expects to be reading from the output of ``vos listvol``, but the name will be resolved if an integer ID is not provided in the second column. file:///afs/acm.jhu.edu/group/admins.pub/scripts/chicago/dump-all-to-bup.sh (on Chicago, ``~root/bin/dump-all-to-bup.sh``) knows how to loop over the entire archive partition and invoke ``dump-to-bup`` appropriately. Repacking Bup Packs ------------------- Nine out of ten, or more, of the packs we write are very, very tiny: users home directories do not change that quickly. As such, left to its own devices, bup will fill up the ``${BUP_DIR}/objects/pack`` directory with many, many tiny files, which is, of course, detrimental to performance. On the other hand, full repacks of huge repositories take forever, so... we compromise by repacking all "small" pack files together at the end of every night's dump. Since we have on the order of thousands of volumes, this will not create a huge number of files that need to be dealt with. See file:///afs/acm.jhu.edu/group/admins.pub/scripts/chicago/bup-tools.sh for details. Eventually, that approach might be a little heavy-handed (asking bup to rebuild all its data structures), but for the moment, those steps are entirely dominated by the git repack itself. .. note:: We have also found it hugely temporally beneficial to disable git's delta coding. It makes our archives a little bigger than they might otherwise be, but it means that our repacks do not take hours on end. This is accomplished by:: git config --local pack.window 0 git config --local pack.depth 1 Mirroring Some Or All Of The Bup Archive ---------------------------------------- Of course, one answer is just to rsync the entirety of the bup archive somewhere else. Our repacking game above means there will be a little slop, but hopefully not too much -- once things are committed to "big" packs, they won't ever move again. You can also use git and maintain your own local pack structure. If you want to access things having done this, you'll need to have bup recreate its midx and bloom files, possibly, but that's straightforward. Creating a git repository and creating a ``remote`` section with something like :: [remote "chicago"] url = root@chicago.acm.jhu.edu:/z/bup fetch = +refs/heads/root/*:refs/remotes/chicago/root/* fetch = +refs/heads/service/*:refs/remotes/chicago/service/* fetch = +refs/heads/group/readonly:refs/remotes/chicago/group/readonly fetch = +refs/heads/group/acm-museum.readonly:refs/remotes/chicago/group/acm-museum.readonly fetch = +refs/heads/group/admins.readonly:refs/remotes/chicago/group/admins.readonly fetch = +refs/heads/group/admins.pub.readonly:refs/remotes/chicago/group/admins.pub.readonly fetch = +refs/heads/group/officers.readonly:refs/remotes/chicago/group/officers.readonly fetch = +refs/heads/group/officers.pub.readonly:refs/remotes/chicago/group/officers.pub.readonly fetch = +refs/heads/mirror/readonly:refs/remotes/chicago/mirror/readonly fetch = +refs/heads/user/readonly:refs/remotes/chicago/user/readonly will allow you to selectively archive parts of the system. Isn't that neat? You can always add another ``fetch =`` line and run ``git fetch chicago`` to bring more things over. Extracting Every Revision Of A Volume ------------------------------------- Basically a loop around the restoration game above. You can extract the hash and time of every revision of a branch with:: GIT_DIR=/z/bup git log --pretty=tformat:'%H %ct' ${BRANCH} > hashtimes for example. Then maybe something like :: vos create chicago.acm.jhu.edu viceps ${TMP_VOL_NAME} fs mkm v ${TMP_VOL_NAME} exec 3