CASA Data Repository
Using the CASA Data Repository
The CASA team switched to using Git LFS for maintaining the CASA Data Repository when source code version control switched from Subversion to Git. Git LFS permits managing large binary files by storing the actual files outside of Git, but checking checksum based stubs into Git as a proxy for the actual files. This system does work, but it is prone to accidental checkin of the actual binary data. For this reason, it is crucial that extra care be exercised when committing to the CASA Data Repository. This is because the only way to correct the error of pushing binary files committed directly to Git instead of through LFS to the Atlassian bitbucket server is by recreating the entire Data Repository from scratch.
Please follow the steps in "Check Before Committing" below before making commits in your local Data Repository clone.
Git Setup
While no specific changes are required to your Git setup, there are some changes which make the checkout of the data repository much more convenient because without these changes the default credential caching timeout will cause Git LFS to prompt (often) as files are downloaded.
OSX
For OSX, setting the OSX keychain as the credential source allows the LFS checkout to proceed without prompting for passwords. This can be done with:
-bash-4.1$ git config --global credential.helper osxkeychain
Linux
For Linux, there is no facility for general password management so the easiest solution for Linux is to increase the credential timeout:
-bash-4.1$ git config --global credential.helper cache
-bash-4.1$ git config --global credential.helper 'cache --timeout=3600'
Git LFS Setup
Git LFS is distributed as an add-on to Git, so before you begin to use Git Lfs, ensure that it is actually installed. If you run (and see) the following:
-bash-4.1$ git help lfs
No manual entry for gitlfs
-bash-4.1$
It means that Git LFS is not installed on your system, contact your local system administrator. Most CASA Linux developers should get Git LFS as part of the installation of 'casa-toolset-2' which includes Git LFS.
Setup Your Git LFS Environment
LFS is switched on or off by Git users not by something committed to the repository. For this reason, you should add LFS to your Git environment on any logins that you will use to commit changes to the CASA Data Repository. It is best to set LFS as a global option so tha you do not need to initialize LFS each time you clone the data repository. You can do this by running the following commands at the bash command line:
git config --global filter.lfs.required true
git config --global filter.lfs.clean "git-lfs clean -- %f"
git config --global filter.lfs.smudge "git-lfs smudge -- %f"
git config --global filter.lfs.process "git-lfs filter-process"
It is also possible to set up Git LFS on a per-repository basis.
Checking Out the Data Repository
The Data Repository is very large. The actual data content is 73GB, but a regular checkout (in Subversion or Git) requires a disk footprint of 153GB. Therefore the best way to start using the CASA Data Repository is to begin with a limited clone:
git clone --no-checkout https://@open-bitbucket.nrao.edu/scm/casa/casa-data.git
Replace "" with your username. This will clone the actual Git files but will not actually fetch the large data files. From this starting point, you could:
- checkout the minimal data repository that is distributed with each binary distribution of CASA
- checkout the entire data repository
These are described in the next two subsections. An alternative to this more typical clone of the Data Repository is to clone only the LFS stubs for a look under the hood of LFS. This is described in the third subsection.
Distro Data Repository
The distro data repository is the minimal subset of the CASA Data Repository which is required for CASA to function properly at runtime. It can be retrieved (after doing the "no checkout" clone command above) like:
cd casa-data
git show HEAD:distro | bash
The CASA distro Data Repository checked out in this way requires around 1.5GB of disk space. The sparse checkout of the distro data repository actually modifies the cloned state so that only a subset of the entire repository is used. You can observe how this is done with:
-bash-4.2$ git show HEAD:distro | head -16
##
## this file is intended to be used by piping its contents into bash in a
## git clone that has been cloned with --no-checkout, see README.md at:
##
## https://open-bitbucket.nrao.edu/projects/CASA/repos/casa-data/browse
##
git config core.sparseCheckout true
cat > .git/info/sparse-checkout <<'EOF'
ephemerides/*
geodetic/*
gui/*
demo/Images/*
demo/calibrater/*
demo/NGC5921.fits
demo/3C273XC1.fits
-bash-4.2$
You can use this information to tailor your personal repository to include those portion of the data repository which are pertinent to the tests which you care about. For example, to add-on the unittest directory:
git clone --no-checkout https://@open-bitbucket.nrao.edu/scm/casa/casa-data.git casa-distro
cd casa-distro
git show HEAD:distro | bash
echo 'regression/unittest/*' >> .git/info/sparse-checkout
git checkout
Entire Repository
The entire repository can be checked out (after the limited clone above) with:
git clone --no-checkout https://@open-bitbucket.nrao.edu/scm/casa/casa-data.git
cd casa-data
git checkout master
This checkout will likely take a long time and consume about 153GB of disk space.
Checkout LFS Internals
You may wish to have a look at the LFS internals. Typically you won't, but this is the only way to confidently check to see if any binary files have crept into our LFS-based binary data repository. In either case, a way this can be done is with:
git -c "filter.lfs.smudge=cat" clone https://open-bitbucket.nrao.edu/scm/casa/casa-data.git
Also, ignore the error message.
Committing Changes
Changes can be committed to either the distro repository, sparse clone or a complete repository clone. However, if you are using a CASA Data Repository clone that you have previously cloned, remember to run "git pull" prior to beginning to make changes.
To do this, just check the new files (or replacement files) into place, and then add them as normal from the root of your Git clone. For example:
cd casa-data
cp demo/3DDAT.fits gui
However, it is important to check to ensure that the change registers as expected as we go through the commit. At this point, Git will see the new file:
-bash-4.2$ git status -s
?? gui/3DDAT.fits
-bash-4.2$
but LFS will not:
-bash-4.2$ git lfs status --porcelain
-bash-4.2$
Next add the new file from the root of your data repository clone:
-bash-4.2$ git add gui/3DDAT.fits
-bash-4.2$
At this point, both Git and Git LFS should recogize the new file for being committed:
-bash-4.2$ git status -s
A gui/3DDAT.fits
-bash-4.2$
-bash-4.2$ git lfs status --porcelain
A gui/3DDAT.fits 10137600
-bash-4.2$
If you do not see your changes reflected in the output from "lfs status", do not commit your changes because commit files reported by "git status" but not reported by "git lfs status" will result in binary data being committed directly to Git (as binary files) instead of through Git LFS.
With our changes visible to both Git and Git LFS, it is safe to commit them:
-bash-4.2$ git commit -m 'changes which should not be pushed'
[master 93cc524] changes which should not be pushed
1 file changed, 3 insertions(+)
create mode 100644 gui/3DDAT.fits
-bash-4.2$
The "changes which should not be pushed" comment simply refers to the fact that we've just committed a bogus file to our local repository which we do not want to be pushed into the bitbucket repository shared by all CASA users. With a normal commit to the CASA Data Repository, with files which should be shared, it would now be safe to push these files up to the server.
When deleting files from the data repository, the deletions will not be listed in the "git lfs status --porcelain" output. This is because when deleting files the large binary files not deleted because they are required when checking out older revisions of the data repository.
Check Before Committing
It is very important to check the status of your data repository clone before doing a commit of changed files to your local repository. Failure to do this (even should you be on a non-master branch), could lead to the need to reconstitute the CASA Data Repository on the server from scratch.
This step is simple. As described in the "Committing Changes" section, all you need to do is compare the output of:
git status -s
and
git lfs status --porcelain
to ensure that each reports knowledge of the files that are about to be committed. In our example above, the interaction looked like:
-bash-4.2$ git status -s
A gui/3DDAT.fits
-bash-4.2$
-bash-4.2$ git lfs status --porcelain
A gui/3DDAT.fits 10137600
-bash-4.2$
When deleting files from the data repository, the deletions will not be listed in the "git lfs status --porcelain" output.
Further Reading