0. Installation
- Linux
sudo apt-get install datalad
- HPC (with MiniConda installed)
conda install -c conda-forge datalad
- macOS
brew install git-annex
python3 -m pip install [--user] datalad~=0.12
- Windows 10 (with MiniConda installed)
conda install -c conda-forge git
(install git-annex into MiniConda Library directory)
pip install datalad~=0.12
1. Setting Up Datasets
- create a dataset
text2git
configuration option allows text files to be stored in git instead of git-annex, preventing them from being locked- use
-f/--force
to convert existing directory into dataset/sub-dataset
datalad create [-d $super_dataset] [-f/--force] --description "..." \
-c config_option $target_dir
- check and commit updates
datalad status
datalad save [-d $dataset] -m "..." [--to-git] [--version-tag $tag] $file_names
- add and commit content from URL
datalad download-url $link --dataset $target_dataset -m "..." -O $output_name
- clone another dataset into an existing dataset & download some content physically
- note that sub-datasets have their separate commit histories
datalad clone -d $existing_dataset $git_link $target_dir
datalad get $file
- after cloning a dataset, check its sub-datasets & get the files in its sub-dataset (and sub-sub-datasets below) without downloading their content
datalad subdatasets
datalad get -n -r [--recursion-limit $limit] $sub_dataset
- add a container image (requires
datalad_container
extension)
datalad containers-add $image_name --url $image_path
2. Managing Provenance
- automatically track all provenance from a command run
- commit is saved to the dataset of the current directory (a commit is only made if there is actual change)
- input files are downloaded & output files unlocked before command is run (replaced by
{inputs}
,{inputs[0]}
,outputs}
, etc. in$command
)
datalad run -m "..." -i/--input $input_files -o/--output $output_files "$command"
- run a command ignoring unstaged changes
datalad run -m "..." --explicit "$command"
- rerun a previously run command with provenance tracking
- also only commits if there is actual change
datalad rerun $commit_number
- rerun all commands between a tag and the current state
datalad rerun --since $tag
- manually unlock a file
datalad unlock $file_name
- check the differences between commits (or current state)
(modified files) datalad diff [-f/--from $commit_curr] -t/--to $commit_to_compare
(modified content) git diff $commit_curr $commit_to_compare
- run a command in a container image (requires
datalad_container
extension)
datalad containers-run -m "..." --containers-name $image_name \
[--input $input_files] [--output $output_files] "$command"
3. Maintaining the Dataset
- only commit already tracked files
datalad save -m "..." -u/--updated
- merge changes from the origin
(git pull equivalent) datalad update --merge [-d/--dataset $orig_dataset] [-r]
(git fetch equivalent) datalad update [-d/--dataset $orig_dataset] [-r]
- check the current size & actual size of the dataset (or sub-dataset depending on the current directory)
datalad status --annex all
- find all files whose content are not locally available
(in super-dataset) git annex find --not --in=here
(in sub-datasets) git submodule foreach --quiet --recursive \
'git annex find --not --in=here --format=$displaypath/$\\{file\\}\\n'
- check the list of container (requires
datalad_container
extension)
datalad containers-list
- remove annexed data completely (retrievable if any remote copy exists)
datalad drop [-r/--recursive] [--nocheck] $files
- drop unused annexed data
git annex unused
(one item) git annex dropunused $item_number
(multiple item) git annex dropunused $item_start-$item_stop
(all) git annex dropunused all
- remove a dataset and all its subdatasets (when outside the dataset)
datalad remove -r --nocheck -d $super_dataset
4. Managing sub-datasets & Siblings
- check changes in a sub-dataset
(files in the sub-datset) datalad status $sub_dataset/
(the sub-dataset itself) datalad status $sub_dataset
(all) datalad status --recursive
- similarly, save changes from a sub-dataset
(to the sub-datset) datalad save -d $sub_dataset -m "..."
(to the super-dataset) datalad save -d $super_dataset -m "..."
(all) datalad save $super_dataset --recursive
- remove a sub-dataset
(can do datalad get again) datalad uninstall $sub_dataset
(completely gone) datalad remove -m "..." -d $dataset $sub_dataset
- add a clone of the dataset as sibling/remote
datalad siblings add -d $this_dataset -s/--name $name --url $sibling_dataset
- check the list of siblings and remove one
datalad siblings
datalad siblings remove -s/--name $name
- merge changes from a sibling
datalad update --merge -s/--name $name
- if git shows error “commits don’t follow merge-base” for a sub-dataset
git reset HEAD $sub_dataset
5. Dataset Configuration
- add/modify/remove a configuration in
.git/config
of the current dataset
git config --local --add $section.[$sub_section.]$variable "$value"
git config --local --replace-all $section.[$sub_section.]$variable "$value"
git config --local --unset $section.[$sub_section.]$variable
- prevent a specific file from being annexed in the current dataset
(add to .gitattribute) $file_name annex.largefiles=nothing
- modify configurations in
.gitmodules
or.datalad/config
git config -f/--file=$config_file --replace-all $section.[$sub_section.]$variable "$value"
- apply a configuration procedure after dataset is created
datalad run-procedure -d $dataset $procedure_name
- add a result hook to the dataset
- possible
type
:file
,dataset
,symlink
,directory
- possible
action
:install
,get
,drop
,status
, etc. - possible
status
:ok
,notneeded
,impossible
,error
- substitutions for
$command
:"{dsarg}"
fordataset
,"{path}"
forpath
, etc.
- possible
git config --local --add datalad.result-hook.$hook_name.call-json '$command'
git config --local --add datalad.result-hook.$hook_name.match-json \
'{"type": "$type","action":"$action","status":"$status"}'
6. Sharing Datasets
- publish dataset to Github/Gitlab
datalad create-sibling-github/gitlab -d $dataset $repo_name
datalad push --to github/gitlab
- also push the tags to Github/Gitlab
git push github/gitlab --tags
(or)
git config --local remote.github/gitlab.push 'refs/tags/*'
- share annexed data through Github Large File Storage (LFS)
datalad create-sibling-github $repo_name
git annex initremote github-lfs type=git-lfs url=$repo_link encryption=none embedcreds=no
datalad push --to=github
- publish dataset to GIN
datalad siblings add -d $dataset --name gin --url $gin_link datalad push --to gin
- add sub-datasets to GIN
datalad subdatasets --contains $sub_dataset --set-property url $git_link
- push/fetch annexed data content (make sure remote is set with ssh-url instead of http-url)
git annex sync --content
Other useful git commands
- edit the commit message after committing
git commit --amend
- edit the last
N
commit messages (only do this if you know very well what you are doing)
git rebase -i HEAD~N
- check commit history
(last commit in detail) git log -p -n 1
(all commits each in a line) git log --oneline
(sub-dataset commits) git -C $sub_dataset log --oneline
(of a specific file) git log -- $file_name
- locate copies of an annexed file
git annex whereis $file_name
- commit a file name change
git mv $old_file $new_file
git commit -m "..."
- stop annexing a file (track with git instead)
git annex unannex $file_name
- soft reset (keeping work done)
(with annexed content) datalad unlock $file_name
git reset $commit
- reverting commit(s)
git revert [--no-commit] $commit [$newer_commit ...]
- check file content at an older commit
git cat-file --textconv $commit:$file_name
- apply configuration from a
.gitignore
file globally (use absolute path)
git config --global core.excludesfile $gitignore_file
- check size of annexed file before downloading
git annex info --fast $file_name
Other useful console commands
- view folder structure of current directory
tree -d
- find the type for a file
file --mime-type $file_name