The DVC Guide: Data Version Control For All Your Data Science Projects
Become familiar with data versioning just like code versioning
As data scientists, we experiment with different versions of code, models, and data. Additionally, we even use version control system like Git to manage our code, track versions, move forward and backward in time, and share our code with our teams.
The versioning of code is important because it helps reproduce software on a much larger scale. The versioning of data is important is because it helps develop machine learning models with similar metrics at any given time by any developer in your team or organization.
Therefore, it is crucial to version your models as well as data. But veteran software engineers will know that using Git for storing large files is a big no.
Not only is Git inefficient with larger files, it is also not a standardized environment for storing large data files. Most data is stored in AWS S3 buckets, Google Cloud Storage, or any institutional remote storage server.
So how do we version data? Enter DVC.
Introducing DVC
DVCĀ is a system for Data Version Control that works hand in hand with Git to track our data files. It even has a similar syntax like Git so itās quite easy to learn.
Letās take a look at some of the great data versioning features of DVC in this article. But first, lets make a new project folder and a virtual environment and install it as a Python package:
$ pip install "dvc[all]"
or if you useĀ Pipenv:
$ pipenv shell
$ pipenv install "dvc[all]"
You should see an output like so:
Now, lets initialize a git repository. You should see the following output:
Perfect! We can now go ahead into adding our data into DVC.
Adding + committing data intoĀ DVC
I have one data file in my projectās data folder like so:
To run a size check from the terminal, use:
$ ls -lh data
You'll see the following output as the data file is displayed as 5.2 MB.
We can now add this data file to DVC. Run:
$ dvc add data/train_shakespeare.txt
Youāll see the following output, prompting us to run the git add command:
We will now run the git add command:
$ git add data/train_shakespeare.txt.dvc data/.gitignore
Now that weāve added our newĀ .dvcĀ file to our git tracking, we can go ahead and commit it to our git:
$ git commit -m "added data."
Setting up a remote storage for ourĀ data
We can simply utilize Google Drive for storing our versioned datasets and in this tutorial weāre going to do exactly that.
Letās create a new folder in our Google drive and look at its URL:
https://drive.google.com/drive/u/0/folders/cVtFRMoZKxe5iNMd-K_T50Ie
Highlighted in bold is the ID of the folder that we want to copy to our terminal so that DVC can track our data in that newly created Drive folder.
Letās do that:
$ dvc remote add -d storage gdrive://cVtFRMoZKxe5iNMd-K_T50Ie
Time to commit our changes to git:
$ git commit .dvc/config -m "Configured remote storage."
Perfect! Now we can push our data to our remote storage.
$ dvc push
Itāll ask for an authentication code or simply take you to perform authentication in your browser, simply follow the instructions and youāll be good to go.
Pulling remoteĀ data
If you or your colleagues want to access the remotely stored data, it can be done with theĀ pullĀ command.
But first, letās delete the data and its cache stored locally so that we can pull it from remote:
$ rm -f data/train_shakespeare.txt
$ rm -rf .dvc/cache
Now, pull:
$ dvc pull
Youāll see the following output on pulling the file:
As you can see, once dvc is tracking your data file, pulling it from remote storage is a breeze.
Tracking a different version ofĀ data
Imagine if we want to track a new version of the same data file, we can easily add it to dvc and subsequently, to git again:
$ dvc add data/train_shakespeare.txt
$ git add data/train_shakespeare.txt.dvc
Now, youāll see a new version of theĀ .dvc file is ready to be committed to our git:
Commit the file.
Now, we can push our latest dataset to remote storage:
$ dvc push
Looking at our Google drive, we can see that we have two versions of our data stored:
Returning to a different datasetĀ version
With DVC, it become easy to go back in time to an older version of a dataset.
If we look at the git log of our project so far, we see that we have commited twoĀ .dvcĀ file versions to git:
Therefore, we must go back to our previous version of theĀ .dvcĀ file, as that is the one git is tracking.
First, simply do a Git checkout to an older commit, like so:
$ git checkout HEAT^1 data/train_shakespeare.txt.dvc
Second, do a checkout of dvc:
$ dvc checkout
Youāll see the following output. We have now restored our data file to its previous version!
Additionally, if you want to keep these dataset changes, simply commit it to git again:
$ git commit data/train_shakespeare.txt.dvc -m "reverted data changes."
Perfect! Till now, you have learned most of the fundamental data versioning features of DVC. Great job!
Final wordsā¦
DVC provides us with a massive helping hand in versioning datasets for our data science projects, and after this article, I hope you have some useful knowledge about getting started with it.
Practising on some sample projects and exploring the DVC documentation will be your best bet to advance your skill with this amazing tool.
If you liked this article, feel free to share it with a friend:
Connect with me onĀ Twitter.
Iāll see you next time :)