Git for Data Science – A Guide For Data Scientists

Greetings! Some links on this site are affiliate links. That means that, if you choose to make a purchase, The Click Reader may earn a small commission at no extra cost to you. We greatly appreciate your support!

Ever since the invention of Git, it has widely been used by software developers for tracking code as well as files changes.

However, in the past few years, the adoption rate of Git for data science has seen a tremendous increase as well. You might even find that the knowledge of Git is now a requirement for various data science vacancies posted on a daily basis on the internet.

What is Git?

Git is a distributed version control system for tracking changes in source code during software development.
— Wikipedia

Git for Data Science

Git is one of the most commonly used command-line tools used for version control in software development. Through version control, developers are able to record, track and save changes to files over time so that they can quickly revert back to previous versions of the files whenever needed.

Also, popular platforms such as Github and Gitlab use Git as the underlying backbone to empower developers to seamlessly work together on the same project at the same time. Many open-source projects (including Linux) were made possible due to Git where thousands of people came together to work on a single project despite the difference in geographical locations.

Why should a data scientist learn Git for data science?

Today, academic data scientists have taken up the role of AI engineers in building the startups of the future and thus, this brings Git into the limelight as a near-to-perfect tool for version control.

Here are a few major reasons why a modern day data scientist should learn Git:

  • For performing collaborative coding in small as well as large data science organizational teams.
  • For building a personal project portfolio on an online repository hosting platform such as GitHub, GitLab, etc.
  • For tracking code as well as file changes.
  • For contributing and learning from open-source data science projects.

Fortunately, learning Git will be a piece of cake if you are able to remember the hundreds of algorithms you study in data science.

Basic Git commands you should know as a data scientist

Before you get started learning the basic commands of Git, it might be a good idea to first install it in your system. Here is a quick tutorial for doing just that: How to Install Git on Linux, Mac or Windows.

Once you’ve finished installing GIT, go through the following commands that are commonly used for creating and working with your file folders, also known as repositories.

a. git init

This command is used as the first step in creating a repository. It turns a local directory into an empty Git repository ready to be added with files.

# inside the directory
$ git init

b. git add

This command adds the files from the local directory to be staged for commiting onto the git repository.

# To add a single file or a directory
$ git add <file name / directory name>

# To add all files/directories in the current directory
$ git add .

c. git status

This command is used to check the current status of the repository. It provides details such as the commit status, files to be added, etc.

$ git status

d. git commit

This command is used to record the changes made to the files in a local repository. Each commit has a unique ID and it is recommended to add a commit message along with each commit explaining the changes made for a better future reference.

$ git commit -m "<Your commit message>"

e. git clone

Platforms such as Github and Gitlab allow the use of remote repositories to host all the files of the working directory on their platform. While working with such repositories, this command is used to create a local working copy of an existing remote repository. It is the equivalent to initializing a remote repository onto the local repository with all files and repository history.

$ git clone <url of remote repository>

f. git remote

Remote repositories can be linked directly with the local copy of the repository through this command.

# Add a remote repository 
$ git remote <command> <remote_name> <remote_URL>

# List remote repositories
$ git remote -v

g. git push

This command is used to push (upload) the committed changes to the remote repository.

$ git push

h. git pull

This command is used to pull (get) the new changes from the remote repository.

$ git pull

i. git config

This command is used for configuring important settings such as username and email associated with the remote account.

$ git config <setting> <command>

# Running git config globally 
$ git config --global "<Your email address>" 
$ git config --global "<Your name>" 

# Running git config on the current repository settings 
$ git config "<Your email address>" 
$ git config "<Your name>"

j. git branch

This command is used to determine the current branch of the local repository, create a new branch as well as delete a branch.

# Create a new branch 
$ git branch <branch_name> 

# List all local and remote branches 
$ git branch -a 

# Delete a branch 
$ git branch -d <branch_name>

k. git checkout

This command is used to switch between different branches while working. It also allows adding a new branch and switching to it.

# Switching to an existing branch
$ git checkout <branch_name>

# Switching to a new branch
$ git checkout -b <new_branch_name>

l. git merge

This command combines the changes from different branches together.

# Merge changes from a branch into the current branch 
$ git merge <branch_name>

Here is a basic workflow on how some of the most frequently used commands (add, commit, push and pull) are used in a git project.

Git for Data Science Workflow

You can always learn more advanced commands when you come across certain situations later on when using Git.

Avoid adding and pushing large data files using Git

While using a remote repository, it is important to note that Git struggles to store large binary files and remote repositories normally have a hard limit on file sizes. If you accidentally commit a large file, such as training data consisting of images, videos, etc., it could cause an issue and it might get time-consuming to undo the changes.

It is hence a good practice to use a .gitignore file which tells Git to ignore the files or folders listed in it while committing the project. The .gitignore file is usually kept at the root directory of the repository you are working in.

Here is an example of what the content of the .gitignore file looks like:

# Ignore Mac system files

# Ignore data_files folder

# Ignore all the text files

# Ignore files related to environmental variables

Notice how you can ignore folders as well as files based on their extensions. The .gitignore file certainly comes in handy when you need to ignore your data files.

In Conclusion

Although the concept of version control is not new, the field of data science is slowly catching up to a sharing culture where amazing open-source projects are hosted online for others to learn as well as collaborate. Moreover, the growing need for data science systems in production has shown greater importance for the need for version control (and Git for data science in general).

What do you think about this? Let us know in the comments.

Git for Data Science - A Guide For Data ScientistsGit for Data Science - A Guide For Data Scientists

Do you want to learn Python, Data Science, and Machine Learning while getting certified? Here are some best selling Datacamp courses that we recommend you enroll in:

  1. Introduction to Python (Free Course) - 1,000,000+ students already enrolled!
  2. Introduction to Data Science  in Python- 400,000+ students already enrolled!
  3. Introduction to TensorFlow for Deep Learning with Python - 90,000+ students already enrolled!
  4. Data Science and Machine Learning Bootcamp with R - 70,000+ students already enrolled!

Leave a Comment