Git and GitHub for Data Scientists
In this tutorial, we will focus on the basics of Git and the version control platform GitHub.
Most Data scientists are from heterogenous backgrounds such as Physics, Mathematics, etc., and more commonly are from research and academia. For developing any data science project or product Data scientists need best practices of software engineering for smooth building and maintaining the product process. Git is one of the skills that every software engineer needs to manage the code base efficiently. In this tutorial, we are focusing on Git and GitHub platforms for efficient software engineering practices.
Git is a version control system to track all the code modifications. Github is a widely used version control uses git. You can also check various open-source projects on Github. Git is the technology for performing the tracking and merging changes in a source code. Github is a web-based platform that uses git technology. There are other platforms that are also available just like Github such as GitLab, and Sourcetree.
Modern software development can not possible without a version control system because maintaining multiple folders and subfolders with different states is very difficult. It also causes the risk of losing the important code. The main reason behind the popularity of Git is that it offers a branching concept. Branching allows you to create multiple versions of your work and track them in a structured manner. Each branch is like a parallel world that keeps all the changes in one branch without affecting the other branches until you merge them together.
In this tutorial, we are going to cover the following topics:
GitHub Terminology
- Repository: It is a folder that contains project files and revision history made to each file. The repository can be of t two types: local and remote repository.
- Cloning: Create a copy of the remote repo on your local machine.
- Commit: When you commit a change. you save the changes made to your files in the repository. these changes are on the local machine for permanent changes on the remote repository you need to push the code on the remote repository.
- Push: push command allows you to transfer all the changes on your local repository to a remote repository.
- Pull: pull command allows transferring all the changes from the remote to the local repository.
- Staging Area: Intermediate place between the working directory and local git repository. You can check any changes that you made before committing to them on the local repository.
- Branching: It allows you to work on new features or functionality. It will help you experiment with new features or ideas. You can create multiple branches in parallel. After finalizing the feature, you can easily merge all the changes in the code to the main/master branch.
- Merging: It helps in committing the final changes of any branch to the main/master branch.
- Pull Request: It helps in reviewing & approving your changes before merging to the main/master branch.
Creating a New Repo
you can create a new repository by clicking a + sign in the upper-right corner of any page. Use the drop-down menu, and select the New repository option from the dropdown menu.
We can also create a git repository from the local command line.
mkdir NLPProject
Change the folder to the project directory that you have created in the last step
cd my-novel
Now, initialize the project folder with GIt so that Git can manage the version control.
git init
Now, you have one folder with a .git name.
After this, you can create and save the python files of your project and added those files to the git local repository by using the git add and git commit commands.
Committing to the Repo
Let’s commit your code to the local repository. First, we add the files to the git by using git add
. After, that we will check the status using git status
and finally commit the code using git commit
.
git add app.py
Let’s check the status using git status
.
git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
Let’s commit the code using git commit
. Here, we have used the -m
argument for specifying the comment string that explains what you are committing to the repo.
git commit -m "Adding application file"
Here, we made our first contribution to the project.
Push Code to a Local Repo
We can push the code to the repository using the git push
command.
git push <option> [<Remote URL><branch name><refspec>...]
Let’s push the local code on the remote master branch.
git push origin master
Pull Code from Remote Repo
We can pull the code from the remote branch to the local branch git pull
command is used to access the changes (commits)from a remote repository to the local repository.
git pull <option> [<repository URL><refspec>...]
git pull origin master
Creating a New Branch
We can create a new branch using the git branch
command. git branch
command also offers other options such as list, rename, and delete branches.
git branch <new-branch>
git branch sentiment_analysis_demo
Switching Branch
In Git, we enjoy the branching option for experimenting with new features in parallel. We can switch between those branches using git checkout
command. Let’s check out to sentiment_analysis_demo branch
git checkout sentiment_analysis_demo
Merging Branches
Merging branches is a bit complicated operation. Instead of emerging my branch to master
branch I merge master
to my branch(sentiment_analysis_demo) because if there are any conflicts, I can resolve them in the branch itself and master
branch remains clean. So, first check out the maste
r branch and then merge with your new branch.
git checkout master
git merge sentiment_analysis_demo
Git Do’s
- Always prefer to create a new branch for new experimentation.
- Perform periodic or regular push or commit updates in the code.
- Always Commit changes with appropriate comments.
- Always keep updated your branch with the latest development branch.
- Always prefer to create a pull request for merging changes from one branch to another.
Git Don’ts
- Don’push secret information such as username, password, Secret keys, and API tokens.
- Don’t directly commit the code to the master or development branch.
- Don’t commit large-size files or data files on git.
- Don’t force the push until you’re confident with that.
Summary
In this tutorial, we have understood the concept of Git and GitHub version control platforms. We have focused on how to create Github Repo, commit code, push code, pull code, create a new branch, and merge branch. We have also discussed the Git Dos and Don’ts. For more data science-related articles such as NLP, Python, and Statistics.