Categories: Essentials Skills

Git and GitHub for Data Scientists

In this tutorial, we will focus on the basics of Git and the version control platform GitHub.

Most Data scientists are from heterogenous backgrounds such as Physics, Mathematics, etc., and more commonly are from research and academia. For developing any data science project or product Data scientists need best practices of software engineering for smooth building and maintaining the product process. Git is one of the skills that every software engineer needs to manage the code base efficiently. In this tutorial, we are focusing on Git and GitHub platforms for efficient software engineering practices.

Git is a version control system to track all the code modifications. Github is a widely used version control uses git. You can also check various open-source projects on Github. Git is the technology for performing the tracking and merging changes in a source code. Github is a web-based platform that uses git technology. There are other platforms that are also available just like Github such as GitLab, and Sourcetree.

Modern software development can not possible without a version control system because maintaining multiple folders and subfolders with different states is very difficult. It also causes the risk of losing the important code. The main reason behind the popularity of Git is that it offers a branching concept. Branching allows you to create multiple versions of your work and track them in a structured manner. Each branch is like a parallel world that keeps all the changes in one branch without affecting the other branches until you merge them together.

In this tutorial, we are going to cover the following topics:

GitHub Terminology

  • Repository: It is a folder that contains project files and revision history made to each file. The repository can be of t two types: local and remote repository.
  • Cloning: Create a copy of the remote repo on your local machine.
  • Commit: When you commit a change. you save the changes made to your files in the repository. these changes are on the local machine for permanent changes on the remote repository you need to push the code on the remote repository.
  • Push: push command allows you to transfer all the changes on your local repository to a remote repository.
  • Pull: pull command allows transferring all the changes from the remote to the local repository.
Github
  • Staging Area: Intermediate place between the working directory and local git repository. You can check any changes that you made before committing to them on the local repository.
  • Branching: It allows you to work on new features or functionality. It will help you experiment with new features or ideas. You can create multiple branches in parallel. After finalizing the feature, you can easily merge all the changes in the code to the main/master branch.
  • Merging: It helps in committing the final changes of any branch to the main/master branch.
  • Pull Request: It helps in reviewing & approving your changes before merging to the main/master branch.

Creating a New Repo

you can create a new repository by clicking a + sign in the upper-right corner of any page. Use the drop-down menu, and select the New repository option from the dropdown menu.

create a repo on GitHub

We can also create a git repository from the local command line.

mkdir NLPProject

Change the folder to the project directory that you have created in the last step

cd my-novel

Now, initialize the project folder with GIt so that Git can manage the version control.

git init

Now, you have one folder with a .git name.

After this, you can create and save the python files of your project and added those files to the git local repository by using the git add and git commit commands.

Committing to the Repo

Let’s commit your code to the local repository. First, we add the files to the git by using git add . After, that we will check the status using git status and finally commit the code using git commit .

git add app.py

Let’s check the status using git status.

git status
On branch master

No commits yet

Changes to be committed:
     (use "git rm --cached <file>..." to unstage)

Let’s commit the code using git commit. Here, we have used the -m argument for specifying the comment string that explains what you are committing to the repo.

git commit -m "Adding application file"

Here, we made our first contribution to the project.

Push Code to a Local Repo

We can push the code to the repository using the git push command.  

git push <option> [<Remote URL><branch name><refspec>...] 

Let’s push the local code on the remote master branch.

git push origin master

Pull Code from Remote Repo

We can pull the code from the remote branch to the local branch git pull command is used to access the changes (commits)from a remote repository to the local repository.

git pull <option> [<repository URL><refspec>...]  
git pull origin master

Creating a New Branch

We can create a new branch using the git branch command. git branch command also offers other options such as list, rename, and delete branches.

git branch <new-branch>
git branch sentiment_analysis_demo

Switching Branch

In Git, we enjoy the branching option for experimenting with new features in parallel. We can switch between those branches using git checkout command. Let’s check out to sentiment_analysis_demo branch

git checkout sentiment_analysis_demo

Merging Branches

Merging branches is a bit complicated operation. Instead of emerging my branch to master branch I merge master to my branch(sentiment_analysis_demo) because if there are any conflicts, I can resolve them in the branch itself and  master branch remains clean. So, first check out the master branch and then merge with your new branch.

git checkout master

git merge sentiment_analysis_demo

Git Do’s

  • Always prefer to create a new branch for new experimentation.
  • Perform periodic or regular push or commit updates in the code.
  • Always Commit changes with appropriate comments.
  • Always keep updated your branch with the latest development branch.
  • Always prefer to create a pull request for merging changes from one branch to another.

Git Don’ts

  • Don’push secret information such as username, password, Secret keys, and API tokens.
  • Don’t directly commit the code to the master or development branch.
  • Don’t commit large-size files or data files on git.
  • Don’t force the push until you’re confident with that.

Summary

In this tutorial, we have understood the concept of Git and GitHub version control platforms. We have focused on how to create Github Repo, commit code, push code, pull code, create a new branch, and merge branch. We have also discussed the Git Dos and Don’ts. For more data science-related articles such as NLP, Python, and Statistics.

Avinash Navlani

Share
Published by
Avinash Navlani

Recent Posts

MapReduce Algorithm

In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…

9 months ago

Linear Programming using Pyomo

Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…

1 year ago

Networking and Professional Development for Machine Learning Careers in the USA

In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…

1 year ago

Predicting Employee Churn in Python

Analyze employee churn, Why employees are leaving the company, and How to predict, who will…

2 years ago

Airflow Operators

Airflow operators are core components of any workflow defined in airflow. The operator represents a…

2 years ago

MLOps Tutorial

Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…

2 years ago