Learn about Logistic Regression, its basic properties, it’s working, and build a machine learning model on the real-world applications in Python.
Classification techniques are an important part of machine learning and data mining applications. Approx 70% of problems in Data Science are classification problems. There are lots of classification problems are available but logistics regression is a very common and useful regression method for solving a binary classification problem. Another category of classification is Multinomial classification, which handles the problems where multiple classes are present in the target variable. For example, the IRIS dataset has various famous examples of multiclass classification. Other examples are classifying article/blog/document categories.
Logistic Regression can be used for various classification problems such as spam detection, Diabetes prediction, if a given customer will purchase a particular product or will churn to another competitor, the user will click on a given advertisement link or not and many more examples are in the bucket.
Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for two-class classification. It is easy to implement and can be used as the baseline for any binary classification problem. Its basic fundamental concepts are also very helpful in deep learning. Logistic regression describes and estimates the relationship between one dependent binary variable and independent variables.
In this tutorial, you will learn the following things in Logistic Regression:
Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature, dichotomous means there are only two possible classes. for example, it can be used for cancer detection problems. It computes the probability of an event occurrence.
It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary event using a logit function.
Linear Regression Equation:
Where, y is the dependent variable and x1, x2 … and Xn are explanatory variables.
Sigmoid Function:
Apply Sigmoid function on linear regression:
Properties of Logistic Regression:
Linear regression gives you a continuous output but logistic regression gives a discrete output. An example of a continuous output is house price and stock price. An example of the discrete output is predicting whether a patient has cancer or not, predicting whether the customer will churn. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using the Maximum Likelihood Estimation (MLE) approach.
The MLE is a “likelihood” maximization method, while OLS is a distance-minimizing approximation method. Maximizing the likelihood function determines the parameters that are most likely to produce the observed data. From a statistical point of view, MLE sets the mean and variance as parameters in determining the specific parametric values for a given model. This set of parameters can be used for predicting the data needed in a normal distribution.
Ordinary Least squares estimates are computed by fitting a regression line on given data points that has the minimum sum of the squared deviations (least square error). Both are used to estimate the parameters of a linear regression model. MLE assumes a joint probability mass function, while OLS doesn’t require any stochastic assumptions for minimizing distance.
The sigmoid function also called the logistic function gives an ‘S-shaped curve that can take any real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity, y predicted will become 1 and If the curve goes to negative infinity, y predicted will become 0. If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES and if it is less than 0.5, we can classify it as 0 or NO. The outputcannotFor example: If the output is 0.75, we can say in terms of probability as: There is a 75 percent chance that the patient will suffer from cancer.
Types of Logistic Regression:
Let’s build a diabetes prediction model.
Here, you are going to predict diabetes using a Logistic Regression Classifier.
Let’s first load the required Pima Indian Diabetes dataset using pandas’ read CSV function. You can download data from the following link:
#import pandas
import pandas as pd
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()
Here, you need to divide given columns into two types of variables dependent(or target variable) and independent variable(or feature variables).
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
To understand model performance, dividing the dataset into a training set and a test set is a good strategy.
Let’s split dataset by using function train_test_split(). you need to pass basically 3 parameters features, target, and test_set size. Additionally, you can use random_state to select records randomly.
# Split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.25, random_state=0)
Here, Dataset is broken into two parts in the ratio of 75:25. It means 75% of data will be used for model training and 25% for model testing.
First, import the Logistic Regression module and create a Logistic Regression classifier object using the LogisticRegression() function.
Then, fit your model on the train set using fit() and perform prediction on the test set using predict().
# import the class
from sklearn.linear_model import LogisticRegression
# instantiate the model (using the default parameters)
logreg = LogisticRegression()
# fit the model with data
logreg.fit(X_train,y_train)
# predict
y_pred=logreg.predict(X_test)
A confusion matrix is a table that is used to evaluate the performance of a classification model. You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise.
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
Output: array([[119, 11], [ 26, 36]])
Here, you can see the confusion matrix in the form of an array object. The dimension of this matrix is 2*2 because this model is a binary classification. you have two classes 0 and 1. Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions. In the output, 119 and 36 are actual predictions and 26 and 11 are inaccurate predictions.
Let’s visualize the results of the model in the form of a confusion matrix using matplotlib and seaborn.
Here, you will visualize the confusion matrix using Heatmap.
# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
class_names=[0,1]
# name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmapsns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Let’s evaluate the model using model evaluation metrics such as accuracy, precision, and recall.
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
Output: Accuracy: 0.8072916666666666 Precision: 0.7659574468085106 Recall: 0.5806451612903226
Well, you got a classification rate of 80%, considered as good accuracy.
Precision: Precision is about being precise i.e. How precise your model is. In other words, you can say, when a model makes a prediction, how often it is correct. In your prediction case, when your Logistic Regression model predicted patients are going to suffer from diabetes, that patients actually have 76% time.
Recall: If there are patients who actually have diabetes in the test set and your Logistic Regression model is able to identify it 58% of the time.
The Receiver Operating Characteristic(ROC) curve is a plot of the true positive rate against the false-positive rate. It shows the tradeoff between sensitivity and specificity.
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
The AUC score for the case is 0.86. AUC score 1 represents a perfect classifier and 0.5 represents a worthless classifier.
Because of its simple and efficient nature, doesn’t require high computation power, easy to implement, is easily interpretable, used widely by data analysts and scientists. Also doesn’t require scaling of features. Logistic regression provides a probability score for observations.
Logistic regression is not able to handle a large number of categorical features/variables. It is vulnerable to overfitting. Also, can’t solve the non-linear problem with the logistic regression which is why it requires a transformation of non-linear features. logistic regression will not perform well with independent variables that are not correlated to the target variable and are very similar or correlated to each other.
In this tutorial, you covered a lot of details about Logistic Regression. you have learned what is logistic regression, how to build respective models, how to visualize results, and some of the theoretical background information. Also, you covered some basic concepts such as the sigmoid function, maximum likelihood, confusion matrix, ROC curve.
Hopefully, you can now utilize the Logistic Regression technique to analyze your own datasets. Thanks for reading this tutorial!
For more such tutorials, projects, and courses visit DataCamp
Originally published at https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python
Reach out to me on Linkedin: https://www.linkedin.com/in/avinash-navlani/
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…