XGBoost is one of the most popular boosting algorithms. It is well known to arrive at better solutions as compared to other Machine Learning Algorithms, for both classification and regression tasks.
XGBoost or Extreme Gradient Boosting is an open-source library. Its original codebase is in C++, but the library is combined with Python interface. It helps us achieve a comparatively high-performance implementation of gradient boosted decision trees, can do parallel computations and is easy to implement.
The Boosting technique trains models serial-wise, with every new learning model trained to correct the errors made by its predecessors. In Gradient Boosting, we try to optimize the loss function of the previous learning model by adding a new adaptive model that combines weak learning models. This reduces the loss function. The technique, thus, controls the learning rate and reduces variance by introducing randomness. This performance can be improved even further by using XGBoost.
The objective function of the XGBoost algorithm is a sum of a specific loss function evaluated over all the predictions and the sum of regularization term for all predictors, as:
Let’s get started with Python’s XGBoost library. The first step is to install it. This can be easily done with pip command:
pip install xgboost |
Now, we need a dataset to work with XGBoost. For this article, we will use the wine dataset from sklearn.datasets.load_wine. This is done as:
from sklearn.datasets import load_wine wine = load_wine() x = wine.data y = wine.target |
This is a simple multi-class classification dataset for wine recognition. It has three classes and the number of instances is equal to 178. It has 13 attributes:
Now, let’s split the data into training data testing data. This is done as:
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2) |
Next, we need to create the Xgboost specific DMatrix data format. This is done as:
import xgboost as xgb dtrain = xgb.DMatrix(x_train, label=y_train) dtest = xgb.DMatrix(x_test, label=y_test) |
Now to get the XGBoost working, we need to set the parameters for xgboost algorithm. Let’s set some parameters for our model, as:
param = { ‘objective’: ‘multi:softprob’, ‘eta’: 0.4, ‘max_depth’: 3, ‘gamma’: 1.0, ‘num_class’: 3 } |
objective is the loss function used, while num_class is the number of classes in the dataset. In our case, it is equal to 3. max_depth is the maximum depth of the decision tree. eta is the learning rate, which is the step size shrinkage. It is used in the update to prevent overfitting. gamma is the minimum loss reduction that is required to make a further partition on a leaf node of the decision tree.
Now, let’s train our model using XGBoost:
model = xgb.train(param, dtrain, 20) |
20 is the number of training iterations.
The model can be tested as:
import numpy as np y_pred = model.predict(dtest) predicts = np.asarray([np.argmax(val) for val in y_pred]) |
Let’s look at the accuracy of our predictions:
from sklearn.metrics import accuracy_score acc = accuracy_score(y_test, predicts) print(“Accuracy = {}”.format(acc)) |
Output:
Accuracy = 0.9444444444444444 |
This is a quite good value of accuracy for this dataset.
In this article, we looked at XGBoost Algorithm using Python. In the next article, we will focus on Feature Scaling: MinMax, Standard, and Robust Scaler.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…