3 Ways to Speed Up and Improve Your XGBoost Models

3 Ways to Speed Up and Improve Your XGBoost Models
Image by Editor | ChatGPT

Introduction

Extreme gradient boosting (XGBoost) is one of the most prominent machine learning techniques used not only for experimentation and analysis but also in deployed predictive solutions in industry. An XGBoost ensemble combines multiple models to address a predictive task like classification, regression, or forecasting. It trains a set of decision trees sequentially, gradually improving the quality of predictions by correcting the errors made by previous trees in the pipeline.

AIAllure Video Generator: My Unfiltered Thoughts

How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, and Executes Virtual Actions Using Local AI Models

In a recent article, we explored the importance and ways to interpret predictions made by XGBoost models (note we use the term ‘model’ here for simplicity, even though XGBoost is an ensemble of models). This article takes another practical dive into XGBoost, this time by illustrating three strategies to speed up and improve its performance.

Initial Setup

To illustrate the three strategies to improve and speed up XGBoost models, we will use an employee dataset with demographic and financial attributes describing employees. It is publicly available in this repository.

The following code loads the dataset, removes instances containing missing values, and identifies 'income' as the target attribute we want to predict, and separates it from the features.

import pandas as pd url=”https://raw.githubusercontent.com/gakudo-ai/open-datasets/main/employees_dataset_with_missing.csv” df = pd.read_csv(url).dropna() X = df.drop(columns=[‘income’]) y = df[‘income’]

import pandas as pd

url = ‘https://raw.githubusercontent.com/gakudo-ai/open-datasets/main/employees_dataset_with_missing.csv’

df = pd.read_csv(url).dropna()

X = df.drop(columns=[‘income’])

y = df[‘income’]

1. Early Stopping with Clean Data

While popularly used with complex neural network models, many don’t consider applying early stopping to ensemble approaches like XGBoost, even though it can create a great balance between efficiency and accuracy. Early stopping consists of interrupting the iterative training process once the model’s performance on a validation set stabilizes and few further improvements are made. This way, not only do we save training costs for larger ensembles trained on vast datasets, but we also help reduce the risk of overfitting the model.

This example first imports the necessary libraries and preprocesses the data to be better suited for XGBoost, namely by encoding categorical features (if any) and downcasting numerical ones for further efficiency. It then partitions the dataset into training and validation sets.

from xgboost import XGBRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import pandas as pd import numpy as np X_enc = pd.get_dummies(X, drop_first=True, dtype=”uint8″) num_cols = X_enc.select_dtypes(include=[“float64”, “int64”]).columns X_enc[num_cols] = X_enc[num_cols].astype(“float32”) X_train, X_val, y_train, y_val = train_test_split( X_enc, y, test_size=0.2, random_state=42 )

from xgboost import XGBRegressor

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

import pandas as pd

import numpy as np

X_enc = pd.get_dummies(X, drop_first=True, dtype=“uint8”)

num_cols = X_enc.select_dtypes(include=[“float64”, “int64”]).columns

X_enc[num_cols] = X_enc[num_cols].astype(“float32”)

X_train, X_val, y_train, y_val = train_test_split(

X_enc, y, test_size=0.2, random_state=42

)

Next, the XGBoost model is trained and tested. The key trick here is to use the early_stopping_rounds optional argument when initializing our model. The value set for this argument indicates the number of consecutive training rounds without significant improvements after which the process should stop.

model = XGBRegressor( tree_method=”hist”, n_estimators=5000, learning_rate=0.01, eval_metric=”rmse”, early_stopping_rounds=50, random_state=42, n_jobs=-1 ) model.fit( X_train, y_train, eval_set=[(X_val, y_val)], verbose=False ) y_pred = model.predict(X_val) rmse = np.sqrt(mean_squared_error(y_val, y_pred)) print(f”Validation RMSE: {rmse:.4f}”) print(f”Best iteration (early-stopped): {model.best_iteration}”)

model = XGBRegressor(

tree_method=“hist”,

n_estimators=5000,

learning_rate=0.01,

eval_metric=“rmse”,

early_stopping_rounds=50,

random_state=42,

n_jobs=–1

)

model.fit(

X_train, y_train,

eval_set=[(X_val, y_val)],

verbose=False

)

y_pred = model.predict(X_val)

rmse = np.sqrt(mean_squared_error(y_val, y_pred))

print(f“Validation RMSE: {rmse:.4f}”)

print(f“Best iteration (early-stopped): {model.best_iteration}”)

2. Native Categorical Handling

The second strategy is suitable for datasets containing categorical attributes. Since our employee dataset doesn’t, we will first simulate the creation of a categorical attribute, education_level, by binning the existing one describing years of education:

bins = [0, 12, 16, float(‘inf’)] # Assuming <12 years is low, 12-16 is medium, >16 is high labels = [‘low’, ‘medium’, ‘high’] X[‘education_level’] = pd.cut(X[‘education_years’], bins=bins, labels=labels, right=False) display(X.head(50))

bins = [0, 12, 16, float(‘inf’)] # Assuming <12 years is low, 12-16 is medium, >16 is high

labels = [‘low’, ‘medium’, ‘high’]

X[‘education_level’] = pd.cut(X[‘education_years’], bins=bins, labels=labels, right=False)

display(X.head(50))

The key to this strategy is to process categorical features more efficiently during training. Once more, there’s a critical, lesser-known argument setting that allows this in the XGBoost model constructor: enable_categorical=True. This way, we avoid traditional one-hot encoding, which, in the case of having multiple categorical features with several categories each, can easily blow up dimensionality. A big win for efficiency here! Additionally, native categorical handling transparently learns optimal category groupings like “one vs. others”, thereby not necessarily handling them all as single categories.

Incorporating this strategy in our code is extremely simple:

from sklearn.metrics import mean_absolute_error for col in X.select_dtypes(include=[‘object’, ‘category’]).columns: X[col] = X[col].astype(‘category’) X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42) model = XGBRegressor( tree_method=’hist’, enable_categorical=True, learning_rate=0.01, early_stopping_rounds=30, n_estimators=500 ) model.fit( X_train, y_train, eval_set=[(X_val, y_val)], verbose=False ) y_pred = model.predict(X_val) print(“Validation MAE:”, mean_absolute_error(y_val, y_pred))

from sklearn.metrics import mean_absolute_error

for col in X.select_dtypes(include=[‘object’, ‘category’]).columns:

X[col] = X[col].astype(‘category’)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

model = XGBRegressor(

tree_method=‘hist’,

enable_categorical=True,

learning_rate=0.01,

early_stopping_rounds=30,

n_estimators=500

)

model.fit(

X_train, y_train,

eval_set=[(X_val, y_val)],

verbose=False

)

y_pred = model.predict(X_val)

print(“Validation MAE:”, mean_absolute_error(y_val, y_pred))

3. Hyperparameter Tuning with GPU Acceleration

The third strategy may sound obvious in terms of seeking efficiency, as it is hardware-related, but its remarkable value for otherwise time-consuming processes like hyperparameter tuning is worth highlighting. You can use device="cuda" and set the runtime type to GPU (if you are working on a notebook environment like Google Colab, this is done in just one click), to speed up an XGBoost ensemble fine-tuning workflow like this:

from sklearn.model_selection import GridSearchCV base_model = XGBRegressor( tree_method=’hist’, device=”cuda”, # Key for GPU acceleration enable_categorical=True, eval_metric=”rmse”, early_stopping_rounds=20, random_state=42 ) # Hyperparameter tuning param_grid = { ‘max_depth’: [4, 6], ‘subsample’: [0.8, 1.0], ‘colsample_bytree’: [0.8, 1.0], ‘learning_rate’: [0.01, 0.05] } grid_search = GridSearchCV( estimator=base_model, param_grid=param_grid, scoring=’neg_root_mean_squared_error’, cv=3, verbose=1, n_jobs=-1 ) grid_search.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False) # Take best model found best_model = grid_search.best_estimator_ y_pred = best_model.predict(X_val) # Evaluate it rmse = np.sqrt(mean_squared_error(y_val, y_pred)) print(f”Best hyperparameters: {grid_search.best_params_}”) print(f”Validation RMSE: {rmse:.4f}”) print(f”Best iteration (early-stopped): {getattr(best_model, ‘best_iteration’, ‘N/A’)}”)

from sklearn.model_selection import GridSearchCV

base_model = XGBRegressor(

tree_method=‘hist’,

device=‘cuda’, # Key for GPU acceleration

enable_categorical=True,

eval_metric=‘rmse’,

early_stopping_rounds=20,

random_state=42

)

# Hyperparameter tuning

param_grid = {

‘max_depth’: [4, 6],

‘subsample’: [0.8, 1.0],

‘colsample_bytree’: [0.8, 1.0],

‘learning_rate’: [0.01, 0.05]

}

grid_search = GridSearchCV(

estimator=base_model,

param_grid=param_grid,

scoring=‘neg_root_mean_squared_error’,

cv=3,

verbose=1,

n_jobs=–1

)

grid_search.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)

# Take best model found

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_val)

# Evaluate it

rmse = np.sqrt(mean_squared_error(y_val, y_pred))

print(f“Best hyperparameters: {grid_search.best_params_}”)

print(f“Validation RMSE: {rmse:.4f}”)

print(f“Best iteration (early-stopped): {getattr(best_model, ‘best_iteration’, ‘N/A’)}”)

Wrapping Up

This article showcased three hands-on examples of improving XGBoost models with a particular focus on efficiency in different parts of the modeling process. Specifically, we learned how to implement early stopping in the training process for when the error stabilizes, how to natively handle categorical features without (sometimes burdensome) one-hot encoding, and lastly, how to optimize otherwise costly processes like model fine-tuning thanks to GPU usage.

Source_link