3 Ways to Speed Up and Improve Your XGBoost Models
Image by Editor | ChatGPT
Introduction
Extreme gradient boosting (XGBoost) is one of the most prominent machine learning techniques used not only for experimentation and analysis but also in deployed predictive solutions in industry. An XGBoost ensemble combines multiple models to address a predictive task like classification, regression, or forecasting. It trains a set of decision trees sequentially, gradually improving the quality of predictions by correcting the errors made by previous trees in the pipeline.
In a recent article, we explored the importance and ways to interpret predictions made by XGBoost models (note we use the term ‘model’ here for simplicity, even though XGBoost is an ensemble of models). This article takes another practical dive into XGBoost, this time by illustrating three strategies to speed up and improve its performance.
Initial Setup
To illustrate the three strategies to improve and speed up XGBoost models, we will use an employee dataset with demographic and financial attributes describing employees. It is publicly available in this repository.
The following code loads the dataset, removes instances containing missing values, and identifies 'income' as the target attribute we want to predict, and separates it from the features.
|
import pandas as pd
url = ‘https://raw.githubusercontent.com/gakudo-ai/open-datasets/main/employees_dataset_with_missing.csv’ df = pd.read_csv(url).dropna()
X = df.drop(columns=[‘income’]) y = df[‘income’] |
1. Early Stopping with Clean Data
While popularly used with complex neural network models, many don’t consider applying early stopping to ensemble approaches like XGBoost, even though it can create a great balance between efficiency and accuracy. Early stopping consists of interrupting the iterative training process once the model’s performance on a validation set stabilizes and few further improvements are made. This way, not only do we save training costs for larger ensembles trained on vast datasets, but we also help reduce the risk of overfitting the model.
This example first imports the necessary libraries and preprocesses the data to be better suited for XGBoost, namely by encoding categorical features (if any) and downcasting numerical ones for further efficiency. It then partitions the dataset into training and validation sets.
|
from xgboost import XGBRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import pandas as pd import numpy as np
X_enc = pd.get_dummies(X, drop_first=True, dtype=“uint8”) num_cols = X_enc.select_dtypes(include=[“float64”, “int64”]).columns X_enc[num_cols] = X_enc[num_cols].astype(“float32”)
X_train, X_val, y_train, y_val = train_test_split( X_enc, y, test_size=0.2, random_state=42 ) |
Next, the XGBoost model is trained and tested. The key trick here is to use the early_stopping_rounds optional argument when initializing our model. The value set for this argument indicates the number of consecutive training rounds without significant improvements after which the process should stop.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
model = XGBRegressor( tree_method=“hist”, n_estimators=5000, learning_rate=0.01, eval_metric=“rmse”, early_stopping_rounds=50, random_state=42, n_jobs=–1 )
model.fit( X_train, y_train, eval_set=[(X_val, y_val)], verbose=False )
y_pred = model.predict(X_val) rmse = np.sqrt(mean_squared_error(y_val, y_pred)) print(f“Validation RMSE: {rmse:.4f}”) print(f“Best iteration (early-stopped): {model.best_iteration}”) |
2. Native Categorical Handling
The second strategy is suitable for datasets containing categorical attributes. Since our employee dataset doesn’t, we will first simulate the creation of a categorical attribute, education_level, by binning the existing one describing years of education:
|
bins = [0, 12, 16, float(‘inf’)] # Assuming <12 years is low, 12-16 is medium, >16 is high labels = [‘low’, ‘medium’, ‘high’]
X[‘education_level’] = pd.cut(X[‘education_years’], bins=bins, labels=labels, right=False) display(X.head(50)) |
The key to this strategy is to process categorical features more efficiently during training. Once more, there’s a critical, lesser-known argument setting that allows this in the XGBoost model constructor: enable_categorical=True. This way, we avoid traditional one-hot encoding, which, in the case of having multiple categorical features with several categories each, can easily blow up dimensionality. A big win for efficiency here! Additionally, native categorical handling transparently learns optimal category groupings like “one vs. others”, thereby not necessarily handling them all as single categories.
Incorporating this strategy in our code is extremely simple:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from sklearn.metrics import mean_absolute_error
for col in X.select_dtypes(include=[‘object’, ‘category’]).columns: X[col] = X[col].astype(‘category’)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
model = XGBRegressor( tree_method=‘hist’, enable_categorical=True, learning_rate=0.01, early_stopping_rounds=30, n_estimators=500 )
model.fit( X_train, y_train, eval_set=[(X_val, y_val)], verbose=False )
y_pred = model.predict(X_val) print(“Validation MAE:”, mean_absolute_error(y_val, y_pred)) |
3. Hyperparameter Tuning with GPU Acceleration
The third strategy may sound obvious in terms of seeking efficiency, as it is hardware-related, but its remarkable value for otherwise time-consuming processes like hyperparameter tuning is worth highlighting. You can use device="cuda" and set the runtime type to GPU (if you are working on a notebook environment like Google Colab, this is done in just one click), to speed up an XGBoost ensemble fine-tuning workflow like this:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
from sklearn.model_selection import GridSearchCV
base_model = XGBRegressor( tree_method=‘hist’, device=‘cuda’, # Key for GPU acceleration enable_categorical=True, eval_metric=‘rmse’, early_stopping_rounds=20, random_state=42 )
# Hyperparameter tuning param_grid = { ‘max_depth’: [4, 6], ‘subsample’: [0.8, 1.0], ‘colsample_bytree’: [0.8, 1.0], ‘learning_rate’: [0.01, 0.05] }
grid_search = GridSearchCV( estimator=base_model, param_grid=param_grid, scoring=‘neg_root_mean_squared_error’, cv=3, verbose=1, n_jobs=–1 )
grid_search.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
# Take best model found best_model = grid_search.best_estimator_ y_pred = best_model.predict(X_val)
# Evaluate it rmse = np.sqrt(mean_squared_error(y_val, y_pred)) print(f“Best hyperparameters: {grid_search.best_params_}”) print(f“Validation RMSE: {rmse:.4f}”) print(f“Best iteration (early-stopped): {getattr(best_model, ‘best_iteration’, ‘N/A’)}”) |
Wrapping Up
This article showcased three hands-on examples of improving XGBoost models with a particular focus on efficiency in different parts of the modeling process. Specifically, we learned how to implement early stopping in the training process for when the error stabilizes, how to natively handle categorical features without (sometimes burdensome) one-hot encoding, and lastly, how to optimize otherwise costly processes like model fine-tuning thanks to GPU usage.















