How to Diagnose Why Your Regression Model Fails

How to Diagnose Why Your Regression Model Fails
Image by Editor | ChatGPT

Introduction

In regression models, failure occurs when the model produces inaccurate predictions — that is, when error metrics like MAE or RMSE are high — or when the model, once deployed, fails to generalize well to new data that differs from the examples it was trained or tested on. While model failure typically shows up in one or both of these forms, the root causes can be more diverse and subtle.

Google AI Proposes Novel Machine Learning Algorithms for Differentially Private Partition Selection

Seeing Images Through the Eyes of Decision Trees

This article explores some common reasons why regression models may underperform and outlines how to detect these issues. It is also accompanied by practical code excerpts using XGBoost — a robust and highly tunable ensemble-based regression model. Despite its popularity and power, XGBoost can also fail if not trained or evaluated properly!

Diagnostic Points for a Regression Model

First, let’s uncover some common causes for failure in regression models, describing each one and recommending how to diagnose them.

1. Underfitting

When the training data used to build the model is insufficient in quantity, quality, or relevant information to predict target labels, the resulting model is too simple and fails to provide accurate predictions, even on examples that are similar to those used for training. This common problem, known as underfitting, is easy to diagnose: it appears when the error on both the training and test sets is high.

Visualization of underfitting

2. Overfitting

The opposite problem to underfitting is, as its name suggests, overfitting. It happens when the model learns or ‘memorizes’ the training data too well, fitting it excessively, as shown below. When overfitting happens, a model that seems to perform extraordinarily well on training examples performs much worse on future, unseen data. Therefore, a low training error and a high test error are often a strong indicator that the model overfits the training data. In sum, memorizing the training examples rather than learning the general patterns and input-output relationships that are key to making reliable predictions results in a model that ‘gets lost’ as soon as it receives an input data example that is even slightly different from anything it has seen before.

Visualization of overfitting

Memorizing the training examples rather than learning the general patterns and input-output relationships that are key to making reliable predictions results in a model that “gets lost” as soon as it receives an input data example that is even slightly different from anything it has seen before.

3. Data Leakage

Data leakage occurs when a machine learning model uses information during training that would not be available at inference time to predict the target variable. Similar to overfitting, leakage can make a model appear highly accurate during validation, but once deployed, its real-world performance often drops significantly.
However, unlike overfitting, the issue is more related to a mismatch in data availability or completeness between training and deployment — for example, when future or target-derived features (like the action taken upon a fraudulent or suspicious house price in a listing, once detected) are inadvertently included during training.

Data leakage is typically diagnosed by noticing unrealistically low validation error, which may indicate that the model had access to information it shouldn’t have: information that will not be available when making predictions in production.

4. Noisy or Irrelevant Features

It is common in datasets containing a large number of features that some of them may be uninformative or even misleading for predicting the target value. For instance, in a regression model for estimating the price of a house based on its attributes, attributes like the age or footage are relevant, whereas others, like the painting color of the facade, may be deemed irrelevant for the price prediction. Calculating feature importance and using interpretability methods like SHAP can help determine whether some features in the data set have little or no influence and should be removed to simplify your model without incurring a loss of accuracy.

5. Poor Data Preprocessing

Missing values, numerical attributes with disparate scales, and raw categorical features should be properly preprocessed according to their ‘estimated’ relevance to the predictive problem at hand. Failing to do so and neglecting important data preprocessing actions like scaling, missing value imputation, and so on can also negatively affect model performance. Issues related to poor or insufficient data preprocessing are diagnosed through data inspection and profiling methods like correlation analysis, summary statistics, or heatmaps to reveal missing values.

6. Wrong Hyperparameters

Models like XGBoost that require us to set several hyperparameters before training assume either a strong level of expertise to define the right hyperparameter setting or (more frequently) a hyperparameter tuning process using a validation scheme like cross-validation to find the best configuration. Setting wrong values for hyperparameters like learning rate, depth of decision trees, etc., leads to underperforming models. Compare your specific hyperparameter setting with a default model setting to verify if you are using an appropriate configuration.

7. Insufficient Data

Last but not least, having too few data examples to learn a reliable predictive pattern or generalize to future data is a problem in itself — even though it can often be part of the reason for other issues like underfitting or overfitting. Data volume is especially critical when using more complex models, which typically cannot learn effectively from a small number of labeled examples.

Practical Example: Predicting House Prices with XGBoost

We will revisit some of the insights from the above discussion through the following example that trains regression models to predict house prices, using the publicly available California Housing dataset, namely the version available in the scikit-learn library.

This code imports the necessary modules, loads the dataset, separating predictor features from the target, and splits the data examples into training and test sets:

from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import xgboost as xgb import numpy as np import pandas as pd data = fetch_california_housing() X = pd.DataFrame(data.data, columns=data.feature_names) y = data.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.datasets import fetch_california_housing

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

import xgboost as xgb

import numpy as np

import pandas as pd

data = fetch_california_housing()

X = pd.DataFrame(data.data, columns=data.feature_names)

y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We’ll first train an XGBoost model whose hyperparameters have been configured at random (or intentionally poorly), using it later as our baseline model for comparisons:

base_model = xgb.XGBRegressor( n_estimators=10, max_depth=1, learning_rate=0.5, random_state=42 ) base_model.fit(X_train, y_train) y_pred_base = base_model.predict(X_test) print(“Baseline model RMSE:”, np.sqrt(mean_squared_error(y_test, y_pred_base)))

base_model = xgb.XGBRegressor(

n_estimators=10,

max_depth=1,

learning_rate=0.5,

random_state=42

)

base_model.fit(X_train, y_train)

y_pred_base = base_model.predict(X_test)

print(“Baseline model RMSE:”, np.sqrt(mean_squared_error(y_test, y_pred_base)))

If you are fairly familiar with decision trees, you may have already raised an eyebrow at the setting of max_depth=1, but for illustrative purposes, we are showing an example of how to diagnose problems; hence, we need something that looks problematic.

The resulting error (RMSE) is 0.7630064001489125: pretty high, if we take into account that the target variable (house value) is expressed in hundreds of thousands of US dollars.

Now, off to a model with a more carefully planned hyperparameter setting, better aligned with the nature and complexity of the dataset (remember that if you are unsure how to do this setting, there are search techniques designed to help you do so: check this article and this one):

good_model = xgb.XGBRegressor( n_estimators=300, max_depth=6, learning_rate=0.05, subsample=0.8, colsample_bytree=0.8, random_state=42 ) good_model.fit(X_train, y_train) y_pred_good = good_model.predict(X_test) print(“Good model RMSE:”, np.sqrt(mean_squared_error(y_test, y_pred_good)))

good_model = xgb.XGBRegressor(

n_estimators=300,

max_depth=6,

learning_rate=0.05,

subsample=0.8,

colsample_bytree=0.8,

random_state=42

)

good_model.fit(X_train, y_train)

y_pred_good = good_model.predict(X_test)

print(“Good model RMSE:”, np.sqrt(mean_squared_error(y_test, y_pred_good)))

This model’s RMSE dramatically went down to 0.4533940039302877. While still likely subject to further improvement, this is a drastic step forward with respect to the baseline model.

In this brief example, we compared RMSEs to diagnose performance issues, but other methods to diagnose model failures could include:

Final Thoughts

This article examined several common reasons why regression models in machine learning may fail to perform well, from data quality issues to poorly defined model configurations. The discussion placed particular focus on ways to diagnose these diverse root causes for underperforming regression models, followed by an example of training and comparing two XGBoost regressors as a way to identify potential issues in one of them.

Source_link