Why Decision Trees Fail (and How to Fix Them)

In this article, you will learn why decision trees sometimes fail in practice and how to correct the most common issues with simple, effective techniques.

Topics we will cover include:

How to spot and reduce overfitting in decision trees.
How to recognize and fix underfitting by tuning model capacity.
How noisy or redundant features mislead trees and how feature selection helps.

Let’s not waste any more time.

Why Decision Trees Fail (and How to Fix Them)
Image by Editor

How to Build a Risk-Aware AI Agent with Internal Critic, Self-Consistency Reasoning, and Uncertainty Estimation for Reliable Decision-Making

marvn.ai and the rise of vertical AI search engines

Decision tree-based models for predictive machine learning tasks like classification and regression are undoubtedly rich in advantages — such as their ability to capture nonlinear relationships among features and their intuitive interpretability that makes it easy to trace decisions. However, they are not perfect and can fail, especially when trained on datasets of moderate to high complexity, where issues like overfitting, underfitting, or sensitivity to noisy features typically arise.

In this article, we examine three common reasons why a trained decision tree model may fail, and we outline simple yet effective strategies to cope with these issues. The discussion is accompanied by Python examples ready for you to try yourself.

1. Overfitting: Memorizing the Data Rather Than Learning from It

Scikit-learn‘s simplicity and intuitiveness in building machine learning models can be tempting, and one may think that simply building a model “by default” should yield satisfactory results. However, a common problem in many machine learning models is overfitting, i.e., the model learns too much from the data, to the point that it nearly memorizes every single data example it has been exposed to. As a result, as soon as the trained model is exposed to new, unseen data examples, it struggles to correctly figure out what the output prediction should be.

This example trains a decision tree on the popular, publicly available California Housing dataset: this is a common dataset of intermediate complexity and size used for regression tasks, namely predicting the median house price in a district of California based on demographic features and average house characteristics in that district.

from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_squared_error import numpy as np # Loading the dataset and splitting it into training and test sets X, y = fetch_california_housing(return_X_y=True, as_frame=True) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) # Building a tree without specifying maximum depth overfit_tree = DecisionTreeRegressor(random_state=42) overfit_tree.fit(X_train, y_train) print(“Train RMSE:”, np.sqrt(mean_squared_error(y_train, overfit_tree.predict(X_train)))) print(“Test RMSE:”, np.sqrt(mean_squared_error(y_test, overfit_tree.predict(X_test))))

from sklearn.datasets import fetch_california_housing

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error

import numpy as np

# Loading the dataset and splitting it into training and test sets

X, y = fetch_california_housing(return_X_y=True, as_frame=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Building a tree without specifying maximum depth

overfit_tree = DecisionTreeRegressor(random_state=42)

overfit_tree.fit(X_train, y_train)

print(“Train RMSE:”, np.sqrt(mean_squared_error(y_train, overfit_tree.predict(X_train))))

print(“Test RMSE:”, np.sqrt(mean_squared_error(y_test, overfit_tree.predict(X_test))))

Note that we trained a decision tree-based regressor without specifying any hyperparameters, including constraints on the shape and size of the tree. Yes, that will have consequences, namely a drastic gap between the nearly zero error (notice the scientific notation e-16 below) on the training examples and the much higher error on the test set. This is a clear sign of overfitting.

Output:

Train RMSE: 3.013481908235909e-16 Test RMSE: 0.7269954649985176

Train RMSE: 3.013481908235909e–16

Test RMSE: 0.7269954649985176

To address overfitting, a frequent strategy is regularization, which consists of simplifying the model’s complexity. While for other models this entails a somewhat intricate mathematical approach, for decision trees in scikit-learn it is as simple as constraining aspects like the maximum depth the tree can grow to, or the minimum number of samples that a leaf node should contain: both hyperparameters are designed to control and prevent possibly overgrown trees.

pruned_tree = DecisionTreeRegressor(max_depth=6, min_samples_leaf=20, random_state=42) pruned_tree.fit(X_train, y_train) print(“Train RMSE:”, np.sqrt(mean_squared_error(y_train, pruned_tree.predict(X_train)))) print(“Test RMSE:”, np.sqrt(mean_squared_error(y_test, pruned_tree.predict(X_test))))

pruned_tree = DecisionTreeRegressor(max_depth=6, min_samples_leaf=20, random_state=42)

pruned_tree.fit(X_train, y_train)

print(“Train RMSE:”, np.sqrt(mean_squared_error(y_train, pruned_tree.predict(X_train))))

print(“Test RMSE:”, np.sqrt(mean_squared_error(y_test, pruned_tree.predict(X_test))))

Train RMSE: 0.6617348643931361 Test RMSE: 0.6940789988854102

Train RMSE: 0.6617348643931361

Test RMSE: 0.6940789988854102

Overall, the second tree is preferred over the first, even though the error in the training set increased. The key lies in the error on the test data, which is normally a better indicator of how the model might behave in the real world, and this error has indeed decreased relative to the first tree.

2. Underfitting: The Tree Is Too Simple to Work Well

At the opposite end of the spectrum relative to overfitting, we have the underfitting problem, which essentially entails models that have learned poorly from the training data so that even when evaluating them on that data, the performance falls below expectations.

While overfit trees are normally overgrown and deep, underfitting is usually associated with shallow tree structures.

One way to address underfitting is to carefully increase the model complexity, taking care not to make it overly complex and run into the previously explained overfitting problem. Here’s an example (try it yourself in a Colab notebook or similar to see results):

from sklearn.datasets import fetch_openml from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import numpy as np wine = fetch_openml(name=”wine-quality-red”, version=1, as_frame=True) X, y = wine.data, wine.target.astype(float) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) # A tree that is too shallow (depth of 2) is likely prone to underfitting shallow_tree = DecisionTreeRegressor(max_depth=2, random_state=42) shallow_tree.fit(X_train, y_train) print(“Train RMSE:”, np.sqrt(mean_squared_error(y_train, shallow_tree.predict(X_train)))) print(“Test RMSE:”, np.sqrt(mean_squared_error(y_test, shallow_tree.predict(X_test))))

from sklearn.datasets import fetch_openml

from sklearn.tree import DecisionTreeRegressor

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

import numpy as np

wine = fetch_openml(name=“wine-quality-red”, version=1, as_frame=True)

X, y = wine.data, wine.target.astype(float)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# A tree that is too shallow (depth of 2) is likely prone to underfitting

shallow_tree = DecisionTreeRegressor(max_depth=2, random_state=42)

shallow_tree.fit(X_train, y_train)

print(“Train RMSE:”, np.sqrt(mean_squared_error(y_train, shallow_tree.predict(X_train))))

print(“Test RMSE:”, np.sqrt(mean_squared_error(y_test, shallow_tree.predict(X_test))))

And a version that reduces the error and alleviates underfitting:

better_tree = DecisionTreeRegressor(max_depth=5, random_state=42) better_tree.fit(X_train, y_train) print(“Train RMSE:”, np.sqrt(mean_squared_error(y_train, better_tree.predict(X_train)))) print(“Test RMSE:”, np.sqrt(mean_squared_error(y_test, better_tree.predict(X_test))))

better_tree = DecisionTreeRegressor(max_depth=5, random_state=42)

better_tree.fit(X_train, y_train)

print(“Train RMSE:”, np.sqrt(mean_squared_error(y_train, better_tree.predict(X_train))))

print(“Test RMSE:”, np.sqrt(mean_squared_error(y_test, better_tree.predict(X_test))))

3. Misleading Training Features: Inducing Distraction

Decision trees can also be very sensitive to features that are irrelevant or redundant when put together with other existing features. This is associated with the “signal-to-noise ratio”; in other words, the more signal (valuable information for predictions) and less noise your data contains, the better the model’s performance. Imagine a tourist who got lost in the middle of the Kyoto Station area and asks for directions to get to Kiyomizu-dera Temple — located several kilometres away. Receiving instructions like “take bus EX101, get off at Gojozaka, and walk the street leading uphill,” the tourist will probably get to the destination easily, but if she is told to walk all the way there, with dozens of turns and street names, she might end up lost again. This is a metaphor for the “signal-to-noise ratio” in models like decision trees.

A careful and strategic feature selection is typically the way to go around this issue. This slightly more elaborate example illustrates the comparison among a baseline tree model, the intentional insertion of artificial noise in the dataset to simulate poor-quality training data, and the subsequent feature selection to enhance model performance.

from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.feature_selection import SelectKBest, mutual_info_classif from sklearn.metrics import accuracy_score import numpy as np, pandas as pd, matplotlib.pyplot as plt adult = fetch_openml(“adult”, version=2, as_frame=True) X, y = adult.data, (adult.target == “>50K”).astype(int) cat, num = X.select_dtypes(“category”).columns, X.select_dtypes(exclude=”category”).columns Xtr, Xte, ytr, yte = train_test_split(X, y, stratify=y, random_state=42) def make_preprocessor(df): return ColumnTransformer([ (“num”, “passthrough”, df.select_dtypes(exclude=”category”).columns), (“cat”, OneHotEncoder(handle_unknown=”ignore”), df.select_dtypes(“category”).columns) ]) # Baseline model base = Pipeline([ (“prep”, make_preprocessor(X)), (“clf”, DecisionTreeClassifier(max_depth=None, random_state=42)) ]).fit(Xtr, ytr) print(“Baseline acc:”, round(accuracy_score(yte, base.predict(Xte)), 3)) # Adding 300 noisy features to emulate a poorly performing model due to being trained on noise rng = np.random.RandomState(42) noise = pd.DataFrame(rng.normal(size=(len(X), 300)), index=X.index, columns=[f”noise_{i}” for i in range(300)]) X_noisy = pd.concat([X, noise], axis=1) Xtr, Xte, ytr, yte = train_test_split(X_noisy, y, stratify=y, random_state=42) noisy = Pipeline([ (“prep”, make_preprocessor(X_noisy)), (“clf”, DecisionTreeClassifier(max_depth=None, random_state=42)) ]).fit(Xtr, ytr) print(“With noise acc:”, round(accuracy_score(yte, noisy.predict(Xte)), 3)) # Our fix: applying feature selection with SelectKBest() function in a pipeline sel = Pipeline([ (“prep”, make_preprocessor(X_noisy)), (“select”, SelectKBest(mutual_info_classif, k=20)), (“clf”, DecisionTreeClassifier(max_depth=None, random_state=42)) ]).fit(Xtr, ytr) print(“After selection acc:”, round(accuracy_score(yte, sel.predict(Xte)), 3)) # Plotting feature importance importances = noisy.named_steps[“clf”].feature_importances_ names = noisy.named_steps[“prep”].get_feature_names_out() pd.Series(importances, index=names).nlargest(20).plot(kind=”barh”) plt.title(“Top 20 Feature Importances (Noisy Model)”) plt.gca().invert_yaxis() plt.show()

from sklearn.datasets import fetch_openml

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.feature_selection import SelectKBest, mutual_info_classif

from sklearn.metrics import accuracy_score

import numpy as np, pandas as pd, matplotlib.pyplot as plt

adult = fetch_openml(“adult”, version=2, as_frame=True)

X, y = adult.data, (adult.target == “>50K”).astype(int)

cat, num = X.select_dtypes(“category”).columns, X.select_dtypes(exclude=“category”).columns

Xtr, Xte, ytr, yte = train_test_split(X, y, stratify=y, random_state=42)

def make_preprocessor(df):

return ColumnTransformer([

(“num”, “passthrough”, df.select_dtypes(exclude=“category”).columns),

(“cat”, OneHotEncoder(handle_unknown=“ignore”), df.select_dtypes(“category”).columns)

])

# Baseline model

base = Pipeline([

(“prep”, make_preprocessor(X)),

(“clf”, DecisionTreeClassifier(max_depth=None, random_state=42))

]).fit(Xtr, ytr)

print(“Baseline acc:”, round(accuracy_score(yte, base.predict(Xte)), 3))

# Adding 300 noisy features to emulate a poorly performing model due to being trained on noise

rng = np.random.RandomState(42)

noise = pd.DataFrame(rng.normal(size=(len(X), 300)), index=X.index, columns=[f“noise_{i}” for i in range(300)])

X_noisy = pd.concat([X, noise], axis=1)

Xtr, Xte, ytr, yte = train_test_split(X_noisy, y, stratify=y, random_state=42)

noisy = Pipeline([

(“prep”, make_preprocessor(X_noisy)),

(“clf”, DecisionTreeClassifier(max_depth=None, random_state=42))

]).fit(Xtr, ytr)

print(“With noise acc:”, round(accuracy_score(yte, noisy.predict(Xte)), 3))

# Our fix: applying feature selection with SelectKBest() function in a pipeline

sel = Pipeline([

(“prep”, make_preprocessor(X_noisy)),

(“select”, SelectKBest(mutual_info_classif, k=20)),

(“clf”, DecisionTreeClassifier(max_depth=None, random_state=42))

]).fit(Xtr, ytr)

print(“After selection acc:”, round(accuracy_score(yte, sel.predict(Xte)), 3))

# Plotting feature importance

importances = noisy.named_steps[“clf”].feature_importances_

names = noisy.named_steps[“prep”].get_feature_names_out()

pd.Series(importances, index=names).nlargest(20).plot(kind=“barh”)

plt.title(“Top 20 Feature Importances (Noisy Model)”)

plt.gca().invert_yaxis()

plt.show()

If everything went well, the model built after feature selection should yield the best results. Try playing with the k for feature selection (set as 20 in the example) and see if you can further improve the last model’s performance.

Conclusion

In this article, we explored and illustrated three common issues that may lead trained decision tree models to behave poorly: from underfitting to overfitting and irrelevant features. We also showed simple yet effective strategies to navigate these problems.

Source_link