Decision Trees Aren’t Just for Tabular Data

Decision Trees Aren’t Just for Tabular Data
Image by Editor | ChatGPT

Introduction

Versatile, interpretable, and effective for a variety of use cases, decision trees have been among the most well-established machine learning techniques for decades, widely used for classification and regression tasks. Yet, they are still widely used — whether as standalone models or as components of more powerful ensemble methods like random forests and gradient boosting machines.

Black Forest Labs Releases FLUX 3: A Multimodal Flow Model for Image, Video, Audio and Robot Action Prediction

KwaiKAT Team Releases KAT-Coder-V2.5: An Agentic Coding Model Trained on 100,000+ Verifiable Repository Environments

And there is one more attractive feature that pushes the boundaries of their versatility even further: they can accommodate data in diverse formats, beyond just fully structured, tabular data. This article examines this facet of decision trees from a balanced theoretical and practical approach.

Quick Overview of Decision Trees

Decision trees are a type of supervised learning model for predictive tasks, namely, classification and regression. They are trained on a set of labeled examples, i.e. data examples with known prediction outputs, for instance, a set of collected animal specimens’ attributes along with the species each observation belongs to. The tree is built gradually, in parallel to a process in which the set of training data is iteratively and recursively partitioned into subsets, seeking as much class (or numerical label) homogeneity as possible per subset. Once trained, the model has learned a hierarchical set of decision rules applied to data attributes, visually represented as a tree (see image below).

Overview of a decision tree for penguin species classification
Image by Author

Applying inference to predict the label for an example with an unknown label consists of checking these rules or conditions from top to bottom, eventually leading to a “leaf node” pointing at a class or value prediction for that unknown label, depending on whether the problem entails classification or regression.

Beyond Tabular Data in Decision Trees

Structured or tabular data organized into instances (rows) that are described by numerical and categorical attributes (columns) constitutes the typical data format digested by most classical machine learning models, including decision trees. However, they can also accommodate datasets or parts of them that aren’t strictly tabular.

Common examples of non-tabular data include text, images, and time series. Through the application of suitable preprocessing techniques, these data formats can be converted into a more structured form. For instance, a text sequence like a customer review of a product can be made structured through feature extraction or embeddings before using them as inputs for a decision tree classifier for analyzing the positive or negative sentiment behind the customer review.

Another strategy to utilize decision trees in predictive tasks that contain partly unstructured data — for example, product data that combines tabular attributes with high-resolution images of that product — is to use hybrid solutions that combine a deep learning model with a decision tree. Take, for instance, a convolutional neural network (CNN) trained to extract features from an image in a structured format (inferring attributes like size, shape, colors, etc.), after which those image-based features are passed to a tree-based model like a random forest for calculating predictions, e.g. estimated product sales.

In research spheres, there have been efforts in adapting decision tree-based models directly for digesting non-tabular data like graphs and hierarchical data, although their mainstream application is still rare.

Practical Example

To wrap up with a bit of practical flavor, we will illustrate how to train a decision tree-based model on a dataset that combines purely tabular and text data.

import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.metrics import classification_report, confusion_matrix import seaborn as sns import matplotlib.pyplot as plt from scipy.sparse import hstack url = “https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/customer_support_dataset.csv” df = pd.read_csv(url) df = df.dropna(subset=[‘prior_tickets’, ‘account_age_days’, ‘text’, ‘label’]) text_vec = TfidfVectorizer(max_features=1000, ngram_range=(1, 2), stop_words=”english”) X_text = text_vec.fit_transform(df[‘text’]) X_num = df[[‘prior_tickets’, ‘account_age_days’]].values X = hstack([X_text, X_num]) le = LabelEncoder() y = le.fit_transform(df[‘label’]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) clf = DecisionTreeClassifier(max_depth=6, random_state=42) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) y_test_labels = le.inverse_transform(y_test) y_pred_labels = le.inverse_transform(y_pred) print(classification_report(y_test_labels, y_pred_labels, zero_division=0)) cm = confusion_matrix(y_test_labels, y_pred_labels, labels=le.classes_) sns.heatmap(cm, annot=True, fmt=”d”, xticklabels=le.classes_, yticklabels=le.classes_, cmap=’Blues’) plt.xlabel(‘Predicted’) plt.ylabel(‘True’) plt.title(‘Confusion Matrix’) plt.show()

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import classification_report, confusion_matrix

import seaborn as sns

import matplotlib.pyplot as plt

from scipy.sparse import hstack

url = “https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/customer_support_dataset.csv”

df = pd.read_csv(url)

df = df.dropna(subset=[‘prior_tickets’, ‘account_age_days’, ‘text’, ‘label’])

text_vec = TfidfVectorizer(max_features=1000, ngram_range=(1, 2), stop_words=‘english’)

X_text = text_vec.fit_transform(df[‘text’])

X_num = df[[‘prior_tickets’, ‘account_age_days’]].values

X = hstack([X_text, X_num])

le = LabelEncoder()

y = le.fit_transform(df[‘label’])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(max_depth=6, random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

y_test_labels = le.inverse_transform(y_test)

y_pred_labels = le.inverse_transform(y_pred)

print(classification_report(y_test_labels, y_pred_labels, zero_division=0))

cm = confusion_matrix(y_test_labels, y_pred_labels, labels=le.classes_)

sns.heatmap(cm, annot=True, fmt=‘d’, xticklabels=le.classes_, yticklabels=le.classes_, cmap=‘Blues’)

plt.xlabel(‘Predicted’)

plt.ylabel(‘True’)

plt.title(‘Confusion Matrix’)

plt.show()

In essence, this code does the following:

uses a dataset that contains three predictor attributes describing customer support tickets
one of them is text, which needs to be preprocessed before being input to a decision tree model
a TF-IDF vectorizer is employed to obtain a vector representation of each text
afterwards, this new feature is merged with the other features to train a decision tree classifier and evaluate it on a test set

You may execute this code to train the model and get disappointed at its performance (roughly as many correct predictions as incorrect ones). This is expected, as we are using a small dataset with just 100 instances, and learning from text representations typically requires more instances.

Conclusion

This article discussed the capabilities of decision trees and decision tree-based machine learning models like random forests to accommodate data that is not strictly tabular. From text to images to time series, machine learning models and data can be preprocessed or combined together to accommodate data that many would at first glance think is impossible to digest.

Source_link