Making Sense of Text with Decision Trees

Making Sense of Text with Decision Trees
Image by Editor | ChatGPT

In this article, you will learn:

Seeing Images Through the Eyes of Decision Trees

Tried an AI Text Humanizer That Passes Copyscape Checker

Build a decision tree classifier for spam email detection that analyzes text data.
Incorporate text data modeling techniques like TF-IDF and embeddings for training your decision tree.
Evaluate and compare classification results against other text classifiers, like Naive Bayes, using Scikit-learn.

Introduction

It’s no secret that decision tree-based models excel at a wide range of classification and regression tasks, often based on structured, tabular data. However, when combined with the right tools, decision trees also become powerful predictive tools for unstructured data, such as text or images, and even time series data.

This article demonstrates how to build decision trees for text data. Specifically, we will incorporate text representation techniques like TF-IDF and embeddings in decision trees trained for spam email classification, evaluating their performance and comparing the results with another text classification model — all with the aid of Python’s Scikit-learn library.

Building Decision Trees for Text Classification

The following hands-on tutorial will use the publicly available UCI dataset for spam classification: a collection of text-label pairs describing email messages and their labeling as spam or ham (“ham” is a colloquial term for non-spam messages).

The following code requests, decompresses, and loads the dataset via its public repository URL into a Pandas DataFrame object named df:

import pandas as pd import requests import zipfile url = “https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip” r = requests.get(url) open(“smsspamcollection.zip”, “wb”).write(r.content) with zipfile.ZipFile(“smsspamcollection.zip”, “r”) as z: with z.open(“SMSSpamCollection”) as f: df = pd.read_csv(f, sep=’\t’, names=[“label”, “text”]) df.head()

import pandas as pd

import requests

import zipfile

url = “https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip”

r = requests.get(url)

open(“smsspamcollection.zip”, “wb”).write(r.content)

with zipfile.ZipFile(“smsspamcollection.zip”, “r”) as z:

with z.open(“SMSSpamCollection”) as f:

df = pd.read_csv(f, sep=‘\t’, names=[“label”, “text”])

df.head()

As a quick first check, let’s view the count of spam versus ham emails:

df[“label”].value_counts()

df[“label”].value_counts()

There are 4,825 ham emails (86%) and 747 spam emails (14%). This indicates we are dealing with a class-imbalanced dataset. Keep this in mind, as a simple metric like accuracy won’t be the best standalone measure for evaluation.

Next, we split the dataset (both input texts and labels) into training and test subsets. Due to the class imbalance, we will use stratified sampling to maintain the same class proportions in both subsets, which helps in training more generalizable models.

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( df[“text”], df[“label”], test_size=0.2, random_state=42, stratify=df[“label”] )

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

df[“text”], df[“label”], test_size=0.2, random_state=42, stratify=df[“label”]

)

Now, we are ready to train our first decision tree model. A key aspect here is encoding the text data into a structured format that decision trees can handle. One common approach is TF-IDF vectorization. TF-IDF maps each text into a sparse numerical vector, where each dimension (feature) represents a term from the existing vocabulary, weighted by its TF-IDF score.

Scikit-learn’s Pipeline class provides an elegant way to chain these steps. We’ll create a pipeline that first applies TF-IDF vectorization using TfidfVectorizer and then trains a DecisionTreeClassifier.

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report tfidf_tree = Pipeline([ (“tfidf”, TfidfVectorizer()), (“clf”, DecisionTreeClassifier(random_state=42)) ]) tfidf_tree.fit(X_train, y_train) y_pred = tfidf_tree.predict(X_test) print(“MODEL 1. Decision Tree + TF-IDF:”) print(classification_report(y_test, y_pred))

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline

from sklearn.metrics import classification_report

tfidf_tree = Pipeline([

(“tfidf”, TfidfVectorizer()),

(“clf”, DecisionTreeClassifier(random_state=42))

])

tfidf_tree.fit(X_train, y_train)

y_pred = tfidf_tree.predict(X_test)

print(“MODEL 1. Decision Tree + TF-IDF:”)

print(classification_report(y_test, y_pred))

Results:

MODEL 1. Decision Tree + TF-IDF: precision recall f1-score support ham 0.97 0.99 0.98 966 spam 0.91 0.83 0.87 149 accuracy 0.97 1115 macro avg 0.94 0.91 0.92 1115 weighted avg 0.97 0.97 0.97 1115

MODEL 1. Decision Tree + TF–IDF:

precision recall f1–score support

ham 0.97 0.99 0.98 966

spam 0.91 0.83 0.87 149

accuracy 0.97 1115

macro avg 0.94 0.91 0.92 1115

weighted avg 0.97 0.97 0.97 1115

The results aren’t too bad, but they are slightly inflated by the dominant ham class. If catching all spam is critical, we should pay special attention to the recall for the spam class, which is only 0.83 in this case. Spam precision is higher, meaning very few ham emails are incorrectly marked as spam. This is a priority if we want to avoid important messages being sent to the spam folder.

Our second decision tree will use an alternative approach for representing text: embeddings. Embeddings are vector representations of words or sentences such that similar texts are associated with vectors close together in space, capturing semantic meaning and contextual relationships beyond mere word counts.

A simple way to generate embeddings for our text is to use pretrained models like GloVe. We can map each word in an email to its corresponding dense GloVe vector and then represent the entire email by averaging these word vectors. This results in a compact, dense numerical representation for each message.

The following code implements this process. It defines a text_to_embedding() function, applies it to the training and test sets, and then trains and evaluates a new decision tree.

import numpy as np # Downloading GloVe embeddings !wget -q http://nlp.stanford.edu/data/glove.6B.zip !unzip -q glove.6B.zip -d glove.6B # Load embeddings into a dictionary embeddings_index = {} with open(“glove.6B/glove.6B.50d.txt”, encoding=”utf8″) as f: for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype=”float32″) embeddings_index[word] = coefs def text_to_embedding(texts): vectors = [] for text in texts: words = text.lower().split() word_vecs = [embeddings_index[w] for w in words if w in embeddings_index] if word_vecs: vectors.append(np.mean(word_vecs, axis=0)) else: vectors.append(np.zeros(50)) return np.array(vectors) X_train_emb = text_to_embedding(X_train) X_test_emb = text_to_embedding(X_test) tree_emb = DecisionTreeClassifier(random_state=42) tree_emb.fit(X_train_emb, y_train) y_pred_emb = tree_emb.predict(X_test_emb) print(“MODEL 2. Decision Tree + Embeddings”) print(classification_report(y_test, y_pred_emb))

import numpy as np

# Downloading GloVe embeddings

!wget –q http://nlp.stanford.edu/data/glove.6B.zip

!unzip –q glove.6B.zip –d glove.6B

# Load embeddings into a dictionary

embeddings_index = {}

with open(“glove.6B/glove.6B.50d.txt”, encoding=“utf8”) as f:

for line in f:

values = line.split()

word = values[0]

coefs = np.asarray(values[1:], dtype=‘float32’)

embeddings_index[word] = coefs

def text_to_embedding(texts):

vectors = []

for text in texts:

words = text.lower().split()

word_vecs = [embeddings_index[w] for w in words if w in embeddings_index]

if word_vecs:

vectors.append(np.mean(word_vecs, axis=0))

else:

vectors.append(np.zeros(50))

return np.array(vectors)

X_train_emb = text_to_embedding(X_train)

X_test_emb = text_to_embedding(X_test)

tree_emb = DecisionTreeClassifier(random_state=42)

tree_emb.fit(X_train_emb, y_train)

y_pred_emb = tree_emb.predict(X_test_emb)

print(“MODEL 2. Decision Tree + Embeddings”)

print(classification_report(y_test, y_pred_emb))

Results:

MODEL 2. Decision Tree + Embeddings precision recall f1-score support ham 0.95 0.95 0.95 966 spam 0.66 0.69 0.68 149 accuracy 0.91 1115 macro avg 0.81 0.82 0.81 1115 weighted avg 0.91 0.91 0.91 1115

MODEL 2. Decision Tree + Embeddings

precision recall f1–score support

ham 0.95 0.95 0.95 966

spam 0.66 0.69 0.68 149

accuracy 0.91 1115

macro avg 0.81 0.82 0.81 1115

weighted avg 0.91 0.91 0.91 1115

Unfortunately, this simple averaging approach can cause significant information loss, sometimes called representation loss. This explains the overall drop in performance compared to the TF-IDF model. Decision trees often work better with sparse, high-signal features like those from TF-IDF. These word-level features can act as strong discriminators (e.g. classifying an email as spam based on the presence of words like “free” or “million”). This largely explains the performance difference between the two models.

Comparison with a Naive Bayes Text Classifier

Finally, let’s compare our results with another popular text classification model: the Naive Bayes classifier. While not tree-based, it works well with TF-IDF features. The process is very similar to our first model:

from sklearn.naive_bayes import MultinomialNB nb_model = Pipeline([ (“tfidf”, TfidfVectorizer()), (“clf”, MultinomialNB()) ]) nb_model.fit(X_train, y_train) y_pred_nb = nb_model.predict(X_test) print(“BASELINE. Naive Bayes + TF-IDF”) print(classification_report(y_test, y_pred_nb))

from sklearn.naive_bayes import MultinomialNB

nb_model = Pipeline([

(“tfidf”, TfidfVectorizer()),

(“clf”, MultinomialNB())

])

nb_model.fit(X_train, y_train)

y_pred_nb = nb_model.predict(X_test)

print(“BASELINE. Naive Bayes + TF-IDF”)

print(classification_report(y_test, y_pred_nb))

Results:

BASELINE. Naive Bayes + TF-IDF precision recall f1-score support ham 0.96 1.00 0.98 966 spam 1.00 0.70 0.83 149 accuracy 0.96 1115 macro avg 0.98 0.85 0.90 1115 weighted avg 0.96 0.96 0.96 1115

BASELINE. Naive Bayes + TF–IDF

precision recall f1–score support

ham 0.96 1.00 0.98 966

spam 1.00 0.70 0.83 149

accuracy 0.96 1115

macro avg 0.98 0.85 0.90 1115

weighted avg 0.96 0.96 0.96 1115

Comparing our first decision tree model (MODEL 1) with this Naive Bayes model, we see little difference in how they classify ham emails. For the spam class, the Naive Bayes model achieves perfect precision (1.00), meaning every email it identifies as spam is indeed spam. However, it performs worse on recall (0.70), missing about 30% of the actual spam messages in the test data. If recall is our most critical performance indicator, we would lean towards the first decision tree model combined with TF-IDF. We could then try to optimize it further, for instance, through hyperparameter tuning or by using more training data.

Wrapping Up

This article demonstrated how to train decision tree models for text data, tackling spam email classification using common text representation approaches like TF-IDF and vector embeddings.

Source_link