• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Saturday, February 14, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Document Clustering with LLM Embeddings in Scikit-learn

Josh by Josh
February 14, 2026
in Al, Analytics and Automation
0
Document Clustering with LLM Embeddings in Scikit-learn
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


In this article, you will learn how to cluster a collection of text documents using large language model embeddings and standard clustering algorithms in scikit-learn.

Topics we will cover include:

  • Why LLM-based embeddings are well suited for document clustering.
  • How to generate embeddings from raw text using a pre-trained sentence transformer.
  • How to apply and compare k-means and DBSCAN for clustering embedded documents.

Let’s get straight to the point.

Document Clustering LLM Embeddings Scikit-learn

Document Clustering with LLM Embeddings in Scikit-learn (click to enlarge)
Image by Editor

Introduction

Imagine that you suddenly obtain a large collection of unclassified documents and are tasked with grouping them by topic. There are traditional clustering methods for text, based on TF-IDF and Word2Vec, that can address this problem, but they suffer from important limitations:

READ ALSO

Neural Love Image Generator Pricing & Features Overview

Exa AI Introduces Exa Instant: A Sub-200ms Neural Search Engine Designed to Eliminate Bottlenecks for Real-Time Agentic Workflows

  • TF-IDF only counts words in a text and relies on similarity based on word frequencies, ignoring the underlying meaning. A sentence like “the tree is big” has an identical representation whether it refers to a natural tree or a decision tree classifier used in machine learning.
  • Word2Vec captures relationships between individual words to form embeddings (numerical vector representations), but it does not explicitly model full context across longer text sequences.

Meanwhile, modern embeddings generated by large language models, such as sentence transformer models, are in most cases superior. They capture contextual semantics — for example, distinguishing natural trees from decision trees — and encode overall, document-level meaning. Moreover, these embeddings are produced by models pre-trained on millions of texts, meaning they already contain a substantial amount of general language knowledge.

This article follows up on a previous tutorial, where we learned how to convert raw text into large language model embeddings that can be used as features for downstream machine learning tasks. Here, we focus specifically on using embeddings from a collection of documents for clustering based on similarity, with the goal of identifying common topics among documents in the same cluster.

Step-by-Step Guide

Let’s walk through the full process using Python.

Depending on your development environment or notebook configuration, you may need to pip install some of the libraries imported below. Assuming they are already available, we start by importing the required modules and classes, including KMeans, scikit-learn’s implementation of the k-means clustering algorithm:

import pandas as pd

import numpy as np

from sentence_transformers import SentenceTransformer

from sklearn.cluster import KMeans

from sklearn.decomposition import PCA

from sklearn.metrics import silhouette_score, adjusted_rand_score

from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt

import seaborn as sns

 

# Configurations for clearer visualizations

sns.set_style(“whitegrid”)

plt.rcParams[‘figure.figsize’] = (12, 6)

Next, we load the dataset. We will use a BBC News dataset containing articles labeled by topic, with a public version available from a Google-hosted dataset repository:

url = “https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv”

df = pd.read_csv(url)

 

print(f“Dataset loaded: {len(df)} documents”)

print(f“Categories: {df[‘category’].unique()}\n”)

print(df[‘category’].value_counts())

Here, we only display information about the categories to get a sense of the ground-truth topics assigned to each document. The dataset contains 2,225 documents in the version used at the time of writing.

At this point, we are ready for the two main steps of the workflow: generating embeddings from raw text and clustering those embeddings.

Generating Embeddings with a Pre-Trained Model

Libraries such as sentence_transformers make it straightforward to use a pre-trained model for tasks like generating embeddings from text. The workflow consists of loading a suitable model — such as all-MiniLM-L6-v2, a lightweight model trained to produce 384-dimensional embeddings — and running inference over the dataset to convert each document into a numerical vector that captures its overall semantics.

We start by loading the model:

# Load embeddings model (downloaded automatically on first use)

print(“Loading embeddings model…”)

model = SentenceTransformer(‘all-MiniLM-L6-v2’)

 

# This model converts text into a 384-dimensional vector

print(f“Model loaded. Embedding dimension: {model.get_sentence_embedding_dimension()}”)

Next, we generate embeddings for all documents:

# Convert all documents into embedding vectors

print(“Generating embeddings (this may take a few minutes)…”)

 

texts = df[‘text’].tolist()

embeddings = model.encode(

    texts,

    show_progress_bar=True,

    batch_size=32  # Batch processing for efficiency

)

 

print(f“Embeddings generated: matrix size is {embeddings.shape}”)

print(f”   → Each document is now represented by {embeddings.shape[1]} numeric values”)

Recall that an embedding is a high-dimensional numerical vector. Documents that are semantically similar are expected to have embeddings that are close to each other in this vector space.

Clustering Document Embeddings with K-Means

Applying the k-means clustering algorithm with scikit-learn is straightforward. We pass in the embedding matrix and specify the number of clusters to find. While this number must be chosen in advance for k-means, we can leverage prior knowledge of the dataset’s ground-truth categories in this example. In other settings, techniques such as the elbow method can help guide this choice.

The following code applies k-means and evaluates the results using several metrics, including the Adjusted Rand Index (ARI). ARI is a permutation-invariant metric that compares the cluster assignments with the true category labels. Values closer to 1 indicate stronger agreement with the ground truth.

n_clusters = 5

 

kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)

kmeans_labels = kmeans.fit_predict(embeddings)

 

# Evaluation against ground-truth categories

le = LabelEncoder()

true_labels = le.fit_transform(df[‘category’])

 

print(”   K-Means Results:”)

print(f”   Silhouette Score: {silhouette_score(embeddings, kmeans_labels):.3f}”)

print(f”   Adjusted Rand Index: {adjusted_rand_score(true_labels, kmeans_labels):.3f}”)

print(f”   Distribution: {pd.Series(kmeans_labels).value_counts().sort_index().tolist()}”)

Example output:

   K–Means Results:

   Silhouette Score: 0.066

   Adjusted Rand Index: 0.899

   Distribution: [376, 414, 517, 497, 421]

Clustering Document Embeddings with DBSCAN

As an alternative, we can apply DBSCAN, a density-based clustering algorithm that automatically infers the number of clusters based on point density. Instead of specifying the number of clusters, DBSCAN requires parameters such as eps (the neighborhood radius) and min_samples:

from sklearn.cluster import DBSCAN

 

# DBSCAN often works better with cosine distance for text embeddings

dbscan = DBSCAN(eps=0.5, min_samples=5, metric=‘cosine’)

dbscan_labels = dbscan.fit_predict(embeddings)

 

# Count clusters (-1 indicates noise points)

n_clusters_found = len(set(dbscan_labels)) – (1 if –1 in dbscan_labels else 0)

n_noise = list(dbscan_labels).count(–1)

 

print(“\nDBSCAN Results:”)

print(f”   Clusters found: {n_clusters_found}”)

print(f”   Noise documents: {n_noise}”)

print(f”   Silhouette Score: {silhouette_score(embeddings[dbscan_labels != -1], dbscan_labels[dbscan_labels != -1]):.3f}”)

print(f”   Adjusted Rand Index: {adjusted_rand_score(true_labels, dbscan_labels):.3f}”)

print(f”   Distribution: {pd.Series(dbscan_labels).value_counts().sort_index().to_dict()}”)

DBSCAN is highly sensitive to its hyperparameters, so achieving good results often requires careful tuning using systematic search strategies.

Once reasonable parameters have been identified, it can be informative to visually compare the clustering results. The following code projects the embeddings into two dimensions using principal component analysis (PCA) and plots the true categories alongside the k-means and DBSCAN cluster assignments:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

# Reduce embeddings to 2D for visualization

pca = PCA(n_components=2, random_state=42)

embeddings_2d = pca.fit_transform(embeddings)

 

# Create comparative visualization

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

 

# Plot 1: True categories

category_colors = {cat: i for i, cat in enumerate(df[‘category’].unique())}

color_map = df[‘category’].map(category_colors)

 

axes[0].scatter(

    embeddings_2d[:, 0],

    embeddings_2d[:, 1],

    c=color_map,

    cmap=‘Set2’,

    alpha=0.6,

    s=30

)

axes[0].set_title(‘True Categories’, fontsize=12, fontweight=‘bold’)

axes[0].set_xlabel(‘PC1’)

axes[0].set_ylabel(‘PC2’)

 

# Plot 2: K-Means

scatter2 = axes[1].scatter(

    embeddings_2d[:, 0],

    embeddings_2d[:, 1],

    c=kmeans_labels,

    cmap=‘viridis’,

    alpha=0.6,

    s=30

)

axes[1].set_title(f‘K-Means (k={n_clusters})’, fontsize=12, fontweight=‘bold’)

axes[1].set_xlabel(‘PC1’)

axes[1].set_ylabel(‘PC2’)

plt.colorbar(scatter2, ax=axes[1])

 

# Plot 3: DBSCAN

scatter3 = axes[2].scatter(

    embeddings_2d[:, 0],

    embeddings_2d[:, 1],

    c=dbscan_labels,

    cmap=‘plasma’,

    alpha=0.6,

    s=30

)

axes[2].set_title(f‘DBSCAN ({n_clusters_found} clusters + noise)’, fontsize=12, fontweight=‘bold’)

axes[2].set_xlabel(‘PC1’)

axes[2].set_ylabel(‘PC2’)

plt.colorbar(scatter3, ax=axes[2])

 

plt.tight_layout()

plt.show()

Clustering results

With the default DBSCAN settings, k-means typically performs much better on this dataset. There are two main reasons for this:

  • DBSCAN suffers from the curse of dimensionality, and 384-dimensional embeddings can be challenging for density-based methods.
  • K-means performs well when clusters are relatively well separated, which is the case for the BBC News dataset due to the clear topical structure of the documents.

Wrapping Up

In this article, we demonstrated how to cluster a collection of text documents using embedding representations generated by pre-trained large language models. After transforming raw text into numerical vectors, we applied traditional clustering techniques — k-means and DBSCAN — to group semantically similar documents and evaluate their performance against known topic labels.



Source_link

Related Posts

Neural Love Image Generator Pricing & Features Overview
Al, Analytics and Automation

Neural Love Image Generator Pricing & Features Overview

February 14, 2026
Al, Analytics and Automation

Exa AI Introduces Exa Instant: A Sub-200ms Neural Search Engine Designed to Eliminate Bottlenecks for Real-Time Agentic Workflows

February 14, 2026
The Machine Learning Practitioner’s Guide to Speculative Decoding
Al, Analytics and Automation

The Machine Learning Practitioner’s Guide to Speculative Decoding

February 13, 2026
AI Detection Tools Statistics 2025
Al, Analytics and Automation

AI Detection Tools Statistics 2025

February 13, 2026
New J-PAL research and policy initiative to test and scale AI innovations to fight poverty | MIT News
Al, Analytics and Automation

New J-PAL research and policy initiative to test and scale AI innovations to fight poverty | MIT News

February 13, 2026
How to Align Large Language Models with Human Preferences Using Direct Preference Optimization, QLoRA, and Ultra-Feedback
Al, Analytics and Automation

How to Align Large Language Models with Human Preferences Using Direct Preference Optimization, QLoRA, and Ultra-Feedback

February 13, 2026
Next Post
AI agents turned Super Bowl viewers into one high-IQ team — now imagine this in the enterprise

AI agents turned Super Bowl viewers into one high-IQ team — now imagine this in the enterprise

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

Watch our Year in Search 2025 film now.

Watch our Year in Search 2025 film now.

December 8, 2025
Abstract or die: Why AI enterprises can't afford rigid vector stacks

Abstract or die: Why AI enterprises can't afford rigid vector stacks

October 19, 2025
What Is Project Planning? Steps, Tools, and Best Practices

What Is Project Planning? Steps, Tools, and Best Practices

July 7, 2025
How AI Will Change Website Development by 2026

How AI Will Change Website Development by 2026

December 13, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to Produce 50 Petricide in Demacia Rising in League of Legends
  • AI agents turned Super Bowl viewers into one high-IQ team — now imagine this in the enterprise
  • Document Clustering with LLM Embeddings in Scikit-learn
  • I Tested OneDrive vs. Google Drive: Here’s the Winner
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?