
Why and When to Use Sentence Embeddings Over Word Embeddings
Image by Editor | ChatGPT
Introduction
Choosing the right text representation is a critical first step in any natural language processing (NLP) project. While both word and sentence embeddings transform text into numerical vectors, they operate at different scopes and are suited for different tasks. The key distinction is whether your goal is semantic or syntactic analysis.
Sentence embeddings are the better choice when you need to understand the overall, compositional meaning of a piece of text. In contrast, word embeddings are superior for token-level tasks that require analyzing individual words and their linguistic features. Research shows that for tasks like semantic similarity, sentence embeddings can outperform aggregated word embeddings by a significant margin.
This article will explore the architectural differences, performance benchmarks, and specific use cases for both sentence and word embeddings to help you decide which is right for your next project.
Word Embeddings: Focusing on the Token Level
Word embeddings represent individual words as dense vectors in a high-dimensional space. In this space, the distance and direction between vectors correspond to the semantic relationships between the words themselves.
There are two main types of word embeddings:
- Static embeddings: Traditional models like Word2Vec and GloVe assign a single, fixed vector to each word, regardless of its context.
- Contextual embeddings: Modern models like BERT generate dynamic vectors for words based on the surrounding text in a sentence.
The primary limitation of word embeddings arises when you need to represent an entire sentence. Simple aggregation methods, such as averaging the vectors of all words in a sentence, can dilute the overall meaning. For example, averaging the vectors for a sentence like “The orchestra performance was excellent, but the wind section struggled somewhat at times” would likely result in a neutral representation, losing the distinct positive and negative sentiments.
Sentence Embeddings: Capturing Holistic Meaning
Sentence embeddings are designed to encode an entire sentence or text passage into a single, dense vector that captures its complete semantic meaning.
Transformer-based architectures, such as Sentence-BERT (SBERT), use specialized training techniques like siamese networks. This ensures that sentences with similar meanings are located close to each other in the vector space. Other powerful models include the Universal Sentence Encoder (USE), which creates 512-dimensional vectors optimized for semantic similarity. These models eliminate the need to write custom aggregation logic, simplifying the workflow for sentence-level tasks.
Embeddings Implementations
Let’s look at some implementations of embeddings, starting with contextual word embeddings. Make sure you have the torch and transformers libraries installed, which you can do with this line: pip install torch transformers
. We will use the bert-base-uncased
model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
import torch from transformers import AutoTokenizer, AutoModel
device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’ bert_model_name = ‘bert-base-uncased’ tok = AutoTokenizer.from_pretrained(bert_model_name) bert = AutoModel.from_pretrained(bert_model_name).to(device).eval()
def get_bert_token_vectors(text: str): “”“ Returns: tokens: list[str] without [CLS]/[SEP] vecs: torch.Tensor [T, hidden] contextual vectors ““” enc = tok(text, return_tensors=‘pt’, add_special_tokens=True) with torch.no_grad(): out = bert(**{k: v.to(device) for k, v in enc.items()}) last_hidden = out.last_hidden_state.squeeze(0) ids = enc[‘input_ids’].squeeze(0) toks = tok.convert_ids_to_tokens(ids) keep = [i for i, t in enumerate(toks) if t not in (‘[CLS]’, ‘[SEP]’)] toks = [toks[i] for i in keep] vecs = last_hidden[keep] return toks, vecs
# Example usage toks, vecs = get_bert_token_vectors( “The orchestra performance was excellent, but the wind section struggled somewhat at times.” ) print(“Word embeddings created.”) print(f“Tokens:\n{toks}”) print(f“Vectors:\n{vecs}”) |
If all goes well, here’s your output:
Word embeddings created. Tokens: [‘the’, ‘orchestra’, ‘performance’, ‘was’, ‘excellent’, ‘,’, ‘but’, ‘the’, ‘wind’, ‘section’, ‘struggled’, ‘somewhat’, ‘at’, ‘times’, ‘.’] Vectors: tensor([[–0.6060, –0.5800, –1.4568, ..., –0.0840, 0.6643, 0.0956], [–0.1886, 0.1606, –0.5778, ..., –0.5084, 0.0512, 0.8313], [–0.2355, –0.2043, –0.6308, ..., –0.0757, –0.0426, –0.2797], ..., [–1.3497, –0.3643, –0.0450, ..., 0.2607, –0.2120, 0.5365], [–1.3596, –0.0966, –0.2539, ..., 0.0997, 0.2397, 0.1411], [ 0.6540, 0.1123, –0.3358, ..., 0.3188, –0.5841, –0.2140]]) |
Remember: Contextual models like BERT produce different vectors for the same word depending on surrounding text, which is superior for token-level tasks (NER/POS) that care mostly about local context.
Now let’s look at sentence embeddings, using the all-MiniLM-L6-v2
model. Make sure you install the sentence-transformers
library with this command: pip install -U sentence-transformers
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from sentence_transformers import SentenceTransformer #, util
device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’ sbert_model_name = ‘sentence-transformers/all-MiniLM-L6-v2’ sbert = SentenceTransformer(sbert_model_name)
def encode_sentences(sentences, normalize: bool=True): “”“ Returns: embeddings: np.ndarray [N, 384] (MiniLM-L6-v2), optionally L2-normalized ““” return sbert.encode(sentences, normalize_embeddings=normalize)
# Example usage sent_vecs = encode_sentences( [ “The orchestra performance was excellent.”, “The woodwinds were uneven at times.”, “What is the capital of France?”, ] ) print(“Sentence embeddings created.”) print(f“Vectors:\n{sent_vecs}”) |
And the output:
Sentence embeddings created. Vectors: [[–0.00495016 0.03691019 –0.01169722 ... 0.07122676 –0.03177164 0.01284262] [ 0.03054073 0.03126326 0.08442244 ... –0.00503035 –0.12718299 0.08703844] [ 0.08204817 0.03605553 –0.00389288 ... 0.0492044 0.08929186 –0.01112777]] |
Remember: Models like all-MiniLM-L6-v2
(fast, 384-dim) or multi-qa-MiniLM-L6-cos-v1
work well for semantic search, clustering, and RAG. Sentence vectors are single fixed-size representations, making them optimal for fast comparison at scale.
We can put this all together and run some useful experiments.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
import torch.nn.functional as F from sentence_transformers import util
def cosine_matrix(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor: A = F.normalize(A, dim=1) B = F.normalize(B, dim=1) return A @ B.T
# Sample texts (two related + one unrelated) A = “The orchestra performance was excellent, but the wind section struggled somewhat at times.” B = “Overall the concert was great, though the woodwinds were uneven in places.” C = “What is the capital of France?”
# Token-level comparison toks_a, vecs_a = get_bert_token_vectors(A) toks_b, vecs_b = get_bert_token_vectors(B) sim_mat = cosine_matrix(vecs_a, vecs_b)
# Summarize token alignment, mean over per-token max similarities token_alignment_score = float(sim_mat.max(dim=1).values.mean())
# Show a few top token pairs def top_token_pairs(toks_a, toks_b, sim_mat, k=8): skip = {“,”, “.”, “!”, “?”, “:”, “;”, “(“, “)”, “-“, “—”} pairs = [] for i in range(sim_mat.size(0)): for j in range(sim_mat.size(1)): ta, tb = toks_a[i], toks_b[j] if ta in skip or tb in skip: continue if len(ta.strip(“#”)) < 2 or len(tb.strip(“#”)) < 2: continue pairs.append((float(sim_mat[i, j]), ta, tb, i, j)) pairs.sort(reverse=True, key=lambda x: x[0]) return pairs[:k]
print(“\nToken-level (BERT):”) print(f“Tokens A ({len(toks_a)}): {toks_a}”) print(f“Tokens B ({len(toks_b)}): {toks_b}”) print(f“Pairwise sim matrix shape: {tuple(sim_mat.shape)}”) print(“Top token↔token similarities:”) for s, ta, tb, i, j in top_token_pairs(toks_a, toks_b, sim_mat, k=8): print(f” {ta:>12s} (A[{i:>2}]) ↔ {tb:<12s} (B[{j:>2}]): cos={s:.3f}”) print(f“Token-alignment summary score: {token_alignment_score:.3f}”)
# Mean-pooled BERT sentence vectors (baseline, not a true sentence model) mpA = F.normalize(vecs_a.mean(dim=0), dim=0) mpB = F.normalize(vecs_b.mean(dim=0), dim=0) mpC = F.normalize(get_bert_token_vectors(C)[1].mean(dim=0), dim=0) print(f“Mean-pooled BERT sentence cosine A ↔ B: {float(torch.dot(mpA, mpB)):.3f}”) print(f“Mean-pooled BERT sentence cosine A ↔ C: {float(torch.dot(mpA, mpC)):.3f}”)
# Sentence-level comparison embs = encode_sentences([A, B, C], normalize=True) cos_ab = float(util.cos_sim(embs[0], embs[1])) cos_ac = float(util.cos_sim(embs[0], embs[2]))
print(“\nSentence-level (SBERT):”) print(f“SBERT cosine A ↔ B: {cos_ab:.3f}”) print(f“SBERT cosine A ↔ C: {cos_ac:.3f}”)
# Simple retrieval example query = “Review of a concert where the winds were inconsistent” q_emb = encode_sentences([query], normalize=True) scores = util.cos_sim(q_emb, embs).squeeze(0).tolist() best_idx = int(max(range(len(scores)), key=lambda i: scores[i])) print(“\nRetrieval demo:”) for i, s in enumerate(scores): label = [“A”, “B”, “C”][i] print(f“score={s:.3f} | {label} | { [A,B,C][i] }”) print(f“\nBest match: index {best_idx} → { [‘A’,’B’,’C’][best_idx] }”) |
Here’s a breakdown of what’s going on in the above code:
- Function
cosine_matrix
: L2-normalizes rows of token vectorsA
andB
and returns the full cosine similarity matrix via a dot product; the resulting shape is[len(A_tokens), len(B_tokens)]
- Function
top_token_pairs
: Filters punctuation/very short subwords, collects(similarity, tokenA, tokenB, i, j)
tuples across the matrix, sorts by similarity, and returns the topk
; for human-friendly inspection - We create two semantically related sentences (
A
,B
) and one unrelated (C
) to contrast behavior at both token and sentence levels - We compute all pairwise token similarities between
A
andB
usingget_bert_token_vectors
- Token alignment summary: For each token in
A
, finds its best match inB
(row-wise max), then averages these maxima - Mean-pooled BERT sentence baseline: We collapse token vectors into a single vector by averaging, then compares with cosine; not a true sentence embedding, just a cheap baseline to contrast with SBERT
- Sentence-level comparison (SBERT): Computes SBERT cosine similarities: related pair
(A ↔ B)
should be high; unrelated(A ↔ C)
low - Simple retrieval example: Encodes a query and scores it against
[A, B, C]
sentence embeddings; prints per-candidate scores and the best match index/string and demonstrates practical retrieval using sentence embeddings - The output shows tokens, the sim-matrix shape, the top token ↔ token pairs, and the alignment score
- Finally, demonstrates which words/subwords align (e.g. “excellent” ↔ “great”, “wind” ↔ “woodwinds”)
And here is our output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
Token–level (BERT): Tokens A (15): [‘the’, ‘orchestra’, ‘performance’, ‘was’, ‘excellent’, ‘,’, ‘but’, ‘the’, ‘wind’, ‘section’, ‘struggled’, ‘somewhat’, ‘at’, ‘times’, ‘.’] Tokens B (16): [‘overall’, ‘the’, ‘concert’, ‘was’, ‘great’, ‘,’, ‘though’, ‘the’, ‘wood’, ‘##wind’, ‘##s’, ‘were’, ‘uneven’, ‘in’, ‘places’, ‘.’] Pairwise sim matrix shape: (15, 16) Top token↔token similarities: but (A[ 6]) ↔ though (B[ 6]): cos=0.838 the (A[ 7]) ↔ the (B[ 7]): cos=0.807 was (A[ 3]) ↔ was (B[ 3]): cos=0.801 excellent (A[ 4]) ↔ great (B[ 4]): cos=0.795 the (A[ 0]) ↔ the (B[ 7]): cos=0.742 the (A[ 0]) ↔ the (B[ 1]): cos=0.738 times (A[13]) ↔ places (B[14]): cos=0.728 was (A[ 3]) ↔ were (B[11]): cos=0.717 Token–alignment summary score: 0.746 Mean–pooled BERT sentence cosine A ↔ B: 0.876 Mean–pooled BERT sentence cosine A ↔ C: 0.482
Sentence–level (SBERT): SBERT cosine A ↔ B: 0.661 SBERT cosine A ↔ C: –0.001
Retrieval demo: score=0.635 | A | The orchestra performance was excellent, but the wind section struggled somewhat at times. score=0.688 | B | Overall the concert was great, though the woodwinds were uneven in places. score=–0.058 | C | What is the capital of France?
Best match: index 1 → B |
The token-level view shows strong local alignments (e.g. excellent ↔ great, but ↔ though), yielding a solid overall alignment score of 0.746 across a 15×16 similarity grid. While mean-pooled BERT rates A ↔ B very high (0.876), it still gives a relatively high score to the unrelated A ↔ C (0.482), whereas SBERT cleanly separates them (A ↔ B = 0.661 vs. A ↔ C ≈ 0), reflecting better sentence-level semantics. In a retrieval setting, the query about inconsistent winds correctly selects sentence B as the best match, indicating SBERT’s practical advantage for sentence search.
Performance and Efficiency
Modern benchmarks consistently show the superiority of sentence embeddings for semantic tasks. On the Massive Text Embedding Benchmark (MTEB), which evaluates models across 131 tasks of 9 types in 20 domains, sentence embedding models like SBERT consistently outperform aggregated word embeddings in semantic textual similarity.
By using a dedicated sentence embedding model like SBERT, pairwise sentence comparison could be completed in a fraction of the time that it would take a BERT-based model, even a BERT-based model with optimization. This is because sentence embeddings produce a single fixed-size vector per sentence, making similarity computations incredibly fast. From an efficiency standpoint, the difference is stark. Think about it intuitively: SBERT’s single sentence embeddings can compare to one another in O(n) time, while BERT needs to compare sentences at the token level which would require O(n²) computational time.
When to Use Sentence Embeddings
The best embedding strategy depends entirely on your specific application. As already stated, sentence embeddings excel in tasks that require understanding the holistic meaning of text.
- Semantic search and information retrieval: They power search systems that find results based on meaning, not just keywords. For instance, a query like “How do I fix a flat tire?” can successfully retrieve a document titled “Steps to repair a punctured bicycle wheel.”
- Retrieval-augmented generation (RAG) systems: RAG systems rely on sentence embeddings to find and retrieve relevant document chunks from a vector database to provide context for a large language model, ensuring more accurate and grounded responses.
- Text classification and sentiment analysis: By capturing the compositional meaning of a sentence, these embeddings are effective for tasks like document-level sentiment analysis.
- Question answering systems: They can match a user’s question to the most semantically similar answer in a knowledge base, even if the wording is completely different.
When to Use Word Embeddings
Word embeddings remain the superior choice for tasks requiring fine-grained, token-level analysis.
- Named entity recognition (NER): Identifying specific entities like names, places, or organizations requires analysis at the individual word level.
- Part-of-speech (POS) tagging and syntactic analysis: Tasks that analyze the grammatical structure of a sentence, such as syntactic parsing or morphological analysis, rely on the token-level semantics provided by word embeddings.
- Cross-lingual applications: Multilingual word embeddings create a shared vector space where words with the same meaning in different languages are positioned closely, enabling tasks like zero-shot classification across languages.
Wrapping Up
The decision to use sentence or word embeddings hinges on the fundamental goal of your NLP task. If you need to capture the holistic, compositional meaning of text for applications like semantic search, clustering, or RAG, sentence embeddings offer superior performance and efficiency. If your task requires a deep dive into the grammatical structure and relationships of individual words, as in NER or POS tagging, word embeddings provide the necessary granularity. By understanding this core distinction, you can select the right tool to build more effective and accurate NLP models.
Feature | Word Embeddings | Sentence Embeddings |
---|---|---|
Scope | Individual words (tokens) | Entire sentences or text passages |
Primary Use | Syntactic analysis, token-level tasks | Semantic analysis, understanding overall meaning |
Best For | NER, POS Tagging, Cross-Lingual Mapping | Semantic Search, Classification, Clustering, RAG |
Limitation | Difficult to aggregate for sentence meaning without information loss | Not suitable for tasks requiring analysis of individual word relationships |