Implementing Hybrid Semantic-Lexical Search in RAG

In this article, you will learn how to implement a hybrid search strategy for RAG systems by combining BM25 lexical search with semantic search, fused together using Reciprocal Rank Fusion.

Topics we will cover include:

Why hybrid search outperforms either lexical or semantic search alone in retrieval-augmented generation systems.
How to implement BM25 lexical search and dense vector semantic search as independent retrieval engines in Python.
How to merge both rankings using Reciprocal Rank Fusion (RRF) to produce a final, balanced retrieval result.

Let’s get straight to it.

Implementing Hybrid Semantic-Lexical Search in RAG

Introduction

Implementing hybrid search strategies is a critical step in building modern RAG (Retrieval-Augmented Generation) systems, especially when shifting from prototype to production-ready solutions.

AMD Releases Instella-MoE-16B-A3B: A Fully Open Mixture-of-Experts LLM With 2.8B Active Parameters Trained On Instinct GPUs

Building Agentic Workflows in Python with LangGraph

There is little argument against semantic search — fueled by dense vectors or embeddings, which are numerical representations of text — being incredibly useful at understanding semantics, synonyms, and context. However, lexical, keyword-based search with approaches like BM25 covers a small blind spot neglected by semantic search. Combining the best of both worlds is therefore the perfect recipe to take your RAG system’s retrieval mechanism the extra mile.

Let’s explore how to implement such a hybrid search strategy through a gentle coding example, guiding you through every step of the process!

Note: If you are unfamiliar with RAG systems, you may find the “Understanding RAG” article series remarkably insightful for getting the most out of this read. In particular, I recommend acquiring an understanding of vector databases first through this article.

Step-by-Step Implementation

The first step is to ensure all the necessary external Python libraries are installed, in particular these three:

!pip install rank_bm25 sentence-transformers requests

!pip install rank_bm25 sentence–transformers requests

rank_bm25: an implementation of the BM25 lexical search algorithm for information retrieval (BM stands for “Best Matching”).
sentence-transformers: provides pre-trained language models for generating text embeddings. In a real setting, you may already have your own vector database containing many document embeddings and not need this, but we will use it here to simulate the construction of a toy vector database and illustrate hybrid search on it.
requests: used to fetch the raw dataset package from a public GitHub datasets repository prepared for this example.

With these ingredients at hand, we start by loading the dataset and storing the raw texts in a list (we do so because it is a small dataset).

import requests import zipfile import io import os # Downloading and extracting the dataset from the compressed file url = “https://github.com/gakudo-ai/open-datasets/raw/refs/heads/main/asia_documents.zip” response = requests.get(url) with zipfile.ZipFile(io.BytesIO(response.content)) as z: z.extractall(“asia_data”) # Loading documents and getting their filenames documents = [] doc_names = [] for file in os.listdir(“asia_data”): if file.endswith(“.txt”): with open(f”asia_data/{file}”, “r”, encoding=”utf-8″) as f: documents.append(f.read()) doc_names.append(file) print(f”Loaded {len(documents)} documents for the knowledge base.”)

import requests

import zipfile

import io

import os

# Downloading and extracting the dataset from the compressed file

url = “https://github.com/gakudo-ai/open-datasets/raw/refs/heads/main/asia_documents.zip”

response = requests.get(url)

with zipfile.ZipFile(io.BytesIO(response.content)) as z:

z.extractall(“asia_data”)

# Loading documents and getting their filenames

documents = []

doc_names = []

for file in os.listdir(“asia_data”):

if file.endswith(“.txt”):

with open(f“asia_data/{file}”, “r”, encoding=“utf-8”) as f:

documents.append(f.read())

doc_names.append(file)

print(f“Loaded {len(documents)} documents for the knowledge base.”)

The hybrid search process is divided into three stages: two of them take place in parallel, or independently from each other. The third is where the fusion of both approaches happens, using a merging method called Reciprocal Rank Fusion (RRF).

Let’s cover lexical search with BM25 first:

from rank_bm25 import BM25Okapi # BM25 requires that each text is tokenized as a (sub)list of words tokenized_corpus = [doc.lower().split() for doc in documents] bm25 = BM25Okapi(tokenized_corpus) def search_bm25(query, top_k=3): tokenized_query = query.lower().split() # Getting scores (lexical relevance to the query) for all documents scores = bm25.get_scores(tokenized_query) # Ranking documents by score ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True) return ranked_indices[:top_k], scores

from rank_bm25 import BM25Okapi

# BM25 requires that each text is tokenized as a (sub)list of words

tokenized_corpus = [doc.lower().split() for doc in documents]

bm25 = BM25Okapi(tokenized_corpus)

def search_bm25(query, top_k=3):

tokenized_query = query.lower().split()

# Getting scores (lexical relevance to the query) for all documents

scores = bm25.get_scores(tokenized_query)

# Ranking documents by score

ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)

return ranked_indices[:top_k], scores

The lexical search process has been encapsulated in a function called search_bm25(). This function takes two input arguments: a string containing the user’s query to the RAG system, and the number of top results to retrieve. The rank_bm25 library provides a get_scores() method that computes, for each document — treated as a collection of tokens — a lexical relevance score. We then rank documents by decreasing score, select the top-k, and return them.

Meanwhile, the semantic search engine first uses a sentence transformer model to obtain embedding vectors for the texts and the user query, then applies a vector similarity metric like cosine similarity to rank texts by semantic relevance and retrieve the most relevant k:

from sentence_transformers import SentenceTransformer, util import torch # Loading the pre-trained embedding model model = SentenceTransformer(‘all-MiniLM-L6-v2’) # Pre-compute embeddings for our corpus (our “Vector DB”) # You do not need this step if you already have an external vector database: # you may read and import your document vectors instead doc_embeddings = model.encode(documents, convert_to_tensor=True) def search_semantic(query, top_k=3): # Embedding the user’s query into a vector query_embedding = model.encode(query, convert_to_tensor=True) # Calculating cosine similarity between the query and all documents cosine_scores = util.cos_sim(query_embedding, doc_embeddings)[0] # Ranking documents by similarity ranked_indices = torch.argsort(cosine_scores, descending=True).tolist() return ranked_indices[:top_k], cosine_scores.tolist()

from sentence_transformers import SentenceTransformer, util

import torch

# Loading the pre-trained embedding model

model = SentenceTransformer(‘all-MiniLM-L6-v2’)

# Pre-compute embeddings for our corpus (our “Vector DB”)

# You do not need this step if you already have an external vector database:

# you may read and import your document vectors instead

doc_embeddings = model.encode(documents, convert_to_tensor=True)

def search_semantic(query, top_k=3):

# Embedding the user’s query into a vector

query_embedding = model.encode(query, convert_to_tensor=True)

# Calculating cosine similarity between the query and all documents

cosine_scores = util.cos_sim(query_embedding, doc_embeddings)[0]

# Ranking documents by similarity

ranked_indices = torch.argsort(cosine_scores, descending=True).tolist()

return ranked_indices[:top_k], cosine_scores.tolist()

Time to put it all together. The two scores calculated for each document cannot simply be added, because they operate on very different numeric scales. Instead, we perform the fusion based on ranks rather than raw similarity or relevance scores. For this, RRF is the gold industry standard for fusing ranking information: it calculates an overall ranking for each document by rewarding those that appear in high positions across both lists. The underlying logic is somewhat similar to that of the harmonic mean operator in statistics.

The overarching hybrid search process is implemented as follows:

def hybrid_search(query, top_k=3): # 1. Obtaining the two standalone search rankings bm25_ranks, _ = search_bm25(query, top_k=len(documents)) semantic_ranks, _ = search_semantic(query, top_k=len(documents)) # 2. Applying RRF formula: RRF_score = 1 / (k + rank) rrf_scores = {i: 0.0 for i in range(len(documents))} k_constant = 60 # The value of 60 is a standard academic convention # Adding RRF scores from BM25 for rank, doc_idx in enumerate(bm25_ranks): rrf_scores[doc_idx] += 1.0 / (k_constant + rank + 1) # Adding RRF scores from semantic search for rank, doc_idx in enumerate(semantic_ranks): rrf_scores[doc_idx] += 1.0 / (k_constant + rank + 1) # 3. Sorting documents by their final fused RRF score final_ranked_indices = sorted(rrf_scores.keys(), key=lambda idx: rrf_scores[idx], reverse=True) return final_ranked_indices[:top_k], rrf_scores

def hybrid_search(query, top_k=3):

# 1. Obtaining the two standalone search rankings

bm25_ranks, _ = search_bm25(query, top_k=len(documents))

semantic_ranks, _ = search_semantic(query, top_k=len(documents))

# 2. Applying RRF formula: RRF_score = 1 / (k + rank)

rrf_scores = {i: 0.0 for i in range(len(documents))}

k_constant = 60 # The value of 60 is a standard academic convention

# Adding RRF scores from BM25

for rank, doc_idx in enumerate(bm25_ranks):

rrf_scores[doc_idx] += 1.0 / (k_constant + rank + 1)

# Adding RRF scores from semantic search

for rank, doc_idx in enumerate(semantic_ranks):

rrf_scores[doc_idx] += 1.0 / (k_constant + rank + 1)

# 3. Sorting documents by their final fused RRF score

final_ranked_indices = sorted(rrf_scores.keys(), key=lambda idx: rrf_scores[idx], reverse=True)

return final_ranked_indices[:top_k], rrf_scores

Now it’s time to try it all out. Let’s formulate a user query and see what results we get.

query = “Which nation is best known for rice fields and paddies?” print(f”— Query: ‘{query}’ —“) # Testing Semantic (good at understanding aspects like “nation-wise nuances” and conceptual titles) print(“\nTop Semantic Results:”) sem_indices, _ = search_semantic(query) for idx in sem_indices: print(f”- {doc_names[idx]}”) # Testing BM25 (good at finding exact keyword-based matches like “rice”, “field”, “paddy”) print(“\nTop BM25 Results:”) bm25_indices, _ = search_bm25(query) for idx in bm25_indices: print(f”- {doc_names[idx]}”) # Testing Hybrid (balances both) print(“\nTop Hybrid (RRF) Results:”) hybrid_indices, _ = hybrid_search(query) for idx in hybrid_indices: print(f”- {doc_names[idx]}”)

query = “Which nation is best known for rice fields and paddies?”

print(f“— Query: ‘{query}’ —“)

# Testing Semantic (good at understanding aspects like “nation-wise nuances” and conceptual titles)

print(“\nTop Semantic Results:”)

sem_indices, _ = search_semantic(query)

for idx in sem_indices:

print(f“- {doc_names[idx]}”)

# Testing BM25 (good at finding exact keyword-based matches like “rice”, “field”, “paddy”)

print(“\nTop BM25 Results:”)

bm25_indices, _ = search_bm25(query)

for idx in bm25_indices:

print(f“- {doc_names[idx]}”)

# Testing Hybrid (balances both)

print(“\nTop Hybrid (RRF) Results:”)

hybrid_indices, _ = hybrid_search(query)

for idx in hybrid_indices:

print(f“- {doc_names[idx]}”)

The results are not excellent compared to a production RAG system, but bear in mind we tested this on a tiny, nine-document dataset. With that context, the outcome is quite reasonable.

— Query: ‘Which nation is best known for rice fields and paddies?’ — Top Semantic Results: – Vietnam.txt – South_Korea.txt – Thailand.txt Top BM25 Results: – Indonesia.txt – Japan.txt – Philippines.txt Top Hybrid (RRF) Results: – Vietnam.txt – Thailand.txt – Indonesia.txt

—– Query: ‘Which nation is best known for rice fields and paddies?’ —–

Top Semantic Results:

– Vietnam.txt

– South_Korea.txt

– Thailand.txt

Top BM25 Results:

– Indonesia.txt

– Japan.txt

– Philippines.txt

Top Hybrid (RRF) Results:

– Vietnam.txt

– Thailand.txt

– Indonesia.txt

Try modifying the query and replacing it with others related to temples, beaches, mountains, or anything else that comes to mind when thinking about eastern destinations. Can you find a scenario in which both the semantic results and the BM25 results are highly consistent with each other?

Wrapping Up

This article guided you through implementing a hybrid search mechanism for the retrieval stage of RAG systems. Choosing not to rely solely on semantic search is an important consideration when scaling RAG solutions to production environments.

Source_link