• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Saturday, September 6, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

How to Build a Complete End-to-End NLP Pipeline with Gensim: Topic Modeling, Word Embeddings, Semantic Search, and Advanced Text Analysis

Josh by Josh
September 5, 2025
in Al, Analytics and Automation
0
How to Build a Complete End-to-End NLP Pipeline with Gensim: Topic Modeling, Word Embeddings, Semantic Search, and Advanced Text Analysis
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


In this tutorial, we present a complete end-to-end Natural Language Processing (NLP) pipeline built with Gensim and supporting libraries, designed to run seamlessly in Google Colab. It integrates multiple core techniques in modern NLP, including preprocessing, topic modeling with Latent Dirichlet Allocation (LDA), word embeddings with Word2Vec, TF-IDF-based similarity analysis, and semantic search. The pipeline not only demonstrates how to train and evaluate these models but also showcases practical visualizations, advanced topic analysis, and document classification workflows. By combining statistical methods with machine learning approaches, the tutorial provides a comprehensive framework for understanding and experimenting with text data at scale. Check out the FULL CODES here.

!pip install --upgrade scipy==1.11.4
!pip install gensim==4.3.2 nltk wordcloud matplotlib seaborn pandas numpy scikit-learn
!pip install --upgrade setuptools


print("Please restart runtime after installation!")
print("Go to Runtime > Restart runtime, then run the next cell")


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import warnings
warnings.filterwarnings('ignore')


from gensim import corpora, models, similarities
from gensim.models import Word2Vec, LdaModel, TfidfModel, CoherenceModel
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short


import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

We install and upgrade the necessary libraries, such as SciPy, Gensim, NLTK, and visualization tools, to ensure compatibility. We then import all required modules for preprocessing, modeling, and analysis. We also download NLTK resources to tokenize and handle stopwords efficiently, thereby setting up the environment for our NLP pipeline. Check out the FULL CODES here.

class AdvancedGensimPipeline:
   def __init__(self):
       self.dictionary = None
       self.corpus = None
       self.lda_model = None
       self.word2vec_model = None
       self.tfidf_model = None
       self.similarity_index = None
       self.processed_docs = None
      
   def create_sample_corpus(self):
       """Create a diverse sample corpus for demonstration"""
       documents = [           "Data science combines statistics, programming, and domain expertise to extract insights",
           "Big data analytics helps organizations make data-driven decisions at scale",
           "Cloud computing provides scalable infrastructure for modern applications and services",
           "Cybersecurity protects digital systems from threats and unauthorized access attempts",
           "Software engineering practices ensure reliable and maintainable code development",
           "Database management systems store and organize large amounts of structured information",
           "Python programming language is widely used for data analysis and machine learning",
           "Statistical modeling helps identify patterns and relationships in complex datasets",
           "Cross-validation techniques ensure robust model performance evaluation and selection",
           "Recommendation systems suggest relevant items based on user preferences and behavior",
           "Text mining extracts valuable insights from unstructured textual data sources",
           "Image classification assigns predefined categories to visual content automatically",
           "Reinforcement learning trains agents through interaction with dynamic environments"
       ]
       return documents
  
   def preprocess_documents(self, documents):
       """Advanced document preprocessing using Gensim filters"""
       print("Preprocessing documents...")
      
       CUSTOM_FILTERS = [
           strip_tags, strip_punctuation, strip_multiple_whitespaces,
           strip_numeric, remove_stopwords, strip_short, lambda x: x.lower()
       ]
      
       processed_docs = []
       for doc in documents:
           processed = preprocess_string(doc, CUSTOM_FILTERS)
          
           stop_words = set(stopwords.words('english'))
           processed = [word for word in processed if word not in stop_words and len(word) > 2]
          
           processed_docs.append(processed)
      
       self.processed_docs = processed_docs
       print(f"Processed {len(processed_docs)} documents")
       return processed_docs
  
   def create_dictionary_and_corpus(self):
       """Create Gensim dictionary and corpus"""
       print("Creating dictionary and corpus...")
      
       self.dictionary = corpora.Dictionary(self.processed_docs)
      
       self.dictionary.filter_extremes(no_below=2, no_above=0.8)
      
       self.corpus = [self.dictionary.doc2bow(doc) for doc in self.processed_docs]
      
       print(f"Dictionary size: {len(self.dictionary)}")
       print(f"Corpus size: {len(self.corpus)}")
      
   def train_word2vec_model(self):
       """Train Word2Vec model for word embeddings"""
       print("Training Word2Vec model...")
      
       self.word2vec_model = Word2Vec(
           sentences=self.processed_docs,
           vector_size=100,
           window=5,
           min_count=2,
           workers=4,
           epochs=50
       )
      
       print("Word2Vec model trained successfully")
      
   def analyze_word_similarities(self):
       """Analyze word similarities using Word2Vec"""
       print("\n=== Word2Vec Similarity Analysis ===")
      
       test_words = ['machine', 'data', 'learning', 'computer']
      
       for word in test_words:
           if word in self.word2vec_model.wv:
               similar_words = self.word2vec_model.wv.most_similar(word, topn=3)
               print(f"Words similar to '{word}': {similar_words}")
      
       try:
           if all(w in self.word2vec_model.wv for w in ['machine', 'computer', 'data']):
               analogy = self.word2vec_model.wv.most_similar(
                   positive=['computer', 'data'],
                   negative=['machine'],
                   topn=1
               )
               print(f"Analogy result: {analogy}")
       except:
           print("Not enough vocabulary for complex analogies")
  
   def train_lda_model(self, num_topics=5):
       """Train LDA topic model"""
       print(f"Training LDA model with {num_topics} topics...")
      
       self.lda_model = LdaModel(
           corpus=self.corpus,
           id2word=self.dictionary,
           num_topics=num_topics,
           random_state=42,
           passes=10,
           alpha="auto",
           per_word_topics=True,
           eval_every=None
       )
      
       print("LDA model trained successfully")


   def evaluate_topic_coherence(self):
       """Evaluate topic model coherence"""
       print("Evaluating topic coherence...")
      
       coherence_model = CoherenceModel(
           model=self.lda_model,
           texts=self.processed_docs,
           dictionary=self.dictionary,
           coherence="c_v"
       )
      
       coherence_score = coherence_model.get_coherence()
       print(f"Topic Coherence Score: {coherence_score:.4f}")
       return coherence_score
  
   def display_topics(self):
       """Display discovered topics"""
       print("\n=== Discovered Topics ===")
      
       topics = self.lda_model.print_topics(num_words=8)
       for idx, topic in enumerate(topics):
           print(f"Topic {idx}: {topic[1]}")
  
   def create_tfidf_model(self):
       """Create TF-IDF model for document similarity"""
       print("Creating TF-IDF model...")
      
       self.tfidf_model = TfidfModel(self.corpus)
       corpus_tfidf = self.tfidf_model[self.corpus]
      
       self.similarity_index = similarities.MatrixSimilarity(corpus_tfidf)
      
       print("TF-IDF model and similarity index created")
  
   def find_similar_documents(self, query_doc_idx=0):
       """Find documents similar to a query document"""
       print(f"\n=== Document Similarity Analysis ===")
      
       query_doc_tfidf = self.tfidf_model[self.corpus[query_doc_idx]]
      
       similarities_scores = self.similarity_index[query_doc_tfidf]
      
       sorted_similarities = sorted(enumerate(similarities_scores), key=lambda x: x[1], reverse=True)
      
       print(f"Documents most similar to document {query_doc_idx}:")
       for doc_idx, similarity in sorted_similarities[:5]:
           print(f"Doc {doc_idx}: {similarity:.4f}")
  
   def visualize_topics(self):
       """Create visualizations for topic analysis"""
       print("Creating topic visualizations...")
      
       doc_topic_matrix = []
       for doc_bow in self.corpus:
           doc_topics = dict(self.lda_model.get_document_topics(doc_bow, minimum_probability=0))
           topic_vec = [doc_topics.get(i, 0) for i in range(self.lda_model.num_topics)]
           doc_topic_matrix.append(topic_vec)
      
       doc_topic_df = pd.DataFrame(doc_topic_matrix, columns=[f'Topic_{i}' for i in range(self.lda_model.num_topics)])
      
       plt.figure(figsize=(12, 8))
       sns.heatmap(doc_topic_df.T, annot=True, cmap='Blues', fmt=".2f")
       plt.title('Document-Topic Distribution Heatmap')
       plt.xlabel('Documents')
       plt.ylabel('Topics')
       plt.tight_layout()
       plt.show()
      
       fig, axes = plt.subplots(2, 3, figsize=(15, 10))
       axes = axes.flatten()
      
       for topic_id in range(min(6, self.lda_model.num_topics)):
           topic_words = dict(self.lda_model.show_topic(topic_id, topn=20))
          
           wordcloud = WordCloud(
               width=300, height=200,
               background_color="white",
               colormap='viridis'
           ).generate_from_frequencies(topic_words)
          
           axes[topic_id].imshow(wordcloud, interpolation='bilinear')
           axes[topic_id].set_title(f'Topic {topic_id}')
           axes[topic_id].axis('off')
      
       for i in range(self.lda_model.num_topics, 6):
           axes[i].axis('off')
          
       plt.tight_layout()
       plt.show()
  
   def advanced_topic_analysis(self):
       """Perform advanced topic analysis"""
       print("\n=== Advanced Topic Analysis ===")
      
       topic_distributions = []
       for i, doc_bow in enumerate(self.corpus):
           doc_topics = self.lda_model.get_document_topics(doc_bow)
           dominant_topic = max(doc_topics, key=lambda x: x[1]) if doc_topics else (0, 0)
           topic_distributions.append({
               'doc_id': i,
               'dominant_topic': dominant_topic[0],
               'topic_probability': dominant_topic[1]
           })
      
       topic_df = pd.DataFrame(topic_distributions)
      
       plt.figure(figsize=(10, 6))
       topic_counts = topic_df['dominant_topic'].value_counts().sort_index()
       plt.bar(range(len(topic_counts)), topic_counts.values)
       plt.xlabel('Topic ID')
       plt.ylabel('Number of Documents')
       plt.title('Distribution of Dominant Topics Across Documents')
       plt.xticks(range(len(topic_counts)), [f'Topic {i}' for i in topic_counts.index])
       plt.show()
      
       return topic_df
  
   def document_classification_demo(self, new_document):
       """Classify a new document using trained models"""
       print(f"\n=== Document Classification Demo ===")
       print(f"Classifying: '{new_document[:50]}...'")
      
       processed_new = preprocess_string(new_document, [
           strip_tags, strip_punctuation, strip_multiple_whitespaces,
           strip_numeric, remove_stopwords, strip_short, lambda x: x.lower()
       ])
      
       new_doc_bow = self.dictionary.doc2bow(processed_new)
      
       doc_topics = self.lda_model.get_document_topics(new_doc_bow)
      
       print("Topic probabilities:")
       for topic_id, prob in doc_topics:
           print(f"  Topic {topic_id}: {prob:.4f}")
      
       new_doc_tfidf = self.tfidf_model[new_doc_bow]
       similarities_scores = self.similarity_index[new_doc_tfidf]
       most_similar = np.argmax(similarities_scores)
      
       print(f"Most similar document: {most_similar} (similarity: {similarities_scores[most_similar]:.4f})")
      
       return doc_topics, most_similar
  
   def run_complete_pipeline(self):
       """Execute the complete NLP pipeline"""
       print("=== Advanced Gensim NLP Pipeline ===\n")
      
       raw_documents = self.create_sample_corpus()
       self.preprocess_documents(raw_documents)
      
       self.create_dictionary_and_corpus()
      
       self.train_word2vec_model()
       self.train_lda_model(num_topics=5)
       self.create_tfidf_model()
      
       self.analyze_word_similarities()
       coherence_score = self.evaluate_topic_coherence()
       self.display_topics()
      
       self.visualize_topics()
       topic_df = self.advanced_topic_analysis()
      
       self.find_similar_documents(query_doc_idx=0)
      
       new_doc = "Deep neural networks are powerful machine learning models for pattern recognition"
       self.document_classification_demo(new_doc)
      
       return {
           'coherence_score': coherence_score,
           'topic_distributions': topic_df,
           'models': {
               'lda': self.lda_model,
               'word2vec': self.word2vec_model,
               'tfidf': self.tfidf_model
           }
       }

We define the AdvancedGensimPipeline class as a modular framework to handle every stage of text analysis in one place. It starts with creating a sample corpus, preprocessing it, and then building a dictionary and corpus representations. We train Word2Vec for embeddings, LDA for topic modeling, and TF-IDF for similarity, followed by visualization, coherence evaluation, and classification of new documents. This way, we bring together the complete NLP workflow, from raw text to insights, into a single reusable pipeline. Check out the FULL CODES here.

def compare_topic_models(pipeline, topic_range=[3, 5, 7, 10]):
   print("\n=== Topic Model Comparison ===")
  
   coherence_scores = []
   perplexity_scores = []
  
   for num_topics in topic_range:
       lda_temp = LdaModel(
           corpus=pipeline.corpus,
           id2word=pipeline.dictionary,
           num_topics=num_topics,
           random_state=42,
           passes=10,
           alpha="auto"
       )
      
       coherence_model = CoherenceModel(
           model=lda_temp,
           texts=pipeline.processed_docs,
           dictionary=pipeline.dictionary,
           coherence="c_v"
       )
       coherence = coherence_model.get_coherence()
       coherence_scores.append(coherence)
      
       perplexity = lda_temp.log_perplexity(pipeline.corpus)
       perplexity_scores.append(perplexity)
      
       print(f"Topics: {num_topics}, Coherence: {coherence:.4f}, Perplexity: {perplexity:.4f}")
  
   fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
  
   ax1.plot(topic_range, coherence_scores, 'bo-')
   ax1.set_xlabel('Number of Topics')
   ax1.set_ylabel('Coherence Score')
   ax1.set_title('Model Coherence vs Number of Topics')
   ax1.grid(True)
  
   ax2.plot(topic_range, perplexity_scores, 'ro-')
   ax2.set_xlabel('Number of Topics')
   ax2.set_ylabel('Perplexity')
   ax2.set_title('Model Perplexity vs Number of Topics')
   ax2.grid(True)
  
   plt.tight_layout()
   plt.show()
  
   return coherence_scores, perplexity_scores

This function compare_topic_models lets us systematically test different numbers of topics for the LDA model and compare their performance. We calculate coherence scores (to check topic interpretability) and perplexity scores (to check model fit) for each topic count in the given range. The results are displayed as line plots, helping us visually decide the most balanced number of topics for our dataset. Check out the FULL CODES here.

def semantic_search_engine(pipeline, query, top_k=5):
   """Implement semantic search using trained models"""
   print(f"\n=== Semantic Search: '{query}' ===")
  
   processed_query = preprocess_string(query, [
       strip_tags, strip_punctuation, strip_multiple_whitespaces,
       strip_numeric, remove_stopwords, strip_short, lambda x: x.lower()
   ])
  
   query_bow = pipeline.dictionary.doc2bow(processed_query)
   query_tfidf = pipeline.tfidf_model[query_bow]
  
   similarities_scores = pipeline.similarity_index[query_tfidf]
  
   top_indices = np.argsort(similarities_scores)[::-1][:top_k]
  
   print("Top matching documents:")
   for i, idx in enumerate(top_indices):
       score = similarities_scores[idx]
       print(f"{i+1}. Document {idx} (Score: {score:.4f})")
       print(f"   Content: {' '.join(pipeline.processed_docs[idx][:10])}...")
  
   return top_indices, similarities_scores[top_indices]

The semantic_search_engine function adds a search layer to the pipeline by taking a query, preprocessing it, and converting it into a bag-of-words and TF-IDF representations. It then compares the query against all documents using the similarity index and returns the top matches. This way, we can quickly retrieve the most relevant documents along with their similarity scores, making the pipeline useful for practical information retrieval and semantic search tasks. Check out the FULL CODES here.

if __name__ == "__main__":
   pipeline = AdvancedGensimPipeline()
   results = pipeline.run_complete_pipeline()
  
   print("\n" + "="*60)
   coherence_scores, perplexity_scores = compare_topic_models(pipeline)
  
   print("\n" + "="*60)
   search_results = semantic_search_engine(
       pipeline,
       "artificial intelligence neural networks deep learning"
   )
  
   print("\n" + "="*60)
   print("Pipeline completed successfully!")
   print(f"Final coherence score: {results['coherence_score']:.4f}")
   print(f"Vocabulary size: {len(pipeline.dictionary)}")
   print(f"Word2Vec model size: {pipeline.word2vec_model.wv.vector_size} dimensions")
  
   print("\nModels trained and ready for use!")
   print("Access models via: pipeline.lda_model, pipeline.word2vec_model, pipeline.tfidf_model")

This main block ties everything together into a complete, executable pipeline. We initialize the AdvancedGensimPipeline, run the full workflow, and then evaluate topic models with different numbers of topics. Next, we test the semantic search engine with a query about artificial intelligence and deep learning. Finally, it prints out summary metrics, such as the coherence score, vocabulary size, and Word2Vec embedding dimensions, confirming that all models are trained and ready for further use.

In conclusion, we gain a powerful, modular workflow that covers the entire spectrum of text analysis, from cleaning and preprocessing raw documents to discovering hidden topics, visualizing results, comparing models, and performing semantic search. The inclusion of Word2Vec embeddings, TF-IDF similarity, and coherence evaluation ensures that the pipeline is both versatile and robust, while visualizations and classification demos make the results interpretable and actionable. This cohesive design enables learners, researchers, and practitioners to quickly adapt the framework for real-world applications, making it a valuable foundation for advanced NLP experimentation and production-ready text analytics.


Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.



Source_link

READ ALSO

Small Language Models are the Future of Agentic AI

When Words Cut Deeper Than Weapons

Related Posts

Small Language Models are the Future of Agentic AI
Al, Analytics and Automation

Small Language Models are the Future of Agentic AI

September 5, 2025
When Words Cut Deeper Than Weapons
Al, Analytics and Automation

When Words Cut Deeper Than Weapons

September 5, 2025
A greener way to 3D print stronger stuff | MIT News
Al, Analytics and Automation

A greener way to 3D print stronger stuff | MIT News

September 5, 2025
Google AI Releases EmbeddingGemma: A 308M Parameter On-Device Embedding Model with State-of-the-Art MTEB Results
Al, Analytics and Automation

Google AI Releases EmbeddingGemma: A 308M Parameter On-Device Embedding Model with State-of-the-Art MTEB Results

September 5, 2025
How Microsoft & Cloudflare Are Turning Every Website Into a Chatty AI Assistant
Al, Analytics and Automation

How Microsoft & Cloudflare Are Turning Every Website Into a Chatty AI Assistant

September 4, 2025
A new generative AI approach to predicting chemical reactions | MIT News
Al, Analytics and Automation

A new generative AI approach to predicting chemical reactions | MIT News

September 4, 2025
Next Post
How AI Audience Segmentation Drives 3X Revenue Growth for Mid-Sized Companies

How AI Audience Segmentation Drives 3X Revenue Growth for Mid-Sized Companies

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025

Completion Of Jeneponto Wind Farm Accelerated To July

April 20, 2025

EDITOR'S PICK

Why does Amazon use palm scanners at Whole Foods and doctors’ offices?

Why does Amazon use palm scanners at Whole Foods and doctors’ offices?

June 5, 2025
Moburst Named Top Mobile Marketing Agency by MobileAppDaily

Moburst Named Top Mobile Marketing Agency by MobileAppDaily

June 29, 2025
How Enterprise Web Scraping at Scale Gives You a Competitive Edge

How Enterprise Web Scraping at Scale Gives You a Competitive Edge

June 8, 2025
Pet & Animal Brands Anticipate 2025-2026 World Branding Awards Animalis Edition in Vienna

Pet & Animal Brands Anticipate 2025-2026 World Branding Awards Animalis Edition in Vienna

July 1, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Bringing the Brand Home: What the Toronto Blue Jays’ Home Run Jacket Can Teach Us About Ritual, Symbolism and Loyalty
  • Google antitrust ruling clears the way for Apple’s Gemini push
  • The Scoop: American Eagle’s Sydney Sweeny campaign didn’t deter customers. It helped recruit them.
  • Pinterest Pin Design Tips That Get Clicks and Saves
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?