• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, March 12, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows

Josh by Josh
November 21, 2025
in Al, Analytics and Automation
0
An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows


In this tutorial, we implement a complete workflow for building, tracing, and evaluating an LLM pipeline using Opik. We structure the system step-by-step, beginning with a lightweight model, adding prompt-based planning, creating a dataset, and finally running automated evaluations. As we move through each snippet, we see how Opik helps us track every function span, visualize the pipeline’s behavior, and measure output quality with clear, reproducible metrics. By the end, we have a fully instrumented QA system that we can extend, compare, and monitor with ease. Check out the FULL CODES here.

!pip install -q opik transformers accelerate torch


import torch
from transformers import pipeline
import textwrap


import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio


device = 0 if torch.cuda.is_available() else -1
print("Using device:", "cuda" if device == 0 else "cpu")


opik.configure()
PROJECT_NAME = "opik-hf-tutorial"

We set up our environment by installing the required libraries and initializing Opik. We load the core modules, detect the device, and configure our project so that every trace flows into the correct workspace. We lay the foundation for the rest of the tutorial. Check out the FULL CODES here.

llm = pipeline(
   "text-generation",
   model="distilgpt2",
   device=device,
)


def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
   result = llm(
       prompt,
       max_new_tokens=max_new_tokens,
       do_sample=True,
       temperature=0.3,
       pad_token_id=llm.tokenizer.eos_token_id,
   )[0]["generated_text"]
   return result[len(prompt):].strip()

We load a lightweight Hugging Face model and create a small helper function to generate text cleanly. We prepare the LLM to operate locally without external APIs. This gives us a reliable and reproducible generation layer for the rest of the pipeline. Check out the FULL CODES here.

plan_prompt = Prompt(
   name="hf_plan_prompt",
   prompt=textwrap.dedent("""
       You are an assistant that creates a plan to answer a question
       using ONLY the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Return exactly 3 bullet points as a plan.
   """).strip(),
)


answer_prompt = Prompt(
   name="hf_answer_prompt",
   prompt=textwrap.dedent("""
       You answer based only on the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Plan:
       {{plan}}


       Answer the question in 2–4 concise sentences.
   """).strip(),
)

We define two structured prompts using Opik’s Prompt class. We control the planning phase and answering phase through clear templates. This helps us maintain consistency and observe how structured prompting impacts model behavior. Check out the FULL CODES here.

DOCS = {
   "overview": """
       Opik is an open-source platform for debugging, evaluating,
       and monitoring LLM and RAG applications. It provides tracing,
       datasets, experiments, and evaluation metrics.
   """,
   "tracing": """
       Tracing in Opik logs nested spans, LLM calls, token usage,
       feedback scores, and metadata to inspect complex LLM pipelines.
   """,
   "evaluation": """
       Opik evaluations are defined by datasets, evaluation tasks,
       scoring metrics, and experiments that aggregate scores,
       helping detect regressions or issues.
   """,
}


@track(project_name=PROJECT_NAME, type="tool", name="retrieve_context")
def retrieve_context(question: str) -> str:
   q = question.lower()
   if "trace" in q or "span" in q:
       return DOCS["tracing"]
   if "metric" in q or "dataset" in q or "evaluate" in q:
       return DOCS["evaluation"]
   return DOCS["overview"]

We construct a tiny document store and a retrieval function that Opik tracks as a tool. We let the pipeline select context based on the user’s question. This allows us to simulate a minimal RAG-style workflow without needing an actual vector database. Check out the FULL CODES here.

@track(project_name=PROJECT_NAME, type="llm", name="plan_answer")
def plan_answer(context: str, question: str) -> str:
   rendered = plan_prompt.format(context=context, question=question)
   return hf_generate(rendered, max_new_tokens=80)


@track(project_name=PROJECT_NAME, type="llm", name="answer_from_plan")
def answer_from_plan(context: str, question: str, plan: str) -> str:
   rendered = answer_prompt.format(
       context=context,
       question=question,
       plan=plan,
   )
   return hf_generate(rendered, max_new_tokens=120)


@track(project_name=PROJECT_NAME, type="general", name="qa_pipeline")
def qa_pipeline(question: str) -> str:
   context = retrieve_context(question)
   plan = plan_answer(context, question)
   answer = answer_from_plan(context, question, plan)
   return answer


print("Sample answer:\n", qa_pipeline("What does Opik help developers do?"))

We bring together planning, reasoning, and answering in a fully traced LLM pipeline. We capture each step with Opik’s decorators so we can analyze spans in the dashboard. By testing the pipeline, we confirm that all components integrate smoothly. Check out the FULL CODES here.

client = Opik()


dataset = client.get_or_create_dataset(
   name="HF_Opik_QA_Dataset",
   description="Small QA dataset for HF + Opik tutorial",
)


dataset.insert([
   {
       "question": "What kind of platform is Opik?",
       "context": DOCS["overview"],
       "reference": "Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG applications.",
   },
   {
       "question": "What does tracing in Opik log?",
       "context": DOCS["tracing"],
       "reference": "Tracing logs nested spans, LLM calls, token usage, feedback scores, and metadata.",
   },
   {
       "question": "What are the components of an Opik evaluation?",
       "context": DOCS["evaluation"],
       "reference": "An Opik evaluation uses datasets, evaluation tasks, scoring metrics and experiments that aggregate scores.",
   },
])

We create and populate a dataset inside Opik that our evaluation will use. We insert multiple question–answer pairs that cover different aspects of Opik. This dataset will serve as the ground truth for our QA evaluation later. Check out the FULL CODES here.

equals_metric = Equals()
lev_metric = LevenshteinRatio()


def evaluation_task(item: dict) -> dict:
   output = qa_pipeline(item["question"])
   return {
       "output": output,
       "reference": item["reference"],
   }

We define the evaluation task and select two metrics—Equals and LevenshteinRatio—to measure model quality. We ensure the task produces outputs in the exact format required for scoring. This connects our pipeline to Opik’s evaluation engine. Check out the FULL CODES here.

evaluation_result = evaluate(
   dataset=dataset,
   task=evaluation_task,
   scoring_metrics=[equals_metric, lev_metric],
   experiment_name="HF_Opik_QA_Experiment",
   project_name=PROJECT_NAME,
   task_threads=1,
)


print("\nExperiment URL:", evaluation_result.experiment_url)

We run the evaluation experiment using Opik’s evaluate function. We keep the execution sequential for stability in Colab. Once complete, we receive a link to view the experiment details inside the Opik dashboard. Check out the FULL CODES here.

agg = evaluation_result.aggregate_evaluation_scores()


print("\nAggregated scores:")
for metric_name, stats in agg.aggregated_scores.items():
   print(metric_name, "=>", stats)

We aggregate and print the evaluation scores to understand how well our pipeline performs. We inspect the metric results to see where outputs align with references and where improvements are needed. This closes the loop on our fully instrumented LLM workflow.

In conclusion, we set up a small but fully functional LLM evaluation ecosystem powered entirely by Opik and a local model. We observe how traces, prompts, datasets, and metrics come together to give us transparent visibility into the model’s reasoning process. As we finalize our evaluation and review the aggregated scores, we appreciate how Opik lets us iterate quickly, experiment systematically, and validate improvements in a structured and reliable way.


Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



Source_link

READ ALSO

How to Design a Streaming Decision Agent with Partial Reasoning, Online Replanning, and Reactive Mid-Execution Adaptation in Dynamic Environments

3 Questions: On the future of AI and the mathematical and physical sciences | MIT News

Related Posts

Al, Analytics and Automation

How to Design a Streaming Decision Agent with Partial Reasoning, Online Replanning, and Reactive Mid-Execution Adaptation in Dynamic Environments

March 12, 2026
3 Questions: On the future of AI and the mathematical and physical sciences | MIT News
Al, Analytics and Automation

3 Questions: On the future of AI and the mathematical and physical sciences | MIT News

March 12, 2026
NVIDIA Releases Nemotron 3 Super: A 120B Parameter Open-Source Hybrid Mamba-Attention MoE Model Delivering 5x Higher Throughput for Agentic AI
Al, Analytics and Automation

NVIDIA Releases Nemotron 3 Super: A 120B Parameter Open-Source Hybrid Mamba-Attention MoE Model Delivering 5x Higher Throughput for Agentic AI

March 11, 2026
A better method for planning complex visual tasks | MIT News
Al, Analytics and Automation

A better method for planning complex visual tasks | MIT News

March 11, 2026
Google AI Introduces Gemini Embedding 2: A Multimodal Embedding Model that Lets Your Bring Text, Images, Video, Audio, and Docs into the Embedding Space
Al, Analytics and Automation

Google AI Introduces Gemini Embedding 2: A Multimodal Embedding Model that Lets Your Bring Text, Images, Video, Audio, and Docs into the Embedding Space

March 11, 2026
AI Is Learning From the News. Now Publishers Want to Get Paid
Al, Analytics and Automation

AI Is Learning From the News. Now Publishers Want to Get Paid

March 11, 2026
Next Post
The best gaming headsets for 2025

The best gaming headsets for 2025

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

X’s New Grok Feature Aims to Outduel TikTok and Reels

X’s New Grok Feature Aims to Outduel TikTok and Reels

August 6, 2025
10 Best Email and SMS Marketing Software for Businesses in 2025

10 Best Email and SMS Marketing Software for Businesses in 2025

September 15, 2025
As China’s 996 culture spreads, South Korea’s tech sector grapples with 52-hour limit

As China’s 996 culture spreads, South Korea’s tech sector grapples with 52-hour limit

October 23, 2025

Top Google Searches (November 2025)

January 6, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to Design a Streaming Decision Agent with Partial Reasoning, Online Replanning, and Reactive Mid-Execution Adaptation in Dynamic Environments
  • Use Cases, Cost & Implementation
  • How to Plan a Milestone Celebration in Roswell, GA: Top Venues and Unique Touches for 2026
  • Google announces Google Play paid games updates
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions