• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, March 10, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows

Josh by Josh
November 21, 2025
in Al, Analytics and Automation
0
An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows


In this tutorial, we implement a complete workflow for building, tracing, and evaluating an LLM pipeline using Opik. We structure the system step-by-step, beginning with a lightweight model, adding prompt-based planning, creating a dataset, and finally running automated evaluations. As we move through each snippet, we see how Opik helps us track every function span, visualize the pipeline’s behavior, and measure output quality with clear, reproducible metrics. By the end, we have a fully instrumented QA system that we can extend, compare, and monitor with ease. Check out theĀ FULL CODES here.

!pip install -q opik transformers accelerate torch


import torch
from transformers import pipeline
import textwrap


import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio


device = 0 if torch.cuda.is_available() else -1
print("Using device:", "cuda" if device == 0 else "cpu")


opik.configure()
PROJECT_NAME = "opik-hf-tutorial"

We set up our environment by installing the required libraries and initializing Opik. We load the core modules, detect the device, and configure our project so that every trace flows into the correct workspace. We lay the foundation for the rest of the tutorial. Check out theĀ FULL CODES here.

llm = pipeline(
   "text-generation",
   model="distilgpt2",
   device=device,
)


def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
   result = llm(
       prompt,
       max_new_tokens=max_new_tokens,
       do_sample=True,
       temperature=0.3,
       pad_token_id=llm.tokenizer.eos_token_id,
   )[0]["generated_text"]
   return result[len(prompt):].strip()

We load a lightweight Hugging Face model and create a small helper function to generate text cleanly. We prepare the LLM to operate locally without external APIs. This gives us a reliable and reproducible generation layer for the rest of the pipeline. Check out theĀ FULL CODES here.

plan_prompt = Prompt(
   name="hf_plan_prompt",
   prompt=textwrap.dedent("""
       You are an assistant that creates a plan to answer a question
       using ONLY the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Return exactly 3 bullet points as a plan.
   """).strip(),
)


answer_prompt = Prompt(
   name="hf_answer_prompt",
   prompt=textwrap.dedent("""
       You answer based only on the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Plan:
       {{plan}}


       Answer the question in 2–4 concise sentences.
   """).strip(),
)

We define two structured prompts using Opik’s Prompt class. We control the planning phase and answering phase through clear templates. This helps us maintain consistency and observe how structured prompting impacts model behavior. Check out theĀ FULL CODES here.

DOCS = {
   "overview": """
       Opik is an open-source platform for debugging, evaluating,
       and monitoring LLM and RAG applications. It provides tracing,
       datasets, experiments, and evaluation metrics.
   """,
   "tracing": """
       Tracing in Opik logs nested spans, LLM calls, token usage,
       feedback scores, and metadata to inspect complex LLM pipelines.
   """,
   "evaluation": """
       Opik evaluations are defined by datasets, evaluation tasks,
       scoring metrics, and experiments that aggregate scores,
       helping detect regressions or issues.
   """,
}


@track(project_name=PROJECT_NAME, type="tool", name="retrieve_context")
def retrieve_context(question: str) -> str:
   q = question.lower()
   if "trace" in q or "span" in q:
       return DOCS["tracing"]
   if "metric" in q or "dataset" in q or "evaluate" in q:
       return DOCS["evaluation"]
   return DOCS["overview"]

We construct a tiny document store and a retrieval function that Opik tracks as a tool. We let the pipeline select context based on the user’s question. This allows us to simulate a minimal RAG-style workflow without needing an actual vector database. Check out theĀ FULL CODES here.

@track(project_name=PROJECT_NAME, type="llm", name="plan_answer")
def plan_answer(context: str, question: str) -> str:
   rendered = plan_prompt.format(context=context, question=question)
   return hf_generate(rendered, max_new_tokens=80)


@track(project_name=PROJECT_NAME, type="llm", name="answer_from_plan")
def answer_from_plan(context: str, question: str, plan: str) -> str:
   rendered = answer_prompt.format(
       context=context,
       question=question,
       plan=plan,
   )
   return hf_generate(rendered, max_new_tokens=120)


@track(project_name=PROJECT_NAME, type="general", name="qa_pipeline")
def qa_pipeline(question: str) -> str:
   context = retrieve_context(question)
   plan = plan_answer(context, question)
   answer = answer_from_plan(context, question, plan)
   return answer


print("Sample answer:\n", qa_pipeline("What does Opik help developers do?"))

We bring together planning, reasoning, and answering in a fully traced LLM pipeline. We capture each step with Opik’s decorators so we can analyze spans in the dashboard. By testing the pipeline, we confirm that all components integrate smoothly. Check out theĀ FULL CODES here.

client = Opik()


dataset = client.get_or_create_dataset(
   name="HF_Opik_QA_Dataset",
   description="Small QA dataset for HF + Opik tutorial",
)


dataset.insert([
   {
       "question": "What kind of platform is Opik?",
       "context": DOCS["overview"],
       "reference": "Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG applications.",
   },
   {
       "question": "What does tracing in Opik log?",
       "context": DOCS["tracing"],
       "reference": "Tracing logs nested spans, LLM calls, token usage, feedback scores, and metadata.",
   },
   {
       "question": "What are the components of an Opik evaluation?",
       "context": DOCS["evaluation"],
       "reference": "An Opik evaluation uses datasets, evaluation tasks, scoring metrics and experiments that aggregate scores.",
   },
])

We create and populate a dataset inside Opik that our evaluation will use. We insert multiple question–answer pairs that cover different aspects of Opik. This dataset will serve as the ground truth for our QA evaluation later. Check out theĀ FULL CODES here.

equals_metric = Equals()
lev_metric = LevenshteinRatio()


def evaluation_task(item: dict) -> dict:
   output = qa_pipeline(item["question"])
   return {
       "output": output,
       "reference": item["reference"],
   }

We define the evaluation task and select two metrics—Equals and LevenshteinRatio—to measure model quality. We ensure the task produces outputs in the exact format required for scoring. This connects our pipeline to Opik’s evaluation engine. Check out theĀ FULL CODES here.

evaluation_result = evaluate(
   dataset=dataset,
   task=evaluation_task,
   scoring_metrics=[equals_metric, lev_metric],
   experiment_name="HF_Opik_QA_Experiment",
   project_name=PROJECT_NAME,
   task_threads=1,
)


print("\nExperiment URL:", evaluation_result.experiment_url)

We run the evaluation experiment using Opik’s evaluate function. We keep the execution sequential for stability in Colab. Once complete, we receive a link to view the experiment details inside the Opik dashboard. Check out theĀ FULL CODES here.

agg = evaluation_result.aggregate_evaluation_scores()


print("\nAggregated scores:")
for metric_name, stats in agg.aggregated_scores.items():
   print(metric_name, "=>", stats)

We aggregate and print the evaluation scores to understand how well our pipeline performs. We inspect the metric results to see where outputs align with references and where improvements are needed. This closes the loop on our fully instrumented LLM workflow.

In conclusion, we set up a small but fully functional LLM evaluation ecosystem powered entirely by Opik and a local model. We observe how traces, prompts, datasets, and metrics come together to give us transparent visibility into the model’s reasoning process. As we finalize our evaluation and review the aggregated scores, we appreciate how Opik lets us iterate quickly, experiment systematically, and validate improvements in a structured and reliable way.


Check out theĀ FULL CODES here.Ā Feel free to check out ourĀ GitHub Page for Tutorials, Codes and Notebooks.Ā Also,Ā feel free to follow us onĀ TwitterĀ and don’t forget to join ourĀ 100k+ ML SubRedditĀ and Subscribe toĀ our Newsletter. Wait! are you on telegram?Ā now you can join us on telegram as well.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

šŸ™Œ Follow MARKTECHPOST: Add us as a preferred source on Google.



Source_link

READ ALSO

Andrew Ng’s Team Releases Context Hub: An Open Source Tool that Gives Your Coding Agent the Up-to-Date API Documentation It Needs

VirtuaLover Image Generator Pricing & Features Overview

Related Posts

Andrew Ng’s Team Releases Context Hub: An Open Source Tool that Gives Your Coding Agent the Up-to-Date API Documentation It Needs
Al, Analytics and Automation

Andrew Ng’s Team Releases Context Hub: An Open Source Tool that Gives Your Coding Agent the Up-to-Date API Documentation It Needs

March 10, 2026
VirtuaLover Image Generator Pricing & Features Overview
Al, Analytics and Automation

VirtuaLover Image Generator Pricing & Features Overview

March 9, 2026
Al, Analytics and Automation

The ‘Bayesian’ Upgrade: Why Google AI’s New Teaching Method is the Key to LLM Reasoning

March 9, 2026
Pricing Breakdown and Core Feature Overview
Al, Analytics and Automation

Pricing Breakdown and Core Feature Overview

March 9, 2026
Improving AI models’ ability to explain their predictions | MIT News
Al, Analytics and Automation

Improving AI models’ ability to explain their predictions | MIT News

March 9, 2026
Beyond Accuracy: Quantifying the Production Fragility Caused by Excessive, Redundant, and Low-Signal Features in Regression
Al, Analytics and Automation

Beyond Accuracy: Quantifying the Production Fragility Caused by Excessive, Redundant, and Low-Signal Features in Regression

March 9, 2026
Next Post
The best gaming headsets for 2025

The best gaming headsets for 2025

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plansĀ 

Google announced the next step in its nuclear energy plansĀ 

August 20, 2025

EDITOR'S PICK

Why Vetting and Choosing a Technical PR Agency is, and Should be, Difficult.

Why Vetting and Choosing a Technical PR Agency is, and Should be, Difficult.

December 17, 2025
Gemini on Android lets you assign multi-step tasks

Gemini on Android lets you assign multi-step tasks

February 26, 2026
What Type Lawyer Makes The Most Money?

What Type Lawyer Makes The Most Money?

November 27, 2025
Apple AirPods as hearing aids: how gadgets become assistive tech

Apple AirPods as hearing aids: how gadgets become assistive tech

September 21, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Mobile Gaming in Taiwan: What You Should Know March 2025 (Updated)
  • Restaurant PR Playbook: Build Buzz, Launch Strong, Sustain Success
  • Why Your Home Needs Professional Network Setup
  • Andrew Ng’s Team Releases Context Hub: An Open Source Tool that Gives Your Coding Agent the Up-to-Date API Documentation It Needs
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions