• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Wednesday, March 11, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Getting Started with MLFlow for LLM Evaluation

Josh by Josh
June 28, 2025
in Al, Analytics and Automation
0
Getting Started with MLFlow for LLM Evaluation


MLflow is a powerful open-source platform for managing the machine learning lifecycle. While it’s traditionally used for tracking model experiments, logging parameters, and managing deployments, MLflow has recently introduced support for evaluating Large Language Models (LLMs).

In this tutorial, we explore how to use MLflow to evaluate the performance of an LLM—in our case, Google’s Gemini model—on a set of fact-based prompts. We’ll generate responses to fact-based prompts using Gemini and assess their quality using a variety of metrics supported directly by MLflow.

Setting up the dependencies

For this tutorial, we’ll be using both the OpenAI and Gemini APIs. MLflow’s built-in generative AI evaluation metrics currently rely on OpenAI models (e.g., GPT-4) to act as judges for metrics like answer similarity or faithfulness, so an OpenAI API key is required. You can obtain:

Installing the libraries

pip install mlflow openai pandas google-genai

Setting the OpenAI and Google API Keys as environment variable

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key:')
os.environ["GOOGLE_API_KEY"] = getpass('Enter Google API Key:')

Preparing Evaluation Data and Fetching Outputs from Gemini

import mlflow
import openai
import os
import pandas as pd
from google import genai

Creating the evaluation data

In this step, we define a small evaluation dataset containing factual prompts along with their correct ground truth answers. These prompts span topics such as science, health, web development, and programming. This structured format allows us to objectively compare the Gemini-generated responses against known correct answers using various evaluation metrics in MLflow.

eval_data = pd.DataFrame(
    {
        "inputs": [
            "Who developed the theory of general relativity?",
            "What are the primary functions of the liver in the human body?",
            "Explain what HTTP status code 404 means.",
            "What is the boiling point of water at sea level in Celsius?",
            "Name the largest planet in our solar system.",
            "What programming language is primarily used for developing iOS apps?",
        ],
        "ground_truth": [
            "Albert Einstein developed the theory of general relativity.",
            "The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.",
            "HTTP 404 means 'Not Found' -- the server can't find the requested resource.",
            "The boiling point of water at sea level is 100 degrees Celsius.",
            "Jupiter is the largest planet in our solar system.",
            "Swift is the primary programming language used for iOS app development."
        ]
    }
)

eval_data

Getting Gemini Responses

This code block defines a helper function gemini_completion() that sends a prompt to the Gemini 1.5 Flash model using the Google Generative AI SDK and returns the generated response as plain text. We then apply this function to each prompt in our evaluation dataset to generate the model’s predictions, storing them in a new “predictions” column. These predictions will later be evaluated against the ground truth answers

client = genai.Client()
def gemini_completion(prompt: str) -> str:
    response = client.models.generate_content(
        model="gemini-1.5-flash",
        contents=prompt
    )
    return response.text.strip()

eval_data["predictions"] = eval_data["inputs"].apply(gemini_completion)
eval_data

Evaluating Gemini Outputs with MLflow

In this step, we initiate an MLflow run to evaluate the responses generated by the Gemini model against a set of factual ground-truth answers. We use the mlflow.evaluate() method with four lightweight metrics: answer_similarity (measuring semantic similarity between the model’s output and the ground truth), exact_match (checking for word-for-word matches), latency (tracking response generation time), and token_count (logging the number of output tokens).

It’s important to note that the answer_similarity metric internally uses OpenAI’s GPT model to judge the semantic closeness between answers, which is why access to the OpenAI API is required. This setup provides an efficient way to assess LLM outputs without relying on custom evaluation logic. The final evaluation results are printed and also saved to a CSV file for later inspection or visualization.

mlflow.set_tracking_uri("mlruns")
mlflow.set_experiment("Gemini Simple Metrics Eval")

with mlflow.start_run():
    results = mlflow.evaluate(
        model_type="question-answering",
        data=eval_data,
        predictions="predictions",
        targets="ground_truth",
        extra_metrics=[
          mlflow.metrics.genai.answer_similarity(),
          mlflow.metrics.exact_match(),
          mlflow.metrics.latency(),
          mlflow.metrics.token_count()
      ]
    )
    print("Aggregated Metrics:")
    print(results.metrics)

    # Save detailed table
    results.tables["eval_results_table"].to_csv("gemini_eval_results.csv", index=False)

To view the detailed results of our evaluation, we load the saved CSV file into a DataFrame and adjust the display settings to ensure full visibility of each response. This allows us to inspect individual prompts, Gemini-generated predictions, ground truth answers, and the associated metric scores without truncation, which is especially helpful in notebook environments like Colab or Jupyter.

results = pd.read_csv('gemini_eval_results.csv')
pd.set_option('display.max_colwidth', None)
results

Check out the Codes here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.



Source_link

READ ALSO

Google AI Introduces Gemini Embedding 2: A Multimodal Embedding Model that Lets Your Bring Text, Images, Video, Audio, and Docs into the Embedding Space

AI Is Learning From the News. Now Publishers Want to Get Paid

Related Posts

Google AI Introduces Gemini Embedding 2: A Multimodal Embedding Model that Lets Your Bring Text, Images, Video, Audio, and Docs into the Embedding Space
Al, Analytics and Automation

Google AI Introduces Gemini Embedding 2: A Multimodal Embedding Model that Lets Your Bring Text, Images, Video, Audio, and Docs into the Embedding Space

March 11, 2026
AI Is Learning From the News. Now Publishers Want to Get Paid
Al, Analytics and Automation

AI Is Learning From the News. Now Publishers Want to Get Paid

March 11, 2026
3 Questions: Building predictive models to characterize tumor progression | MIT News
Al, Analytics and Automation

3 Questions: Building predictive models to characterize tumor progression | MIT News

March 10, 2026
Al, Analytics and Automation

How to Build a Risk-Aware AI Agent with Internal Critic, Self-Consistency Reasoning, and Uncertainty Estimation for Reliable Decision-Making

March 10, 2026
marvn.ai and the rise of vertical AI search engines
Al, Analytics and Automation

marvn.ai and the rise of vertical AI search engines

March 10, 2026
Andrew Ng’s Team Releases Context Hub: An Open Source Tool that Gives Your Coding Agent the Up-to-Date API Documentation It Needs
Al, Analytics and Automation

Andrew Ng’s Team Releases Context Hub: An Open Source Tool that Gives Your Coding Agent the Up-to-Date API Documentation It Needs

March 10, 2026
Next Post
What is the job market like for new college graduates?

What is the job market like for new college graduates?

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

Anthropic vs. OpenAI vs. the Pentagon: the AI safety fight shaping our future

Anthropic vs. OpenAI vs. the Pentagon: the AI safety fight shaping our future

March 9, 2026
Fitbit’s personal health coach is expanding to more people in public preview.

Fitbit’s personal health coach is expanding to more people in public preview.

February 10, 2026
Evaluating Perplexity on Language Models

Evaluating Perplexity on Language Models

December 29, 2025
15 Best International SEO Companies To Grow Your Business

15 Best International SEO Companies To Grow Your Business

September 6, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • How to Defeat a Dangerous Threat in Demacia Rising in League of Legends
  • A Certified Sleep Coach Shares the Sleep Week Deals She’s Adding to Cart (2026)
  • Google AI Introduces Gemini Embedding 2: A Multimodal Embedding Model that Lets Your Bring Text, Images, Video, Audio, and Docs into the Embedding Space
  • How AI Is Reshaping The Consumer–Brand Relationship
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions