• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, April 17, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

A Hands-On Guide to Testing Agents with RAGAs and G-Eval

Josh by Josh
April 17, 2026
in Al, Analytics and Automation
0
A Hands-On Guide to Testing Agents with RAGAs and G-Eval


In this article, you will learn how to evaluate large language model applications using RAGAs and G-Eval-based frameworks in a practical, hands-on workflow.

Topics we will cover include:

  • How to use RAGAs to measure faithfulness and answer relevancy in retrieval-augmented systems.
  • How to structure evaluation datasets and integrate them into a testing pipeline.
  • How to apply G-Eval via DeepEval to assess qualitative aspects like coherence.

Let’s get started.

A Hands-On Guide to Testing Agents with RAGAs and G-Eval

A Hands-On Guide to Testing Agents with RAGAs and G-Eval
Image by Editor

Introduction

RAGAs (Retrieval-Augmented Generation Assessment) is an open-source evaluation framework that replaces subjective “vibe checks” with a systematic, LLM-driven “judge” to quantify the quality of RAG pipelines. It assesses a triad of desirable RAG properties, including contextual accuracy and answer relevance. RAGAs has also evolved to support not only RAG architectures but also agent-based applications, where methodologies like G-Eval play a role in defining custom, interpretable evaluation criteria.

READ ALSO

Bringing AI-driven protein-design tools to biologists everywhere | MIT News

OpenAI Launches GPT-Rosalind: Its First Life Sciences AI Model Built to Accelerate Drug Discovery and Genomics Research

This article presents a hands-on guide to understanding how to test large language model and agent-based applications using both RAGAs and frameworks based on G-Eval. Concretely, we will leverage DeepEval, which integrates multiple evaluation metrics into a unified testing sandbox.

If you are unfamiliar with evaluation frameworks like RAGAs, consider reviewing this related article first.

Step-by-Step Guide

This example is designed to work both in a standalone Python IDE and in a Google Colab notebook. You may need to pip install some libraries along the way to resolve potential ModuleNotFoundError issues, which occur when attempting to import modules that are not installed in your environment.

We begin by defining a function that takes a user query as input and interacts with an LLM API (such as OpenAI) to generate a response. This is a simplified agent that encapsulates a basic input-response workflow.

import openai

 

def simple_agent(query):

    # NOTE: this is a ‘mock’ agent loop

    # In a real scenario, you would use a system prompt to define tool usage

    prompt = f“You are a helpful assistant. Answer the user query: {query}”

    

    # Example using OpenAI (this can be swapped for Gemini or another provider)

    response = openai.chat.completions.create(

        model=“gpt-3.5-turbo”,

        messages=[{“role”: “user”, “content”: prompt}]

    )

    return response.choices[0].message.content

In a more realistic production setting, the agent defined above would include additional capabilities such as reasoning, planning, and tool execution. However, since the focus here is on evaluation, we intentionally keep the implementation simple.

Next, we introduce RAGAs. The following code demonstrates how to evaluate a question-answering scenario using the faithfulness metric, which measures how well the generated answer aligns with the provided context.

from ragas import evaluate

from ragas.metrics import faithfulness

 

# Defining a simple testing dataset for a question-answering scenario

data = {

    “question”: [“What is the capital of Japan?”],

    “answer”: [“Tokyo is the capital.”],

    “contexts”: [[“Japan is a country in Asia. Its capital is Tokyo.”]]

}

 

# Running RAGAs evaluation

result = evaluate(data, metrics=[faithfulness])

Note that you may need sufficient API quota (e.g., OpenAI or Gemini) to run these examples, which typically requires a paid account.

Below is a more elaborate example that incorporates an additional metric for answer relevancy and uses a structured dataset.

test_cases = [

    {

        “question”: “How do I reset my password?”,

        “answer”: “Go to settings and click ‘forgot password’. An email will be sent.”,

        “contexts”: [“Users can reset passwords via the Settings > Security menu.”],

        “ground_truth”: “Navigate to Settings, then Security, and select Forgot Password.”

    }

]

Ensure that your API key is configured before proceeding. First, we demonstrate evaluation without wrapping the logic in an agent:

import os

from ragas import evaluate

from ragas.metrics import faithfulness, answer_relevancy

from datasets import Dataset

 

# IMPORTANT: Replace “YOUR_API_KEY” with your actual API key

os.environ[“OPENAI_API_KEY”] = “YOUR_API_KEY”

 

# Convert list to Hugging Face Dataset (required by RAGAs)

dataset = Dataset.from_list(test_cases)

 

# Run evaluation

ragas_results = evaluate(dataset, metrics=[faithfulness, answer_relevancy])

print(f“RAGAs Faithfulness Score: {ragas_results[‘faithfulness’]}”)

To simulate an agent-based workflow, we can encapsulate the evaluation logic into a reusable function:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

import os

from ragas import evaluate

from ragas.metrics import faithfulness, answer_relevancy

from datasets import Dataset

 

def evaluate_ragas_agent(test_cases, openai_api_key=“YOUR_API_KEY”):

    “”“Simulates a simple AI agent that performs RAGAs evaluation.”“”

    

    os.environ[“OPENAI_API_KEY”] = openai_api_key

 

    # Convert test cases into a Dataset object

    dataset = Dataset.from_list(test_cases)

 

    # Run evaluation

    ragas_results = evaluate(dataset, metrics=[faithfulness, answer_relevancy])

 

    return ragas_results

The Hugging Face Dataset object is designed to efficiently represent structured data for large language model evaluation and inference.

The following code demonstrates how to call the evaluation function:

my_openai_key = “YOUR_API_KEY”  # Replace with your actual API key

 

if ‘test_cases’ in globals():

    evaluation_output = evaluate_ragas_agent(test_cases, openai_api_key=my_openai_key)

    print(“RAGAs Evaluation Results:”)

    print(evaluation_output)

else:

    print(“Please define the ‘test_cases’ variable first. Example:”)

    print(“test_cases = [{ ‘question’: ‘…’, ‘answer’: ‘…’, ‘contexts’: […], ‘ground_truth’: ‘…’ }]”)

We now introduce DeepEval, which acts as a qualitative evaluation layer using a reasoning-and-scoring approach. This is particularly useful for assessing attributes such as coherence, clarity, and professionalism.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

from deepeval.metrics import GEval

from deepeval.test_case import LLMTestCase, LLMTestCaseParams

 

# STEP 1: Define a custom evaluation metric

coherence_metric = GEval(

    name=“Coherence”,

    criteria=“Determine if the answer is easy to follow and logically structured.”,

    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],

    threshold=0.7  # Pass/fail threshold

)

 

# STEP 2: Create a test case

case = LLMTestCase(

    input=test_cases[0][“question”],

    actual_output=test_cases[0][“answer”]

)

 

# STEP 3: Run evaluation

coherence_metric.measure(case)

print(f“G-Eval Score: {coherence_metric.score}”)

print(f“Reasoning: {coherence_metric.reason}”)

A quick recap of the key steps:

  • Define a custom metric using natural language criteria and a threshold between 0 and 1.
  • Create an LLMTestCase using your test data.
  • Execute evaluation using the measure method.

Summary

This article demonstrated how to evaluate large language model and retrieval-augmented applications using RAGAs and G-Eval-based frameworks. By combining structured metrics (faithfulness and relevancy) with qualitative evaluation (coherence), you can build a more comprehensive and reliable evaluation pipeline for modern AI systems.



Source_link

Related Posts

Bringing AI-driven protein-design tools to biologists everywhere | MIT News
Al, Analytics and Automation

Bringing AI-driven protein-design tools to biologists everywhere | MIT News

April 17, 2026
OpenAI Launches GPT-Rosalind: Its First Life Sciences AI Model Built to Accelerate Drug Discovery and Genomics Research
Al, Analytics and Automation

OpenAI Launches GPT-Rosalind: Its First Life Sciences AI Model Built to Accelerate Drug Discovery and Genomics Research

April 17, 2026
The Roadmap to Mastering Agentic AI Design Patterns
Al, Analytics and Automation

The Roadmap to Mastering Agentic AI Design Patterns

April 16, 2026
UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size
Al, Analytics and Automation

UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size

April 16, 2026
Beyond Vector Search: Building a Deterministic 3-Tiered Graph-RAG System
Al, Analytics and Automation

Beyond Vector Search: Building a Deterministic 3-Tiered Graph-RAG System

April 16, 2026
Human-machine teaming dives underwater | MIT News
Al, Analytics and Automation

Human-machine teaming dives underwater | MIT News

April 15, 2026
Next Post
Sam Altman’s ‘human verification’ company thinks its eye-scanning orbs could solve ticket scalping

Sam Altman's 'human verification' company thinks its eye-scanning orbs could solve ticket scalping

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

Libby is adding an AI book recommendation feature

Libby is adding an AI book recommendation feature

August 30, 2025
Mixing generative AI with physics to create personal items that work in the real world | MIT News

Mixing generative AI with physics to create personal items that work in the real world | MIT News

February 25, 2026
How to Migrate Azure File Shares with Zero Downtime

How to Migrate Azure File Shares with Zero Downtime

September 19, 2025
Unlimited Virtual Girlfriend AI that Works like ChatGPT

Unlimited Virtual Girlfriend AI that Works like ChatGPT

February 14, 2026

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • What Is Custom Software Development? Complete Guide 2026
  • Streamlining the Digital Payment Experience: 5 Website Tips
  • 2026 travel trends and top destinations for summer
  • Did you know there is only a 12% overlap between organic and AI search rankings? Navigate the shift to AI search…
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions