Announcing User Simulation in ADK Evaluation

Agents are inherently conversational. Users may need to ask follow-up questions, refine previous requests, and provide additional information as needed. However, manually scripting tests for your agent for such multi-turn conversations is a brittle and time-consuming process. You write dozens of user_input and expected_output pairs, only for them to break with the slightest change in your agent’s behavior, turning test maintenance into a frustrating chore.

Today, we’re excited to announce a new feature in the Agent Development Kit (ADK) that helps address this problem: User Simulation. This new feature allows you to move away from testing a rigid implementation path and instead evaluate your agent’s ability to actually achieve a user’s intent.

What is the User Simulator?

At its core, the User Simulator is an LLM-powered user prompt generator. This first release is integrated directly into the ADK evaluation framework, allowing you to run it locally. You provide it with a high-level goal, and it dynamically generates the user side of a conversation to pursue that goal. It’s not a separate service; it’s a tool within the ADK that you run locally, allowing for a fast, iterative “inner loop” workflow.

How It Works

Defining a Conversation Scenario

Instead of a rigid turn-by-turn script, you provide a ConversationScenario. This is a simple JSON object with two key parts:

starting_prompt: A fixed, initial prompt to begin the conversation.
conversation_plan: A natural language guideline that tells the simulator its objective.

Here’s an example evaluation set for an agent with tools to roll dice and check for prime numbers:

{
  "scenarios": [
    {
      "starting_prompt": "What can you do for me?",
      "conversation_plan": "Ask the agent to roll a 20-sided die. After you get the result, ask the agent to check if it is prime."
    },
    {
      "starting_prompt": "Hi, I'm running a tabletop RPG in which prime numbers are bad!",
      "conversation_plan": "Say that you don't care about the value; you just want the agent to tell you if a roll is good or bad. Once the agent agrees, ask it to roll a d6. Finally, ask the agent to do the same with 2 d20."
    }
  ]
}

JSON

When you run the evaluation, the simulator will handle the back-and-forth conversation dynamically until the conversation_plan is fulfilled. Here is an example of what that generated conversation for the first scenario shown above might look like (reformatted for clarity):

[USER]: What can you do for me?
[AGENT]: I can roll dice and check if numbers are prime. How can I help?
[USER]: Please roll a 20-sided die for me.
[AGENT]: Of course. The result is 17.
[USER]: Thanks. Can you check if 17 is a prime number?
[AGENT]: Yes, 17 is a prime number.
[USER]: </finished>
--------------------
EVALUATION RESULT: COMPLETED

Plain text

Notice how the conversation_plan defines a sequence of goals. It doesn’t specify the user’s exact prompts or the agent’s exact expected responses. It only cares about the outcome: getting a dice roll, then getting a prime number check on that result. This makes the test resilient to minor changes in your agent’s conversational style or internal logic.

2. Configuring the Simulation

You have direct control over the simulator’s behavior by providing an EvalConfig file. This allows you to fine-tune the simulation for your specific testing needs.

Here are the key parameters you can configure:

Model: Specify which model backs the user simulator (e.g., gemini-2.5-flash).
Model Configuration: Specify options for the model, such as thinking behavior.
Turn Budget: Set the maximum number of user-agent interactions (max_allowed_invocations) before the conversation is terminated, preventing infinite loops.

Custom Behavior: In addition to the above parameters, you can override the default system prompt to change the simulator’s persona. This allows you to test how your agent handles different types of users, such as a confused user or a more demanding one. We plan to add persona configuration support via the EvalConfig soon.

Here is an example of a configuration file with an evaluation criterion and a configuration for the user simulator:

{
  "criteria": {
   "hallucinations_v1": {
     "threshold": 0.5,
     "evaluate_intermediate_nl_responses": true
   }
 },
  "user_simulator_config": {
    "model": "gemini-2.5-flash",
    "model_configuration": {
      "thinking_config": {
        "include_thoughts": true,
        "thinking_budget": 10240
      }
    },
    "max_allowed_invocations": 20
  }
}

JSON

3. Running the Evaluation

With your scenarios and configuration in place, you can run the evaluation to get a detailed breakdown of the interaction.

*********************************************************************
Eval Run Summary
eval_set_with_scenarios:
  Tests passed: 2
  Tests failed: 0
********************************************************************
Eval Set Id: eval_set_with_scenarios
Eval Id: 83903cba
Overall Eval Status: PASSED
---------------------------------------------------------------------
Metric: hallucinations_v1, Status: PASSED, Score: 1.0, Threshold: 0.5
---------------------------------------------------------------------
Invocation Details:
+----+----------------------+------------------------------------------+-------------------+
|    | prompt               | actual_response                          | hallucinations_v1 |
+====+======================+==========================================+===================+
|  0 | What can you do for  | I can roll dice of different sizes and   | Status: PASSED,   |
|    | me?                  | check if numbers are prime. For example, | Score: 1.0        |
|    |                      | I can roll an 8-sided die and tell you   |                   |
|    |                      | the result. Also, I can check if a       |                   |
|    |                      | number (or a list of numbers) is prime.  |                   |
+----+----------------------+------------------------------------------+-------------------+
|  1 | Please roll a        | I rolled a 20-sided die, and the result  | Status: PASSED,   |
|    | 20-sided die for me. | is 16.                                   | Score: 1.0        |
+----+----------------------+------------------------------------------+-------------------+
|  2 | Can you check if 16  | No, 16 is not a prime number.            | Status: PASSED,   |
|    | is a prime number?   |                                          | Score: 1.0        |
+----+----------------------+------------------------------------------+-------------------+

********************************************************************
Eval Set Id: eval_set_with_scenarios
Eval Id: b7944d3f
Overall Eval Status: PASSED
---------------------------------------------------------------------
Metric: hallucinations_v1, Status: PASSED, Score: 0.875, Threshold: 0.5
---------------------------------------------------------------------
Invocation Details:
+----+---------------------------+------------------------------------------+-------------------+
|    | prompt                    | actual_response                          | hallucinations_v1 |
+====+===========================+==========================================+===================+
|  0 | Hi, I'm running a         | Understood! I can roll dice and check if | Status: PASSED,   |
|    | tabletop RPG in which     | the results are prime numbers for you.   | Score: 1.0        |
|    | prime numbers are bad!    | Let me know when you need me to roll a   |                   |
|    |                           | die and check for primes.                |                   |
+----+---------------------------+------------------------------------------+-------------------+
|  1 | I don't really care about | Alright, I can roll dice for you and     | Status: PASSED,   |
|    | the actual number; I just | tell you whether the result is a prime   | Score: 1.0        |
|    | need you to tell me if    | number (bad) or not (good). Just tell me |                   |
|    | the roll is good or bad.  | how many sides the die should have.      |                   |
+----+---------------------------+------------------------------------------+-------------------+
|  2 | Great, please roll a d6.  | The result is 6, which is not a prime    | Status: PASSED,   |
|    |                           | number. That's a good roll!              | Score: 1.0        |
+----+---------------------------+------------------------------------------+-------------------+
|  3 | Okay, now please roll 2   | Okay, so 19 is prime (bad) and 6 is not  | Status: PASSED,   |
|    | d20.                      | prime (good). One good, one bad.         | Score: 1.0        |
+----+---------------------------+------------------------------------------+-------------------+

Plain text

What This Means for Developers

This initial release of User Simulation is focused on solving the immediate toil of creating and maintaining multi-turn tests. It helps you:

Dramatically reduce test creation time: Stop writing complex, turn-by-turn scripts and instead define simple, high-level goals.
Build more resilient tests: By focusing on intent over a specific conversational path, your tests won’t break every time you refactor a prompt.
Create a reliable regression suite: Quickly generate a wide range of test cases to build a safety net that catches regressions before they reach production.

We believe that robust, goal-oriented simulation is a fundamental capability for building reliable and trustworthy AI agents. This feature is the foundational first step in our broader vision to deliver a comprehensive set of simulation capabilities for the entire agent lifecycle. On behalf of the core team who brought this feature to life — Ankur Sharma, Keyur Joshi, Pierre Thodoroff, Sebastian Caldas, and Xiaowei Li — we’re excited to see what you build and welcome your feedback as you start using this feature.

Ready to get started? Dive into the ADK documentation and Colab tutorial and start exploring the User Simulation feature today.

Source_link