Google AI Research Proposes Vantage: An LLM-Based Protocol for Measuring Collaboration, Creativity, and Critical Thinking

Standardized tests can tell you whether a student knows calculus or can parse a passage of text. What they cannot reliably tell you is whether that student can resolve a disagreement with a teammate, generate genuinely original ideas under pressure, or critically dismantle a flawed argument. These are the so-called durable skills — collaboration, creativity, and critical thinking — and for decades they have resisted rigorous, scalable measurement. A new research from Google Research proposes a technically novel solution called Vantage: orchestrated large language models that can both simulate authentic group interaction and score the results with accuracy rivaling human expert raters.

https://services.google.com/fh/files/misc/toward_scalable_measurement_of_durable_skills.pdf

The Core Problem: Ecological Validity vs. Psychometric Rigor

To understand why this is technically interesting, it helps to understand the measurement paradox the research team was trying to crack. Measuring durable skills effectively requires two conflicting properties. On one hand, the assessment needs ecological validity — it should feel like a real-world scenario, because that is precisely the context in which these skills are exercised. On the other hand, it needs psychometric rigor: standardized conditions, reproducibility, and controllable stimuli so that scores are comparable across test-takers.

Previous large-scale efforts, like the PISA 2015 Collaborative Problem Solving assessment, tried to solve this by having subjects interact with scripted simulated teammates via multiple-choice questions. That guarantees control but sacrifices authenticity. Human-to-human assessments do the opposite. LLMs, the research team argues, are uniquely positioned to satisfy both requirements simultaneously — they can produce naturalistic, open-ended conversational interactions while still being steered programmatically toward specific assessment goals.

The Executive LLM: A Coordination Layer Over AI Agents

The most technically distinctive contribution of this research is the Executive LLM architecture. Rather than spawning multiple independent LLM agents — one per AI teammate — the system uses a single LLM to generate responses for all AI participants in the conversation. This matters for two reasons.

First, it enables coordination. The Executive LLM has access to the same pedagogical rubric that will later be used to evaluate the human participant. It uses this rubric not just passively, but actively — steering the conversation toward scenarios that elicit evidence of specific skills. For example, if the target dimension is Conflict Resolution, the Executive LLM may instruct one of its AI personas to introduce a disagreement and sustain it until the human participant demonstrates (or fails to demonstrate) a conflict-resolution strategy. This is functionally analogous to how a computerized adaptive test (CAT) dynamically adjusts item difficulty based on a test-taker’s running performance — except here, the ‘items’ are turns in a live conversation.

Second, the Independent Agents baseline (separate LLMs with no coordination) proved demonstrably weaker. Without steering, conversations simply might not produce the right evidence — if team members naturally agree, there is no conflict to resolve, and the assessment learns nothing about that sub-skill.

Gemini 2.5 Pro was used as the model underlying the Executive LLM for the main collaboration experiments, while Gemini 3 powered the creativity and critical thinking modules.

What the Experiments Actually Show

The research team recruited 188 participants aged 18–25, English native speakers based in the United States, via the Prolific platform. Each participant generated two conversations for a total of 373 transcripts (three were filtered due to technical issues). All participants worked through collaborative tasks — either a science experiment design or a structured debate — with a group of AI personas, for 30 minutes per conversation.

Two sub-skills of collaboration were evaluated: Conflict Resolution (CR) and Project Management (PM). Conversations were rated both by two human pedagogical raters from New York University and by an AI Evaluator (Gemini 3.0), which scored each participant turn 20 times. A turn was declared NA if any one of the 20 predictions returned NA; otherwise, the final label was the most frequent non-NA level among the 20 runs. A regression model — linear for scores, logistic for NA decisions — was then trained on these turn-level labels to produce a conversation-level score, with performance evaluated using leave-one-out cross-validation.

The main results are compelling on multiple fronts. Turn-level and conversation-level evidence rates for skill-relevant behavior were significantly higher in the Executive LLM conditions than in the Independent Agents condition across both sub-skills. Conversation-level information rates reached 92.4% for Project Management and 85% for Conflict Resolution when the skill-matched Executive LLM was used. Notably, simply telling participants to focus on a skill had no significant effect on evidence rates (all p > 0.6), confirming that the steering must come from the AI side.

On scoring accuracy, inter-rater agreement between the AI Evaluator and human experts — measured with Cohen’s Kappa — was comparable to inter-human agreement, which ranged from moderate (κ = 0.45–0.64) across both skills and both scoring tasks.

Simulation as a Development Sandbox

One practically useful finding for ML engineers building similar systems is the validation of LLM-based simulation as a stand-in for human subjects during protocol development. The research team used Gemini to simulate human participants at known skill levels (1–4 on each rubric dimension), then measured recovery error — the mean absolute difference between the ground-truth level and the autorater’s inferred level. The Executive LLM produced significantly lower recovery error than Independent Agents for both CR and PM. Qualitative patterns in the simulated data closely matched those from real human conversations, suggesting that rubric-based simulation can de-risk assessment design before expensive human data collection.

Evidence Rates Extend Across Creativity and Critical Thinking

For creativity and critical thinking, initial evidence rates were evaluated using simulated subjects. The results show the Executive LLM outperforming Independent Agents across all 8 dimensions tested — all six creativity dimensions (Fluidity, Originality, Quality, Building on Ideas, Elaborating, and Selecting) and both critical thinking dimensions (Interpret and Analyze; Evaluate and Judge) — with all differences statistically significant. The research team noted that human rating collection for these two skills is ongoing and results will be shared in future work, but the simulation results suggest the Executive LLM approach generalizes beyond collaboration.

Creativity Scoring at 0.88 Pearson Correlation

In a separate partnership with OpenMic, an institution building AI-powered durable skills assessment tools, the research team evaluated their Gemini-based creativity autorater on complex multimedia tasks completed by 280 high school students. The tasks involved designing a news segment based on a short story, including producing character interview questions. Critically, 100 submissions were used first to refine the Gemini prompt and the expert pedagogical rubrics, while the remaining 180 held-out submissions were used for the final accuracy evaluation. Rubric-based scoring by OpenMic experts and the autorater agreed at Cohen’s Kappa = 0.66 (good agreement) at the item level. More strikingly, when overall submission scores were compared, the Pearson correlation between autorater and human expert totals was 0.88 — a level of agreement that is difficult to achieve even between human raters on subjective creative tasks.

Closing the Feedback Loop

Beyond scoring, Vantage surfaces results to users through a quantitative skills map showing competency levels across all skills and sub-skills, with the option to drill down into specific excerpts from the conversation that substantiate each numeric score. This makes the evidence for the assessment transparent and actionable — a meaningful design consideration for anyone building similar evaluation pipelines where interpretability of automated scores matters.

Key Takeaways

A single ‘Executive LLM’ outperforms multiple independent agents for skill assessment: Rather than running one LLM per AI teammate, Google’s Vantage uses a single coordinating LLM that generates responses for all AI participants. This allows it to actively steer conversations using a pedagogical rubric — introducing conflicts, pushing back on ideas, or creating planning bottlenecks — to draw out observable evidence of specific skills that might never surface naturally.
LLM-based scoring is now on par with human expert raters: The AI Evaluator’s agreement with human raters was comparable to the agreement between two human experts themselves, who only reached moderate Cohen’s Kappa (0.45–0.64) even after multiple calibration rounds. This positions automated LLM scoring as a genuinely scalable alternative to expensive human annotation for complex, open-ended conversational tasks.
Telling users to focus on a skill does nothing — the steering has to come from the AI side: Participants who were explicitly instructed to pay attention to conflict resolution or project management showed no statistically significant improvement in evidence rates (all p > 0.6) compared to those given no instructions. Only the Executive LLM’s active steering produced measurably richer assessment data.
LLM simulation can serve as a low-cost sandbox before running studies with real humans: By simulating participants at known skill levels and measuring how accurately the system recovered those levels, the research team validated their assessment protocol without burning through expensive human subject budgets. Simulated and real conversation patterns were qualitatively similar, making this a practical approach for iterating on rubrics and prompts early in development.
AI creativity scoring achieved 0.88 Pearson correlation with human experts on real student work: In a real-world test with 180 held-out high school student submissions, a Gemini-based autorater matched human expert scores at a Pearson correlation of 0.88 on overall creativity assessment — demonstrating that automated scoring of complex, subjective, multimedia tasks is not just theoretically possible but empirically validated.

Check out the Paper and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Source_link