How the bench is run, and what to read into it.
A fixed task, a fixed model, and eight framework adapters around them. Everything below documents the choices that make the numbers comparable — and the ones that limit how far they generalize.
What each framework is asked to do
A small recruiting task picked because it exercises tool-calling without rewarding model knowledge.
The task isolates framework behavior, not model intelligence. Given a job_id drawn from a fixed set of 10, each adapter must return its top 3 candidates from a 50-row dataset, scored 0–100 with a short justification.
Four deterministic tools are exposed to the agent:
search_candidates(query, filters?)— free-text plus filtersget_candidate_profile(candidate_id)— full profilescore_match(candidate_id, job_id)— returns a breakdown, not an aggregate, so the model still has to reasonlist_jobs()— lightweight job summary
Output is strict JSON. A run is invalid if the schema is wrong, an ID is hallucinated, a candidate is duplicated, or a justification exceeds 60 words.
The same model for every framework
One endpoint, one config, one set of sampling params — chosen for portability across all eight adapters.
Every framework calls gemini-2.5-flash through Google AI Studio's OpenAI-compatible endpoint at generativelanguage.googleapis.com/v1beta/openai/, with temperature=0 and MAX_STEPS=25.
Gemini 3.1 Pro Preview was the original target; it was replaced because its thought_signature round-tripping fails on five of eight frameworks under the OpenAI-compat layer (see The thinking-model trap below). Flash is also the model most SaaS teams actually deploy — a working bench on a production model beats a broken bench on a flashier preview.
How outputs are evaluated
Hard checks first, then a model-based judge on the runs that survive — invalid runs never reach the rubric.
- Programmatic validation — JSON shape, candidate IDs present in the dataset, justification length cap, no duplicates. A run that fails any of these is discarded before judging.
- LLM-as-judge — Gemini scores each surviving run on four criteria (relevance, score coherence, justification quality, format) and the criteria are aggregated into a /20 score.
Honest limits readers should hold
Four places where the numbers underdetermine the conclusion. Worth keeping in view while reading the leaderboard.
tool_calls read as zero until its step_callback was wired in by hand; if a future SDK release changes that callback shape, the field will revert to null rather than silently mislead.gemini-2.5-flash silently consumes part of the max_tokens budget on internal reasoning even though it is documented as a non-thinking model. The first judge run failed on every valid output (220 of 220): with max_tokens=512 the response was systematically truncated to ~20 tokens (finish_reason=length), producing valid-looking but incomplete JSON that broke at a key boundary. Bumping the budget to 4096 fixed it. If this bench were running on a credit-card account instead of an enterprise quota, the silent truncation would have looked like a budget save right up until someone tried to read the scores.Why gemini-2.5-flash, not 3.x
The thought_signature round-trip is the single largest cross-framework gotcha this bench surfaced — and the reason for the model choice above.
Gemini 3.x and 2.5-Pro are thinking models: they generate internal reasoning before each function call. Google's API attaches an opaque thought_signature to every function call returned by these models and expects that signature back when the agent re-injects the conversation history at the next turn.
Frameworks that pass the response message through verbatim (baseline-python, baseline-typescript, Mastra, Vercel AI SDK) preserve the signature. Frameworks that rebuild messages into a "clean" provider-agnostic shape (LangGraph, PydanticAI) silently strip it. Google then rejects the next request with 400: Function call is missing a thought_signature in functionCall parts.
gemini-3.1-pro-preview— breaks 5/8 frameworksgemini-3-flash-preview— breaks 2/8 (LangGraph, PydanticAI)gemini-2.5-flash— works on 8/8 (not a thinking model, so there is no signature to preserve)
The bug doesn't surface in single-framework quickstarts. It only emerges when rebuilt messages meet a thinking model — exactly the shape most production teams arrive at by accident, having picked the framework first and the model second.
This is bigger than Gemini
Every thinking-model vendor exposes a version of this bug. The failure mode is what differs.
thought_signature is a Gemini-specific token, but the underlying class of bug — frameworks normalizing away vendor-specific reasoning artifacts — is universal across thinking models. The artifact and the symptom change; the root cause does not.
| Vendor / model | Reasoning artifact | Failure when stripped |
|---|---|---|
| Gemini 2.5 / 3.x | thought_signature | HTTP 400 |
| Claude (extended thinking) | signature on thinking blocks | HTTP 400 |
| OpenAI o1 / o3 / GPT-5 thinking | previous_response_id / reasoning.encrypted_content | Re-thinks · extra tokens |
| DeepSeek R1, Qwen QwQ | Inline <think>...</think> tags | Reasoning truncated |
Anthropic and Google fail loudly: HTTP 400, you find out immediately. OpenAI and the open-weight thinking models fail quietly: extra tokens, longer responses, no error to catch. The first kind shows up in CI; the second only shows up on the credit-card statement.
How each framework should actually handle thinking models
Vendor-native paths preserve thinking artifacts; the OpenAI-compat path used here for uniformity does not.
For transport parity, every framework in this bench is routed through Google's OpenAI-compatible endpoint. That single choice is what surfaces the thought_signature bug — it is not a verdict on the frameworks themselves. Each one ships a vendor-native path that handles thinking models correctly.
| Framework | Native Gemini path (for thinking models) | OpenAI-compat path (used by this bench) |
|---|---|---|
| LangGraph | langchain-google-genai | langchain-openai + base_url |
| PydanticAI | Agent('google-gla:gemini-3...') | OpenAIChatModel + base_url |
| CrewAI | LLM(model='gemini/gemini-3...') | LLM(model='openai/...', api_base=...) |
| Google ADK | LlmAgent(model='gemini-3...') | LlmAgent(model=LiteLlm(...)) |
| Mastra | @ai-sdk/google | @ai-sdk/openai-compatible |
| Vercel AI SDK | @ai-sdk/google | @ai-sdk/openai-compatible |
gemini-2.5-flash) keeps that trade-off from penalizing any single framework.What the headline numbers hide
Three findings from the 240-run dataset that deserve to be read alongside the leaderboard, not after it.
Aggregated metrics flatten the shape of each framework's behavior. The three patterns below are the ones most likely to mislead a reader skimming p50 latency and success rate alone.
ToolLoopAgent handles context differently from the other adapters — fewer tokens replayed each step, not a different model. Worth confirming the behavior matches expectations before reading the cost number as a free win.