Methodology

How the bench is run, and what to read into it.

A fixed task, a fixed model, and eight framework adapters around them. Everything below documents the choices that make the numbers comparable — and the ones that limit how far they generalize.

Task

What each framework is asked to do

A small recruiting task picked because it exercises tool-calling without rewarding model knowledge.

The task isolates framework behavior, not model intelligence. Given a job_id drawn from a fixed set of 10, each adapter must return its top 3 candidates from a 50-row dataset, scored 0–100 with a short justification.

Four deterministic tools are exposed to the agent:

search_candidates(query, filters?) — free-text plus filters
get_candidate_profile(candidate_id) — full profile
score_match(candidate_id, job_id) — returns a breakdown, not an aggregate, so the model still has to reason
list_jobs() — lightweight job summary

Output is strict JSON. A run is invalid if the schema is wrong, an ID is hallucinated, a candidate is duplicated, or a justification exceeds 60 words.

Model

The same model for every framework

One endpoint, one config, one set of sampling params — chosen for portability across all eight adapters.

Every framework calls gemini-2.5-flash through Google AI Studio's OpenAI-compatible endpoint at generativelanguage.googleapis.com/v1beta/openai/, with temperature=0 and MAX_STEPS=25.

Gemini 3.1 Pro Preview was the original target; it was replaced because its thought_signature round-tripping fails on five of eight frameworks under the OpenAI-compat layer (see The thinking-model trap below). Flash is also the model most SaaS teams actually deploy — a working bench on a production model beats a broken bench on a flashier preview.

Scoring

How outputs are evaluated

Hard checks first, then a model-based judge on the runs that survive — invalid runs never reach the rubric.

Programmatic validation — JSON shape, candidate IDs present in the dataset, justification length cap, no duplicates. A run that fails any of these is discarded before judging.
LLM-as-judge — Gemini scores each surviving run on four criteria (relevance, score coherence, justification quality, format) and the criteria are aggregated into a /20 score.

Caveats

Honest limits readers should hold

Four places where the numbers underdetermine the conclusion. Worth keeping in view while reading the leaderboard.

Self-judging bias

Gemini judges Gemini outputs, and self-judging is documented at roughly 15–20% over-rating on a model's own work. The /20 scores are useful for ordering frameworks, not for cross-vendor comparison. A future swap to GPT-5 or Claude as judge would quantify the delta.

Per-framework metric availability

Not every SDK reports every metric the same way. CrewAI's tool_calls read as zero until its step_callback was wired in by hand; if a future SDK release changes that callback shape, the field will revert to null rather than silently mislead.

Sample size

The headline run is 30 trials per framework. That's enough to stabilize p50; p95 confidence intervals widen meaningfully below 100 trials, so single-decimal differences shouldn't drive a decision.

The hidden max_tokens tax on Gemini 2.5 Flash

On the OpenAI-compatible endpoint, gemini-2.5-flash silently consumes part of the max_tokens budget on internal reasoning even though it is documented as a non-thinking model. The first judge run failed on every valid output (220 of 220): with max_tokens=512 the response was systematically truncated to ~20 tokens (finish_reason=length), producing valid-looking but incomplete JSON that broke at a key boundary. Bumping the budget to 4096 fixed it. If this bench were running on a credit-card account instead of an enterprise quota, the silent truncation would have looked like a budget save right up until someone tried to read the scores.

The thinking-model trap

Why `gemini-2.5-flash`, not 3.x

The thought_signature round-trip is the single largest cross-framework gotcha this bench surfaced — and the reason for the model choice above.

Gemini 3.x and 2.5-Pro are thinking models: they generate internal reasoning before each function call. Google's API attaches an opaque thought_signature to every function call returned by these models and expects that signature back when the agent re-injects the conversation history at the next turn.

Frameworks that pass the response message through verbatim (baseline-python, baseline-typescript, Mastra, Vercel AI SDK) preserve the signature. Frameworks that rebuild messages into a "clean" provider-agnostic shape (LangGraph, PydanticAI) silently strip it. Google then rejects the next request with 400: Function call is missing a thought_signature in functionCall parts.

Empirical breakage rates from this bench

gemini-3.1-pro-preview — breaks 5/8 frameworks
gemini-3-flash-preview — breaks 2/8 (LangGraph, PydanticAI)
gemini-2.5-flash — works on 8/8 (not a thinking model, so there is no signature to preserve)

The bug doesn't surface in single-framework quickstarts. It only emerges when rebuilt messages meet a thinking model — exactly the shape most production teams arrive at by accident, having picked the framework first and the model second.

Cross-vendor

This is bigger than Gemini

Every thinking-model vendor exposes a version of this bug. The failure mode is what differs.

thought_signature is a Gemini-specific token, but the underlying class of bug — frameworks normalizing away vendor-specific reasoning artifacts — is universal across thinking models. The artifact and the symptom change; the root cause does not.

Vendor / model	Reasoning artifact	Failure when stripped
Gemini 2.5 / 3.x	`thought_signature`	HTTP 400
Claude (extended thinking)	`signature` on `thinking` blocks	HTTP 400
OpenAI o1 / o3 / GPT-5 thinking	`previous_response_id` / `reasoning.encrypted_content`	Re-thinks · extra tokens
DeepSeek R1, Qwen QwQ	Inline `<think>...</think>` tags	Reasoning truncated

Anthropic and Google fail loudly: HTTP 400, you find out immediately. OpenAI and the open-weight thinking models fail quietly: extra tokens, longer responses, no error to catch. The first kind shows up in CI; the second only shows up on the credit-card statement.

Per-framework

How each framework should actually handle thinking models

Vendor-native paths preserve thinking artifacts; the OpenAI-compat path used here for uniformity does not.

For transport parity, every framework in this bench is routed through Google's OpenAI-compatible endpoint. That single choice is what surfaces the thought_signature bug — it is not a verdict on the frameworks themselves. Each one ships a vendor-native path that handles thinking models correctly.

Framework	Native Gemini path (for thinking models)	OpenAI-compat path (used by this bench)
LangGraph	`langchain-google-genai`	`langchain-openai + base_url`
PydanticAI	`Agent('google-gla:gemini-3...')`	`OpenAIChatModel + base_url`
CrewAI	`LLM(model='gemini/gemini-3...')`	`LLM(model='openai/...', api_base=...)`
Google ADK	`LlmAgent(model='gemini-3...')`	`LlmAgent(model=LiteLlm(...))`
Mastra	`@ai-sdk/google`	`@ai-sdk/openai-compatible`
Vercel AI SDK	`@ai-sdk/google`	`@ai-sdk/openai-compatible`

The trade-off this bench made explicitly

Transport uniformity (one SDK family, one endpoint) at the cost of vendor-specific features. Picking a non-thinking model (gemini-2.5-flash) keeps that trade-off from penalizing any single framework.

The trade-off your team likely makes implicitly

Routing through a gateway like OpenRouter or LiteLLM-as-proxy "for simplicity" silently drops vendor-specific features. If the framework is locked in (LangGraph, PydanticAI) and thinking models are required, the vendor-native path becomes mandatory — which means more SDKs in the dependency graph and more drift between framework adapters.

Other caveats

What the headline numbers hide

Three findings from the 240-run dataset that deserve to be read alongside the leaderboard, not after it.

Aggregated metrics flatten the shape of each framework's behavior. The three patterns below are the ones most likely to mislead a reader skimming p50 latency and success rate alone.

Vercel AI SDK is the cost outlier — by design

At 1,605 mean input tokens and $0.0060 per run, the Vercel AI SDK runs at roughly a third of the bench median ($0.018). Its ToolLoopAgent handles context differently from the other adapters — fewer tokens replayed each step, not a different model. Worth confirming the behavior matches expectations before reading the cost number as a free win.

CrewAI re-injects its DSL on every step

CrewAI sits at the opposite extreme: 42,785 mean input tokens and $0.1072 per run, roughly 6× the bench average. The DSL re-serializes the agent and task configuration into the prompt on every step, which is invisible from the leaderboard but dominates the cost column.

Google ADK has no client-side step timeout

ADK's p95 latency is 471.8s — one or two trials stalled for nearly eight minutes against a p50 of 19.9s. The event loop has no client-side cap, so a slow tool call or a stuck step blocks until the upstream gives up. In production, that translates to request handlers held open well beyond any reasonable SLO unless the host enforces its own timeout.