Findings

Four things the leaderboard hides.

Aggregated p50 and success-rate columns flatten the more interesting shape of the dataset. These four sections walk through patterns the headline numbers obscure — token cost spread, the real efficiency frontier, p95 stalls, and which job categories break which frameworks.

Finding 1 · Token cost

The same task runs on 27× more input tokens depending on the framework.

crewai averages 42,785 input tokens per run. vercel-ai-sdk averages 1,605. Same model, same prompt, same tools.

The variance lives in how each framework rebuilds the prompt at every step of the tool-calling loop. A framework that re-injects its own scaffolding (agent backstory, task description, ReAct preamble, verbose tool descriptions) on each turn pays for that choice every step.

Why crewai runs at 27× the lean baseline

CrewAI's DSL re-serializes the agent persona and the task specification into the prompt on every step. Twelve steps × ~3,000 tokens of framework boilerplate ≈ 36,000 tokens of repeated scaffolding, on top of the ~6,000 tokens of actual conversation state. The cost shows up in the bill, not the leaderboard.

Why vercel-ai-sdk runs at 1×

Vercel's ToolLoopAgent keeps a single message thread, ships compact JSON-schema tool descriptions, and trusts the model to manage its own internal reasoning. No "Thought / Action / Observation" preamble, no per-step persona reset. The model gets the conversation as-is and replies.

Finding 2 · Efficiency frontier

vercel-ai-sdk delivers 20× the NDCG@3 per dollar of crewai.

Cost and quality alone are misleading axes. NDCG@3 / cost together is the metric a buyer should care about — and it crowns a different framework than either column individually.

The efficiency frontier — NDCG@3 per dollar — surfaces which framework gives the best retrieval quality for the money spent. vercel-ai-sdk leads on quality-per-dollar, while crewai sits at the bottom of the frontier.

The Pareto-dominated framework

crewai sits at 5.5804 NDCG/$ — the framework that loses on the efficiency axis. Higher cost than the average without a commensurate NDCG@3 gain. Its token spend doesn't translate to better retrieval quality; it's pure framework overhead being billed as if it were capability.

Finding 3 · Outliers

google-adk's p95 latency is 24× its p50.

google-adk runs 19.9s on a typical trial and 471.8s on its 95th percentile. One trial reached 752s — over twelve minutes on a task that usually takes twenty seconds.

Quick refresher: p50 and p95

Sort the latencies of all trials from fastest to slowest. The p50 (median) is the middle value: half the trials are faster, half are slower. It describes what a typical user experiences. The p95 is the value below which 95% of trials fall — meaning 1 trial in 20 is slower. It describes the worst case that real users will still hit regularly.

A framework with a good p50 but a bad p95 looks fast on average and intermittently freezes. A framework with both numbers close together is predictable. The gap between them — how far p95 is from p50 — is the more actionable signal than either column alone.

The strip plot above shows every valid trial as a dot. Most frameworks cluster tightly between 10s and 50s. Two — Google ADK and PydanticAI — have outliers past the 100-second mark, drawn in red. These aren't "the framework is slow"; they're "one trial stuck for ten minutes against a task that the same framework handles in twenty seconds the other 28 times".

Why this matters in production

A user-facing agent product needs a client-side step timeout. A framework that doesn't expose one — or that swallows the timeout configuration — turns a latent upstream stall into a stuck request. The p95 column on the leaderboard tells you to set the timeout; the p50 column lies about how often you'll need it.

Finding 4 · Job-level breakdown

Failures cluster on specific job × framework pairs.

The success-rate column averages over ten job categories. Some frameworks are uniformly reliable; others fail consistently on specific shapes of input.

	#001Senior Backend	#002Mid Backend	#003Senior FE	#004Senior Full-Stack	#005ML Engineer	#006DevOps /	#007Senior iOS	#008Product Designer	#009Staff FS	#010Junior Backend
baseline-python	3/3	1/3	1/3	3/3	2/3	2/3	2/3	3/3	3/3	3/3
baseline-typescript	3/3	3/3	3/3	3/3	2/3	3/3	3/3	3/3	3/3	3/3
crewai	3/3	2/3	2/3	2/3	3/3	3/3	3/3	3/3	2/3	3/3
google-adk	3/3	2/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3
langgraph	3/3	3/3	3/3	3/3	1/3	3/3	3/3	3/3	3/3	2/3
mastra	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3
pydantic-ai	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	2/3
vercel-ai-sdk	3/3	3/3	3/3	3/3	3/3	0/3	3/3	3/3	3/3	3/3

3/3 valid 2/3 1/3 0/3

Each cell is a framework × job combination, three trials deep. Green is full pass, red is full fail. The pattern surfaces framework-specific weaknesses that the aggregate success rate obscures.

Patterns worth reading the table for

Vercel AI SDK fails 0/3 on the DevOps / SRE job — the only red cell in its row, despite 100% on everything else. A specific class of input (long location string + many remote-related keywords) defeats its lean context handling.
Baseline-python flakes on jobs 002, 003, 005, 006, and 007 — without framework discipline, the manual loop hits its step ceiling on tasks that demand more exploration.
job-001 (the simple Senior Python Backend) is 100% across the board — use it as your "smoke test" job before committing to a 30-trial run.