Four things the leaderboard hides.
Aggregated p50 and success-rate columns flatten the more interesting shape of the dataset. These four sections walk through patterns the headline numbers obscure — token cost spread, the real efficiency frontier, p95 stalls, and which job categories break which frameworks.
The same task runs on 27× more input tokens depending on the framework.
crewai averages 42,785 input tokens per run. vercel-ai-sdk averages 1,605. Same model, same prompt, same tools.
The variance lives in how each framework rebuilds the prompt at every step of the tool-calling loop. A framework that re-injects its own scaffolding (agent backstory, task description, ReAct preamble, verbose tool descriptions) on each turn pays for that choice every step.
CrewAI's DSL re-serializes the agent persona and the task specification into the prompt on every step. Twelve steps × ~3,000 tokens of framework boilerplate ≈ 36,000 tokens of repeated scaffolding, on top of the ~6,000 tokens of actual conversation state. The cost shows up in the bill, not the leaderboard.
Vercel's ToolLoopAgent keeps a single message thread, ships compact JSON-schema tool descriptions, and trusts the model to manage its own internal reasoning. No "Thought / Action / Observation" preamble, no per-step persona reset. The model gets the conversation as-is and replies.
vercel-ai-sdk delivers 20× the NDCG@3 per dollar of crewai.
Cost and quality alone are misleading axes. NDCG@3 / cost together is the metric a buyer should care about — and it crowns a different framework than either column individually.
The efficiency frontier — NDCG@3 per dollar — surfaces which framework gives the best retrieval quality for the money spent. vercel-ai-sdk leads on quality-per-dollar, while crewai sits at the bottom of the frontier.
crewai sits at 5.5804 NDCG/$ — the framework that loses on the efficiency axis. Higher cost than the average without a commensurate NDCG@3 gain. Its token spend doesn't translate to better retrieval quality; it's pure framework overhead being billed as if it were capability.
google-adk's p95 latency is 24× its p50.
google-adk runs 19.9s on a typical trial and 471.8s on its 95th percentile. One trial reached 752s — over twelve minutes on a task that usually takes twenty seconds.
Sort the latencies of all trials from fastest to slowest. The p50 (median) is the middle value: half the trials are faster, half are slower. It describes what a typical user experiences. The p95 is the value below which 95% of trials fall — meaning 1 trial in 20 is slower. It describes the worst case that real users will still hit regularly.
A framework with a good p50 but a bad p95 looks fast on average and intermittently freezes. A framework with both numbers close together is predictable. The gap between them — how far p95 is from p50 — is the more actionable signal than either column alone.
The strip plot above shows every valid trial as a dot. Most frameworks cluster tightly between 10s and 50s. Two — Google ADK and PydanticAI — have outliers past the 100-second mark, drawn in red. These aren't "the framework is slow"; they're "one trial stuck for ten minutes against a task that the same framework handles in twenty seconds the other 28 times".
A user-facing agent product needs a client-side step timeout. A framework that doesn't expose one — or that swallows the timeout configuration — turns a latent upstream stall into a stuck request. The p95 column on the leaderboard tells you to set the timeout; the p50 column lies about how often you'll need it.
Failures cluster on specific job × framework pairs.
The success-rate column averages over ten job categories. Some frameworks are uniformly reliable; others fail consistently on specific shapes of input.
#001Senior Backend | #002Mid Backend | #003Senior FE | #004Senior Full-Stack | #005ML Engineer | #006DevOps / | #007Senior iOS | #008Product Designer | #009Staff FS | #010Junior Backend | |
|---|---|---|---|---|---|---|---|---|---|---|
| baseline-python | 3/3 | 1/3 | 1/3 | 3/3 | 2/3 | 2/3 | 2/3 | 3/3 | 3/3 | 3/3 |
| baseline-typescript | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
| crewai | 3/3 | 2/3 | 2/3 | 2/3 | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 | 3/3 |
| google-adk | 3/3 | 2/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
| langgraph | 3/3 | 3/3 | 3/3 | 3/3 | 1/3 | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 |
| mastra | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
| pydantic-ai | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 |
| vercel-ai-sdk | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 0/3 | 3/3 | 3/3 | 3/3 | 3/3 |
Each cell is a framework × job combination, three trials deep. Green is full pass, red is full fail. The pattern surfaces framework-specific weaknesses that the aggregate success rate obscures.
- Vercel AI SDK fails 0/3 on the DevOps / SRE job — the only red cell in its row, despite 100% on everything else. A specific class of input (long location string + many remote-related keywords) defeats its lean context handling.
- Baseline-python flakes on jobs 002, 003, 005, 006, and 007 — without framework discipline, the manual loop hits its step ceiling on tasks that demand more exploration.
- job-001 (the simple Senior Python Backend) is 100% across the board — use it as your "smoke test" job before committing to a 30-trial run.