Open benchmark · 8 frameworks

Same model. Same task. Which framework wins?

A controlled comparison of 8 LLM agent frameworks running the same candidate-job matching task on gemini-2.5-flash with the same 4 tools. Only the framework varies.

Frameworks tested
8
Trials per framework
30
240 runs total
Model
gemini-2.5-flash
Generated
May 11, 2026, 12:00 AM UTC
Leaderboard

Frameworks ranked by NDCG@3

NDCG@3 is the standard information-retrieval ranking score (0–1, higher = better): how well the agent's top-3 picks match the rule-based gold top-3. Hit@1 is the % of trials where the agent's #1 pick is at least "Relevant". Hover any column header for a precise definition. Frameworks with no valid runs sort last. Pricing: $2/M input · $12/M output tokens.

#FrameworkValidNDCG@3Hit@1p50 (s)p95 (s)Tokens (in / out)ToolsCost / runJustifQ /5
1Pypydantic-ai29/300.857100.0%16.231.56,149 / 4808.4
$0.0181
3.41
2Pylanggraph27/300.82392.6%17.125.95,167 / 5028.8
$0.0164
3.70
3TSvercel-ai-sdk27/300.66277.8%21.228.41,605 / 2289.1
$0.0060
2.89
4Pygoogle-adk29/300.62172.4%19.9471.86,128 / 5109.3
$0.0184
3.41
5TSmastra30/300.61073.3%21.531.96,154 / 54811.2
$0.0189
3.37
6Pycrewai26/300.59869.2%18.731.642,785 / 1,80611.8
$0.1072
3.27
7TSbaseline-typescript29/300.58969.0%20.532.75,897 / 4959.2
$0.0177
3.00
8Pybaseline-python23/300.57065.2%22.154.87,027 / 5159.7
$0.0202
3.09
Trade-offs

Cost, quality, and token mix

Pareto frontier (quality vs cost) and stacked input/output tokens per valid run.

Cost vs NDCG@3

Upper-left is better — higher NDCG@3 per dollar spent.

baseline-python
baseline-typescript
crewai
google-adk
langgraph
mastra
pydantic-ai
vercel-ai-sdk
Tokens (input / output, stacked)

Mean per valid run.