Open benchmark · 8 frameworks
Same model. Same task. Which framework wins?
A controlled comparison of 8 LLM agent frameworks running the same candidate-job matching task on gemini-2.5-flash with the same 4 tools. Only the framework varies.
Frameworks tested
8
Trials per framework
30
240 runs total
Model
gemini-2.5-flash
Generated
May 11, 2026, 12:00 AM UTC
Leaderboard
Frameworks ranked by NDCG@3
NDCG@3 is the standard information-retrieval ranking score (0–1, higher = better): how well the agent's top-3 picks match the rule-based gold top-3. Hit@1 is the % of trials where the agent's #1 pick is at least "Relevant". Hover any column header for a precise definition. Frameworks with no valid runs sort last. Pricing: $2/M input · $12/M output tokens.
| # | Framework | Valid | NDCG@3 | Hit@1 | p50 (s) | p95 (s) | Tokens (in / out) | Tools | Cost / run | JustifQ /5 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Pypydantic-ai | 29/30 | 0.857 | 100.0% | 16.2 | 31.5 | 6,149 / 480 | 8.4 | $0.0181 | 3.41 |
| 2 | Pylanggraph | 27/30 | 0.823 | 92.6% | 17.1 | 25.9 | 5,167 / 502 | 8.8 | $0.0164 | 3.70 |
| 3 | TSvercel-ai-sdk | 27/30 | 0.662 | 77.8% | 21.2 | 28.4 | 1,605 / 228 | 9.1 | $0.0060 | 2.89 |
| 4 | Pygoogle-adk | 29/30 | 0.621 | 72.4% | 19.9 | 471.8 | 6,128 / 510 | 9.3 | $0.0184 | 3.41 |
| 5 | TSmastra | 30/30 | 0.610 | 73.3% | 21.5 | 31.9 | 6,154 / 548 | 11.2 | $0.0189 | 3.37 |
| 6 | Pycrewai | 26/30 | 0.598 | 69.2% | 18.7 | 31.6 | 42,785 / 1,806 | 11.8 | $0.1072 | 3.27 |
| 7 | TSbaseline-typescript | 29/30 | 0.589 | 69.0% | 20.5 | 32.7 | 5,897 / 495 | 9.2 | $0.0177 | 3.00 |
| 8 | Pybaseline-python | 23/30 | 0.570 | 65.2% | 22.1 | 54.8 | 7,027 / 515 | 9.7 | $0.0202 | 3.09 |
Trade-offs
Cost, quality, and token mix
Pareto frontier (quality vs cost) and stacked input/output tokens per valid run.