Open benchmark · 8 frameworks

Same model. Same task. Which framework wins?

A controlled comparison of 8 LLM agent frameworks running the same candidate-job matching task on gemini-2.5-flash with the same 4 tools. Only the framework varies.

Frameworks tested

Trials per framework

240 runs total

Model

gemini-2.5-flash

Generated

May 11, 2026, 12:00 AM UTC

Leaderboard

Frameworks ranked by NDCG@3

NDCG@3 is the standard information-retrieval ranking score (0–1, higher = better): how well the agent's top-3 picks match the rule-based gold top-3. Hit@1 is the % of trials where the agent's #1 pick is at least "Relevant". Hover any column header for a precise definition. Frameworks with no valid runs sort last. Pricing: $2/M input · $12/M output tokens.

#	Framework	Valid	NDCG@3	Hit@1	p50 (s)	p95 (s)	Tokens (in / out)	Tools	Cost / run	JustifQ /5
1	Pypydantic-ai	29/30	0.857	100.0%	16.2	31.5	6,149 / 480	8.4	$0.0181	3.41
2	Pylanggraph	27/30	0.823	92.6%	17.1	25.9	5,167 / 502	8.8	$0.0164	3.70
3	TSvercel-ai-sdk	27/30	0.662	77.8%	21.2	28.4	1,605 / 228	9.1	$0.0060	2.89
4	Pygoogle-adk	29/30	0.621	72.4%	19.9	471.8	6,128 / 510	9.3	$0.0184	3.41
5	TSmastra	30/30	0.610	73.3%	21.5	31.9	6,154 / 548	11.2	$0.0189	3.37
6	Pycrewai	26/30	0.598	69.2%	18.7	31.6	42,785 / 1,806	11.8	$0.1072	3.27
7	TSbaseline-typescript	29/30	0.589	69.0%	20.5	32.7	5,897 / 495	9.2	$0.0177	3.00
8	Pybaseline-python	23/30	0.570	65.2%	22.1	54.8	7,027 / 515	9.7	$0.0202	3.09

Trade-offs

Cost, quality, and token mix

Pareto frontier (quality vs cost) and stacked input/output tokens per valid run.

Cost vs NDCG@3

Upper-left is better — higher NDCG@3 per dollar spent.

baseline-python

baseline-typescript

crewai

google-adk

langgraph

mastra

pydantic-ai

vercel-ai-sdk

Tokens (input / output, stacked)

Mean per valid run.