baseline-python
PythonHand-rolled tool loop
Model:
gemini-2.5-flash·Generated: May 11, 2026, 12:00 AM UTCNDCG@3
0.570
23 scored · 23/30 valid
Hit@1
65.2%
JustifQ 3.09/5
Latency p50
22.1s
p95 54.8s
Mean tokens
7,542
7,027 in · 515 out
Cost / run
$0.0202
9.7 avg tool calls
All metrics
count_total
30
count_valid
23
success_rate
76.7%
latency_p50
22.138s
latency_p95
54.805s
latency_mean
32.216s
latency_max
221.598s
mean_input_tokens
7,027
mean_output_tokens
515
mean_tool_calls
9.65
estimated_cost_usd_per_run
$0.020240
mean_ndcg_at_3
0.570
hit_at_1_rate
65.2%
mean_precision_at_3
0.348
mean_recall_at_3
0.507
n_scored
23
mean_justification_quality
3.09/5
mean_judge_score
13.65/20
judge_n
23
hit_step_limit_rate
0.0%