Batch Benchmarking¶
Benchmark multiple models and generate a ranked leaderboard.
Run a Benchmark¶
from aime_loc import LOC
loc = LOC()
results = loc.benchmark([
"meta-llama/Llama-4-Scout",
"deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
"mistralai/Mistral-Small-24B-Instruct-2501",
"Qwen/Qwen3.5-35B-A3B",
"google/gemma-3-12b-it",
])
View Results¶
# Print leaderboard table
results.leaderboard_table()
# | Rank | Model | Size | TC % | Best Function |
# |:---:|-------|------|:---:|:---:|
# | 1 | Llama-4-Scout | 70B | 15.37 | Emotion |
# ...
# Heatmap across all models and functions
results.heatmap()
Access Individual Profiles¶
Public Leaderboard¶
Access the pre-computed public leaderboard: