Benchmarking
The agex.bench
module supports empirical, data-driven agent improvement. Systematic benchmarking is central for building performant agents. This is especially true when engineering primers, where small changes in wording can lead to significant differences in behavior across different LLM providers and models.
The vision for this module is to support a robust practice of "primer engineering," enabling users to:
- A/B test primers to find the most effective instructions.
- Develop LLM-specific primers, as the optimal guidance for a GPT model may differ from that of a Gemini or Claude model.
- Detect regressions in agent behavior as the framework or underlying models evolve.
- Contribute to a community-driven understanding of what makes agents effective.
By providing the tools for rigorous evaluation, agex.bench
aims to make agent development a little less magical.
The initial examples in the benchmarks/
directory serve as a starting point, and community contributions to help build out a comprehensive suite are highly encouraged.
Core Concepts
Trial-Based Evaluation
A Trial represents a single test case with:
- Parameters: Input arguments to the task function
- Judge Function: Evaluates the actual result
Judge Functions
Judge functions take the actual result from a task and return a new result for aggregation:
- Pass/Fail: Return
bool
for success rate metrics - Numeric: Return
float
for average scores - Custom: Return any type with matching aggregator
Note: Judge functions can be agent task functions themselves, enabling "agent-as-judge" evaluation patterns where one agent evaluates another's output.
Metrics
All benchmarks automatically collect:
- Completion rate: Successful vs errored trials
- Performance: Average actions taken and time per trial
- Judge-specific: Pass rates, scores, etc.
Quick Start
Simple Pass/Fail Benchmark
from agex.bench import Trial, benchmark_pass_fail, params
import operator
# Define test cases
trials = [
Trial(
params=params("Calculate 2 + 2"),
judge=lambda actual: actual == 4,
),
Trial(
params=params("Calculate 10 * 5"),
judge=lambda actual: actual == 50,
),
]
# Run benchmark
results = benchmark_pass_fail(
tasks=[my_agent.solve_math],
trials=trials,
max_concurrency=5,
)
# View results
for task, stats in results.items():
print(f"Pass rate: {stats.pass_rate:.2%}")
print(f"Average time: {stats.time_per_trial:.2f}s")
Numeric Scoring
from agex.bench import Trial, benchmark_numeric, params
def similarity_scorer(expected_text):
"""Custom judge that returns similarity score."""
def judge(actual_text):
# Simple word overlap metric
expected_words = set(expected_text.lower().split())
actual_words = set(actual_text.lower().split())
if not expected_words:
return 1.0 if not actual_words else 0.0
overlap = expected_words & actual_words
return len(overlap) / len(expected_words)
return judge
trials = [
Trial(
params=params("Summarize: The quick brown fox jumps over the lazy dog."),
judge=similarity_scorer("A fox jumps over a dog"),
),
]
results = benchmark_numeric(
tasks=[summarizer_agent.summarize],
trials=trials,
)
for task, stats in results.items():
print(f"Average score: {stats.mean_score:.2f}")
print(f"Score range: {stats.min_score:.2f} - {stats.max_score:.2f}")
API Reference
Core Functions
benchmark_pass_fail
benchmark_pass_fail(
tasks: list[Callable[..., T]],
trials: list[Trial[T, bool]],
max_concurrency: int = 1,
) -> dict[Callable[..., T], PassFailStats]
Benchmark for pass/fail evaluation with boolean judge functions.
Parameter | Type | Description |
---|---|---|
tasks |
list[Callable] |
Task functions to benchmark |
trials |
list[Trial] |
Test cases with boolean judges |
max_concurrency |
int |
Maximum concurrent executions |
benchmark_numeric
benchmark_numeric(
tasks: list[Callable[..., T]],
trials: list[Trial[T, float]],
max_concurrency: int = 1,
) -> dict[Callable[..., T], NumericStats]
Benchmark for numeric evaluation with score-based judge functions.
benchmark_generic
benchmark_generic(
tasks: list[Callable[..., T]],
trials: list[Trial[T, U]],
agg: Callable[[list[U], Stats], Stats],
max_concurrency: int = 1,
) -> dict[Callable, Stats]
Generic benchmark with custom aggregation logic.
Data Types
Trial[T, U]
@dataclass
class Trial[T, U]:
params: Params # Input parameters
judge: Callable[[T], U] # Judge function
Params
@dataclass
class Params:
args: tuple[Any, ...] # Positional arguments
kwargs: dict[str, Any] # Keyword arguments
# Convenience constructor
def params(*args, **kwargs) -> Params
PassFailStats
@dataclass
class PassFailStats(Stats):
pass_count: int # Successful trials
fail_count: int # Failed trials
@property
def pass_rate(self) -> float # Success percentage
NumericStats
@dataclass
class NumericStats(Stats):
mean_score: float # Average score
min_score: float # Minimum score
max_score: float # Maximum score
total_score: float # Sum of all scores
Stats
(Base Class)
@dataclass
class Stats:
total_trials: int # Total test cases
completed_trials: int # Successfully completed
errored_trials: int # Failed with exceptions
actions_per_trial: float # Average LLM calls per trial
time_per_trial: float # Average execution time per trial
Advanced Usage
Custom Aggregators
from agex.bench import benchmark_generic, Stats
def custom_aggregator(results: list[str], event_stats: Stats) -> CustomStats:
"""Custom aggregation for string results."""
return CustomStats(
**event_stats.__dict__,
word_count=sum(len(r.split()) for r in results),
avg_length=sum(len(r) for r in results) / len(results),
)
results = benchmark_generic(
tasks=[text_agent.generate],
trials=trials,
agg=custom_aggregator,
)
Multi-Agent Comparison
# Compare different agent configurations
tasks = [
Agent(primer="You are concise.").task(my_task_fn),
Agent(primer="You are detailed.").task(my_task_fn),
Agent(primer="You are creative.").task(my_task_fn),
]
results = benchmark_pass_fail(
tasks=tasks,
trials=problem_trials,
max_concurrency=3,
)
# Analyze which primer works best
for task, stats in results.items():
agent_name = task.__self__.name
print(f"{agent_name}: {stats.pass_rate:.2%} pass rate")
State and Context Testing
from agex import Versioned
def create_trials_with_state():
"""Generate trials that test stateful interactions."""
base_state = Versioned({"context": "financial_analysis"})
return [
Trial(
params=params("What's the revenue?", state=base_state),
judge=lambda actual: "revenue_data" in actual,
),
Trial(
params=params("Calculate the growth rate", state=base_state),
judge=lambda actual: "growth_calculation" in actual,
),
]
Stateful Benchmarks and Concurrency
When designing benchmarks that test stateful interactions (i.e., multiple trials that share the same Versioned
state object), you must use max_concurrency=1
(the default).
Using a max_concurrency
greater than 1 for stateful benchmarks will lead to race conditions and unpredictable results, as concurrent trials will attempt to read from and write to the same state object simultaneously. For stateless trials, concurrency is safe and recommended for performance.
Example: Complete Benchmark
See benchmarks/funcy_bench.py
for a complete example that tests function generation capabilities:
"""
Benchmark for examples/funcy.py - Function Generation
Tests agent's ability to generate working Python functions.
"""
def equivalent(expected_fn):
def judge(actual_fn):
test_inputs = range(8)
return all(expected_fn(x) == actual_fn(x) for x in test_inputs)
return judge
def main():
trials = [
Trial(
params=params("a function that checks if a number is even"),
judge=equivalent(lambda x: x % 2 == 0),
),
# ... more trials
]
results = benchmark_pass_fail(
tasks=[fn_builder],
trials=trials,
max_concurrency=5,
)
# Print detailed results...
if __name__ == "__main__":
main()
Run benchmarks with: