eval_framework.metrics.aggregators package¶
Submodules¶
eval_framework.metrics.aggregators.aggregators module¶
- class eval_framework.metrics.aggregators.aggregators.Aggregator(*args, **kwargs)[source]¶
Bases:
ProtocolBase class for metric aggregators.
An aggregator collapses multiple evaluation rows for the same problem (i.e. prompt) into a single score per problem. The input DataFrame has one row per (problem, attempt) pair; the output has one row per problem with a new
value.- Parameters:
response_df – DataFrame where each row is one evaluation attempt. Must contain a
valuecolumn (the per-attempt score) and allidentifier_columns.identifier_columns – Columns that uniquely identify a problem (e.g.
["prompt"]). Rows sharing the same identifier are different attempts at the same problem.
- Returns:
DataFrame with one row per unique problem and a
valuecolumn holding the aggregated score. All non-identifier, non-value columns are preserved (typically via"first").
Example input (
identifier_columns=["prompt"], 3 attempts per problem):prompt | value | subject ||----------------|——-|---------| | “What is 2+2?” | 1.0 | algebra | | “What is 2+2?” | 1.0 | algebra | | “What is 2+2?” | 0.0 | algebra | | “Solve x^2=4” | 0.0 | algebra | | “Solve x^2=4” | 1.0 | algebra | | “Solve x^2=4” | 0.0 | algebra |
- name: str¶
- class eval_framework.metrics.aggregators.aggregators.IdentifierMean[source]¶
Bases:
AggregatorComputes the arithmetic mean of
valueacross attempts per problem.Example (continuing from the Aggregator docstring example):
“What is 2+2?”: mean(1.0, 1.0, 0.0) = 0.667 “Solve x^2=4”: mean(0.0, 1.0, 0.0) = 0.333
Output: | prompt | value | subject | |----------------|——-|---------| | “What is 2+2?” | 0.667 | algebra | | “Solve x^2=4” | 0.333 | algebra |
- class eval_framework.metrics.aggregators.aggregators.Identity[source]¶
Bases:
objectNo-op aggregator — returns the input unchanged.
Use for metrics where each row is already a final score and no cross-attempt aggregation is needed (e.g. when
num_samples=1).
- class eval_framework.metrics.aggregators.aggregators.PassAtK(k=1)[source]¶
Bases:
AggregatorComputes pass@k: the probability that at least one of k random attempts is correct.
Groups rows by
identifier_columns, counts correct (c = sum(value)) and total (n = count(value)) attempts per problem, then applies the closed-form estimator.Expects
valueto be binary (0 or 1). For k=1 this is equivalent to the mean.- Example (k=2, continuing from the Aggregator docstring example):
“What is 2+2?”: n=3, c=2, k=2 -> 1.0 (guaranteed correct pick) “Solve x^2=4”: n=3, c=1, k=2 -> 0.667 (as computed by the closed_form_passatk)
Output: | prompt | value | subject | |----------------|——-|---------| | “What is 2+2?” | 1.000 | algebra | | “Solve x^2=4” | 0.667 | algebra |
- Parameters:
k (int)
- eval_framework.metrics.aggregators.aggregators.closed_form_passatk(n, c, k)[source]¶
Closed-form pass@k estimator (see HumanEval paper).
pass@k = 1 - C(n-c, k) / C(n, k)
Given n total samples with c correct, this is the probability that at least one of k randomly chosen samples is correct. The ratio C(n-c,k)/C(n,k) is the chance all k picks are wrong; subtracting from 1 gives success probability. When n-c < k there aren’t enough wrong samples to fill k slots, so the result is trivially 1.
- Return type:
float- Parameters:
n (int)
c (int)
k (int)