eval_framework.metrics.aggregators package

Submodules

eval_framework.metrics.aggregators.aggregators module

class eval_framework.metrics.aggregators.aggregators.Aggregator(*args, **kwargs)[source]

Bases: Protocol

Base class for metric aggregators.

An aggregator collapses multiple evaluation rows for the same problem (i.e. prompt) into a single score per problem. The input DataFrame has one row per (problem, attempt) pair; the output has one row per problem with a new value.

Parameters:
  • response_df – DataFrame where each row is one evaluation attempt. Must contain a value column (the per-attempt score) and all identifier_columns.

  • identifier_columns – Columns that uniquely identify a problem (e.g. ["prompt"]). Rows sharing the same identifier are different attempts at the same problem.

Returns:

DataFrame with one row per unique problem and a value column holding the aggregated score. All non-identifier, non-value columns are preserved (typically via "first").

Example input (identifier_columns=["prompt"], 3 attempts per problem):

prompt | value | subject |

|----------------|——-|---------| | “What is 2+2?” | 1.0 | algebra | | “What is 2+2?” | 1.0 | algebra | | “What is 2+2?” | 0.0 | algebra | | “Solve x^2=4” | 0.0 | algebra | | “Solve x^2=4” | 1.0 | algebra | | “Solve x^2=4” | 0.0 | algebra |

name: str
class eval_framework.metrics.aggregators.aggregators.IdentifierMean[source]

Bases: Aggregator

Computes the arithmetic mean of value across attempts per problem.

Example (continuing from the Aggregator docstring example):

“What is 2+2?”: mean(1.0, 1.0, 0.0) = 0.667 “Solve x^2=4”: mean(0.0, 1.0, 0.0) = 0.333

Output: | prompt | value | subject | |----------------|——-|---------| | “What is 2+2?” | 0.667 | algebra | | “Solve x^2=4” | 0.333 | algebra |

class eval_framework.metrics.aggregators.aggregators.Identity[source]

Bases: object

No-op aggregator — returns the input unchanged.

Use for metrics where each row is already a final score and no cross-attempt aggregation is needed (e.g. when num_samples=1).

class eval_framework.metrics.aggregators.aggregators.PassAtK(k=1)[source]

Bases: Aggregator

Computes pass@k: the probability that at least one of k random attempts is correct.

Groups rows by identifier_columns, counts correct (c = sum(value)) and total (n = count(value)) attempts per problem, then applies the closed-form estimator.

Expects value to be binary (0 or 1). For k=1 this is equivalent to the mean.

Example (k=2, continuing from the Aggregator docstring example):

“What is 2+2?”: n=3, c=2, k=2 -> 1.0 (guaranteed correct pick) “Solve x^2=4”: n=3, c=1, k=2 -> 0.667 (as computed by the closed_form_passatk)

Output: | prompt | value | subject | |----------------|——-|---------| | “What is 2+2?” | 1.000 | algebra | | “Solve x^2=4” | 0.667 | algebra |

Parameters:

k (int)

eval_framework.metrics.aggregators.aggregators.closed_form_passatk(n, c, k)[source]

Closed-form pass@k estimator (see HumanEval paper).

pass@k = 1 - C(n-c, k) / C(n, k)

Given n total samples with c correct, this is the probability that at least one of k randomly chosen samples is correct. The ratio C(n-c,k)/C(n,k) is the chance all k picks are wrong; subtracting from 1 gives success probability. When n-c < k there aren’t enough wrong samples to fill k slots, so the result is trivially 1.

Return type:

float

Parameters:
  • n (int)

  • c (int)

  • k (int)

Module contents