eval_framework package

Subpackages

Submodules

eval_framework.base_config module

class eval_framework.base_config.BaseConfig(**data)[source]

Bases: BaseModel

as_dict()[source]
Return type:

dict[str, Any]

classmethod from_yaml(yml_filename)[source]
Return type:

BaseConfig

Parameters:

yml_filename (str | Path)

model_config: ClassVar[ConfigDict] = {'extra': 'forbid', 'frozen': True, 'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

save(out_file)[source]
Return type:

None

Parameters:

out_file (Path)

eval_framework.evaluation_generator module

class eval_framework.evaluation_generator.EvaluationGenerator(config, result_processor)[source]

Bases: object

Parameters:
run_eval()[source]

Runs evaluation using saved completions.

Return type:

list[Result]

eval_framework.exceptions module

exception eval_framework.exceptions.LogicError[source]

Bases: Exception

eval_framework.logger module

eval_framework.main module

eval_framework.main.main(llm, config, should_preempt_callable=None, trial_id=None, *args, resource_cleanup=False, verbosity=1)[source]

Runs the entire evaluation process: responses generation and evaluation.

Return type:

list[Result]

Parameters:
  • llm (BaseLLM)

  • config (EvalConfig)

  • should_preempt_callable (Callable[[], bool] | None)

  • trial_id (int | None)

  • args (Any)

  • resource_cleanup (bool)

  • verbosity (int)

eval_framework.response_generator module

class eval_framework.response_generator.ResponseGenerator(llm, config, result_processor)[source]

Bases: object

Parameters:
generate(should_preempt_callable)[source]

Generates responses and saves them along with metadata. :type should_preempt_callable: Callable[[], bool] :param should_preempt_callable: function to check if preempt is called

Return type:

tuple[list[Completion | Loglikelihood], bool]

Returns:

list of responses, preempted: whether the process was preempted or not

Parameters:

should_preempt_callable (Callable[[], bool])

eval_framework.response_generator.map_language_to_value(language)[source]
Return type:

str | dict[str, str] | dict[str, tuple[str, str]] | None

Parameters:

language (Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None)

eval_framework.response_generator.repeat_samples(samples, repeats)[source]

Flatten repeats into a single stream of samples.

After expansion original sample indices do not point to the same sample anymore. They Original sample can be recovered by original_index = expanded_index // repeats.

Return type:

Iterable[Sample]

Parameters:
  • samples (Iterable[Sample])

  • repeats (int)

eval_framework.run module

eval_framework.run.parse_args()[source]
Return type:

Namespace

eval_framework.run.run()[source]
Return type:

None

eval_framework.run.run_with_kwargs(kwargs)[source]
Return type:

None

Parameters:

kwargs (dict)

eval_framework.run_direct module

eval_framework.suite module

class eval_framework.suite.MetricSource(**data)[source]

Bases: BaseModel

A single (child, metric) pair used as an input to a SuiteAggregate. See the examples folder for how these are used.

Parameters:
  • child (str)

  • metric (str)

child: str
metric: str
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class eval_framework.suite.SuiteAggregate(**data)[source]

Bases: BaseModel

Model to aggregate results from a suite of tasks.

Parameters:
  • name (str)

  • sources (list[MetricSource])

  • method (str | Callable[[list[float]], float])

method: str | Callable[[list[float]], float]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
sources: list[MetricSource]
classmethod validate_method(v)[source]
Return type:

str | Callable

Parameters:

v (str | Callable)

class eval_framework.suite.SuiteResult(**data)[source]

Bases: BaseModel

Parameters:
  • name (str)

  • task_results (dict[str, Self])

  • aggregates (dict[str, float | None])

aggregates: dict[str, float | None]
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
task_results: dict[str, Self]
class eval_framework.suite.TaskSuite(**data)[source]

Bases: BaseModel

Parameters:
  • name (str | None)

  • tasks (Annotated[str | list[str | Self], BeforeValidator(func=~eval_framework.suite.parse_strings_to_task_or_suite, json_schema_input_type=PydanticUndefined)])

  • aggregates (list[SuiteAggregate])

  • temperature (float | None)

  • top_p (float | None)

  • top_k (int | None)

  • extra_llm_args (dict[str, Any])

  • num_samples (int | None)

  • num_fewshot (int | None)

  • max_tokens (int | None)

  • repeats (int | None)

  • batch_size (int | None)

  • task_subjects (list[str] | None)

  • hf_revision (str | None)

aggregates: list[SuiteAggregate]
batch_size: int | None
extra_llm_args: dict[str, Any]
get_hyperparam_overrides()[source]

Return hyperparam fields that were explicitly set in the suite definition.

Return type:

dict[str, Any]

hf_revision: str | None
property is_leaf: bool
classmethod load(path)[source]
Return type:

Self

Parameters:

path (Path | str)

classmethod load_from_py(path)[source]
Return type:

Self

Parameters:

path (Path | str)

classmethod load_from_yaml(path)[source]
Return type:

Self

Parameters:

path (Path)

max_tokens: int | None
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str | None
num_fewshot: int | None
num_samples: int | None
repeats: int | None
property task_name: str

The registered task name. Only valid for leaf tasks.

task_subjects: list[str] | None
tasks: Annotated[str | list[str | Self], BeforeValidator(func=parse_strings_to_task_or_suite, json_schema_input_type=PydanticUndefined)]
temperature: float | None
top_k: int | None
top_p: float | None
validate_suite()[source]
Return type:

Self

eval_framework.suite.compute_aggregates(aggregates, child_results)[source]

Compute suite-level stats from explicitly named (child, metric) sources.

For each SuiteAggregate, the value from each MetricSource is looked up by child name and exact metric key. Sources whose child is missing or whose metric is None or NaN are silently skipped. If no sources yield a valid value the aggregate is None.

Return type:

dict[str, float | None]

Parameters:
eval_framework.suite.parse_strings_to_task_or_suite(v)[source]

Expand bare strings in a list to leaf-suite dicts. Pydantic validates them into TaskSuite.

Return type:

str | list

Parameters:

v (str | list)

eval_framework.suite.resolve_to_evalconfig_kwargs(leaf, resolved_defaults, cli_kwargs)[source]

Build the kwargs dict expected by run_with_kwargs() for a single leaf task.

Merges CLI kwargs as the base, overlays resolved suite defaults, and routes temperature/top_p/extra_llm_args into the llm_args dict.

Return type:

dict

Parameters:
  • leaf (TaskSuite)

  • resolved_defaults (dict[str, Any])

  • cli_kwargs (dict[str, Any])

eval_framework.suite.run_suite(suite, cli_kwargs, parent_defaults=None, root_suite_name=None)[source]

Recursively run all tasks in a suite and compute aggregates bottom-up using post-order traversal.

For a leaf suite: runs the single task via _run_single_task and returns the aggregated results directly. For a composite suite: recurses into each child, collects results, then computes this suite’s aggregates.

Return type:

SuiteResult

Parameters:
  • suite (TaskSuite)

  • cli_kwargs (dict[str, Any])

  • parent_defaults (dict[str, Any] | None)

  • root_suite_name (str | None)

eval_framework.suite.save_suite_results(output_dir, results)[source]
Return type:

None

Parameters:
  • output_dir (Path)

  • results (dict[str, float | None])

Module contents