eval_framework.metrics.completion package¶

Submodules¶

eval_framework.metrics.completion.accuracy_completion module¶

class eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Accuracy Completion'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.aidanbench module¶

class eval_framework.metrics.completion.aidanbench.AidanBenchMetric[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'AidanBench'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.bleu module¶

class eval_framework.metrics.completion.bleu.BLEU[source]¶

Bases: BaseMetric[Completion]

The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence. It counts matching n-grams in the candidate translation to n-grams in the reference text, where 1-gram or unigram would be each token and a bigram comparison would be each word pair. The comparison is made regardless of word order Source: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/ Paper: https://www.aclweb.org/anthology/P02-1040/

NAME: str = 'BLEU'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.bleu.LINEWISE_BLEU[source]¶

Bases: BaseMetric[Completion]

Maximum Line-level BLEU score.

NAME: str = 'Linewise BLEU'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.bleu.ResponseToOriginalBLEU[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Response to Original BLEU'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.chrf module¶

class eval_framework.metrics.completion.chrf.CHRF[source]¶

Bases: BaseMetric[Completion]

chrF++ is a tool for automatic evaluation of machine translation output based on character n-gram precision and recall enhanced with word n-grams. Source: https://github.com/m-popovic/chrF Paper: https://www.aclweb.org/anthology/W15-3049.pdf

NAME: str = 'chrF'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.chrf.LINEWISE_CHRF[source]¶

Bases: BaseMetric[Completion]

Maximum Line-level chrF++ (Character n-gram F-score) score. Paper: https://aclanthology.org/W15-3049/

NAME: str = 'Linewise chrF'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.code_assertion module¶

class eval_framework.metrics.completion.code_assertion.CodeCompletionAssertion[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Code Completion Accuracy'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.code_execution_pass_at_one module¶

class eval_framework.metrics.completion.code_execution_pass_at_one.CodeExecutionBaseContext(**data)[source]¶

Bases: BaseMetricContext

Parameters:

run_env (str)
code_prompt (str)
test_code (str)
benchmark_timeout (int)
package_downloads (dict[str, str | None])
extra_data (Any)

benchmark_timeout: int¶

code_prompt: str¶

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

package_downloads: dict[str, str | None]¶

run_env: str¶

test_code: str¶

class eval_framework.metrics.completion.code_execution_pass_at_one.CodeExecutionPassAtOne[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'code-execution-pass@1'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.code_execution_pass_at_one.CodeExecutionPassAtOneContext(**data)[source]¶

Bases: CodeExecutionBaseContext

Parameters:

run_env (str)
code_prompt (str)
test_code (str)
benchmark_timeout (int)
package_downloads (dict[str, str | None])
snippet_merge_fn (str)
output_parse_fn (str)
extra_data (Any)

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output_parse_fn: str¶

snippet_merge_fn: str¶

class eval_framework.metrics.completion.code_execution_pass_at_one.RealtimeCodeExectionContext(**data)[source]¶

Bases: CodeExecutionBaseContext

Parameters:

run_env (str)
code_prompt (str)
test_code (str)
benchmark_timeout (int)
package_downloads (dict[str, str | None])
snippet_merge_fn (Callable[[str, str], str])
output_parse_fn (Callable[[str], ExecutionResult])
extra_data (Any)

classmethod from_context(context)[source]¶

Return type:: Self
Parameters:: context (CodeExecutionPassAtOneContext)

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output_parse_fn: Callable[[str], ExecutionResult]¶

snippet_merge_fn: Callable[[str, str], str]¶

eval_framework.metrics.completion.code_execution_pass_at_one.estimate_pass_at_k(n, c, k)[source]¶

Estimates pass@k for a single problem.

Parameters: n (int): Total number of generated samples. c (int): Number of correct samples. k (int): Number of attempts or samples considered.

Returns: float: The pass@k value.

Return type:

float

Parameters:

n (int)
c (int)
k (int)

eval_framework.metrics.completion.comet module¶

class eval_framework.metrics.completion.comet.COMET[source]¶

Bases: BaseMetric[Completion]

COMET is a neural, multilingual framework for evaluating machine translation quality by leveraging cross-lingual pretrained language models to achieve state-of-the-art correlation with human judgments Note: this requires a Hugging Face token with access to the model: https://huggingface.co/Unbabel/XCOMET-XL Source: https://github.com/Unbabel/COMET Paper: https://arxiv.org/abs/2009.09025

NAME: str = 'COMET'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.concordance_index module¶

class eval_framework.metrics.completion.concordance_index.ConcordanceIndex[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'ConcordanceIndex'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.concordance_index.calculate_concordance_index(ground_truth, completion)[source]¶

Return type:

float

Parameters:

ground_truth (str)
completion (str)

eval_framework.metrics.completion.csv_format module¶

class eval_framework.metrics.completion.csv_format.CSVFormat[source]¶

Bases: BaseMetric[Completion]

KEYS: list[str] | None = ['has_csv', 'is_separator_respected', 'is_column_count_respected']¶

NAME: str = 'CSV Format'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.csv_format.CSVFormatEvaluation(**data)[source]¶

Bases: BaseModel

Parameters:

implicit (bool)
has_csv (bool)
is_separator_respected (bool)
is_column_count_respected (bool)

has_csv: bool¶

implicit: bool¶

is_column_count_respected: bool¶

is_separator_respected: bool¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

eval_framework.metrics.completion.csv_format.evaluate_csv_format(response)[source]¶

Return type:: CSVFormatEvaluation
Parameters:: response (Completion)

eval_framework.metrics.completion.csv_format.extract_csv_from_text(text, min_rows=2, min_columns=2)[source]¶

Return type:

tuple[list[str] | None, str | None]

Parameters:

text (str)
min_rows (int)
min_columns (int)

eval_framework.metrics.completion.cwe_accuracy module¶

class eval_framework.metrics.completion.cwe_accuracy.CWEAccuracy[source]¶

Bases: BaseMetric[Completion]

Metric for Common Word Extraction tasks

NAME: str = 'CWEAccuracy'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.drop_completion module¶

DROP completion metrics: F1 and exact match.

class eval_framework.metrics.completion.drop_completion.DropF1ExactMatch[source]¶

Bases: BaseMetric[Completion]

DROP F1 and exact match. Requires DropMetricContext with answer_tuples.

KEYS: list[str] | None = ['f1', 'exact_match']¶

NAME: str = 'DROP F1 / Exact Match'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.drop_completion.DropMetricContext(**data)[source]¶

Bases: BaseMetricContext

Context for DROP completion metrics. answer_tuples: list of gold answers (each a list of strings).

Parameters:

answer_tuples (list[list[str]])
extra_data (Any)

answer_tuples: list[list[str]]¶

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

eval_framework.metrics.completion.exponential_similarity module¶

class eval_framework.metrics.completion.exponential_similarity.ExponentialSimilarity[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'ExponentialSimilarity'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.exponential_similarity.calculate_exponential_similarity(p_true, p_pred)[source]¶

Compute the exponential similarity (SpaceDigest version) between the gold percentage and predicted value.

Parameters: - p_true (float): The gold/reference percentage. - p_pred (float): The predicted scalar. - d (float): Base of the exponent. Default is 2. - c (float): Coefficient in exponent. Default is 10.

Returns: - float: Similarity score between 0 and 1.

Return type:

float

Parameters:

p_true (float)
p_pred (float)

eval_framework.metrics.completion.f1 module¶

class eval_framework.metrics.completion.f1.F1[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'F1'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.f1.calculate_f1(ref_tokens, hyp_tokens)[source]¶

Calculate F1 score between two texts based on token overlap.

Return type:

float

Parameters:

ref_tokens (list[Any])
hyp_tokens (list[Any])

eval_framework.metrics.completion.format_checker module¶

class eval_framework.metrics.completion.format_checker.CheckJsonFormat[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'JSON Format'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.format_checker.CheckPostScriptFormat[source]¶

Bases: BaseMetric[Completion]

This metric is honestly not that great In the original IFEval implementation it just checks whether the text contains the string (P.)P.S. or variants thereof such as p. s. It doesn’t check for parsing

NAME: str = 'Postscript Format'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.grid_difference module¶

class eval_framework.metrics.completion.grid_difference.GridDifference[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'grid_difference'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

calculate_score(output_ground_truth_difference_count, input_ground_truth_difference_count)[source]¶

Return type:

float

Parameters:

output_ground_truth_difference_count (int)
input_ground_truth_difference_count (int)

count_differences(character_list_1, character_list_2)[source]¶

Return type:

int

Parameters:

character_list_1 (list[str])
character_list_2 (list[str])

extract_grid_from_prompt(prompt)[source]¶

Return type:: str
Parameters:: prompt (str)

eval_framework.metrics.completion.ifeval module¶

class eval_framework.metrics.completion.ifeval.IFEvalMetric[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'IFEval'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.ifeval.IFEvalMetricContext(**data)[source]¶

Bases: BaseMetricContext

Parameters:

key (int)
instruction_id_list (list[str])
prompt (str)
additional_kwargs (list[dict[str, Any]])
extra_data (Any)

additional_kwargs: list[dict[str, Any]]¶

instruction_id_list: list[str]¶

key: int¶

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt: str¶

eval_framework.metrics.completion.json_format module¶

class eval_framework.metrics.completion.json_format.JsonFormat[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'JSON Format'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.json_format.JsonFormatEvaluation(**data)[source]¶

Bases: BaseModel

Parameters:

is_just_json (bool)
is_valid_json (bool)
fulfills_schema (bool | None)
exact_match (bool | None)
json_parsing_error (str | None)
schema_validation_error (str | None)

exact_match: bool | None¶

fulfills_schema: bool | None¶

is_just_json: bool¶

is_valid_json: bool¶

json_parsing_error: str | None¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

schema_validation_error: str | None¶

eval_framework.metrics.completion.json_format.get_json_object(text)[source]¶

Extract the first valid JSON object or array from text.

This function handles nested brackets properly by using a bracket counting approach to find complete JSON structures, rather than using regex which can incorrectly match outer brackets containing non-JSON content.

Return type:: str
Parameters:: text (str)

eval_framework.metrics.completion.json_format.remove_comments(text, comment_indicator='//')[source]¶

Return type:

str

Parameters:

text (str)
comment_indicator (str)

eval_framework.metrics.completion.language_checker module¶

class eval_framework.metrics.completion.language_checker.GermanCompletionChecker[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'German Completion Check'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.language_checker.LanguageChecker[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Language Check'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.language_checker.LanguageConsistencyChecker[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Language Consistency'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Language Consistency Raw'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.length_control module¶

class eval_framework.metrics.completion.length_control.LengthControl(tolerance=0.16666666666666666)[source]¶

Bases: BaseMetric[Completion]

Parameters:: tolerance (float)

NAME: str = 'length_control'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.length_control.LengthRequirementType(*values)[source]¶

Bases: Enum

MAX = 'maximum'¶

MIN = 'minimum'¶

TARGET = 'target'¶

class eval_framework.metrics.completion.length_control.LengthRequirementUnit(*values)[source]¶

Bases: Enum

PARAGRAPHS = 'paragraphs'¶

SENTENCES = 'sentences'¶

WORDS = 'words'¶

eval_framework.metrics.completion.math_minerva_completion module¶

Minerva-style MATH completion metric: exact_match and exact_match_flex.

class eval_framework.metrics.completion.math_minerva_completion.MathMinervaCompletion(use_cot=True, cot_style='minerva', relaxed=False)[source]¶

Bases: BaseMetric[Completion]

Minerva MATH: reports Exact Match and Exact Match (Flex). Uses raw_completion to extract multiple candidates; primary for exact_match, all candidates with both Minerva and Hendrycks equivalence for exact_match_flex.

Parameters:

use_cot (bool)
cot_style (str)
relaxed (bool)

NAME: str = 'Math Minerva Completion'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.math_minerva_completion.MathMinervaCompletionRelaxed(use_cot=True, cot_style='minerva', relaxed=True)[source]¶

Bases: MathMinervaCompletion

MathMinervaCompletion with relaxed=True by default (flexible final-answer matching).

Parameters:

use_cot (bool)
cot_style (str)
relaxed (bool)

eval_framework.metrics.completion.math_reasoning_completion module¶

class eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Math Reasoning Completion (symbolic)'¶

REMOVED_EXPRESSIONS_FORMAT = ['\\text{s}', '\\text{.}', '\\text{\ns}', '\\text{}^2', '\\text{}^3', '\\text{\n}', '\\text{}', '\\mathrm{th}', '^\\circ', '^{\\circ}', '\\;', ',\\!', '{,}', '"', '\\dots']¶

REMOVED_EXPRESSIONS_UNITS = ['square', 'ways', 'integers', 'dollars', 'mph', 'inches', 'ft', 'hours', 'km', 'units', '\\ldots', 'sue', 'points', 'feet', 'minutes', 'digits', 'cents', 'degrees', 'cm', 'gm', 'pounds', 'meters', 'meals', 'edges', 'students', 'childrentickets', 'multiples']¶

SUBSTITUTIONS = [('\\ban\\b(?!\\w)', ''), ('\\ba\\b(?!\\w)', ''), ('\\.\\$', '$'), ('\\\\\\$', ''), ('\\\\ ', ''), ('\\s+', ''), ('\\\\mbox', 'text'), (',\\\\text\\{and\\}', ','), ('\\\\text\\{and\\}', ','), ('\\\\text\\{m\\}', '\\text{}')]¶

calculate(response)[source]¶

Calculate the accuracy of the completion

performs several verification and simplification steps to ensure that the completion is correct

the completion may either be a latex or string response which sympy will parse, factor, and simplify

Parameters:: response (Completion) – Completion object
Return type:: list[MetricResult]
Returns:: list of MetricResult

check_for_equation(final_answer)[source]¶

Check if the final answer is an equation and split it into left hand side and right hand side :type final_answer: str :param final_answer: the expression to evaluate

Return type:: list
Returns:: list of left hand side and right hand side of the equation
Parameters:: final_answer (str)

normalize_expression(final_answer)[source]¶

Function to normalize LaTeX expressions :type final_answer: str :param final_answer: raw LaTeX expression

Return type:: str
Returns:: normalized LaTeX expression
Parameters:: final_answer (str)

NOTE: Changed logic, because before the substitution randomly replaced characters in the string, i.e., turned “infty” into “iny” by removing “ft”

eval_framework.metrics.completion.math_reasoning_completion.timeout_handler(signum, frame)[source]¶

Return type:

None

Parameters:

signum (Any)
frame (Any)

eval_framework.metrics.completion.minerva_math_utils module¶

Minerva-style MATH answer extraction and equivalence (Lewkowycz et al. 2022).

eval_framework.metrics.completion.minerva_math_utils.extract_answers(raw_answer, use_cot=True, cot_style='minerva', relaxed=False)[source]¶

Extract multiple candidate answers from model output (for exact_match and exact_match_flex). Returns list of normalized strings; first is primary for exact_match. When relaxed=True, final-answer string matching is more lenient (whitespace/case).

Return type:

list[str]

Parameters:

raw_answer (str)
use_cot (bool)
cot_style (str)
relaxed (bool)

eval_framework.metrics.completion.minerva_math_utils.get_unnormalized_answer(text, relaxed=False)[source]¶

Extract answer from Minerva ‘Final Answer: The final answer is … I hope it is correct.’

When relaxed=False, pattern matches lm-evaluation-harness (lm_eval.tasks.minerva_math.utils) for parity: exact capitalization, no flexible whitespace. When relaxed=True, accepts any capitalisation of:

“Final Answer: The answer is “ / “Final Answer: The final answer is ” “The Final Answer: The answer is “ / “The Final Answer: The final answer is “

with flexible whitespace; no suffix required but “I hope it is correct.” is stripped when present).

Return type:

str

Parameters:

text (str)
relaxed (bool)

eval_framework.metrics.completion.minerva_math_utils.is_equiv_hendrycks(str1, str2)[source]¶

String equality after Hendrycks strip_string.

Return type:

bool

Parameters:

str1 (str | None)
str2 (str | None)

eval_framework.metrics.completion.minerva_math_utils.is_equiv_minerva(x1, x2, timeout_seconds=5)[source]¶

Sympy-based equivalence (Minerva).

Return type:

bool

Parameters:

x1 (str)
x2 (str)
timeout_seconds (int)

eval_framework.metrics.completion.minerva_math_utils.last_boxed_only_string(string)[source]¶

Extract the last boxed{…} or fbox{…} from string.

Return type:: str | None
Parameters:: string (str)

eval_framework.metrics.completion.minerva_math_utils.normalize_final_answer(final_answer)[source]¶

Normalize a final answer (appendix D of Lewkowycz et al. 2022).

Return type:: str
Parameters:: final_answer (str)

eval_framework.metrics.completion.minerva_math_utils.normalized_gold_from_solution(solution)[source]¶

Extract and normalize the gold answer from a solution string (last boxed{…}).

Return type:: str | None
Parameters:: solution (str)

eval_framework.metrics.completion.minerva_math_utils.remove_boxed(s)[source]¶

Remove boxed{ or boxed from content.

Return type:: str
Parameters:: s (str)

eval_framework.metrics.completion.minerva_math_utils.strip_string_hendrycks(string)[source]¶

Hendrycks-style string normalization for string equivalence.

Return type:: str
Parameters:: string (str)

eval_framework.metrics.completion.niah_accuracy module¶

class eval_framework.metrics.completion.niah_accuracy.NIAHAccuracy[source]¶

Bases: BaseMetric[Completion]

Metric for Needle in a Haystack tasks

NAME: str = 'NIAHAccuracy'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.niah_accuracy.clean_text(text)[source]¶

Clean text by removing spaces and normalizing

Return type:: str
Parameters:: text (str)

eval_framework.metrics.completion.placeholder_checker module¶

class eval_framework.metrics.completion.placeholder_checker.PlaceholderChecker[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Placeholder Check'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.placeholder_checker.PlaceholderCheckerMetricContext(**data)[source]¶

Bases: BaseMetricContext

Parameters:

num_placeholders (int)
extra_data (Any)

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_placeholders: int¶

eval_framework.metrics.completion.repetition module¶

class eval_framework.metrics.completion.repetition.WordRepetition(window_size=128, min_repetitions=1)[source]¶

Bases: BaseMetric[Completion]

Word Repetition Metric

This metric checks for repetitions of words in the completion text for a given window size and repetition threshold. The window size defines the consecutive word count to consider a repetition, and min_repetitions specifies the minimum repetition count that triggers the metric. This metric returns 0.0 if no repetitions are found, and 1.0 if a sufficient number of repetitions are found. For example, if the completion contains a two-word sequence that repeats once (such as “hello world hello world”), this metric would trigger with a window size of 2 and min_repetitions set to 1.

Parameters:

window_size (int)
min_repetitions (int)

HIGHER_IS_BETTER: Final[bool] = False¶

NAME: str = 'WordRepetition'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.rouge_1 module¶

class eval_framework.metrics.completion.rouge_1.ROUGE_1[source]¶

Bases: BaseMetric[Completion]

ROUGE-1

NAME: str = 'ROUGE-1'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.rouge_2 module¶

class eval_framework.metrics.completion.rouge_2.ROUGE_2[source]¶

Bases: BaseMetric[Completion]

ROUGE-2

NAME: str = 'ROUGE-2'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.rouge_geometric_mean module¶

class eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN[source]¶

Bases: BaseMetric[Completion]

ROUGE Geometric Mean

NAME: str = 'ROUGE-Geometric-Mean'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.rouge_l module¶

class eval_framework.metrics.completion.rouge_l.ROUGE_L[source]¶

Bases: BaseMetric[Completion]

ROUGE-L

NAME: str = 'ROUGE-L'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.struct_eval_metrics module¶

class eval_framework.metrics.completion.struct_eval_metrics.RenderableStructMetric[source]¶

Bases: StructMetric

NAME: str = 'RenderableStructMetric'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.struct_eval_metrics.RenderableStructMetricContext(**data)[source]¶

Bases: BaseMetricContext

Parameters:

output_type (str)
keywords (list[str])
extra_data (Any)

keywords: list[str]¶

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output_type: str¶

class eval_framework.metrics.completion.struct_eval_metrics.StructMetric[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'StructMetric'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.struct_eval_metrics.StructMetricContext(**data)[source]¶

Bases: BaseMetricContext

Parameters:

output_type (str)
paths (list[str])
extra_data (Any)

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output_type: str¶

paths: list[str]¶

eval_framework.metrics.completion.struct_eval_metrics.is_valid_html(html)[source]¶

Return type:: bool
Parameters:: html (str)

eval_framework.metrics.completion.struct_eval_metrics.path_exists(data, path)[source]¶

Check if a path exists in a structured data object.

Parameters:

data (Any) – The structured data to check
path (str) – The path to check (dot notation)

Return type:

bool

Returns:

True if path exists, False otherwise

eval_framework.metrics.completion.struct_eval_metrics.tokenize_path(path)[source]¶

Tokenize a dot-notation path, handling back-ticks and array indices.

Parameters:: path (str) – The path string (e.g. “users.0.name” or “users[0].name”)
Return type:: list[str]
Returns:: List of path tokens

eval_framework.metrics.completion.ter module¶

class eval_framework.metrics.completion.ter.LINEWISE_TER[source]¶

Bases: BaseMetric[Completion]

Minimum Line-level TER (Translation Edit Rate) score.

NAME: str = 'Linewise TER'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.ter.TER[source]¶

Bases: BaseMetric[Completion]

Translation Error Rate is an error metric for machine translation that measures the number of edits required to change a system output into one of the references Source: http://www.cs.umd.edu/~snover/tercom/ Paper: http://mt-archive.info/AMTA-2006-Snover.pdf

NAME: str = 'TER'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

eval_framework.metrics.completion.text_counter module¶

class eval_framework.metrics.completion.text_counter.ParagraphCounter[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Paragraph Count'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.text_counter.ParagraphCounterMetricContext(**data)[source]¶

Bases: BaseMetricContext

Parameters:

comparison (str)
paragraph_count (int)
extra_data (Any)

comparison: str¶

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

paragraph_count: int¶

class eval_framework.metrics.completion.text_counter.ResponseToOriginalLengthRatio[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Response to Original Length Ratio'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.text_counter.SentenceCounter[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Sentence Count'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.text_counter.SentenceCounterMetricContext(**data)[source]¶

Bases: BaseMetricContext

Parameters:

comparison (str)
sentence_count (int)
extra_data (Any)

comparison: str¶

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

sentence_count: int¶

class eval_framework.metrics.completion.text_counter.WordCounter[source]¶

Bases: BaseMetric[Completion]

NAME: str = 'Word Count'¶

calculate(response)[source]¶

Return type:: list[MetricResult]
Parameters:: response (Completion)

class eval_framework.metrics.completion.text_counter.WordCounterMetricContext(**data)[source]¶

Bases: BaseMetricContext

Parameters:

comparison (str)
word_count (int)
extra_data (Any)

comparison: str¶

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

word_count: int¶

eval_framework.metrics.completion package¶

Submodules¶

eval_framework.metrics.completion.accuracy_completion module¶

eval_framework.metrics.completion.aidanbench module¶

eval_framework.metrics.completion.bleu module¶

eval_framework.metrics.completion.chrf module¶

eval_framework.metrics.completion.code_assertion module¶

eval_framework.metrics.completion.code_execution_pass_at_one module¶

eval_framework.metrics.completion.comet module¶

eval_framework.metrics.completion.concordance_index module¶

eval_framework.metrics.completion.csv_format module¶

eval_framework.metrics.completion.cwe_accuracy module¶

eval_framework.metrics.completion.drop_completion module¶

eval_framework.metrics.completion.exponential_similarity module¶

eval_framework.metrics.completion.f1 module¶

eval_framework.metrics.completion.format_checker module¶

eval_framework.metrics.completion.grid_difference module¶

eval_framework.metrics.completion.ifeval module¶

eval_framework.metrics.completion.json_format module¶

eval_framework.metrics.completion.language_checker module¶

eval_framework.metrics.completion.length_control module¶

eval_framework.metrics.completion.math_minerva_completion module¶

eval_framework.metrics.completion.math_reasoning_completion module¶

eval_framework.metrics.completion.minerva_math_utils module¶

eval_framework.metrics.completion.niah_accuracy module¶

eval_framework.metrics.completion.placeholder_checker module¶

eval_framework.metrics.completion.repetition module¶

eval_framework.metrics.completion.rouge_1 module¶

eval_framework.metrics.completion.rouge_2 module¶

eval_framework.metrics.completion.rouge_geometric_mean module¶

eval_framework.metrics.completion.rouge_l module¶

eval_framework.metrics.completion.struct_eval_metrics module¶

eval_framework.metrics.completion.ter module¶

eval_framework.metrics.completion.text_counter module¶

Module contents¶