eval_framework.metrics.completion package¶
Submodules¶
eval_framework.metrics.completion.accuracy_completion module¶
- class eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Accuracy Completion'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.aidanbench module¶
- class eval_framework.metrics.completion.aidanbench.AidanBenchMetric[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'AidanBench'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.bleu module¶
- class eval_framework.metrics.completion.bleu.BLEU[source]¶
Bases:
BaseMetric[Completion]The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence. It counts matching n-grams in the candidate translation to n-grams in the reference text, where 1-gram or unigram would be each token and a bigram comparison would be each word pair. The comparison is made regardless of word order Source: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/ Paper: https://www.aclweb.org/anthology/P02-1040/
- NAME: str = 'BLEU'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.bleu.LINEWISE_BLEU[source]¶
Bases:
BaseMetric[Completion]Maximum Line-level BLEU score.
- NAME: str = 'Linewise BLEU'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.bleu.ResponseToOriginalBLEU[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Response to Original BLEU'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.chrf module¶
- class eval_framework.metrics.completion.chrf.CHRF[source]¶
Bases:
BaseMetric[Completion]chrF++ is a tool for automatic evaluation of machine translation output based on character n-gram precision and recall enhanced with word n-grams. Source: https://github.com/m-popovic/chrF Paper: https://www.aclweb.org/anthology/W15-3049.pdf
- NAME: str = 'chrF'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.chrf.LINEWISE_CHRF[source]¶
Bases:
BaseMetric[Completion]Maximum Line-level chrF++ (Character n-gram F-score) score. Paper: https://aclanthology.org/W15-3049/
- NAME: str = 'Linewise chrF'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.code_assertion module¶
- class eval_framework.metrics.completion.code_assertion.CodeCompletionAssertion[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Code Completion Accuracy'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.code_execution_pass_at_one module¶
- class eval_framework.metrics.completion.code_execution_pass_at_one.CodeExecutionBaseContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
run_env (str)
code_prompt (str)
test_code (str)
benchmark_timeout (int)
package_downloads (dict[str, str | None])
extra_data (Any)
- benchmark_timeout: int¶
- code_prompt: str¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- package_downloads: dict[str, str | None]¶
- run_env: str¶
- test_code: str¶
- class eval_framework.metrics.completion.code_execution_pass_at_one.CodeExecutionPassAtOne[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'code-execution-pass@1'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.code_execution_pass_at_one.CodeExecutionPassAtOneContext(**data)[source]¶
Bases:
CodeExecutionBaseContext- Parameters:
run_env (str)
code_prompt (str)
test_code (str)
benchmark_timeout (int)
package_downloads (dict[str, str | None])
snippet_merge_fn (str)
output_parse_fn (str)
extra_data (Any)
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- output_parse_fn: str¶
- snippet_merge_fn: str¶
- class eval_framework.metrics.completion.code_execution_pass_at_one.RealtimeCodeExectionContext(**data)[source]¶
Bases:
CodeExecutionBaseContext- Parameters:
run_env (str)
code_prompt (str)
test_code (str)
benchmark_timeout (int)
package_downloads (dict[str, str | None])
snippet_merge_fn (Callable[[str, str], str])
output_parse_fn (Callable[[str], ExecutionResult])
extra_data (Any)
- classmethod from_context(context)[source]¶
- Return type:
Self- Parameters:
context (CodeExecutionPassAtOneContext)
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- output_parse_fn: Callable[[str], ExecutionResult]¶
- snippet_merge_fn: Callable[[str, str], str]¶
- eval_framework.metrics.completion.code_execution_pass_at_one.estimate_pass_at_k(n, c, k)[source]¶
Estimates pass@k for a single problem.
Parameters: n (int): Total number of generated samples. c (int): Number of correct samples. k (int): Number of attempts or samples considered.
Returns: float: The pass@k value.
- Return type:
float- Parameters:
n (int)
c (int)
k (int)
eval_framework.metrics.completion.comet module¶
- class eval_framework.metrics.completion.comet.COMET[source]¶
Bases:
BaseMetric[Completion]COMET is a neural, multilingual framework for evaluating machine translation quality by leveraging cross-lingual pretrained language models to achieve state-of-the-art correlation with human judgments Note: this requires a Hugging Face token with access to the model: https://huggingface.co/Unbabel/XCOMET-XL Source: https://github.com/Unbabel/COMET Paper: https://arxiv.org/abs/2009.09025
- NAME: str = 'COMET'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.concordance_index module¶
- class eval_framework.metrics.completion.concordance_index.ConcordanceIndex[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'ConcordanceIndex'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.csv_format module¶
- class eval_framework.metrics.completion.csv_format.CSVFormat[source]¶
Bases:
BaseMetric[Completion]- KEYS: list[str] | None = ['has_csv', 'is_separator_respected', 'is_column_count_respected']¶
- NAME: str = 'CSV Format'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.csv_format.CSVFormatEvaluation(**data)[source]¶
Bases:
BaseModel- Parameters:
implicit (bool)
has_csv (bool)
is_separator_respected (bool)
is_column_count_respected (bool)
- has_csv: bool¶
- implicit: bool¶
- is_column_count_respected: bool¶
- is_separator_respected: bool¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
eval_framework.metrics.completion.cwe_accuracy module¶
- class eval_framework.metrics.completion.cwe_accuracy.CWEAccuracy[source]¶
Bases:
BaseMetric[Completion]Metric for Common Word Extraction tasks
- NAME: str = 'CWEAccuracy'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.drop_completion module¶
DROP completion metrics: F1 and exact match.
- class eval_framework.metrics.completion.drop_completion.DropF1ExactMatch[source]¶
Bases:
BaseMetric[Completion]DROP F1 and exact match. Requires DropMetricContext with answer_tuples.
- KEYS: list[str] | None = ['f1', 'exact_match']¶
- NAME: str = 'DROP F1 / Exact Match'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.drop_completion.DropMetricContext(**data)[source]¶
Bases:
BaseMetricContextContext for DROP completion metrics. answer_tuples: list of gold answers (each a list of strings).
- Parameters:
answer_tuples (list[list[str]])
extra_data (Any)
- answer_tuples: list[list[str]]¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
eval_framework.metrics.completion.exponential_similarity module¶
- class eval_framework.metrics.completion.exponential_similarity.ExponentialSimilarity[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'ExponentialSimilarity'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- eval_framework.metrics.completion.exponential_similarity.calculate_exponential_similarity(p_true, p_pred)[source]¶
Compute the exponential similarity (SpaceDigest version) between the gold percentage and predicted value.
Parameters: - p_true (float): The gold/reference percentage. - p_pred (float): The predicted scalar. - d (float): Base of the exponent. Default is 2. - c (float): Coefficient in exponent. Default is 10.
Returns: - float: Similarity score between 0 and 1.
- Return type:
float- Parameters:
p_true (float)
p_pred (float)
eval_framework.metrics.completion.f1 module¶
- class eval_framework.metrics.completion.f1.F1[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'F1'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.format_checker module¶
- class eval_framework.metrics.completion.format_checker.CheckJsonFormat[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'JSON Format'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.format_checker.CheckPostScriptFormat[source]¶
Bases:
BaseMetric[Completion]This metric is honestly not that great In the original IFEval implementation it just checks whether the text contains the string (P.)P.S. or variants thereof such as p. s. It doesn’t check for parsing
- NAME: str = 'Postscript Format'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.grid_difference module¶
- class eval_framework.metrics.completion.grid_difference.GridDifference[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'grid_difference'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- calculate_score(output_ground_truth_difference_count, input_ground_truth_difference_count)[source]¶
- Return type:
float- Parameters:
output_ground_truth_difference_count (int)
input_ground_truth_difference_count (int)
eval_framework.metrics.completion.ifeval module¶
- class eval_framework.metrics.completion.ifeval.IFEvalMetric[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'IFEval'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.ifeval.IFEvalMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
key (int)
instruction_id_list (list[str])
prompt (str)
additional_kwargs (list[dict[str, Any]])
extra_data (Any)
- additional_kwargs: list[dict[str, Any]]¶
- instruction_id_list: list[str]¶
- key: int¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- prompt: str¶
eval_framework.metrics.completion.json_format module¶
- class eval_framework.metrics.completion.json_format.JsonFormat[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'JSON Format'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.json_format.JsonFormatEvaluation(**data)[source]¶
Bases:
BaseModel- Parameters:
is_just_json (bool)
is_valid_json (bool)
fulfills_schema (bool | None)
exact_match (bool | None)
json_parsing_error (str | None)
schema_validation_error (str | None)
- exact_match: bool | None¶
- fulfills_schema: bool | None¶
- is_just_json: bool¶
- is_valid_json: bool¶
- json_parsing_error: str | None¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- schema_validation_error: str | None¶
- eval_framework.metrics.completion.json_format.get_json_object(text)[source]¶
Extract the first valid JSON object or array from text.
This function handles nested brackets properly by using a bracket counting approach to find complete JSON structures, rather than using regex which can incorrectly match outer brackets containing non-JSON content.
- Return type:
str- Parameters:
text (str)
eval_framework.metrics.completion.language_checker module¶
- class eval_framework.metrics.completion.language_checker.GermanCompletionChecker[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'German Completion Check'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.language_checker.LanguageChecker[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Language Check'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.language_checker.LanguageConsistencyChecker[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Language Consistency'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Language Consistency Raw'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.length_control module¶
- class eval_framework.metrics.completion.length_control.LengthControl(tolerance=0.16666666666666666)[source]¶
Bases:
BaseMetric[Completion]- Parameters:
tolerance (float)
- NAME: str = 'length_control'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.math_minerva_completion module¶
Minerva-style MATH completion metric: exact_match and exact_match_flex.
- class eval_framework.metrics.completion.math_minerva_completion.MathMinervaCompletion(use_cot=True, cot_style='minerva', relaxed=False)[source]¶
Bases:
BaseMetric[Completion]Minerva MATH: reports Exact Match and Exact Match (Flex). Uses raw_completion to extract multiple candidates; primary for exact_match, all candidates with both Minerva and Hendrycks equivalence for exact_match_flex.
- Parameters:
use_cot (bool)
cot_style (str)
relaxed (bool)
- NAME: str = 'Math Minerva Completion'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.math_minerva_completion.MathMinervaCompletionRelaxed(use_cot=True, cot_style='minerva', relaxed=True)[source]¶
Bases:
MathMinervaCompletionMathMinervaCompletion with relaxed=True by default (flexible final-answer matching).
- Parameters:
use_cot (bool)
cot_style (str)
relaxed (bool)
eval_framework.metrics.completion.math_reasoning_completion module¶
- class eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Math Reasoning Completion (symbolic)'¶
- REMOVED_EXPRESSIONS_FORMAT = ['\\text{s}', '\\text{.}', '\\text{\ns}', '\\text{}^2', '\\text{}^3', '\\text{\n}', '\\text{}', '\\mathrm{th}', '^\\circ', '^{\\circ}', '\\;', ',\\!', '{,}', '"', '\\dots']¶
- REMOVED_EXPRESSIONS_UNITS = ['square', 'ways', 'integers', 'dollars', 'mph', 'inches', 'ft', 'hours', 'km', 'units', '\\ldots', 'sue', 'points', 'feet', 'minutes', 'digits', 'cents', 'degrees', 'cm', 'gm', 'pounds', 'meters', 'meals', 'edges', 'students', 'childrentickets', 'multiples']¶
- SUBSTITUTIONS = [('\\ban\\b(?!\\w)', ''), ('\\ba\\b(?!\\w)', ''), ('\\.\\$', '$'), ('\\\\\\$', ''), ('\\\\ ', ''), ('\\s+', ''), ('\\\\mbox', 'text'), (',\\\\text\\{and\\}', ','), ('\\\\text\\{and\\}', ','), ('\\\\text\\{m\\}', '\\text{}')]¶
- calculate(response)[source]¶
Calculate the accuracy of the completion
performs several verification and simplification steps to ensure that the completion is correct
the completion may either be a latex or string response which sympy will parse, factor, and simplify
- Parameters:
response (
Completion) – Completion object- Return type:
list[MetricResult]- Returns:
list of MetricResult
- check_for_equation(final_answer)[source]¶
Check if the final answer is an equation and split it into left hand side and right hand side :type final_answer:
str:param final_answer: the expression to evaluate- Return type:
list- Returns:
list of left hand side and right hand side of the equation
- Parameters:
final_answer (str)
- normalize_expression(final_answer)[source]¶
Function to normalize LaTeX expressions :type final_answer:
str:param final_answer: raw LaTeX expression- Return type:
str- Returns:
normalized LaTeX expression
- Parameters:
final_answer (str)
NOTE: Changed logic, because before the substitution randomly replaced characters in the string, i.e., turned “infty” into “iny” by removing “ft”
eval_framework.metrics.completion.minerva_math_utils module¶
Minerva-style MATH answer extraction and equivalence (Lewkowycz et al. 2022).
- eval_framework.metrics.completion.minerva_math_utils.extract_answers(raw_answer, use_cot=True, cot_style='minerva', relaxed=False)[source]¶
Extract multiple candidate answers from model output (for exact_match and exact_match_flex). Returns list of normalized strings; first is primary for exact_match. When relaxed=True, final-answer string matching is more lenient (whitespace/case).
- Return type:
list[str]- Parameters:
raw_answer (str)
use_cot (bool)
cot_style (str)
relaxed (bool)
- eval_framework.metrics.completion.minerva_math_utils.get_unnormalized_answer(text, relaxed=False)[source]¶
Extract answer from Minerva ‘Final Answer: The final answer is … I hope it is correct.’
When relaxed=False, pattern matches lm-evaluation-harness (lm_eval.tasks.minerva_math.utils) for parity: exact capitalization, no flexible whitespace. When relaxed=True, accepts any capitalisation of:
“Final Answer: The answer is “ / “Final Answer: The final answer is ” “The Final Answer: The answer is “ / “The Final Answer: The final answer is “
with flexible whitespace; no suffix required but “I hope it is correct.” is stripped when present).
- Return type:
str- Parameters:
text (str)
relaxed (bool)
- eval_framework.metrics.completion.minerva_math_utils.is_equiv_hendrycks(str1, str2)[source]¶
String equality after Hendrycks strip_string.
- Return type:
bool- Parameters:
str1 (str | None)
str2 (str | None)
- eval_framework.metrics.completion.minerva_math_utils.is_equiv_minerva(x1, x2, timeout_seconds=5)[source]¶
Sympy-based equivalence (Minerva).
- Return type:
bool- Parameters:
x1 (str)
x2 (str)
timeout_seconds (int)
- eval_framework.metrics.completion.minerva_math_utils.last_boxed_only_string(string)[source]¶
Extract the last boxed{…} or fbox{…} from string.
- Return type:
str|None- Parameters:
string (str)
- eval_framework.metrics.completion.minerva_math_utils.normalize_final_answer(final_answer)[source]¶
Normalize a final answer (appendix D of Lewkowycz et al. 2022).
- Return type:
str- Parameters:
final_answer (str)
- eval_framework.metrics.completion.minerva_math_utils.normalized_gold_from_solution(solution)[source]¶
Extract and normalize the gold answer from a solution string (last boxed{…}).
- Return type:
str|None- Parameters:
solution (str)
eval_framework.metrics.completion.niah_accuracy module¶
- class eval_framework.metrics.completion.niah_accuracy.NIAHAccuracy[source]¶
Bases:
BaseMetric[Completion]Metric for Needle in a Haystack tasks
- NAME: str = 'NIAHAccuracy'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.placeholder_checker module¶
- class eval_framework.metrics.completion.placeholder_checker.PlaceholderChecker[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Placeholder Check'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.placeholder_checker.PlaceholderCheckerMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
num_placeholders (int)
extra_data (Any)
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- num_placeholders: int¶
eval_framework.metrics.completion.repetition module¶
- class eval_framework.metrics.completion.repetition.WordRepetition(window_size=128, min_repetitions=1)[source]¶
Bases:
BaseMetric[Completion]Word Repetition Metric
This metric checks for repetitions of words in the completion text for a given window size and repetition threshold. The window size defines the consecutive word count to consider a repetition, and min_repetitions specifies the minimum repetition count that triggers the metric. This metric returns 0.0 if no repetitions are found, and 1.0 if a sufficient number of repetitions are found. For example, if the completion contains a two-word sequence that repeats once (such as “hello world hello world”), this metric would trigger with a window size of 2 and min_repetitions set to 1.
- Parameters:
window_size (int)
min_repetitions (int)
- HIGHER_IS_BETTER: Final[bool] = False¶
- NAME: str = 'WordRepetition'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.rouge_1 module¶
- class eval_framework.metrics.completion.rouge_1.ROUGE_1[source]¶
Bases:
BaseMetric[Completion]ROUGE-1
- NAME: str = 'ROUGE-1'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.rouge_2 module¶
- class eval_framework.metrics.completion.rouge_2.ROUGE_2[source]¶
Bases:
BaseMetric[Completion]ROUGE-2
- NAME: str = 'ROUGE-2'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.rouge_geometric_mean module¶
- class eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN[source]¶
Bases:
BaseMetric[Completion]ROUGE Geometric Mean
- NAME: str = 'ROUGE-Geometric-Mean'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.rouge_l module¶
- class eval_framework.metrics.completion.rouge_l.ROUGE_L[source]¶
Bases:
BaseMetric[Completion]ROUGE-L
- NAME: str = 'ROUGE-L'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.struct_eval_metrics module¶
- class eval_framework.metrics.completion.struct_eval_metrics.RenderableStructMetric[source]¶
Bases:
StructMetric- NAME: str = 'RenderableStructMetric'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.struct_eval_metrics.RenderableStructMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
output_type (str)
keywords (list[str])
extra_data (Any)
- keywords: list[str]¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- output_type: str¶
- class eval_framework.metrics.completion.struct_eval_metrics.StructMetric[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'StructMetric'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.struct_eval_metrics.StructMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
output_type (str)
paths (list[str])
extra_data (Any)
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- output_type: str¶
- paths: list[str]¶
- eval_framework.metrics.completion.struct_eval_metrics.is_valid_html(html)[source]¶
- Return type:
bool- Parameters:
html (str)
- eval_framework.metrics.completion.struct_eval_metrics.path_exists(data, path)[source]¶
Check if a path exists in a structured data object.
- Parameters:
data (
Any) – The structured data to checkpath (
str) – The path to check (dot notation)
- Return type:
bool- Returns:
True if path exists, False otherwise
eval_framework.metrics.completion.ter module¶
- class eval_framework.metrics.completion.ter.LINEWISE_TER[source]¶
Bases:
BaseMetric[Completion]Minimum Line-level TER (Translation Edit Rate) score.
- NAME: str = 'Linewise TER'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.ter.TER[source]¶
Bases:
BaseMetric[Completion]Translation Error Rate is an error metric for machine translation that measures the number of edits required to change a system output into one of the references Source: http://www.cs.umd.edu/~snover/tercom/ Paper: http://mt-archive.info/AMTA-2006-Snover.pdf
- NAME: str = 'TER'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.completion.text_counter module¶
- class eval_framework.metrics.completion.text_counter.ParagraphCounter[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Paragraph Count'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.text_counter.ParagraphCounterMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
comparison (str)
paragraph_count (int)
extra_data (Any)
- comparison: str¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- paragraph_count: int¶
- class eval_framework.metrics.completion.text_counter.ResponseToOriginalLengthRatio[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Response to Original Length Ratio'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.text_counter.SentenceCounter[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Sentence Count'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.text_counter.SentenceCounterMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
comparison (str)
sentence_count (int)
extra_data (Any)
- comparison: str¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- sentence_count: int¶
- class eval_framework.metrics.completion.text_counter.WordCounter[source]¶
Bases:
BaseMetric[Completion]- NAME: str = 'Word Count'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.completion.text_counter.WordCounterMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
comparison (str)
word_count (int)
extra_data (Any)
- comparison: str¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- word_count: int¶