How to Add a New Benchmark to Eval Framework¶
This guide provides comprehensive instructions for adding new benchmarks to the eval-framework, including all possible configuration options and attributes.
Overview¶
The eval-framework supports two response types:
Completion Tasks - Generate text completions (e.g., math problems, code generation)
Loglikelihood Tasks - Multiple choice questions where the model ranks answer options
For detailed information about implementing each task type, please refer to:
Completion Task Guide - Comprehensive guide for text generation tasks
Loglikelihood Task Guide - Detailed guide for multiple choice tasks
Understanding the Base Task Structure¶
All benchmarks inherit from BaseTask[SubjectType] and must implement several required methods and class attributes.
Required Class Attributes¶
class YourBenchmark(BaseTask[str]): # or BaseTask[Enum] for multiple subjects
# === CORE CONFIGURATION ===
NAME: str # Display name for the benchmark
DATASET_PATH: str # HuggingFace dataset path or local path
SAMPLE_SPLIT: str # Dataset split for evaluation samples
FEWSHOT_SPLIT: str # Dataset split for few-shot examples
RESPONSE_TYPE: ResponseType # COMPLETION or LOGLIKELIHOODS
METRICS: list[type[BaseMetric]] # List of metric classes to compute
SUBJECTS: list[SubjectType] # List of subjects/categories to evaluate
# === OPTIONAL CONFIGURATION ===
HF_REVISION: str | None = None # Git revision for reproducibility
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = None # Words to protect from perturbation
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = None # Language(s) tested
Required Methods to Implement¶
def _get_instruction_text(self, item: dict[str, Any]) -> str:
"""Generate the instruction/question text for a sample."""
pass
def _get_ground_truth(self, item: dict[str, Any]) -> str | None | list[str]:
"""Extract the correct answer(s) from a dataset item."""
pass
Optional Methods to Override¶
def _get_initial_prompt_text(self, item: dict[str, Any]) -> str:
"""Text to prepend to the first message."""
return ""
def _get_system_prompt_text(self, item: dict[str, Any]) -> str | None:
"""System message content."""
return None
def _get_cue_text(self, item: dict[str, Any]) -> str:
"""Text to append as assistant cue (e.g., 'Answer:')."""
return ""
def _get_possible_completions(self, item: dict[str, Any]) -> list[str] | None:
"""For loglikelihood tasks: list of answer choices."""
return None
def _get_fewshot_target_text(self, item: dict[str, Any]) -> str:
"""Target text for few-shot examples."""
target = self._get_ground_truth(item)
assert target is not None and isinstance(target, str)
return target
def _get_context(self, item: dict[str, Any]) -> BaseMetricContext | list[BaseMetricContext] | None:
"""Additional parameters for evaluation metrics."""
return None
def _sample_fewshot_examples(self, item: dict[str, Any]) -> list[dict]:
"""Custom few-shot sampling logic."""
# Default implementation samples randomly from FEWSHOT_SPLIT
pass
def _create_samples(self, item: dict[str, Any], index: int, subject: str) -> list[Sample]:
"""Create one or more samples from a dataset item."""
# Default creates single sample - override for multi-sample items
pass
def post_process_generated_completion(self, completion_text: str, sample: Sample | None = None) -> str:
"""Post-process model completions (e.g., extract final answer)."""
return completion_text
Response Types, Metrics, and Configuration Attributes Reference¶
This section provides a complete reference for all configurations available when creating benchmarks.
Response Types¶
The response type determines how your model interacts with the task and what type of output is expected.
from eval_framework.models.sample import ResponseType
# For text generation tasks
RESPONSE_TYPE = ResponseType.COMPLETION
# For multiple choice tasks
RESPONSE_TYPE = ResponseType.LOGLIKELIHOODS
All Available Metrics¶
Metrics define how your task’s outputs are evaluated and scored. Choose metrics that align with your response type and evaluation goals.
Completion Metrics¶
These metrics work with generated text outputs from COMPLETION tasks:
# Accuracy metrics
from eval_framework.metrics.completion.accuracy_completion import AccuracyCompletion
from eval_framework.metrics.completion.math_reasoning_completion import MathReasoningCompletion
from eval_framework.metrics.completion.cwe_accuracy import CWEAccuracy
# Translation and similarity metrics
from eval_framework.metrics.completion.bleu import BLEU
from eval_framework.metrics.completion.chrf import CHRF
from eval_framework.metrics.completion.ter import TER
from eval_framework.metrics.completion.rouge_1 import ROUGE_1
from eval_framework.metrics.completion.rouge_2 import ROUGE_2
from eval_framework.metrics.completion.rouge_l import ROUGE_L
from eval_framework.metrics.completion.rouge_geometric_mean import ROUGE_GEOMETRIC_MEAN
from eval_framework.metrics.completion.f1 import F1
# Code evaluation metrics
from eval_framework.metrics.completion.code_assertion import CodeCompletionAssertion
from eval_framework.metrics.completion.code_execution_pass_at_one import CodeExecutionPassAtOne
# Format validation metrics
from eval_framework.metrics.completion.json_format import JsonFormat
from eval_framework.metrics.completion.csv_format import CSVFormat
from eval_framework.metrics.completion.format_checker import CheckJsonFormat
from eval_framework.metrics.completion.format_checker import CheckPostScriptFormat
# Specialized metrics
from eval_framework.metrics.completion.ifeval import IFEvalMetric
from eval_framework.metrics.completion.language_checker import LanguageChecker
from eval_framework.metrics.completion.length_control import LengthControl
from eval_framework.metrics.completion.niah_accuracy import NIAHAccuracy
from eval_framework.metrics.completion.text_counter import WordCounter
from eval_framework.metrics.completion.text_counter import ParagraphCounter
from eval_framework.metrics.completion.text_counter import ResponseToOriginalLengthRatio
Loglikelihood Metrics¶
These metrics work with probability rankings from LOGLIKELIHOODS tasks:
# Standard accuracy metrics
from eval_framework.metrics.loglikelihood.accuracy_loglikelihood import AccuracyLoglikelihood
from eval_framework.metrics.loglikelihood.accuracy_loglikelihood import AccuracyNormLoglikelihood
# Probability metrics
from eval_framework.metrics.loglikelihood.probability_mass import ProbabilityMass
from eval_framework.metrics.loglikelihood.probability_mass import ProbabilityMassNorm
LLM Judge Metrics¶
These metrics use another LLM to evaluate generated outputs, useful for complex or subjective tasks:
from eval_framework.metrics.llm.llm_judge_chatbot_style import LLMJudgeChatbotStyle
# Classifies whether a text generation model's response follows a chatbot-style format by evaluating characteristics like friendly introductions, verbose language, follow-up questions, and conversational fluff, returning a boolean classification with reasoning. (English and German)
from eval_framework.metrics.llm.llm_judge_completion_accuracy import LLMJudgeCompletionAccuracy
# Evaluates whether a text generation model's response matches the target response by assessing if the generation addresses the question correctly and provides the right information, returning a boolean classification. (English)
from eval_framework.metrics.llm.llm_judge_conciseness import LLMJudgeConciseness
# Assesses if the model's response is concise by evaluating whether it is brief and to the point without unnecessary details or elaboration, returning a boolean classification with reasoning. (English and German)
from eval_framework.metrics.llm.llm_judge_contains_names import LLMJudgeAvoidsNames
# Checks whether the model text generation contains names of individuals, returning a boolean classification. (English, French and German)
from eval_framework.metrics.llm.llm_judge_format_correctness import LLMJudgeFormatCorrectness
# Evaluates the correctness and quality of models' outputs by checking if they comply with format specifications stated in the prompt, returning a detailed assessment with format correctness status (1 for correct, 0 for incorrect) and reasoning. (English)
from eval_framework.metrics.llm.llm_judge_instruction import LLMJudgeInstruction
# Evaluates how well the response from a text generation model fulfills the requirements of a given user instruction on a fixed set of criteria, where the user cannot ask follow-up questions to the model. (English, Finnish and German)
from eval_framework.metrics.llm.llm_judge_mtbench_pair import MTBenchJudgePair
# Evaluates the quality of two AI assistant responses by acting as an impartial judge, comparing factors like helpfulness, relevance, accuracy, and depth to determine which assistant provides a better answer (A wins, B wins, or tie). (English, Finnish and German)
from eval_framework.metrics.llm.llm_judge_mtbench_single import MTBenchJudgeSingle
# Evaluates a single AI assistant response on a 1-10 scale, considering factors like helpfulness, relevance, accuracy, depth, creativity, and detail level to provide an objective quality rating. (English, Finnish and German)
from eval_framework.metrics.llm.llm_judge_refusal import LLMJudgeRefusal
# Assesses whether the assistant's response is a refusal to answer the question. (English)
from eval_framework.metrics.llm.llm_judge_sql import LLMJudgeSql
# Evaluates SQL query quality against specified requirements, rating efficiency and accuracy on an A-F scale and providing detailed reasoning for the grade assigned. (English and German)
from eval_framework.metrics.llm.llm_judge_world_knowledge import LLMJudgeWorldKnowledge
# Evaluates whether a summary contains information that goes beyond the reference text (also known as "world knowledge"), returning a boolean classification with detailed reasoning for the assessment. (English, French and German)
Implementation Examples and Patterns¶
Practical Example: GeographyQATask¶
Practical example of creating a geography question-answering benchmark.
Dataset Structure: Each item looks like {"country": "Germany", "capital": "Berlin"}
from typing import Any
from eval_framework.tasks.base import BaseTask
from eval_framework.models.sample import ResponseType
from eval_framework.metrics.completion.accuracy_completion import AccuracyCompletion
class GeographyQATask(BaseTask[str]):
# Required class attributes
NAME = "GeographyQA"
DATASET_PATH = "example/geography_qa"
SAMPLE_SPLIT = "test"
FEWSHOT_SPLIT = "train"
RESPONSE_TYPE = ResponseType.COMPLETION
METRICS = [AccuracyCompletion]
SUBJECTS = ["Europe", "Asia"]
def _get_instruction_text(self, item: dict[str, Any]) -> str:
"""Format the question from the dataset item."""
return f"Q: What's the capital of {item['country']}?"
def _get_ground_truth(self, item: dict[str, Any]) -> str:
"""Extract the correct answer from the dataset item."""
return f"A: {item['capital']}."
def _get_system_prompt_text(self, item: dict[str, Any]) -> str:
"""Provide context about the task."""
return "Answer the geography questions accurately."
def _get_cue_text(self, item: dict[str, Any]) -> str:
"""Start the model's response with 'A:'."""
return "A:"
def _sample_fewshot_examples(self, item: dict[str, Any]) -> list[dict]:
"""Sample random examples from the training split."""
return self.rnd.sample(self.dataset[self.FEWSHOT_SPLIT], self.num_fewshot)
Add to Task Registry¶
Add a registration call for your new benchmark to register_all_tasks in src/eval_framework/tasks/task_names.py:
register_lazy_task("eval_framework.tasks.benchmarks.geographyqa.GeographyQA")
The task will now be available through get_task("GeographyQA").
Testing your benchmark¶
All tasks automatically go through formatting tests to ensure proper prompt generation. However, if your benchmark has specific functionality that needs testing, create a dedicated test file.
Automatic Formatting Tests¶
All benchmarks are automatically tested for proper prompt formatting across different chat templates. No additional setup required.
Custom Task Tests (Optional)¶
If your benchmark has specific logic that needs testing, create a test file in tests/tasks/ to test it.
Benchmark Examples by Task Type¶
Study these existing benchmarks in the codebase for more complex patterns:
Simple Classification Tasks¶
ARC (
src/eval_framework/tasks/arc.py): Multiple choice with loglikelihoodsMMLU (
src/eval_framework/tasks/mmlu.py): Multi-subject classification with enum subjects
Reasoning Tasks¶
GSM8K (
src/eval_framework/tasks/gsm8k.py): Math reasoning with answer extraction patterns
Code Generation¶
HumanEval (
src/eval_framework/tasks/human_eval.py): Code completion with execution validationMBPP (
src/eval_framework/tasks/mbpp.py): Code generation with comprehensive test validation
Long Context Tasks¶
InfiniteBench (
src/eval_framework/tasks/infinite_bench_tasks.py): Long context reasoning tasks
Custom Format Tasks¶
IFEval (
src/eval_framework/tasks/ifeval.py): Instruction following with format validationJSON/CSV Tasks: Custom format validation examples