Overview Dataloading¶
To evaluate models on benchmarks, we define custom tasks that inherit from BaseTask to handle dataset loading and formatting. The framework supports two main evaluation types: completion tasks (text generation) and loglikelihood tasks (multiple choice ranking).
Core Data Types¶
The framework uses different data types based on the evaluation approach:
from eval_framework.shared.types import Completion, Loglikelihood, RawCompletion, RawLoglikelihood
from template_formatting.formatter import Message, Role
# For completion tasks (text generation)
class Completion(BaseModel):
completion_text: str # Generated text from the model
# Additional fields based on actual implementation
# For loglikelihood tasks (multiple choice)
class Loglikelihood(BaseModel):
loglikelihoods: list[float] # Probability scores for each choice
# Additional fields based on actual implementation
# Raw response types from LLMs
class RawCompletion(BaseModel):
text: str # Raw generated text
# Additional fields based on actual implementation
class RawLoglikelihood(BaseModel):
loglikelihoods: list[float] # Raw probability scores
# Additional fields based on actual implementation
Message Structure¶
Each prompt is structured as a sequence of messages using the template formatting system:
from template_formatting.formatter import Message, Role
class Message(BaseModel):
role: Role # SYSTEM, USER, or ASSISTANT
content: str # Message content
# Additional fields based on actual formatter implementation
Task Implementation Pattern¶
Custom tasks inherit from BaseTask and implement specific methods based on their evaluation type:
For Completion Tasks:¶
from eval_framework.tasks.base import BaseTask
from eval_framework.models.sample import ResponseType
class MyCompletionTask(BaseTask[str]):
NAME = "My Task"
DATASET_PATH = "dataset_name"
RESPONSE_TYPE = ResponseType.COMPLETION
def _get_instruction_text(self, item: dict) -> str:
"""Format the question/instruction."""
return f"Question: {item['question']}"
def _get_ground_truth(self, item: dict) -> str:
"""Return the expected answer."""
return item['answer']
For Loglikelihood Tasks:¶
class MyLoglikelihoodTask(BaseTask[str]):
NAME = "My Task"
DATASET_PATH = "dataset_name"
RESPONSE_TYPE = ResponseType.LOGLIKELIHOODS
def _get_instruction_text(self, item: dict) -> str:
"""Format the question without choices."""
return f"Question: {item['question']}"
def _get_ground_truth(self, item: dict) -> str:
"""Return the correct answer choice."""
return item['choices'][item['answer_idx']]
def _get_possible_completions(self, item: dict) -> list[str]:
"""Return all answer choices for ranking."""
return item['choices']
Few-Shot Example Construction¶
The framework automatically constructs few-shot prompts using these methods:
# Example prompt construction for a two-shot scenario
def construct_prompt(self, item: dict) -> list[Message]:
messages = []
# 1. System prompt (optional)
if system_prompt := self._get_system_prompt_text(item):
messages.append(Message(role=Role.SYSTEM, content=system_prompt))
# 2. Few-shot examples
fewshot_examples = self._sample_fewshot_examples(item)
for example in fewshot_examples:
# User instruction
messages.append(Message(
role=Role.USER,
content=self._get_instruction_text(example)
))
# Assistant response
messages.append(Message(
role=Role.ASSISTANT,
content=self._get_fewshot_target_text(example)
))
# 3. Actual instruction
messages.append(Message(
role=Role.USER,
content=self._get_instruction_text(item)
))
# 4. Response cue (optional)
if cue := self._get_cue_text(item):
messages.append(Message(role=Role.ASSISTANT, content=cue))
return messages
Example: Geography Quiz¶
Here’s how a complete geography quiz task might look:
messages = [
Message(Role.SYSTEM, "Answer geography questions accurately."),
Message(Role.USER, "Question: What is the capital of Germany?"),
Message(Role.ASSISTANT, "Answer: Berlin"),
Message(Role.USER, "Question: What is the capital of France?"),
Message(Role.ASSISTANT, "Answer: Paris"),
Message(Role.USER, "Question: What is the capital of Italy?"),
Message(Role.ASSISTANT, "Answer:")
]
For completion tasks, the model generates the complete answer: " Rome"
For loglikelihood tasks, the model ranks options: [" Rome", " Madrid", " Athens", " Vienna"]