How to Evaluate HuggingFace Models with Eval Framework

This guide shows you how to evaluate any HuggingFace model using the eval-framework, from simple setup to advanced configurations.

Quick Start

Here’s a sample of code to evaluate a HuggingFace model:

from functools import partial
from pathlib import Path

from eval_framework.llm.huggingface import HFLLM
from eval_framework.main import main
from eval_framework.tasks.eval_config import EvalConfig
from template_formatting.formatter import HFFormatter

# Define your model
class MyHuggingFaceModel(HFLLM):
    LLM_NAME = "context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16"
    DEFAULT_FORMATTER = partial(HFFormatter, "context-labs/meta-llama-Llama-3.2-3B-Instruct-FP16")

if __name__ == "__main__":
    # Initialize your model
    llm = MyHuggingFaceModel()

    # Configure evaluation
    config = EvalConfig(
        task_name="ARC",
        num_fewshot=3,
        num_samples=100,
        output_dir=Path("./eval_results"),
        llm_class=MyHuggingFaceModel,
    )

    # Run evaluation
    results = main(llm=llm, config=config)

Understanding the Components

1. Model Definition

The HFLLM base class provides the foundation for HuggingFace model integration:

class MyModel(HFLLM):
    LLM_NAME = "model-name-on-huggingface"
    DEFAULT_FORMATTER = partial(HFFormatter, "model-name-on-huggingface")

    def __init__(self, formatter=None):
        # Set custom attributes before calling super().__init__
        super().__init__(formatter=formatter)

        # Additional model configuration can be done here
        # Note: model and tokenizer are already loaded in super().__init__

2. Formatter Selection

The formatter determines how prompts are structured for your model. Choose based on your model type:

Concat Formatter (Base Models):

from template_formatting.formatter import ConcatFormatter

class BaseModel(HFLLM):
    LLM_NAME = "meta-llama/Llama-3.2-3B"
    DEFAULT_FORMATTER = ConcatFormatter

Simple concatenation formatter for base models without chat templates.

Llama3 Formatter:

from template_formatting.formatter import Llama3Formatter

class Llama3Model(HFLLM):
    LLM_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
    DEFAULT_FORMATTER = Llama3Formatter

Specialized formatter for Llama 3 models with their specific chat template.

Mistral Formatter:

from template_formatting.mistral_formatter import MistralFormatter

class MistralModel(HFLLM):
    LLM_NAME = "mistralai/Mistral-7B-Instruct-v0.1"
    DEFAULT_FORMATTER = MistralFormatter

Automatic HF Formatter:

from template_formatting.formatter import HFFormatter
from functools import partial

class ChatModel(HFLLM):
    LLM_NAME = "meta-llama/Llama-3.2-3B-Instruct"
    DEFAULT_FORMATTER = partial(HFFormatter, "meta-llama/Llama-3.2-3B-Instruct")

Automatically detects and uses the model’s chat template from HuggingFace.

Step-by-Step Implementation

Step 1: Choose Your Model

Pick any HuggingFace model. Here are examples for different model types:

Large Language Models:

class Llama3_8B(HFLLM):
    LLM_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
    DEFAULT_FORMATTER = Llama3Formatter

class Mistral7B(HFLLM):
    LLM_NAME = "mistralai/Mistral-7B-Instruct-v0.1"
    DEFAULT_FORMATTER = MistralFormatter

class Qwen2_7B(HFLLM):
    LLM_NAME = "Qwen/Qwen2-7B-Instruct"
    DEFAULT_FORMATTER = partial(HFFormatter, "Qwen/Qwen2-7B-Instruct")

Small Models:

class SmolLM(HFLLM):
    LLM_NAME = "HuggingFaceTB/SmolLM-1.7B-Instruct"
    DEFAULT_FORMATTER = partial(HFFormatter, "HuggingFaceTB/SmolLM-1.7B-Instruct")

class TinyLlama(HFLLM):
    LLM_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    DEFAULT_FORMATTER = partial(HFFormatter, "TinyLlama/TinyLlama-1.1B-Chat-v1.0")

Step 2: Set Up Evaluation Configuration

Configure the evaluation parameters:

from pathlib import Path
from eval_framework.tasks.eval_config import EvalConfig

config = EvalConfig(
    # Core settings
    task_name="MMLU",                    # Benchmark to run
    num_fewshot=5,                       # Number of examples in prompt
    num_samples=100,                     # How many questions to evaluate
    output_dir=Path("./eval_results"),        # Where to save results
    llm_class=YourModelClass,            # Your model class

    # Optional settings
    task_subjects=["astronomy"],             # Specific subjects (if applicable)
    batch_size=8,                        # Batch processing size
)

Step 3: Run Evaluation

Execute the evaluation:

from eval_framework.main import main

if __name__ == "__main__":
    # Initialize model
    llm = YourModelClass()

    # Run evaluation
    results = main(llm=llm, config=config)

    # Results are automatically saved to output_dir
    print(f"Evaluation completed! Results saved to {config.output_dir}")