Understanding Evaluation Results Guide

This guide explains how to read and interpret the results generated by the eval-framework after running evaluations on language models.

Result Directory Structure

When you run an evaluation, results are organized in a hierarchical directory structure:

eval_framework_results/
└── {model_name}/
    └── v{version}_{task_name}/
        └── {parameters}_{config_hash}/
            ├── aggregated_results.json
            ├── metadata.json
            ├── output.jsonl
            └── results.jsonl

Directory Components

  • {model_name}: The name of the evaluated model (e.g., Llama31_8B_HF)

  • v{version}_{task_name}: Framework version and task name (e.g., v0.1.96_ARC)

  • {parameters}_{config_hash}: Evaluation parameters and configuration hash (e.g., fewshot_3__samples_100_e7f98)

Individual Result Files

1. results.jsonl - Detailed Metric Results

This file contains one JSON object per line, each representing a metric calculation for a single evaluation sample.

Structure:

{
  "id": 0,
  "subject": "ARC-Easy",
  "num_fewshot": 3,
  "llm_name": "MyHuggingFaceModel",
  "task_name": "ARC",
  "metric_class_name": "AccuracyLoglikelihood",
  "metric_name": "Accuracy Loglikelihood",
  "key": null,
  "value": 0.0,
  "higher_is_better": true,
  "prompt": "Question: Which is the function of the gallbladder?\nAnswer:",
  "response": " store bile",
  "llm_judge_prompt": null,
  "llm_judge_response": null,
  "code_execution_trace": null,
  "error": null
}

Key Fields:

  • id: Unique identifier for this evaluation sample

  • subject: Task subset/category (e.g., “ARC-Easy”, “ARC-Challenge”)

  • metric_name: The specific metric being measured

  • value: The metric score (0.0 = incorrect, 1.0 = correct for binary metrics)

  • higher_is_better: Whether higher values indicate better performance

  • prompt: The input given to the model

  • response: The model’s output

  • error: Any errors encountered during evaluation

2. output.jsonl - Raw Model Responses

Contains the raw model completions before metric calculation.

Structure:

{
  "id": 0,
  "subject": "math",
  "ground_truth": "4",
  "prompt": "What is 2+2?",
  "completion": "4",
  "raw_completion": "4",
  "messages": [...],
  "error": null
}

Aggregated Results

aggregated_results.json - Summary Statistics

This file provides high-level performance summaries across all evaluation samples.

Structure:

{
  "ErrorFreeRatio Accuracy Loglikelihood": 1.0,
  "Average Accuracy Loglikelihood": 0.215,
  "Average Accuracy Loglikelihood - ARC-Challenge": 0.14,
  "Average Accuracy Loglikelihood - ARC-Easy": 0.29,
  "ErrorFreeRatio Bytes": 1.0,
  "Average Bytes": 814.045
}

Metric Types:

  • ErrorFreeRatio: Percentage of samples that completed without errors (1.0 = 100%)

  • Average: Mean score across all samples

  • Subject-specific: Metrics broken down by task subset (e.g., “ARC-Challenge” vs “ARC-Easy”)

Metadata

metadata.json - Evaluation Configuration

Contains complete information about how the evaluation was conducted.

Key Fields:

  • task_name: The benchmark task that was evaluated

  • num_fewshot: Number of examples provided to the model

  • num_samples: Number of test samples evaluated

  • llm_name: Name of the evaluated model

  • metrics: List of metrics computed

  • primary_metrics: Main metrics for this task

  • start_time/end_time: Evaluation timing

  • eval_framework_version: Framework version used