Understanding Evaluation Results Guide¶

This guide explains how to read and interpret the results generated by the eval-framework after running evaluations on language models.

Result Directory Structure¶

When you run an evaluation, results are organized in a hierarchical directory structure:

eval_framework_results/
└── {model_name}/
    └── v{version}_{task_name}/
        └── {parameters}_{config_hash}/
            ├── aggregated_results.json
            ├── metadata.json
            ├── output.jsonl
            └── results.jsonl

Directory Components¶

{model_name}: The name of the evaluated model (e.g., Llama31_8B_HF)
v{version}_{task_name}: Framework version and task name (e.g., v0.1.96_ARC)
{parameters}_{config_hash}: Evaluation parameters and configuration hash (e.g., fewshot_3__samples_100_e7f98)

Individual Result Files¶

1. `results.jsonl` - Detailed Metric Results¶

This file contains one JSON object per line, each representing a metric calculation for a single evaluation sample.

Structure:

{
  "id": 0,
  "subject": "ARC-Easy",
  "num_fewshot": 3,
  "llm_name": "MyHuggingFaceModel",
  "task_name": "ARC",
  "metric_class_name": "AccuracyLoglikelihood",
  "metric_name": "Accuracy Loglikelihood",
  "key": null,
  "value": 0.0,
  "higher_is_better": true,
  "prompt": "Question: Which is the function of the gallbladder?\nAnswer:",
  "response": " store bile",
  "llm_judge_prompt": null,
  "llm_judge_response": null,
  "code_execution_trace": null,
  "error": null
}

Key Fields:

id: Unique identifier for this evaluation sample
subject: Task subset/category (e.g., “ARC-Easy”, “ARC-Challenge”)
metric_name: The specific metric being measured
value: The metric score (0.0 = incorrect, 1.0 = correct for binary metrics)
higher_is_better: Whether higher values indicate better performance
prompt: The input given to the model
response: The model’s output
error: Any errors encountered during evaluation

2. `output.jsonl` - Raw Model Responses¶

Contains the raw model completions before metric calculation.

Structure:

{
  "id": 0,
  "subject": "math",
  "ground_truth": "4",
  "prompt": "What is 2+2?",
  "completion": "4",
  "raw_completion": "4",
  "messages": [...],
  "error": null
}

Aggregated Results¶

`aggregated_results.json` - Summary Statistics¶

This file provides high-level performance summaries across all evaluation samples.

Structure:

{
  "ErrorFreeRatio Accuracy Loglikelihood": 1.0,
  "Average Accuracy Loglikelihood": 0.215,
  "Average Accuracy Loglikelihood - ARC-Challenge": 0.14,
  "Average Accuracy Loglikelihood - ARC-Easy": 0.29,
  "ErrorFreeRatio Bytes": 1.0,
  "Average Bytes": 814.045
}

Metric Types:

ErrorFreeRatio: Percentage of samples that completed without errors (1.0 = 100%)
Average: Mean score across all samples
Subject-specific: Metrics broken down by task subset (e.g., “ARC-Challenge” vs “ARC-Easy”)

Metadata¶

`metadata.json` - Evaluation Configuration¶

Contains complete information about how the evaluation was conducted.

Key Fields:

task_name: The benchmark task that was evaluated
num_fewshot: Number of examples provided to the model
num_samples: Number of test samples evaluated
llm_name: Name of the evaluated model
metrics: List of metrics computed
primary_metrics: Main metrics for this task
start_time/end_time: Evaluation timing
eval_framework_version: Framework version used

Understanding Evaluation Results Guide¶

Result Directory Structure¶

Directory Components¶

Individual Result Files¶

1. results.jsonl - Detailed Metric Results¶

2. output.jsonl - Raw Model Responses¶

Aggregated Results¶

aggregated_results.json - Summary Statistics¶

Metadata¶

metadata.json - Evaluation Configuration¶

1. `results.jsonl` - Detailed Metric Results¶

2. `output.jsonl` - Raw Model Responses¶

`aggregated_results.json` - Summary Statistics¶

`metadata.json` - Evaluation Configuration¶