Understanding Evaluation Results Guide¶
This guide explains how to read and interpret the results generated by the eval-framework after running evaluations on language models.
Result Directory Structure¶
When you run an evaluation, results are organized in a hierarchical directory structure:
eval_framework_results/
└── {model_name}/
└── v{version}_{task_name}/
└── {parameters}_{config_hash}/
├── aggregated_results.json
├── metadata.json
├── output.jsonl
└── results.jsonl
Directory Components¶
{model_name}: The name of the evaluated model (e.g.,Llama31_8B_HF)v{version}_{task_name}: Framework version and task name (e.g.,v0.1.96_ARC){parameters}_{config_hash}: Evaluation parameters and configuration hash (e.g.,fewshot_3__samples_100_e7f98)
Individual Result Files¶
1. results.jsonl - Detailed Metric Results¶
This file contains one JSON object per line, each representing a metric calculation for a single evaluation sample.
Structure:
{
"id": 0,
"subject": "ARC-Easy",
"num_fewshot": 3,
"llm_name": "MyHuggingFaceModel",
"task_name": "ARC",
"metric_class_name": "AccuracyLoglikelihood",
"metric_name": "Accuracy Loglikelihood",
"key": null,
"value": 0.0,
"higher_is_better": true,
"prompt": "Question: Which is the function of the gallbladder?\nAnswer:",
"response": " store bile",
"llm_judge_prompt": null,
"llm_judge_response": null,
"code_execution_trace": null,
"error": null
}
Key Fields:
id: Unique identifier for this evaluation samplesubject: Task subset/category (e.g., “ARC-Easy”, “ARC-Challenge”)metric_name: The specific metric being measuredvalue: The metric score (0.0 = incorrect, 1.0 = correct for binary metrics)higher_is_better: Whether higher values indicate better performanceprompt: The input given to the modelresponse: The model’s outputerror: Any errors encountered during evaluation
2. output.jsonl - Raw Model Responses¶
Contains the raw model completions before metric calculation.
Structure:
{
"id": 0,
"subject": "math",
"ground_truth": "4",
"prompt": "What is 2+2?",
"completion": "4",
"raw_completion": "4",
"messages": [...],
"error": null
}
Aggregated Results¶
aggregated_results.json - Summary Statistics¶
This file provides high-level performance summaries across all evaluation samples.
Structure:
{
"ErrorFreeRatio Accuracy Loglikelihood": 1.0,
"Average Accuracy Loglikelihood": 0.215,
"Average Accuracy Loglikelihood - ARC-Challenge": 0.14,
"Average Accuracy Loglikelihood - ARC-Easy": 0.29,
"ErrorFreeRatio Bytes": 1.0,
"Average Bytes": 814.045
}
Metric Types:
ErrorFreeRatio: Percentage of samples that completed without errors (1.0 = 100%)Average: Mean score across all samplesSubject-specific: Metrics broken down by task subset (e.g., “ARC-Challenge” vs “ARC-Easy”)
Metadata¶
metadata.json - Evaluation Configuration¶
Contains complete information about how the evaluation was conducted.
Key Fields:
task_name: The benchmark task that was evaluatednum_fewshot: Number of examples provided to the modelnum_samples: Number of test samples evaluatedllm_name: Name of the evaluated modelmetrics: List of metrics computedprimary_metrics: Main metrics for this taskstart_time/end_time: Evaluation timingeval_framework_version: Framework version used