Included Benchmark Tasks¶
Currently, the framework covers a wide range of pre-training and post-training benchmarks for completion and loglikelihood tasks, as well as benchmarks that use LLM-as-a-judge evaluation methods. The suggested few-shot counts are extracted from other leaderboards and literature.
Additional task documentation can be generated with the script utils/generate-task-docs.py as documented in installation.md. The documention can thereafter be found in docs/tasks.
Completion¶
Task |
Capability |
Benchmarks |
Long Context |
|---|---|---|---|
Logical Reasoning |
Math |
|
|
Logical Reasoning |
Programming |
|
|
Logical Reasoning |
Puzzle |
|
|
Output Control |
Structure |
|
|
Text Distillation |
Aggregation |
|
|
Text Distillation |
Classification |
|
|
TextDistillation |
Closed QA |
|
|
Text Distilation |
Extraction |
|
|
Text Distillation |
QA |
|
|
Text Transformation |
Translation |
|
Loglikelihoods¶
Task |
Capability |
Benchmarks |
Long Context |
|---|---|---|---|
Output Control |
Bias |
|
|
Text Distillation |
Classification |
|
|
Text Distillation |
QA |
|
|
Text Generation |
Open QA |
|
|
Logical Reasoning |
Closed QA |
|
|
Logical Reasoning |
Programming |
|
|
Logical Reasoning |
Reasoning |
|
Long-Context¶
Task Name |
Tag |
Task |
Capability |
Domain |
Common Few-Shot Counts |
Avg #Words |
Language |
|---|---|---|---|---|---|---|---|
Babilong |
|
Text Generation, Long Context |
Completion, Long Context |
? |
not supported |
22003 |
en |
InfiniteBench_CodeDebug |
|
LogicalReasoning |
Programming |
? |
not supported |
127761 |
en |
InfiniteBench_CodeRun |
|
LogicalReasoning |
Programming |
? |
not supported |
34851 |
en |
InfiniteBench_EnDia |
|
TextDistillation |
Closed QA |
? |
not supported |
73240 |
en |
InfiniteBench_EnMC |
|
TextDistillation |
Closed QA |
? |
not supported |
139966 |
en |
InfiniteBench_EnQA |
|
TextDistillation |
Closed QA |
? |
not supported |
149442 |
en |
InfiniteBench_MathFind |
|
LogicalReasoning |
Math |
? |
not supported |
30017 |
en |
InfiniteBench_RetrieveKV2 |
|
TextDistillation |
Extraction |
? |
not supported |
5010 |
en |
InfiniteBench_RetrieveNumber |
|
TextDistillation |
Extraction |
? |
not supported |
99199 |
en |
InfiniteBench_RetrievePassKey1 |
|
TextDistillation |
Extraction |
? |
not supported |
99196 |
en |
QuALITY |
|
Text Distillation |
QA |
Literature, Misc |
not supported |
4248 |
en |
ZeroSCROLLS GovReport |
|
Text Distillation |
QA |
Government |
not supported |
7273 |
en |
ZeroSCROLLS SQuALITY |
|
Text Distillation |
QB-Summ? |
Literature |
not supported |
4971 |
en |
ZeroSCROLLS Qasper |
|
Text Distillation |
QA |
Science |
not supported |
3531 |
en |
ZeroSCROLLS NarrativeQA |
|
Text Distillation |
QA |
Literature, Film |
not supported |
49384 |
en |
ZeroSCROLLS QuALITY |
|
Text Distillation |
QA |
Literature, Misc |
not supported |
4248 |
en |
ZeroSCROLLS MuSiQue |
|
Text Distillation |
QA |
Wikipedia |
not supported |
1749 |
en |
ZeroSCROLLS SpaceDigest |
|
Text Distillation |
Aggregation |
Reviews |
not supported |
5481 |
en |
Languages¶
Languages in Likelihood tasks: ENG (39), DEU (7), FRA (4), FIN (2), NLD (2), ITA (1), POL (1), RUS (1), SPA (1), SWE (1), UKR (1)
Languages in Completion tasks: ENG (20), DEU (5), FRA (5), ARB (1), FIN (1), ITA (1), POR (1), SPA (1) and 44 languages in INCLUDE.
Languages in both types of tasks: ENG (59), DEU (12), FRA (9), FIN (3), NLD (2), SPA (2), ARB (1), ITA (1), POL (1), POR (1), RUS (1), SWE (1), UKR (1) and 44 languages in INCLUDE.
Metrics¶
Metrics Type |
Metrics |
|---|---|
Completion Metrics |
Accuracy |
Bleu |
|
Chrf |
|
Ter |
|
F1 |
|
Rouge 1 |
|
Rouge 2 |
|
Rouge-L |
|
Code Assertion |
|
Language Checker |
|
Length Checker |
|
Math Reasoning |
|
Placeholder Checker |
|
Text Counter |
|
CSV Format |
|
JSON Format |
|
Postscript Format |
|
Custom IFEval Checker |
|
Custom CWE Checker |
|
Custom NIAH Checker |
|
Custom Grid Comparison Checker |
|
Repetition Checker |
|
Loglikelihood Metrics |
Accuracy Loglikelihood |
Normalized Accuracy Loglikelihood |
|
Probability Mass |
|
LLM Metrics |
Chatbot Style Judge |
Completion Accuracy Judge |
|
Conciseness Judge |
|
Contains Names Judge |
|
Instruction Judge |
|
SQL Format |
|
World Knowledge Judge |
|
Efficiency Metrics |
Bytes per Sequence Position |