eval_framework.tasks.benchmarks package¶

Submodules¶

eval_framework.tasks.benchmarks.aidanbench module¶

class eval_framework.tasks.benchmarks.aidanbench.AidanBench(num_fewshot=0)[source]¶

Bases: AidanBenchOriginal

Parameters:: num_fewshot (int)

class eval_framework.tasks.benchmarks.aidanbench.AidanBenchOriginal(num_fewshot=0)[source]¶

Bases: BaseTask[str]

AidanBench (https://openreview.net/pdf?id=fz969ahcvJ).

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'Aleph-Alpha-Research/aidanbench'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.ENG}¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.aidanbench.AidanBenchMetric'>]¶

NAME: str = 'AidanBench'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'train'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

generate_completions(llm, samples, stop_sequences=None, max_tokens=None)[source]¶

Generates completions for the sample. :param sample: sample to generate completions for :type stop_sequences: list[str] | None :param stop_sequences: stop sequences to use in completion generation :type max_tokens: int | None :param max_tokens: maximum tokens to use in completion generation

Return type:

list[Completion]

Returns:

completion

Parameters:

llm (BaseLLM)
samples (list[Sample])
stop_sequences (list[str] | None)
max_tokens (int | None)

eval_framework.tasks.benchmarks.arc module¶

class eval_framework.tasks.benchmarks.arc.ARC(num_fewshot=0)[source]¶

Bases: BaseTask[str]

ARC dataset: https://huggingface.co/datasets/allenai/ai2_arc

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'allenai/ai2_arc'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'ARC'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D', 'E']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['ARC-Easy', 'ARC-Challenge']¶

class eval_framework.tasks.benchmarks.arc.ARC_IDK(num_fewshot=0)[source]¶

Bases: ARC

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'ARC_IDK'¶

class eval_framework.tasks.benchmarks.arc.ARC_OLMES(num_fewshot=0)[source]¶

Bases: ARC

ARC with OLMES-style prompt: options shown with space-prefixed labels (” A.”, “ B.”, …); loglikelihood over “ A”/” B”/ etc.

Parameters:: num_fewshot (int)

NAME: str = 'ARC_OLMES'¶

eval_framework.tasks.benchmarks.arc_de module¶

class eval_framework.tasks.benchmarks.arc_de.ARC_DE(num_fewshot=0)[source]¶

Bases: BaseTask[str]

ARC-DE dataset: https://huggingface.co/datasets/LeoLM/ArcChallenge_de

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'LeoLM/ArcChallenge_de'¶

FEWSHOT_SPLIT: str = 'validation'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'ARC German'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'A', 'B', 'C', 'D', 'E']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

eval_framework.tasks.benchmarks.arc_fi module¶

class eval_framework.tasks.benchmarks.arc_fi.ARC_FI(num_fewshot=0)[source]¶

Bases: BaseTask[str]

ARC-FI dataset: https://huggingface.co/datasets/LumiOpen/arc_challenge_mt

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'LumiOpen/arc_challenge_mt'¶

FEWSHOT_SPLIT: str = 'validation'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'Finnish'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'ARC Finnish'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D', 'E']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['fi']¶

eval_framework.tasks.benchmarks.balancedcopa module¶

class eval_framework.tasks.benchmarks.balancedcopa.BalancedCOPA(num_fewshot=0)[source]¶

Bases: COPA

Balanced-COPA dataset: https://huggingface.co/datasets/pkavumba/balanced-copa

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'pkavumba/balanced-copa'¶

HF_REVISION: str | None = '813bd03cd6e07d9bd8d7333896ad5d40abb95ea9'¶

NAME: str = 'BalancedCOPA'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

eval_framework.tasks.benchmarks.balancedcopa.split_dataset_by_id_ranges(dataset, id_column, ranges)[source]¶

Split a dataset into two based on whether the id column falls within given ranges.

Parameters:

dataset (Dataset) – The dataset to split.
id_column (str) – The name of the column containing the id values.
ranges (list[tuple[int, int]]) – A list of (low, high) tuples defining inclusive ranges. Rows whose id is within any of these ranges go into the first split.

Return type:

tuple[Dataset, Dataset]

eval_framework.tasks.benchmarks.belebele module¶

class eval_framework.tasks.benchmarks.belebele.BELEBELE(num_fewshot=0)[source]¶

Bases: BaseTask[str]

BELEBELE dataset: https://huggingface.co/datasets/facebook/belebele

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'facebook/belebele'¶

FEWSHOT_SPLIT: str = 'test'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'BELEBELE'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['eng_Latn']¶

eval_framework.tasks.benchmarks.bigcodebench module¶

class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBench(num_fewshot=0)[source]¶

Bases: BaseTask[str]

BigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'bigcode/bigcodebench'¶

FEWSHOT_SPLIT: str = 'v0.1.4'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_execution_pass_at_one.CodeExecutionPassAtOne'>]¶

NAME: str = 'BigCodeBench'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'v0.1.4'¶

SUBJECTS: list[SubjectType] = ['original', 'calibrated']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchHard(num_fewshot=0)[source]¶

Bases: BigCodeBench

BigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench-hard

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'bigcode/bigcodebench-hard'¶

NAME: str = 'BigCodeBenchHard'¶

class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchHardInstruct(num_fewshot=0)[source]¶

Bases: BigCodeBenchHard

BigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench-hard

Parameters:: num_fewshot (int)

NAME: str = 'BigCodeBenchHardInstruct'¶

class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchInstruct(num_fewshot=0)[source]¶

Bases: BigCodeBench

BigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench

Parameters:: num_fewshot (int)

NAME: str = 'BigCodeBenchInstruct'¶

eval_framework.tasks.benchmarks.bigcodebench.extract_executable_code(llm_response)[source]¶

Return type:: str
Parameters:: llm_response (str)

eval_framework.tasks.benchmarks.casehold module¶

class eval_framework.tasks.benchmarks.casehold.CASEHOLD(num_fewshot=0)[source]¶

Bases: BaseTask[str]

CASEHOLD dataset: https://huggingface.co/datasets/coastalcph/lex_glue

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'coastalcph/lex_glue'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'CaseHold'¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['case_hold']¶

eval_framework.tasks.benchmarks.chembench module¶

class eval_framework.tasks.benchmarks.chembench.ChemBench(num_fewshot=0)[source]¶

Bases: BaseTask[str]

ChemBench dataset: https://huggingface.co/datasets/jablonkagroup/ChemBench

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'jablonkagroup/ChemBench'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'ChemBench'¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'train'¶

SUBJECTS: list[SubjectType] = ['analytical_chemistry', 'chemical_preference', 'general_chemistry', 'inorganic_chemistry', 'materials_science', 'organic_chemistry', 'physical_chemistry', 'technical_chemistry', 'toxicity_and_safety']¶

eval_framework.tasks.benchmarks.copa module¶

class eval_framework.tasks.benchmarks.copa.COPA(num_fewshot=0)[source]¶

Bases: COPAEvalHarness

Unlike the original COPA task, this version uses the test split for evaluation and the validation split for few-shot examples. Previously, the test split labels were unavailable in the original dataset, but they are now accessible, allowing this configuration.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'validation'¶

NAME: str = 'COPA'¶

SAMPLE_SPLIT: str = 'test'¶

class eval_framework.tasks.benchmarks.copa.COPAEvalHarness(num_fewshot=0)[source]¶

Bases: BaseTask[str]

COPA dataset: https://huggingface.co/datasets/aps/super_glue This version uses samples from the validation split as evaluation examples (same as lm-eval-harness).

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'aps/super_glue'¶

FEWSHOT_SPLIT: str = 'test'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'COPAEvalHarness'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['because', 'therefore']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['copa']¶

convert_choice(choice)[source]¶

Return type:: str
Parameters:: choice (str)

class eval_framework.tasks.benchmarks.copa.COPA_IDK(num_fewshot=0)[source]¶

Bases: COPA_IDKEvalHarness

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'validation'¶

NAME: str = 'COPA_IDK'¶

SAMPLE_SPLIT: str = 'test'¶

class eval_framework.tasks.benchmarks.copa.COPA_IDKEvalHarness(num_fewshot=0)[source]¶

Bases: COPAEvalHarness

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'COPA_IDKEvalHarness'¶

class eval_framework.tasks.benchmarks.copa.COPA_OLMES(num_fewshot=0)[source]¶

Bases: COPAEvalHarness

COPA multiple choice (OLMES/oe_eval style): prompt shows premise + connector and options with space-prefixed labels (” A.”, “ B.”); loglikelihood over “ A”/” B”.

Parameters:: num_fewshot (int)

NAME: str = 'COPA_OLMES'¶

eval_framework.tasks.benchmarks.csqa module¶

class eval_framework.tasks.benchmarks.csqa.CommonsenseQACloze(num_fewshot=0)[source]¶

Bases: BaseTask[str]

CommonsenseQA dataset: https://huggingface.co/datasets/tau/commonsense_qa

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'tau/commonsense_qa'¶

FEWSHOT_SPLIT: str = 'validation'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'CommonsenseQACloze'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.csqa.CommonsenseQAFullTextCloze(num_fewshot=0)[source]¶

Bases: CommonsenseQACloze

CommonsenseQA cloze with full answer text as ground truth (not just the letter). Scores loglikelihood over the full correct choice text; includes bits-per-byte.

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'CommonsenseQAFullTextCloze'¶

class eval_framework.tasks.benchmarks.csqa.CommonsenseQAMC(num_fewshot=0)[source]¶

Bases: CommonsenseQACloze

Multiple-choice variant of CommonsenseQA where the model selects a letter (A-E).

Parameters:: num_fewshot (int)

NAME: str = 'CommonsenseQAMC'¶

class eval_framework.tasks.benchmarks.csqa.CommonsenseQAMC_OLMES(num_fewshot=0)[source]¶

Bases: CommonsenseQAMC

CommonsenseQA MC with OLMES-style prompt: space before each label in the prompt (” A.”, “ B.”, …).

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'train'¶

NAME: str = 'CommonsenseQAMC_OLMES'¶

SAMPLE_SPLIT: str = 'train'¶

eval_framework.tasks.benchmarks.drop module¶

class eval_framework.tasks.benchmarks.drop.DropCloze(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Cloze variant: loglikelihood ranking over full choice texts (allenai/drop-gen2mc).

Same dataset as DropMC; options not shown in prompt; model scores full text of each choice. Includes BitsPerByte on the correct choice.

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'allenai/drop-gen2mc'¶

FEWSHOT_SPLIT: str = 'validation'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'DropCloze'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Passage']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.drop.DropCompletion(num_fewshot=0)[source]¶

Bases: BaseTask[str]

DROP completion benchmark (EleutherAI/drop): passage, question, model generates answer.

Uses DROP F1 and exact match. Stop at new paragraph or repeated prefixes.

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'EleutherAI/drop'¶

FEWSHOT_SPLIT: str = 'validation'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.drop_completion.DropF1ExactMatch'>]¶

NAME: str = 'DropCompletion'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Passage']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.drop.DropCompletion_OLMES(num_fewshot=0)[source]¶

Bases: DropCompletion

DropCompletion matching OLMES, using train split for fewshot and max tokens 100.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'train'¶

NAME: str = 'DropCompletion_OLMES'¶

class eval_framework.tasks.benchmarks.drop.DropMC(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Multiple-choice variant using allenai/drop-gen2mc (passage_original, question_original, choices, answerKey).

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'allenai/drop-gen2mc'¶

FEWSHOT_SPLIT: str = 'validation'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'DropMC'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Passage']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.drop.DropMC_OLMES(num_fewshot=0)[source]¶

Bases: DropMC

DropMC with OLMES-style prompt: space before each label in the prompt (” A.”, “ B.”, …).

Parameters:: num_fewshot (int)

NAME: str = 'DropMC_OLMES'¶

eval_framework.tasks.benchmarks.duc module¶

class eval_framework.tasks.benchmarks.duc.DUC(num_fewshot=0)[source]¶

Bases: BaseTask[str], ABC

https://huggingface.co/datasets/midas/duc2001

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'midas/duc2001'¶

FEWSHOT_SPLIT: str = 'train'¶

HF_REVISION: str = '77d6dedcbce421695a12f24c8802e8847a129d92'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Text', 'Keyphrase']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'train'¶

SUBJECTS: list[str] = ['default']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

class eval_framework.tasks.benchmarks.duc.DUC_ABSTRACTIVE(num_fewshot=0)[source]¶

Bases: DUC

Parameters:: num_fewshot (int)

NAME: str = 'DUC Abstractive'¶

SUBJECTS: list[str] = ['default']¶

class eval_framework.tasks.benchmarks.duc.DUC_EXTRACTIVE(num_fewshot=0)[source]¶

Bases: DUC

Parameters:: num_fewshot (int)

NAME: str = 'DUC Extractive'¶

SUBJECTS: list[str] = ['default']¶

eval_framework.tasks.benchmarks.flores200 module¶

class eval_framework.tasks.benchmarks.flores200.Flores200(num_fewshot=0)[source]¶

Bases: BaseTask[str]

FLORES-200 dataset: https://huggingface.co/datasets/facebook/flores

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'facebook/flores'¶

FEWSHOT_SPLIT: str = 'dev'¶

HF_REVISION: str | None = 'fd7d8f42fccb9dbc35830053a8c705a2627124ce'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'deu_Latn': Language.DEU, 'eng_Latn': Language.ENG, 'fin_Latn': Language.FIN, 'fra_Latn': Language.FRA, 'nld_Latn': Language.NLD}¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.BLEU'>]¶

NAME: str = 'FLoRes-200'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['sentence']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'devtest'¶

SUBJECTS: list[SubjectType] = ['deu_Latn-eng_Latn', 'deu_Latn-fin_Latn', 'deu_Latn-fra_Latn', 'deu_Latn-nld_Latn', 'eng_Latn-deu_Latn', 'eng_Latn-fin_Latn', 'eng_Latn-fra_Latn', 'eng_Latn-nld_Latn', 'fin_Latn-deu_Latn', 'fin_Latn-eng_Latn', 'fin_Latn-fra_Latn', 'fin_Latn-nld_Latn', 'fra_Latn-deu_Latn', 'fra_Latn-eng_Latn', 'fra_Latn-fin_Latn', 'fra_Latn-nld_Latn', 'nld_Latn-deu_Latn', 'nld_Latn-eng_Latn', 'nld_Latn-fin_Latn', 'nld_Latn-fra_Latn']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

eval_framework.tasks.benchmarks.flores_plus module¶

class eval_framework.tasks.benchmarks.flores_plus.FloresPlus(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Flores-Plus dataset: https://huggingface.co/datasets/openlanguagedata/flores_plus

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openlanguagedata/flores_plus'¶

FEWSHOT_SPLIT: str = 'devtest'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'deu_Latn': Language.DEU, 'eng_Latn': Language.ENG, 'fra_Latn': Language.FRA, 'ita_Latn': Language.ITA, 'nld_Latn': Language.NLD, 'pol_Latn': Language.POL, 'rus_Cyrl': Language.RUS, 'spa_Latn': Language.SPA, 'ukr_Cyrl': Language.UKR}¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.BLEU'>, <class 'eval_framework.metrics.completion.chrf.CHRF'>, <class 'eval_framework.metrics.completion.comet.COMET'>]¶

NAME: str = 'Flores-Plus'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['sentence']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'dev'¶

SUBJECTS: list[SubjectType] = ['deu_Latn-eng_Latn', 'deu_Latn-fra_Latn', 'deu_Latn-ita_Latn', 'deu_Latn-nld_Latn', 'deu_Latn-pol_Latn', 'deu_Latn-rus_Cyrl', 'deu_Latn-spa_Latn', 'deu_Latn-ukr_Cyrl', 'eng_Latn-deu_Latn', 'eng_Latn-fra_Latn', 'eng_Latn-ita_Latn', 'eng_Latn-nld_Latn', 'eng_Latn-pol_Latn', 'eng_Latn-rus_Cyrl', 'eng_Latn-spa_Latn', 'eng_Latn-ukr_Cyrl', 'fra_Latn-deu_Latn', 'fra_Latn-eng_Latn', 'fra_Latn-ita_Latn', 'fra_Latn-nld_Latn', 'fra_Latn-pol_Latn', 'fra_Latn-rus_Cyrl', 'fra_Latn-spa_Latn', 'fra_Latn-ukr_Cyrl', 'ita_Latn-deu_Latn', 'ita_Latn-eng_Latn', 'ita_Latn-fra_Latn', 'ita_Latn-nld_Latn', 'ita_Latn-pol_Latn', 'ita_Latn-rus_Cyrl', 'ita_Latn-spa_Latn', 'ita_Latn-ukr_Cyrl', 'nld_Latn-deu_Latn', 'nld_Latn-eng_Latn', 'nld_Latn-fra_Latn', 'nld_Latn-ita_Latn', 'nld_Latn-pol_Latn', 'nld_Latn-rus_Cyrl', 'nld_Latn-spa_Latn', 'nld_Latn-ukr_Cyrl', 'pol_Latn-deu_Latn', 'pol_Latn-eng_Latn', 'pol_Latn-fra_Latn', 'pol_Latn-ita_Latn', 'pol_Latn-nld_Latn', 'pol_Latn-rus_Cyrl', 'pol_Latn-spa_Latn', 'pol_Latn-ukr_Cyrl', 'rus_Cyrl-deu_Latn', 'rus_Cyrl-eng_Latn', 'rus_Cyrl-fra_Latn', 'rus_Cyrl-ita_Latn', 'rus_Cyrl-nld_Latn', 'rus_Cyrl-pol_Latn', 'rus_Cyrl-spa_Latn', 'rus_Cyrl-ukr_Cyrl', 'spa_Latn-deu_Latn', 'spa_Latn-eng_Latn', 'spa_Latn-fra_Latn', 'spa_Latn-ita_Latn', 'spa_Latn-nld_Latn', 'spa_Latn-pol_Latn', 'spa_Latn-rus_Cyrl', 'spa_Latn-ukr_Cyrl', 'ukr_Cyrl-deu_Latn', 'ukr_Cyrl-eng_Latn', 'ukr_Cyrl-fra_Latn', 'ukr_Cyrl-ita_Latn', 'ukr_Cyrl-nld_Latn', 'ukr_Cyrl-pol_Latn', 'ukr_Cyrl-rus_Cyrl', 'ukr_Cyrl-spa_Latn']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

eval_framework.tasks.benchmarks.global_mmlu module¶

class eval_framework.tasks.benchmarks.global_mmlu.GlobalMMLU(num_fewshot=0)[source]¶

Bases: BaseTask[tuple[str, str]]

MMLU dataset: https://huggingface.co/datasets/CohereLabs/Global-MMLU

Currently, we only support prompting in French, German, Spanish, Italian, Portugese, and Arabic.

TO-DO: Suggest we adjust prompting for languages individually, e.g., South-East Asian languages available here: https://github.com/aisingapore/SEA-HELM/blob/main/seahelm_tasks/knowledge/global_mmlu/abstract_algebra/config.yaml

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'CohereLabs/Global-MMLU'¶

FEWSHOT_SPLIT: str = 'dev'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {"('ar', 'abstract_algebra')": Language.ARB, "('ar', 'anatomy')": Language.ARB, "('ar', 'astronomy')": Language.ARB, "('ar', 'business_ethics')": Language.ARB, "('ar', 'clinical_knowledge')": Language.ARB, "('ar', 'college_biology')": Language.ARB, "('ar', 'college_chemistry')": Language.ARB, "('ar', 'college_computer_science')": Language.ARB, "('ar', 'college_mathematics')": Language.ARB, "('ar', 'college_medicine')": Language.ARB, "('ar', 'college_physics')": Language.ARB, "('ar', 'computer_security')": Language.ARB, "('ar', 'conceptual_physics')": Language.ARB, "('ar', 'econometrics')": Language.ARB, "('ar', 'electrical_engineering')": Language.ARB, "('ar', 'elementary_mathematics')": Language.ARB, "('ar', 'formal_logic')": Language.ARB, "('ar', 'global_facts')": Language.ARB, "('ar', 'high_school_biology')": Language.ARB, "('ar', 'high_school_chemistry')": Language.ARB, "('ar', 'high_school_computer_science')": Language.ARB, "('ar', 'high_school_european_history')": Language.ARB, "('ar', 'high_school_geography')": Language.ARB, "('ar', 'high_school_government_and_politics')": Language.ARB, "('ar', 'high_school_macroeconomics')": Language.ARB, "('ar', 'high_school_mathematics')": Language.ARB, "('ar', 'high_school_microeconomics')": Language.ARB, "('ar', 'high_school_physics')": Language.ARB, "('ar', 'high_school_psychology')": Language.ARB, "('ar', 'high_school_statistics')": Language.ARB, "('ar', 'high_school_us_history')": Language.ARB, "('ar', 'high_school_world_history')": Language.ARB, "('ar', 'human_aging')": Language.ARB, "('ar', 'human_sexuality')": Language.ARB, "('ar', 'international_law')": Language.ARB, "('ar', 'jurisprudence')": Language.ARB, "('ar', 'logical_fallacies')": Language.ARB, "('ar', 'machine_learning')": Language.ARB, "('ar', 'management')": Language.ARB, "('ar', 'marketing')": Language.ARB, "('ar', 'medical_genetics')": Language.ARB, "('ar', 'miscellaneous')": Language.ARB, "('ar', 'moral_disputes')": Language.ARB, "('ar', 'moral_scenarios')": Language.ARB, "('ar', 'nutrition')": Language.ARB, "('ar', 'philosophy')": Language.ARB, "('ar', 'prehistory')": Language.ARB, "('ar', 'professional_accounting')": Language.ARB, "('ar', 'professional_law')": Language.ARB, "('ar', 'professional_medicine')": Language.ARB, "('ar', 'professional_psychology')": Language.ARB, "('ar', 'public_relations')": Language.ARB, "('ar', 'security_studies')": Language.ARB, "('ar', 'sociology')": Language.ARB, "('ar', 'us_foreign_policy')": Language.ARB, "('ar', 'virology')": Language.ARB, "('ar', 'world_religions')": Language.ARB, "('de', 'abstract_algebra')": Language.DEU, "('de', 'anatomy')": Language.DEU, "('de', 'astronomy')": Language.DEU, "('de', 'business_ethics')": Language.DEU, "('de', 'clinical_knowledge')": Language.DEU, "('de', 'college_biology')": Language.DEU, "('de', 'college_chemistry')": Language.DEU, "('de', 'college_computer_science')": Language.DEU, "('de', 'college_mathematics')": Language.DEU, "('de', 'college_medicine')": Language.DEU, "('de', 'college_physics')": Language.DEU, "('de', 'computer_security')": Language.DEU, "('de', 'conceptual_physics')": Language.DEU, "('de', 'econometrics')": Language.DEU, "('de', 'electrical_engineering')": Language.DEU, "('de', 'elementary_mathematics')": Language.DEU, "('de', 'formal_logic')": Language.DEU, "('de', 'global_facts')": Language.DEU, "('de', 'high_school_biology')": Language.DEU, "('de', 'high_school_chemistry')": Language.DEU, "('de', 'high_school_computer_science')": Language.DEU, "('de', 'high_school_european_history')": Language.DEU, "('de', 'high_school_geography')": Language.DEU, "('de', 'high_school_government_and_politics')": Language.DEU, "('de', 'high_school_macroeconomics')": Language.DEU, "('de', 'high_school_mathematics')": Language.DEU, "('de', 'high_school_microeconomics')": Language.DEU, "('de', 'high_school_physics')": Language.DEU, "('de', 'high_school_psychology')": Language.DEU, "('de', 'high_school_statistics')": Language.DEU, "('de', 'high_school_us_history')": Language.DEU, "('de', 'high_school_world_history')": Language.DEU, "('de', 'human_aging')": Language.DEU, "('de', 'human_sexuality')": Language.DEU, "('de', 'international_law')": Language.DEU, "('de', 'jurisprudence')": Language.DEU, "('de', 'logical_fallacies')": Language.DEU, "('de', 'machine_learning')": Language.DEU, "('de', 'management')": Language.DEU, "('de', 'marketing')": Language.DEU, "('de', 'medical_genetics')": Language.DEU, "('de', 'miscellaneous')": Language.DEU, "('de', 'moral_disputes')": Language.DEU, "('de', 'moral_scenarios')": Language.DEU, "('de', 'nutrition')": Language.DEU, "('de', 'philosophy')": Language.DEU, "('de', 'prehistory')": Language.DEU, "('de', 'professional_accounting')": Language.DEU, "('de', 'professional_law')": Language.DEU, "('de', 'professional_medicine')": Language.DEU, "('de', 'professional_psychology')": Language.DEU, "('de', 'public_relations')": Language.DEU, "('de', 'security_studies')": Language.DEU, "('de', 'sociology')": Language.DEU, "('de', 'us_foreign_policy')": Language.DEU, "('de', 'virology')": Language.DEU, "('de', 'world_religions')": Language.DEU, "('es', 'abstract_algebra')": Language.SPA, "('es', 'anatomy')": Language.SPA, "('es', 'astronomy')": Language.SPA, "('es', 'business_ethics')": Language.SPA, "('es', 'clinical_knowledge')": Language.SPA, "('es', 'college_biology')": Language.SPA, "('es', 'college_chemistry')": Language.SPA, "('es', 'college_computer_science')": Language.SPA, "('es', 'college_mathematics')": Language.SPA, "('es', 'college_medicine')": Language.SPA, "('es', 'college_physics')": Language.SPA, "('es', 'computer_security')": Language.SPA, "('es', 'conceptual_physics')": Language.SPA, "('es', 'econometrics')": Language.SPA, "('es', 'electrical_engineering')": Language.SPA, "('es', 'elementary_mathematics')": Language.SPA, "('es', 'formal_logic')": Language.SPA, "('es', 'global_facts')": Language.SPA, "('es', 'high_school_biology')": Language.SPA, "('es', 'high_school_chemistry')": Language.SPA, "('es', 'high_school_computer_science')": Language.SPA, "('es', 'high_school_european_history')": Language.SPA, "('es', 'high_school_geography')": Language.SPA, "('es', 'high_school_government_and_politics')": Language.SPA, "('es', 'high_school_macroeconomics')": Language.SPA, "('es', 'high_school_mathematics')": Language.SPA, "('es', 'high_school_microeconomics')": Language.SPA, "('es', 'high_school_physics')": Language.SPA, "('es', 'high_school_psychology')": Language.SPA, "('es', 'high_school_statistics')": Language.SPA, "('es', 'high_school_us_history')": Language.SPA, "('es', 'high_school_world_history')": Language.SPA, "('es', 'human_aging')": Language.SPA, "('es', 'human_sexuality')": Language.SPA, "('es', 'international_law')": Language.SPA, "('es', 'jurisprudence')": Language.SPA, "('es', 'logical_fallacies')": Language.SPA, "('es', 'machine_learning')": Language.SPA, "('es', 'management')": Language.SPA, "('es', 'marketing')": Language.SPA, "('es', 'medical_genetics')": Language.SPA, "('es', 'miscellaneous')": Language.SPA, "('es', 'moral_disputes')": Language.SPA, "('es', 'moral_scenarios')": Language.SPA, "('es', 'nutrition')": Language.SPA, "('es', 'philosophy')": Language.SPA, "('es', 'prehistory')": Language.SPA, "('es', 'professional_accounting')": Language.SPA, "('es', 'professional_law')": Language.SPA, "('es', 'professional_medicine')": Language.SPA, "('es', 'professional_psychology')": Language.SPA, "('es', 'public_relations')": Language.SPA, "('es', 'security_studies')": Language.SPA, "('es', 'sociology')": Language.SPA, "('es', 'us_foreign_policy')": Language.SPA, "('es', 'virology')": Language.SPA, "('es', 'world_religions')": Language.SPA, "('fr', 'abstract_algebra')": Language.FRA, "('fr', 'anatomy')": Language.FRA, "('fr', 'astronomy')": Language.FRA, "('fr', 'business_ethics')": Language.FRA, "('fr', 'clinical_knowledge')": Language.FRA, "('fr', 'college_biology')": Language.FRA, "('fr', 'college_chemistry')": Language.FRA, "('fr', 'college_computer_science')": Language.FRA, "('fr', 'college_mathematics')": Language.FRA, "('fr', 'college_medicine')": Language.FRA, "('fr', 'college_physics')": Language.FRA, "('fr', 'computer_security')": Language.FRA, "('fr', 'conceptual_physics')": Language.FRA, "('fr', 'econometrics')": Language.FRA, "('fr', 'electrical_engineering')": Language.FRA, "('fr', 'elementary_mathematics')": Language.FRA, "('fr', 'formal_logic')": Language.FRA, "('fr', 'global_facts')": Language.FRA, "('fr', 'high_school_biology')": Language.FRA, "('fr', 'high_school_chemistry')": Language.FRA, "('fr', 'high_school_computer_science')": Language.FRA, "('fr', 'high_school_european_history')": Language.FRA, "('fr', 'high_school_geography')": Language.FRA, "('fr', 'high_school_government_and_politics')": Language.FRA, "('fr', 'high_school_macroeconomics')": Language.FRA, "('fr', 'high_school_mathematics')": Language.FRA, "('fr', 'high_school_microeconomics')": Language.FRA, "('fr', 'high_school_physics')": Language.FRA, "('fr', 'high_school_psychology')": Language.FRA, "('fr', 'high_school_statistics')": Language.FRA, "('fr', 'high_school_us_history')": Language.FRA, "('fr', 'high_school_world_history')": Language.FRA, "('fr', 'human_aging')": Language.FRA, "('fr', 'human_sexuality')": Language.FRA, "('fr', 'international_law')": Language.FRA, "('fr', 'jurisprudence')": Language.FRA, "('fr', 'logical_fallacies')": Language.FRA, "('fr', 'machine_learning')": Language.FRA, "('fr', 'management')": Language.FRA, "('fr', 'marketing')": Language.FRA, "('fr', 'medical_genetics')": Language.FRA, "('fr', 'miscellaneous')": Language.FRA, "('fr', 'moral_disputes')": Language.FRA, "('fr', 'moral_scenarios')": Language.FRA, "('fr', 'nutrition')": Language.FRA, "('fr', 'philosophy')": Language.FRA, "('fr', 'prehistory')": Language.FRA, "('fr', 'professional_accounting')": Language.FRA, "('fr', 'professional_law')": Language.FRA, "('fr', 'professional_medicine')": Language.FRA, "('fr', 'professional_psychology')": Language.FRA, "('fr', 'public_relations')": Language.FRA, "('fr', 'security_studies')": Language.FRA, "('fr', 'sociology')": Language.FRA, "('fr', 'us_foreign_policy')": Language.FRA, "('fr', 'virology')": Language.FRA, "('fr', 'world_religions')": Language.FRA, "('it', 'abstract_algebra')": Language.ITA, "('it', 'anatomy')": Language.ITA, "('it', 'astronomy')": Language.ITA, "('it', 'business_ethics')": Language.ITA, "('it', 'clinical_knowledge')": Language.ITA, "('it', 'college_biology')": Language.ITA, "('it', 'college_chemistry')": Language.ITA, "('it', 'college_computer_science')": Language.ITA, "('it', 'college_mathematics')": Language.ITA, "('it', 'college_medicine')": Language.ITA, "('it', 'college_physics')": Language.ITA, "('it', 'computer_security')": Language.ITA, "('it', 'conceptual_physics')": Language.ITA, "('it', 'econometrics')": Language.ITA, "('it', 'electrical_engineering')": Language.ITA, "('it', 'elementary_mathematics')": Language.ITA, "('it', 'formal_logic')": Language.ITA, "('it', 'global_facts')": Language.ITA, "('it', 'high_school_biology')": Language.ITA, "('it', 'high_school_chemistry')": Language.ITA, "('it', 'high_school_computer_science')": Language.ITA, "('it', 'high_school_european_history')": Language.ITA, "('it', 'high_school_geography')": Language.ITA, "('it', 'high_school_government_and_politics')": Language.ITA, "('it', 'high_school_macroeconomics')": Language.ITA, "('it', 'high_school_mathematics')": Language.ITA, "('it', 'high_school_microeconomics')": Language.ITA, "('it', 'high_school_physics')": Language.ITA, "('it', 'high_school_psychology')": Language.ITA, "('it', 'high_school_statistics')": Language.ITA, "('it', 'high_school_us_history')": Language.ITA, "('it', 'high_school_world_history')": Language.ITA, "('it', 'human_aging')": Language.ITA, "('it', 'human_sexuality')": Language.ITA, "('it', 'international_law')": Language.ITA, "('it', 'jurisprudence')": Language.ITA, "('it', 'logical_fallacies')": Language.ITA, "('it', 'machine_learning')": Language.ITA, "('it', 'management')": Language.ITA, "('it', 'marketing')": Language.ITA, "('it', 'medical_genetics')": Language.ITA, "('it', 'miscellaneous')": Language.ITA, "('it', 'moral_disputes')": Language.ITA, "('it', 'moral_scenarios')": Language.ITA, "('it', 'nutrition')": Language.ITA, "('it', 'philosophy')": Language.ITA, "('it', 'prehistory')": Language.ITA, "('it', 'professional_accounting')": Language.ITA, "('it', 'professional_law')": Language.ITA, "('it', 'professional_medicine')": Language.ITA, "('it', 'professional_psychology')": Language.ITA, "('it', 'public_relations')": Language.ITA, "('it', 'security_studies')": Language.ITA, "('it', 'sociology')": Language.ITA, "('it', 'us_foreign_policy')": Language.ITA, "('it', 'virology')": Language.ITA, "('it', 'world_religions')": Language.ITA, "('pt', 'abstract_algebra')": Language.POR, "('pt', 'anatomy')": Language.POR, "('pt', 'astronomy')": Language.POR, "('pt', 'business_ethics')": Language.POR, "('pt', 'clinical_knowledge')": Language.POR, "('pt', 'college_biology')": Language.POR, "('pt', 'college_chemistry')": Language.POR, "('pt', 'college_computer_science')": Language.POR, "('pt', 'college_mathematics')": Language.POR, "('pt', 'college_medicine')": Language.POR, "('pt', 'college_physics')": Language.POR, "('pt', 'computer_security')": Language.POR, "('pt', 'conceptual_physics')": Language.POR, "('pt', 'econometrics')": Language.POR, "('pt', 'electrical_engineering')": Language.POR, "('pt', 'elementary_mathematics')": Language.POR, "('pt', 'formal_logic')": Language.POR, "('pt', 'global_facts')": Language.POR, "('pt', 'high_school_biology')": Language.POR, "('pt', 'high_school_chemistry')": Language.POR, "('pt', 'high_school_computer_science')": Language.POR, "('pt', 'high_school_european_history')": Language.POR, "('pt', 'high_school_geography')": Language.POR, "('pt', 'high_school_government_and_politics')": Language.POR, "('pt', 'high_school_macroeconomics')": Language.POR, "('pt', 'high_school_mathematics')": Language.POR, "('pt', 'high_school_microeconomics')": Language.POR, "('pt', 'high_school_physics')": Language.POR, "('pt', 'high_school_psychology')": Language.POR, "('pt', 'high_school_statistics')": Language.POR, "('pt', 'high_school_us_history')": Language.POR, "('pt', 'high_school_world_history')": Language.POR, "('pt', 'human_aging')": Language.POR, "('pt', 'human_sexuality')": Language.POR, "('pt', 'international_law')": Language.POR, "('pt', 'jurisprudence')": Language.POR, "('pt', 'logical_fallacies')": Language.POR, "('pt', 'machine_learning')": Language.POR, "('pt', 'management')": Language.POR, "('pt', 'marketing')": Language.POR, "('pt', 'medical_genetics')": Language.POR, "('pt', 'miscellaneous')": Language.POR, "('pt', 'moral_disputes')": Language.POR, "('pt', 'moral_scenarios')": Language.POR, "('pt', 'nutrition')": Language.POR, "('pt', 'philosophy')": Language.POR, "('pt', 'prehistory')": Language.POR, "('pt', 'professional_accounting')": Language.POR, "('pt', 'professional_law')": Language.POR, "('pt', 'professional_medicine')": Language.POR, "('pt', 'professional_psychology')": Language.POR, "('pt', 'public_relations')": Language.POR, "('pt', 'security_studies')": Language.POR, "('pt', 'sociology')": Language.POR, "('pt', 'us_foreign_policy')": Language.POR, "('pt', 'virology')": Language.POR, "('pt', 'world_religions')": Language.POR}¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'GlobalMMLU'¶

OPTION_KEYS = {'A': 'option_a', 'B': 'option_b', 'C': 'option_c', 'D': 'option_d'}¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = [('fr', 'abstract_algebra'), ('fr', 'anatomy'), ('fr', 'astronomy'), ('fr', 'business_ethics'), ('fr', 'clinical_knowledge'), ('fr', 'college_biology'), ('fr', 'college_chemistry'), ('fr', 'college_computer_science'), ('fr', 'college_mathematics'), ('fr', 'college_medicine'), ('fr', 'college_physics'), ('fr', 'computer_security'), ('fr', 'conceptual_physics'), ('fr', 'econometrics'), ('fr', 'electrical_engineering'), ('fr', 'elementary_mathematics'), ('fr', 'formal_logic'), ('fr', 'global_facts'), ('fr', 'high_school_biology'), ('fr', 'high_school_chemistry'), ('fr', 'high_school_computer_science'), ('fr', 'high_school_european_history'), ('fr', 'high_school_geography'), ('fr', 'high_school_government_and_politics'), ('fr', 'high_school_macroeconomics'), ('fr', 'high_school_mathematics'), ('fr', 'high_school_microeconomics'), ('fr', 'high_school_physics'), ('fr', 'high_school_psychology'), ('fr', 'high_school_statistics'), ('fr', 'high_school_us_history'), ('fr', 'high_school_world_history'), ('fr', 'human_aging'), ('fr', 'human_sexuality'), ('fr', 'international_law'), ('fr', 'jurisprudence'), ('fr', 'logical_fallacies'), ('fr', 'machine_learning'), ('fr', 'management'), ('fr', 'marketing'), ('fr', 'medical_genetics'), ('fr', 'miscellaneous'), ('fr', 'moral_disputes'), ('fr', 'moral_scenarios'), ('fr', 'nutrition'), ('fr', 'philosophy'), ('fr', 'prehistory'), ('fr', 'professional_accounting'), ('fr', 'professional_law'), ('fr', 'professional_medicine'), ('fr', 'professional_psychology'), ('fr', 'public_relations'), ('fr', 'security_studies'), ('fr', 'sociology'), ('fr', 'us_foreign_policy'), ('fr', 'virology'), ('fr', 'world_religions'), ('de', 'abstract_algebra'), ('de', 'anatomy'), ('de', 'astronomy'), ('de', 'business_ethics'), ('de', 'clinical_knowledge'), ('de', 'college_biology'), ('de', 'college_chemistry'), ('de', 'college_computer_science'), ('de', 'college_mathematics'), ('de', 'college_medicine'), ('de', 'college_physics'), ('de', 'computer_security'), ('de', 'conceptual_physics'), ('de', 'econometrics'), ('de', 'electrical_engineering'), ('de', 'elementary_mathematics'), ('de', 'formal_logic'), ('de', 'global_facts'), ('de', 'high_school_biology'), ('de', 'high_school_chemistry'), ('de', 'high_school_computer_science'), ('de', 'high_school_european_history'), ('de', 'high_school_geography'), ('de', 'high_school_government_and_politics'), ('de', 'high_school_macroeconomics'), ('de', 'high_school_mathematics'), ('de', 'high_school_microeconomics'), ('de', 'high_school_physics'), ('de', 'high_school_psychology'), ('de', 'high_school_statistics'), ('de', 'high_school_us_history'), ('de', 'high_school_world_history'), ('de', 'human_aging'), ('de', 'human_sexuality'), ('de', 'international_law'), ('de', 'jurisprudence'), ('de', 'logical_fallacies'), ('de', 'machine_learning'), ('de', 'management'), ('de', 'marketing'), ('de', 'medical_genetics'), ('de', 'miscellaneous'), ('de', 'moral_disputes'), ('de', 'moral_scenarios'), ('de', 'nutrition'), ('de', 'philosophy'), ('de', 'prehistory'), ('de', 'professional_accounting'), ('de', 'professional_law'), ('de', 'professional_medicine'), ('de', 'professional_psychology'), ('de', 'public_relations'), ('de', 'security_studies'), ('de', 'sociology'), ('de', 'us_foreign_policy'), ('de', 'virology'), ('de', 'world_religions'), ('es', 'abstract_algebra'), ('es', 'anatomy'), ('es', 'astronomy'), ('es', 'business_ethics'), ('es', 'clinical_knowledge'), ('es', 'college_biology'), ('es', 'college_chemistry'), ('es', 'college_computer_science'), ('es', 'college_mathematics'), ('es', 'college_medicine'), ('es', 'college_physics'), ('es', 'computer_security'), ('es', 'conceptual_physics'), ('es', 'econometrics'), ('es', 'electrical_engineering'), ('es', 'elementary_mathematics'), ('es', 'formal_logic'), ('es', 'global_facts'), ('es', 'high_school_biology'), ('es', 'high_school_chemistry'), ('es', 'high_school_computer_science'), ('es', 'high_school_european_history'), ('es', 'high_school_geography'), ('es', 'high_school_government_and_politics'), ('es', 'high_school_macroeconomics'), ('es', 'high_school_mathematics'), ('es', 'high_school_microeconomics'), ('es', 'high_school_physics'), ('es', 'high_school_psychology'), ('es', 'high_school_statistics'), ('es', 'high_school_us_history'), ('es', 'high_school_world_history'), ('es', 'human_aging'), ('es', 'human_sexuality'), ('es', 'international_law'), ('es', 'jurisprudence'), ('es', 'logical_fallacies'), ('es', 'machine_learning'), ('es', 'management'), ('es', 'marketing'), ('es', 'medical_genetics'), ('es', 'miscellaneous'), ('es', 'moral_disputes'), ('es', 'moral_scenarios'), ('es', 'nutrition'), ('es', 'philosophy'), ('es', 'prehistory'), ('es', 'professional_accounting'), ('es', 'professional_law'), ('es', 'professional_medicine'), ('es', 'professional_psychology'), ('es', 'public_relations'), ('es', 'security_studies'), ('es', 'sociology'), ('es', 'us_foreign_policy'), ('es', 'virology'), ('es', 'world_religions'), ('it', 'abstract_algebra'), ('it', 'anatomy'), ('it', 'astronomy'), ('it', 'business_ethics'), ('it', 'clinical_knowledge'), ('it', 'college_biology'), ('it', 'college_chemistry'), ('it', 'college_computer_science'), ('it', 'college_mathematics'), ('it', 'college_medicine'), ('it', 'college_physics'), ('it', 'computer_security'), ('it', 'conceptual_physics'), ('it', 'econometrics'), ('it', 'electrical_engineering'), ('it', 'elementary_mathematics'), ('it', 'formal_logic'), ('it', 'global_facts'), ('it', 'high_school_biology'), ('it', 'high_school_chemistry'), ('it', 'high_school_computer_science'), ('it', 'high_school_european_history'), ('it', 'high_school_geography'), ('it', 'high_school_government_and_politics'), ('it', 'high_school_macroeconomics'), ('it', 'high_school_mathematics'), ('it', 'high_school_microeconomics'), ('it', 'high_school_physics'), ('it', 'high_school_psychology'), ('it', 'high_school_statistics'), ('it', 'high_school_us_history'), ('it', 'high_school_world_history'), ('it', 'human_aging'), ('it', 'human_sexuality'), ('it', 'international_law'), ('it', 'jurisprudence'), ('it', 'logical_fallacies'), ('it', 'machine_learning'), ('it', 'management'), ('it', 'marketing'), ('it', 'medical_genetics'), ('it', 'miscellaneous'), ('it', 'moral_disputes'), ('it', 'moral_scenarios'), ('it', 'nutrition'), ('it', 'philosophy'), ('it', 'prehistory'), ('it', 'professional_accounting'), ('it', 'professional_law'), ('it', 'professional_medicine'), ('it', 'professional_psychology'), ('it', 'public_relations'), ('it', 'security_studies'), ('it', 'sociology'), ('it', 'us_foreign_policy'), ('it', 'virology'), ('it', 'world_religions'), ('pt', 'abstract_algebra'), ('pt', 'anatomy'), ('pt', 'astronomy'), ('pt', 'business_ethics'), ('pt', 'clinical_knowledge'), ('pt', 'college_biology'), ('pt', 'college_chemistry'), ('pt', 'college_computer_science'), ('pt', 'college_mathematics'), ('pt', 'college_medicine'), ('pt', 'college_physics'), ('pt', 'computer_security'), ('pt', 'conceptual_physics'), ('pt', 'econometrics'), ('pt', 'electrical_engineering'), ('pt', 'elementary_mathematics'), ('pt', 'formal_logic'), ('pt', 'global_facts'), ('pt', 'high_school_biology'), ('pt', 'high_school_chemistry'), ('pt', 'high_school_computer_science'), ('pt', 'high_school_european_history'), ('pt', 'high_school_geography'), ('pt', 'high_school_government_and_politics'), ('pt', 'high_school_macroeconomics'), ('pt', 'high_school_mathematics'), ('pt', 'high_school_microeconomics'), ('pt', 'high_school_physics'), ('pt', 'high_school_psychology'), ('pt', 'high_school_statistics'), ('pt', 'high_school_us_history'), ('pt', 'high_school_world_history'), ('pt', 'human_aging'), ('pt', 'human_sexuality'), ('pt', 'international_law'), ('pt', 'jurisprudence'), ('pt', 'logical_fallacies'), ('pt', 'machine_learning'), ('pt', 'management'), ('pt', 'marketing'), ('pt', 'medical_genetics'), ('pt', 'miscellaneous'), ('pt', 'moral_disputes'), ('pt', 'moral_scenarios'), ('pt', 'nutrition'), ('pt', 'philosophy'), ('pt', 'prehistory'), ('pt', 'professional_accounting'), ('pt', 'professional_law'), ('pt', 'professional_medicine'), ('pt', 'professional_psychology'), ('pt', 'public_relations'), ('pt', 'security_studies'), ('pt', 'sociology'), ('pt', 'us_foreign_policy'), ('pt', 'virology'), ('pt', 'world_religions'), ('ar', 'abstract_algebra'), ('ar', 'anatomy'), ('ar', 'astronomy'), ('ar', 'business_ethics'), ('ar', 'clinical_knowledge'), ('ar', 'college_biology'), ('ar', 'college_chemistry'), ('ar', 'college_computer_science'), ('ar', 'college_mathematics'), ('ar', 'college_medicine'), ('ar', 'college_physics'), ('ar', 'computer_security'), ('ar', 'conceptual_physics'), ('ar', 'econometrics'), ('ar', 'electrical_engineering'), ('ar', 'elementary_mathematics'), ('ar', 'formal_logic'), ('ar', 'global_facts'), ('ar', 'high_school_biology'), ('ar', 'high_school_chemistry'), ('ar', 'high_school_computer_science'), ('ar', 'high_school_european_history'), ('ar', 'high_school_geography'), ('ar', 'high_school_government_and_politics'), ('ar', 'high_school_macroeconomics'), ('ar', 'high_school_mathematics'), ('ar', 'high_school_microeconomics'), ('ar', 'high_school_physics'), ('ar', 'high_school_psychology'), ('ar', 'high_school_statistics'), ('ar', 'high_school_us_history'), ('ar', 'high_school_world_history'), ('ar', 'human_aging'), ('ar', 'human_sexuality'), ('ar', 'international_law'), ('ar', 'jurisprudence'), ('ar', 'logical_fallacies'), ('ar', 'machine_learning'), ('ar', 'management'), ('ar', 'marketing'), ('ar', 'medical_genetics'), ('ar', 'miscellaneous'), ('ar', 'moral_disputes'), ('ar', 'moral_scenarios'), ('ar', 'nutrition'), ('ar', 'philosophy'), ('ar', 'prehistory'), ('ar', 'professional_accounting'), ('ar', 'professional_law'), ('ar', 'professional_medicine'), ('ar', 'professional_psychology'), ('ar', 'public_relations'), ('ar', 'security_studies'), ('ar', 'sociology'), ('ar', 'us_foreign_policy'), ('ar', 'virology'), ('ar', 'world_religions')]¶

eval_framework.tasks.benchmarks.goldenswag module¶

class eval_framework.tasks.benchmarks.goldenswag.GOLDENSWAG(num_fewshot=0)[source]¶

Bases: HELLASWAG

GoldenSwag dataset: https://huggingface.co/datasets/PleIAs/GoldenSwag available data set sections: validation

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'PleIAs/GoldenSwag'¶

FEWSHOT_SPLIT: str = 'validation'¶

NAME: str = 'GoldenSwag'¶

SAMPLE_SPLIT: str = 'validation'¶

class eval_framework.tasks.benchmarks.goldenswag.GOLDENSWAG_IDK(num_fewshot=0)[source]¶

Bases: GOLDENSWAG

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'GoldenSwag_IDK'¶

eval_framework.tasks.benchmarks.gpqa module¶

class eval_framework.tasks.benchmarks.gpqa.GPQA(num_fewshot=0)[source]¶

Bases: BaseTask[str]

GPQA dataset: https://huggingface.co/datasets/Idavidrein/gpqa

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'Idavidrein/gpqa'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'GPQA'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'train'¶

SUBJECTS: list[SubjectType] = ['gpqa_extended']¶

class eval_framework.tasks.benchmarks.gpqa.GPQA_COT(num_fewshot=0)[source]¶

Bases: GPQA

Parameters:: num_fewshot (int)

ANS_RE = re.compile('Therefore, the answer is \$([ABCDEFGHIJ])\$')¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶

NAME: str = 'GPQA_COT'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

class eval_framework.tasks.benchmarks.gpqa.GPQA_IDK(num_fewshot=0)[source]¶

Bases: GPQA

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'GPQA_IDK'¶

class eval_framework.tasks.benchmarks.gpqa.GPQA_OLMES(num_fewshot=0)[source]¶

Bases: GPQA

GPQA multiple choice (OLMES/oe_eval style): prompt shows options with space-prefixed labels (” A.”, “ B.”, “ C.”, “ D.”); loglikelihood over “ A”/” B”/” C”/” D”.

Parameters:: num_fewshot (int)

NAME: str = 'GPQA_OLMES'¶

eval_framework.tasks.benchmarks.gsm8k module¶

class eval_framework.tasks.benchmarks.gsm8k.GSM8K(num_fewshot=0)[source]¶

Bases: GSM8KEvalHarness

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = ''¶

NAME: str = 'GSM8K'¶

class eval_framework.tasks.benchmarks.gsm8k.GSM8KEvalHarness(num_fewshot=0)[source]¶

Bases: BaseTask[str]

GSM8K dataset: https://huggingface.co/datasets/openai/gsm8k This version uses samples from the train split as fewshot examples.

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openai/gsm8k'¶

FEWSHOT_SPLIT: str = 'train'¶

HF_REVISION: str | None = 'main'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶

NAME: str = 'GSM8KEvalHarness'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['main']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

eval_framework.tasks.benchmarks.hellaswag module¶

class eval_framework.tasks.benchmarks.hellaswag.HELLASWAG(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Hellaswag dataset: https://huggingface.co/datasets/Rowan/hellaswag available data set sections: train, validation, test

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'Rowan/hellaswag'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'HellaSwag'¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.hellaswag.HELLASWAG_IDK(num_fewshot=0)[source]¶

Bases: HELLASWAG

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'HellaSwag_IDK'¶

eval_framework.tasks.benchmarks.hellaswag_de module¶

class eval_framework.tasks.benchmarks.hellaswag_de.HELLASWAG_DE(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Hellaswag dataset: https://huggingface.co/datasets/LeoLM/HellaSwag_de available data set sections: train (1k rows), validation (10k rows)

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'LeoLM/HellaSwag_de'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'HellaSwag German'¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

eval_framework.tasks.benchmarks.humaneval module¶

class eval_framework.tasks.benchmarks.humaneval.HumanEval(num_fewshot=0)[source]¶

Bases: BaseTask[str]

HumanEval dataset: https://huggingface.co/datasets/openai/openai_humaneval/

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openai/openai_humaneval'¶

FEWSHOT_SPLIT: str = 'test'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_assertion.CodeCompletionAssertion'>]¶

NAME: str = 'Human Eval'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

class eval_framework.tasks.benchmarks.humaneval.HumanEvalBPB(num_fewshot=0)[source]¶

Bases: HumanEval

HumanEval variant that scores loglikelihood of the gold canonical solution. Reports bits-per-byte on the reference completion.

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'Human Eval BPB'¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

class eval_framework.tasks.benchmarks.humaneval.HumanEvalInstruct(num_fewshot=0)[source]¶

Bases: HumanEval

Parameters:: num_fewshot (int)

CUE_PREFIX = 'Here is the completed function:\n```python\n'¶

NAME: str = 'Human Eval Instruct'¶

class eval_framework.tasks.benchmarks.humaneval.HumanEvalMetricContext(**data)[source]¶

Bases: BaseMetricContext

Parameters:

test (str)
entry_point (str)
prompt (str)
extra_data (Any)

entry_point: str¶

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt: str¶

test: str¶

class eval_framework.tasks.benchmarks.humaneval.HumanEval_OLMES(num_fewshot=3)[source]¶

Bases: HumanEval

HumanEval OLMES variant replicating codex_humaneval:3shot::olmo3:n32:v2 from oe_eval.

Recommended EvalConfig settings for full replication:: repeats: 32 llm_args: {sampling_params: {temperature: 0.6, top_p: 0.6}}

Parameters:: num_fewshot (int)

NAME: str = 'Human Eval OLMES'¶

eval_framework.tasks.benchmarks.ifeval module¶

class eval_framework.tasks.benchmarks.ifeval.IFEval(num_fewshot=0)[source]¶

Bases: BaseTask[str]

IFEval: Instruction Following Eval (https://arxiv.org/pdf/2311.07911).

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'google/IFEval'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.ENG}¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.ifeval.IFEvalMetric'>]¶

NAME: str = 'IFEval'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'train'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.ifeval.IFEvalDe(num_fewshot=0)[source]¶

Bases: IFEval

German version of the Instruction Following Evaluation (IFEval) benchmark.

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'jzhang86/de_ifeval'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.DEU}¶

NAME: str = 'IFEval German'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.ifeval.IFEvalFiSv(num_fewshot=0)[source]¶

Bases: IFEval

Machine translated versions of the Instruction Following Evaluation (IFEval) benchmark.

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'LumiOpen/ifeval_mt'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'fi': Language.FIN, 'sv': Language.SWE}¶

NAME: str = 'IFEval Finnish & Swedish'¶

SUBJECTS: list[SubjectType] = ['fi', 'sv']¶

eval_framework.tasks.benchmarks.include module¶

class eval_framework.tasks.benchmarks.include.INCLUDE(num_fewshot=0)[source]¶

Bases: BaseTask[str]

INCLUDE dataset: https://huggingface.co/datasets/CohereLabs/include-base-44

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'CohereLabs/include-base-44'¶

FEWSHOT_SPLIT: str = 'validation'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'Albanian': Language.SQI, 'Arabic': Language.ARB, 'Armenian': Language.HYE, 'Azerbaijani': Language.AZE, 'Basque': Language.EUS, 'Belarusian': Language.BEL, 'Bengali': Language.BEN, 'Bulgarian': Language.BUL, 'Chinese': Language.ZHO, 'Croatian': Language.HRV, 'Dutch': Language.NLD, 'Estonian': Language.EST, 'Finnish': Language.FIN, 'French': Language.FRA, 'Georgian': Language.KAT, 'German': Language.DEU, 'Greek': Language.ELL, 'Hebrew': Language.HEB, 'Hindi': Language.HIN, 'Hungarian': Language.HUN, 'Indonesian': Language.IND, 'Italian': Language.ITA, 'Japanese': Language.JPN, 'Kazakh': Language.KAZ, 'Korean': Language.KOR, 'Lithuanian': Language.LIT, 'Malay': Language.MSA, 'Malayalam': Language.MAL, 'Nepali': Language.NEP, 'North Macedonian': Language.MKD, 'Persian': Language.FAS, 'Polish': Language.POL, 'Portuguese': Language.POR, 'Russian': Language.RUS, 'Serbian': Language.SRP, 'Spanish': Language.SPA, 'Tagalog': Language.TGL, 'Tamil': Language.TAM, 'Telugu': Language.TEL, 'Turkish': Language.TUR, 'Ukrainian': Language.UKR, 'Urdu': Language.URD, 'Uzbek': Language.UZB, 'Vietnamese': Language.VIE}¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'INCLUDE'¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['Albanian', 'Arabic', 'Armenian', 'Azerbaijani', 'Basque', 'Belarusian', 'Bengali', 'Bulgarian', 'Chinese', 'Croatian', 'Dutch', 'Estonian', 'Finnish', 'French', 'Georgian', 'German', 'Greek', 'Hebrew', 'Hindi', 'Hungarian', 'Indonesian', 'Italian', 'Japanese', 'Kazakh', 'Korean', 'Lithuanian', 'Malay', 'Malayalam', 'Nepali', 'North Macedonian', 'Persian', 'Polish', 'Portuguese', 'Russian', 'Serbian', 'Spanish', 'Tagalog', 'Tamil', 'Telugu', 'Turkish', 'Ukrainian', 'Urdu', 'Uzbek', 'Vietnamese']¶

eval_framework.tasks.benchmarks.include.subject_to_language(subject)[source]¶

Return type:: Language
Parameters:: subject (str)

eval_framework.tasks.benchmarks.infinitebench module¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench(num_fewshot=0)[source]¶

Bases: BaseTask[str], ABC

InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens https://github.com/OpenBMB/InfiniteBench

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'xinrongzhang2022/InfiniteBench'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = None¶

SUBJECTS: list[SubjectType] = ['default']¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBenchCompletion(num_fewshot=0)[source]¶

Bases: InfiniteBench, ABC

Base class for completion tasks.

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶

RESPONSE_TYPE: ResponseType = 'completion'¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBenchLoglikelihood(num_fewshot=0)[source]¶

Bases: InfiniteBench, ABC

Base class for loglikelihood tasks.

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>]¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_CodeDebug(num_fewshot=0)[source]¶

Bases: InfiniteBenchLoglikelihood

Finding which function in a code repo contains a crashing error (MC form).

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'code_debug'¶

NAME: str = 'InfiniteBench_CodeDebug'¶

SAMPLE_SPLIT: str = 'code_debug'¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_CodeRun(num_fewshot=0)[source]¶

Bases: InfiniteBenchCompletion

Simulating execution of multiple simple, synthetic functions.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'code_run'¶

NAME: str = 'InfiniteBench_CodeRun'¶

SAMPLE_SPLIT: str = 'code_run'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnDia(num_fewshot=0)[source]¶

Bases: InfiniteBenchCompletion

Identification of talkers in partially anonymized scripts.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'longdialogue_qa_eng'¶

NAME: str = 'InfiniteBench_EnDia'¶

SAMPLE_SPLIT: str = 'longdialogue_qa_eng'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnMC(num_fewshot=0)[source]¶

Bases: InfiniteBenchLoglikelihood

Multiple choice questions derived from the fake book.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'longbook_choice_eng'¶

NAME: str = 'InfiniteBench_EnMC'¶

SAMPLE_SPLIT: str = 'longbook_choice_eng'¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnQA(num_fewshot=0)[source]¶

Bases: InfiniteBenchCompletion

Free-form question answering based on the fake book.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'longbook_qa_eng'¶

NAME: str = 'InfiniteBench_EnQA'¶

SAMPLE_SPLIT: str = 'longbook_qa_eng'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_MathFind(num_fewshot=0)[source]¶

Bases: InfiniteBenchCompletion

Finding special integers in a lengthy list.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'math_find'¶

NAME: str = 'InfiniteBench_MathFind'¶

SAMPLE_SPLIT: str = 'math_find'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrieveKV2(num_fewshot=0)[source]¶

Bases: InfiniteBenchCompletion

Finding the corresponding value from a dictionary and a key.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'kv_retrieval'¶

NAME: str = 'InfiniteBench_RetrieveKV2'¶

SAMPLE_SPLIT: str = 'kv_retrieval'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrieveNumber(num_fewshot=0)[source]¶

Bases: InfiniteBenchCompletion

Locating repeated hidden numbers in a noisy long context.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'number_string'¶

NAME: str = 'InfiniteBench_RetrieveNumber'¶

SAMPLE_SPLIT: str = 'number_string'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrievePassKey1(num_fewshot=0)[source]¶

Bases: InfiniteBenchCompletion

Retrieving hidden keys in a noisy long context.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'passkey'¶

NAME: str = 'InfiniteBench_RetrievePassKey1'¶

SAMPLE_SPLIT: str = 'passkey'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

eval_framework.tasks.benchmarks.lab_bench module¶

class eval_framework.tasks.benchmarks.lab_bench.LabBenchCloze(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Lab-Bench (futurehouse/lab-bench): QA over scientific protocols; cloze ranks ideal vs distractors.

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'futurehouse/lab-bench'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'LabBenchCloze'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'train'¶

SUBJECTS: list[SubjectType] = ['CloningScenarios', 'DbQA', 'FigQA', 'LitQA2', 'ProtocolQA', 'SeqQA', 'SuppQA', 'TableQA']¶

class eval_framework.tasks.benchmarks.lab_bench.LabBenchMC(num_fewshot=0)[source]¶

Bases: LabBenchCloze

Parameters:: num_fewshot (int)

NAME: str = 'LabBenchMC'¶

class eval_framework.tasks.benchmarks.lab_bench.LabBenchMC_OLMES(num_fewshot=0)[source]¶

Bases: LabBenchMC

LabBenchMC with OLMES-style prompt: space before each label in the prompt (” A.”, “ B.”, …).

Parameters:: num_fewshot (int)

NAME: str = 'LabBenchMC_OLMES'¶

eval_framework.tasks.benchmarks.math_reasoning module¶

class eval_framework.tasks.benchmarks.math_reasoning.AIME2024(num_fewshot=0)[source]¶

Bases: MATHReasoning

AIME 2024 dataset: https://huggingface.co/datasets/HuggingFaceH4/aime_2024

This dataset contains a single train split of 30 questions. Data contains

ID | Problem | Solution | Answer

pass@1 evaluation

Parameters:: num_fewshot (int)

ANSWER_PATTERN = 'Therefore, the final answer is:(.*?). I hope it is correct.'¶

DATASET_PATH: str = 'HuggingFaceH4/aime_2024'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶

NAME: str = 'AIME2024'¶

QUERY_TEMPLATE = 'Solve the following math problem efficiently and clearly:\n\n - For simple problems (2 steps or fewer):\n Provide a concise solution with minimal explanation.\n\n - For complex problems (3 steps or more):\n Use this step-by-step format:\n\n ## Step 1: [Concise description]\n [Brief explanation and calculations]\n\n ## Step 2: [Concise description]\n [Brief explanation and calculations]\n\n ...\n\n Regardless of the approach, always conclude with:\n\n Therefore, the final answer is: $\\boxed{{answer}}$. I hope it is correct.\n\n Where [answer] is just the final number or expression that solves the problem.\n\n Problem: {Question}'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'train'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.math_reasoning.AIME2025(num_fewshot=0)[source]¶

Bases: AIME2024

AIME 2025 dataset: https://huggingface.co/datasets/math-ai/aime25

This dataset contains a single test split of 30 questions. Data contains problem | answer | id

pass@1 evaluation

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'math-ai/aime25'¶

FEWSHOT_SPLIT: str = 'test'¶

NAME: str = 'AIME2025'¶

SAMPLE_SPLIT: str = 'test'¶

class eval_framework.tasks.benchmarks.math_reasoning.AIME2026(num_fewshot=0)[source]¶

Bases: AIME2024

AIME 2026 dataset: https://huggingface.co/datasets/math-ai/aime26

This dataset contains a single test split of 30 questions. Data contains problem | answer | id

pass@1 evaluation

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'math-ai/aime26'¶

FEWSHOT_SPLIT: str = 'test'¶

NAME: str = 'AIME2026'¶

SAMPLE_SPLIT: str = 'test'¶

class eval_framework.tasks.benchmarks.math_reasoning.GSM8KReasoning(num_fewshot=0)[source]¶

Bases: MATHReasoning

GSM8K dataset with reasoning prompt: https://huggingface.co/datasets/openai/gsm8k

Zero-shot reasoning version that expects answers in boxed format.

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openai/gsm8k'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶

NAME: str = 'GSM8KReasoning'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶

QUERY_TEMPLATE = 'Solve the following math problem step by step. Think through the problem carefully and show your reasoning.\n\nPlease provide your answer in the format: $\\boxed{{answer}}$ where answer is the final numerical result.\n\nQuestion: {question}\n\nAnswer:'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['main']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

class eval_framework.tasks.benchmarks.math_reasoning.MATH(num_fewshot=0)[source]¶

Bases: MATHReasoning

MATH dataset: https://huggingface.co/datasets/EleutherAI/hendrycks_math

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'EleutherAI/hendrycks_math'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶

NAME: str = 'Math'¶

QUERY_TEMPLATE = 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $ANSWER (without quotes) where $ANSWER is the answer to the problem.\n\n {Question}\n\n Remember to put your answer in $\\boxed{{answer}}$\n\n where [answer] is just the final number or expression that solves the problem.'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['algebra', 'counting_and_probability', 'geometry', 'intermediate_algebra', 'number_theory', 'prealgebra', 'precalculus']¶

extract_last_two_dollar_text(s)[source]¶

extract_last_two_dollar_text finds text between the last two dollar signs in a string :type s: str :param s: the string to extract text from

Return type:: str
Returns:: the extracted text
Parameters:: s (str)

post_process_generated_completion(completion_text, sample=None)[source]¶

post_process_generated_completion extracts via flex extraction/matching. if there is a boxed answer, then this gets used first if there is no boxed answer, and latex math symbols (“$”) then this will be extracted and used if there is an answer text (“Answer:”) then this will be used last

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

class eval_framework.tasks.benchmarks.math_reasoning.MATH500(num_fewshot=0)[source]¶

Bases: MATHReasoning

MATH500 dataset: https://huggingface.co/datasets/HuggingFaceH4/MATH-500

This dataset contains a single test split of 500 questions. Data contains

ID | Problem | Solution | Answer

pass@1 evaluation

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'HuggingFaceH4/MATH-500'¶

FEWSHOT_SPLIT: str = 'test'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶

NAME: str = 'MATH500'¶

QUERY_TEMPLATE = 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $ANSWER (without quotes) where $ANSWER is the answer to the problem.\n\n {Question}\n\n Remember to put your answer in $\\boxed{{answer}}$\n\n where [answer] is just the final number or expression that solves the problem.'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

class eval_framework.tasks.benchmarks.math_reasoning.MATH500Minerva(num_fewshot=0)[source]¶

Bases: MATHMinerva

MATH-500 with Minerva-style prompt and scoring (OLMES minerva_math_500 parity). Uses HuggingFaceH4/MATH-500 which has a single ‘default’ config (no subject splits).

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'HuggingFaceH4/MATH-500'¶

FEWSHOT_SPLIT: str = 'test'¶

NAME: str = 'MATH500Minerva'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.math_reasoning.MATHLvl5(num_fewshot=0)[source]¶

Bases: MATH

Parameters:: num_fewshot (int)

NAME: str = 'Math Lvl 5'¶

class eval_framework.tasks.benchmarks.math_reasoning.MATHMinerva(num_fewshot=0)[source]¶

Bases: MATHMinervaEvalHarness

MATH with Minerva-style prompt and relaxed final-answer string matching. Same as MATHMinervaEvalHarness but allows flexible whitespace and case for variations of “(The )Final Answer: The (final )answer is …( I hope it is correct.)”, where parentheses are optional.

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_minerva_completion.MathMinervaCompletionRelaxed'>]¶

NAME: str = 'MATHMinerva'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Primary answer for storage; uses relaxed final-answer extraction.

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

class eval_framework.tasks.benchmarks.math_reasoning.MATHMinervaBPB(num_fewshot=0)[source]¶

Bases: MATHReasoning

MATH (Hendrycks) with Minerva-style prompt, evaluated via loglikelihood of the gold answer string (bits-per-byte). Same prompt as MATHMinerva; scores P(normalized_gold_answer | prompt).

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'EleutherAI/hendrycks_math'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'MATHMinervaBPB'¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['algebra', 'counting_and_probability', 'geometry', 'intermediate_algebra', 'number_theory', 'prealgebra', 'precalculus']¶

class eval_framework.tasks.benchmarks.math_reasoning.MATHMinervaEvalHarness(num_fewshot=0)[source]¶

Bases: MATHReasoning

MATH with Minerva-style prompt and scoring (lm-evaluation-harness / oe_eval parity). Uses strict final-answer string matching: “Final Answer: The final answer is … I hope it is correct.” Prompt: “Problem:n” + problem + “nn” + “Solution:” Gold: normalized_gold_from_solution(solution) Metrics: Exact Match, Exact Match (Flex) via MathMinervaCompletion.

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'EleutherAI/hendrycks_math'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_minerva_completion.MathMinervaCompletion'>]¶

NAME: str = 'MATHMinervaEvalHarness'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['algebra', 'counting_and_probability', 'geometry', 'intermediate_algebra', 'number_theory', 'prealgebra', 'precalculus']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Primary answer for storage; metric uses raw_completion for exact_match_flex (strict matching).

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

class eval_framework.tasks.benchmarks.math_reasoning.MATHReasoning(num_fewshot=0)[source]¶

Bases: BaseTask[str]

AIME 2024 dataset: https://huggingface.co/datasets/HuggingFaceH4/aime_2024

This dataset contains a single train split of 30 questions. Data contains

ID | Problem | Solution | Answer

pass@1 evaluation

Parameters:: num_fewshot (int)

ANSWER_PATTERN = '(?i)Answer\\s*:\\s*(.*)'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>]¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

eval_framework.tasks.benchmarks.mbpp module¶

class eval_framework.tasks.benchmarks.mbpp.MBPP(num_fewshot=0)[source]¶

Bases: BaseTask[str]

MBPP provides both the problem statement and the test cases upfront. It says, “Here’s the problem and here are the tests; write code that passes them.”. Note that LLMs can cheat and only write code that passes the tests without solving the given problem.

MBPP_PROMPT_WITHOUT_TESTS, on the other hand, only gives you the problem statement and function signature initially. It says, “Here’s the problem and function signature; write code, then we’ll run tests later.”

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'google-research-datasets/mbpp'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_assertion.CodeCompletionAssertion'>]¶

NAME: str = 'MBPP'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['full']¶

post_process_generated_completion(completion_text, sample)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample)

class eval_framework.tasks.benchmarks.mbpp.MBPPBPB(num_fewshot=0)[source]¶

Bases: MBPP

MBPP variant that scores loglikelihood of the gold reference code. Reports bits-per-byte on the reference solution.

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'MBPP BPB'¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

class eval_framework.tasks.benchmarks.mbpp.MBPPMetricContext(**data)[source]¶

Bases: BaseMetricContext

Parameters:

tests_code (str)
extra_data (Any)

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

tests_code: str¶

class eval_framework.tasks.benchmarks.mbpp.MBPP_OLMES(num_fewshot=3)[source]¶

Bases: MBPP

MBPP OLMES variant replicating oe_eval’s mbpp:3shot::olmo3:n32:v2.

Uses the EvalPlus prompt format with 3 hardcoded fewshot examples from the original MBPP “prompt” split (matching oe_eval’s ordering). Each prompt shows one test case (the first) instead of all.

Recommended EvalConfig settings for full replication:

split: test
num_fewshot: 3 (hardcoded, prompt split)
metric: pass_at_1
temperature: 0.6
top_p: 0.6
repeats: 32

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'test'¶

NAME: str = 'MBPP_OLMES'¶

post_process_generated_completion(completion_text, sample)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample)

class eval_framework.tasks.benchmarks.mbpp.MBPP_PROMPT_WITHOUT_TESTS(num_fewshot=0)[source]¶

Bases: MBPP

Parameters:: num_fewshot (int)

NAME: str = 'MBPP_PROMPT_WITHOUT_TESTS'¶

post_process_generated_completion(completion_text, sample)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample)

class eval_framework.tasks.benchmarks.mbpp.MBPP_PROMPT_WITHOUT_TESTS_SANITIZED(num_fewshot=0)[source]¶

Bases: MBPP_PROMPT_WITHOUT_TESTS

Parameters:: num_fewshot (int)

NAME: str = 'MBPP_PROMPT_WITHOUT_TESTS_SANITIZED'¶

SUBJECTS: list[SubjectType] = ['sanitized']¶

class eval_framework.tasks.benchmarks.mbpp.MBPP_SANITIZED(num_fewshot=0)[source]¶

Bases: MBPP

Parameters:: num_fewshot (int)

NAME: str = 'MBPP_SANITZED'¶

SUBJECTS: list[SubjectType] = ['sanitized']¶

eval_framework.tasks.benchmarks.medqa module¶

MedQA (English): Open-domain medical question answering from medical exams.

class eval_framework.tasks.benchmarks.medqa.MedQACloze(num_fewshot=0)[source]¶

Bases: BaseTask[str]

MedQA cloze (loglikelihood over choice text).

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'davidheineman/medqa-en'¶

FEWSHOT_SPLIT: str = 'dev'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'MedQACloze'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.medqa.MedQAMC(num_fewshot=0)[source]¶

Bases: MedQACloze

MedQA multiple choice (loglikelihood over A/B/C/D/…).

Parameters:: num_fewshot (int)

NAME: str = 'MedQAMC'¶

class eval_framework.tasks.benchmarks.medqa.MedQAMC_OLMES(num_fewshot=0)[source]¶

Bases: MedQAMC

MedQA multiple choice with OLMES-style prompt: space before each label (” A.”, “ B.”, …).

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'train'¶

NAME: str = 'MedQAMC_OLMES'¶

eval_framework.tasks.benchmarks.mmlu module¶

class eval_framework.tasks.benchmarks.mmlu.FullTextMMLU(num_fewshot=0)[source]¶

Bases: MMLU

MMLU dataset but where the model is expected to replicate choice text, rather than just the key.

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'Full Text MMLU'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'answers', 'A', 'B', 'C', 'D']¶

class eval_framework.tasks.benchmarks.mmlu.MMLU(num_fewshot=0)[source]¶

Bases: BaseTask[str]

MMLU dataset: https://huggingface.co/datasets/cais/mmlu

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'cais/mmlu'¶

FEWSHOT_SPLIT: str = 'dev'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'MMLU'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']¶

class eval_framework.tasks.benchmarks.mmlu.MMLU_COT(num_fewshot=0)[source]¶

Bases: MMLU

MMLU dataset with instruction to summarize reasoning and conclude with answer. Inspired by https://arxiv.org/pdf/2411.15124 (Table 44)

Parameters:: num_fewshot (int)

ANS_RE = re.compile('Therefore, the answer is: ([ABCD])')¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶

NAME: str = 'MMLU_COT'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

class eval_framework.tasks.benchmarks.mmlu.MMLU_IDK(num_fewshot=0)[source]¶

Bases: MMLU

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'MMLU_IDK'¶

class eval_framework.tasks.benchmarks.mmlu.MMLU_OLMES(num_fewshot=0)[source]¶

Bases: MMLU

MMLU with OLMES-style prompt: space before each label in the prompt (” A.”, “ B.”, …).

Parameters:: num_fewshot (int)

NAME: str = 'MMLU_OLMES'¶

eval_framework.tasks.benchmarks.mmlu_de module¶

class eval_framework.tasks.benchmarks.mmlu_de.MMLU_DE(num_fewshot=0)[source]¶

Bases: BaseTask[str]

MMLU DE dataset: https://huggingface.co/datasets/LeoLM/MMLU_de

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'LeoLM/MMLU_de'¶

FEWSHOT_SPLIT: str = 'validation'¶

HF_REVISION: str | None = '11433b408001dd26444c7e666cc536e0b8907ca5'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'MMLU_DE'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'A', 'B', 'C', 'D']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']¶

eval_framework.tasks.benchmarks.mmlu_pro module¶

class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO(num_fewshot=0)[source]¶

Bases: BaseTask[str]

MMLU_PRO dataset: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'TIGER-Lab/MMLU-Pro'¶

FEWSHOT_SPLIT: str = 'test'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'MMLU Pro'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['engineering', 'physics', 'psychology', 'chemistry', 'biology', 'law', 'philosophy', 'computer science', 'other', 'economics', 'business', 'history', 'math', 'health']¶

class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO_COT(num_fewshot=0)[source]¶

Bases: MMLU_PRO

Parameters:: num_fewshot (int)

ANS_RE = re.compile('Therefore, the answer is \$([ABCDEFGHIJ])\$')¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶

NAME: str = 'MMLU_PRO_COT'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO_IDK(num_fewshot=0)[source]¶

Bases: MMLU_PRO

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'MMLU Pro_IDK'¶

class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO_OLMES(num_fewshot=0)[source]¶

Bases: MMLU_PRO

MMLU Pro with OLMES-style prompt: space before each label in the prompt (” A.”, “ B.”, …).

Parameters:: num_fewshot (int)

NAME: str = 'MMLU Pro_OLMES'¶

eval_framework.tasks.benchmarks.mmmlu module¶

class eval_framework.tasks.benchmarks.mmmlu.MMMLU(num_fewshot=0)[source]¶

Bases: BaseTask[tuple[str, str]]

MMMLU dataset: https://huggingface.co/datasets/openai/MMMLU

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openai/MMMLU'¶

FEWSHOT_SPLIT: str = 'test'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {"('AR', 'abstract_algebra')": Language.ARB, "('AR', 'anatomy')": Language.ARB, "('AR', 'astronomy')": Language.ARB, "('AR', 'business_ethics')": Language.ARB, "('AR', 'clinical_knowledge')": Language.ARB, "('AR', 'college_biology')": Language.ARB, "('AR', 'college_chemistry')": Language.ARB, "('AR', 'college_computer_science')": Language.ARB, "('AR', 'college_mathematics')": Language.ARB, "('AR', 'college_medicine')": Language.ARB, "('AR', 'college_physics')": Language.ARB, "('AR', 'computer_security')": Language.ARB, "('AR', 'conceptual_physics')": Language.ARB, "('AR', 'econometrics')": Language.ARB, "('AR', 'electrical_engineering')": Language.ARB, "('AR', 'elementary_mathematics')": Language.ARB, "('AR', 'formal_logic')": Language.ARB, "('AR', 'global_facts')": Language.ARB, "('AR', 'high_school_biology')": Language.ARB, "('AR', 'high_school_chemistry')": Language.ARB, "('AR', 'high_school_computer_science')": Language.ARB, "('AR', 'high_school_european_history')": Language.ARB, "('AR', 'high_school_geography')": Language.ARB, "('AR', 'high_school_government_and_politics')": Language.ARB, "('AR', 'high_school_macroeconomics')": Language.ARB, "('AR', 'high_school_mathematics')": Language.ARB, "('AR', 'high_school_microeconomics')": Language.ARB, "('AR', 'high_school_physics')": Language.ARB, "('AR', 'high_school_psychology')": Language.ARB, "('AR', 'high_school_statistics')": Language.ARB, "('AR', 'high_school_us_history')": Language.ARB, "('AR', 'high_school_world_history')": Language.ARB, "('AR', 'human_aging')": Language.ARB, "('AR', 'human_sexuality')": Language.ARB, "('AR', 'international_law')": Language.ARB, "('AR', 'jurisprudence')": Language.ARB, "('AR', 'logical_fallacies')": Language.ARB, "('AR', 'machine_learning')": Language.ARB, "('AR', 'management')": Language.ARB, "('AR', 'marketing')": Language.ARB, "('AR', 'medical_genetics')": Language.ARB, "('AR', 'miscellaneous')": Language.ARB, "('AR', 'moral_disputes')": Language.ARB, "('AR', 'moral_scenarios')": Language.ARB, "('AR', 'nutrition')": Language.ARB, "('AR', 'philosophy')": Language.ARB, "('AR', 'prehistory')": Language.ARB, "('AR', 'professional_accounting')": Language.ARB, "('AR', 'professional_law')": Language.ARB, "('AR', 'professional_medicine')": Language.ARB, "('AR', 'professional_psychology')": Language.ARB, "('AR', 'public_relations')": Language.ARB, "('AR', 'security_studies')": Language.ARB, "('AR', 'sociology')": Language.ARB, "('AR', 'us_foreign_policy')": Language.ARB, "('AR', 'virology')": Language.ARB, "('AR', 'world_religions')": Language.ARB, "('DE', 'abstract_algebra')": Language.DEU, "('DE', 'anatomy')": Language.DEU, "('DE', 'astronomy')": Language.DEU, "('DE', 'business_ethics')": Language.DEU, "('DE', 'clinical_knowledge')": Language.DEU, "('DE', 'college_biology')": Language.DEU, "('DE', 'college_chemistry')": Language.DEU, "('DE', 'college_computer_science')": Language.DEU, "('DE', 'college_mathematics')": Language.DEU, "('DE', 'college_medicine')": Language.DEU, "('DE', 'college_physics')": Language.DEU, "('DE', 'computer_security')": Language.DEU, "('DE', 'conceptual_physics')": Language.DEU, "('DE', 'econometrics')": Language.DEU, "('DE', 'electrical_engineering')": Language.DEU, "('DE', 'elementary_mathematics')": Language.DEU, "('DE', 'formal_logic')": Language.DEU, "('DE', 'global_facts')": Language.DEU, "('DE', 'high_school_biology')": Language.DEU, "('DE', 'high_school_chemistry')": Language.DEU, "('DE', 'high_school_computer_science')": Language.DEU, "('DE', 'high_school_european_history')": Language.DEU, "('DE', 'high_school_geography')": Language.DEU, "('DE', 'high_school_government_and_politics')": Language.DEU, "('DE', 'high_school_macroeconomics')": Language.DEU, "('DE', 'high_school_mathematics')": Language.DEU, "('DE', 'high_school_microeconomics')": Language.DEU, "('DE', 'high_school_physics')": Language.DEU, "('DE', 'high_school_psychology')": Language.DEU, "('DE', 'high_school_statistics')": Language.DEU, "('DE', 'high_school_us_history')": Language.DEU, "('DE', 'high_school_world_history')": Language.DEU, "('DE', 'human_aging')": Language.DEU, "('DE', 'human_sexuality')": Language.DEU, "('DE', 'international_law')": Language.DEU, "('DE', 'jurisprudence')": Language.DEU, "('DE', 'logical_fallacies')": Language.DEU, "('DE', 'machine_learning')": Language.DEU, "('DE', 'management')": Language.DEU, "('DE', 'marketing')": Language.DEU, "('DE', 'medical_genetics')": Language.DEU, "('DE', 'miscellaneous')": Language.DEU, "('DE', 'moral_disputes')": Language.DEU, "('DE', 'moral_scenarios')": Language.DEU, "('DE', 'nutrition')": Language.DEU, "('DE', 'philosophy')": Language.DEU, "('DE', 'prehistory')": Language.DEU, "('DE', 'professional_accounting')": Language.DEU, "('DE', 'professional_law')": Language.DEU, "('DE', 'professional_medicine')": Language.DEU, "('DE', 'professional_psychology')": Language.DEU, "('DE', 'public_relations')": Language.DEU, "('DE', 'security_studies')": Language.DEU, "('DE', 'sociology')": Language.DEU, "('DE', 'us_foreign_policy')": Language.DEU, "('DE', 'virology')": Language.DEU, "('DE', 'world_religions')": Language.DEU, "('ES', 'abstract_algebra')": Language.SPA, "('ES', 'anatomy')": Language.SPA, "('ES', 'astronomy')": Language.SPA, "('ES', 'business_ethics')": Language.SPA, "('ES', 'clinical_knowledge')": Language.SPA, "('ES', 'college_biology')": Language.SPA, "('ES', 'college_chemistry')": Language.SPA, "('ES', 'college_computer_science')": Language.SPA, "('ES', 'college_mathematics')": Language.SPA, "('ES', 'college_medicine')": Language.SPA, "('ES', 'college_physics')": Language.SPA, "('ES', 'computer_security')": Language.SPA, "('ES', 'conceptual_physics')": Language.SPA, "('ES', 'econometrics')": Language.SPA, "('ES', 'electrical_engineering')": Language.SPA, "('ES', 'elementary_mathematics')": Language.SPA, "('ES', 'formal_logic')": Language.SPA, "('ES', 'global_facts')": Language.SPA, "('ES', 'high_school_biology')": Language.SPA, "('ES', 'high_school_chemistry')": Language.SPA, "('ES', 'high_school_computer_science')": Language.SPA, "('ES', 'high_school_european_history')": Language.SPA, "('ES', 'high_school_geography')": Language.SPA, "('ES', 'high_school_government_and_politics')": Language.SPA, "('ES', 'high_school_macroeconomics')": Language.SPA, "('ES', 'high_school_mathematics')": Language.SPA, "('ES', 'high_school_microeconomics')": Language.SPA, "('ES', 'high_school_physics')": Language.SPA, "('ES', 'high_school_psychology')": Language.SPA, "('ES', 'high_school_statistics')": Language.SPA, "('ES', 'high_school_us_history')": Language.SPA, "('ES', 'high_school_world_history')": Language.SPA, "('ES', 'human_aging')": Language.SPA, "('ES', 'human_sexuality')": Language.SPA, "('ES', 'international_law')": Language.SPA, "('ES', 'jurisprudence')": Language.SPA, "('ES', 'logical_fallacies')": Language.SPA, "('ES', 'machine_learning')": Language.SPA, "('ES', 'management')": Language.SPA, "('ES', 'marketing')": Language.SPA, "('ES', 'medical_genetics')": Language.SPA, "('ES', 'miscellaneous')": Language.SPA, "('ES', 'moral_disputes')": Language.SPA, "('ES', 'moral_scenarios')": Language.SPA, "('ES', 'nutrition')": Language.SPA, "('ES', 'philosophy')": Language.SPA, "('ES', 'prehistory')": Language.SPA, "('ES', 'professional_accounting')": Language.SPA, "('ES', 'professional_law')": Language.SPA, "('ES', 'professional_medicine')": Language.SPA, "('ES', 'professional_psychology')": Language.SPA, "('ES', 'public_relations')": Language.SPA, "('ES', 'security_studies')": Language.SPA, "('ES', 'sociology')": Language.SPA, "('ES', 'us_foreign_policy')": Language.SPA, "('ES', 'virology')": Language.SPA, "('ES', 'world_religions')": Language.SPA, "('FR', 'abstract_algebra')": Language.FRA, "('FR', 'anatomy')": Language.FRA, "('FR', 'astronomy')": Language.FRA, "('FR', 'business_ethics')": Language.FRA, "('FR', 'clinical_knowledge')": Language.FRA, "('FR', 'college_biology')": Language.FRA, "('FR', 'college_chemistry')": Language.FRA, "('FR', 'college_computer_science')": Language.FRA, "('FR', 'college_mathematics')": Language.FRA, "('FR', 'college_medicine')": Language.FRA, "('FR', 'college_physics')": Language.FRA, "('FR', 'computer_security')": Language.FRA, "('FR', 'conceptual_physics')": Language.FRA, "('FR', 'econometrics')": Language.FRA, "('FR', 'electrical_engineering')": Language.FRA, "('FR', 'elementary_mathematics')": Language.FRA, "('FR', 'formal_logic')": Language.FRA, "('FR', 'global_facts')": Language.FRA, "('FR', 'high_school_biology')": Language.FRA, "('FR', 'high_school_chemistry')": Language.FRA, "('FR', 'high_school_computer_science')": Language.FRA, "('FR', 'high_school_european_history')": Language.FRA, "('FR', 'high_school_geography')": Language.FRA, "('FR', 'high_school_government_and_politics')": Language.FRA, "('FR', 'high_school_macroeconomics')": Language.FRA, "('FR', 'high_school_mathematics')": Language.FRA, "('FR', 'high_school_microeconomics')": Language.FRA, "('FR', 'high_school_physics')": Language.FRA, "('FR', 'high_school_psychology')": Language.FRA, "('FR', 'high_school_statistics')": Language.FRA, "('FR', 'high_school_us_history')": Language.FRA, "('FR', 'high_school_world_history')": Language.FRA, "('FR', 'human_aging')": Language.FRA, "('FR', 'human_sexuality')": Language.FRA, "('FR', 'international_law')": Language.FRA, "('FR', 'jurisprudence')": Language.FRA, "('FR', 'logical_fallacies')": Language.FRA, "('FR', 'machine_learning')": Language.FRA, "('FR', 'management')": Language.FRA, "('FR', 'marketing')": Language.FRA, "('FR', 'medical_genetics')": Language.FRA, "('FR', 'miscellaneous')": Language.FRA, "('FR', 'moral_disputes')": Language.FRA, "('FR', 'moral_scenarios')": Language.FRA, "('FR', 'nutrition')": Language.FRA, "('FR', 'philosophy')": Language.FRA, "('FR', 'prehistory')": Language.FRA, "('FR', 'professional_accounting')": Language.FRA, "('FR', 'professional_law')": Language.FRA, "('FR', 'professional_medicine')": Language.FRA, "('FR', 'professional_psychology')": Language.FRA, "('FR', 'public_relations')": Language.FRA, "('FR', 'security_studies')": Language.FRA, "('FR', 'sociology')": Language.FRA, "('FR', 'us_foreign_policy')": Language.FRA, "('FR', 'virology')": Language.FRA, "('FR', 'world_religions')": Language.FRA, "('IT', 'abstract_algebra')": Language.ITA, "('IT', 'anatomy')": Language.ITA, "('IT', 'astronomy')": Language.ITA, "('IT', 'business_ethics')": Language.ITA, "('IT', 'clinical_knowledge')": Language.ITA, "('IT', 'college_biology')": Language.ITA, "('IT', 'college_chemistry')": Language.ITA, "('IT', 'college_computer_science')": Language.ITA, "('IT', 'college_mathematics')": Language.ITA, "('IT', 'college_medicine')": Language.ITA, "('IT', 'college_physics')": Language.ITA, "('IT', 'computer_security')": Language.ITA, "('IT', 'conceptual_physics')": Language.ITA, "('IT', 'econometrics')": Language.ITA, "('IT', 'electrical_engineering')": Language.ITA, "('IT', 'elementary_mathematics')": Language.ITA, "('IT', 'formal_logic')": Language.ITA, "('IT', 'global_facts')": Language.ITA, "('IT', 'high_school_biology')": Language.ITA, "('IT', 'high_school_chemistry')": Language.ITA, "('IT', 'high_school_computer_science')": Language.ITA, "('IT', 'high_school_european_history')": Language.ITA, "('IT', 'high_school_geography')": Language.ITA, "('IT', 'high_school_government_and_politics')": Language.ITA, "('IT', 'high_school_macroeconomics')": Language.ITA, "('IT', 'high_school_mathematics')": Language.ITA, "('IT', 'high_school_microeconomics')": Language.ITA, "('IT', 'high_school_physics')": Language.ITA, "('IT', 'high_school_psychology')": Language.ITA, "('IT', 'high_school_statistics')": Language.ITA, "('IT', 'high_school_us_history')": Language.ITA, "('IT', 'high_school_world_history')": Language.ITA, "('IT', 'human_aging')": Language.ITA, "('IT', 'human_sexuality')": Language.ITA, "('IT', 'international_law')": Language.ITA, "('IT', 'jurisprudence')": Language.ITA, "('IT', 'logical_fallacies')": Language.ITA, "('IT', 'machine_learning')": Language.ITA, "('IT', 'management')": Language.ITA, "('IT', 'marketing')": Language.ITA, "('IT', 'medical_genetics')": Language.ITA, "('IT', 'miscellaneous')": Language.ITA, "('IT', 'moral_disputes')": Language.ITA, "('IT', 'moral_scenarios')": Language.ITA, "('IT', 'nutrition')": Language.ITA, "('IT', 'philosophy')": Language.ITA, "('IT', 'prehistory')": Language.ITA, "('IT', 'professional_accounting')": Language.ITA, "('IT', 'professional_law')": Language.ITA, "('IT', 'professional_medicine')": Language.ITA, "('IT', 'professional_psychology')": Language.ITA, "('IT', 'public_relations')": Language.ITA, "('IT', 'security_studies')": Language.ITA, "('IT', 'sociology')": Language.ITA, "('IT', 'us_foreign_policy')": Language.ITA, "('IT', 'virology')": Language.ITA, "('IT', 'world_religions')": Language.ITA, "('PT', 'abstract_algebra')": Language.POR, "('PT', 'anatomy')": Language.POR, "('PT', 'astronomy')": Language.POR, "('PT', 'business_ethics')": Language.POR, "('PT', 'clinical_knowledge')": Language.POR, "('PT', 'college_biology')": Language.POR, "('PT', 'college_chemistry')": Language.POR, "('PT', 'college_computer_science')": Language.POR, "('PT', 'college_mathematics')": Language.POR, "('PT', 'college_medicine')": Language.POR, "('PT', 'college_physics')": Language.POR, "('PT', 'computer_security')": Language.POR, "('PT', 'conceptual_physics')": Language.POR, "('PT', 'econometrics')": Language.POR, "('PT', 'electrical_engineering')": Language.POR, "('PT', 'elementary_mathematics')": Language.POR, "('PT', 'formal_logic')": Language.POR, "('PT', 'global_facts')": Language.POR, "('PT', 'high_school_biology')": Language.POR, "('PT', 'high_school_chemistry')": Language.POR, "('PT', 'high_school_computer_science')": Language.POR, "('PT', 'high_school_european_history')": Language.POR, "('PT', 'high_school_geography')": Language.POR, "('PT', 'high_school_government_and_politics')": Language.POR, "('PT', 'high_school_macroeconomics')": Language.POR, "('PT', 'high_school_mathematics')": Language.POR, "('PT', 'high_school_microeconomics')": Language.POR, "('PT', 'high_school_physics')": Language.POR, "('PT', 'high_school_psychology')": Language.POR, "('PT', 'high_school_statistics')": Language.POR, "('PT', 'high_school_us_history')": Language.POR, "('PT', 'high_school_world_history')": Language.POR, "('PT', 'human_aging')": Language.POR, "('PT', 'human_sexuality')": Language.POR, "('PT', 'international_law')": Language.POR, "('PT', 'jurisprudence')": Language.POR, "('PT', 'logical_fallacies')": Language.POR, "('PT', 'machine_learning')": Language.POR, "('PT', 'management')": Language.POR, "('PT', 'marketing')": Language.POR, "('PT', 'medical_genetics')": Language.POR, "('PT', 'miscellaneous')": Language.POR, "('PT', 'moral_disputes')": Language.POR, "('PT', 'moral_scenarios')": Language.POR, "('PT', 'nutrition')": Language.POR, "('PT', 'philosophy')": Language.POR, "('PT', 'prehistory')": Language.POR, "('PT', 'professional_accounting')": Language.POR, "('PT', 'professional_law')": Language.POR, "('PT', 'professional_medicine')": Language.POR, "('PT', 'professional_psychology')": Language.POR, "('PT', 'public_relations')": Language.POR, "('PT', 'security_studies')": Language.POR, "('PT', 'sociology')": Language.POR, "('PT', 'us_foreign_policy')": Language.POR, "('PT', 'virology')": Language.POR, "('PT', 'world_religions')": Language.POR}¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'MMMLU'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = [('FR_FR', 'abstract_algebra'), ('FR_FR', 'anatomy'), ('FR_FR', 'astronomy'), ('FR_FR', 'business_ethics'), ('FR_FR', 'clinical_knowledge'), ('FR_FR', 'college_biology'), ('FR_FR', 'college_chemistry'), ('FR_FR', 'college_computer_science'), ('FR_FR', 'college_mathematics'), ('FR_FR', 'college_medicine'), ('FR_FR', 'college_physics'), ('FR_FR', 'computer_security'), ('FR_FR', 'conceptual_physics'), ('FR_FR', 'econometrics'), ('FR_FR', 'electrical_engineering'), ('FR_FR', 'elementary_mathematics'), ('FR_FR', 'formal_logic'), ('FR_FR', 'global_facts'), ('FR_FR', 'high_school_biology'), ('FR_FR', 'high_school_chemistry'), ('FR_FR', 'high_school_computer_science'), ('FR_FR', 'high_school_european_history'), ('FR_FR', 'high_school_geography'), ('FR_FR', 'high_school_government_and_politics'), ('FR_FR', 'high_school_macroeconomics'), ('FR_FR', 'high_school_mathematics'), ('FR_FR', 'high_school_microeconomics'), ('FR_FR', 'high_school_physics'), ('FR_FR', 'high_school_psychology'), ('FR_FR', 'high_school_statistics'), ('FR_FR', 'high_school_us_history'), ('FR_FR', 'high_school_world_history'), ('FR_FR', 'human_aging'), ('FR_FR', 'human_sexuality'), ('FR_FR', 'international_law'), ('FR_FR', 'jurisprudence'), ('FR_FR', 'logical_fallacies'), ('FR_FR', 'machine_learning'), ('FR_FR', 'management'), ('FR_FR', 'marketing'), ('FR_FR', 'medical_genetics'), ('FR_FR', 'miscellaneous'), ('FR_FR', 'moral_disputes'), ('FR_FR', 'moral_scenarios'), ('FR_FR', 'nutrition'), ('FR_FR', 'philosophy'), ('FR_FR', 'prehistory'), ('FR_FR', 'professional_accounting'), ('FR_FR', 'professional_law'), ('FR_FR', 'professional_medicine'), ('FR_FR', 'professional_psychology'), ('FR_FR', 'public_relations'), ('FR_FR', 'security_studies'), ('FR_FR', 'sociology'), ('FR_FR', 'us_foreign_policy'), ('FR_FR', 'virology'), ('FR_FR', 'world_religions'), ('DE_DE', 'abstract_algebra'), ('DE_DE', 'anatomy'), ('DE_DE', 'astronomy'), ('DE_DE', 'business_ethics'), ('DE_DE', 'clinical_knowledge'), ('DE_DE', 'college_biology'), ('DE_DE', 'college_chemistry'), ('DE_DE', 'college_computer_science'), ('DE_DE', 'college_mathematics'), ('DE_DE', 'college_medicine'), ('DE_DE', 'college_physics'), ('DE_DE', 'computer_security'), ('DE_DE', 'conceptual_physics'), ('DE_DE', 'econometrics'), ('DE_DE', 'electrical_engineering'), ('DE_DE', 'elementary_mathematics'), ('DE_DE', 'formal_logic'), ('DE_DE', 'global_facts'), ('DE_DE', 'high_school_biology'), ('DE_DE', 'high_school_chemistry'), ('DE_DE', 'high_school_computer_science'), ('DE_DE', 'high_school_european_history'), ('DE_DE', 'high_school_geography'), ('DE_DE', 'high_school_government_and_politics'), ('DE_DE', 'high_school_macroeconomics'), ('DE_DE', 'high_school_mathematics'), ('DE_DE', 'high_school_microeconomics'), ('DE_DE', 'high_school_physics'), ('DE_DE', 'high_school_psychology'), ('DE_DE', 'high_school_statistics'), ('DE_DE', 'high_school_us_history'), ('DE_DE', 'high_school_world_history'), ('DE_DE', 'human_aging'), ('DE_DE', 'human_sexuality'), ('DE_DE', 'international_law'), ('DE_DE', 'jurisprudence'), ('DE_DE', 'logical_fallacies'), ('DE_DE', 'machine_learning'), ('DE_DE', 'management'), ('DE_DE', 'marketing'), ('DE_DE', 'medical_genetics'), ('DE_DE', 'miscellaneous'), ('DE_DE', 'moral_disputes'), ('DE_DE', 'moral_scenarios'), ('DE_DE', 'nutrition'), ('DE_DE', 'philosophy'), ('DE_DE', 'prehistory'), ('DE_DE', 'professional_accounting'), ('DE_DE', 'professional_law'), ('DE_DE', 'professional_medicine'), ('DE_DE', 'professional_psychology'), ('DE_DE', 'public_relations'), ('DE_DE', 'security_studies'), ('DE_DE', 'sociology'), ('DE_DE', 'us_foreign_policy'), ('DE_DE', 'virology'), ('DE_DE', 'world_religions'), ('ES_LA', 'abstract_algebra'), ('ES_LA', 'anatomy'), ('ES_LA', 'astronomy'), ('ES_LA', 'business_ethics'), ('ES_LA', 'clinical_knowledge'), ('ES_LA', 'college_biology'), ('ES_LA', 'college_chemistry'), ('ES_LA', 'college_computer_science'), ('ES_LA', 'college_mathematics'), ('ES_LA', 'college_medicine'), ('ES_LA', 'college_physics'), ('ES_LA', 'computer_security'), ('ES_LA', 'conceptual_physics'), ('ES_LA', 'econometrics'), ('ES_LA', 'electrical_engineering'), ('ES_LA', 'elementary_mathematics'), ('ES_LA', 'formal_logic'), ('ES_LA', 'global_facts'), ('ES_LA', 'high_school_biology'), ('ES_LA', 'high_school_chemistry'), ('ES_LA', 'high_school_computer_science'), ('ES_LA', 'high_school_european_history'), ('ES_LA', 'high_school_geography'), ('ES_LA', 'high_school_government_and_politics'), ('ES_LA', 'high_school_macroeconomics'), ('ES_LA', 'high_school_mathematics'), ('ES_LA', 'high_school_microeconomics'), ('ES_LA', 'high_school_physics'), ('ES_LA', 'high_school_psychology'), ('ES_LA', 'high_school_statistics'), ('ES_LA', 'high_school_us_history'), ('ES_LA', 'high_school_world_history'), ('ES_LA', 'human_aging'), ('ES_LA', 'human_sexuality'), ('ES_LA', 'international_law'), ('ES_LA', 'jurisprudence'), ('ES_LA', 'logical_fallacies'), ('ES_LA', 'machine_learning'), ('ES_LA', 'management'), ('ES_LA', 'marketing'), ('ES_LA', 'medical_genetics'), ('ES_LA', 'miscellaneous'), ('ES_LA', 'moral_disputes'), ('ES_LA', 'moral_scenarios'), ('ES_LA', 'nutrition'), ('ES_LA', 'philosophy'), ('ES_LA', 'prehistory'), ('ES_LA', 'professional_accounting'), ('ES_LA', 'professional_law'), ('ES_LA', 'professional_medicine'), ('ES_LA', 'professional_psychology'), ('ES_LA', 'public_relations'), ('ES_LA', 'security_studies'), ('ES_LA', 'sociology'), ('ES_LA', 'us_foreign_policy'), ('ES_LA', 'virology'), ('ES_LA', 'world_religions'), ('IT_IT', 'abstract_algebra'), ('IT_IT', 'anatomy'), ('IT_IT', 'astronomy'), ('IT_IT', 'business_ethics'), ('IT_IT', 'clinical_knowledge'), ('IT_IT', 'college_biology'), ('IT_IT', 'college_chemistry'), ('IT_IT', 'college_computer_science'), ('IT_IT', 'college_mathematics'), ('IT_IT', 'college_medicine'), ('IT_IT', 'college_physics'), ('IT_IT', 'computer_security'), ('IT_IT', 'conceptual_physics'), ('IT_IT', 'econometrics'), ('IT_IT', 'electrical_engineering'), ('IT_IT', 'elementary_mathematics'), ('IT_IT', 'formal_logic'), ('IT_IT', 'global_facts'), ('IT_IT', 'high_school_biology'), ('IT_IT', 'high_school_chemistry'), ('IT_IT', 'high_school_computer_science'), ('IT_IT', 'high_school_european_history'), ('IT_IT', 'high_school_geography'), ('IT_IT', 'high_school_government_and_politics'), ('IT_IT', 'high_school_macroeconomics'), ('IT_IT', 'high_school_mathematics'), ('IT_IT', 'high_school_microeconomics'), ('IT_IT', 'high_school_physics'), ('IT_IT', 'high_school_psychology'), ('IT_IT', 'high_school_statistics'), ('IT_IT', 'high_school_us_history'), ('IT_IT', 'high_school_world_history'), ('IT_IT', 'human_aging'), ('IT_IT', 'human_sexuality'), ('IT_IT', 'international_law'), ('IT_IT', 'jurisprudence'), ('IT_IT', 'logical_fallacies'), ('IT_IT', 'machine_learning'), ('IT_IT', 'management'), ('IT_IT', 'marketing'), ('IT_IT', 'medical_genetics'), ('IT_IT', 'miscellaneous'), ('IT_IT', 'moral_disputes'), ('IT_IT', 'moral_scenarios'), ('IT_IT', 'nutrition'), ('IT_IT', 'philosophy'), ('IT_IT', 'prehistory'), ('IT_IT', 'professional_accounting'), ('IT_IT', 'professional_law'), ('IT_IT', 'professional_medicine'), ('IT_IT', 'professional_psychology'), ('IT_IT', 'public_relations'), ('IT_IT', 'security_studies'), ('IT_IT', 'sociology'), ('IT_IT', 'us_foreign_policy'), ('IT_IT', 'virology'), ('IT_IT', 'world_religions'), ('PT_BR', 'abstract_algebra'), ('PT_BR', 'anatomy'), ('PT_BR', 'astronomy'), ('PT_BR', 'business_ethics'), ('PT_BR', 'clinical_knowledge'), ('PT_BR', 'college_biology'), ('PT_BR', 'college_chemistry'), ('PT_BR', 'college_computer_science'), ('PT_BR', 'college_mathematics'), ('PT_BR', 'college_medicine'), ('PT_BR', 'college_physics'), ('PT_BR', 'computer_security'), ('PT_BR', 'conceptual_physics'), ('PT_BR', 'econometrics'), ('PT_BR', 'electrical_engineering'), ('PT_BR', 'elementary_mathematics'), ('PT_BR', 'formal_logic'), ('PT_BR', 'global_facts'), ('PT_BR', 'high_school_biology'), ('PT_BR', 'high_school_chemistry'), ('PT_BR', 'high_school_computer_science'), ('PT_BR', 'high_school_european_history'), ('PT_BR', 'high_school_geography'), ('PT_BR', 'high_school_government_and_politics'), ('PT_BR', 'high_school_macroeconomics'), ('PT_BR', 'high_school_mathematics'), ('PT_BR', 'high_school_microeconomics'), ('PT_BR', 'high_school_physics'), ('PT_BR', 'high_school_psychology'), ('PT_BR', 'high_school_statistics'), ('PT_BR', 'high_school_us_history'), ('PT_BR', 'high_school_world_history'), ('PT_BR', 'human_aging'), ('PT_BR', 'human_sexuality'), ('PT_BR', 'international_law'), ('PT_BR', 'jurisprudence'), ('PT_BR', 'logical_fallacies'), ('PT_BR', 'machine_learning'), ('PT_BR', 'management'), ('PT_BR', 'marketing'), ('PT_BR', 'medical_genetics'), ('PT_BR', 'miscellaneous'), ('PT_BR', 'moral_disputes'), ('PT_BR', 'moral_scenarios'), ('PT_BR', 'nutrition'), ('PT_BR', 'philosophy'), ('PT_BR', 'prehistory'), ('PT_BR', 'professional_accounting'), ('PT_BR', 'professional_law'), ('PT_BR', 'professional_medicine'), ('PT_BR', 'professional_psychology'), ('PT_BR', 'public_relations'), ('PT_BR', 'security_studies'), ('PT_BR', 'sociology'), ('PT_BR', 'us_foreign_policy'), ('PT_BR', 'virology'), ('PT_BR', 'world_religions'), ('AR_XY', 'abstract_algebra'), ('AR_XY', 'anatomy'), ('AR_XY', 'astronomy'), ('AR_XY', 'business_ethics'), ('AR_XY', 'clinical_knowledge'), ('AR_XY', 'college_biology'), ('AR_XY', 'college_chemistry'), ('AR_XY', 'college_computer_science'), ('AR_XY', 'college_mathematics'), ('AR_XY', 'college_medicine'), ('AR_XY', 'college_physics'), ('AR_XY', 'computer_security'), ('AR_XY', 'conceptual_physics'), ('AR_XY', 'econometrics'), ('AR_XY', 'electrical_engineering'), ('AR_XY', 'elementary_mathematics'), ('AR_XY', 'formal_logic'), ('AR_XY', 'global_facts'), ('AR_XY', 'high_school_biology'), ('AR_XY', 'high_school_chemistry'), ('AR_XY', 'high_school_computer_science'), ('AR_XY', 'high_school_european_history'), ('AR_XY', 'high_school_geography'), ('AR_XY', 'high_school_government_and_politics'), ('AR_XY', 'high_school_macroeconomics'), ('AR_XY', 'high_school_mathematics'), ('AR_XY', 'high_school_microeconomics'), ('AR_XY', 'high_school_physics'), ('AR_XY', 'high_school_psychology'), ('AR_XY', 'high_school_statistics'), ('AR_XY', 'high_school_us_history'), ('AR_XY', 'high_school_world_history'), ('AR_XY', 'human_aging'), ('AR_XY', 'human_sexuality'), ('AR_XY', 'international_law'), ('AR_XY', 'jurisprudence'), ('AR_XY', 'logical_fallacies'), ('AR_XY', 'machine_learning'), ('AR_XY', 'management'), ('AR_XY', 'marketing'), ('AR_XY', 'medical_genetics'), ('AR_XY', 'miscellaneous'), ('AR_XY', 'moral_disputes'), ('AR_XY', 'moral_scenarios'), ('AR_XY', 'nutrition'), ('AR_XY', 'philosophy'), ('AR_XY', 'prehistory'), ('AR_XY', 'professional_accounting'), ('AR_XY', 'professional_law'), ('AR_XY', 'professional_medicine'), ('AR_XY', 'professional_psychology'), ('AR_XY', 'public_relations'), ('AR_XY', 'security_studies'), ('AR_XY', 'sociology'), ('AR_XY', 'us_foreign_policy'), ('AR_XY', 'virology'), ('AR_XY', 'world_religions')]¶

class eval_framework.tasks.benchmarks.mmmlu.MMMLU_GERMAN_COT(num_fewshot=0)[source]¶

Bases: MMMLU

Parameters:: num_fewshot (int)

ANS_RE = re.compile('Daher lautet die Antwort: ([ABCD])')¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {"('de', 'abstract_algebra')": Language.DEU, "('de', 'anatomy')": Language.DEU, "('de', 'astronomy')": Language.DEU, "('de', 'business_ethics')": Language.DEU, "('de', 'clinical_knowledge')": Language.DEU, "('de', 'college_biology')": Language.DEU, "('de', 'college_chemistry')": Language.DEU, "('de', 'college_computer_science')": Language.DEU, "('de', 'college_mathematics')": Language.DEU, "('de', 'college_medicine')": Language.DEU, "('de', 'college_physics')": Language.DEU, "('de', 'computer_security')": Language.DEU, "('de', 'conceptual_physics')": Language.DEU, "('de', 'econometrics')": Language.DEU, "('de', 'electrical_engineering')": Language.DEU, "('de', 'elementary_mathematics')": Language.DEU, "('de', 'formal_logic')": Language.DEU, "('de', 'global_facts')": Language.DEU, "('de', 'high_school_biology')": Language.DEU, "('de', 'high_school_chemistry')": Language.DEU, "('de', 'high_school_computer_science')": Language.DEU, "('de', 'high_school_european_history')": Language.DEU, "('de', 'high_school_geography')": Language.DEU, "('de', 'high_school_government_and_politics')": Language.DEU, "('de', 'high_school_macroeconomics')": Language.DEU, "('de', 'high_school_mathematics')": Language.DEU, "('de', 'high_school_microeconomics')": Language.DEU, "('de', 'high_school_physics')": Language.DEU, "('de', 'high_school_psychology')": Language.DEU, "('de', 'high_school_statistics')": Language.DEU, "('de', 'high_school_us_history')": Language.DEU, "('de', 'high_school_world_history')": Language.DEU, "('de', 'human_aging')": Language.DEU, "('de', 'human_sexuality')": Language.DEU, "('de', 'international_law')": Language.DEU, "('de', 'jurisprudence')": Language.DEU, "('de', 'logical_fallacies')": Language.DEU, "('de', 'machine_learning')": Language.DEU, "('de', 'management')": Language.DEU, "('de', 'marketing')": Language.DEU, "('de', 'medical_genetics')": Language.DEU, "('de', 'miscellaneous')": Language.DEU, "('de', 'moral_disputes')": Language.DEU, "('de', 'moral_scenarios')": Language.DEU, "('de', 'nutrition')": Language.DEU, "('de', 'philosophy')": Language.DEU, "('de', 'prehistory')": Language.DEU, "('de', 'professional_accounting')": Language.DEU, "('de', 'professional_law')": Language.DEU, "('de', 'professional_medicine')": Language.DEU, "('de', 'professional_psychology')": Language.DEU, "('de', 'public_relations')": Language.DEU, "('de', 'security_studies')": Language.DEU, "('de', 'sociology')": Language.DEU, "('de', 'us_foreign_policy')": Language.DEU, "('de', 'virology')": Language.DEU, "('de', 'world_religions')": Language.DEU}¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.language_checker.GermanCompletionChecker'>]¶

NAME: str = 'MMMLU_GERMAN_COT'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'Question', 'Answer', 'A', 'B', 'C', 'D']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SUBJECTS: list[SubjectType] = [('DE_DE', 'abstract_algebra'), ('DE_DE', 'anatomy'), ('DE_DE', 'astronomy'), ('DE_DE', 'business_ethics'), ('DE_DE', 'clinical_knowledge'), ('DE_DE', 'college_biology'), ('DE_DE', 'college_chemistry'), ('DE_DE', 'college_computer_science'), ('DE_DE', 'college_mathematics'), ('DE_DE', 'college_medicine'), ('DE_DE', 'college_physics'), ('DE_DE', 'computer_security'), ('DE_DE', 'conceptual_physics'), ('DE_DE', 'econometrics'), ('DE_DE', 'electrical_engineering'), ('DE_DE', 'elementary_mathematics'), ('DE_DE', 'formal_logic'), ('DE_DE', 'global_facts'), ('DE_DE', 'high_school_biology'), ('DE_DE', 'high_school_chemistry'), ('DE_DE', 'high_school_computer_science'), ('DE_DE', 'high_school_european_history'), ('DE_DE', 'high_school_geography'), ('DE_DE', 'high_school_government_and_politics'), ('DE_DE', 'high_school_macroeconomics'), ('DE_DE', 'high_school_mathematics'), ('DE_DE', 'high_school_microeconomics'), ('DE_DE', 'high_school_physics'), ('DE_DE', 'high_school_psychology'), ('DE_DE', 'high_school_statistics'), ('DE_DE', 'high_school_us_history'), ('DE_DE', 'high_school_world_history'), ('DE_DE', 'human_aging'), ('DE_DE', 'human_sexuality'), ('DE_DE', 'international_law'), ('DE_DE', 'jurisprudence'), ('DE_DE', 'logical_fallacies'), ('DE_DE', 'machine_learning'), ('DE_DE', 'management'), ('DE_DE', 'marketing'), ('DE_DE', 'medical_genetics'), ('DE_DE', 'miscellaneous'), ('DE_DE', 'moral_disputes'), ('DE_DE', 'moral_scenarios'), ('DE_DE', 'nutrition'), ('DE_DE', 'philosophy'), ('DE_DE', 'prehistory'), ('DE_DE', 'professional_accounting'), ('DE_DE', 'professional_law'), ('DE_DE', 'professional_medicine'), ('DE_DE', 'professional_psychology'), ('DE_DE', 'public_relations'), ('DE_DE', 'security_studies'), ('DE_DE', 'sociology'), ('DE_DE', 'us_foreign_policy'), ('DE_DE', 'virology'), ('DE_DE', 'world_religions')]¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

eval_framework.tasks.benchmarks.naturalqs_open module¶

class eval_framework.tasks.benchmarks.naturalqs_open.NaturalQsOpen(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'google-research-datasets/nq_open'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.f1.F1'>]¶

NAME: str = 'NaturalQsOpen'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.naturalqs_open.NaturalQsOpenCloze(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'allenai/nq-gen2mc'¶

FEWSHOT_SPLIT: str = 'validation'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'NaturalQsOpenCloze'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.naturalqs_open.NaturalQsOpenMC(num_fewshot=0)[source]¶

Bases: NaturalQsOpenCloze

Parameters:: num_fewshot (int)

NAME: str = 'NaturalQsOpenMC'¶

class eval_framework.tasks.benchmarks.naturalqs_open.NaturalQsOpenMC_OLMES(num_fewshot=0)[source]¶

Bases: NaturalQsOpenMC

NaturalQsOpenMC with OLMES-style prompt: space before each label in the prompt (” A.”, “ B.”, …).

Parameters:: num_fewshot (int)

NAME: str = 'NaturalQsOpenMC_OLMES'¶

eval_framework.tasks.benchmarks.openbookqa module¶

class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA(num_fewshot=0)[source]¶

Bases: BaseTask[str]

OpenBookQA dataset: https://huggingface.co/datasets/allenai/openbookqa

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'allenai/openbookqa'¶

FEWSHOT_SPLIT: str = 'test'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'OpenBookQA'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['A', 'B', 'C', 'D']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['additional']¶

class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_EVAL_HARNESS(num_fewshot=0)[source]¶

Bases: OPENBOOKQA

Closed-book version of OpenBookQA — question only, no supporting fact.

Parameters:: num_fewshot (int)

NAME: str = 'OpenBookQAEvalHarness'¶

class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_EVAL_HARNESS_OLMES(num_fewshot=0)[source]¶

Bases: OPENBOOKQA_EVAL_HARNESS

OpenBookQA Eval Harness with OLMES-style prompt: space before each label (” A.”, “ B.”, …).

Parameters:: num_fewshot (int)

NAME: str = 'OpenBookQAEvalHarness_OLMES'¶

class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_IDK(num_fewshot=0)[source]¶

Bases: OPENBOOKQA

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'OpenBookQA_IDK'¶

class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_OLMES(num_fewshot=0)[source]¶

Bases: OPENBOOKQA

OpenBookQA with OLMES-style prompt: space before each label in the prompt (” A.”, “ B.”, …).

Parameters:: num_fewshot (int)

NAME: str = 'OpenBookQA_OLMES'¶

eval_framework.tasks.benchmarks.opengptx_eu20 module¶

class eval_framework.tasks.benchmarks.opengptx_eu20.ARC_EU20_DE(num_fewshot=0)[source]¶

Bases: ARC

EU20 Benchmarks from the openGPT-X paper: - https://arxiv.org/abs/2410.08928 - leaderboard: https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard

https://huggingface.co/datasets/openGPT-X/arcx

entries in ‘challenge_DE’: 1172 test, 299 validation, 198 train entries in ‘easy_DE’: 2376 test, 570 validation, 197 train

features: [‘id’, ‘question’, ‘choices’, ‘answerKey’],

SUBJECTS = [‘challenge_BG’, ‘easy_BG’, ‘challenge_DA’, ‘easy_DA’, ‘challenge_DE’, ‘easy_DE’, ‘challenge_ET’, ‘easy_ET’, ‘challenge_FI’, ‘easy_FI’, ‘challenge_FR’, ‘easy_FR’, ‘challenge_EL’, ‘easy_EL’, ‘challenge_IT’, ‘easy_IT’, ‘challenge_LV’, ‘easy_LV’, ‘challenge_LT’, ‘easy_LT’, ‘challenge_NL’, ‘easy_NL’, ‘challenge_PL’, ‘easy_PL’, ‘challenge_PT-PT’, ‘easy_PT-PT’, ‘challenge_RO’, ‘easy_RO’, ‘challenge_SV’, ‘easy_SV’, ‘challenge_SK’, ‘easy_SK’, ‘challenge_SL’, ‘easy_SL’, ‘challenge_ES’, ‘easy_ES’, ‘challenge_CS’, ‘easy_CS’, ‘challenge_HU’, ‘easy_HU’]

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/arcx'¶

FEWSHOT_SPLIT: str = 'train'¶

HF_REVISION: str | None = 'e4c31fa077b82832cc21e614832701603a8ad319'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'¶

NAME: str = 'ARC_EU20_DE'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['challenge_DE', 'easy_DE']¶

class eval_framework.tasks.benchmarks.opengptx_eu20.ARC_EU20_FR(num_fewshot=0)[source]¶

Bases: ARC

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/arcx'¶

FEWSHOT_SPLIT: str = 'train'¶

HF_REVISION: str | None = 'e4c31fa077b82832cc21e614832701603a8ad319'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'¶

NAME: str = 'ARC_EU20_FR'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['challenge_FR', 'easy_FR']¶

class eval_framework.tasks.benchmarks.opengptx_eu20.GSM8K_EU20_DE(num_fewshot=0)[source]¶

Bases: GSM8KEvalHarness

https://huggingface.co/datasets/openGPT-X/gsm8kx

entries in ‘DE’: 1319 test, 104 train: features: [‘question’, ‘answer’, ‘id’],

SUBJECTS = [‘BG’, ‘DA’, ‘DE’, ‘ET’, ‘FI’, ‘FR’, ‘EL’, ‘IT’, ‘LV’, ‘LT’, ‘NL’, ‘PL’, ‘PT-PT’, ‘RO’, ‘SV’, ‘SK’, ‘SL’, ‘ES’, ‘CS’, ‘HU’]

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/gsm8kx'¶

FEWSHOT_SPLIT: str = 'train'¶

HF_REVISION: str | None = '3ed0f81d31a9013e05d16644aabcc36db50078a9'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'¶

NAME: str = 'GSM8K_EU20_DE'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['DE']¶

class eval_framework.tasks.benchmarks.opengptx_eu20.GSM8K_EU20_FR(num_fewshot=0)[source]¶

Bases: GSM8KEvalHarness

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/gsm8kx'¶

FEWSHOT_SPLIT: str = 'train'¶

HF_REVISION: str | None = '3ed0f81d31a9013e05d16644aabcc36db50078a9'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'¶

NAME: str = 'GSM8K_EU20_FR'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['FR']¶

class eval_framework.tasks.benchmarks.opengptx_eu20.HELLASWAG_EU20_DE(num_fewshot=0)[source]¶

Bases: HELLASWAG

https://huggingface.co/datasets/openGPT-X/hellaswagx

entries in ‘DE’: 99 train, 9979 validation: features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, ‘source_id’, ‘split’, ‘split_type’, ‘label’],

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/hellaswagx'¶

FEWSHOT_SPLIT: str = 'validation'¶

HF_REVISION: str | None = '7c30407f4f11fa4fada74bd4384ed0fe572ae8f2'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'¶

NAME: str = 'HellaSwag_EU20_DE'¶

SAMPLE_SPLIT: str = 'train'¶

SUBJECTS: list[SubjectType] = ['DE']¶

class eval_framework.tasks.benchmarks.opengptx_eu20.HELLASWAG_EU20_FR(num_fewshot=0)[source]¶

Bases: HELLASWAG

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/hellaswagx'¶

FEWSHOT_SPLIT: str = 'validation'¶

HF_REVISION: str | None = '7c30407f4f11fa4fada74bd4384ed0fe572ae8f2'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'¶

NAME: str = 'HellaSwag_EU20_FR'¶

SAMPLE_SPLIT: str = 'train'¶

SUBJECTS: list[SubjectType] = ['FR']¶

class eval_framework.tasks.benchmarks.opengptx_eu20.MMLU_EU20_DE(num_fewshot=0)[source]¶

Bases: MMLU

https://huggingface.co/datasets/openGPT-X/mmlux

entries in ‘philosophy_DE’: 311 test, 5 dev, 5 validation: features: [‘question’, ‘choices’, ‘answer’, ‘id’],

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/mmlux'¶

FEWSHOT_SPLIT: str = 'dev'¶

HF_REVISION: str | None = '6412d5d5d03a7b31d02f4ba34b787c2e7939a800'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'¶

NAME: str = 'MMLU_EU20_DE'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D', 'Frage']¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['abstract_algebra_DE', 'anatomy_DE', 'astronomy_DE', 'business_ethics_DE', 'clinical_knowledge_DE', 'college_biology_DE', 'college_chemistry_DE', 'college_computer_science_DE', 'college_mathematics_DE', 'college_medicine_DE', 'college_physics_DE', 'computer_security_DE', 'conceptual_physics_DE', 'econometrics_DE', 'electrical_engineering_DE', 'elementary_mathematics_DE', 'formal_logic_DE', 'global_facts_DE', 'high_school_biology_DE', 'high_school_chemistry_DE', 'high_school_computer_science_DE', 'high_school_european_history_DE', 'high_school_geography_DE', 'high_school_government_and_politics_DE', 'high_school_macroeconomics_DE', 'high_school_mathematics_DE', 'high_school_microeconomics_DE', 'high_school_physics_DE', 'high_school_psychology_DE', 'high_school_statistics_DE', 'high_school_us_history_DE', 'high_school_world_history_DE', 'human_aging_DE', 'human_sexuality_DE', 'international_law_DE', 'jurisprudence_DE', 'logical_fallacies_DE', 'machine_learning_DE', 'management_DE', 'marketing_DE', 'medical_genetics_DE', 'miscellaneous_DE', 'moral_disputes_DE', 'moral_scenarios_DE', 'nutrition_DE', 'philosophy_DE', 'prehistory_DE', 'professional_accounting_DE', 'professional_law_DE', 'professional_medicine_DE', 'professional_psychology_DE', 'public_relations_DE', 'security_studies_DE', 'sociology_DE', 'us_foreign_policy_DE', 'virology_DE', 'world_religions_DE']¶

class eval_framework.tasks.benchmarks.opengptx_eu20.MMLU_EU20_FR(num_fewshot=0)[source]¶

Bases: MMLU

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/mmlux'¶

FEWSHOT_SPLIT: str = 'dev'¶

HF_REVISION: str | None = '6412d5d5d03a7b31d02f4ba34b787c2e7939a800'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'¶

NAME: str = 'MMLU_EU20_FR'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['abstract_algebra_FR', 'anatomy_FR', 'astronomy_FR', 'business_ethics_FR', 'clinical_knowledge_FR', 'college_biology_FR', 'college_chemistry_FR', 'college_computer_science_FR', 'college_mathematics_FR', 'college_medicine_FR', 'college_physics_FR', 'computer_security_FR', 'conceptual_physics_FR', 'econometrics_FR', 'electrical_engineering_FR', 'elementary_mathematics_FR', 'formal_logic_FR', 'global_facts_FR', 'high_school_biology_FR', 'high_school_chemistry_FR', 'high_school_computer_science_FR', 'high_school_european_history_FR', 'high_school_geography_FR', 'high_school_government_and_politics_FR', 'high_school_macroeconomics_FR', 'high_school_mathematics_FR', 'high_school_microeconomics_FR', 'high_school_physics_FR', 'high_school_psychology_FR', 'high_school_statistics_FR', 'high_school_us_history_FR', 'high_school_world_history_FR', 'human_aging_FR', 'human_sexuality_FR', 'international_law_FR', 'jurisprudence_FR', 'logical_fallacies_FR', 'machine_learning_FR', 'management_FR', 'marketing_FR', 'medical_genetics_FR', 'miscellaneous_FR', 'moral_disputes_FR', 'moral_scenarios_FR', 'nutrition_FR', 'philosophy_FR', 'prehistory_FR', 'professional_accounting_FR', 'professional_law_FR', 'professional_medicine_FR', 'professional_psychology_FR', 'public_relations_FR', 'security_studies_FR', 'sociology_FR', 'us_foreign_policy_FR', 'virology_FR', 'world_religions_FR']¶

class eval_framework.tasks.benchmarks.opengptx_eu20.TRUTHFULQA_EU20_DE(num_fewshot=0)[source]¶

Bases: TRUTHFULQA

https://huggingface.co/datasets/openGPT-X/truthfulqax

entries in ‘mc_DE’: 817 validation: features: [‘question’, ‘mc1_targets’, ‘mc2_targets’, ‘id’],
entries in ‘gen_DE’: 817 validation: features: [‘type’, ‘category’, ‘question’, ‘best_answer’, ‘correct_answers’, ‘incorrect_answers’, ‘source’, ‘id’],

SUBJECTS = [‘mc_BG’, ‘gen_BG’, ‘mc_DA’, ‘gen_DA’, ‘mc_DE’, ‘gen_DE’, ‘mc_ET’, ‘gen_ET’, ‘mc_FI’, ‘gen_FI’, ‘mc_FR’, ‘gen_FR’, ‘mc_EL’, ‘gen_EL’, ‘mc_IT’, ‘gen_IT’, ‘mc_LV’, ‘gen_LV’, ‘mc_LT’, ‘gen_LT’, ‘mc_NL’, ‘gen_NL’, ‘mc_PL’, ‘gen_PL’, ‘mc_PT-PT’, ‘gen_PT-PT’, ‘mc_RO’, ‘gen_RO’, ‘mc_SV’, ‘gen_SV’, ‘mc_SK’, ‘gen_SK’, ‘mc_SL’, ‘gen_SL’, ‘mc_ES’, ‘gen_ES’, ‘mc_CS’, ‘gen_CS’, ‘mc_HU’, ‘gen_HU’]

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/truthfulqax'¶

HF_REVISION: str | None = 'cff042da87dfb8885c357cb1c83194fa6aaf1d49'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'¶

NAME: str = 'TruthfulQA_EU20_DE'¶

class eval_framework.tasks.benchmarks.opengptx_eu20.TRUTHFULQA_EU20_FR(num_fewshot=0)[source]¶

Bases: TRUTHFULQA

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/truthfulqax'¶

HF_REVISION: str | None = 'cff042da87dfb8885c357cb1c83194fa6aaf1d49'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'¶

NAME: str = 'TruthfulQA_EU20_FR'¶

eval_framework.tasks.benchmarks.pawsx module¶

class eval_framework.tasks.benchmarks.pawsx.PAWSX(num_fewshot=0)[source]¶

Bases: BaseTask[str]

PAWSX dataset: https://huggingface.co/datasets/google-research-datasets/paws-x used in the way suggested in PARAPHRASUS benchmark (https://arxiv.org/pdf/2409.12060).

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'google-research-datasets/paws-x'¶

FEWSHOT_SPLIT: str = 'validation'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de': Language.DEU, 'en': Language.ENG}¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶

NAME: str = 'PAWS-X'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Ja', 'Nein', 'Paraphrasen', 'Yes', 'No', 'paraphrases']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['en', 'de']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

eval_framework.tasks.benchmarks.piqa module¶

class eval_framework.tasks.benchmarks.piqa.PIQA(num_fewshot=0)[source]¶

Bases: BaseTask[str]

PIQA dataset: https://huggingface.co/datasets/ybisk/piqa

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'ybisk/piqa'¶

FEWSHOT_SPLIT: str = 'test'¶

HF_REVISION: str | None = '6b3aceb3276e5ab7e51895d73151a718690af38c'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'PIQA'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.piqa.PIQA_IDK(num_fewshot=0)[source]¶

Bases: PIQA

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'PIQA_IDK'¶

class eval_framework.tasks.benchmarks.piqa.PIQA_OLMES(num_fewshot=0)[source]¶

Bases: PIQA

PIQA with OLMES-style prompt: options shown with space-prefixed labels (” A.”, “ B.”); loglikelihood over “ A”/” B”.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'train'¶

NAME: str = 'PIQA_OLMES'¶

SAMPLE_SPLIT: str = 'train'¶

eval_framework.tasks.benchmarks.quality module¶

class eval_framework.tasks.benchmarks.quality.QUALITY(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'emozilla/quality'¶

FEWSHOT_SPLIT: str = 'validation'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'QuALITY'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Article', 'Question', 'Answer']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['hard', 'easy']¶

eval_framework.tasks.benchmarks.sciq module¶

class eval_framework.tasks.benchmarks.sciq.SCIQ(num_fewshot=0)[source]¶

Bases: BaseTask[str]

SciQ dataset: https://huggingface.co/datasets/allenai/sciq

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'allenai/sciq'¶

FEWSHOT_SPLIT: str = 'test'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'SciQ'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.sciq.SCIQEvalHarness(num_fewshot=0)[source]¶

Bases: SCIQ

Based on https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/sciq/sciq.yaml#L8 In the Eval Harness implementation, the instruction text includes a context passage. This passage often contains the answer, reducing the benchmark to a straightforward copy-and-paste task.

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'allenai/sciq'¶

FEWSHOT_SPLIT: str = 'test'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'SciQ Eval Harness'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.sciq.SCIQEvalHarness_IDK(num_fewshot=0)[source]¶

Bases: SCIQEvalHarness

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'SciQ Eval Harness_IDK'¶

class eval_framework.tasks.benchmarks.sciq.SCIQ_IDK(num_fewshot=0)[source]¶

Bases: SCIQ

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'SciQ_IDK'¶

class eval_framework.tasks.benchmarks.sciq.SCIQ_OLMES(num_fewshot=0)[source]¶

Bases: SCIQ

SciQ with OLMES-style prompt: options shown with space-prefixed labels (” A.”, “ B.”, “ C.”, “ D.”); loglikelihood over “ A”/” B”/” C”/” D”. Answer choices are deterministically shuffled per example.

Parameters:: num_fewshot (int)

FEWSHOT_SPLIT: str = 'train'¶

NAME: str = 'SciQ_OLMES'¶

SAMPLE_SPLIT: str = 'train'¶

eval_framework.tasks.benchmarks.social_iqa module¶

Social IQA: Commonsense reasoning about social interactions.

Dataset: allenai/social_i_qa (context, question, answerA/B/C, label 1-indexed).

class eval_framework.tasks.benchmarks.social_iqa.SocialIQACloze(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Social IQA cloze: loglikelihood over full answer text.

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'allenai/social_i_qa'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'SocialIQACloze'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

class eval_framework.tasks.benchmarks.social_iqa.SocialIQAMC(num_fewshot=0)[source]¶

Bases: SocialIQAMC_OLMES

Social IQA multiple choice: loglikelihood over “ A”/” B”/” C”. Labels in prompt have no leading space (“A.”, “B.”, “C.”); possible completions use a prefixed space (” A”, “ B”, “ C”) for tokenization consistency.

Parameters:: num_fewshot (int)

NAME: str = 'SocialIQAMC'¶

SAMPLE_SPLIT: str = 'validation'¶

class eval_framework.tasks.benchmarks.social_iqa.SocialIQAMC_OLMES(num_fewshot=0)[source]¶

Bases: SocialIQACloze

Social IQA multiple choice (OLMES/oe_eval style): loglikelihood over “ A”/” B”/” C”. Uses space-prefixed labels in prompt (” A.”, “ B.”, “ C.”) for tokenization parity with oe_eval.

Parameters:: num_fewshot (int)

NAME: str = 'SocialIQAMC_OLMES'¶

SAMPLE_SPLIT: str = 'train'¶

eval_framework.tasks.benchmarks.sphyr module¶

class eval_framework.tasks.benchmarks.sphyr.SPHYR(num_fewshot=0)[source]¶

Bases: BaseTask[str]

SPhyR dataset: https://huggingface.co/datasets/philippds/SPhyR

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'philippds/SPhyR'¶

FEWSHOT_SPLIT: str = ''¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.grid_difference.GridDifference'>]¶

NAME: str = 'SPHYR'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = None¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['1_random_cell_easy', '5_random_cell_easy', '10_random_cell_easy', '1_random_row_easy', '3_random_row_easy', '1_random_column_easy', '3_random_column_easy', 'full_easy', '1_random_cell_hard', '5_random_cell_hard', '10_random_cell_hard', '1_random_row_hard', '3_random_row_hard', '1_random_column_hard', '3_random_column_hard', 'full_hard']¶

eval_framework.tasks.benchmarks.squad module¶

class eval_framework.tasks.benchmarks.squad.SQUAD(num_fewshot=0)[source]¶

Bases: SQUAD2

Squad dataset: https://huggingface.co/datasets/rajpurkar/squad

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'rajpurkar/squad'¶

NAME: str = 'SQuAD'¶

class eval_framework.tasks.benchmarks.squad.SQUAD2(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Squad v2 dataset: https://huggingface.co/datasets/rajpurkar/squad_v2

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'rajpurkar/squad_v2'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.f1.F1'>]¶

NAME: str = 'SQuAD2'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'Context', 'unanswerable']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['no_subject']¶

UNANSWERABLE_STR = 'unanswerable'¶

class eval_framework.tasks.benchmarks.squad.SQUAD2BPB(num_fewshot=0)[source]¶

Bases: SQUAD2

SQuAD2 variant that scores loglikelihood of the gold answer text. Reports bits-per-byte on the reference answer (first acceptable answer).

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'SQuAD2 BPB'¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

eval_framework.tasks.benchmarks.struct_eval module¶

class eval_framework.tasks.benchmarks.struct_eval.RenderableStructEval(num_fewshot=0)[source]¶

Bases: StructEval

Renderable StructEval task for tasks that can be rendered visually.

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.struct_eval_metrics.RenderableStructMetric'>]¶

NAME: str = 'RenderableStructEval'¶

SUBJECTS: list[SubjectType] = ['Convert Markdown to HTML', 'Convert React to HTML', 'Convert Vue to HTML', 'Text to HTML']¶

class eval_framework.tasks.benchmarks.struct_eval.StructEval(num_fewshot=0)[source]¶

Bases: BaseTask[str]

StructEval task: https://tiger-ai-lab.github.io/StructEval/

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'TIGER-Lab/StructEval'¶

FEWSHOT_SPLIT: str = 'train'¶

HF_REVISION: str | None = 'b551217560cf225245b0607a21c505e24a58e396'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.struct_eval_metrics.StructMetric'>]¶

NAME: str = 'StructEval'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'train'¶

SUBJECTS: list[SubjectType] = ['CSV to YAML', 'JSON to XML', 'JSON to CSV', 'XML to JSON', 'XML to YAML', 'Text to XML', 'Text to YAML', 'Text to TOML', 'YAML to JSON', 'TOML to JSON', 'Text to CSV', 'YAML to XML', 'JSON to YAML', 'TOML to YAML', 'YAML to CSV', 'CSV to JSON', 'CSV to XML', 'Text to JSON', 'XML to CSV']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

eval_framework.tasks.benchmarks.tablebench module¶

class eval_framework.tasks.benchmarks.tablebench.TableBench(num_fewshot=0)[source]¶

Bases: BaseTask[tuple[str, str]]

TableBench dataset: https://huggingface.co/datasets/Multilingual-Multimodal-NLP/TableBench

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'Multilingual-Multimodal-NLP/TableBench'¶

FEWSHOT_SPLIT: str = 'test'¶

HF_REVISION: str | None = '81b551c744b7f49cfa0ad69cb7a1465d865c206e'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_l.ROUGE_L'>]¶

NAME: str = 'TableBench'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = [('PoT', 'NumericalReasoning'), ('PoT', 'DataAnalysis'), ('PoT', 'FactChecking'), ('SCoT', 'NumericalReasoning'), ('SCoT', 'DataAnalysis'), ('SCoT', 'FactChecking'), ('TCoT', 'NumericalReasoning'), ('TCoT', 'DataAnalysis'), ('TCoT', 'FactChecking')]¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

eval_framework.tasks.benchmarks.triviaqa module¶

class eval_framework.tasks.benchmarks.triviaqa.TRIVIAQA(num_fewshot=0)[source]¶

Bases: BaseTask[str]

Trivia QA dataset: https://huggingface.co/datasets/mandarjoshi/trivia_qa

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'mandarjoshi/trivia_qa'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.f1.F1'>]¶

NAME: str = 'TriviaQA'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['rc.wikipedia.nocontext']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

eval_framework.tasks.benchmarks.truthfulqa module¶

class eval_framework.tasks.benchmarks.truthfulqa.TRUTHFULQA(num_fewshot=0)[source]¶

Bases: BaseTask[str]

TRUTHFULQA dataset: https://huggingface.co/datasets/truthfulqa/truthful_qa

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'truthfulqa/truthful_qa'¶

FEWSHOT_ITEMS = [{'question': 'What is human life expectancy in the United States?', 'response': 'Human life expectancy in the United States is 78 years.'}, {'question': 'Who was president of the United States in 1955?', 'response': 'Dwight D. Eisenhower was president of the United States in 1955.'}, {'question': 'Which party did he belong to?', 'response': 'He belonged to the Republican Party.'}, {'question': 'What is the square root of banana?', 'response': 'I have no comment.'}, {'question': 'How does a telescope work?', 'response': 'Telescopes use lenses or mirrors to focus light and make objects appear closer.'}, {'question': 'Where were the 1992 Olympics held?', 'response': 'The 1992 Olympics were held in Barcelona, Spain.'}]¶

FEWSHOT_SPLIT: str = ''¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMass'>, <class 'eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMassNorm'>]¶

NAME: str = 'TruthfulQA'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Q', 'A']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['mc1', 'mc2']¶

class eval_framework.tasks.benchmarks.truthfulqa.TRUTHFULQA_IDK(num_fewshot=0)[source]¶

Bases: TRUTHFULQA

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'TruthfulQA_IDK'¶

class eval_framework.tasks.benchmarks.truthfulqa.TRUTHFULQA_OLMES(num_fewshot=0)[source]¶

Bases: TRUTHFULQA

TruthfulQA multiple choice (OLMES/oe_eval style): prompt shows question and options with space-prefixed labels (” A.”, “ B.”, …); loglikelihood over “ A”/” B”/ etc.

Parameters:: num_fewshot (int)

NAME: str = 'TruthfulQA_OLMES'¶

eval_framework.tasks.benchmarks.winogender module¶

class eval_framework.tasks.benchmarks.winogender.WINOGENDER(num_fewshot=0)[source]¶

Bases: BaseTask[str]

WINOGENDER dataset: https://huggingface.co/datasets/datasets/oskarvanderwal/winogender

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'oskarvanderwal/winogender'¶

FEWSHOT_SPLIT: str = 'test'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶

NAME: str = 'Winogender'¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'test'¶

SUBJECTS: list[SubjectType] = ['all']¶

class eval_framework.tasks.benchmarks.winogender.WINOGENDER_IDK(num_fewshot=0)[source]¶

Bases: WINOGENDER

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'Winogender_IDK'¶

eval_framework.tasks.benchmarks.winogrande module¶

class eval_framework.tasks.benchmarks.winogrande.WINOGRANDE(num_fewshot=0)[source]¶

Bases: BaseTask[str]

WINOGRANDE dataset: https://huggingface.co/datasets/allenai/winogrande

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'allenai/winogrande'¶

FEWSHOT_SPLIT: str = 'train'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶

NAME: str = 'Winogrande'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['1', '2']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['winogrande_xl']¶

class eval_framework.tasks.benchmarks.winogrande.WINOGRANDE_IDK(num_fewshot=0)[source]¶

Bases: WINOGRANDE

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶

NAME: str = 'Winogrande_IDK'¶

class eval_framework.tasks.benchmarks.winogrande.WINOGRANDE_OLMES(num_fewshot=0)[source]¶

Bases: WINOGRANDE

Winogrande with OLMES-style prompt: options shown with space-prefixed labels (” A.”, “ B.”); loglikelihood over “ A”/” B”.

Parameters:: num_fewshot (int)

NAME: str = 'Winogrande_OLMES'¶

eval_framework.tasks.benchmarks.winox module¶

class eval_framework.tasks.benchmarks.winox.WINOX(num_fewshot=0)[source]¶

Bases: WINOGRANDE

Wino-X is a parallel dataset of German, French, and Russian Winograd schemas, aligned with their English counterparts, used to examine whether neural machine translation models can perform coreference resolution that requires commonsense knowledge, and whether multilingual language models are capable of commonsense reasoning across multiple languages.

Winogrande: https://arxiv.org/abs/1907.10641 Wino-X: https://github.com/demelin/Wino-X Wino-X: https://huggingface.co/datasets/demelin/wino_x

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'demelin/wino_x'¶

FEWSHOT_SPLIT: str = 'test'¶

HF_REVISION: str | None = '7d82697fd52ac8b03e62aadfddc61077320f21e7'¶

LANGUAGE_SHORT_CODE = ''¶

SAMPLE_SPLIT: str = 'test'¶

class eval_framework.tasks.benchmarks.winox.WINOX_DE(num_fewshot=0)[source]¶

Bases: WINOX

Parameters:: num_fewshot (int)

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'¶

LANGUAGE_SHORT_CODE = 'de'¶

NAME: str = 'WINOX_DE'¶

SUBJECTS: list[SubjectType] = ['lm_en_de']¶

class eval_framework.tasks.benchmarks.winox.WINOX_FR(num_fewshot=0)[source]¶

Bases: WINOX

Parameters:: num_fewshot (int)

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'¶

LANGUAGE_SHORT_CODE = 'fr'¶

NAME: str = 'WINOX_FR'¶

SUBJECTS: list[SubjectType] = ['lm_en_fr']¶

eval_framework.tasks.benchmarks.wmt module¶

class eval_framework.tasks.benchmarks.wmt.WMT(num_fewshot=0)[source]¶

Bases: BaseTask[str], ABC

WMT dataset:

Parameters:: num_fewshot (int)

DATASET_PATH: str = ''¶

FEWSHOT_SPLIT: str = 'test'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.LINEWISE_BLEU'>, <class 'eval_framework.metrics.completion.chrf.LINEWISE_CHRF'>, <class 'eval_framework.metrics.completion.ter.LINEWISE_TER'>]¶

NAME: str = 'WMT'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['phrase']¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'test'¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

class eval_framework.tasks.benchmarks.wmt.WMT14(num_fewshot=0)[source]¶

Bases: WMT

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'wmt14'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'en-fr': (Language.ENG, Language.FRA), 'fr-en': (Language.FRA, Language.ENG)}¶

NAME: str = 'WMT14'¶

SUBJECTS: list[SubjectType] = ['en-fr', 'fr-en']¶

class eval_framework.tasks.benchmarks.wmt.WMT14_INSTRUCT(num_fewshot=0)[source]¶

Bases: WMT_INSTRUCT

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'wmt14'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'en-fr': (Language.ENG, Language.FRA), 'fr-en': (Language.FRA, Language.ENG)}¶

NAME: str = 'WMT14 Instruct'¶

SUBJECTS: list[SubjectType] = ['en-fr', 'fr-en']¶

class eval_framework.tasks.benchmarks.wmt.WMT16(num_fewshot=0)[source]¶

Bases: WMT

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'wmt16'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'en-de': (Language.ENG, Language.DEU)}¶

NAME: str = 'WMT16'¶

SUBJECTS: list[SubjectType] = ['de-en', 'en-de']¶

class eval_framework.tasks.benchmarks.wmt.WMT16_INSTRUCT(num_fewshot=0)[source]¶

Bases: WMT_INSTRUCT

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'wmt16'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'en-de': (Language.ENG, Language.DEU)}¶

NAME: str = 'WMT16 Instruct'¶

SUBJECTS: list[SubjectType] = ['de-en', 'en-de']¶

class eval_framework.tasks.benchmarks.wmt.WMT20(num_fewshot=0)[source]¶

Bases: WMT

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'wmt20'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'de-fr': (Language.DEU, Language.FRA), 'en-de': (Language.ENG, Language.DEU), 'fr-de': (Language.FRA, Language.DEU)}¶

NAME: str = 'WMT20'¶

SUBJECTS: list[SubjectType] = ['de-en', 'de-fr', 'en-de', 'fr-de']¶

class eval_framework.tasks.benchmarks.wmt.WMT20_INSTRUCT(num_fewshot=0)[source]¶

Bases: WMT_INSTRUCT

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'wmt20'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'de-fr': (Language.DEU, Language.FRA), 'en-de': (Language.ENG, Language.DEU), 'fr-de': (Language.FRA, Language.DEU)}¶

NAME: str = 'WMT20 Instruct'¶

SUBJECTS: list[SubjectType] = ['de-en', 'de-fr', 'en-de', 'fr-de']¶

class eval_framework.tasks.benchmarks.wmt.WMT_INSTRUCT(num_fewshot=0)[source]¶

Bases: WMT

Parameters:: num_fewshot (int)

COMPLETION_PREFIX = 'This is the translation:'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Please', 'translate']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

stop_sequences: list[str]¶

eval_framework.tasks.benchmarks.zero_scrolls module¶

class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_COMPLETION(num_fewshot=0)[source]¶

Bases: BaseTask[str]

ZeroSCROLLS dataset: https://huggingface.co/datasets/tau/zero_scrolls

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'tau/zero_scrolls'¶

FEWSHOT_SPLIT: str = 'validation'¶

HF_REVISION: str | None = 'dc63b23022752816989b0666a366c0b0195ccc4b'¶

RESPONSE_TYPE: ResponseType = 'completion'¶

SAMPLE_SPLIT: str = 'validation'¶

class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_GOV_REPORT(num_fewshot=0)[source]¶

Bases: ZERO_SCROLLS_COMPLETION

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]¶

NAME: str = 'ZeroSCROLLS GovReport'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Summary']¶

SUBJECTS: list[SubjectType] = ['gov_report']¶

class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_MUSIQUE(num_fewshot=0)[source]¶

Bases: ZERO_SCROLLS_COMPLETION

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]¶

NAME: str = 'ZeroSCROLLS MuSiQue'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶

SUBJECTS: list[SubjectType] = ['musique']¶

class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_NARRATIVEQA(num_fewshot=0)[source]¶

Bases: ZERO_SCROLLS_COMPLETION

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]¶

NAME: str = 'ZeroSCROLLS NarrativeQA'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶

SUBJECTS: list[SubjectType] = ['narrative_qa']¶

class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QASPER(num_fewshot=0)[source]¶

Bases: ZERO_SCROLLS_COMPLETION

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]¶

NAME: str = 'ZeroSCROLLS Qasper'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶

SUBJECTS: list[SubjectType] = ['qasper']¶

class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QMSUM(num_fewshot=0)[source]¶

Bases: ZERO_SCROLLS_COMPLETION

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]¶

NAME: str = 'ZeroSCROLLS QMSum'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶

SUBJECTS: list[SubjectType] = ['qmsum']¶

class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QUALITY(num_fewshot=0)[source]¶

Bases: BaseTask[str]

ZeroSCROLLS dataset: https://huggingface.co/datasets/tau/zero_scrolls

Parameters:: num_fewshot (int)

DATASET_PATH: str = 'tau/zero_scrolls'¶

FEWSHOT_SPLIT: str = 'validation'¶

HF_REVISION: str | None = 'dc63b23022752816989b0666a366c0b0195ccc4b'¶

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'¶

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>]¶

NAME: str = 'ZeroSCROLLS QuALITY'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶

RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶

SAMPLE_SPLIT: str = 'validation'¶

SUBJECTS: list[SubjectType] = ['quality']¶

class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_SPACE_DIGEST(num_fewshot=0)[source]¶

Bases: ZERO_SCROLLS_COMPLETION

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.exponential_similarity.ExponentialSimilarity'>]¶

NAME: str = 'ZeroSCROLLS SpaceDigest'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶

SUBJECTS: list[SubjectType] = ['space_digest']¶

post_process_generated_completion(completion_text, sample=None)[source]¶

Return type:

str

Parameters:

completion_text (str)
sample (Sample | None)

class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_SQUALITY(num_fewshot=0)[source]¶

Bases: ZERO_SCROLLS_COMPLETION

Parameters:: num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]¶

NAME: str = 'ZeroSCROLLS SQuALITY'¶

PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶

SUBJECTS: list[SubjectType] = ['squality']¶

eval_framework.tasks.benchmarks package¶

Submodules¶

eval_framework.tasks.benchmarks.aidanbench module¶

eval_framework.tasks.benchmarks.arc module¶

eval_framework.tasks.benchmarks.arc_de module¶

eval_framework.tasks.benchmarks.arc_fi module¶

eval_framework.tasks.benchmarks.balancedcopa module¶

eval_framework.tasks.benchmarks.belebele module¶

eval_framework.tasks.benchmarks.bigcodebench module¶

eval_framework.tasks.benchmarks.casehold module¶

eval_framework.tasks.benchmarks.chembench module¶

eval_framework.tasks.benchmarks.copa module¶

eval_framework.tasks.benchmarks.csqa module¶

eval_framework.tasks.benchmarks.drop module¶

eval_framework.tasks.benchmarks.duc module¶

eval_framework.tasks.benchmarks.flores200 module¶

eval_framework.tasks.benchmarks.flores_plus module¶

eval_framework.tasks.benchmarks.global_mmlu module¶

eval_framework.tasks.benchmarks.goldenswag module¶

eval_framework.tasks.benchmarks.gpqa module¶

eval_framework.tasks.benchmarks.gsm8k module¶

eval_framework.tasks.benchmarks.hellaswag module¶

eval_framework.tasks.benchmarks.hellaswag_de module¶

eval_framework.tasks.benchmarks.humaneval module¶

eval_framework.tasks.benchmarks.ifeval module¶

eval_framework.tasks.benchmarks.include module¶

eval_framework.tasks.benchmarks.infinitebench module¶

eval_framework.tasks.benchmarks.lab_bench module¶

eval_framework.tasks.benchmarks.math_reasoning module¶

eval_framework.tasks.benchmarks.mbpp module¶

eval_framework.tasks.benchmarks.medqa module¶

eval_framework.tasks.benchmarks.mmlu module¶

eval_framework.tasks.benchmarks.mmlu_de module¶

eval_framework.tasks.benchmarks.mmlu_pro module¶

eval_framework.tasks.benchmarks.mmmlu module¶

eval_framework.tasks.benchmarks.naturalqs_open module¶

eval_framework.tasks.benchmarks.openbookqa module¶

eval_framework.tasks.benchmarks.opengptx_eu20 module¶

eval_framework.tasks.benchmarks.pawsx module¶

eval_framework.tasks.benchmarks.piqa module¶

eval_framework.tasks.benchmarks.quality module¶

eval_framework.tasks.benchmarks.sciq module¶

eval_framework.tasks.benchmarks.social_iqa module¶

eval_framework.tasks.benchmarks.sphyr module¶

eval_framework.tasks.benchmarks.squad module¶

eval_framework.tasks.benchmarks.struct_eval module¶

eval_framework.tasks.benchmarks.tablebench module¶

eval_framework.tasks.benchmarks.triviaqa module¶

eval_framework.tasks.benchmarks.truthfulqa module¶

eval_framework.tasks.benchmarks.winogender module¶

eval_framework.tasks.benchmarks.winogrande module¶

eval_framework.tasks.benchmarks.winox module¶

eval_framework.tasks.benchmarks.wmt module¶

eval_framework.tasks.benchmarks.zero_scrolls module¶

Module contents¶