eval_framework.tasks.benchmarks package¶
Submodules¶
eval_framework.tasks.benchmarks.aidanbench module¶
- class eval_framework.tasks.benchmarks.aidanbench.AidanBench(num_fewshot=0)[source]¶
Bases:
AidanBenchOriginal- Parameters:
num_fewshot (int)
- class eval_framework.tasks.benchmarks.aidanbench.AidanBenchOriginal(num_fewshot=0)[source]¶
Bases:
BaseTask[str]AidanBench (https://openreview.net/pdf?id=fz969ahcvJ).
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'Aleph-Alpha-Research/aidanbench'¶
- FEWSHOT_SPLIT: str = 'train'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.ENG}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.aidanbench.AidanBenchMetric'>]¶
- NAME: str = 'AidanBench'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- generate_completions(llm, samples, stop_sequences=None, max_tokens=None)[source]¶
Generates completions for the sample. :param sample: sample to generate completions for :type stop_sequences:
list[str] |None:param stop_sequences: stop sequences to use in completion generation :type max_tokens:int|None:param max_tokens: maximum tokens to use in completion generation
eval_framework.tasks.benchmarks.arc module¶
- class eval_framework.tasks.benchmarks.arc.ARC(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ARC dataset: https://huggingface.co/datasets/allenai/ai2_arc
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'allenai/ai2_arc'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'ARC'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D', 'E']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['ARC-Easy', 'ARC-Challenge']¶
- class eval_framework.tasks.benchmarks.arc.ARC_IDK(num_fewshot=0)[source]¶
Bases:
ARC- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'ARC_IDK'¶
eval_framework.tasks.benchmarks.arc_de module¶
- class eval_framework.tasks.benchmarks.arc_de.ARC_DE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ARC-DE dataset: https://huggingface.co/datasets/LeoLM/ArcChallenge_de
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'LeoLM/ArcChallenge_de'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'ARC German'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'A', 'B', 'C', 'D', 'E']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
eval_framework.tasks.benchmarks.arc_fi module¶
- class eval_framework.tasks.benchmarks.arc_fi.ARC_FI(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ARC-FI dataset: https://huggingface.co/datasets/LumiOpen/arc_challenge_mt
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'LumiOpen/arc_challenge_mt'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'ARC Finnish'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D', 'E']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['fi']¶
eval_framework.tasks.benchmarks.balancedcopa module¶
- class eval_framework.tasks.benchmarks.balancedcopa.BalancedCOPA(num_fewshot=0)[source]¶
Bases:
COPABalanced-COPA dataset: https://huggingface.co/datasets/pkavumba/balanced-copa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'pkavumba/balanced-copa'¶
- HF_REVISION: str | None = '813bd03cd6e07d9bd8d7333896ad5d40abb95ea9'¶
- NAME: str = 'BalancedCOPA'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- eval_framework.tasks.benchmarks.balancedcopa.split_dataset_by_id_ranges(dataset, id_column, ranges)[source]¶
Split a dataset into two based on whether the id column falls within given ranges.
- Parameters:
dataset (
Dataset) – The dataset to split.id_column (
str) – The name of the column containing the id values.ranges (
list[tuple[int,int]]) – A list of (low, high) tuples defining inclusive ranges. Rows whose id is within any of these ranges go into the first split.
- Return type:
tuple[Dataset,Dataset]
eval_framework.tasks.benchmarks.belebele module¶
- class eval_framework.tasks.benchmarks.belebele.BELEBELE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]BELEBELE dataset: https://huggingface.co/datasets/facebook/belebele
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'facebook/belebele'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'BELEBELE'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['eng_Latn']¶
eval_framework.tasks.benchmarks.bigcodebench module¶
- class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBench(num_fewshot=0)[source]¶
Bases:
BaseTask[str]BigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'bigcode/bigcodebench'¶
- FEWSHOT_SPLIT: str = 'v0.1.4'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_execution_pass_at_one.CodeExecutionPassAtOne'>]¶
- NAME: str = 'BigCodeBench'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'v0.1.4'¶
- SUBJECTS: list[SubjectType] = ['original', 'calibrated']¶
- class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchHard(num_fewshot=0)[source]¶
Bases:
BigCodeBenchBigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench-hard
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'bigcode/bigcodebench-hard'¶
- NAME: str = 'BigCodeBenchHard'¶
- class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchHardInstruct(num_fewshot=0)[source]¶
Bases:
BigCodeBenchHardBigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench-hard
- Parameters:
num_fewshot (int)
- NAME: str = 'BigCodeBenchHardInstruct'¶
- class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchInstruct(num_fewshot=0)[source]¶
Bases:
BigCodeBenchBigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench
- Parameters:
num_fewshot (int)
- NAME: str = 'BigCodeBenchInstruct'¶
eval_framework.tasks.benchmarks.casehold module¶
- class eval_framework.tasks.benchmarks.casehold.CASEHOLD(num_fewshot=0)[source]¶
Bases:
BaseTask[str]CASEHOLD dataset: https://huggingface.co/datasets/coastalcph/lex_glue
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'coastalcph/lex_glue'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'CaseHold'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['case_hold']¶
eval_framework.tasks.benchmarks.chembench module¶
- class eval_framework.tasks.benchmarks.chembench.ChemBench(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ChemBench dataset: https://huggingface.co/datasets/jablonkagroup/ChemBench
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'jablonkagroup/ChemBench'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'ChemBench'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['analytical_chemistry', 'chemical_preference', 'general_chemistry', 'inorganic_chemistry', 'materials_science', 'organic_chemistry', 'physical_chemistry', 'technical_chemistry', 'toxicity_and_safety']¶
eval_framework.tasks.benchmarks.copa module¶
- class eval_framework.tasks.benchmarks.copa.COPA(num_fewshot=0)[source]¶
Bases:
COPAEvalHarnessUnlike the original COPA task, this version uses the test split for evaluation and the validation split for few-shot examples. Previously, the test split labels were unavailable in the original dataset, but they are now accessible, allowing this configuration.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'validation'¶
- NAME: str = 'COPA'¶
- SAMPLE_SPLIT: str = 'test'¶
- class eval_framework.tasks.benchmarks.copa.COPAEvalHarness(num_fewshot=0)[source]¶
Bases:
BaseTask[str]COPA dataset: https://huggingface.co/datasets/aps/super_glue This version uses samples from the validation split as evaluation examples (same as lm-eval-harness).
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'aps/super_glue'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'COPAEvalHarness'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['because', 'therefore']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['copa']¶
- class eval_framework.tasks.benchmarks.copa.COPA_IDK(num_fewshot=0)[source]¶
Bases:
COPA_IDKEvalHarness- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'validation'¶
- NAME: str = 'COPA_IDK'¶
- SAMPLE_SPLIT: str = 'test'¶
- class eval_framework.tasks.benchmarks.copa.COPA_IDKEvalHarness(num_fewshot=0)[source]¶
Bases:
COPAEvalHarness- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'COPA_IDKEvalHarness'¶
- class eval_framework.tasks.benchmarks.copa.COPA_OLMES(num_fewshot=0)[source]¶
Bases:
COPAEvalHarnessCOPA multiple choice (OLMES/oe_eval style): prompt shows premise + connector and options with space-prefixed labels (” A.”, “ B.”); loglikelihood over “ A”/” B”.
- Parameters:
num_fewshot (int)
- NAME: str = 'COPA_OLMES'¶
eval_framework.tasks.benchmarks.csqa module¶
- class eval_framework.tasks.benchmarks.csqa.CommonsenseQACloze(num_fewshot=0)[source]¶
Bases:
BaseTask[str]CommonsenseQA dataset: https://huggingface.co/datasets/tau/commonsense_qa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'tau/commonsense_qa'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'CommonsenseQACloze'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.csqa.CommonsenseQAFullTextCloze(num_fewshot=0)[source]¶
Bases:
CommonsenseQAClozeCommonsenseQA cloze with full answer text as ground truth (not just the letter). Scores loglikelihood over the full correct choice text; includes bits-per-byte.
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'CommonsenseQAFullTextCloze'¶
- class eval_framework.tasks.benchmarks.csqa.CommonsenseQAMC(num_fewshot=0)[source]¶
Bases:
CommonsenseQAClozeMultiple-choice variant of CommonsenseQA where the model selects a letter (A-E).
- Parameters:
num_fewshot (int)
- NAME: str = 'CommonsenseQAMC'¶
- class eval_framework.tasks.benchmarks.csqa.CommonsenseQAMC_OLMES(num_fewshot=0)[source]¶
Bases:
CommonsenseQAMCCommonsenseQA MC with OLMES-style prompt: space before each label in the prompt (” A.”, “ B.”, …).
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'train'¶
- NAME: str = 'CommonsenseQAMC_OLMES'¶
- SAMPLE_SPLIT: str = 'train'¶
eval_framework.tasks.benchmarks.drop module¶
- class eval_framework.tasks.benchmarks.drop.DropCloze(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Cloze variant: loglikelihood ranking over full choice texts (allenai/drop-gen2mc).
Same dataset as DropMC; options not shown in prompt; model scores full text of each choice. Includes BitsPerByte on the correct choice.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'allenai/drop-gen2mc'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'DropCloze'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Passage']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.drop.DropCompletion(num_fewshot=0)[source]¶
Bases:
BaseTask[str]DROP completion benchmark (EleutherAI/drop): passage, question, model generates answer.
Uses DROP F1 and exact match. Stop at new paragraph or repeated prefixes.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'EleutherAI/drop'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.drop_completion.DropF1ExactMatch'>]¶
- NAME: str = 'DropCompletion'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Passage']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.drop.DropCompletion_OLMES(num_fewshot=0)[source]¶
Bases:
DropCompletionDropCompletion matching OLMES, using train split for fewshot and max tokens 100.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'train'¶
- NAME: str = 'DropCompletion_OLMES'¶
- class eval_framework.tasks.benchmarks.drop.DropMC(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Multiple-choice variant using allenai/drop-gen2mc (passage_original, question_original, choices, answerKey).
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'allenai/drop-gen2mc'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'DropMC'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Passage']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
eval_framework.tasks.benchmarks.duc module¶
- class eval_framework.tasks.benchmarks.duc.DUC(num_fewshot=0)[source]¶
Bases:
BaseTask[str],ABChttps://huggingface.co/datasets/midas/duc2001
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'midas/duc2001'¶
- FEWSHOT_SPLIT: str = 'train'¶
- HF_REVISION: str = '77d6dedcbce421695a12f24c8802e8847a129d92'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Text', 'Keyphrase']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[str] = ['default']¶
eval_framework.tasks.benchmarks.flores200 module¶
- class eval_framework.tasks.benchmarks.flores200.Flores200(num_fewshot=0)[source]¶
Bases:
BaseTask[str]FLORES-200 dataset: https://huggingface.co/datasets/facebook/flores
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'facebook/flores'¶
- FEWSHOT_SPLIT: str = 'dev'¶
- HF_REVISION: str | None = 'fd7d8f42fccb9dbc35830053a8c705a2627124ce'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'deu_Latn': Language.DEU, 'eng_Latn': Language.ENG, 'fin_Latn': Language.FIN, 'fra_Latn': Language.FRA, 'nld_Latn': Language.NLD}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.BLEU'>]¶
- NAME: str = 'FLoRes-200'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['sentence']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'devtest'¶
- SUBJECTS: list[SubjectType] = ['deu_Latn-eng_Latn', 'deu_Latn-fin_Latn', 'deu_Latn-fra_Latn', 'deu_Latn-nld_Latn', 'eng_Latn-deu_Latn', 'eng_Latn-fin_Latn', 'eng_Latn-fra_Latn', 'eng_Latn-nld_Latn', 'fin_Latn-deu_Latn', 'fin_Latn-eng_Latn', 'fin_Latn-fra_Latn', 'fin_Latn-nld_Latn', 'fra_Latn-deu_Latn', 'fra_Latn-eng_Latn', 'fra_Latn-fin_Latn', 'fra_Latn-nld_Latn', 'nld_Latn-deu_Latn', 'nld_Latn-eng_Latn', 'nld_Latn-fin_Latn', 'nld_Latn-fra_Latn']¶
eval_framework.tasks.benchmarks.flores_plus module¶
- class eval_framework.tasks.benchmarks.flores_plus.FloresPlus(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Flores-Plus dataset: https://huggingface.co/datasets/openlanguagedata/flores_plus
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openlanguagedata/flores_plus'¶
- FEWSHOT_SPLIT: str = 'devtest'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'deu_Latn': Language.DEU, 'eng_Latn': Language.ENG, 'fra_Latn': Language.FRA, 'ita_Latn': Language.ITA, 'nld_Latn': Language.NLD, 'pol_Latn': Language.POL, 'rus_Cyrl': Language.RUS, 'spa_Latn': Language.SPA, 'ukr_Cyrl': Language.UKR}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.BLEU'>, <class 'eval_framework.metrics.completion.chrf.CHRF'>, <class 'eval_framework.metrics.completion.comet.COMET'>]¶
- NAME: str = 'Flores-Plus'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['sentence']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'dev'¶
- SUBJECTS: list[SubjectType] = ['deu_Latn-eng_Latn', 'deu_Latn-fra_Latn', 'deu_Latn-ita_Latn', 'deu_Latn-nld_Latn', 'deu_Latn-pol_Latn', 'deu_Latn-rus_Cyrl', 'deu_Latn-spa_Latn', 'deu_Latn-ukr_Cyrl', 'eng_Latn-deu_Latn', 'eng_Latn-fra_Latn', 'eng_Latn-ita_Latn', 'eng_Latn-nld_Latn', 'eng_Latn-pol_Latn', 'eng_Latn-rus_Cyrl', 'eng_Latn-spa_Latn', 'eng_Latn-ukr_Cyrl', 'fra_Latn-deu_Latn', 'fra_Latn-eng_Latn', 'fra_Latn-ita_Latn', 'fra_Latn-nld_Latn', 'fra_Latn-pol_Latn', 'fra_Latn-rus_Cyrl', 'fra_Latn-spa_Latn', 'fra_Latn-ukr_Cyrl', 'ita_Latn-deu_Latn', 'ita_Latn-eng_Latn', 'ita_Latn-fra_Latn', 'ita_Latn-nld_Latn', 'ita_Latn-pol_Latn', 'ita_Latn-rus_Cyrl', 'ita_Latn-spa_Latn', 'ita_Latn-ukr_Cyrl', 'nld_Latn-deu_Latn', 'nld_Latn-eng_Latn', 'nld_Latn-fra_Latn', 'nld_Latn-ita_Latn', 'nld_Latn-pol_Latn', 'nld_Latn-rus_Cyrl', 'nld_Latn-spa_Latn', 'nld_Latn-ukr_Cyrl', 'pol_Latn-deu_Latn', 'pol_Latn-eng_Latn', 'pol_Latn-fra_Latn', 'pol_Latn-ita_Latn', 'pol_Latn-nld_Latn', 'pol_Latn-rus_Cyrl', 'pol_Latn-spa_Latn', 'pol_Latn-ukr_Cyrl', 'rus_Cyrl-deu_Latn', 'rus_Cyrl-eng_Latn', 'rus_Cyrl-fra_Latn', 'rus_Cyrl-ita_Latn', 'rus_Cyrl-nld_Latn', 'rus_Cyrl-pol_Latn', 'rus_Cyrl-spa_Latn', 'rus_Cyrl-ukr_Cyrl', 'spa_Latn-deu_Latn', 'spa_Latn-eng_Latn', 'spa_Latn-fra_Latn', 'spa_Latn-ita_Latn', 'spa_Latn-nld_Latn', 'spa_Latn-pol_Latn', 'spa_Latn-rus_Cyrl', 'spa_Latn-ukr_Cyrl', 'ukr_Cyrl-deu_Latn', 'ukr_Cyrl-eng_Latn', 'ukr_Cyrl-fra_Latn', 'ukr_Cyrl-ita_Latn', 'ukr_Cyrl-nld_Latn', 'ukr_Cyrl-pol_Latn', 'ukr_Cyrl-rus_Cyrl', 'ukr_Cyrl-spa_Latn']¶
eval_framework.tasks.benchmarks.global_mmlu module¶
- class eval_framework.tasks.benchmarks.global_mmlu.GlobalMMLU(num_fewshot=0)[source]¶
Bases:
BaseTask[tuple[str,str]]MMLU dataset: https://huggingface.co/datasets/CohereLabs/Global-MMLU
Currently, we only support prompting in French, German, Spanish, Italian, Portugese, and Arabic.
TO-DO: Suggest we adjust prompting for languages individually, e.g., South-East Asian languages available here: https://github.com/aisingapore/SEA-HELM/blob/main/seahelm_tasks/knowledge/global_mmlu/abstract_algebra/config.yaml
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'CohereLabs/Global-MMLU'¶
- FEWSHOT_SPLIT: str = 'dev'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {"('ar', 'abstract_algebra')": Language.ARB, "('ar', 'anatomy')": Language.ARB, "('ar', 'astronomy')": Language.ARB, "('ar', 'business_ethics')": Language.ARB, "('ar', 'clinical_knowledge')": Language.ARB, "('ar', 'college_biology')": Language.ARB, "('ar', 'college_chemistry')": Language.ARB, "('ar', 'college_computer_science')": Language.ARB, "('ar', 'college_mathematics')": Language.ARB, "('ar', 'college_medicine')": Language.ARB, "('ar', 'college_physics')": Language.ARB, "('ar', 'computer_security')": Language.ARB, "('ar', 'conceptual_physics')": Language.ARB, "('ar', 'econometrics')": Language.ARB, "('ar', 'electrical_engineering')": Language.ARB, "('ar', 'elementary_mathematics')": Language.ARB, "('ar', 'formal_logic')": Language.ARB, "('ar', 'global_facts')": Language.ARB, "('ar', 'high_school_biology')": Language.ARB, "('ar', 'high_school_chemistry')": Language.ARB, "('ar', 'high_school_computer_science')": Language.ARB, "('ar', 'high_school_european_history')": Language.ARB, "('ar', 'high_school_geography')": Language.ARB, "('ar', 'high_school_government_and_politics')": Language.ARB, "('ar', 'high_school_macroeconomics')": Language.ARB, "('ar', 'high_school_mathematics')": Language.ARB, "('ar', 'high_school_microeconomics')": Language.ARB, "('ar', 'high_school_physics')": Language.ARB, "('ar', 'high_school_psychology')": Language.ARB, "('ar', 'high_school_statistics')": Language.ARB, "('ar', 'high_school_us_history')": Language.ARB, "('ar', 'high_school_world_history')": Language.ARB, "('ar', 'human_aging')": Language.ARB, "('ar', 'human_sexuality')": Language.ARB, "('ar', 'international_law')": Language.ARB, "('ar', 'jurisprudence')": Language.ARB, "('ar', 'logical_fallacies')": Language.ARB, "('ar', 'machine_learning')": Language.ARB, "('ar', 'management')": Language.ARB, "('ar', 'marketing')": Language.ARB, "('ar', 'medical_genetics')": Language.ARB, "('ar', 'miscellaneous')": Language.ARB, "('ar', 'moral_disputes')": Language.ARB, "('ar', 'moral_scenarios')": Language.ARB, "('ar', 'nutrition')": Language.ARB, "('ar', 'philosophy')": Language.ARB, "('ar', 'prehistory')": Language.ARB, "('ar', 'professional_accounting')": Language.ARB, "('ar', 'professional_law')": Language.ARB, "('ar', 'professional_medicine')": Language.ARB, "('ar', 'professional_psychology')": Language.ARB, "('ar', 'public_relations')": Language.ARB, "('ar', 'security_studies')": Language.ARB, "('ar', 'sociology')": Language.ARB, "('ar', 'us_foreign_policy')": Language.ARB, "('ar', 'virology')": Language.ARB, "('ar', 'world_religions')": Language.ARB, "('de', 'abstract_algebra')": Language.DEU, "('de', 'anatomy')": Language.DEU, "('de', 'astronomy')": Language.DEU, "('de', 'business_ethics')": Language.DEU, "('de', 'clinical_knowledge')": Language.DEU, "('de', 'college_biology')": Language.DEU, "('de', 'college_chemistry')": Language.DEU, "('de', 'college_computer_science')": Language.DEU, "('de', 'college_mathematics')": Language.DEU, "('de', 'college_medicine')": Language.DEU, "('de', 'college_physics')": Language.DEU, "('de', 'computer_security')": Language.DEU, "('de', 'conceptual_physics')": Language.DEU, "('de', 'econometrics')": Language.DEU, "('de', 'electrical_engineering')": Language.DEU, "('de', 'elementary_mathematics')": Language.DEU, "('de', 'formal_logic')": Language.DEU, "('de', 'global_facts')": Language.DEU, "('de', 'high_school_biology')": Language.DEU, "('de', 'high_school_chemistry')": Language.DEU, "('de', 'high_school_computer_science')": Language.DEU, "('de', 'high_school_european_history')": Language.DEU, "('de', 'high_school_geography')": Language.DEU, "('de', 'high_school_government_and_politics')": Language.DEU, "('de', 'high_school_macroeconomics')": Language.DEU, "('de', 'high_school_mathematics')": Language.DEU, "('de', 'high_school_microeconomics')": Language.DEU, "('de', 'high_school_physics')": Language.DEU, "('de', 'high_school_psychology')": Language.DEU, "('de', 'high_school_statistics')": Language.DEU, "('de', 'high_school_us_history')": Language.DEU, "('de', 'high_school_world_history')": Language.DEU, "('de', 'human_aging')": Language.DEU, "('de', 'human_sexuality')": Language.DEU, "('de', 'international_law')": Language.DEU, "('de', 'jurisprudence')": Language.DEU, "('de', 'logical_fallacies')": Language.DEU, "('de', 'machine_learning')": Language.DEU, "('de', 'management')": Language.DEU, "('de', 'marketing')": Language.DEU, "('de', 'medical_genetics')": Language.DEU, "('de', 'miscellaneous')": Language.DEU, "('de', 'moral_disputes')": Language.DEU, "('de', 'moral_scenarios')": Language.DEU, "('de', 'nutrition')": Language.DEU, "('de', 'philosophy')": Language.DEU, "('de', 'prehistory')": Language.DEU, "('de', 'professional_accounting')": Language.DEU, "('de', 'professional_law')": Language.DEU, "('de', 'professional_medicine')": Language.DEU, "('de', 'professional_psychology')": Language.DEU, "('de', 'public_relations')": Language.DEU, "('de', 'security_studies')": Language.DEU, "('de', 'sociology')": Language.DEU, "('de', 'us_foreign_policy')": Language.DEU, "('de', 'virology')": Language.DEU, "('de', 'world_religions')": Language.DEU, "('es', 'abstract_algebra')": Language.SPA, "('es', 'anatomy')": Language.SPA, "('es', 'astronomy')": Language.SPA, "('es', 'business_ethics')": Language.SPA, "('es', 'clinical_knowledge')": Language.SPA, "('es', 'college_biology')": Language.SPA, "('es', 'college_chemistry')": Language.SPA, "('es', 'college_computer_science')": Language.SPA, "('es', 'college_mathematics')": Language.SPA, "('es', 'college_medicine')": Language.SPA, "('es', 'college_physics')": Language.SPA, "('es', 'computer_security')": Language.SPA, "('es', 'conceptual_physics')": Language.SPA, "('es', 'econometrics')": Language.SPA, "('es', 'electrical_engineering')": Language.SPA, "('es', 'elementary_mathematics')": Language.SPA, "('es', 'formal_logic')": Language.SPA, "('es', 'global_facts')": Language.SPA, "('es', 'high_school_biology')": Language.SPA, "('es', 'high_school_chemistry')": Language.SPA, "('es', 'high_school_computer_science')": Language.SPA, "('es', 'high_school_european_history')": Language.SPA, "('es', 'high_school_geography')": Language.SPA, "('es', 'high_school_government_and_politics')": Language.SPA, "('es', 'high_school_macroeconomics')": Language.SPA, "('es', 'high_school_mathematics')": Language.SPA, "('es', 'high_school_microeconomics')": Language.SPA, "('es', 'high_school_physics')": Language.SPA, "('es', 'high_school_psychology')": Language.SPA, "('es', 'high_school_statistics')": Language.SPA, "('es', 'high_school_us_history')": Language.SPA, "('es', 'high_school_world_history')": Language.SPA, "('es', 'human_aging')": Language.SPA, "('es', 'human_sexuality')": Language.SPA, "('es', 'international_law')": Language.SPA, "('es', 'jurisprudence')": Language.SPA, "('es', 'logical_fallacies')": Language.SPA, "('es', 'machine_learning')": Language.SPA, "('es', 'management')": Language.SPA, "('es', 'marketing')": Language.SPA, "('es', 'medical_genetics')": Language.SPA, "('es', 'miscellaneous')": Language.SPA, "('es', 'moral_disputes')": Language.SPA, "('es', 'moral_scenarios')": Language.SPA, "('es', 'nutrition')": Language.SPA, "('es', 'philosophy')": Language.SPA, "('es', 'prehistory')": Language.SPA, "('es', 'professional_accounting')": Language.SPA, "('es', 'professional_law')": Language.SPA, "('es', 'professional_medicine')": Language.SPA, "('es', 'professional_psychology')": Language.SPA, "('es', 'public_relations')": Language.SPA, "('es', 'security_studies')": Language.SPA, "('es', 'sociology')": Language.SPA, "('es', 'us_foreign_policy')": Language.SPA, "('es', 'virology')": Language.SPA, "('es', 'world_religions')": Language.SPA, "('fr', 'abstract_algebra')": Language.FRA, "('fr', 'anatomy')": Language.FRA, "('fr', 'astronomy')": Language.FRA, "('fr', 'business_ethics')": Language.FRA, "('fr', 'clinical_knowledge')": Language.FRA, "('fr', 'college_biology')": Language.FRA, "('fr', 'college_chemistry')": Language.FRA, "('fr', 'college_computer_science')": Language.FRA, "('fr', 'college_mathematics')": Language.FRA, "('fr', 'college_medicine')": Language.FRA, "('fr', 'college_physics')": Language.FRA, "('fr', 'computer_security')": Language.FRA, "('fr', 'conceptual_physics')": Language.FRA, "('fr', 'econometrics')": Language.FRA, "('fr', 'electrical_engineering')": Language.FRA, "('fr', 'elementary_mathematics')": Language.FRA, "('fr', 'formal_logic')": Language.FRA, "('fr', 'global_facts')": Language.FRA, "('fr', 'high_school_biology')": Language.FRA, "('fr', 'high_school_chemistry')": Language.FRA, "('fr', 'high_school_computer_science')": Language.FRA, "('fr', 'high_school_european_history')": Language.FRA, "('fr', 'high_school_geography')": Language.FRA, "('fr', 'high_school_government_and_politics')": Language.FRA, "('fr', 'high_school_macroeconomics')": Language.FRA, "('fr', 'high_school_mathematics')": Language.FRA, "('fr', 'high_school_microeconomics')": Language.FRA, "('fr', 'high_school_physics')": Language.FRA, "('fr', 'high_school_psychology')": Language.FRA, "('fr', 'high_school_statistics')": Language.FRA, "('fr', 'high_school_us_history')": Language.FRA, "('fr', 'high_school_world_history')": Language.FRA, "('fr', 'human_aging')": Language.FRA, "('fr', 'human_sexuality')": Language.FRA, "('fr', 'international_law')": Language.FRA, "('fr', 'jurisprudence')": Language.FRA, "('fr', 'logical_fallacies')": Language.FRA, "('fr', 'machine_learning')": Language.FRA, "('fr', 'management')": Language.FRA, "('fr', 'marketing')": Language.FRA, "('fr', 'medical_genetics')": Language.FRA, "('fr', 'miscellaneous')": Language.FRA, "('fr', 'moral_disputes')": Language.FRA, "('fr', 'moral_scenarios')": Language.FRA, "('fr', 'nutrition')": Language.FRA, "('fr', 'philosophy')": Language.FRA, "('fr', 'prehistory')": Language.FRA, "('fr', 'professional_accounting')": Language.FRA, "('fr', 'professional_law')": Language.FRA, "('fr', 'professional_medicine')": Language.FRA, "('fr', 'professional_psychology')": Language.FRA, "('fr', 'public_relations')": Language.FRA, "('fr', 'security_studies')": Language.FRA, "('fr', 'sociology')": Language.FRA, "('fr', 'us_foreign_policy')": Language.FRA, "('fr', 'virology')": Language.FRA, "('fr', 'world_religions')": Language.FRA, "('it', 'abstract_algebra')": Language.ITA, "('it', 'anatomy')": Language.ITA, "('it', 'astronomy')": Language.ITA, "('it', 'business_ethics')": Language.ITA, "('it', 'clinical_knowledge')": Language.ITA, "('it', 'college_biology')": Language.ITA, "('it', 'college_chemistry')": Language.ITA, "('it', 'college_computer_science')": Language.ITA, "('it', 'college_mathematics')": Language.ITA, "('it', 'college_medicine')": Language.ITA, "('it', 'college_physics')": Language.ITA, "('it', 'computer_security')": Language.ITA, "('it', 'conceptual_physics')": Language.ITA, "('it', 'econometrics')": Language.ITA, "('it', 'electrical_engineering')": Language.ITA, "('it', 'elementary_mathematics')": Language.ITA, "('it', 'formal_logic')": Language.ITA, "('it', 'global_facts')": Language.ITA, "('it', 'high_school_biology')": Language.ITA, "('it', 'high_school_chemistry')": Language.ITA, "('it', 'high_school_computer_science')": Language.ITA, "('it', 'high_school_european_history')": Language.ITA, "('it', 'high_school_geography')": Language.ITA, "('it', 'high_school_government_and_politics')": Language.ITA, "('it', 'high_school_macroeconomics')": Language.ITA, "('it', 'high_school_mathematics')": Language.ITA, "('it', 'high_school_microeconomics')": Language.ITA, "('it', 'high_school_physics')": Language.ITA, "('it', 'high_school_psychology')": Language.ITA, "('it', 'high_school_statistics')": Language.ITA, "('it', 'high_school_us_history')": Language.ITA, "('it', 'high_school_world_history')": Language.ITA, "('it', 'human_aging')": Language.ITA, "('it', 'human_sexuality')": Language.ITA, "('it', 'international_law')": Language.ITA, "('it', 'jurisprudence')": Language.ITA, "('it', 'logical_fallacies')": Language.ITA, "('it', 'machine_learning')": Language.ITA, "('it', 'management')": Language.ITA, "('it', 'marketing')": Language.ITA, "('it', 'medical_genetics')": Language.ITA, "('it', 'miscellaneous')": Language.ITA, "('it', 'moral_disputes')": Language.ITA, "('it', 'moral_scenarios')": Language.ITA, "('it', 'nutrition')": Language.ITA, "('it', 'philosophy')": Language.ITA, "('it', 'prehistory')": Language.ITA, "('it', 'professional_accounting')": Language.ITA, "('it', 'professional_law')": Language.ITA, "('it', 'professional_medicine')": Language.ITA, "('it', 'professional_psychology')": Language.ITA, "('it', 'public_relations')": Language.ITA, "('it', 'security_studies')": Language.ITA, "('it', 'sociology')": Language.ITA, "('it', 'us_foreign_policy')": Language.ITA, "('it', 'virology')": Language.ITA, "('it', 'world_religions')": Language.ITA, "('pt', 'abstract_algebra')": Language.POR, "('pt', 'anatomy')": Language.POR, "('pt', 'astronomy')": Language.POR, "('pt', 'business_ethics')": Language.POR, "('pt', 'clinical_knowledge')": Language.POR, "('pt', 'college_biology')": Language.POR, "('pt', 'college_chemistry')": Language.POR, "('pt', 'college_computer_science')": Language.POR, "('pt', 'college_mathematics')": Language.POR, "('pt', 'college_medicine')": Language.POR, "('pt', 'college_physics')": Language.POR, "('pt', 'computer_security')": Language.POR, "('pt', 'conceptual_physics')": Language.POR, "('pt', 'econometrics')": Language.POR, "('pt', 'electrical_engineering')": Language.POR, "('pt', 'elementary_mathematics')": Language.POR, "('pt', 'formal_logic')": Language.POR, "('pt', 'global_facts')": Language.POR, "('pt', 'high_school_biology')": Language.POR, "('pt', 'high_school_chemistry')": Language.POR, "('pt', 'high_school_computer_science')": Language.POR, "('pt', 'high_school_european_history')": Language.POR, "('pt', 'high_school_geography')": Language.POR, "('pt', 'high_school_government_and_politics')": Language.POR, "('pt', 'high_school_macroeconomics')": Language.POR, "('pt', 'high_school_mathematics')": Language.POR, "('pt', 'high_school_microeconomics')": Language.POR, "('pt', 'high_school_physics')": Language.POR, "('pt', 'high_school_psychology')": Language.POR, "('pt', 'high_school_statistics')": Language.POR, "('pt', 'high_school_us_history')": Language.POR, "('pt', 'high_school_world_history')": Language.POR, "('pt', 'human_aging')": Language.POR, "('pt', 'human_sexuality')": Language.POR, "('pt', 'international_law')": Language.POR, "('pt', 'jurisprudence')": Language.POR, "('pt', 'logical_fallacies')": Language.POR, "('pt', 'machine_learning')": Language.POR, "('pt', 'management')": Language.POR, "('pt', 'marketing')": Language.POR, "('pt', 'medical_genetics')": Language.POR, "('pt', 'miscellaneous')": Language.POR, "('pt', 'moral_disputes')": Language.POR, "('pt', 'moral_scenarios')": Language.POR, "('pt', 'nutrition')": Language.POR, "('pt', 'philosophy')": Language.POR, "('pt', 'prehistory')": Language.POR, "('pt', 'professional_accounting')": Language.POR, "('pt', 'professional_law')": Language.POR, "('pt', 'professional_medicine')": Language.POR, "('pt', 'professional_psychology')": Language.POR, "('pt', 'public_relations')": Language.POR, "('pt', 'security_studies')": Language.POR, "('pt', 'sociology')": Language.POR, "('pt', 'us_foreign_policy')": Language.POR, "('pt', 'virology')": Language.POR, "('pt', 'world_religions')": Language.POR}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'GlobalMMLU'¶
- OPTION_KEYS = {'A': 'option_a', 'B': 'option_b', 'C': 'option_c', 'D': 'option_d'}¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = [('fr', 'abstract_algebra'), ('fr', 'anatomy'), ('fr', 'astronomy'), ('fr', 'business_ethics'), ('fr', 'clinical_knowledge'), ('fr', 'college_biology'), ('fr', 'college_chemistry'), ('fr', 'college_computer_science'), ('fr', 'college_mathematics'), ('fr', 'college_medicine'), ('fr', 'college_physics'), ('fr', 'computer_security'), ('fr', 'conceptual_physics'), ('fr', 'econometrics'), ('fr', 'electrical_engineering'), ('fr', 'elementary_mathematics'), ('fr', 'formal_logic'), ('fr', 'global_facts'), ('fr', 'high_school_biology'), ('fr', 'high_school_chemistry'), ('fr', 'high_school_computer_science'), ('fr', 'high_school_european_history'), ('fr', 'high_school_geography'), ('fr', 'high_school_government_and_politics'), ('fr', 'high_school_macroeconomics'), ('fr', 'high_school_mathematics'), ('fr', 'high_school_microeconomics'), ('fr', 'high_school_physics'), ('fr', 'high_school_psychology'), ('fr', 'high_school_statistics'), ('fr', 'high_school_us_history'), ('fr', 'high_school_world_history'), ('fr', 'human_aging'), ('fr', 'human_sexuality'), ('fr', 'international_law'), ('fr', 'jurisprudence'), ('fr', 'logical_fallacies'), ('fr', 'machine_learning'), ('fr', 'management'), ('fr', 'marketing'), ('fr', 'medical_genetics'), ('fr', 'miscellaneous'), ('fr', 'moral_disputes'), ('fr', 'moral_scenarios'), ('fr', 'nutrition'), ('fr', 'philosophy'), ('fr', 'prehistory'), ('fr', 'professional_accounting'), ('fr', 'professional_law'), ('fr', 'professional_medicine'), ('fr', 'professional_psychology'), ('fr', 'public_relations'), ('fr', 'security_studies'), ('fr', 'sociology'), ('fr', 'us_foreign_policy'), ('fr', 'virology'), ('fr', 'world_religions'), ('de', 'abstract_algebra'), ('de', 'anatomy'), ('de', 'astronomy'), ('de', 'business_ethics'), ('de', 'clinical_knowledge'), ('de', 'college_biology'), ('de', 'college_chemistry'), ('de', 'college_computer_science'), ('de', 'college_mathematics'), ('de', 'college_medicine'), ('de', 'college_physics'), ('de', 'computer_security'), ('de', 'conceptual_physics'), ('de', 'econometrics'), ('de', 'electrical_engineering'), ('de', 'elementary_mathematics'), ('de', 'formal_logic'), ('de', 'global_facts'), ('de', 'high_school_biology'), ('de', 'high_school_chemistry'), ('de', 'high_school_computer_science'), ('de', 'high_school_european_history'), ('de', 'high_school_geography'), ('de', 'high_school_government_and_politics'), ('de', 'high_school_macroeconomics'), ('de', 'high_school_mathematics'), ('de', 'high_school_microeconomics'), ('de', 'high_school_physics'), ('de', 'high_school_psychology'), ('de', 'high_school_statistics'), ('de', 'high_school_us_history'), ('de', 'high_school_world_history'), ('de', 'human_aging'), ('de', 'human_sexuality'), ('de', 'international_law'), ('de', 'jurisprudence'), ('de', 'logical_fallacies'), ('de', 'machine_learning'), ('de', 'management'), ('de', 'marketing'), ('de', 'medical_genetics'), ('de', 'miscellaneous'), ('de', 'moral_disputes'), ('de', 'moral_scenarios'), ('de', 'nutrition'), ('de', 'philosophy'), ('de', 'prehistory'), ('de', 'professional_accounting'), ('de', 'professional_law'), ('de', 'professional_medicine'), ('de', 'professional_psychology'), ('de', 'public_relations'), ('de', 'security_studies'), ('de', 'sociology'), ('de', 'us_foreign_policy'), ('de', 'virology'), ('de', 'world_religions'), ('es', 'abstract_algebra'), ('es', 'anatomy'), ('es', 'astronomy'), ('es', 'business_ethics'), ('es', 'clinical_knowledge'), ('es', 'college_biology'), ('es', 'college_chemistry'), ('es', 'college_computer_science'), ('es', 'college_mathematics'), ('es', 'college_medicine'), ('es', 'college_physics'), ('es', 'computer_security'), ('es', 'conceptual_physics'), ('es', 'econometrics'), ('es', 'electrical_engineering'), ('es', 'elementary_mathematics'), ('es', 'formal_logic'), ('es', 'global_facts'), ('es', 'high_school_biology'), ('es', 'high_school_chemistry'), ('es', 'high_school_computer_science'), ('es', 'high_school_european_history'), ('es', 'high_school_geography'), ('es', 'high_school_government_and_politics'), ('es', 'high_school_macroeconomics'), ('es', 'high_school_mathematics'), ('es', 'high_school_microeconomics'), ('es', 'high_school_physics'), ('es', 'high_school_psychology'), ('es', 'high_school_statistics'), ('es', 'high_school_us_history'), ('es', 'high_school_world_history'), ('es', 'human_aging'), ('es', 'human_sexuality'), ('es', 'international_law'), ('es', 'jurisprudence'), ('es', 'logical_fallacies'), ('es', 'machine_learning'), ('es', 'management'), ('es', 'marketing'), ('es', 'medical_genetics'), ('es', 'miscellaneous'), ('es', 'moral_disputes'), ('es', 'moral_scenarios'), ('es', 'nutrition'), ('es', 'philosophy'), ('es', 'prehistory'), ('es', 'professional_accounting'), ('es', 'professional_law'), ('es', 'professional_medicine'), ('es', 'professional_psychology'), ('es', 'public_relations'), ('es', 'security_studies'), ('es', 'sociology'), ('es', 'us_foreign_policy'), ('es', 'virology'), ('es', 'world_religions'), ('it', 'abstract_algebra'), ('it', 'anatomy'), ('it', 'astronomy'), ('it', 'business_ethics'), ('it', 'clinical_knowledge'), ('it', 'college_biology'), ('it', 'college_chemistry'), ('it', 'college_computer_science'), ('it', 'college_mathematics'), ('it', 'college_medicine'), ('it', 'college_physics'), ('it', 'computer_security'), ('it', 'conceptual_physics'), ('it', 'econometrics'), ('it', 'electrical_engineering'), ('it', 'elementary_mathematics'), ('it', 'formal_logic'), ('it', 'global_facts'), ('it', 'high_school_biology'), ('it', 'high_school_chemistry'), ('it', 'high_school_computer_science'), ('it', 'high_school_european_history'), ('it', 'high_school_geography'), ('it', 'high_school_government_and_politics'), ('it', 'high_school_macroeconomics'), ('it', 'high_school_mathematics'), ('it', 'high_school_microeconomics'), ('it', 'high_school_physics'), ('it', 'high_school_psychology'), ('it', 'high_school_statistics'), ('it', 'high_school_us_history'), ('it', 'high_school_world_history'), ('it', 'human_aging'), ('it', 'human_sexuality'), ('it', 'international_law'), ('it', 'jurisprudence'), ('it', 'logical_fallacies'), ('it', 'machine_learning'), ('it', 'management'), ('it', 'marketing'), ('it', 'medical_genetics'), ('it', 'miscellaneous'), ('it', 'moral_disputes'), ('it', 'moral_scenarios'), ('it', 'nutrition'), ('it', 'philosophy'), ('it', 'prehistory'), ('it', 'professional_accounting'), ('it', 'professional_law'), ('it', 'professional_medicine'), ('it', 'professional_psychology'), ('it', 'public_relations'), ('it', 'security_studies'), ('it', 'sociology'), ('it', 'us_foreign_policy'), ('it', 'virology'), ('it', 'world_religions'), ('pt', 'abstract_algebra'), ('pt', 'anatomy'), ('pt', 'astronomy'), ('pt', 'business_ethics'), ('pt', 'clinical_knowledge'), ('pt', 'college_biology'), ('pt', 'college_chemistry'), ('pt', 'college_computer_science'), ('pt', 'college_mathematics'), ('pt', 'college_medicine'), ('pt', 'college_physics'), ('pt', 'computer_security'), ('pt', 'conceptual_physics'), ('pt', 'econometrics'), ('pt', 'electrical_engineering'), ('pt', 'elementary_mathematics'), ('pt', 'formal_logic'), ('pt', 'global_facts'), ('pt', 'high_school_biology'), ('pt', 'high_school_chemistry'), ('pt', 'high_school_computer_science'), ('pt', 'high_school_european_history'), ('pt', 'high_school_geography'), ('pt', 'high_school_government_and_politics'), ('pt', 'high_school_macroeconomics'), ('pt', 'high_school_mathematics'), ('pt', 'high_school_microeconomics'), ('pt', 'high_school_physics'), ('pt', 'high_school_psychology'), ('pt', 'high_school_statistics'), ('pt', 'high_school_us_history'), ('pt', 'high_school_world_history'), ('pt', 'human_aging'), ('pt', 'human_sexuality'), ('pt', 'international_law'), ('pt', 'jurisprudence'), ('pt', 'logical_fallacies'), ('pt', 'machine_learning'), ('pt', 'management'), ('pt', 'marketing'), ('pt', 'medical_genetics'), ('pt', 'miscellaneous'), ('pt', 'moral_disputes'), ('pt', 'moral_scenarios'), ('pt', 'nutrition'), ('pt', 'philosophy'), ('pt', 'prehistory'), ('pt', 'professional_accounting'), ('pt', 'professional_law'), ('pt', 'professional_medicine'), ('pt', 'professional_psychology'), ('pt', 'public_relations'), ('pt', 'security_studies'), ('pt', 'sociology'), ('pt', 'us_foreign_policy'), ('pt', 'virology'), ('pt', 'world_religions'), ('ar', 'abstract_algebra'), ('ar', 'anatomy'), ('ar', 'astronomy'), ('ar', 'business_ethics'), ('ar', 'clinical_knowledge'), ('ar', 'college_biology'), ('ar', 'college_chemistry'), ('ar', 'college_computer_science'), ('ar', 'college_mathematics'), ('ar', 'college_medicine'), ('ar', 'college_physics'), ('ar', 'computer_security'), ('ar', 'conceptual_physics'), ('ar', 'econometrics'), ('ar', 'electrical_engineering'), ('ar', 'elementary_mathematics'), ('ar', 'formal_logic'), ('ar', 'global_facts'), ('ar', 'high_school_biology'), ('ar', 'high_school_chemistry'), ('ar', 'high_school_computer_science'), ('ar', 'high_school_european_history'), ('ar', 'high_school_geography'), ('ar', 'high_school_government_and_politics'), ('ar', 'high_school_macroeconomics'), ('ar', 'high_school_mathematics'), ('ar', 'high_school_microeconomics'), ('ar', 'high_school_physics'), ('ar', 'high_school_psychology'), ('ar', 'high_school_statistics'), ('ar', 'high_school_us_history'), ('ar', 'high_school_world_history'), ('ar', 'human_aging'), ('ar', 'human_sexuality'), ('ar', 'international_law'), ('ar', 'jurisprudence'), ('ar', 'logical_fallacies'), ('ar', 'machine_learning'), ('ar', 'management'), ('ar', 'marketing'), ('ar', 'medical_genetics'), ('ar', 'miscellaneous'), ('ar', 'moral_disputes'), ('ar', 'moral_scenarios'), ('ar', 'nutrition'), ('ar', 'philosophy'), ('ar', 'prehistory'), ('ar', 'professional_accounting'), ('ar', 'professional_law'), ('ar', 'professional_medicine'), ('ar', 'professional_psychology'), ('ar', 'public_relations'), ('ar', 'security_studies'), ('ar', 'sociology'), ('ar', 'us_foreign_policy'), ('ar', 'virology'), ('ar', 'world_religions')]¶
eval_framework.tasks.benchmarks.goldenswag module¶
- class eval_framework.tasks.benchmarks.goldenswag.GOLDENSWAG(num_fewshot=0)[source]¶
Bases:
HELLASWAGGoldenSwag dataset: https://huggingface.co/datasets/PleIAs/GoldenSwag available data set sections: validation
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'PleIAs/GoldenSwag'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- NAME: str = 'GoldenSwag'¶
- SAMPLE_SPLIT: str = 'validation'¶
- class eval_framework.tasks.benchmarks.goldenswag.GOLDENSWAG_IDK(num_fewshot=0)[source]¶
Bases:
GOLDENSWAG- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'GoldenSwag_IDK'¶
eval_framework.tasks.benchmarks.gpqa module¶
- class eval_framework.tasks.benchmarks.gpqa.GPQA(num_fewshot=0)[source]¶
Bases:
BaseTask[str]GPQA dataset: https://huggingface.co/datasets/Idavidrein/gpqa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'Idavidrein/gpqa'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'GPQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['gpqa_extended']¶
- class eval_framework.tasks.benchmarks.gpqa.GPQA_COT(num_fewshot=0)[source]¶
Bases:
GPQA- Parameters:
num_fewshot (int)
- ANS_RE = re.compile('Therefore, the answer is \\(([ABCDEFGHIJ])\\)')¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- NAME: str = 'GPQA_COT'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.gpqa.GPQA_IDK(num_fewshot=0)[source]¶
Bases:
GPQA- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'GPQA_IDK'¶
- class eval_framework.tasks.benchmarks.gpqa.GPQA_OLMES(num_fewshot=0)[source]¶
Bases:
GPQAGPQA multiple choice (OLMES/oe_eval style): prompt shows options with space-prefixed labels (” A.”, “ B.”, “ C.”, “ D.”); loglikelihood over “ A”/” B”/” C”/” D”.
- Parameters:
num_fewshot (int)
- NAME: str = 'GPQA_OLMES'¶
eval_framework.tasks.benchmarks.gsm8k module¶
- class eval_framework.tasks.benchmarks.gsm8k.GSM8K(num_fewshot=0)[source]¶
Bases:
GSM8KEvalHarness- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = ''¶
- NAME: str = 'GSM8K'¶
- class eval_framework.tasks.benchmarks.gsm8k.GSM8KEvalHarness(num_fewshot=0)[source]¶
Bases:
BaseTask[str]GSM8K dataset: https://huggingface.co/datasets/openai/gsm8k This version uses samples from the train split as fewshot examples.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openai/gsm8k'¶
- FEWSHOT_SPLIT: str = 'train'¶
- HF_REVISION: str | None = 'main'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- NAME: str = 'GSM8KEvalHarness'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['main']¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
eval_framework.tasks.benchmarks.hellaswag module¶
- class eval_framework.tasks.benchmarks.hellaswag.HELLASWAG(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Hellaswag dataset: https://huggingface.co/datasets/Rowan/hellaswag available data set sections: train, validation, test
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'Rowan/hellaswag'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'HellaSwag'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.hellaswag.HELLASWAG_IDK(num_fewshot=0)[source]¶
Bases:
HELLASWAG- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'HellaSwag_IDK'¶
eval_framework.tasks.benchmarks.hellaswag_de module¶
- class eval_framework.tasks.benchmarks.hellaswag_de.HELLASWAG_DE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Hellaswag dataset: https://huggingface.co/datasets/LeoLM/HellaSwag_de available data set sections: train (1k rows), validation (10k rows)
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'LeoLM/HellaSwag_de'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'HellaSwag German'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
eval_framework.tasks.benchmarks.humaneval module¶
- class eval_framework.tasks.benchmarks.humaneval.HumanEval(num_fewshot=0)[source]¶
Bases:
BaseTask[str]HumanEval dataset: https://huggingface.co/datasets/openai/openai_humaneval/
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openai/openai_humaneval'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_assertion.CodeCompletionAssertion'>]¶
- NAME: str = 'Human Eval'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.humaneval.HumanEvalBPB(num_fewshot=0)[source]¶
Bases:
HumanEvalHumanEval variant that scores loglikelihood of the gold canonical solution. Reports bits-per-byte on the reference completion.
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'Human Eval BPB'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- class eval_framework.tasks.benchmarks.humaneval.HumanEvalInstruct(num_fewshot=0)[source]¶
Bases:
HumanEval- Parameters:
num_fewshot (int)
- CUE_PREFIX = 'Here is the completed function:\n```python\n'¶
- NAME: str = 'Human Eval Instruct'¶
- class eval_framework.tasks.benchmarks.humaneval.HumanEvalMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
test (str)
entry_point (str)
prompt (str)
extra_data (Any)
- entry_point: str¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- prompt: str¶
- test: str¶
- class eval_framework.tasks.benchmarks.humaneval.HumanEval_OLMES(num_fewshot=3)[source]¶
Bases:
HumanEvalHumanEval OLMES variant replicating codex_humaneval:3shot::olmo3:n32:v2 from oe_eval.
- Recommended EvalConfig settings for full replication:
repeats: 32 llm_args: {sampling_params: {temperature: 0.6, top_p: 0.6}}
- Parameters:
num_fewshot (int)
- NAME: str = 'Human Eval OLMES'¶
eval_framework.tasks.benchmarks.ifeval module¶
- class eval_framework.tasks.benchmarks.ifeval.IFEval(num_fewshot=0)[source]¶
Bases:
BaseTask[str]IFEval: Instruction Following Eval (https://arxiv.org/pdf/2311.07911).
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'google/IFEval'¶
- FEWSHOT_SPLIT: str = 'train'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.ENG}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.ifeval.IFEvalMetric'>]¶
- NAME: str = 'IFEval'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.ifeval.IFEvalDe(num_fewshot=0)[source]¶
Bases:
IFEvalGerman version of the Instruction Following Evaluation (IFEval) benchmark.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'jzhang86/de_ifeval'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.DEU}¶
- NAME: str = 'IFEval German'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.ifeval.IFEvalFiSv(num_fewshot=0)[source]¶
Bases:
IFEvalMachine translated versions of the Instruction Following Evaluation (IFEval) benchmark.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'LumiOpen/ifeval_mt'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'fi': Language.FIN, 'sv': Language.SWE}¶
- NAME: str = 'IFEval Finnish & Swedish'¶
- SUBJECTS: list[SubjectType] = ['fi', 'sv']¶
eval_framework.tasks.benchmarks.include module¶
- class eval_framework.tasks.benchmarks.include.INCLUDE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]INCLUDE dataset: https://huggingface.co/datasets/CohereLabs/include-base-44
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'CohereLabs/include-base-44'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'Albanian': Language.SQI, 'Arabic': Language.ARB, 'Armenian': Language.HYE, 'Azerbaijani': Language.AZE, 'Basque': Language.EUS, 'Belarusian': Language.BEL, 'Bengali': Language.BEN, 'Bulgarian': Language.BUL, 'Chinese': Language.ZHO, 'Croatian': Language.HRV, 'Dutch': Language.NLD, 'Estonian': Language.EST, 'Finnish': Language.FIN, 'French': Language.FRA, 'Georgian': Language.KAT, 'German': Language.DEU, 'Greek': Language.ELL, 'Hebrew': Language.HEB, 'Hindi': Language.HIN, 'Hungarian': Language.HUN, 'Indonesian': Language.IND, 'Italian': Language.ITA, 'Japanese': Language.JPN, 'Kazakh': Language.KAZ, 'Korean': Language.KOR, 'Lithuanian': Language.LIT, 'Malay': Language.MSA, 'Malayalam': Language.MAL, 'Nepali': Language.NEP, 'North Macedonian': Language.MKD, 'Persian': Language.FAS, 'Polish': Language.POL, 'Portuguese': Language.POR, 'Russian': Language.RUS, 'Serbian': Language.SRP, 'Spanish': Language.SPA, 'Tagalog': Language.TGL, 'Tamil': Language.TAM, 'Telugu': Language.TEL, 'Turkish': Language.TUR, 'Ukrainian': Language.UKR, 'Urdu': Language.URD, 'Uzbek': Language.UZB, 'Vietnamese': Language.VIE}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'INCLUDE'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['Albanian', 'Arabic', 'Armenian', 'Azerbaijani', 'Basque', 'Belarusian', 'Bengali', 'Bulgarian', 'Chinese', 'Croatian', 'Dutch', 'Estonian', 'Finnish', 'French', 'Georgian', 'German', 'Greek', 'Hebrew', 'Hindi', 'Hungarian', 'Indonesian', 'Italian', 'Japanese', 'Kazakh', 'Korean', 'Lithuanian', 'Malay', 'Malayalam', 'Nepali', 'North Macedonian', 'Persian', 'Polish', 'Portuguese', 'Russian', 'Serbian', 'Spanish', 'Tagalog', 'Tamil', 'Telugu', 'Turkish', 'Ukrainian', 'Urdu', 'Uzbek', 'Vietnamese']¶
eval_framework.tasks.benchmarks.infinitebench module¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench(num_fewshot=0)[source]¶
Bases:
BaseTask[str],ABCInfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens https://github.com/OpenBMB/InfiniteBench
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'xinrongzhang2022/InfiniteBench'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = None¶
- SUBJECTS: list[SubjectType] = ['default']¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBenchCompletion(num_fewshot=0)[source]¶
Bases:
InfiniteBench,ABCBase class for completion tasks.
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBenchLoglikelihood(num_fewshot=0)[source]¶
Bases:
InfiniteBench,ABCBase class for loglikelihood tasks.
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>]¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_CodeDebug(num_fewshot=0)[source]¶
Bases:
InfiniteBenchLoglikelihoodFinding which function in a code repo contains a crashing error (MC form).
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'code_debug'¶
- NAME: str = 'InfiniteBench_CodeDebug'¶
- SAMPLE_SPLIT: str = 'code_debug'¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_CodeRun(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionSimulating execution of multiple simple, synthetic functions.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'code_run'¶
- NAME: str = 'InfiniteBench_CodeRun'¶
- SAMPLE_SPLIT: str = 'code_run'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnDia(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionIdentification of talkers in partially anonymized scripts.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'longdialogue_qa_eng'¶
- NAME: str = 'InfiniteBench_EnDia'¶
- SAMPLE_SPLIT: str = 'longdialogue_qa_eng'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnMC(num_fewshot=0)[source]¶
Bases:
InfiniteBenchLoglikelihoodMultiple choice questions derived from the fake book.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'longbook_choice_eng'¶
- NAME: str = 'InfiniteBench_EnMC'¶
- SAMPLE_SPLIT: str = 'longbook_choice_eng'¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnQA(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionFree-form question answering based on the fake book.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'longbook_qa_eng'¶
- NAME: str = 'InfiniteBench_EnQA'¶
- SAMPLE_SPLIT: str = 'longbook_qa_eng'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_MathFind(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionFinding special integers in a lengthy list.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'math_find'¶
- NAME: str = 'InfiniteBench_MathFind'¶
- SAMPLE_SPLIT: str = 'math_find'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrieveKV2(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionFinding the corresponding value from a dictionary and a key.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'kv_retrieval'¶
- NAME: str = 'InfiniteBench_RetrieveKV2'¶
- SAMPLE_SPLIT: str = 'kv_retrieval'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrieveNumber(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionLocating repeated hidden numbers in a noisy long context.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'number_string'¶
- NAME: str = 'InfiniteBench_RetrieveNumber'¶
- SAMPLE_SPLIT: str = 'number_string'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrievePassKey1(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionRetrieving hidden keys in a noisy long context.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'passkey'¶
- NAME: str = 'InfiniteBench_RetrievePassKey1'¶
- SAMPLE_SPLIT: str = 'passkey'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
eval_framework.tasks.benchmarks.lab_bench module¶
- class eval_framework.tasks.benchmarks.lab_bench.LabBenchCloze(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Lab-Bench (futurehouse/lab-bench): QA over scientific protocols; cloze ranks ideal vs distractors.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'futurehouse/lab-bench'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'LabBenchCloze'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['CloningScenarios', 'DbQA', 'FigQA', 'LitQA2', 'ProtocolQA', 'SeqQA', 'SuppQA', 'TableQA']¶
- class eval_framework.tasks.benchmarks.lab_bench.LabBenchMC(num_fewshot=0)[source]¶
Bases:
LabBenchCloze- Parameters:
num_fewshot (int)
- NAME: str = 'LabBenchMC'¶
- class eval_framework.tasks.benchmarks.lab_bench.LabBenchMC_OLMES(num_fewshot=0)[source]¶
Bases:
LabBenchMCLabBenchMC with OLMES-style prompt: space before each label in the prompt (” A.”, “ B.”, …).
- Parameters:
num_fewshot (int)
- NAME: str = 'LabBenchMC_OLMES'¶
eval_framework.tasks.benchmarks.math_reasoning module¶
- class eval_framework.tasks.benchmarks.math_reasoning.AIME2024(num_fewshot=0)[source]¶
Bases:
MATHReasoningAIME 2024 dataset: https://huggingface.co/datasets/HuggingFaceH4/aime_2024
This dataset contains a single train split of 30 questions. Data contains
ID | Problem | Solution | Answer
pass@1 evaluation
- Parameters:
num_fewshot (int)
- ANSWER_PATTERN = 'Therefore, the final answer is:(.*?). I hope it is correct.'¶
- DATASET_PATH: str = 'HuggingFaceH4/aime_2024'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶
- NAME: str = 'AIME2024'¶
- QUERY_TEMPLATE = 'Solve the following math problem efficiently and clearly:\n\n - For simple problems (2 steps or fewer):\n Provide a concise solution with minimal explanation.\n\n - For complex problems (3 steps or more):\n Use this step-by-step format:\n\n ## Step 1: [Concise description]\n [Brief explanation and calculations]\n\n ## Step 2: [Concise description]\n [Brief explanation and calculations]\n\n ...\n\n Regardless of the approach, always conclude with:\n\n Therefore, the final answer is: $\\boxed{{answer}}$. I hope it is correct.\n\n Where [answer] is just the final number or expression that solves the problem.\n\n Problem: {Question}'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.math_reasoning.AIME2025(num_fewshot=0)[source]¶
Bases:
AIME2024AIME 2025 dataset: https://huggingface.co/datasets/math-ai/aime25
This dataset contains a single test split of 30 questions. Data contains problem | answer | id
pass@1 evaluation
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'math-ai/aime25'¶
- FEWSHOT_SPLIT: str = 'test'¶
- NAME: str = 'AIME2025'¶
- SAMPLE_SPLIT: str = 'test'¶
- class eval_framework.tasks.benchmarks.math_reasoning.AIME2026(num_fewshot=0)[source]¶
Bases:
AIME2024AIME 2026 dataset: https://huggingface.co/datasets/math-ai/aime26
This dataset contains a single test split of 30 questions. Data contains problem | answer | id
pass@1 evaluation
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'math-ai/aime26'¶
- FEWSHOT_SPLIT: str = 'test'¶
- NAME: str = 'AIME2026'¶
- SAMPLE_SPLIT: str = 'test'¶
- class eval_framework.tasks.benchmarks.math_reasoning.GSM8KReasoning(num_fewshot=0)[source]¶
Bases:
MATHReasoningGSM8K dataset with reasoning prompt: https://huggingface.co/datasets/openai/gsm8k
Zero-shot reasoning version that expects answers in boxed format.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openai/gsm8k'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶
- NAME: str = 'GSM8KReasoning'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶
- QUERY_TEMPLATE = 'Solve the following math problem step by step. Think through the problem carefully and show your reasoning.\n\nPlease provide your answer in the format: $\\boxed{{answer}}$ where answer is the final numerical result.\n\nQuestion: {question}\n\nAnswer:'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['main']¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.math_reasoning.MATH(num_fewshot=0)[source]¶
Bases:
MATHReasoningMATH dataset: https://huggingface.co/datasets/EleutherAI/hendrycks_math
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'EleutherAI/hendrycks_math'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶
- NAME: str = 'Math'¶
- QUERY_TEMPLATE = 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $ANSWER (without quotes) where $ANSWER is the answer to the problem.\n\n {Question}\n\n Remember to put your answer in $\\boxed{{answer}}$\n\n where [answer] is just the final number or expression that solves the problem.'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['algebra', 'counting_and_probability', 'geometry', 'intermediate_algebra', 'number_theory', 'prealgebra', 'precalculus']¶
- extract_last_two_dollar_text(s)[source]¶
extract_last_two_dollar_text finds text between the last two dollar signs in a string :type s:
str:param s: the string to extract text from- Return type:
str- Returns:
the extracted text
- Parameters:
s (str)
- post_process_generated_completion(completion_text, sample=None)[source]¶
post_process_generated_completion extracts via flex extraction/matching. if there is a boxed answer, then this gets used first if there is no boxed answer, and latex math symbols (“$”) then this will be extracted and used if there is an answer text (“Answer:”) then this will be used last
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- class eval_framework.tasks.benchmarks.math_reasoning.MATH500(num_fewshot=0)[source]¶
Bases:
MATHReasoningMATH500 dataset: https://huggingface.co/datasets/HuggingFaceH4/MATH-500
This dataset contains a single test split of 500 questions. Data contains
ID | Problem | Solution | Answer
pass@1 evaluation
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'HuggingFaceH4/MATH-500'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶
- NAME: str = 'MATH500'¶
- QUERY_TEMPLATE = 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $ANSWER (without quotes) where $ANSWER is the answer to the problem.\n\n {Question}\n\n Remember to put your answer in $\\boxed{{answer}}$\n\n where [answer] is just the final number or expression that solves the problem.'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.math_reasoning.MATH500Minerva(num_fewshot=0)[source]¶
Bases:
MATHMinervaMATH-500 with Minerva-style prompt and scoring (OLMES minerva_math_500 parity). Uses HuggingFaceH4/MATH-500 which has a single ‘default’ config (no subject splits).
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'HuggingFaceH4/MATH-500'¶
- FEWSHOT_SPLIT: str = 'test'¶
- NAME: str = 'MATH500Minerva'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.math_reasoning.MATHLvl5(num_fewshot=0)[source]¶
Bases:
MATH- Parameters:
num_fewshot (int)
- NAME: str = 'Math Lvl 5'¶
- class eval_framework.tasks.benchmarks.math_reasoning.MATHMinerva(num_fewshot=0)[source]¶
Bases:
MATHMinervaEvalHarnessMATH with Minerva-style prompt and relaxed final-answer string matching. Same as MATHMinervaEvalHarness but allows flexible whitespace and case for variations of “(The )Final Answer: The (final )answer is …( I hope it is correct.)”, where parentheses are optional.
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_minerva_completion.MathMinervaCompletionRelaxed'>]¶
- NAME: str = 'MATHMinerva'¶
- class eval_framework.tasks.benchmarks.math_reasoning.MATHMinervaBPB(num_fewshot=0)[source]¶
Bases:
MATHReasoningMATH (Hendrycks) with Minerva-style prompt, evaluated via loglikelihood of the gold answer string (bits-per-byte). Same prompt as MATHMinerva; scores P(normalized_gold_answer | prompt).
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'EleutherAI/hendrycks_math'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'MATHMinervaBPB'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['algebra', 'counting_and_probability', 'geometry', 'intermediate_algebra', 'number_theory', 'prealgebra', 'precalculus']¶
- class eval_framework.tasks.benchmarks.math_reasoning.MATHMinervaEvalHarness(num_fewshot=0)[source]¶
Bases:
MATHReasoningMATH with Minerva-style prompt and scoring (lm-evaluation-harness / oe_eval parity). Uses strict final-answer string matching: “Final Answer: The final answer is … I hope it is correct.” Prompt: “Problem:n” + problem + “nn” + “Solution:” Gold: normalized_gold_from_solution(solution) Metrics: Exact Match, Exact Match (Flex) via MathMinervaCompletion.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'EleutherAI/hendrycks_math'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_minerva_completion.MathMinervaCompletion'>]¶
- NAME: str = 'MATHMinervaEvalHarness'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['algebra', 'counting_and_probability', 'geometry', 'intermediate_algebra', 'number_theory', 'prealgebra', 'precalculus']¶
- class eval_framework.tasks.benchmarks.math_reasoning.MATHReasoning(num_fewshot=0)[source]¶
Bases:
BaseTask[str]AIME 2024 dataset: https://huggingface.co/datasets/HuggingFaceH4/aime_2024
This dataset contains a single train split of 30 questions. Data contains
ID | Problem | Solution | Answer
pass@1 evaluation
- Parameters:
num_fewshot (int)
- ANSWER_PATTERN = '(?i)Answer\\s*:\\s*(.*)'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>]¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
eval_framework.tasks.benchmarks.mbpp module¶
- class eval_framework.tasks.benchmarks.mbpp.MBPP(num_fewshot=0)[source]¶
Bases:
BaseTask[str]MBPP provides both the problem statement and the test cases upfront. It says, “Here’s the problem and here are the tests; write code that passes them.”. Note that LLMs can cheat and only write code that passes the tests without solving the given problem.
MBPP_PROMPT_WITHOUT_TESTS, on the other hand, only gives you the problem statement and function signature initially. It says, “Here’s the problem and function signature; write code, then we’ll run tests later.”
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'google-research-datasets/mbpp'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_assertion.CodeCompletionAssertion'>]¶
- NAME: str = 'MBPP'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['full']¶
- class eval_framework.tasks.benchmarks.mbpp.MBPPBPB(num_fewshot=0)[source]¶
Bases:
MBPPMBPP variant that scores loglikelihood of the gold reference code. Reports bits-per-byte on the reference solution.
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'MBPP BPB'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- class eval_framework.tasks.benchmarks.mbpp.MBPPMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
tests_code (str)
extra_data (Any)
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- tests_code: str¶
- class eval_framework.tasks.benchmarks.mbpp.MBPP_OLMES(num_fewshot=3)[source]¶
Bases:
MBPPMBPP OLMES variant replicating oe_eval’s
mbpp:3shot::olmo3:n32:v2.Uses the EvalPlus prompt format with 3 hardcoded fewshot examples from the original MBPP “prompt” split (matching oe_eval’s ordering). Each prompt shows one test case (the first) instead of all.
Recommended EvalConfig settings for full replication:
split: test num_fewshot: 3 (hardcoded, prompt split) metric: pass_at_1 temperature: 0.6 top_p: 0.6 repeats: 32
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'test'¶
- NAME: str = 'MBPP_OLMES'¶
- class eval_framework.tasks.benchmarks.mbpp.MBPP_PROMPT_WITHOUT_TESTS(num_fewshot=0)[source]¶
Bases:
MBPPMBPP provides both the problem statement and the test cases upfront. It says, “Here’s the problem and here are the tests; write code that passes them.”. Note that LLMs can cheat and only write code that passes the tests without solving the given problem.
MBPP_PROMPT_WITHOUT_TESTS, on the other hand, only gives you the problem statement and function signature initially. It says, “Here’s the problem and function signature; write code, then we’ll run tests later.”
- Parameters:
num_fewshot (int)
- NAME: str = 'MBPP_PROMPT_WITHOUT_TESTS'¶
- class eval_framework.tasks.benchmarks.mbpp.MBPP_PROMPT_WITHOUT_TESTS_SANITIZED(num_fewshot=0)[source]¶
Bases:
MBPP_PROMPT_WITHOUT_TESTS- Parameters:
num_fewshot (int)
- NAME: str = 'MBPP_PROMPT_WITHOUT_TESTS_SANITIZED'¶
- SUBJECTS: list[SubjectType] = ['sanitized']¶
eval_framework.tasks.benchmarks.medqa module¶
MedQA (English): Open-domain medical question answering from medical exams.
- class eval_framework.tasks.benchmarks.medqa.MedQACloze(num_fewshot=0)[source]¶
Bases:
BaseTask[str]MedQA cloze (loglikelihood over choice text).
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'davidheineman/medqa-en'¶
- FEWSHOT_SPLIT: str = 'dev'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'MedQACloze'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.medqa.MedQAMC(num_fewshot=0)[source]¶
Bases:
MedQAClozeMedQA multiple choice (loglikelihood over A/B/C/D/…).
- Parameters:
num_fewshot (int)
- NAME: str = 'MedQAMC'¶
eval_framework.tasks.benchmarks.mmlu module¶
- class eval_framework.tasks.benchmarks.mmlu.FullTextMMLU(num_fewshot=0)[source]¶
Bases:
MMLUMMLU dataset but where the model is expected to replicate choice text, rather than just the key.
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'Full Text MMLU'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'answers', 'A', 'B', 'C', 'D']¶
- class eval_framework.tasks.benchmarks.mmlu.MMLU(num_fewshot=0)[source]¶
Bases:
BaseTask[str]MMLU dataset: https://huggingface.co/datasets/cais/mmlu
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'cais/mmlu'¶
- FEWSHOT_SPLIT: str = 'dev'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'MMLU'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']¶
- class eval_framework.tasks.benchmarks.mmlu.MMLU_COT(num_fewshot=0)[source]¶
Bases:
MMLUMMLU dataset with instruction to summarize reasoning and conclude with answer. Inspired by https://arxiv.org/pdf/2411.15124 (Table 44)
- Parameters:
num_fewshot (int)
- ANS_RE = re.compile('Therefore, the answer is: ([ABCD])')¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- NAME: str = 'MMLU_COT'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.mmlu.MMLU_IDK(num_fewshot=0)[source]¶
Bases:
MMLU- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'MMLU_IDK'¶
eval_framework.tasks.benchmarks.mmlu_de module¶
- class eval_framework.tasks.benchmarks.mmlu_de.MMLU_DE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]MMLU DE dataset: https://huggingface.co/datasets/LeoLM/MMLU_de
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'LeoLM/MMLU_de'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- HF_REVISION: str | None = '11433b408001dd26444c7e666cc536e0b8907ca5'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'MMLU_DE'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']¶
eval_framework.tasks.benchmarks.mmlu_pro module¶
- class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO(num_fewshot=0)[source]¶
Bases:
BaseTask[str]MMLU_PRO dataset: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'TIGER-Lab/MMLU-Pro'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'MMLU Pro'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['engineering', 'physics', 'psychology', 'chemistry', 'biology', 'law', 'philosophy', 'computer science', 'other', 'economics', 'business', 'history', 'math', 'health']¶
- class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO_COT(num_fewshot=0)[source]¶
Bases:
MMLU_PRO- Parameters:
num_fewshot (int)
- ANS_RE = re.compile('Therefore, the answer is \\(([ABCDEFGHIJ])\\)')¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- NAME: str = 'MMLU_PRO_COT'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO_IDK(num_fewshot=0)[source]¶
Bases:
MMLU_PRO- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'MMLU Pro_IDK'¶
eval_framework.tasks.benchmarks.mmmlu module¶
- class eval_framework.tasks.benchmarks.mmmlu.MMMLU(num_fewshot=0)[source]¶
Bases:
BaseTask[tuple[str,str]]MMMLU dataset: https://huggingface.co/datasets/openai/MMMLU
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openai/MMMLU'¶
- FEWSHOT_SPLIT: str = 'test'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {"('AR', 'abstract_algebra')": Language.ARB, "('AR', 'anatomy')": Language.ARB, "('AR', 'astronomy')": Language.ARB, "('AR', 'business_ethics')": Language.ARB, "('AR', 'clinical_knowledge')": Language.ARB, "('AR', 'college_biology')": Language.ARB, "('AR', 'college_chemistry')": Language.ARB, "('AR', 'college_computer_science')": Language.ARB, "('AR', 'college_mathematics')": Language.ARB, "('AR', 'college_medicine')": Language.ARB, "('AR', 'college_physics')": Language.ARB, "('AR', 'computer_security')": Language.ARB, "('AR', 'conceptual_physics')": Language.ARB, "('AR', 'econometrics')": Language.ARB, "('AR', 'electrical_engineering')": Language.ARB, "('AR', 'elementary_mathematics')": Language.ARB, "('AR', 'formal_logic')": Language.ARB, "('AR', 'global_facts')": Language.ARB, "('AR', 'high_school_biology')": Language.ARB, "('AR', 'high_school_chemistry')": Language.ARB, "('AR', 'high_school_computer_science')": Language.ARB, "('AR', 'high_school_european_history')": Language.ARB, "('AR', 'high_school_geography')": Language.ARB, "('AR', 'high_school_government_and_politics')": Language.ARB, "('AR', 'high_school_macroeconomics')": Language.ARB, "('AR', 'high_school_mathematics')": Language.ARB, "('AR', 'high_school_microeconomics')": Language.ARB, "('AR', 'high_school_physics')": Language.ARB, "('AR', 'high_school_psychology')": Language.ARB, "('AR', 'high_school_statistics')": Language.ARB, "('AR', 'high_school_us_history')": Language.ARB, "('AR', 'high_school_world_history')": Language.ARB, "('AR', 'human_aging')": Language.ARB, "('AR', 'human_sexuality')": Language.ARB, "('AR', 'international_law')": Language.ARB, "('AR', 'jurisprudence')": Language.ARB, "('AR', 'logical_fallacies')": Language.ARB, "('AR', 'machine_learning')": Language.ARB, "('AR', 'management')": Language.ARB, "('AR', 'marketing')": Language.ARB, "('AR', 'medical_genetics')": Language.ARB, "('AR', 'miscellaneous')": Language.ARB, "('AR', 'moral_disputes')": Language.ARB, "('AR', 'moral_scenarios')": Language.ARB, "('AR', 'nutrition')": Language.ARB, "('AR', 'philosophy')": Language.ARB, "('AR', 'prehistory')": Language.ARB, "('AR', 'professional_accounting')": Language.ARB, "('AR', 'professional_law')": Language.ARB, "('AR', 'professional_medicine')": Language.ARB, "('AR', 'professional_psychology')": Language.ARB, "('AR', 'public_relations')": Language.ARB, "('AR', 'security_studies')": Language.ARB, "('AR', 'sociology')": Language.ARB, "('AR', 'us_foreign_policy')": Language.ARB, "('AR', 'virology')": Language.ARB, "('AR', 'world_religions')": Language.ARB, "('DE', 'abstract_algebra')": Language.DEU, "('DE', 'anatomy')": Language.DEU, "('DE', 'astronomy')": Language.DEU, "('DE', 'business_ethics')": Language.DEU, "('DE', 'clinical_knowledge')": Language.DEU, "('DE', 'college_biology')": Language.DEU, "('DE', 'college_chemistry')": Language.DEU, "('DE', 'college_computer_science')": Language.DEU, "('DE', 'college_mathematics')": Language.DEU, "('DE', 'college_medicine')": Language.DEU, "('DE', 'college_physics')": Language.DEU, "('DE', 'computer_security')": Language.DEU, "('DE', 'conceptual_physics')": Language.DEU, "('DE', 'econometrics')": Language.DEU, "('DE', 'electrical_engineering')": Language.DEU, "('DE', 'elementary_mathematics')": Language.DEU, "('DE', 'formal_logic')": Language.DEU, "('DE', 'global_facts')": Language.DEU, "('DE', 'high_school_biology')": Language.DEU, "('DE', 'high_school_chemistry')": Language.DEU, "('DE', 'high_school_computer_science')": Language.DEU, "('DE', 'high_school_european_history')": Language.DEU, "('DE', 'high_school_geography')": Language.DEU, "('DE', 'high_school_government_and_politics')": Language.DEU, "('DE', 'high_school_macroeconomics')": Language.DEU, "('DE', 'high_school_mathematics')": Language.DEU, "('DE', 'high_school_microeconomics')": Language.DEU, "('DE', 'high_school_physics')": Language.DEU, "('DE', 'high_school_psychology')": Language.DEU, "('DE', 'high_school_statistics')": Language.DEU, "('DE', 'high_school_us_history')": Language.DEU, "('DE', 'high_school_world_history')": Language.DEU, "('DE', 'human_aging')": Language.DEU, "('DE', 'human_sexuality')": Language.DEU, "('DE', 'international_law')": Language.DEU, "('DE', 'jurisprudence')": Language.DEU, "('DE', 'logical_fallacies')": Language.DEU, "('DE', 'machine_learning')": Language.DEU, "('DE', 'management')": Language.DEU, "('DE', 'marketing')": Language.DEU, "('DE', 'medical_genetics')": Language.DEU, "('DE', 'miscellaneous')": Language.DEU, "('DE', 'moral_disputes')": Language.DEU, "('DE', 'moral_scenarios')": Language.DEU, "('DE', 'nutrition')": Language.DEU, "('DE', 'philosophy')": Language.DEU, "('DE', 'prehistory')": Language.DEU, "('DE', 'professional_accounting')": Language.DEU, "('DE', 'professional_law')": Language.DEU, "('DE', 'professional_medicine')": Language.DEU, "('DE', 'professional_psychology')": Language.DEU, "('DE', 'public_relations')": Language.DEU, "('DE', 'security_studies')": Language.DEU, "('DE', 'sociology')": Language.DEU, "('DE', 'us_foreign_policy')": Language.DEU, "('DE', 'virology')": Language.DEU, "('DE', 'world_religions')": Language.DEU, "('ES', 'abstract_algebra')": Language.SPA, "('ES', 'anatomy')": Language.SPA, "('ES', 'astronomy')": Language.SPA, "('ES', 'business_ethics')": Language.SPA, "('ES', 'clinical_knowledge')": Language.SPA, "('ES', 'college_biology')": Language.SPA, "('ES', 'college_chemistry')": Language.SPA, "('ES', 'college_computer_science')": Language.SPA, "('ES', 'college_mathematics')": Language.SPA, "('ES', 'college_medicine')": Language.SPA, "('ES', 'college_physics')": Language.SPA, "('ES', 'computer_security')": Language.SPA, "('ES', 'conceptual_physics')": Language.SPA, "('ES', 'econometrics')": Language.SPA, "('ES', 'electrical_engineering')": Language.SPA, "('ES', 'elementary_mathematics')": Language.SPA, "('ES', 'formal_logic')": Language.SPA, "('ES', 'global_facts')": Language.SPA, "('ES', 'high_school_biology')": Language.SPA, "('ES', 'high_school_chemistry')": Language.SPA, "('ES', 'high_school_computer_science')": Language.SPA, "('ES', 'high_school_european_history')": Language.SPA, "('ES', 'high_school_geography')": Language.SPA, "('ES', 'high_school_government_and_politics')": Language.SPA, "('ES', 'high_school_macroeconomics')": Language.SPA, "('ES', 'high_school_mathematics')": Language.SPA, "('ES', 'high_school_microeconomics')": Language.SPA, "('ES', 'high_school_physics')": Language.SPA, "('ES', 'high_school_psychology')": Language.SPA, "('ES', 'high_school_statistics')": Language.SPA, "('ES', 'high_school_us_history')": Language.SPA, "('ES', 'high_school_world_history')": Language.SPA, "('ES', 'human_aging')": Language.SPA, "('ES', 'human_sexuality')": Language.SPA, "('ES', 'international_law')": Language.SPA, "('ES', 'jurisprudence')": Language.SPA, "('ES', 'logical_fallacies')": Language.SPA, "('ES', 'machine_learning')": Language.SPA, "('ES', 'management')": Language.SPA, "('ES', 'marketing')": Language.SPA, "('ES', 'medical_genetics')": Language.SPA, "('ES', 'miscellaneous')": Language.SPA, "('ES', 'moral_disputes')": Language.SPA, "('ES', 'moral_scenarios')": Language.SPA, "('ES', 'nutrition')": Language.SPA, "('ES', 'philosophy')": Language.SPA, "('ES', 'prehistory')": Language.SPA, "('ES', 'professional_accounting')": Language.SPA, "('ES', 'professional_law')": Language.SPA, "('ES', 'professional_medicine')": Language.SPA, "('ES', 'professional_psychology')": Language.SPA, "('ES', 'public_relations')": Language.SPA, "('ES', 'security_studies')": Language.SPA, "('ES', 'sociology')": Language.SPA, "('ES', 'us_foreign_policy')": Language.SPA, "('ES', 'virology')": Language.SPA, "('ES', 'world_religions')": Language.SPA, "('FR', 'abstract_algebra')": Language.FRA, "('FR', 'anatomy')": Language.FRA, "('FR', 'astronomy')": Language.FRA, "('FR', 'business_ethics')": Language.FRA, "('FR', 'clinical_knowledge')": Language.FRA, "('FR', 'college_biology')": Language.FRA, "('FR', 'college_chemistry')": Language.FRA, "('FR', 'college_computer_science')": Language.FRA, "('FR', 'college_mathematics')": Language.FRA, "('FR', 'college_medicine')": Language.FRA, "('FR', 'college_physics')": Language.FRA, "('FR', 'computer_security')": Language.FRA, "('FR', 'conceptual_physics')": Language.FRA, "('FR', 'econometrics')": Language.FRA, "('FR', 'electrical_engineering')": Language.FRA, "('FR', 'elementary_mathematics')": Language.FRA, "('FR', 'formal_logic')": Language.FRA, "('FR', 'global_facts')": Language.FRA, "('FR', 'high_school_biology')": Language.FRA, "('FR', 'high_school_chemistry')": Language.FRA, "('FR', 'high_school_computer_science')": Language.FRA, "('FR', 'high_school_european_history')": Language.FRA, "('FR', 'high_school_geography')": Language.FRA, "('FR', 'high_school_government_and_politics')": Language.FRA, "('FR', 'high_school_macroeconomics')": Language.FRA, "('FR', 'high_school_mathematics')": Language.FRA, "('FR', 'high_school_microeconomics')": Language.FRA, "('FR', 'high_school_physics')": Language.FRA, "('FR', 'high_school_psychology')": Language.FRA, "('FR', 'high_school_statistics')": Language.FRA, "('FR', 'high_school_us_history')": Language.FRA, "('FR', 'high_school_world_history')": Language.FRA, "('FR', 'human_aging')": Language.FRA, "('FR', 'human_sexuality')": Language.FRA, "('FR', 'international_law')": Language.FRA, "('FR', 'jurisprudence')": Language.FRA, "('FR', 'logical_fallacies')": Language.FRA, "('FR', 'machine_learning')": Language.FRA, "('FR', 'management')": Language.FRA, "('FR', 'marketing')": Language.FRA, "('FR', 'medical_genetics')": Language.FRA, "('FR', 'miscellaneous')": Language.FRA, "('FR', 'moral_disputes')": Language.FRA, "('FR', 'moral_scenarios')": Language.FRA, "('FR', 'nutrition')": Language.FRA, "('FR', 'philosophy')": Language.FRA, "('FR', 'prehistory')": Language.FRA, "('FR', 'professional_accounting')": Language.FRA, "('FR', 'professional_law')": Language.FRA, "('FR', 'professional_medicine')": Language.FRA, "('FR', 'professional_psychology')": Language.FRA, "('FR', 'public_relations')": Language.FRA, "('FR', 'security_studies')": Language.FRA, "('FR', 'sociology')": Language.FRA, "('FR', 'us_foreign_policy')": Language.FRA, "('FR', 'virology')": Language.FRA, "('FR', 'world_religions')": Language.FRA, "('IT', 'abstract_algebra')": Language.ITA, "('IT', 'anatomy')": Language.ITA, "('IT', 'astronomy')": Language.ITA, "('IT', 'business_ethics')": Language.ITA, "('IT', 'clinical_knowledge')": Language.ITA, "('IT', 'college_biology')": Language.ITA, "('IT', 'college_chemistry')": Language.ITA, "('IT', 'college_computer_science')": Language.ITA, "('IT', 'college_mathematics')": Language.ITA, "('IT', 'college_medicine')": Language.ITA, "('IT', 'college_physics')": Language.ITA, "('IT', 'computer_security')": Language.ITA, "('IT', 'conceptual_physics')": Language.ITA, "('IT', 'econometrics')": Language.ITA, "('IT', 'electrical_engineering')": Language.ITA, "('IT', 'elementary_mathematics')": Language.ITA, "('IT', 'formal_logic')": Language.ITA, "('IT', 'global_facts')": Language.ITA, "('IT', 'high_school_biology')": Language.ITA, "('IT', 'high_school_chemistry')": Language.ITA, "('IT', 'high_school_computer_science')": Language.ITA, "('IT', 'high_school_european_history')": Language.ITA, "('IT', 'high_school_geography')": Language.ITA, "('IT', 'high_school_government_and_politics')": Language.ITA, "('IT', 'high_school_macroeconomics')": Language.ITA, "('IT', 'high_school_mathematics')": Language.ITA, "('IT', 'high_school_microeconomics')": Language.ITA, "('IT', 'high_school_physics')": Language.ITA, "('IT', 'high_school_psychology')": Language.ITA, "('IT', 'high_school_statistics')": Language.ITA, "('IT', 'high_school_us_history')": Language.ITA, "('IT', 'high_school_world_history')": Language.ITA, "('IT', 'human_aging')": Language.ITA, "('IT', 'human_sexuality')": Language.ITA, "('IT', 'international_law')": Language.ITA, "('IT', 'jurisprudence')": Language.ITA, "('IT', 'logical_fallacies')": Language.ITA, "('IT', 'machine_learning')": Language.ITA, "('IT', 'management')": Language.ITA, "('IT', 'marketing')": Language.ITA, "('IT', 'medical_genetics')": Language.ITA, "('IT', 'miscellaneous')": Language.ITA, "('IT', 'moral_disputes')": Language.ITA, "('IT', 'moral_scenarios')": Language.ITA, "('IT', 'nutrition')": Language.ITA, "('IT', 'philosophy')": Language.ITA, "('IT', 'prehistory')": Language.ITA, "('IT', 'professional_accounting')": Language.ITA, "('IT', 'professional_law')": Language.ITA, "('IT', 'professional_medicine')": Language.ITA, "('IT', 'professional_psychology')": Language.ITA, "('IT', 'public_relations')": Language.ITA, "('IT', 'security_studies')": Language.ITA, "('IT', 'sociology')": Language.ITA, "('IT', 'us_foreign_policy')": Language.ITA, "('IT', 'virology')": Language.ITA, "('IT', 'world_religions')": Language.ITA, "('PT', 'abstract_algebra')": Language.POR, "('PT', 'anatomy')": Language.POR, "('PT', 'astronomy')": Language.POR, "('PT', 'business_ethics')": Language.POR, "('PT', 'clinical_knowledge')": Language.POR, "('PT', 'college_biology')": Language.POR, "('PT', 'college_chemistry')": Language.POR, "('PT', 'college_computer_science')": Language.POR, "('PT', 'college_mathematics')": Language.POR, "('PT', 'college_medicine')": Language.POR, "('PT', 'college_physics')": Language.POR, "('PT', 'computer_security')": Language.POR, "('PT', 'conceptual_physics')": Language.POR, "('PT', 'econometrics')": Language.POR, "('PT', 'electrical_engineering')": Language.POR, "('PT', 'elementary_mathematics')": Language.POR, "('PT', 'formal_logic')": Language.POR, "('PT', 'global_facts')": Language.POR, "('PT', 'high_school_biology')": Language.POR, "('PT', 'high_school_chemistry')": Language.POR, "('PT', 'high_school_computer_science')": Language.POR, "('PT', 'high_school_european_history')": Language.POR, "('PT', 'high_school_geography')": Language.POR, "('PT', 'high_school_government_and_politics')": Language.POR, "('PT', 'high_school_macroeconomics')": Language.POR, "('PT', 'high_school_mathematics')": Language.POR, "('PT', 'high_school_microeconomics')": Language.POR, "('PT', 'high_school_physics')": Language.POR, "('PT', 'high_school_psychology')": Language.POR, "('PT', 'high_school_statistics')": Language.POR, "('PT', 'high_school_us_history')": Language.POR, "('PT', 'high_school_world_history')": Language.POR, "('PT', 'human_aging')": Language.POR, "('PT', 'human_sexuality')": Language.POR, "('PT', 'international_law')": Language.POR, "('PT', 'jurisprudence')": Language.POR, "('PT', 'logical_fallacies')": Language.POR, "('PT', 'machine_learning')": Language.POR, "('PT', 'management')": Language.POR, "('PT', 'marketing')": Language.POR, "('PT', 'medical_genetics')": Language.POR, "('PT', 'miscellaneous')": Language.POR, "('PT', 'moral_disputes')": Language.POR, "('PT', 'moral_scenarios')": Language.POR, "('PT', 'nutrition')": Language.POR, "('PT', 'philosophy')": Language.POR, "('PT', 'prehistory')": Language.POR, "('PT', 'professional_accounting')": Language.POR, "('PT', 'professional_law')": Language.POR, "('PT', 'professional_medicine')": Language.POR, "('PT', 'professional_psychology')": Language.POR, "('PT', 'public_relations')": Language.POR, "('PT', 'security_studies')": Language.POR, "('PT', 'sociology')": Language.POR, "('PT', 'us_foreign_policy')": Language.POR, "('PT', 'virology')": Language.POR, "('PT', 'world_religions')": Language.POR}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'MMMLU'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = [('FR_FR', 'abstract_algebra'), ('FR_FR', 'anatomy'), ('FR_FR', 'astronomy'), ('FR_FR', 'business_ethics'), ('FR_FR', 'clinical_knowledge'), ('FR_FR', 'college_biology'), ('FR_FR', 'college_chemistry'), ('FR_FR', 'college_computer_science'), ('FR_FR', 'college_mathematics'), ('FR_FR', 'college_medicine'), ('FR_FR', 'college_physics'), ('FR_FR', 'computer_security'), ('FR_FR', 'conceptual_physics'), ('FR_FR', 'econometrics'), ('FR_FR', 'electrical_engineering'), ('FR_FR', 'elementary_mathematics'), ('FR_FR', 'formal_logic'), ('FR_FR', 'global_facts'), ('FR_FR', 'high_school_biology'), ('FR_FR', 'high_school_chemistry'), ('FR_FR', 'high_school_computer_science'), ('FR_FR', 'high_school_european_history'), ('FR_FR', 'high_school_geography'), ('FR_FR', 'high_school_government_and_politics'), ('FR_FR', 'high_school_macroeconomics'), ('FR_FR', 'high_school_mathematics'), ('FR_FR', 'high_school_microeconomics'), ('FR_FR', 'high_school_physics'), ('FR_FR', 'high_school_psychology'), ('FR_FR', 'high_school_statistics'), ('FR_FR', 'high_school_us_history'), ('FR_FR', 'high_school_world_history'), ('FR_FR', 'human_aging'), ('FR_FR', 'human_sexuality'), ('FR_FR', 'international_law'), ('FR_FR', 'jurisprudence'), ('FR_FR', 'logical_fallacies'), ('FR_FR', 'machine_learning'), ('FR_FR', 'management'), ('FR_FR', 'marketing'), ('FR_FR', 'medical_genetics'), ('FR_FR', 'miscellaneous'), ('FR_FR', 'moral_disputes'), ('FR_FR', 'moral_scenarios'), ('FR_FR', 'nutrition'), ('FR_FR', 'philosophy'), ('FR_FR', 'prehistory'), ('FR_FR', 'professional_accounting'), ('FR_FR', 'professional_law'), ('FR_FR', 'professional_medicine'), ('FR_FR', 'professional_psychology'), ('FR_FR', 'public_relations'), ('FR_FR', 'security_studies'), ('FR_FR', 'sociology'), ('FR_FR', 'us_foreign_policy'), ('FR_FR', 'virology'), ('FR_FR', 'world_religions'), ('DE_DE', 'abstract_algebra'), ('DE_DE', 'anatomy'), ('DE_DE', 'astronomy'), ('DE_DE', 'business_ethics'), ('DE_DE', 'clinical_knowledge'), ('DE_DE', 'college_biology'), ('DE_DE', 'college_chemistry'), ('DE_DE', 'college_computer_science'), ('DE_DE', 'college_mathematics'), ('DE_DE', 'college_medicine'), ('DE_DE', 'college_physics'), ('DE_DE', 'computer_security'), ('DE_DE', 'conceptual_physics'), ('DE_DE', 'econometrics'), ('DE_DE', 'electrical_engineering'), ('DE_DE', 'elementary_mathematics'), ('DE_DE', 'formal_logic'), ('DE_DE', 'global_facts'), ('DE_DE', 'high_school_biology'), ('DE_DE', 'high_school_chemistry'), ('DE_DE', 'high_school_computer_science'), ('DE_DE', 'high_school_european_history'), ('DE_DE', 'high_school_geography'), ('DE_DE', 'high_school_government_and_politics'), ('DE_DE', 'high_school_macroeconomics'), ('DE_DE', 'high_school_mathematics'), ('DE_DE', 'high_school_microeconomics'), ('DE_DE', 'high_school_physics'), ('DE_DE', 'high_school_psychology'), ('DE_DE', 'high_school_statistics'), ('DE_DE', 'high_school_us_history'), ('DE_DE', 'high_school_world_history'), ('DE_DE', 'human_aging'), ('DE_DE', 'human_sexuality'), ('DE_DE', 'international_law'), ('DE_DE', 'jurisprudence'), ('DE_DE', 'logical_fallacies'), ('DE_DE', 'machine_learning'), ('DE_DE', 'management'), ('DE_DE', 'marketing'), ('DE_DE', 'medical_genetics'), ('DE_DE', 'miscellaneous'), ('DE_DE', 'moral_disputes'), ('DE_DE', 'moral_scenarios'), ('DE_DE', 'nutrition'), ('DE_DE', 'philosophy'), ('DE_DE', 'prehistory'), ('DE_DE', 'professional_accounting'), ('DE_DE', 'professional_law'), ('DE_DE', 'professional_medicine'), ('DE_DE', 'professional_psychology'), ('DE_DE', 'public_relations'), ('DE_DE', 'security_studies'), ('DE_DE', 'sociology'), ('DE_DE', 'us_foreign_policy'), ('DE_DE', 'virology'), ('DE_DE', 'world_religions'), ('ES_LA', 'abstract_algebra'), ('ES_LA', 'anatomy'), ('ES_LA', 'astronomy'), ('ES_LA', 'business_ethics'), ('ES_LA', 'clinical_knowledge'), ('ES_LA', 'college_biology'), ('ES_LA', 'college_chemistry'), ('ES_LA', 'college_computer_science'), ('ES_LA', 'college_mathematics'), ('ES_LA', 'college_medicine'), ('ES_LA', 'college_physics'), ('ES_LA', 'computer_security'), ('ES_LA', 'conceptual_physics'), ('ES_LA', 'econometrics'), ('ES_LA', 'electrical_engineering'), ('ES_LA', 'elementary_mathematics'), ('ES_LA', 'formal_logic'), ('ES_LA', 'global_facts'), ('ES_LA', 'high_school_biology'), ('ES_LA', 'high_school_chemistry'), ('ES_LA', 'high_school_computer_science'), ('ES_LA', 'high_school_european_history'), ('ES_LA', 'high_school_geography'), ('ES_LA', 'high_school_government_and_politics'), ('ES_LA', 'high_school_macroeconomics'), ('ES_LA', 'high_school_mathematics'), ('ES_LA', 'high_school_microeconomics'), ('ES_LA', 'high_school_physics'), ('ES_LA', 'high_school_psychology'), ('ES_LA', 'high_school_statistics'), ('ES_LA', 'high_school_us_history'), ('ES_LA', 'high_school_world_history'), ('ES_LA', 'human_aging'), ('ES_LA', 'human_sexuality'), ('ES_LA', 'international_law'), ('ES_LA', 'jurisprudence'), ('ES_LA', 'logical_fallacies'), ('ES_LA', 'machine_learning'), ('ES_LA', 'management'), ('ES_LA', 'marketing'), ('ES_LA', 'medical_genetics'), ('ES_LA', 'miscellaneous'), ('ES_LA', 'moral_disputes'), ('ES_LA', 'moral_scenarios'), ('ES_LA', 'nutrition'), ('ES_LA', 'philosophy'), ('ES_LA', 'prehistory'), ('ES_LA', 'professional_accounting'), ('ES_LA', 'professional_law'), ('ES_LA', 'professional_medicine'), ('ES_LA', 'professional_psychology'), ('ES_LA', 'public_relations'), ('ES_LA', 'security_studies'), ('ES_LA', 'sociology'), ('ES_LA', 'us_foreign_policy'), ('ES_LA', 'virology'), ('ES_LA', 'world_religions'), ('IT_IT', 'abstract_algebra'), ('IT_IT', 'anatomy'), ('IT_IT', 'astronomy'), ('IT_IT', 'business_ethics'), ('IT_IT', 'clinical_knowledge'), ('IT_IT', 'college_biology'), ('IT_IT', 'college_chemistry'), ('IT_IT', 'college_computer_science'), ('IT_IT', 'college_mathematics'), ('IT_IT', 'college_medicine'), ('IT_IT', 'college_physics'), ('IT_IT', 'computer_security'), ('IT_IT', 'conceptual_physics'), ('IT_IT', 'econometrics'), ('IT_IT', 'electrical_engineering'), ('IT_IT', 'elementary_mathematics'), ('IT_IT', 'formal_logic'), ('IT_IT', 'global_facts'), ('IT_IT', 'high_school_biology'), ('IT_IT', 'high_school_chemistry'), ('IT_IT', 'high_school_computer_science'), ('IT_IT', 'high_school_european_history'), ('IT_IT', 'high_school_geography'), ('IT_IT', 'high_school_government_and_politics'), ('IT_IT', 'high_school_macroeconomics'), ('IT_IT', 'high_school_mathematics'), ('IT_IT', 'high_school_microeconomics'), ('IT_IT', 'high_school_physics'), ('IT_IT', 'high_school_psychology'), ('IT_IT', 'high_school_statistics'), ('IT_IT', 'high_school_us_history'), ('IT_IT', 'high_school_world_history'), ('IT_IT', 'human_aging'), ('IT_IT', 'human_sexuality'), ('IT_IT', 'international_law'), ('IT_IT', 'jurisprudence'), ('IT_IT', 'logical_fallacies'), ('IT_IT', 'machine_learning'), ('IT_IT', 'management'), ('IT_IT', 'marketing'), ('IT_IT', 'medical_genetics'), ('IT_IT', 'miscellaneous'), ('IT_IT', 'moral_disputes'), ('IT_IT', 'moral_scenarios'), ('IT_IT', 'nutrition'), ('IT_IT', 'philosophy'), ('IT_IT', 'prehistory'), ('IT_IT', 'professional_accounting'), ('IT_IT', 'professional_law'), ('IT_IT', 'professional_medicine'), ('IT_IT', 'professional_psychology'), ('IT_IT', 'public_relations'), ('IT_IT', 'security_studies'), ('IT_IT', 'sociology'), ('IT_IT', 'us_foreign_policy'), ('IT_IT', 'virology'), ('IT_IT', 'world_religions'), ('PT_BR', 'abstract_algebra'), ('PT_BR', 'anatomy'), ('PT_BR', 'astronomy'), ('PT_BR', 'business_ethics'), ('PT_BR', 'clinical_knowledge'), ('PT_BR', 'college_biology'), ('PT_BR', 'college_chemistry'), ('PT_BR', 'college_computer_science'), ('PT_BR', 'college_mathematics'), ('PT_BR', 'college_medicine'), ('PT_BR', 'college_physics'), ('PT_BR', 'computer_security'), ('PT_BR', 'conceptual_physics'), ('PT_BR', 'econometrics'), ('PT_BR', 'electrical_engineering'), ('PT_BR', 'elementary_mathematics'), ('PT_BR', 'formal_logic'), ('PT_BR', 'global_facts'), ('PT_BR', 'high_school_biology'), ('PT_BR', 'high_school_chemistry'), ('PT_BR', 'high_school_computer_science'), ('PT_BR', 'high_school_european_history'), ('PT_BR', 'high_school_geography'), ('PT_BR', 'high_school_government_and_politics'), ('PT_BR', 'high_school_macroeconomics'), ('PT_BR', 'high_school_mathematics'), ('PT_BR', 'high_school_microeconomics'), ('PT_BR', 'high_school_physics'), ('PT_BR', 'high_school_psychology'), ('PT_BR', 'high_school_statistics'), ('PT_BR', 'high_school_us_history'), ('PT_BR', 'high_school_world_history'), ('PT_BR', 'human_aging'), ('PT_BR', 'human_sexuality'), ('PT_BR', 'international_law'), ('PT_BR', 'jurisprudence'), ('PT_BR', 'logical_fallacies'), ('PT_BR', 'machine_learning'), ('PT_BR', 'management'), ('PT_BR', 'marketing'), ('PT_BR', 'medical_genetics'), ('PT_BR', 'miscellaneous'), ('PT_BR', 'moral_disputes'), ('PT_BR', 'moral_scenarios'), ('PT_BR', 'nutrition'), ('PT_BR', 'philosophy'), ('PT_BR', 'prehistory'), ('PT_BR', 'professional_accounting'), ('PT_BR', 'professional_law'), ('PT_BR', 'professional_medicine'), ('PT_BR', 'professional_psychology'), ('PT_BR', 'public_relations'), ('PT_BR', 'security_studies'), ('PT_BR', 'sociology'), ('PT_BR', 'us_foreign_policy'), ('PT_BR', 'virology'), ('PT_BR', 'world_religions'), ('AR_XY', 'abstract_algebra'), ('AR_XY', 'anatomy'), ('AR_XY', 'astronomy'), ('AR_XY', 'business_ethics'), ('AR_XY', 'clinical_knowledge'), ('AR_XY', 'college_biology'), ('AR_XY', 'college_chemistry'), ('AR_XY', 'college_computer_science'), ('AR_XY', 'college_mathematics'), ('AR_XY', 'college_medicine'), ('AR_XY', 'college_physics'), ('AR_XY', 'computer_security'), ('AR_XY', 'conceptual_physics'), ('AR_XY', 'econometrics'), ('AR_XY', 'electrical_engineering'), ('AR_XY', 'elementary_mathematics'), ('AR_XY', 'formal_logic'), ('AR_XY', 'global_facts'), ('AR_XY', 'high_school_biology'), ('AR_XY', 'high_school_chemistry'), ('AR_XY', 'high_school_computer_science'), ('AR_XY', 'high_school_european_history'), ('AR_XY', 'high_school_geography'), ('AR_XY', 'high_school_government_and_politics'), ('AR_XY', 'high_school_macroeconomics'), ('AR_XY', 'high_school_mathematics'), ('AR_XY', 'high_school_microeconomics'), ('AR_XY', 'high_school_physics'), ('AR_XY', 'high_school_psychology'), ('AR_XY', 'high_school_statistics'), ('AR_XY', 'high_school_us_history'), ('AR_XY', 'high_school_world_history'), ('AR_XY', 'human_aging'), ('AR_XY', 'human_sexuality'), ('AR_XY', 'international_law'), ('AR_XY', 'jurisprudence'), ('AR_XY', 'logical_fallacies'), ('AR_XY', 'machine_learning'), ('AR_XY', 'management'), ('AR_XY', 'marketing'), ('AR_XY', 'medical_genetics'), ('AR_XY', 'miscellaneous'), ('AR_XY', 'moral_disputes'), ('AR_XY', 'moral_scenarios'), ('AR_XY', 'nutrition'), ('AR_XY', 'philosophy'), ('AR_XY', 'prehistory'), ('AR_XY', 'professional_accounting'), ('AR_XY', 'professional_law'), ('AR_XY', 'professional_medicine'), ('AR_XY', 'professional_psychology'), ('AR_XY', 'public_relations'), ('AR_XY', 'security_studies'), ('AR_XY', 'sociology'), ('AR_XY', 'us_foreign_policy'), ('AR_XY', 'virology'), ('AR_XY', 'world_religions')]¶
- class eval_framework.tasks.benchmarks.mmmlu.MMMLU_GERMAN_COT(num_fewshot=0)[source]¶
Bases:
MMMLU- Parameters:
num_fewshot (int)
- ANS_RE = re.compile('Daher lautet die Antwort: ([ABCD])')¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {"('de', 'abstract_algebra')": Language.DEU, "('de', 'anatomy')": Language.DEU, "('de', 'astronomy')": Language.DEU, "('de', 'business_ethics')": Language.DEU, "('de', 'clinical_knowledge')": Language.DEU, "('de', 'college_biology')": Language.DEU, "('de', 'college_chemistry')": Language.DEU, "('de', 'college_computer_science')": Language.DEU, "('de', 'college_mathematics')": Language.DEU, "('de', 'college_medicine')": Language.DEU, "('de', 'college_physics')": Language.DEU, "('de', 'computer_security')": Language.DEU, "('de', 'conceptual_physics')": Language.DEU, "('de', 'econometrics')": Language.DEU, "('de', 'electrical_engineering')": Language.DEU, "('de', 'elementary_mathematics')": Language.DEU, "('de', 'formal_logic')": Language.DEU, "('de', 'global_facts')": Language.DEU, "('de', 'high_school_biology')": Language.DEU, "('de', 'high_school_chemistry')": Language.DEU, "('de', 'high_school_computer_science')": Language.DEU, "('de', 'high_school_european_history')": Language.DEU, "('de', 'high_school_geography')": Language.DEU, "('de', 'high_school_government_and_politics')": Language.DEU, "('de', 'high_school_macroeconomics')": Language.DEU, "('de', 'high_school_mathematics')": Language.DEU, "('de', 'high_school_microeconomics')": Language.DEU, "('de', 'high_school_physics')": Language.DEU, "('de', 'high_school_psychology')": Language.DEU, "('de', 'high_school_statistics')": Language.DEU, "('de', 'high_school_us_history')": Language.DEU, "('de', 'high_school_world_history')": Language.DEU, "('de', 'human_aging')": Language.DEU, "('de', 'human_sexuality')": Language.DEU, "('de', 'international_law')": Language.DEU, "('de', 'jurisprudence')": Language.DEU, "('de', 'logical_fallacies')": Language.DEU, "('de', 'machine_learning')": Language.DEU, "('de', 'management')": Language.DEU, "('de', 'marketing')": Language.DEU, "('de', 'medical_genetics')": Language.DEU, "('de', 'miscellaneous')": Language.DEU, "('de', 'moral_disputes')": Language.DEU, "('de', 'moral_scenarios')": Language.DEU, "('de', 'nutrition')": Language.DEU, "('de', 'philosophy')": Language.DEU, "('de', 'prehistory')": Language.DEU, "('de', 'professional_accounting')": Language.DEU, "('de', 'professional_law')": Language.DEU, "('de', 'professional_medicine')": Language.DEU, "('de', 'professional_psychology')": Language.DEU, "('de', 'public_relations')": Language.DEU, "('de', 'security_studies')": Language.DEU, "('de', 'sociology')": Language.DEU, "('de', 'us_foreign_policy')": Language.DEU, "('de', 'virology')": Language.DEU, "('de', 'world_religions')": Language.DEU}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.language_checker.GermanCompletionChecker'>]¶
- NAME: str = 'MMMLU_GERMAN_COT'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'Question', 'Answer', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SUBJECTS: list[SubjectType] = [('DE_DE', 'abstract_algebra'), ('DE_DE', 'anatomy'), ('DE_DE', 'astronomy'), ('DE_DE', 'business_ethics'), ('DE_DE', 'clinical_knowledge'), ('DE_DE', 'college_biology'), ('DE_DE', 'college_chemistry'), ('DE_DE', 'college_computer_science'), ('DE_DE', 'college_mathematics'), ('DE_DE', 'college_medicine'), ('DE_DE', 'college_physics'), ('DE_DE', 'computer_security'), ('DE_DE', 'conceptual_physics'), ('DE_DE', 'econometrics'), ('DE_DE', 'electrical_engineering'), ('DE_DE', 'elementary_mathematics'), ('DE_DE', 'formal_logic'), ('DE_DE', 'global_facts'), ('DE_DE', 'high_school_biology'), ('DE_DE', 'high_school_chemistry'), ('DE_DE', 'high_school_computer_science'), ('DE_DE', 'high_school_european_history'), ('DE_DE', 'high_school_geography'), ('DE_DE', 'high_school_government_and_politics'), ('DE_DE', 'high_school_macroeconomics'), ('DE_DE', 'high_school_mathematics'), ('DE_DE', 'high_school_microeconomics'), ('DE_DE', 'high_school_physics'), ('DE_DE', 'high_school_psychology'), ('DE_DE', 'high_school_statistics'), ('DE_DE', 'high_school_us_history'), ('DE_DE', 'high_school_world_history'), ('DE_DE', 'human_aging'), ('DE_DE', 'human_sexuality'), ('DE_DE', 'international_law'), ('DE_DE', 'jurisprudence'), ('DE_DE', 'logical_fallacies'), ('DE_DE', 'machine_learning'), ('DE_DE', 'management'), ('DE_DE', 'marketing'), ('DE_DE', 'medical_genetics'), ('DE_DE', 'miscellaneous'), ('DE_DE', 'moral_disputes'), ('DE_DE', 'moral_scenarios'), ('DE_DE', 'nutrition'), ('DE_DE', 'philosophy'), ('DE_DE', 'prehistory'), ('DE_DE', 'professional_accounting'), ('DE_DE', 'professional_law'), ('DE_DE', 'professional_medicine'), ('DE_DE', 'professional_psychology'), ('DE_DE', 'public_relations'), ('DE_DE', 'security_studies'), ('DE_DE', 'sociology'), ('DE_DE', 'us_foreign_policy'), ('DE_DE', 'virology'), ('DE_DE', 'world_religions')]¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
eval_framework.tasks.benchmarks.naturalqs_open module¶
- class eval_framework.tasks.benchmarks.naturalqs_open.NaturalQsOpen(num_fewshot=0)[source]¶
Bases:
BaseTask[str]- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'google-research-datasets/nq_open'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.f1.F1'>]¶
- NAME: str = 'NaturalQsOpen'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.naturalqs_open.NaturalQsOpenCloze(num_fewshot=0)[source]¶
Bases:
BaseTask[str]- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'allenai/nq-gen2mc'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'NaturalQsOpenCloze'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.naturalqs_open.NaturalQsOpenMC(num_fewshot=0)[source]¶
Bases:
NaturalQsOpenCloze- Parameters:
num_fewshot (int)
- NAME: str = 'NaturalQsOpenMC'¶
- class eval_framework.tasks.benchmarks.naturalqs_open.NaturalQsOpenMC_OLMES(num_fewshot=0)[source]¶
Bases:
NaturalQsOpenMCNaturalQsOpenMC with OLMES-style prompt: space before each label in the prompt (” A.”, “ B.”, …).
- Parameters:
num_fewshot (int)
- NAME: str = 'NaturalQsOpenMC_OLMES'¶
eval_framework.tasks.benchmarks.openbookqa module¶
- class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA(num_fewshot=0)[source]¶
Bases:
BaseTask[str]OpenBookQA dataset: https://huggingface.co/datasets/allenai/openbookqa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'allenai/openbookqa'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'OpenBookQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['additional']¶
- class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_EVAL_HARNESS(num_fewshot=0)[source]¶
Bases:
OPENBOOKQAClosed-book version of OpenBookQA — question only, no supporting fact.
- Parameters:
num_fewshot (int)
- NAME: str = 'OpenBookQAEvalHarness'¶
- class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_EVAL_HARNESS_OLMES(num_fewshot=0)[source]¶
Bases:
OPENBOOKQA_EVAL_HARNESSOpenBookQA Eval Harness with OLMES-style prompt: space before each label (” A.”, “ B.”, …).
- Parameters:
num_fewshot (int)
- NAME: str = 'OpenBookQAEvalHarness_OLMES'¶
- class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_IDK(num_fewshot=0)[source]¶
Bases:
OPENBOOKQA- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'OpenBookQA_IDK'¶
- class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_OLMES(num_fewshot=0)[source]¶
Bases:
OPENBOOKQAOpenBookQA with OLMES-style prompt: space before each label in the prompt (” A.”, “ B.”, …).
- Parameters:
num_fewshot (int)
- NAME: str = 'OpenBookQA_OLMES'¶
eval_framework.tasks.benchmarks.opengptx_eu20 module¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.ARC_EU20_DE(num_fewshot=0)[source]¶
Bases:
ARCEU20 Benchmarks from the openGPT-X paper: - https://arxiv.org/abs/2410.08928 - leaderboard: https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard
- https://huggingface.co/datasets/openGPT-X/arcx
entries in ‘challenge_DE’: 1172 test, 299 validation, 198 train entries in ‘easy_DE’: 2376 test, 570 validation, 197 train
features: [‘id’, ‘question’, ‘choices’, ‘answerKey’],
SUBJECTS = [‘challenge_BG’, ‘easy_BG’, ‘challenge_DA’, ‘easy_DA’, ‘challenge_DE’, ‘easy_DE’, ‘challenge_ET’, ‘easy_ET’, ‘challenge_FI’, ‘easy_FI’, ‘challenge_FR’, ‘easy_FR’, ‘challenge_EL’, ‘easy_EL’, ‘challenge_IT’, ‘easy_IT’, ‘challenge_LV’, ‘easy_LV’, ‘challenge_LT’, ‘easy_LT’, ‘challenge_NL’, ‘easy_NL’, ‘challenge_PL’, ‘easy_PL’, ‘challenge_PT-PT’, ‘easy_PT-PT’, ‘challenge_RO’, ‘easy_RO’, ‘challenge_SV’, ‘easy_SV’, ‘challenge_SK’, ‘easy_SK’, ‘challenge_SL’, ‘easy_SL’, ‘challenge_ES’, ‘easy_ES’, ‘challenge_CS’, ‘easy_CS’, ‘challenge_HU’, ‘easy_HU’]
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/arcx'¶
- FEWSHOT_SPLIT: str = 'train'¶
- HF_REVISION: str | None = 'e4c31fa077b82832cc21e614832701603a8ad319'¶
- NAME: str = 'ARC_EU20_DE'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['challenge_DE', 'easy_DE']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.ARC_EU20_FR(num_fewshot=0)[source]¶
Bases:
ARC- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/arcx'¶
- FEWSHOT_SPLIT: str = 'train'¶
- HF_REVISION: str | None = 'e4c31fa077b82832cc21e614832701603a8ad319'¶
- NAME: str = 'ARC_EU20_FR'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['challenge_FR', 'easy_FR']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.GSM8K_EU20_DE(num_fewshot=0)[source]¶
Bases:
GSM8KEvalHarness- https://huggingface.co/datasets/openGPT-X/gsm8kx
- entries in ‘DE’: 1319 test, 104 train
features: [‘question’, ‘answer’, ‘id’],
SUBJECTS = [‘BG’, ‘DA’, ‘DE’, ‘ET’, ‘FI’, ‘FR’, ‘EL’, ‘IT’, ‘LV’, ‘LT’, ‘NL’, ‘PL’, ‘PT-PT’, ‘RO’, ‘SV’, ‘SK’, ‘SL’, ‘ES’, ‘CS’, ‘HU’]
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/gsm8kx'¶
- FEWSHOT_SPLIT: str = 'train'¶
- HF_REVISION: str | None = '3ed0f81d31a9013e05d16644aabcc36db50078a9'¶
- NAME: str = 'GSM8K_EU20_DE'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['DE']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.GSM8K_EU20_FR(num_fewshot=0)[source]¶
Bases:
GSM8KEvalHarness- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/gsm8kx'¶
- FEWSHOT_SPLIT: str = 'train'¶
- HF_REVISION: str | None = '3ed0f81d31a9013e05d16644aabcc36db50078a9'¶
- NAME: str = 'GSM8K_EU20_FR'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['FR']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.HELLASWAG_EU20_DE(num_fewshot=0)[source]¶
Bases:
HELLASWAG- https://huggingface.co/datasets/openGPT-X/hellaswagx
- entries in ‘DE’: 99 train, 9979 validation
features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, ‘source_id’, ‘split’, ‘split_type’, ‘label’],
SUBJECTS = [‘BG’, ‘DA’, ‘DE’, ‘ET’, ‘FI’, ‘FR’, ‘EL’, ‘IT’, ‘LV’, ‘LT’, ‘NL’, ‘PL’, ‘PT-PT’, ‘RO’, ‘SV’, ‘SK’, ‘SL’, ‘ES’, ‘CS’, ‘HU’]
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/hellaswagx'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- HF_REVISION: str | None = '7c30407f4f11fa4fada74bd4384ed0fe572ae8f2'¶
- NAME: str = 'HellaSwag_EU20_DE'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['DE']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.HELLASWAG_EU20_FR(num_fewshot=0)[source]¶
Bases:
HELLASWAG- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/hellaswagx'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- HF_REVISION: str | None = '7c30407f4f11fa4fada74bd4384ed0fe572ae8f2'¶
- NAME: str = 'HellaSwag_EU20_FR'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['FR']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.MMLU_EU20_DE(num_fewshot=0)[source]¶
Bases:
MMLU- https://huggingface.co/datasets/openGPT-X/mmlux
- entries in ‘philosophy_DE’: 311 test, 5 dev, 5 validation
features: [‘question’, ‘choices’, ‘answer’, ‘id’],
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/mmlux'¶
- FEWSHOT_SPLIT: str = 'dev'¶
- HF_REVISION: str | None = '6412d5d5d03a7b31d02f4ba34b787c2e7939a800'¶
- NAME: str = 'MMLU_EU20_DE'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D', 'Frage']¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['abstract_algebra_DE', 'anatomy_DE', 'astronomy_DE', 'business_ethics_DE', 'clinical_knowledge_DE', 'college_biology_DE', 'college_chemistry_DE', 'college_computer_science_DE', 'college_mathematics_DE', 'college_medicine_DE', 'college_physics_DE', 'computer_security_DE', 'conceptual_physics_DE', 'econometrics_DE', 'electrical_engineering_DE', 'elementary_mathematics_DE', 'formal_logic_DE', 'global_facts_DE', 'high_school_biology_DE', 'high_school_chemistry_DE', 'high_school_computer_science_DE', 'high_school_european_history_DE', 'high_school_geography_DE', 'high_school_government_and_politics_DE', 'high_school_macroeconomics_DE', 'high_school_mathematics_DE', 'high_school_microeconomics_DE', 'high_school_physics_DE', 'high_school_psychology_DE', 'high_school_statistics_DE', 'high_school_us_history_DE', 'high_school_world_history_DE', 'human_aging_DE', 'human_sexuality_DE', 'international_law_DE', 'jurisprudence_DE', 'logical_fallacies_DE', 'machine_learning_DE', 'management_DE', 'marketing_DE', 'medical_genetics_DE', 'miscellaneous_DE', 'moral_disputes_DE', 'moral_scenarios_DE', 'nutrition_DE', 'philosophy_DE', 'prehistory_DE', 'professional_accounting_DE', 'professional_law_DE', 'professional_medicine_DE', 'professional_psychology_DE', 'public_relations_DE', 'security_studies_DE', 'sociology_DE', 'us_foreign_policy_DE', 'virology_DE', 'world_religions_DE']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.MMLU_EU20_FR(num_fewshot=0)[source]¶
Bases:
MMLU- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/mmlux'¶
- FEWSHOT_SPLIT: str = 'dev'¶
- HF_REVISION: str | None = '6412d5d5d03a7b31d02f4ba34b787c2e7939a800'¶
- NAME: str = 'MMLU_EU20_FR'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['abstract_algebra_FR', 'anatomy_FR', 'astronomy_FR', 'business_ethics_FR', 'clinical_knowledge_FR', 'college_biology_FR', 'college_chemistry_FR', 'college_computer_science_FR', 'college_mathematics_FR', 'college_medicine_FR', 'college_physics_FR', 'computer_security_FR', 'conceptual_physics_FR', 'econometrics_FR', 'electrical_engineering_FR', 'elementary_mathematics_FR', 'formal_logic_FR', 'global_facts_FR', 'high_school_biology_FR', 'high_school_chemistry_FR', 'high_school_computer_science_FR', 'high_school_european_history_FR', 'high_school_geography_FR', 'high_school_government_and_politics_FR', 'high_school_macroeconomics_FR', 'high_school_mathematics_FR', 'high_school_microeconomics_FR', 'high_school_physics_FR', 'high_school_psychology_FR', 'high_school_statistics_FR', 'high_school_us_history_FR', 'high_school_world_history_FR', 'human_aging_FR', 'human_sexuality_FR', 'international_law_FR', 'jurisprudence_FR', 'logical_fallacies_FR', 'machine_learning_FR', 'management_FR', 'marketing_FR', 'medical_genetics_FR', 'miscellaneous_FR', 'moral_disputes_FR', 'moral_scenarios_FR', 'nutrition_FR', 'philosophy_FR', 'prehistory_FR', 'professional_accounting_FR', 'professional_law_FR', 'professional_medicine_FR', 'professional_psychology_FR', 'public_relations_FR', 'security_studies_FR', 'sociology_FR', 'us_foreign_policy_FR', 'virology_FR', 'world_religions_FR']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.TRUTHFULQA_EU20_DE(num_fewshot=0)[source]¶
Bases:
TRUTHFULQA- https://huggingface.co/datasets/openGPT-X/truthfulqax
- entries in ‘mc_DE’: 817 validation
features: [‘question’, ‘mc1_targets’, ‘mc2_targets’, ‘id’],
- entries in ‘gen_DE’: 817 validation
features: [‘type’, ‘category’, ‘question’, ‘best_answer’, ‘correct_answers’, ‘incorrect_answers’, ‘source’, ‘id’],
SUBJECTS = [‘mc_BG’, ‘gen_BG’, ‘mc_DA’, ‘gen_DA’, ‘mc_DE’, ‘gen_DE’, ‘mc_ET’, ‘gen_ET’, ‘mc_FI’, ‘gen_FI’, ‘mc_FR’, ‘gen_FR’, ‘mc_EL’, ‘gen_EL’, ‘mc_IT’, ‘gen_IT’, ‘mc_LV’, ‘gen_LV’, ‘mc_LT’, ‘gen_LT’, ‘mc_NL’, ‘gen_NL’, ‘mc_PL’, ‘gen_PL’, ‘mc_PT-PT’, ‘gen_PT-PT’, ‘mc_RO’, ‘gen_RO’, ‘mc_SV’, ‘gen_SV’, ‘mc_SK’, ‘gen_SK’, ‘mc_SL’, ‘gen_SL’, ‘mc_ES’, ‘gen_ES’, ‘mc_CS’, ‘gen_CS’, ‘mc_HU’, ‘gen_HU’]
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/truthfulqax'¶
- HF_REVISION: str | None = 'cff042da87dfb8885c357cb1c83194fa6aaf1d49'¶
- NAME: str = 'TruthfulQA_EU20_DE'¶
eval_framework.tasks.benchmarks.pawsx module¶
- class eval_framework.tasks.benchmarks.pawsx.PAWSX(num_fewshot=0)[source]¶
Bases:
BaseTask[str]PAWSX dataset: https://huggingface.co/datasets/google-research-datasets/paws-x used in the way suggested in PARAPHRASUS benchmark (https://arxiv.org/pdf/2409.12060).
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'google-research-datasets/paws-x'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de': Language.DEU, 'en': Language.ENG}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- NAME: str = 'PAWS-X'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Ja', 'Nein', 'Paraphrasen', 'Yes', 'No', 'paraphrases']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['en', 'de']¶
eval_framework.tasks.benchmarks.piqa module¶
- class eval_framework.tasks.benchmarks.piqa.PIQA(num_fewshot=0)[source]¶
Bases:
BaseTask[str]PIQA dataset: https://huggingface.co/datasets/ybisk/piqa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'ybisk/piqa'¶
- FEWSHOT_SPLIT: str = 'test'¶
- HF_REVISION: str | None = '6b3aceb3276e5ab7e51895d73151a718690af38c'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'PIQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.piqa.PIQA_IDK(num_fewshot=0)[source]¶
Bases:
PIQA- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'PIQA_IDK'¶
- class eval_framework.tasks.benchmarks.piqa.PIQA_OLMES(num_fewshot=0)[source]¶
Bases:
PIQAPIQA with OLMES-style prompt: options shown with space-prefixed labels (” A.”, “ B.”); loglikelihood over “ A”/” B”.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'train'¶
- NAME: str = 'PIQA_OLMES'¶
- SAMPLE_SPLIT: str = 'train'¶
eval_framework.tasks.benchmarks.quality module¶
- class eval_framework.tasks.benchmarks.quality.QUALITY(num_fewshot=0)[source]¶
Bases:
BaseTask[str]- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'emozilla/quality'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'QuALITY'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Article', 'Question', 'Answer']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['hard', 'easy']¶
eval_framework.tasks.benchmarks.sciq module¶
- class eval_framework.tasks.benchmarks.sciq.SCIQ(num_fewshot=0)[source]¶
Bases:
BaseTask[str]SciQ dataset: https://huggingface.co/datasets/allenai/sciq
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'allenai/sciq'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'SciQ'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.sciq.SCIQEvalHarness(num_fewshot=0)[source]¶
Bases:
SCIQBased on https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/sciq/sciq.yaml#L8 In the Eval Harness implementation, the instruction text includes a context passage. This passage often contains the answer, reducing the benchmark to a straightforward copy-and-paste task.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'allenai/sciq'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'SciQ Eval Harness'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.sciq.SCIQEvalHarness_IDK(num_fewshot=0)[source]¶
Bases:
SCIQEvalHarness- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'SciQ Eval Harness_IDK'¶
- class eval_framework.tasks.benchmarks.sciq.SCIQ_IDK(num_fewshot=0)[source]¶
Bases:
SCIQ- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'SciQ_IDK'¶
- class eval_framework.tasks.benchmarks.sciq.SCIQ_OLMES(num_fewshot=0)[source]¶
Bases:
SCIQSciQ with OLMES-style prompt: options shown with space-prefixed labels (” A.”, “ B.”, “ C.”, “ D.”); loglikelihood over “ A”/” B”/” C”/” D”. Answer choices are deterministically shuffled per example.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'train'¶
- NAME: str = 'SciQ_OLMES'¶
- SAMPLE_SPLIT: str = 'train'¶
eval_framework.tasks.benchmarks.sphyr module¶
- class eval_framework.tasks.benchmarks.sphyr.SPHYR(num_fewshot=0)[source]¶
Bases:
BaseTask[str]SPhyR dataset: https://huggingface.co/datasets/philippds/SPhyR
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'philippds/SPhyR'¶
- FEWSHOT_SPLIT: str = ''¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.grid_difference.GridDifference'>]¶
- NAME: str = 'SPHYR'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = None¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['1_random_cell_easy', '5_random_cell_easy', '10_random_cell_easy', '1_random_row_easy', '3_random_row_easy', '1_random_column_easy', '3_random_column_easy', 'full_easy', '1_random_cell_hard', '5_random_cell_hard', '10_random_cell_hard', '1_random_row_hard', '3_random_row_hard', '1_random_column_hard', '3_random_column_hard', 'full_hard']¶
eval_framework.tasks.benchmarks.squad module¶
- class eval_framework.tasks.benchmarks.squad.SQUAD(num_fewshot=0)[source]¶
Bases:
SQUAD2Squad dataset: https://huggingface.co/datasets/rajpurkar/squad
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'rajpurkar/squad'¶
- NAME: str = 'SQuAD'¶
- class eval_framework.tasks.benchmarks.squad.SQUAD2(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Squad v2 dataset: https://huggingface.co/datasets/rajpurkar/squad_v2
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'rajpurkar/squad_v2'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.f1.F1'>]¶
- NAME: str = 'SQuAD2'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'Context', 'unanswerable']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- UNANSWERABLE_STR = 'unanswerable'¶
- class eval_framework.tasks.benchmarks.squad.SQUAD2BPB(num_fewshot=0)[source]¶
Bases:
SQUAD2SQuAD2 variant that scores loglikelihood of the gold answer text. Reports bits-per-byte on the reference answer (first acceptable answer).
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'SQuAD2 BPB'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
eval_framework.tasks.benchmarks.struct_eval module¶
- class eval_framework.tasks.benchmarks.struct_eval.RenderableStructEval(num_fewshot=0)[source]¶
Bases:
StructEvalRenderable StructEval task for tasks that can be rendered visually.
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.struct_eval_metrics.RenderableStructMetric'>]¶
- NAME: str = 'RenderableStructEval'¶
- SUBJECTS: list[SubjectType] = ['Convert Markdown to HTML', 'Convert React to HTML', 'Convert Vue to HTML', 'Text to HTML']¶
- class eval_framework.tasks.benchmarks.struct_eval.StructEval(num_fewshot=0)[source]¶
Bases:
BaseTask[str]StructEval task: https://tiger-ai-lab.github.io/StructEval/
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'TIGER-Lab/StructEval'¶
- FEWSHOT_SPLIT: str = 'train'¶
- HF_REVISION: str | None = 'b551217560cf225245b0607a21c505e24a58e396'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.struct_eval_metrics.StructMetric'>]¶
- NAME: str = 'StructEval'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['CSV to YAML', 'JSON to XML', 'JSON to CSV', 'XML to JSON', 'XML to YAML', 'Text to XML', 'Text to YAML', 'Text to TOML', 'YAML to JSON', 'TOML to JSON', 'Text to CSV', 'YAML to XML', 'JSON to YAML', 'TOML to YAML', 'YAML to CSV', 'CSV to JSON', 'CSV to XML', 'Text to JSON', 'XML to CSV']¶
eval_framework.tasks.benchmarks.tablebench module¶
- class eval_framework.tasks.benchmarks.tablebench.TableBench(num_fewshot=0)[source]¶
Bases:
BaseTask[tuple[str,str]]TableBench dataset: https://huggingface.co/datasets/Multilingual-Multimodal-NLP/TableBench
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'Multilingual-Multimodal-NLP/TableBench'¶
- FEWSHOT_SPLIT: str = 'test'¶
- HF_REVISION: str | None = '81b551c744b7f49cfa0ad69cb7a1465d865c206e'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_l.ROUGE_L'>]¶
- NAME: str = 'TableBench'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = [('PoT', 'NumericalReasoning'), ('PoT', 'DataAnalysis'), ('PoT', 'FactChecking'), ('SCoT', 'NumericalReasoning'), ('SCoT', 'DataAnalysis'), ('SCoT', 'FactChecking'), ('TCoT', 'NumericalReasoning'), ('TCoT', 'DataAnalysis'), ('TCoT', 'FactChecking')]¶
eval_framework.tasks.benchmarks.triviaqa module¶
- class eval_framework.tasks.benchmarks.triviaqa.TRIVIAQA(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Trivia QA dataset: https://huggingface.co/datasets/mandarjoshi/trivia_qa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'mandarjoshi/trivia_qa'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.f1.F1'>]¶
- NAME: str = 'TriviaQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['rc.wikipedia.nocontext']¶
eval_framework.tasks.benchmarks.truthfulqa module¶
- class eval_framework.tasks.benchmarks.truthfulqa.TRUTHFULQA(num_fewshot=0)[source]¶
Bases:
BaseTask[str]TRUTHFULQA dataset: https://huggingface.co/datasets/truthfulqa/truthful_qa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'truthfulqa/truthful_qa'¶
- FEWSHOT_ITEMS = [{'question': 'What is human life expectancy in the United States?', 'response': 'Human life expectancy in the United States is 78 years.'}, {'question': 'Who was president of the United States in 1955?', 'response': 'Dwight D. Eisenhower was president of the United States in 1955.'}, {'question': 'Which party did he belong to?', 'response': 'He belonged to the Republican Party.'}, {'question': 'What is the square root of banana?', 'response': 'I have no comment.'}, {'question': 'How does a telescope work?', 'response': 'Telescopes use lenses or mirrors to focus light and make objects appear closer.'}, {'question': 'Where were the 1992 Olympics held?', 'response': 'The 1992 Olympics were held in Barcelona, Spain.'}]¶
- FEWSHOT_SPLIT: str = ''¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMass'>, <class 'eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMassNorm'>]¶
- NAME: str = 'TruthfulQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Q', 'A']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['mc1', 'mc2']¶
- class eval_framework.tasks.benchmarks.truthfulqa.TRUTHFULQA_IDK(num_fewshot=0)[source]¶
Bases:
TRUTHFULQA- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'TruthfulQA_IDK'¶
- class eval_framework.tasks.benchmarks.truthfulqa.TRUTHFULQA_OLMES(num_fewshot=0)[source]¶
Bases:
TRUTHFULQATruthfulQA multiple choice (OLMES/oe_eval style): prompt shows question and options with space-prefixed labels (” A.”, “ B.”, …); loglikelihood over “ A”/” B”/ etc.
- Parameters:
num_fewshot (int)
- NAME: str = 'TruthfulQA_OLMES'¶
eval_framework.tasks.benchmarks.winogender module¶
- class eval_framework.tasks.benchmarks.winogender.WINOGENDER(num_fewshot=0)[source]¶
Bases:
BaseTask[str]WINOGENDER dataset: https://huggingface.co/datasets/datasets/oskarvanderwal/winogender
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'oskarvanderwal/winogender'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'Winogender'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['all']¶
- class eval_framework.tasks.benchmarks.winogender.WINOGENDER_IDK(num_fewshot=0)[source]¶
Bases:
WINOGENDER- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'Winogender_IDK'¶
eval_framework.tasks.benchmarks.winogrande module¶
- class eval_framework.tasks.benchmarks.winogrande.WINOGRANDE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]WINOGRANDE dataset: https://huggingface.co/datasets/allenai/winogrande
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'allenai/winogrande'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood'>]¶
- NAME: str = 'Winogrande'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['1', '2']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['winogrande_xl']¶
- class eval_framework.tasks.benchmarks.winogrande.WINOGRANDE_IDK(num_fewshot=0)[source]¶
Bases:
WINOGRANDE- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'Winogrande_IDK'¶
- class eval_framework.tasks.benchmarks.winogrande.WINOGRANDE_OLMES(num_fewshot=0)[source]¶
Bases:
WINOGRANDEWinogrande with OLMES-style prompt: options shown with space-prefixed labels (” A.”, “ B.”); loglikelihood over “ A”/” B”.
- Parameters:
num_fewshot (int)
- NAME: str = 'Winogrande_OLMES'¶
eval_framework.tasks.benchmarks.winox module¶
- class eval_framework.tasks.benchmarks.winox.WINOX(num_fewshot=0)[source]¶
Bases:
WINOGRANDEWino-X is a parallel dataset of German, French, and Russian Winograd schemas, aligned with their English counterparts, used to examine whether neural machine translation models can perform coreference resolution that requires commonsense knowledge, and whether multilingual language models are capable of commonsense reasoning across multiple languages.
Winogrande: https://arxiv.org/abs/1907.10641 Wino-X: https://github.com/demelin/Wino-X Wino-X: https://huggingface.co/datasets/demelin/wino_x
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'demelin/wino_x'¶
- FEWSHOT_SPLIT: str = 'test'¶
- HF_REVISION: str | None = '7d82697fd52ac8b03e62aadfddc61077320f21e7'¶
- LANGUAGE_SHORT_CODE = ''¶
- SAMPLE_SPLIT: str = 'test'¶
- class eval_framework.tasks.benchmarks.winox.WINOX_DE(num_fewshot=0)[source]¶
Bases:
WINOX- Parameters:
num_fewshot (int)
- LANGUAGE_SHORT_CODE = 'de'¶
- NAME: str = 'WINOX_DE'¶
- SUBJECTS: list[SubjectType] = ['lm_en_de']¶
eval_framework.tasks.benchmarks.wmt module¶
- class eval_framework.tasks.benchmarks.wmt.WMT(num_fewshot=0)[source]¶
Bases:
BaseTask[str],ABCWMT dataset:
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = ''¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.LINEWISE_BLEU'>, <class 'eval_framework.metrics.completion.chrf.LINEWISE_CHRF'>, <class 'eval_framework.metrics.completion.ter.LINEWISE_TER'>]¶
- NAME: str = 'WMT'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['phrase']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- class eval_framework.tasks.benchmarks.wmt.WMT14(num_fewshot=0)[source]¶
Bases:
WMT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt14'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'en-fr': (Language.ENG, Language.FRA), 'fr-en': (Language.FRA, Language.ENG)}¶
- NAME: str = 'WMT14'¶
- SUBJECTS: list[SubjectType] = ['en-fr', 'fr-en']¶
- class eval_framework.tasks.benchmarks.wmt.WMT14_INSTRUCT(num_fewshot=0)[source]¶
Bases:
WMT_INSTRUCT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt14'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'en-fr': (Language.ENG, Language.FRA), 'fr-en': (Language.FRA, Language.ENG)}¶
- NAME: str = 'WMT14 Instruct'¶
- SUBJECTS: list[SubjectType] = ['en-fr', 'fr-en']¶
- class eval_framework.tasks.benchmarks.wmt.WMT16(num_fewshot=0)[source]¶
Bases:
WMT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt16'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'en-de': (Language.ENG, Language.DEU)}¶
- NAME: str = 'WMT16'¶
- SUBJECTS: list[SubjectType] = ['de-en', 'en-de']¶
- class eval_framework.tasks.benchmarks.wmt.WMT16_INSTRUCT(num_fewshot=0)[source]¶
Bases:
WMT_INSTRUCT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt16'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'en-de': (Language.ENG, Language.DEU)}¶
- NAME: str = 'WMT16 Instruct'¶
- SUBJECTS: list[SubjectType] = ['de-en', 'en-de']¶
- class eval_framework.tasks.benchmarks.wmt.WMT20(num_fewshot=0)[source]¶
Bases:
WMT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt20'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'de-fr': (Language.DEU, Language.FRA), 'en-de': (Language.ENG, Language.DEU), 'fr-de': (Language.FRA, Language.DEU)}¶
- NAME: str = 'WMT20'¶
- SUBJECTS: list[SubjectType] = ['de-en', 'de-fr', 'en-de', 'fr-de']¶
- class eval_framework.tasks.benchmarks.wmt.WMT20_INSTRUCT(num_fewshot=0)[source]¶
Bases:
WMT_INSTRUCT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt20'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'de-fr': (Language.DEU, Language.FRA), 'en-de': (Language.ENG, Language.DEU), 'fr-de': (Language.FRA, Language.DEU)}¶
- NAME: str = 'WMT20 Instruct'¶
- SUBJECTS: list[SubjectType] = ['de-en', 'de-fr', 'en-de', 'fr-de']¶
- class eval_framework.tasks.benchmarks.wmt.WMT_INSTRUCT(num_fewshot=0)[source]¶
Bases:
WMT- Parameters:
num_fewshot (int)
- COMPLETION_PREFIX = 'This is the translation:'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Please', 'translate']¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
eval_framework.tasks.benchmarks.zero_scrolls module¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_COMPLETION(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ZeroSCROLLS dataset: https://huggingface.co/datasets/tau/zero_scrolls
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'tau/zero_scrolls'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- HF_REVISION: str | None = 'dc63b23022752816989b0666a366c0b0195ccc4b'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'validation'¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_GOV_REPORT(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]¶
- NAME: str = 'ZeroSCROLLS GovReport'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Summary']¶
- SUBJECTS: list[SubjectType] = ['gov_report']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_MUSIQUE(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]¶
- NAME: str = 'ZeroSCROLLS MuSiQue'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['musique']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_NARRATIVEQA(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]¶
- NAME: str = 'ZeroSCROLLS NarrativeQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['narrative_qa']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QASPER(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]¶
- NAME: str = 'ZeroSCROLLS Qasper'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['qasper']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QMSUM(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]¶
- NAME: str = 'ZeroSCROLLS QMSum'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['qmsum']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QUALITY(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ZeroSCROLLS dataset: https://huggingface.co/datasets/tau/zero_scrolls
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'tau/zero_scrolls'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- HF_REVISION: str | None = 'dc63b23022752816989b0666a366c0b0195ccc4b'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>]¶
- NAME: str = 'ZeroSCROLLS QuALITY'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['quality']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_SPACE_DIGEST(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.exponential_similarity.ExponentialSimilarity'>]¶
- NAME: str = 'ZeroSCROLLS SpaceDigest'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['space_digest']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_SQUALITY(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]¶
- NAME: str = 'ZeroSCROLLS SQuALITY'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['squality']¶