eval_framework.tasks.benchmarks package

Submodules

eval_framework.tasks.benchmarks.aidanbench module

class eval_framework.tasks.benchmarks.aidanbench.AidanBench(num_fewshot=0)[source]

Bases: AidanBenchOriginal

Parameters:

num_fewshot (int)

class eval_framework.tasks.benchmarks.aidanbench.AidanBenchOriginal(num_fewshot=0)[source]

Bases: BaseTask[str]

AidanBench (https://openreview.net/pdf?id=fz969ahcvJ).

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'Aleph-Alpha-Research/aidanbench'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.ENG}
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.aidanbench.AidanBenchMetric'>]
NAME: str = 'AidanBench'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'train'
SUBJECTS: list[SubjectType] = ['no_subject']
generate_completions(llm, samples, stop_sequences=None, max_tokens=None)[source]

Generates completions for the sample. :param sample: sample to generate completions for :type stop_sequences: list[str] | None :param stop_sequences: stop sequences to use in completion generation :type max_tokens: int | None :param max_tokens: maximum tokens to use in completion generation :rtype: list[Completion] :return: completion

Parameters:
  • llm (BaseLLM)

  • samples (list[Sample])

  • stop_sequences (list[str] | None)

  • max_tokens (int | None)

Return type:

list[Completion]

eval_framework.tasks.benchmarks.arc module

class eval_framework.tasks.benchmarks.arc.ARC(num_fewshot=0)[source]

Bases: BaseTask[str]

ARC dataset: https://huggingface.co/datasets/allenai/ai2_arc

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'ai2_arc'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'ARC'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D', 'E']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['ARC-Easy', 'ARC-Challenge']
class eval_framework.tasks.benchmarks.arc.ARC_IDK(num_fewshot=0)[source]

Bases: ARC

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'ARC_IDK'

eval_framework.tasks.benchmarks.arc_de module

class eval_framework.tasks.benchmarks.arc_de.ARC_DE(num_fewshot=0)[source]

Bases: BaseTask[str]

ARC-DE dataset: https://huggingface.co/datasets/LeoLM/ArcChallenge_de

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'LeoLM/ArcChallenge_de'
FEWSHOT_SPLIT: str = 'validation'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'ARC German'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'A', 'B', 'C', 'D', 'E']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['no_subject']

eval_framework.tasks.benchmarks.arc_fi module

class eval_framework.tasks.benchmarks.arc_fi.ARC_FI(num_fewshot=0)[source]

Bases: BaseTask[str]

ARC-FI dataset: https://huggingface.co/datasets/LumiOpen/arc_challenge_mt

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'LumiOpen/arc_challenge_mt'
FEWSHOT_SPLIT: str = 'validation'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'Finnish'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'ARC Finnish'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D', 'E']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['fi']

eval_framework.tasks.benchmarks.belebele module

class eval_framework.tasks.benchmarks.belebele.BELEBELE(num_fewshot=0)[source]

Bases: BaseTask[str]

BELEBELE dataset: https://huggingface.co/datasets/facebook/belebele

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'facebook/belebele'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'BELEBELE'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['eng_Latn']

eval_framework.tasks.benchmarks.bigcodebench module

class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBench(num_fewshot=0)[source]

Bases: BaseTask[str]

BigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'bigcode/bigcodebench'
FEWSHOT_SPLIT: str = 'v0.1.4'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_execution_pass_at_one.CodeExecutionPassAtOne'>]
NAME: str = 'BigCodeBench'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'v0.1.4'
SUBJECTS: list[SubjectType] = ['original', 'calibrated']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchHard(num_fewshot=0)[source]

Bases: BigCodeBench

BigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench-hard

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'bigcode/bigcodebench-hard'
NAME: str = 'BigCodeBenchHard'
class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchHardInstruct(num_fewshot=0)[source]

Bases: BigCodeBenchHard

BigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench-hard

Parameters:

num_fewshot (int)

NAME: str = 'BigCodeBenchHardInstruct'
class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchInstruct(num_fewshot=0)[source]

Bases: BigCodeBench

BigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench

Parameters:

num_fewshot (int)

NAME: str = 'BigCodeBenchInstruct'
eval_framework.tasks.benchmarks.bigcodebench.extract_executable_code(llm_response)[source]
Return type:

str

Parameters:

llm_response (str)

eval_framework.tasks.benchmarks.casehold module

class eval_framework.tasks.benchmarks.casehold.CASEHOLD(num_fewshot=0)[source]

Bases: BaseTask[str]

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'lex_glue'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'CaseHold'
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['case_hold']

eval_framework.tasks.benchmarks.chembench module

class eval_framework.tasks.benchmarks.chembench.ChemBench(num_fewshot=0)[source]

Bases: BaseTask[str]

ChemBench dataset: https://huggingface.co/datasets/jablonkagroup/ChemBench

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'jablonkagroup/ChemBench'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'ChemBench'
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'train'
SUBJECTS: list[SubjectType] = ['analytical_chemistry', 'chemical_preference', 'general_chemistry', 'inorganic_chemistry', 'materials_science', 'organic_chemistry', 'physical_chemistry', 'technical_chemistry', 'toxicity_and_safety']

eval_framework.tasks.benchmarks.copa module

class eval_framework.tasks.benchmarks.copa.COPA(num_fewshot=0)[source]

Bases: BaseTask[str]

COPA dataset: https://huggingface.co/datasets/aps/super_glue

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'aps/super_glue'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'COPA'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['because', 'therefore']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['copa']
convert_choice(choice)[source]
Return type:

str

Parameters:

choice (str)

class eval_framework.tasks.benchmarks.copa.COPA_IDK(num_fewshot=0)[source]

Bases: COPA

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'COPA_IDK'

eval_framework.tasks.benchmarks.duc module

class eval_framework.tasks.benchmarks.duc.DUC(num_fewshot=0)[source]

Bases: BaseTask[str], ABC

https://huggingface.co/datasets/midas/duc2001

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'midas/duc2001'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Text', 'Keyphrase']
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[str] = ['raw']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

class eval_framework.tasks.benchmarks.duc.DUC_ABSTRACTIVE(num_fewshot=0)[source]

Bases: DUC

Parameters:

num_fewshot (int)

NAME: str = 'DUC Abstractive'
SUBJECTS: list[str] = ['raw']
class eval_framework.tasks.benchmarks.duc.DUC_EXTRACTIVE(num_fewshot=0)[source]

Bases: DUC

Parameters:

num_fewshot (int)

NAME: str = 'DUC Extractive'
SUBJECTS: list[str] = ['raw']

eval_framework.tasks.benchmarks.flores200 module

class eval_framework.tasks.benchmarks.flores200.Flores200(num_fewshot=0)[source]

Bases: BaseTask[str]

FLORES-200 dataset: https://huggingface.co/datasets/facebook/flores

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'facebook/flores'
FEWSHOT_SPLIT: str = 'dev'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'deu_Latn': Language.DEU, 'eng_Latn': Language.ENG, 'fin_Latn': Language.FIN, 'fra_Latn': Language.FRA, 'nld_Latn': Language.NLD}
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.BLEU'>]
NAME: str = 'FLoRes-200'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['sentence']
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'devtest'
SUBJECTS: list[SubjectType] = ['deu_Latn-eng_Latn', 'deu_Latn-fin_Latn', 'deu_Latn-fra_Latn', 'deu_Latn-nld_Latn', 'eng_Latn-deu_Latn', 'eng_Latn-fin_Latn', 'eng_Latn-fra_Latn', 'eng_Latn-nld_Latn', 'fin_Latn-deu_Latn', 'fin_Latn-eng_Latn', 'fin_Latn-fra_Latn', 'fin_Latn-nld_Latn', 'fra_Latn-deu_Latn', 'fra_Latn-eng_Latn', 'fra_Latn-fin_Latn', 'fra_Latn-nld_Latn', 'nld_Latn-deu_Latn', 'nld_Latn-eng_Latn', 'nld_Latn-fin_Latn', 'nld_Latn-fra_Latn']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

eval_framework.tasks.benchmarks.flores_plus module

class eval_framework.tasks.benchmarks.flores_plus.FloresPlus(num_fewshot=0)[source]

Bases: BaseTask[str]

Flores-Plus dataset: https://huggingface.co/datasets/openlanguagedata/flores_plus

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openlanguagedata/flores_plus'
FEWSHOT_SPLIT: str = 'devtest'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'deu_Latn': Language.DEU, 'eng_Latn': Language.ENG, 'fra_Latn': Language.FRA, 'ita_Latn': Language.ITA, 'nld_Latn': Language.NLD, 'pol_Latn': Language.POL, 'rus_Cyrl': Language.RUS, 'spa_Latn': Language.SPA, 'ukr_Cyrl': Language.UKR}
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.BLEU'>, <class 'eval_framework.metrics.completion.chrf.CHRF'>, <class 'eval_framework.metrics.completion.comet.COMET'>]
NAME: str = 'Flores-Plus'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['sentence']
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'dev'
SUBJECTS: list[SubjectType] = ['deu_Latn-eng_Latn', 'deu_Latn-fra_Latn', 'deu_Latn-ita_Latn', 'deu_Latn-nld_Latn', 'deu_Latn-pol_Latn', 'deu_Latn-rus_Cyrl', 'deu_Latn-spa_Latn', 'deu_Latn-ukr_Cyrl', 'eng_Latn-deu_Latn', 'eng_Latn-fra_Latn', 'eng_Latn-ita_Latn', 'eng_Latn-nld_Latn', 'eng_Latn-pol_Latn', 'eng_Latn-rus_Cyrl', 'eng_Latn-spa_Latn', 'eng_Latn-ukr_Cyrl', 'fra_Latn-deu_Latn', 'fra_Latn-eng_Latn', 'fra_Latn-ita_Latn', 'fra_Latn-nld_Latn', 'fra_Latn-pol_Latn', 'fra_Latn-rus_Cyrl', 'fra_Latn-spa_Latn', 'fra_Latn-ukr_Cyrl', 'ita_Latn-deu_Latn', 'ita_Latn-eng_Latn', 'ita_Latn-fra_Latn', 'ita_Latn-nld_Latn', 'ita_Latn-pol_Latn', 'ita_Latn-rus_Cyrl', 'ita_Latn-spa_Latn', 'ita_Latn-ukr_Cyrl', 'nld_Latn-deu_Latn', 'nld_Latn-eng_Latn', 'nld_Latn-fra_Latn', 'nld_Latn-ita_Latn', 'nld_Latn-pol_Latn', 'nld_Latn-rus_Cyrl', 'nld_Latn-spa_Latn', 'nld_Latn-ukr_Cyrl', 'pol_Latn-deu_Latn', 'pol_Latn-eng_Latn', 'pol_Latn-fra_Latn', 'pol_Latn-ita_Latn', 'pol_Latn-nld_Latn', 'pol_Latn-rus_Cyrl', 'pol_Latn-spa_Latn', 'pol_Latn-ukr_Cyrl', 'rus_Cyrl-deu_Latn', 'rus_Cyrl-eng_Latn', 'rus_Cyrl-fra_Latn', 'rus_Cyrl-ita_Latn', 'rus_Cyrl-nld_Latn', 'rus_Cyrl-pol_Latn', 'rus_Cyrl-spa_Latn', 'rus_Cyrl-ukr_Cyrl', 'spa_Latn-deu_Latn', 'spa_Latn-eng_Latn', 'spa_Latn-fra_Latn', 'spa_Latn-ita_Latn', 'spa_Latn-nld_Latn', 'spa_Latn-pol_Latn', 'spa_Latn-rus_Cyrl', 'spa_Latn-ukr_Cyrl', 'ukr_Cyrl-deu_Latn', 'ukr_Cyrl-eng_Latn', 'ukr_Cyrl-fra_Latn', 'ukr_Cyrl-ita_Latn', 'ukr_Cyrl-nld_Latn', 'ukr_Cyrl-pol_Latn', 'ukr_Cyrl-rus_Cyrl', 'ukr_Cyrl-spa_Latn']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

eval_framework.tasks.benchmarks.gpqa module

class eval_framework.tasks.benchmarks.gpqa.GPQA(num_fewshot=0)[source]

Bases: BaseTask[str]

GPQA dataset: https://huggingface.co/datasets/Idavidrein/gpqa

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'Idavidrein/gpqa'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'GPQA'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'train'
SUBJECTS: list[SubjectType] = ['gpqa_extended']
class eval_framework.tasks.benchmarks.gpqa.GPQA_COT(num_fewshot=0)[source]

Bases: GPQA

Parameters:

num_fewshot (int)

ANS_RE = re.compile('Therefore, the answer is \\(([ABCDEFGHIJ])\\)')
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]
NAME: str = 'GPQA_COT'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']
RESPONSE_TYPE: ResponseType = 'completion'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]
class eval_framework.tasks.benchmarks.gpqa.GPQA_IDK(num_fewshot=0)[source]

Bases: GPQA

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'GPQA_IDK'

eval_framework.tasks.benchmarks.gsm8k module

class eval_framework.tasks.benchmarks.gsm8k.GSM8K(num_fewshot=0)[source]

Bases: GSM8KEvalHarness

Parameters:

num_fewshot (int)

FEWSHOT_SPLIT: str = ''
NAME: str = 'GSM8K'
class eval_framework.tasks.benchmarks.gsm8k.GSM8KEvalHarness(num_fewshot=0)[source]

Bases: BaseTask[str]

GSM8K dataset: https://huggingface.co/datasets/openai/gsm8k This version uses samples from the train split as fewshot examples.

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'gsm8k'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]
NAME: str = 'GSM8KEvalHarness'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['main']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]

eval_framework.tasks.benchmarks.hellaswag module

class eval_framework.tasks.benchmarks.hellaswag.HELLASWAG(num_fewshot=0)[source]

Bases: BaseTask[str]

Hellaswag dataset: https://huggingface.co/datasets/Rowan/hellaswag available data set sections: train, validation, test

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'Rowan/hellaswag'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'HellaSwag'
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['no_subject']
class eval_framework.tasks.benchmarks.hellaswag.HELLASWAG_IDK(num_fewshot=0)[source]

Bases: HELLASWAG

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'HellaSwag_IDK'

eval_framework.tasks.benchmarks.hellaswag_de module

class eval_framework.tasks.benchmarks.hellaswag_de.HELLASWAG_DE(num_fewshot=0)[source]

Bases: BaseTask[str]

Hellaswag dataset: https://huggingface.co/datasets/LeoLM/HellaSwag_de available data set sections: train (1k rows), validation (10k rows)

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'LeoLM/HellaSwag_de'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'HellaSwag German'
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['no_subject']

eval_framework.tasks.benchmarks.humaneval module

class eval_framework.tasks.benchmarks.humaneval.HumanEval(num_fewshot=0)[source]

Bases: BaseTask[str]

HumanEval dataset: https://huggingface.co/datasets/openai/openai_humaneval/

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openai/openai_humaneval'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_assertion.CodeCompletionAssertion'>]
NAME: str = 'Human Eval'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['no_subject']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

class eval_framework.tasks.benchmarks.humaneval.HumanEvalInstruct(num_fewshot=0)[source]

Bases: HumanEval

Parameters:

num_fewshot (int)

CUE_PREFIX = 'Here is the completed function:\n```python\n'
NAME: str = 'Human Eval Instruct'
class eval_framework.tasks.benchmarks.humaneval.HumanEvalMetricContext(**data)[source]

Bases: BaseMetricContext

Parameters:
  • test (str)

  • entry_point (str)

  • prompt (str)

  • extra_data (Any)

entry_point: str
model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt: str
test: str

eval_framework.tasks.benchmarks.ifeval module

class eval_framework.tasks.benchmarks.ifeval.IFEval(num_fewshot=0)[source]

Bases: BaseTask[str]

IFEval: Instruction Following Eval (https://arxiv.org/pdf/2311.07911).

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'google/IFEval'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.ENG}
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.ifeval.IFEvalMetric'>]
NAME: str = 'IFEval'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'train'
SUBJECTS: list[SubjectType] = ['no_subject']
class eval_framework.tasks.benchmarks.ifeval.IFEvalDe(num_fewshot=0)[source]

Bases: IFEval

German version of the Instruction Following Evaluation (IFEval) benchmark.

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'jzhang86/de_ifeval'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.DEU}
NAME: str = 'IFEval German'
SUBJECTS: list[SubjectType] = ['no_subject']
class eval_framework.tasks.benchmarks.ifeval.IFEvalFiSv(num_fewshot=0)[source]

Bases: IFEval

Machine translated versions of the Instruction Following Evaluation (IFEval) benchmark.

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'LumiOpen/ifeval_mt'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'fi': Language.FIN, 'sv': Language.SWE}
NAME: str = 'IFEval Finnish & Swedish'
SUBJECTS: list[SubjectType] = ['fi', 'sv']

eval_framework.tasks.benchmarks.include module

class eval_framework.tasks.benchmarks.include.INCLUDE(num_fewshot=0)[source]

Bases: BaseTask[str]

INCLUDE dataset: https://huggingface.co/datasets/CohereLabs/include-base-44

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'CohereLabs/include-base-44'
FEWSHOT_SPLIT: str = 'validation'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'Albanian': Language.SQI, 'Arabic': Language.ARB, 'Armenian': Language.HYE, 'Azerbaijani': Language.AZE, 'Basque': Language.EUS, 'Belarusian': Language.BEL, 'Bengali': Language.BEN, 'Bulgarian': Language.BUL, 'Chinese': Language.ZHO, 'Croatian': Language.HRV, 'Dutch': Language.NLD, 'Estonian': Language.EST, 'Finnish': Language.FIN, 'French': Language.FRA, 'Georgian': Language.KAT, 'German': Language.DEU, 'Greek': Language.ELL, 'Hebrew': Language.HEB, 'Hindi': Language.HIN, 'Hungarian': Language.HUN, 'Indonesian': Language.IND, 'Italian': Language.ITA, 'Japanese': Language.JPN, 'Kazakh': Language.KAZ, 'Korean': Language.KOR, 'Lithuanian': Language.LIT, 'Malay': Language.MSA, 'Malayalam': Language.MAL, 'Nepali': Language.NEP, 'North Macedonian': Language.MKD, 'Persian': Language.FAS, 'Polish': Language.POL, 'Portuguese': Language.POR, 'Russian': Language.RUS, 'Serbian': Language.SRP, 'Spanish': Language.SPA, 'Tagalog': Language.TGL, 'Tamil': Language.TAM, 'Telugu': Language.TEL, 'Turkish': Language.TUR, 'Ukrainian': Language.UKR, 'Urdu': Language.URD, 'Uzbek': Language.UZB, 'Vietnamese': Language.VIE}
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'INCLUDE'
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['Albanian', 'Arabic', 'Armenian', 'Azerbaijani', 'Basque', 'Belarusian', 'Bengali', 'Bulgarian', 'Chinese', 'Croatian', 'Dutch', 'Estonian', 'Finnish', 'French', 'Georgian', 'German', 'Greek', 'Hebrew', 'Hindi', 'Hungarian', 'Indonesian', 'Italian', 'Japanese', 'Kazakh', 'Korean', 'Lithuanian', 'Malay', 'Malayalam', 'Nepali', 'North Macedonian', 'Persian', 'Polish', 'Portuguese', 'Russian', 'Serbian', 'Spanish', 'Tagalog', 'Tamil', 'Telugu', 'Turkish', 'Ukrainian', 'Urdu', 'Uzbek', 'Vietnamese']
eval_framework.tasks.benchmarks.include.subject_to_language(subject)[source]
Return type:

Language

Parameters:

subject (str)

eval_framework.tasks.benchmarks.infinitebench module

class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench(num_fewshot=0)[source]

Bases: BaseTask[str], ABC

InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens https://github.com/OpenBMB/InfiniteBench

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'xinrongzhang2022/InfiniteBench'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = None
SUBJECTS: list[SubjectType] = ['default']
class eval_framework.tasks.benchmarks.infinitebench.InfiniteBenchCompletion(num_fewshot=0)[source]

Bases: InfiniteBench, ABC

Base class for completion tasks.

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]
RESPONSE_TYPE: ResponseType = 'completion'
class eval_framework.tasks.benchmarks.infinitebench.InfiniteBenchLoglikelihood(num_fewshot=0)[source]

Bases: InfiniteBench, ABC

Base class for loglikelihood tasks.

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>]
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_CodeDebug(num_fewshot=0)[source]

Bases: InfiniteBenchLoglikelihood

Finding which function in a code repo contains a crashing error (MC form).

Parameters:

num_fewshot (int)

FEWSHOT_SPLIT: str = 'code_debug'
NAME: str = 'InfiniteBench_CodeDebug'
SAMPLE_SPLIT: str = 'code_debug'
class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_CodeRun(num_fewshot=0)[source]

Bases: InfiniteBenchCompletion

Simulating execution of multiple simple, synthetic functions.

Parameters:

num_fewshot (int)

FEWSHOT_SPLIT: str = 'code_run'
NAME: str = 'InfiniteBench_CodeRun'
SAMPLE_SPLIT: str = 'code_run'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]
class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnDia(num_fewshot=0)[source]

Bases: InfiniteBenchCompletion

Identification of talkers in partially anonymized scripts.

Parameters:

num_fewshot (int)

FEWSHOT_SPLIT: str = 'longdialogue_qa_eng'
NAME: str = 'InfiniteBench_EnDia'
SAMPLE_SPLIT: str = 'longdialogue_qa_eng'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]
class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnMC(num_fewshot=0)[source]

Bases: InfiniteBenchLoglikelihood

Multiple choice questions derived from the fake book.

Parameters:

num_fewshot (int)

FEWSHOT_SPLIT: str = 'longbook_choice_eng'
NAME: str = 'InfiniteBench_EnMC'
SAMPLE_SPLIT: str = 'longbook_choice_eng'
class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnQA(num_fewshot=0)[source]

Bases: InfiniteBenchCompletion

Free-form question answering based on the fake book.

Parameters:

num_fewshot (int)

FEWSHOT_SPLIT: str = 'longbook_qa_eng'
NAME: str = 'InfiniteBench_EnQA'
SAMPLE_SPLIT: str = 'longbook_qa_eng'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]
class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_MathFind(num_fewshot=0)[source]

Bases: InfiniteBenchCompletion

Finding special integers in a lengthy list.

Parameters:

num_fewshot (int)

FEWSHOT_SPLIT: str = 'math_find'
NAME: str = 'InfiniteBench_MathFind'
SAMPLE_SPLIT: str = 'math_find'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]
class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrieveKV2(num_fewshot=0)[source]

Bases: InfiniteBenchCompletion

Finding the corresponding value from a dictionary and a key.

Parameters:

num_fewshot (int)

FEWSHOT_SPLIT: str = 'kv_retrieval'
NAME: str = 'InfiniteBench_RetrieveKV2'
SAMPLE_SPLIT: str = 'kv_retrieval'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]
class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrieveNumber(num_fewshot=0)[source]

Bases: InfiniteBenchCompletion

Locating repeated hidden numbers in a noisy long context.

Parameters:

num_fewshot (int)

FEWSHOT_SPLIT: str = 'number_string'
NAME: str = 'InfiniteBench_RetrieveNumber'
SAMPLE_SPLIT: str = 'number_string'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]
class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrievePassKey1(num_fewshot=0)[source]

Bases: InfiniteBenchCompletion

Retrieving hidden keys in a noisy long context.

Parameters:

num_fewshot (int)

FEWSHOT_SPLIT: str = 'passkey'
NAME: str = 'InfiniteBench_RetrievePassKey1'
SAMPLE_SPLIT: str = 'passkey'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]

eval_framework.tasks.benchmarks.math_reasoning module

class eval_framework.tasks.benchmarks.math_reasoning.AIME2024(num_fewshot=0)[source]

Bases: MATHReasoning

AIME 2024 dataset: https://huggingface.co/datasets/HuggingFaceH4/aime_2024

This dataset contains a single train split of 30 questions. Data contains

ID | Problem | Solution | Answer

pass@1 evaluation

Parameters:

num_fewshot (int)

ANSWER_PATTERN = 'Therefore, the final answer is:(.*?). I hope it is correct.'
DATASET_PATH: str = 'HuggingFaceH4/aime_2024'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]
NAME: str = 'AIME2024'
QUERY_TEMPLATE = 'Solve the following math problem efficiently and clearly:\n\n    - For simple problems (2 steps or fewer):\n    Provide a concise solution with minimal explanation.\n\n    - For complex problems (3 steps or more):\n    Use this step-by-step format:\n\n    ## Step 1: [Concise description]\n    [Brief explanation and calculations]\n\n    ## Step 2: [Concise description]\n    [Brief explanation and calculations]\n\n    ...\n\n    Regardless of the approach, always conclude with:\n\n    Therefore, the final answer is: $\\boxed{{answer}}$. I hope it is correct.\n\n    Where [answer] is just the final number or expression that solves the problem.\n\n    Problem: {Question}'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'train'
SUBJECTS: list[SubjectType] = ['no_subject']
class eval_framework.tasks.benchmarks.math_reasoning.AIME2025(num_fewshot=0)[source]

Bases: AIME2024

AIME 2025 dataset: https://huggingface.co/datasets/math-ai/aime25

This dataset contains a single test split of 30 questions. Data contains problem | answer | id

pass@1 evaluation

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'math-ai/aime25'
FEWSHOT_SPLIT: str = 'test'
NAME: str = 'AIME2025'
SAMPLE_SPLIT: str = 'test'
class eval_framework.tasks.benchmarks.math_reasoning.GSM8KReasoning(num_fewshot=0)[source]

Bases: MATHReasoning

GSM8K dataset with reasoning prompt: https://huggingface.co/datasets/openai/gsm8k

Zero-shot reasoning version that expects answers in boxed format.

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'gsm8k'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]
NAME: str = 'GSM8KReasoning'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']
QUERY_TEMPLATE = 'Solve the following math problem step by step. Think through the problem carefully and show your reasoning.\n\nPlease provide your answer in the format: $\\boxed{{answer}}$ where answer is the final numerical result.\n\nQuestion: {question}\n\nAnswer:'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['main']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]
class eval_framework.tasks.benchmarks.math_reasoning.MATH(num_fewshot=0)[source]

Bases: MATHReasoning

MATH dataset: https://huggingface.co/datasets/EleutherAI/hendrycks_math

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'EleutherAI/hendrycks_math'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]
NAME: str = 'Math'
QUERY_TEMPLATE = 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $ANSWER (without quotes) where $ANSWER is the answer to the problem.\n\n    {Question}\n\n    Remember to put your answer in $\\boxed{{answer}}$\n\n    where [answer] is just the final number or expression that solves the problem.'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['algebra', 'counting_and_probability', 'geometry', 'intermediate_algebra', 'number_theory', 'prealgebra', 'precalculus']
extract_last_two_dollar_text(s)[source]

extract_last_two_dollar_text finds text between the last two dollar signs in a string :type s: str :param s: the string to extract text from :rtype: str :returns: the extracted text

Parameters:

s (str)

Return type:

str

post_process_generated_completion(completion_text, sample=None)[source]

post_process_generated_completion extracts via flex extraction/matching. if there is a boxed answer, then this gets used first if there is no boxed answer, and latex math symbols (“$”) then this will be extracted and used if there is an answer text (“Answer:”) then this will be used last

Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

class eval_framework.tasks.benchmarks.math_reasoning.MATH500(num_fewshot=0)[source]

Bases: MATHReasoning

MATH500 dataset: https://huggingface.co/datasets/HuggingFaceH4/MATH-500

This dataset contains a single test split of 500 questions. Data contains

ID | Problem | Solution | Answer

pass@1 evaluation

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'HuggingFaceH4/MATH-500'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]
NAME: str = 'MATH500'
QUERY_TEMPLATE = 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $ANSWER (without quotes) where $ANSWER is the answer to the problem.\n\n    {Question}\n\n    Remember to put your answer in $\\boxed{{answer}}$\n\n    where [answer] is just the final number or expression that solves the problem.'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['no_subject']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

class eval_framework.tasks.benchmarks.math_reasoning.MATHLvl5(num_fewshot=0)[source]

Bases: MATH

Parameters:

num_fewshot (int)

NAME: str = 'Math Lvl 5'
class eval_framework.tasks.benchmarks.math_reasoning.MATHReasoning(num_fewshot=0)[source]

Bases: BaseTask[str]

AIME 2024 dataset: https://huggingface.co/datasets/HuggingFaceH4/aime_2024

This dataset contains a single train split of 30 questions. Data contains

ID | Problem | Solution | Answer

pass@1 evaluation

Parameters:

num_fewshot (int)

ANSWER_PATTERN = '(?i)Answer\\s*:\\s*(.*)'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>]
RESPONSE_TYPE: ResponseType = 'completion'
SUBJECTS: list[SubjectType] = ['no_subject']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

eval_framework.tasks.benchmarks.mbpp module

class eval_framework.tasks.benchmarks.mbpp.MBPP(num_fewshot=0)[source]

Bases: BaseTask[str]

MBPP provides both the problem statement and the test cases upfront. It says, “Here’s the problem and here are the tests; write code that passes them.”. Note that LLMs can cheat and only write code that passes the tests without solving the given problem.

MBPP_PROMPT_WITHOUT_TESTS, on the other hand, only gives you the problem statement and function signature initially. It says, “Here’s the problem and function signature; write code, then we’ll run tests later.”

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'google-research-datasets/mbpp'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_assertion.CodeCompletionAssertion'>]
NAME: str = 'MBPP'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['full']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

class eval_framework.tasks.benchmarks.mbpp.MBPPMetricContext(**data)[source]

Bases: BaseMetricContext

Parameters:
  • tests_code (str)

  • extra_data (Any)

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

tests_code: str
class eval_framework.tasks.benchmarks.mbpp.MBPP_PROMPT_WITHOUT_TESTS(num_fewshot=0)[source]

Bases: MBPP

MBPP provides both the problem statement and the test cases upfront. It says, “Here’s the problem and here are the tests; write code that passes them.”. Note that LLMs can cheat and only write code that passes the tests without solving the given problem.

MBPP_PROMPT_WITHOUT_TESTS, on the other hand, only gives you the problem statement and function signature initially. It says, “Here’s the problem and function signature; write code, then we’ll run tests later.”

Parameters:

num_fewshot (int)

NAME: str = 'MBPP_PROMPT_WITHOUT_TESTS'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

class eval_framework.tasks.benchmarks.mbpp.MBPP_PROMPT_WITHOUT_TESTS_SANITIZED(num_fewshot=0)[source]

Bases: MBPP_PROMPT_WITHOUT_TESTS

Parameters:

num_fewshot (int)

NAME: str = 'MBPP_PROMPT_WITHOUT_TESTS_SANITIZED'
SUBJECTS: list[SubjectType] = ['sanitized']
class eval_framework.tasks.benchmarks.mbpp.MBPP_SANITIZED(num_fewshot=0)[source]

Bases: MBPP

Parameters:

num_fewshot (int)

NAME: str = 'MBPP_SANITZED'
SUBJECTS: list[SubjectType] = ['sanitized']

eval_framework.tasks.benchmarks.mmlu module

class eval_framework.tasks.benchmarks.mmlu.FullTextMMLU(num_fewshot=0)[source]

Bases: MMLU

MMLU dataset but where the model is expected to replicate choice text, rather than just the key.

Parameters:

num_fewshot (int)

NAME: str = 'Full Text MMLU'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'answers', 'A', 'B', 'C', 'D']
class eval_framework.tasks.benchmarks.mmlu.MMLU(num_fewshot=0)[source]

Bases: BaseTask[str]

MMLU dataset: https://huggingface.co/datasets/cais/mmlu

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'cais/mmlu'
FEWSHOT_SPLIT: str = 'dev'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'MMLU'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']
class eval_framework.tasks.benchmarks.mmlu.MMLU_COT(num_fewshot=0)[source]

Bases: MMLU

MMLU dataset with instruction to summarize reasoning and conclude with answer. Inspired by https://arxiv.org/pdf/2411.15124 (Table 44)

Parameters:

num_fewshot (int)

ANS_RE = re.compile('Therefore, the answer is: ([ABCD])')
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]
NAME: str = 'MMLU_COT'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']
RESPONSE_TYPE: ResponseType = 'completion'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]
class eval_framework.tasks.benchmarks.mmlu.MMLU_IDK(num_fewshot=0)[source]

Bases: MMLU

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'MMLU_IDK'

eval_framework.tasks.benchmarks.mmlu_de module

class eval_framework.tasks.benchmarks.mmlu_de.MMLU_DE(num_fewshot=0)[source]

Bases: BaseTask[str]

MMLU DE dataset: https://huggingface.co/datasets/LeoLM/MMLU_de

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'LeoLM/MMLU_de'
FEWSHOT_SPLIT: str = 'validation'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'MMLU_DE'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'A', 'B', 'C', 'D']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']

eval_framework.tasks.benchmarks.mmlu_pro module

class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO(num_fewshot=0)[source]

Bases: BaseTask[str]

MMLU_PRO dataset: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'TIGER-Lab/MMLU-Pro'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'MMLU Pro'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['engineering', 'physics', 'psychology', 'chemistry', 'biology', 'law', 'philosophy', 'computer science', 'other', 'economics', 'business', 'history', 'math', 'health']
class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO_COT(num_fewshot=0)[source]

Bases: MMLU_PRO

Parameters:

num_fewshot (int)

ANS_RE = re.compile('Therefore, the answer is \\(([ABCDEFGHIJ])\\)')
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]
NAME: str = 'MMLU_PRO_COT'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']
RESPONSE_TYPE: ResponseType = 'completion'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]
class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO_IDK(num_fewshot=0)[source]

Bases: MMLU_PRO

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'MMLU Pro_IDK'

eval_framework.tasks.benchmarks.mmmlu module

class eval_framework.tasks.benchmarks.mmmlu.MMMLU(num_fewshot=0)[source]

Bases: BaseTask[tuple[str, str]]

MMMLU dataset: https://huggingface.co/datasets/openai/MMMLU

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openai/MMMLU'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {"('AR', 'abstract_algebra')": Language.ARB, "('AR', 'anatomy')": Language.ARB, "('AR', 'astronomy')": Language.ARB, "('AR', 'business_ethics')": Language.ARB, "('AR', 'clinical_knowledge')": Language.ARB, "('AR', 'college_biology')": Language.ARB, "('AR', 'college_chemistry')": Language.ARB, "('AR', 'college_computer_science')": Language.ARB, "('AR', 'college_mathematics')": Language.ARB, "('AR', 'college_medicine')": Language.ARB, "('AR', 'college_physics')": Language.ARB, "('AR', 'computer_security')": Language.ARB, "('AR', 'conceptual_physics')": Language.ARB, "('AR', 'econometrics')": Language.ARB, "('AR', 'electrical_engineering')": Language.ARB, "('AR', 'elementary_mathematics')": Language.ARB, "('AR', 'formal_logic')": Language.ARB, "('AR', 'global_facts')": Language.ARB, "('AR', 'high_school_biology')": Language.ARB, "('AR', 'high_school_chemistry')": Language.ARB, "('AR', 'high_school_computer_science')": Language.ARB, "('AR', 'high_school_european_history')": Language.ARB, "('AR', 'high_school_geography')": Language.ARB, "('AR', 'high_school_government_and_politics')": Language.ARB, "('AR', 'high_school_macroeconomics')": Language.ARB, "('AR', 'high_school_mathematics')": Language.ARB, "('AR', 'high_school_microeconomics')": Language.ARB, "('AR', 'high_school_physics')": Language.ARB, "('AR', 'high_school_psychology')": Language.ARB, "('AR', 'high_school_statistics')": Language.ARB, "('AR', 'high_school_us_history')": Language.ARB, "('AR', 'high_school_world_history')": Language.ARB, "('AR', 'human_aging')": Language.ARB, "('AR', 'human_sexuality')": Language.ARB, "('AR', 'international_law')": Language.ARB, "('AR', 'jurisprudence')": Language.ARB, "('AR', 'logical_fallacies')": Language.ARB, "('AR', 'machine_learning')": Language.ARB, "('AR', 'management')": Language.ARB, "('AR', 'marketing')": Language.ARB, "('AR', 'medical_genetics')": Language.ARB, "('AR', 'miscellaneous')": Language.ARB, "('AR', 'moral_disputes')": Language.ARB, "('AR', 'moral_scenarios')": Language.ARB, "('AR', 'nutrition')": Language.ARB, "('AR', 'philosophy')": Language.ARB, "('AR', 'prehistory')": Language.ARB, "('AR', 'professional_accounting')": Language.ARB, "('AR', 'professional_law')": Language.ARB, "('AR', 'professional_medicine')": Language.ARB, "('AR', 'professional_psychology')": Language.ARB, "('AR', 'public_relations')": Language.ARB, "('AR', 'security_studies')": Language.ARB, "('AR', 'sociology')": Language.ARB, "('AR', 'us_foreign_policy')": Language.ARB, "('AR', 'virology')": Language.ARB, "('AR', 'world_religions')": Language.ARB, "('DE', 'abstract_algebra')": Language.DEU, "('DE', 'anatomy')": Language.DEU, "('DE', 'astronomy')": Language.DEU, "('DE', 'business_ethics')": Language.DEU, "('DE', 'clinical_knowledge')": Language.DEU, "('DE', 'college_biology')": Language.DEU, "('DE', 'college_chemistry')": Language.DEU, "('DE', 'college_computer_science')": Language.DEU, "('DE', 'college_mathematics')": Language.DEU, "('DE', 'college_medicine')": Language.DEU, "('DE', 'college_physics')": Language.DEU, "('DE', 'computer_security')": Language.DEU, "('DE', 'conceptual_physics')": Language.DEU, "('DE', 'econometrics')": Language.DEU, "('DE', 'electrical_engineering')": Language.DEU, "('DE', 'elementary_mathematics')": Language.DEU, "('DE', 'formal_logic')": Language.DEU, "('DE', 'global_facts')": Language.DEU, "('DE', 'high_school_biology')": Language.DEU, "('DE', 'high_school_chemistry')": Language.DEU, "('DE', 'high_school_computer_science')": Language.DEU, "('DE', 'high_school_european_history')": Language.DEU, "('DE', 'high_school_geography')": Language.DEU, "('DE', 'high_school_government_and_politics')": Language.DEU, "('DE', 'high_school_macroeconomics')": Language.DEU, "('DE', 'high_school_mathematics')": Language.DEU, "('DE', 'high_school_microeconomics')": Language.DEU, "('DE', 'high_school_physics')": Language.DEU, "('DE', 'high_school_psychology')": Language.DEU, "('DE', 'high_school_statistics')": Language.DEU, "('DE', 'high_school_us_history')": Language.DEU, "('DE', 'high_school_world_history')": Language.DEU, "('DE', 'human_aging')": Language.DEU, "('DE', 'human_sexuality')": Language.DEU, "('DE', 'international_law')": Language.DEU, "('DE', 'jurisprudence')": Language.DEU, "('DE', 'logical_fallacies')": Language.DEU, "('DE', 'machine_learning')": Language.DEU, "('DE', 'management')": Language.DEU, "('DE', 'marketing')": Language.DEU, "('DE', 'medical_genetics')": Language.DEU, "('DE', 'miscellaneous')": Language.DEU, "('DE', 'moral_disputes')": Language.DEU, "('DE', 'moral_scenarios')": Language.DEU, "('DE', 'nutrition')": Language.DEU, "('DE', 'philosophy')": Language.DEU, "('DE', 'prehistory')": Language.DEU, "('DE', 'professional_accounting')": Language.DEU, "('DE', 'professional_law')": Language.DEU, "('DE', 'professional_medicine')": Language.DEU, "('DE', 'professional_psychology')": Language.DEU, "('DE', 'public_relations')": Language.DEU, "('DE', 'security_studies')": Language.DEU, "('DE', 'sociology')": Language.DEU, "('DE', 'us_foreign_policy')": Language.DEU, "('DE', 'virology')": Language.DEU, "('DE', 'world_religions')": Language.DEU, "('ES', 'abstract_algebra')": Language.SPA, "('ES', 'anatomy')": Language.SPA, "('ES', 'astronomy')": Language.SPA, "('ES', 'business_ethics')": Language.SPA, "('ES', 'clinical_knowledge')": Language.SPA, "('ES', 'college_biology')": Language.SPA, "('ES', 'college_chemistry')": Language.SPA, "('ES', 'college_computer_science')": Language.SPA, "('ES', 'college_mathematics')": Language.SPA, "('ES', 'college_medicine')": Language.SPA, "('ES', 'college_physics')": Language.SPA, "('ES', 'computer_security')": Language.SPA, "('ES', 'conceptual_physics')": Language.SPA, "('ES', 'econometrics')": Language.SPA, "('ES', 'electrical_engineering')": Language.SPA, "('ES', 'elementary_mathematics')": Language.SPA, "('ES', 'formal_logic')": Language.SPA, "('ES', 'global_facts')": Language.SPA, "('ES', 'high_school_biology')": Language.SPA, "('ES', 'high_school_chemistry')": Language.SPA, "('ES', 'high_school_computer_science')": Language.SPA, "('ES', 'high_school_european_history')": Language.SPA, "('ES', 'high_school_geography')": Language.SPA, "('ES', 'high_school_government_and_politics')": Language.SPA, "('ES', 'high_school_macroeconomics')": Language.SPA, "('ES', 'high_school_mathematics')": Language.SPA, "('ES', 'high_school_microeconomics')": Language.SPA, "('ES', 'high_school_physics')": Language.SPA, "('ES', 'high_school_psychology')": Language.SPA, "('ES', 'high_school_statistics')": Language.SPA, "('ES', 'high_school_us_history')": Language.SPA, "('ES', 'high_school_world_history')": Language.SPA, "('ES', 'human_aging')": Language.SPA, "('ES', 'human_sexuality')": Language.SPA, "('ES', 'international_law')": Language.SPA, "('ES', 'jurisprudence')": Language.SPA, "('ES', 'logical_fallacies')": Language.SPA, "('ES', 'machine_learning')": Language.SPA, "('ES', 'management')": Language.SPA, "('ES', 'marketing')": Language.SPA, "('ES', 'medical_genetics')": Language.SPA, "('ES', 'miscellaneous')": Language.SPA, "('ES', 'moral_disputes')": Language.SPA, "('ES', 'moral_scenarios')": Language.SPA, "('ES', 'nutrition')": Language.SPA, "('ES', 'philosophy')": Language.SPA, "('ES', 'prehistory')": Language.SPA, "('ES', 'professional_accounting')": Language.SPA, "('ES', 'professional_law')": Language.SPA, "('ES', 'professional_medicine')": Language.SPA, "('ES', 'professional_psychology')": Language.SPA, "('ES', 'public_relations')": Language.SPA, "('ES', 'security_studies')": Language.SPA, "('ES', 'sociology')": Language.SPA, "('ES', 'us_foreign_policy')": Language.SPA, "('ES', 'virology')": Language.SPA, "('ES', 'world_religions')": Language.SPA, "('FR', 'abstract_algebra')": Language.FRA, "('FR', 'anatomy')": Language.FRA, "('FR', 'astronomy')": Language.FRA, "('FR', 'business_ethics')": Language.FRA, "('FR', 'clinical_knowledge')": Language.FRA, "('FR', 'college_biology')": Language.FRA, "('FR', 'college_chemistry')": Language.FRA, "('FR', 'college_computer_science')": Language.FRA, "('FR', 'college_mathematics')": Language.FRA, "('FR', 'college_medicine')": Language.FRA, "('FR', 'college_physics')": Language.FRA, "('FR', 'computer_security')": Language.FRA, "('FR', 'conceptual_physics')": Language.FRA, "('FR', 'econometrics')": Language.FRA, "('FR', 'electrical_engineering')": Language.FRA, "('FR', 'elementary_mathematics')": Language.FRA, "('FR', 'formal_logic')": Language.FRA, "('FR', 'global_facts')": Language.FRA, "('FR', 'high_school_biology')": Language.FRA, "('FR', 'high_school_chemistry')": Language.FRA, "('FR', 'high_school_computer_science')": Language.FRA, "('FR', 'high_school_european_history')": Language.FRA, "('FR', 'high_school_geography')": Language.FRA, "('FR', 'high_school_government_and_politics')": Language.FRA, "('FR', 'high_school_macroeconomics')": Language.FRA, "('FR', 'high_school_mathematics')": Language.FRA, "('FR', 'high_school_microeconomics')": Language.FRA, "('FR', 'high_school_physics')": Language.FRA, "('FR', 'high_school_psychology')": Language.FRA, "('FR', 'high_school_statistics')": Language.FRA, "('FR', 'high_school_us_history')": Language.FRA, "('FR', 'high_school_world_history')": Language.FRA, "('FR', 'human_aging')": Language.FRA, "('FR', 'human_sexuality')": Language.FRA, "('FR', 'international_law')": Language.FRA, "('FR', 'jurisprudence')": Language.FRA, "('FR', 'logical_fallacies')": Language.FRA, "('FR', 'machine_learning')": Language.FRA, "('FR', 'management')": Language.FRA, "('FR', 'marketing')": Language.FRA, "('FR', 'medical_genetics')": Language.FRA, "('FR', 'miscellaneous')": Language.FRA, "('FR', 'moral_disputes')": Language.FRA, "('FR', 'moral_scenarios')": Language.FRA, "('FR', 'nutrition')": Language.FRA, "('FR', 'philosophy')": Language.FRA, "('FR', 'prehistory')": Language.FRA, "('FR', 'professional_accounting')": Language.FRA, "('FR', 'professional_law')": Language.FRA, "('FR', 'professional_medicine')": Language.FRA, "('FR', 'professional_psychology')": Language.FRA, "('FR', 'public_relations')": Language.FRA, "('FR', 'security_studies')": Language.FRA, "('FR', 'sociology')": Language.FRA, "('FR', 'us_foreign_policy')": Language.FRA, "('FR', 'virology')": Language.FRA, "('FR', 'world_religions')": Language.FRA, "('IT', 'abstract_algebra')": Language.ITA, "('IT', 'anatomy')": Language.ITA, "('IT', 'astronomy')": Language.ITA, "('IT', 'business_ethics')": Language.ITA, "('IT', 'clinical_knowledge')": Language.ITA, "('IT', 'college_biology')": Language.ITA, "('IT', 'college_chemistry')": Language.ITA, "('IT', 'college_computer_science')": Language.ITA, "('IT', 'college_mathematics')": Language.ITA, "('IT', 'college_medicine')": Language.ITA, "('IT', 'college_physics')": Language.ITA, "('IT', 'computer_security')": Language.ITA, "('IT', 'conceptual_physics')": Language.ITA, "('IT', 'econometrics')": Language.ITA, "('IT', 'electrical_engineering')": Language.ITA, "('IT', 'elementary_mathematics')": Language.ITA, "('IT', 'formal_logic')": Language.ITA, "('IT', 'global_facts')": Language.ITA, "('IT', 'high_school_biology')": Language.ITA, "('IT', 'high_school_chemistry')": Language.ITA, "('IT', 'high_school_computer_science')": Language.ITA, "('IT', 'high_school_european_history')": Language.ITA, "('IT', 'high_school_geography')": Language.ITA, "('IT', 'high_school_government_and_politics')": Language.ITA, "('IT', 'high_school_macroeconomics')": Language.ITA, "('IT', 'high_school_mathematics')": Language.ITA, "('IT', 'high_school_microeconomics')": Language.ITA, "('IT', 'high_school_physics')": Language.ITA, "('IT', 'high_school_psychology')": Language.ITA, "('IT', 'high_school_statistics')": Language.ITA, "('IT', 'high_school_us_history')": Language.ITA, "('IT', 'high_school_world_history')": Language.ITA, "('IT', 'human_aging')": Language.ITA, "('IT', 'human_sexuality')": Language.ITA, "('IT', 'international_law')": Language.ITA, "('IT', 'jurisprudence')": Language.ITA, "('IT', 'logical_fallacies')": Language.ITA, "('IT', 'machine_learning')": Language.ITA, "('IT', 'management')": Language.ITA, "('IT', 'marketing')": Language.ITA, "('IT', 'medical_genetics')": Language.ITA, "('IT', 'miscellaneous')": Language.ITA, "('IT', 'moral_disputes')": Language.ITA, "('IT', 'moral_scenarios')": Language.ITA, "('IT', 'nutrition')": Language.ITA, "('IT', 'philosophy')": Language.ITA, "('IT', 'prehistory')": Language.ITA, "('IT', 'professional_accounting')": Language.ITA, "('IT', 'professional_law')": Language.ITA, "('IT', 'professional_medicine')": Language.ITA, "('IT', 'professional_psychology')": Language.ITA, "('IT', 'public_relations')": Language.ITA, "('IT', 'security_studies')": Language.ITA, "('IT', 'sociology')": Language.ITA, "('IT', 'us_foreign_policy')": Language.ITA, "('IT', 'virology')": Language.ITA, "('IT', 'world_religions')": Language.ITA, "('PT', 'abstract_algebra')": Language.POR, "('PT', 'anatomy')": Language.POR, "('PT', 'astronomy')": Language.POR, "('PT', 'business_ethics')": Language.POR, "('PT', 'clinical_knowledge')": Language.POR, "('PT', 'college_biology')": Language.POR, "('PT', 'college_chemistry')": Language.POR, "('PT', 'college_computer_science')": Language.POR, "('PT', 'college_mathematics')": Language.POR, "('PT', 'college_medicine')": Language.POR, "('PT', 'college_physics')": Language.POR, "('PT', 'computer_security')": Language.POR, "('PT', 'conceptual_physics')": Language.POR, "('PT', 'econometrics')": Language.POR, "('PT', 'electrical_engineering')": Language.POR, "('PT', 'elementary_mathematics')": Language.POR, "('PT', 'formal_logic')": Language.POR, "('PT', 'global_facts')": Language.POR, "('PT', 'high_school_biology')": Language.POR, "('PT', 'high_school_chemistry')": Language.POR, "('PT', 'high_school_computer_science')": Language.POR, "('PT', 'high_school_european_history')": Language.POR, "('PT', 'high_school_geography')": Language.POR, "('PT', 'high_school_government_and_politics')": Language.POR, "('PT', 'high_school_macroeconomics')": Language.POR, "('PT', 'high_school_mathematics')": Language.POR, "('PT', 'high_school_microeconomics')": Language.POR, "('PT', 'high_school_physics')": Language.POR, "('PT', 'high_school_psychology')": Language.POR, "('PT', 'high_school_statistics')": Language.POR, "('PT', 'high_school_us_history')": Language.POR, "('PT', 'high_school_world_history')": Language.POR, "('PT', 'human_aging')": Language.POR, "('PT', 'human_sexuality')": Language.POR, "('PT', 'international_law')": Language.POR, "('PT', 'jurisprudence')": Language.POR, "('PT', 'logical_fallacies')": Language.POR, "('PT', 'machine_learning')": Language.POR, "('PT', 'management')": Language.POR, "('PT', 'marketing')": Language.POR, "('PT', 'medical_genetics')": Language.POR, "('PT', 'miscellaneous')": Language.POR, "('PT', 'moral_disputes')": Language.POR, "('PT', 'moral_scenarios')": Language.POR, "('PT', 'nutrition')": Language.POR, "('PT', 'philosophy')": Language.POR, "('PT', 'prehistory')": Language.POR, "('PT', 'professional_accounting')": Language.POR, "('PT', 'professional_law')": Language.POR, "('PT', 'professional_medicine')": Language.POR, "('PT', 'professional_psychology')": Language.POR, "('PT', 'public_relations')": Language.POR, "('PT', 'security_studies')": Language.POR, "('PT', 'sociology')": Language.POR, "('PT', 'us_foreign_policy')": Language.POR, "('PT', 'virology')": Language.POR, "('PT', 'world_religions')": Language.POR}
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'MMMLU'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = [('FR_FR', 'abstract_algebra'), ('FR_FR', 'anatomy'), ('FR_FR', 'astronomy'), ('FR_FR', 'business_ethics'), ('FR_FR', 'clinical_knowledge'), ('FR_FR', 'college_biology'), ('FR_FR', 'college_chemistry'), ('FR_FR', 'college_computer_science'), ('FR_FR', 'college_mathematics'), ('FR_FR', 'college_medicine'), ('FR_FR', 'college_physics'), ('FR_FR', 'computer_security'), ('FR_FR', 'conceptual_physics'), ('FR_FR', 'econometrics'), ('FR_FR', 'electrical_engineering'), ('FR_FR', 'elementary_mathematics'), ('FR_FR', 'formal_logic'), ('FR_FR', 'global_facts'), ('FR_FR', 'high_school_biology'), ('FR_FR', 'high_school_chemistry'), ('FR_FR', 'high_school_computer_science'), ('FR_FR', 'high_school_european_history'), ('FR_FR', 'high_school_geography'), ('FR_FR', 'high_school_government_and_politics'), ('FR_FR', 'high_school_macroeconomics'), ('FR_FR', 'high_school_mathematics'), ('FR_FR', 'high_school_microeconomics'), ('FR_FR', 'high_school_physics'), ('FR_FR', 'high_school_psychology'), ('FR_FR', 'high_school_statistics'), ('FR_FR', 'high_school_us_history'), ('FR_FR', 'high_school_world_history'), ('FR_FR', 'human_aging'), ('FR_FR', 'human_sexuality'), ('FR_FR', 'international_law'), ('FR_FR', 'jurisprudence'), ('FR_FR', 'logical_fallacies'), ('FR_FR', 'machine_learning'), ('FR_FR', 'management'), ('FR_FR', 'marketing'), ('FR_FR', 'medical_genetics'), ('FR_FR', 'miscellaneous'), ('FR_FR', 'moral_disputes'), ('FR_FR', 'moral_scenarios'), ('FR_FR', 'nutrition'), ('FR_FR', 'philosophy'), ('FR_FR', 'prehistory'), ('FR_FR', 'professional_accounting'), ('FR_FR', 'professional_law'), ('FR_FR', 'professional_medicine'), ('FR_FR', 'professional_psychology'), ('FR_FR', 'public_relations'), ('FR_FR', 'security_studies'), ('FR_FR', 'sociology'), ('FR_FR', 'us_foreign_policy'), ('FR_FR', 'virology'), ('FR_FR', 'world_religions'), ('DE_DE', 'abstract_algebra'), ('DE_DE', 'anatomy'), ('DE_DE', 'astronomy'), ('DE_DE', 'business_ethics'), ('DE_DE', 'clinical_knowledge'), ('DE_DE', 'college_biology'), ('DE_DE', 'college_chemistry'), ('DE_DE', 'college_computer_science'), ('DE_DE', 'college_mathematics'), ('DE_DE', 'college_medicine'), ('DE_DE', 'college_physics'), ('DE_DE', 'computer_security'), ('DE_DE', 'conceptual_physics'), ('DE_DE', 'econometrics'), ('DE_DE', 'electrical_engineering'), ('DE_DE', 'elementary_mathematics'), ('DE_DE', 'formal_logic'), ('DE_DE', 'global_facts'), ('DE_DE', 'high_school_biology'), ('DE_DE', 'high_school_chemistry'), ('DE_DE', 'high_school_computer_science'), ('DE_DE', 'high_school_european_history'), ('DE_DE', 'high_school_geography'), ('DE_DE', 'high_school_government_and_politics'), ('DE_DE', 'high_school_macroeconomics'), ('DE_DE', 'high_school_mathematics'), ('DE_DE', 'high_school_microeconomics'), ('DE_DE', 'high_school_physics'), ('DE_DE', 'high_school_psychology'), ('DE_DE', 'high_school_statistics'), ('DE_DE', 'high_school_us_history'), ('DE_DE', 'high_school_world_history'), ('DE_DE', 'human_aging'), ('DE_DE', 'human_sexuality'), ('DE_DE', 'international_law'), ('DE_DE', 'jurisprudence'), ('DE_DE', 'logical_fallacies'), ('DE_DE', 'machine_learning'), ('DE_DE', 'management'), ('DE_DE', 'marketing'), ('DE_DE', 'medical_genetics'), ('DE_DE', 'miscellaneous'), ('DE_DE', 'moral_disputes'), ('DE_DE', 'moral_scenarios'), ('DE_DE', 'nutrition'), ('DE_DE', 'philosophy'), ('DE_DE', 'prehistory'), ('DE_DE', 'professional_accounting'), ('DE_DE', 'professional_law'), ('DE_DE', 'professional_medicine'), ('DE_DE', 'professional_psychology'), ('DE_DE', 'public_relations'), ('DE_DE', 'security_studies'), ('DE_DE', 'sociology'), ('DE_DE', 'us_foreign_policy'), ('DE_DE', 'virology'), ('DE_DE', 'world_religions'), ('ES_LA', 'abstract_algebra'), ('ES_LA', 'anatomy'), ('ES_LA', 'astronomy'), ('ES_LA', 'business_ethics'), ('ES_LA', 'clinical_knowledge'), ('ES_LA', 'college_biology'), ('ES_LA', 'college_chemistry'), ('ES_LA', 'college_computer_science'), ('ES_LA', 'college_mathematics'), ('ES_LA', 'college_medicine'), ('ES_LA', 'college_physics'), ('ES_LA', 'computer_security'), ('ES_LA', 'conceptual_physics'), ('ES_LA', 'econometrics'), ('ES_LA', 'electrical_engineering'), ('ES_LA', 'elementary_mathematics'), ('ES_LA', 'formal_logic'), ('ES_LA', 'global_facts'), ('ES_LA', 'high_school_biology'), ('ES_LA', 'high_school_chemistry'), ('ES_LA', 'high_school_computer_science'), ('ES_LA', 'high_school_european_history'), ('ES_LA', 'high_school_geography'), ('ES_LA', 'high_school_government_and_politics'), ('ES_LA', 'high_school_macroeconomics'), ('ES_LA', 'high_school_mathematics'), ('ES_LA', 'high_school_microeconomics'), ('ES_LA', 'high_school_physics'), ('ES_LA', 'high_school_psychology'), ('ES_LA', 'high_school_statistics'), ('ES_LA', 'high_school_us_history'), ('ES_LA', 'high_school_world_history'), ('ES_LA', 'human_aging'), ('ES_LA', 'human_sexuality'), ('ES_LA', 'international_law'), ('ES_LA', 'jurisprudence'), ('ES_LA', 'logical_fallacies'), ('ES_LA', 'machine_learning'), ('ES_LA', 'management'), ('ES_LA', 'marketing'), ('ES_LA', 'medical_genetics'), ('ES_LA', 'miscellaneous'), ('ES_LA', 'moral_disputes'), ('ES_LA', 'moral_scenarios'), ('ES_LA', 'nutrition'), ('ES_LA', 'philosophy'), ('ES_LA', 'prehistory'), ('ES_LA', 'professional_accounting'), ('ES_LA', 'professional_law'), ('ES_LA', 'professional_medicine'), ('ES_LA', 'professional_psychology'), ('ES_LA', 'public_relations'), ('ES_LA', 'security_studies'), ('ES_LA', 'sociology'), ('ES_LA', 'us_foreign_policy'), ('ES_LA', 'virology'), ('ES_LA', 'world_religions'), ('IT_IT', 'abstract_algebra'), ('IT_IT', 'anatomy'), ('IT_IT', 'astronomy'), ('IT_IT', 'business_ethics'), ('IT_IT', 'clinical_knowledge'), ('IT_IT', 'college_biology'), ('IT_IT', 'college_chemistry'), ('IT_IT', 'college_computer_science'), ('IT_IT', 'college_mathematics'), ('IT_IT', 'college_medicine'), ('IT_IT', 'college_physics'), ('IT_IT', 'computer_security'), ('IT_IT', 'conceptual_physics'), ('IT_IT', 'econometrics'), ('IT_IT', 'electrical_engineering'), ('IT_IT', 'elementary_mathematics'), ('IT_IT', 'formal_logic'), ('IT_IT', 'global_facts'), ('IT_IT', 'high_school_biology'), ('IT_IT', 'high_school_chemistry'), ('IT_IT', 'high_school_computer_science'), ('IT_IT', 'high_school_european_history'), ('IT_IT', 'high_school_geography'), ('IT_IT', 'high_school_government_and_politics'), ('IT_IT', 'high_school_macroeconomics'), ('IT_IT', 'high_school_mathematics'), ('IT_IT', 'high_school_microeconomics'), ('IT_IT', 'high_school_physics'), ('IT_IT', 'high_school_psychology'), ('IT_IT', 'high_school_statistics'), ('IT_IT', 'high_school_us_history'), ('IT_IT', 'high_school_world_history'), ('IT_IT', 'human_aging'), ('IT_IT', 'human_sexuality'), ('IT_IT', 'international_law'), ('IT_IT', 'jurisprudence'), ('IT_IT', 'logical_fallacies'), ('IT_IT', 'machine_learning'), ('IT_IT', 'management'), ('IT_IT', 'marketing'), ('IT_IT', 'medical_genetics'), ('IT_IT', 'miscellaneous'), ('IT_IT', 'moral_disputes'), ('IT_IT', 'moral_scenarios'), ('IT_IT', 'nutrition'), ('IT_IT', 'philosophy'), ('IT_IT', 'prehistory'), ('IT_IT', 'professional_accounting'), ('IT_IT', 'professional_law'), ('IT_IT', 'professional_medicine'), ('IT_IT', 'professional_psychology'), ('IT_IT', 'public_relations'), ('IT_IT', 'security_studies'), ('IT_IT', 'sociology'), ('IT_IT', 'us_foreign_policy'), ('IT_IT', 'virology'), ('IT_IT', 'world_religions'), ('PT_BR', 'abstract_algebra'), ('PT_BR', 'anatomy'), ('PT_BR', 'astronomy'), ('PT_BR', 'business_ethics'), ('PT_BR', 'clinical_knowledge'), ('PT_BR', 'college_biology'), ('PT_BR', 'college_chemistry'), ('PT_BR', 'college_computer_science'), ('PT_BR', 'college_mathematics'), ('PT_BR', 'college_medicine'), ('PT_BR', 'college_physics'), ('PT_BR', 'computer_security'), ('PT_BR', 'conceptual_physics'), ('PT_BR', 'econometrics'), ('PT_BR', 'electrical_engineering'), ('PT_BR', 'elementary_mathematics'), ('PT_BR', 'formal_logic'), ('PT_BR', 'global_facts'), ('PT_BR', 'high_school_biology'), ('PT_BR', 'high_school_chemistry'), ('PT_BR', 'high_school_computer_science'), ('PT_BR', 'high_school_european_history'), ('PT_BR', 'high_school_geography'), ('PT_BR', 'high_school_government_and_politics'), ('PT_BR', 'high_school_macroeconomics'), ('PT_BR', 'high_school_mathematics'), ('PT_BR', 'high_school_microeconomics'), ('PT_BR', 'high_school_physics'), ('PT_BR', 'high_school_psychology'), ('PT_BR', 'high_school_statistics'), ('PT_BR', 'high_school_us_history'), ('PT_BR', 'high_school_world_history'), ('PT_BR', 'human_aging'), ('PT_BR', 'human_sexuality'), ('PT_BR', 'international_law'), ('PT_BR', 'jurisprudence'), ('PT_BR', 'logical_fallacies'), ('PT_BR', 'machine_learning'), ('PT_BR', 'management'), ('PT_BR', 'marketing'), ('PT_BR', 'medical_genetics'), ('PT_BR', 'miscellaneous'), ('PT_BR', 'moral_disputes'), ('PT_BR', 'moral_scenarios'), ('PT_BR', 'nutrition'), ('PT_BR', 'philosophy'), ('PT_BR', 'prehistory'), ('PT_BR', 'professional_accounting'), ('PT_BR', 'professional_law'), ('PT_BR', 'professional_medicine'), ('PT_BR', 'professional_psychology'), ('PT_BR', 'public_relations'), ('PT_BR', 'security_studies'), ('PT_BR', 'sociology'), ('PT_BR', 'us_foreign_policy'), ('PT_BR', 'virology'), ('PT_BR', 'world_religions'), ('AR_XY', 'abstract_algebra'), ('AR_XY', 'anatomy'), ('AR_XY', 'astronomy'), ('AR_XY', 'business_ethics'), ('AR_XY', 'clinical_knowledge'), ('AR_XY', 'college_biology'), ('AR_XY', 'college_chemistry'), ('AR_XY', 'college_computer_science'), ('AR_XY', 'college_mathematics'), ('AR_XY', 'college_medicine'), ('AR_XY', 'college_physics'), ('AR_XY', 'computer_security'), ('AR_XY', 'conceptual_physics'), ('AR_XY', 'econometrics'), ('AR_XY', 'electrical_engineering'), ('AR_XY', 'elementary_mathematics'), ('AR_XY', 'formal_logic'), ('AR_XY', 'global_facts'), ('AR_XY', 'high_school_biology'), ('AR_XY', 'high_school_chemistry'), ('AR_XY', 'high_school_computer_science'), ('AR_XY', 'high_school_european_history'), ('AR_XY', 'high_school_geography'), ('AR_XY', 'high_school_government_and_politics'), ('AR_XY', 'high_school_macroeconomics'), ('AR_XY', 'high_school_mathematics'), ('AR_XY', 'high_school_microeconomics'), ('AR_XY', 'high_school_physics'), ('AR_XY', 'high_school_psychology'), ('AR_XY', 'high_school_statistics'), ('AR_XY', 'high_school_us_history'), ('AR_XY', 'high_school_world_history'), ('AR_XY', 'human_aging'), ('AR_XY', 'human_sexuality'), ('AR_XY', 'international_law'), ('AR_XY', 'jurisprudence'), ('AR_XY', 'logical_fallacies'), ('AR_XY', 'machine_learning'), ('AR_XY', 'management'), ('AR_XY', 'marketing'), ('AR_XY', 'medical_genetics'), ('AR_XY', 'miscellaneous'), ('AR_XY', 'moral_disputes'), ('AR_XY', 'moral_scenarios'), ('AR_XY', 'nutrition'), ('AR_XY', 'philosophy'), ('AR_XY', 'prehistory'), ('AR_XY', 'professional_accounting'), ('AR_XY', 'professional_law'), ('AR_XY', 'professional_medicine'), ('AR_XY', 'professional_psychology'), ('AR_XY', 'public_relations'), ('AR_XY', 'security_studies'), ('AR_XY', 'sociology'), ('AR_XY', 'us_foreign_policy'), ('AR_XY', 'virology'), ('AR_XY', 'world_religions')]
class eval_framework.tasks.benchmarks.mmmlu.MMMLU_GERMAN_COT(num_fewshot=0)[source]

Bases: MMMLU

Parameters:

num_fewshot (int)

ANS_RE = re.compile('Daher lautet die Antwort: ([ABCD])')
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {"('de', 'abstract_algebra')": Language.DEU, "('de', 'anatomy')": Language.DEU, "('de', 'astronomy')": Language.DEU, "('de', 'business_ethics')": Language.DEU, "('de', 'clinical_knowledge')": Language.DEU, "('de', 'college_biology')": Language.DEU, "('de', 'college_chemistry')": Language.DEU, "('de', 'college_computer_science')": Language.DEU, "('de', 'college_mathematics')": Language.DEU, "('de', 'college_medicine')": Language.DEU, "('de', 'college_physics')": Language.DEU, "('de', 'computer_security')": Language.DEU, "('de', 'conceptual_physics')": Language.DEU, "('de', 'econometrics')": Language.DEU, "('de', 'electrical_engineering')": Language.DEU, "('de', 'elementary_mathematics')": Language.DEU, "('de', 'formal_logic')": Language.DEU, "('de', 'global_facts')": Language.DEU, "('de', 'high_school_biology')": Language.DEU, "('de', 'high_school_chemistry')": Language.DEU, "('de', 'high_school_computer_science')": Language.DEU, "('de', 'high_school_european_history')": Language.DEU, "('de', 'high_school_geography')": Language.DEU, "('de', 'high_school_government_and_politics')": Language.DEU, "('de', 'high_school_macroeconomics')": Language.DEU, "('de', 'high_school_mathematics')": Language.DEU, "('de', 'high_school_microeconomics')": Language.DEU, "('de', 'high_school_physics')": Language.DEU, "('de', 'high_school_psychology')": Language.DEU, "('de', 'high_school_statistics')": Language.DEU, "('de', 'high_school_us_history')": Language.DEU, "('de', 'high_school_world_history')": Language.DEU, "('de', 'human_aging')": Language.DEU, "('de', 'human_sexuality')": Language.DEU, "('de', 'international_law')": Language.DEU, "('de', 'jurisprudence')": Language.DEU, "('de', 'logical_fallacies')": Language.DEU, "('de', 'machine_learning')": Language.DEU, "('de', 'management')": Language.DEU, "('de', 'marketing')": Language.DEU, "('de', 'medical_genetics')": Language.DEU, "('de', 'miscellaneous')": Language.DEU, "('de', 'moral_disputes')": Language.DEU, "('de', 'moral_scenarios')": Language.DEU, "('de', 'nutrition')": Language.DEU, "('de', 'philosophy')": Language.DEU, "('de', 'prehistory')": Language.DEU, "('de', 'professional_accounting')": Language.DEU, "('de', 'professional_law')": Language.DEU, "('de', 'professional_medicine')": Language.DEU, "('de', 'professional_psychology')": Language.DEU, "('de', 'public_relations')": Language.DEU, "('de', 'security_studies')": Language.DEU, "('de', 'sociology')": Language.DEU, "('de', 'us_foreign_policy')": Language.DEU, "('de', 'virology')": Language.DEU, "('de', 'world_religions')": Language.DEU}
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.language_checker.GermanCompletionChecker'>]
NAME: str = 'MMMLU_GERMAN_COT'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'Question', 'Answer', 'A', 'B', 'C', 'D']
RESPONSE_TYPE: ResponseType = 'completion'
SUBJECTS: list[SubjectType] = [('DE_DE', 'abstract_algebra'), ('DE_DE', 'anatomy'), ('DE_DE', 'astronomy'), ('DE_DE', 'business_ethics'), ('DE_DE', 'clinical_knowledge'), ('DE_DE', 'college_biology'), ('DE_DE', 'college_chemistry'), ('DE_DE', 'college_computer_science'), ('DE_DE', 'college_mathematics'), ('DE_DE', 'college_medicine'), ('DE_DE', 'college_physics'), ('DE_DE', 'computer_security'), ('DE_DE', 'conceptual_physics'), ('DE_DE', 'econometrics'), ('DE_DE', 'electrical_engineering'), ('DE_DE', 'elementary_mathematics'), ('DE_DE', 'formal_logic'), ('DE_DE', 'global_facts'), ('DE_DE', 'high_school_biology'), ('DE_DE', 'high_school_chemistry'), ('DE_DE', 'high_school_computer_science'), ('DE_DE', 'high_school_european_history'), ('DE_DE', 'high_school_geography'), ('DE_DE', 'high_school_government_and_politics'), ('DE_DE', 'high_school_macroeconomics'), ('DE_DE', 'high_school_mathematics'), ('DE_DE', 'high_school_microeconomics'), ('DE_DE', 'high_school_physics'), ('DE_DE', 'high_school_psychology'), ('DE_DE', 'high_school_statistics'), ('DE_DE', 'high_school_us_history'), ('DE_DE', 'high_school_world_history'), ('DE_DE', 'human_aging'), ('DE_DE', 'human_sexuality'), ('DE_DE', 'international_law'), ('DE_DE', 'jurisprudence'), ('DE_DE', 'logical_fallacies'), ('DE_DE', 'machine_learning'), ('DE_DE', 'management'), ('DE_DE', 'marketing'), ('DE_DE', 'medical_genetics'), ('DE_DE', 'miscellaneous'), ('DE_DE', 'moral_disputes'), ('DE_DE', 'moral_scenarios'), ('DE_DE', 'nutrition'), ('DE_DE', 'philosophy'), ('DE_DE', 'prehistory'), ('DE_DE', 'professional_accounting'), ('DE_DE', 'professional_law'), ('DE_DE', 'professional_medicine'), ('DE_DE', 'professional_psychology'), ('DE_DE', 'public_relations'), ('DE_DE', 'security_studies'), ('DE_DE', 'sociology'), ('DE_DE', 'us_foreign_policy'), ('DE_DE', 'virology'), ('DE_DE', 'world_religions')]
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]

eval_framework.tasks.benchmarks.openbookqa module

class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA(num_fewshot=0)[source]

Bases: BaseTask[str]

OpenBookQA dataset: https://huggingface.co/datasets/allenai/openbookqa

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'allenai/openbookqa'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'OpenBookQA'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['A', 'B', 'C', 'D']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['additional']
class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_EVAL_HARNESS(num_fewshot=0)[source]

Bases: OPENBOOKQA

Closed-book version of OpenBookQA — question only, no supporting fact.

Parameters:

num_fewshot (int)

NAME: str = 'OpenBookQAEvalHarness'
class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_IDK(num_fewshot=0)[source]

Bases: OPENBOOKQA

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'OpenBookQA_IDK'

eval_framework.tasks.benchmarks.opengptx_eu20 module

class eval_framework.tasks.benchmarks.opengptx_eu20.ARC_EU20_DE(num_fewshot=0)[source]

Bases: ARC

EU20 Benchmarks from the openGPT-X paper: - https://arxiv.org/abs/2410.08928 - leaderboard: https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard

https://huggingface.co/datasets/openGPT-X/arcx

entries in ‘challenge_DE’: 1172 test, 299 validation, 198 train entries in ‘easy_DE’: 2376 test, 570 validation, 197 train

features: [‘id’, ‘question’, ‘choices’, ‘answerKey’],

SUBJECTS = [‘challenge_BG’, ‘easy_BG’, ‘challenge_DA’, ‘easy_DA’, ‘challenge_DE’, ‘easy_DE’, ‘challenge_ET’, ‘easy_ET’, ‘challenge_FI’, ‘easy_FI’, ‘challenge_FR’, ‘easy_FR’, ‘challenge_EL’, ‘easy_EL’, ‘challenge_IT’, ‘easy_IT’, ‘challenge_LV’, ‘easy_LV’, ‘challenge_LT’, ‘easy_LT’, ‘challenge_NL’, ‘easy_NL’, ‘challenge_PL’, ‘easy_PL’, ‘challenge_PT-PT’, ‘easy_PT-PT’, ‘challenge_RO’, ‘easy_RO’, ‘challenge_SV’, ‘easy_SV’, ‘challenge_SK’, ‘easy_SK’, ‘challenge_SL’, ‘easy_SL’, ‘challenge_ES’, ‘easy_ES’, ‘challenge_CS’, ‘easy_CS’, ‘challenge_HU’, ‘easy_HU’]

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/arcx'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'
NAME: str = 'ARC_EU20_DE'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['challenge_DE', 'easy_DE']
class eval_framework.tasks.benchmarks.opengptx_eu20.ARC_EU20_FR(num_fewshot=0)[source]

Bases: ARC

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/arcx'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'
NAME: str = 'ARC_EU20_FR'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['challenge_FR', 'easy_FR']
class eval_framework.tasks.benchmarks.opengptx_eu20.GSM8K_EU20_DE(num_fewshot=0)[source]

Bases: GSM8KEvalHarness

https://huggingface.co/datasets/openGPT-X/gsm8kx
entries in ‘DE’: 1319 test, 104 train

features: [‘question’, ‘answer’, ‘id’],

SUBJECTS = [‘BG’, ‘DA’, ‘DE’, ‘ET’, ‘FI’, ‘FR’, ‘EL’, ‘IT’, ‘LV’, ‘LT’, ‘NL’, ‘PL’, ‘PT-PT’, ‘RO’, ‘SV’, ‘SK’, ‘SL’, ‘ES’, ‘CS’, ‘HU’]

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/gsm8kx'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'
NAME: str = 'GSM8K_EU20_DE'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['DE']
class eval_framework.tasks.benchmarks.opengptx_eu20.GSM8K_EU20_FR(num_fewshot=0)[source]

Bases: GSM8KEvalHarness

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/gsm8kx'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'
NAME: str = 'GSM8K_EU20_FR'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['FR']
class eval_framework.tasks.benchmarks.opengptx_eu20.HELLASWAG_EU20_DE(num_fewshot=0)[source]

Bases: HELLASWAG

https://huggingface.co/datasets/openGPT-X/hellaswagx
entries in ‘DE’: 99 train, 9979 validation

features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, ‘source_id’, ‘split’, ‘split_type’, ‘label’],

SUBJECTS = [‘BG’, ‘DA’, ‘DE’, ‘ET’, ‘FI’, ‘FR’, ‘EL’, ‘IT’, ‘LV’, ‘LT’, ‘NL’, ‘PL’, ‘PT-PT’, ‘RO’, ‘SV’, ‘SK’, ‘SL’, ‘ES’, ‘CS’, ‘HU’]

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/hellaswagx'
FEWSHOT_SPLIT: str = 'validation'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'
NAME: str = 'HellaSwag_EU20_DE'
SAMPLE_SPLIT: str = 'train'
SUBJECTS: list[SubjectType] = ['DE']
class eval_framework.tasks.benchmarks.opengptx_eu20.HELLASWAG_EU20_FR(num_fewshot=0)[source]

Bases: HELLASWAG

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/hellaswagx'
FEWSHOT_SPLIT: str = 'validation'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'
NAME: str = 'HellaSwag_EU20_FR'
SAMPLE_SPLIT: str = 'train'
SUBJECTS: list[SubjectType] = ['FR']
class eval_framework.tasks.benchmarks.opengptx_eu20.MMLU_EU20_DE(num_fewshot=0)[source]

Bases: MMLU

https://huggingface.co/datasets/openGPT-X/mmlux
entries in ‘philosophy_DE’: 311 test, 5 dev, 5 validation

features: [‘question’, ‘choices’, ‘answer’, ‘id’],

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/mmlux'
FEWSHOT_SPLIT: str = 'dev'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'
NAME: str = 'MMLU_EU20_DE'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D', 'Frage']
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['abstract_algebra_DE', 'anatomy_DE', 'astronomy_DE', 'business_ethics_DE', 'clinical_knowledge_DE', 'college_biology_DE', 'college_chemistry_DE', 'college_computer_science_DE', 'college_mathematics_DE', 'college_medicine_DE', 'college_physics_DE', 'computer_security_DE', 'conceptual_physics_DE', 'econometrics_DE', 'electrical_engineering_DE', 'elementary_mathematics_DE', 'formal_logic_DE', 'global_facts_DE', 'high_school_biology_DE', 'high_school_chemistry_DE', 'high_school_computer_science_DE', 'high_school_european_history_DE', 'high_school_geography_DE', 'high_school_government_and_politics_DE', 'high_school_macroeconomics_DE', 'high_school_mathematics_DE', 'high_school_microeconomics_DE', 'high_school_physics_DE', 'high_school_psychology_DE', 'high_school_statistics_DE', 'high_school_us_history_DE', 'high_school_world_history_DE', 'human_aging_DE', 'human_sexuality_DE', 'international_law_DE', 'jurisprudence_DE', 'logical_fallacies_DE', 'machine_learning_DE', 'management_DE', 'marketing_DE', 'medical_genetics_DE', 'miscellaneous_DE', 'moral_disputes_DE', 'moral_scenarios_DE', 'nutrition_DE', 'philosophy_DE', 'prehistory_DE', 'professional_accounting_DE', 'professional_law_DE', 'professional_medicine_DE', 'professional_psychology_DE', 'public_relations_DE', 'security_studies_DE', 'sociology_DE', 'us_foreign_policy_DE', 'virology_DE', 'world_religions_DE']
class eval_framework.tasks.benchmarks.opengptx_eu20.MMLU_EU20_FR(num_fewshot=0)[source]

Bases: MMLU

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/mmlux'
FEWSHOT_SPLIT: str = 'dev'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'
NAME: str = 'MMLU_EU20_FR'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['abstract_algebra_FR', 'anatomy_FR', 'astronomy_FR', 'business_ethics_FR', 'clinical_knowledge_FR', 'college_biology_FR', 'college_chemistry_FR', 'college_computer_science_FR', 'college_mathematics_FR', 'college_medicine_FR', 'college_physics_FR', 'computer_security_FR', 'conceptual_physics_FR', 'econometrics_FR', 'electrical_engineering_FR', 'elementary_mathematics_FR', 'formal_logic_FR', 'global_facts_FR', 'high_school_biology_FR', 'high_school_chemistry_FR', 'high_school_computer_science_FR', 'high_school_european_history_FR', 'high_school_geography_FR', 'high_school_government_and_politics_FR', 'high_school_macroeconomics_FR', 'high_school_mathematics_FR', 'high_school_microeconomics_FR', 'high_school_physics_FR', 'high_school_psychology_FR', 'high_school_statistics_FR', 'high_school_us_history_FR', 'high_school_world_history_FR', 'human_aging_FR', 'human_sexuality_FR', 'international_law_FR', 'jurisprudence_FR', 'logical_fallacies_FR', 'machine_learning_FR', 'management_FR', 'marketing_FR', 'medical_genetics_FR', 'miscellaneous_FR', 'moral_disputes_FR', 'moral_scenarios_FR', 'nutrition_FR', 'philosophy_FR', 'prehistory_FR', 'professional_accounting_FR', 'professional_law_FR', 'professional_medicine_FR', 'professional_psychology_FR', 'public_relations_FR', 'security_studies_FR', 'sociology_FR', 'us_foreign_policy_FR', 'virology_FR', 'world_religions_FR']
class eval_framework.tasks.benchmarks.opengptx_eu20.TRUTHFULQA_EU20_DE(num_fewshot=0)[source]

Bases: TRUTHFULQA

https://huggingface.co/datasets/openGPT-X/truthfulqax
entries in ‘mc_DE’: 817 validation

features: [‘question’, ‘mc1_targets’, ‘mc2_targets’, ‘id’],

entries in ‘gen_DE’: 817 validation

features: [‘type’, ‘category’, ‘question’, ‘best_answer’, ‘correct_answers’, ‘incorrect_answers’, ‘source’, ‘id’],

SUBJECTS = [‘mc_BG’, ‘gen_BG’, ‘mc_DA’, ‘gen_DA’, ‘mc_DE’, ‘gen_DE’, ‘mc_ET’, ‘gen_ET’, ‘mc_FI’, ‘gen_FI’, ‘mc_FR’, ‘gen_FR’, ‘mc_EL’, ‘gen_EL’, ‘mc_IT’, ‘gen_IT’, ‘mc_LV’, ‘gen_LV’, ‘mc_LT’, ‘gen_LT’, ‘mc_NL’, ‘gen_NL’, ‘mc_PL’, ‘gen_PL’, ‘mc_PT-PT’, ‘gen_PT-PT’, ‘mc_RO’, ‘gen_RO’, ‘mc_SV’, ‘gen_SV’, ‘mc_SK’, ‘gen_SK’, ‘mc_SL’, ‘gen_SL’, ‘mc_ES’, ‘gen_ES’, ‘mc_CS’, ‘gen_CS’, ‘mc_HU’, ‘gen_HU’]

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/truthfulqax'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'
NAME: str = 'TruthfulQA_EU20_DE'
class eval_framework.tasks.benchmarks.opengptx_eu20.TRUTHFULQA_EU20_FR(num_fewshot=0)[source]

Bases: TRUTHFULQA

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'openGPT-X/truthfulqax'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'
NAME: str = 'TruthfulQA_EU20_FR'

eval_framework.tasks.benchmarks.pawsx module

class eval_framework.tasks.benchmarks.pawsx.PAWSX(num_fewshot=0)[source]

Bases: BaseTask[str]

PAWSX dataset: https://huggingface.co/datasets/google-research-datasets/paws-x used in the way suggested in PARAPHRASUS benchmark (https://arxiv.org/pdf/2409.12060).

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'google-research-datasets/paws-x'
FEWSHOT_SPLIT: str = 'validation'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de': Language.DEU, 'en': Language.ENG}
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]
NAME: str = 'PAWS-X'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Ja', 'Nein', 'Paraphrasen', 'Yes', 'No', 'paraphrases']
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['en', 'de']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

eval_framework.tasks.benchmarks.piqa module

class eval_framework.tasks.benchmarks.piqa.PIQA(num_fewshot=0)[source]

Bases: BaseTask[str]

PIQA dataset: https://huggingface.co/datasets/ybisk/piqa

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'ybisk/piqa'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'PIQA'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['no_subject']
class eval_framework.tasks.benchmarks.piqa.PIQA_IDK(num_fewshot=0)[source]

Bases: PIQA

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'PIQA_IDK'

eval_framework.tasks.benchmarks.quality module

class eval_framework.tasks.benchmarks.quality.QUALITY(num_fewshot=0)[source]

Bases: BaseTask[str]

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'emozilla/quality'
FEWSHOT_SPLIT: str = 'validation'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'QuALITY'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Article', 'Question', 'Answer']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['hard', 'easy']

eval_framework.tasks.benchmarks.sciq module

class eval_framework.tasks.benchmarks.sciq.SCIQ(num_fewshot=0)[source]

Bases: BaseTask[str]

SciQ dataset: https://huggingface.co/datasets/allenai/sciq

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'allenai/sciq'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'SciQ'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['no_subject']
class eval_framework.tasks.benchmarks.sciq.SCIQEvalHarness(num_fewshot=0)[source]

Bases: SCIQ

Based on https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/sciq/sciq.yaml#L8 In the Eval Harness implementation, the instruction text includes a context passage. This passage often contains the answer, reducing the benchmark to a straightforward copy-and-paste task.

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'allenai/sciq'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'SciQ Eval Harness'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['no_subject']
class eval_framework.tasks.benchmarks.sciq.SCIQEvalHarness_IDK(num_fewshot=0)[source]

Bases: SCIQEvalHarness

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'SciQ Eval Harness_IDK'
class eval_framework.tasks.benchmarks.sciq.SCIQ_IDK(num_fewshot=0)[source]

Bases: SCIQ

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'SciQ_IDK'

eval_framework.tasks.benchmarks.sphyr module

class eval_framework.tasks.benchmarks.sphyr.SPHYR(num_fewshot=0)[source]

Bases: BaseTask[str]

SPhyR dataset: https://huggingface.co/datasets/philippds/SPhyR

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'philippds/SPhyR'
FEWSHOT_SPLIT: str = ''
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.grid_difference.GridDifference'>]
NAME: str = 'SPHYR'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = None
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['1_random_cell_easy', '5_random_cell_easy', '10_random_cell_easy', '1_random_row_easy', '3_random_row_easy', '1_random_column_easy', '3_random_column_easy', 'full_easy', '1_random_cell_hard', '5_random_cell_hard', '10_random_cell_hard', '1_random_row_hard', '3_random_row_hard', '1_random_column_hard', '3_random_column_hard', 'full_hard']

eval_framework.tasks.benchmarks.squad module

class eval_framework.tasks.benchmarks.squad.SQUAD(num_fewshot=0)[source]

Bases: SQUAD2

Squad dataset: https://huggingface.co/datasets/rajpurkar/squad

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'rajpurkar/squad'
NAME: str = 'SQuAD'
class eval_framework.tasks.benchmarks.squad.SQUAD2(num_fewshot=0)[source]

Bases: BaseTask[str]

Squad v2 dataset: https://huggingface.co/datasets/rajpurkar/squad_v2

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'rajpurkar/squad_v2'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.f1.F1'>]
NAME: str = 'SQuAD2'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'Context', 'unanswerable']
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['no_subject']
UNANSWERABLE_STR = 'unanswerable'

eval_framework.tasks.benchmarks.struct_eval module

class eval_framework.tasks.benchmarks.struct_eval.RenderableStructEval(num_fewshot=0)[source]

Bases: StructEval

Renderable StructEval task for tasks that can be rendered visually.

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.struct_eval_metrics.RenderableStructMetric'>]
NAME: str = 'RenderableStructEval'
SUBJECTS: list[SubjectType] = ['Convert Markdown to HTML', 'Convert React to HTML', 'Convert Vue to HTML', 'Text to HTML']
class eval_framework.tasks.benchmarks.struct_eval.StructEval(num_fewshot=0)[source]

Bases: BaseTask[str]

StructEval task: https://tiger-ai-lab.github.io/StructEval/

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'TIGER-Lab/StructEval'
FEWSHOT_SPLIT: str = 'train'
HF_REVISION: str | None = 'b551217560cf225245b0607a21c505e24a58e396'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.struct_eval_metrics.StructMetric'>]
NAME: str = 'StructEval'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'train'
SUBJECTS: list[SubjectType] = ['CSV to YAML', 'JSON to XML', 'JSON to CSV', 'XML to JSON', 'XML to YAML', 'Text to XML', 'Text to YAML', 'Text to TOML', 'YAML to JSON', 'TOML to JSON', 'Text to CSV', 'YAML to XML', 'JSON to YAML', 'TOML to YAML', 'YAML to CSV', 'CSV to JSON', 'CSV to XML', 'Text to JSON', 'XML to CSV']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

eval_framework.tasks.benchmarks.tablebench module

class eval_framework.tasks.benchmarks.tablebench.TableBench(num_fewshot=0)[source]

Bases: BaseTask[tuple[str, str]]

TableBench dataset: https://huggingface.co/datasets/Multilingual-Multimodal-NLP/TableBench

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'Multilingual-Multimodal-NLP/TableBench'
FEWSHOT_SPLIT: str = 'test'
HF_REVISION: str | None = '81b551c744b7f49cfa0ad69cb7a1465d865c206e'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_l.ROUGE_L'>]
NAME: str = 'TableBench'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = [('PoT', 'NumericalReasoning'), ('PoT', 'DataAnalysis'), ('PoT', 'FactChecking'), ('SCoT', 'NumericalReasoning'), ('SCoT', 'DataAnalysis'), ('SCoT', 'FactChecking'), ('TCoT', 'NumericalReasoning'), ('TCoT', 'DataAnalysis'), ('TCoT', 'FactChecking')]
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

eval_framework.tasks.benchmarks.triviaqa module

class eval_framework.tasks.benchmarks.triviaqa.TRIVIAQA(num_fewshot=0)[source]

Bases: BaseTask[str]

Trivia QA dataset: https://huggingface.co/datasets/mandarjoshi/trivia_qa

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'mandarjoshi/trivia_qa'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.f1.F1'>]
NAME: str = 'TriviaQA'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['rc.wikipedia.nocontext']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

eval_framework.tasks.benchmarks.truthfulqa module

class eval_framework.tasks.benchmarks.truthfulqa.TRUTHFULQA(num_fewshot=0)[source]

Bases: BaseTask[str]

TRUTHFULQA dataset: https://huggingface.co/datasets/truthfulqa/truthful_qa

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'truthful_qa'
FEWSHOT_ITEMS = [{'question': 'What is human life expectancy in the United States?', 'response': 'Human life expectancy in the United States is 78 years.'}, {'question': 'Who was president of the United States in 1955?', 'response': 'Dwight D. Eisenhower was president of the United States in 1955.'}, {'question': 'Which party did he belong to?', 'response': 'He belonged to the Republican Party.'}, {'question': 'What is the square root of banana?', 'response': 'I have no comment.'}, {'question': 'How does a telescope work?', 'response': 'Telescopes use lenses or mirrors to focus light and make objects appear closer.'}, {'question': 'Where were the 1992 Olympics held?', 'response': 'The 1992 Olympics were held in Barcelona, Spain.'}]
FEWSHOT_SPLIT: str = ''
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMass'>, <class 'eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMassNorm'>]
NAME: str = 'TruthfulQA'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Q', 'A']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['mc1', 'mc2']
class eval_framework.tasks.benchmarks.truthfulqa.TRUTHFULQA_IDK(num_fewshot=0)[source]

Bases: TRUTHFULQA

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'TruthfulQA_IDK'

eval_framework.tasks.benchmarks.winogender module

class eval_framework.tasks.benchmarks.winogender.WINOGENDER(num_fewshot=0)[source]

Bases: BaseTask[str]

WINOGENDER dataset: https://huggingface.co/datasets/datasets/oskarvanderwal/winogender

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'oskarvanderwal/winogender'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'Winogender'
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'test'
SUBJECTS: list[SubjectType] = ['all']
class eval_framework.tasks.benchmarks.winogender.WINOGENDER_IDK(num_fewshot=0)[source]

Bases: WINOGENDER

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'Winogender_IDK'

eval_framework.tasks.benchmarks.winogrande module

class eval_framework.tasks.benchmarks.winogrande.WINOGRANDE(num_fewshot=0)[source]

Bases: BaseTask[str]

WINOGRANDE dataset: https://huggingface.co/datasets/winogrande

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'winogrande'
FEWSHOT_SPLIT: str = 'train'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]
NAME: str = 'Winogrande'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['1', '2']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['winogrande_xl']
class eval_framework.tasks.benchmarks.winogrande.WINOGRANDE_IDK(num_fewshot=0)[source]

Bases: WINOGRANDE

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]
NAME: str = 'Winogrande_IDK'

eval_framework.tasks.benchmarks.winox module

class eval_framework.tasks.benchmarks.winox.WINOX(num_fewshot=0)[source]

Bases: WINOGRANDE

Wino-X is a parallel dataset of German, French, and Russian Winograd schemas, aligned with their English counterparts, used to examine whether neural machine translation models can perform coreference resolution that requires commonsense knowledge, and whether multilingual language models are capable of commonsense reasoning across multiple languages.

Winogrande: https://arxiv.org/abs/1907.10641 Wino-X: https://github.com/demelin/Wino-X Wino-X: https://huggingface.co/datasets/demelin/wino_x

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'demelin/wino_x'
FEWSHOT_SPLIT: str = 'test'
LANGUAGE_SHORT_CODE = ''
SAMPLE_SPLIT: str = 'test'
class eval_framework.tasks.benchmarks.winox.WINOX_DE(num_fewshot=0)[source]

Bases: WINOX

Parameters:

num_fewshot (int)

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'German'
LANGUAGE_SHORT_CODE = 'de'
NAME: str = 'WINOX_DE'
SUBJECTS: list[SubjectType] = ['lm_en_de']
class eval_framework.tasks.benchmarks.winox.WINOX_FR(num_fewshot=0)[source]

Bases: WINOX

Parameters:

num_fewshot (int)

LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'French'
LANGUAGE_SHORT_CODE = 'fr'
NAME: str = 'WINOX_FR'
SUBJECTS: list[SubjectType] = ['lm_en_fr']

eval_framework.tasks.benchmarks.wmt module

class eval_framework.tasks.benchmarks.wmt.WMT(num_fewshot=0)[source]

Bases: BaseTask[str], ABC

WMT dataset:

Parameters:

num_fewshot (int)

DATASET_PATH: str = ''
FEWSHOT_SPLIT: str = 'test'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.LINEWISE_BLEU'>, <class 'eval_framework.metrics.completion.chrf.LINEWISE_CHRF'>, <class 'eval_framework.metrics.completion.ter.LINEWISE_TER'>]
NAME: str = 'WMT'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['phrase']
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'test'
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

class eval_framework.tasks.benchmarks.wmt.WMT14(num_fewshot=0)[source]

Bases: WMT

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'wmt14'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'en-fr': (Language.ENG, Language.FRA), 'fr-en': (Language.FRA, Language.ENG)}
NAME: str = 'WMT14'
SUBJECTS: list[SubjectType] = ['en-fr', 'fr-en']
class eval_framework.tasks.benchmarks.wmt.WMT14_INSTRUCT(num_fewshot=0)[source]

Bases: WMT_INSTRUCT

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'wmt14'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'en-fr': (Language.ENG, Language.FRA), 'fr-en': (Language.FRA, Language.ENG)}
NAME: str = 'WMT14 Instruct'
SUBJECTS: list[SubjectType] = ['en-fr', 'fr-en']
class eval_framework.tasks.benchmarks.wmt.WMT16(num_fewshot=0)[source]

Bases: WMT

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'wmt16'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'en-de': (Language.ENG, Language.DEU)}
NAME: str = 'WMT16'
SUBJECTS: list[SubjectType] = ['de-en', 'en-de']
class eval_framework.tasks.benchmarks.wmt.WMT16_INSTRUCT(num_fewshot=0)[source]

Bases: WMT_INSTRUCT

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'wmt16'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'en-de': (Language.ENG, Language.DEU)}
NAME: str = 'WMT16 Instruct'
SUBJECTS: list[SubjectType] = ['de-en', 'en-de']
class eval_framework.tasks.benchmarks.wmt.WMT20(num_fewshot=0)[source]

Bases: WMT

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'wmt20'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'de-fr': (Language.DEU, Language.FRA), 'en-de': (Language.ENG, Language.DEU), 'fr-de': (Language.FRA, Language.DEU)}
NAME: str = 'WMT20'
SUBJECTS: list[SubjectType] = ['de-en', 'de-fr', 'en-de', 'fr-de']
class eval_framework.tasks.benchmarks.wmt.WMT20_INSTRUCT(num_fewshot=0)[source]

Bases: WMT_INSTRUCT

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'wmt20'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'de-fr': (Language.DEU, Language.FRA), 'en-de': (Language.ENG, Language.DEU), 'fr-de': (Language.FRA, Language.DEU)}
NAME: str = 'WMT20 Instruct'
SUBJECTS: list[SubjectType] = ['de-en', 'de-fr', 'en-de', 'fr-de']
class eval_framework.tasks.benchmarks.wmt.WMT_INSTRUCT(num_fewshot=0)[source]

Bases: WMT

Parameters:

num_fewshot (int)

COMPLETION_PREFIX = 'This is the translation:'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Please', 'translate']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

stop_sequences: list[str]

eval_framework.tasks.benchmarks.zero_scrolls module

class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_COMPLETION(num_fewshot=0)[source]

Bases: BaseTask[str]

ZeroSCROLLS dataset: https://huggingface.co/datasets/tau/zero_scrolls

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'tau/zero_scrolls'
FEWSHOT_SPLIT: str = 'validation'
RESPONSE_TYPE: ResponseType = 'completion'
SAMPLE_SPLIT: str = 'validation'
class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_GOV_REPORT(num_fewshot=0)[source]

Bases: ZERO_SCROLLS_COMPLETION

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]
NAME: str = 'ZeroSCROLLS GovReport'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Summary']
SUBJECTS: list[SubjectType] = ['gov_report']
class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_MUSIQUE(num_fewshot=0)[source]

Bases: ZERO_SCROLLS_COMPLETION

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]
NAME: str = 'ZeroSCROLLS MuSiQue'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']
SUBJECTS: list[SubjectType] = ['musique']
class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_NARRATIVEQA(num_fewshot=0)[source]

Bases: ZERO_SCROLLS_COMPLETION

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]
NAME: str = 'ZeroSCROLLS NarrativeQA'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']
SUBJECTS: list[SubjectType] = ['narrative_qa']
class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QASPER(num_fewshot=0)[source]

Bases: ZERO_SCROLLS_COMPLETION

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]
NAME: str = 'ZeroSCROLLS Qasper'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']
SUBJECTS: list[SubjectType] = ['qasper']
class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QMSUM(num_fewshot=0)[source]

Bases: ZERO_SCROLLS_COMPLETION

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]
NAME: str = 'ZeroSCROLLS QMSum'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']
SUBJECTS: list[SubjectType] = ['qmsum']
class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QUALITY(num_fewshot=0)[source]

Bases: BaseTask[str]

ZeroSCROLLS dataset: https://huggingface.co/datasets/tau/zero_scrolls

Parameters:

num_fewshot (int)

DATASET_PATH: str = 'tau/zero_scrolls'
FEWSHOT_SPLIT: str = 'validation'
LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = 'English'
METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>]
NAME: str = 'ZeroSCROLLS QuALITY'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']
RESPONSE_TYPE: ResponseType = 'loglikelihoods'
SAMPLE_SPLIT: str = 'validation'
SUBJECTS: list[SubjectType] = ['quality']
class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_SPACE_DIGEST(num_fewshot=0)[source]

Bases: ZERO_SCROLLS_COMPLETION

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.exponential_similarity.ExponentialSimilarity'>]
NAME: str = 'ZeroSCROLLS SpaceDigest'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']
SUBJECTS: list[SubjectType] = ['space_digest']
post_process_generated_completion(completion_text, sample=None)[source]
Return type:

str

Parameters:
  • completion_text (str)

  • sample (Sample | None)

class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_SQUALITY(num_fewshot=0)[source]

Bases: ZERO_SCROLLS_COMPLETION

Parameters:

num_fewshot (int)

METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]
NAME: str = 'ZeroSCROLLS SQuALITY'
PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']
SUBJECTS: list[SubjectType] = ['squality']

Module contents