eval_framework.tasks.benchmarks package¶
Submodules¶
eval_framework.tasks.benchmarks.aidanbench module¶
- class eval_framework.tasks.benchmarks.aidanbench.AidanBench(num_fewshot=0)[source]¶
Bases:
AidanBenchOriginal- Parameters:
num_fewshot (int)
- class eval_framework.tasks.benchmarks.aidanbench.AidanBenchOriginal(num_fewshot=0)[source]¶
Bases:
BaseTask[str]AidanBench (https://openreview.net/pdf?id=fz969ahcvJ).
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'Aleph-Alpha-Research/aidanbench'¶
- FEWSHOT_SPLIT: str = 'train'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.ENG}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.aidanbench.AidanBenchMetric'>]¶
- NAME: str = 'AidanBench'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- generate_completions(llm, samples, stop_sequences=None, max_tokens=None)[source]¶
Generates completions for the sample. :param sample: sample to generate completions for :type stop_sequences:
list[str] |None:param stop_sequences: stop sequences to use in completion generation :type max_tokens:int|None:param max_tokens: maximum tokens to use in completion generation :rtype:list[Completion] :return: completion
eval_framework.tasks.benchmarks.arc module¶
- class eval_framework.tasks.benchmarks.arc.ARC(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ARC dataset: https://huggingface.co/datasets/allenai/ai2_arc
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'ai2_arc'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'ARC'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D', 'E']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['ARC-Easy', 'ARC-Challenge']¶
- class eval_framework.tasks.benchmarks.arc.ARC_IDK(num_fewshot=0)[source]¶
Bases:
ARC- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'ARC_IDK'¶
eval_framework.tasks.benchmarks.arc_de module¶
- class eval_framework.tasks.benchmarks.arc_de.ARC_DE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ARC-DE dataset: https://huggingface.co/datasets/LeoLM/ArcChallenge_de
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'LeoLM/ArcChallenge_de'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'ARC German'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'A', 'B', 'C', 'D', 'E']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
eval_framework.tasks.benchmarks.arc_fi module¶
- class eval_framework.tasks.benchmarks.arc_fi.ARC_FI(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ARC-FI dataset: https://huggingface.co/datasets/LumiOpen/arc_challenge_mt
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'LumiOpen/arc_challenge_mt'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'ARC Finnish'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D', 'E']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['fi']¶
eval_framework.tasks.benchmarks.belebele module¶
- class eval_framework.tasks.benchmarks.belebele.BELEBELE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]BELEBELE dataset: https://huggingface.co/datasets/facebook/belebele
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'facebook/belebele'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'BELEBELE'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['eng_Latn']¶
eval_framework.tasks.benchmarks.bigcodebench module¶
- class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBench(num_fewshot=0)[source]¶
Bases:
BaseTask[str]BigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'bigcode/bigcodebench'¶
- FEWSHOT_SPLIT: str = 'v0.1.4'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_execution_pass_at_one.CodeExecutionPassAtOne'>]¶
- NAME: str = 'BigCodeBench'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'v0.1.4'¶
- SUBJECTS: list[SubjectType] = ['original', 'calibrated']¶
- class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchHard(num_fewshot=0)[source]¶
Bases:
BigCodeBenchBigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench-hard
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'bigcode/bigcodebench-hard'¶
- NAME: str = 'BigCodeBenchHard'¶
- class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchHardInstruct(num_fewshot=0)[source]¶
Bases:
BigCodeBenchHardBigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench-hard
- Parameters:
num_fewshot (int)
- NAME: str = 'BigCodeBenchHardInstruct'¶
- class eval_framework.tasks.benchmarks.bigcodebench.BigCodeBenchInstruct(num_fewshot=0)[source]¶
Bases:
BigCodeBenchBigCodeBench dataset: https://huggingface.co/datasets/bigcode/bigcodebench
- Parameters:
num_fewshot (int)
- NAME: str = 'BigCodeBenchInstruct'¶
eval_framework.tasks.benchmarks.casehold module¶
- class eval_framework.tasks.benchmarks.casehold.CASEHOLD(num_fewshot=0)[source]¶
Bases:
BaseTask[str]- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'lex_glue'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'CaseHold'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['case_hold']¶
eval_framework.tasks.benchmarks.chembench module¶
- class eval_framework.tasks.benchmarks.chembench.ChemBench(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ChemBench dataset: https://huggingface.co/datasets/jablonkagroup/ChemBench
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'jablonkagroup/ChemBench'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'ChemBench'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['analytical_chemistry', 'chemical_preference', 'general_chemistry', 'inorganic_chemistry', 'materials_science', 'organic_chemistry', 'physical_chemistry', 'technical_chemistry', 'toxicity_and_safety']¶
eval_framework.tasks.benchmarks.copa module¶
- class eval_framework.tasks.benchmarks.copa.COPA(num_fewshot=0)[source]¶
Bases:
BaseTask[str]COPA dataset: https://huggingface.co/datasets/aps/super_glue
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'aps/super_glue'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'COPA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['because', 'therefore']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['copa']¶
- class eval_framework.tasks.benchmarks.copa.COPA_IDK(num_fewshot=0)[source]¶
Bases:
COPA- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'COPA_IDK'¶
eval_framework.tasks.benchmarks.duc module¶
- class eval_framework.tasks.benchmarks.duc.DUC(num_fewshot=0)[source]¶
Bases:
BaseTask[str],ABChttps://huggingface.co/datasets/midas/duc2001
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'midas/duc2001'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Text', 'Keyphrase']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[str] = ['raw']¶
eval_framework.tasks.benchmarks.flores200 module¶
- class eval_framework.tasks.benchmarks.flores200.Flores200(num_fewshot=0)[source]¶
Bases:
BaseTask[str]FLORES-200 dataset: https://huggingface.co/datasets/facebook/flores
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'facebook/flores'¶
- FEWSHOT_SPLIT: str = 'dev'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'deu_Latn': Language.DEU, 'eng_Latn': Language.ENG, 'fin_Latn': Language.FIN, 'fra_Latn': Language.FRA, 'nld_Latn': Language.NLD}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.BLEU'>]¶
- NAME: str = 'FLoRes-200'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['sentence']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'devtest'¶
- SUBJECTS: list[SubjectType] = ['deu_Latn-eng_Latn', 'deu_Latn-fin_Latn', 'deu_Latn-fra_Latn', 'deu_Latn-nld_Latn', 'eng_Latn-deu_Latn', 'eng_Latn-fin_Latn', 'eng_Latn-fra_Latn', 'eng_Latn-nld_Latn', 'fin_Latn-deu_Latn', 'fin_Latn-eng_Latn', 'fin_Latn-fra_Latn', 'fin_Latn-nld_Latn', 'fra_Latn-deu_Latn', 'fra_Latn-eng_Latn', 'fra_Latn-fin_Latn', 'fra_Latn-nld_Latn', 'nld_Latn-deu_Latn', 'nld_Latn-eng_Latn', 'nld_Latn-fin_Latn', 'nld_Latn-fra_Latn']¶
eval_framework.tasks.benchmarks.flores_plus module¶
- class eval_framework.tasks.benchmarks.flores_plus.FloresPlus(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Flores-Plus dataset: https://huggingface.co/datasets/openlanguagedata/flores_plus
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openlanguagedata/flores_plus'¶
- FEWSHOT_SPLIT: str = 'devtest'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'deu_Latn': Language.DEU, 'eng_Latn': Language.ENG, 'fra_Latn': Language.FRA, 'ita_Latn': Language.ITA, 'nld_Latn': Language.NLD, 'pol_Latn': Language.POL, 'rus_Cyrl': Language.RUS, 'spa_Latn': Language.SPA, 'ukr_Cyrl': Language.UKR}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.BLEU'>, <class 'eval_framework.metrics.completion.chrf.CHRF'>, <class 'eval_framework.metrics.completion.comet.COMET'>]¶
- NAME: str = 'Flores-Plus'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['sentence']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'dev'¶
- SUBJECTS: list[SubjectType] = ['deu_Latn-eng_Latn', 'deu_Latn-fra_Latn', 'deu_Latn-ita_Latn', 'deu_Latn-nld_Latn', 'deu_Latn-pol_Latn', 'deu_Latn-rus_Cyrl', 'deu_Latn-spa_Latn', 'deu_Latn-ukr_Cyrl', 'eng_Latn-deu_Latn', 'eng_Latn-fra_Latn', 'eng_Latn-ita_Latn', 'eng_Latn-nld_Latn', 'eng_Latn-pol_Latn', 'eng_Latn-rus_Cyrl', 'eng_Latn-spa_Latn', 'eng_Latn-ukr_Cyrl', 'fra_Latn-deu_Latn', 'fra_Latn-eng_Latn', 'fra_Latn-ita_Latn', 'fra_Latn-nld_Latn', 'fra_Latn-pol_Latn', 'fra_Latn-rus_Cyrl', 'fra_Latn-spa_Latn', 'fra_Latn-ukr_Cyrl', 'ita_Latn-deu_Latn', 'ita_Latn-eng_Latn', 'ita_Latn-fra_Latn', 'ita_Latn-nld_Latn', 'ita_Latn-pol_Latn', 'ita_Latn-rus_Cyrl', 'ita_Latn-spa_Latn', 'ita_Latn-ukr_Cyrl', 'nld_Latn-deu_Latn', 'nld_Latn-eng_Latn', 'nld_Latn-fra_Latn', 'nld_Latn-ita_Latn', 'nld_Latn-pol_Latn', 'nld_Latn-rus_Cyrl', 'nld_Latn-spa_Latn', 'nld_Latn-ukr_Cyrl', 'pol_Latn-deu_Latn', 'pol_Latn-eng_Latn', 'pol_Latn-fra_Latn', 'pol_Latn-ita_Latn', 'pol_Latn-nld_Latn', 'pol_Latn-rus_Cyrl', 'pol_Latn-spa_Latn', 'pol_Latn-ukr_Cyrl', 'rus_Cyrl-deu_Latn', 'rus_Cyrl-eng_Latn', 'rus_Cyrl-fra_Latn', 'rus_Cyrl-ita_Latn', 'rus_Cyrl-nld_Latn', 'rus_Cyrl-pol_Latn', 'rus_Cyrl-spa_Latn', 'rus_Cyrl-ukr_Cyrl', 'spa_Latn-deu_Latn', 'spa_Latn-eng_Latn', 'spa_Latn-fra_Latn', 'spa_Latn-ita_Latn', 'spa_Latn-nld_Latn', 'spa_Latn-pol_Latn', 'spa_Latn-rus_Cyrl', 'spa_Latn-ukr_Cyrl', 'ukr_Cyrl-deu_Latn', 'ukr_Cyrl-eng_Latn', 'ukr_Cyrl-fra_Latn', 'ukr_Cyrl-ita_Latn', 'ukr_Cyrl-nld_Latn', 'ukr_Cyrl-pol_Latn', 'ukr_Cyrl-rus_Cyrl', 'ukr_Cyrl-spa_Latn']¶
eval_framework.tasks.benchmarks.gpqa module¶
- class eval_framework.tasks.benchmarks.gpqa.GPQA(num_fewshot=0)[source]¶
Bases:
BaseTask[str]GPQA dataset: https://huggingface.co/datasets/Idavidrein/gpqa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'Idavidrein/gpqa'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'GPQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['gpqa_extended']¶
- class eval_framework.tasks.benchmarks.gpqa.GPQA_COT(num_fewshot=0)[source]¶
Bases:
GPQA- Parameters:
num_fewshot (int)
- ANS_RE = re.compile('Therefore, the answer is \\(([ABCDEFGHIJ])\\)')¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- NAME: str = 'GPQA_COT'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.gpqa.GPQA_IDK(num_fewshot=0)[source]¶
Bases:
GPQA- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'GPQA_IDK'¶
eval_framework.tasks.benchmarks.gsm8k module¶
- class eval_framework.tasks.benchmarks.gsm8k.GSM8K(num_fewshot=0)[source]¶
Bases:
GSM8KEvalHarness- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = ''¶
- NAME: str = 'GSM8K'¶
- class eval_framework.tasks.benchmarks.gsm8k.GSM8KEvalHarness(num_fewshot=0)[source]¶
Bases:
BaseTask[str]GSM8K dataset: https://huggingface.co/datasets/openai/gsm8k This version uses samples from the train split as fewshot examples.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'gsm8k'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- NAME: str = 'GSM8KEvalHarness'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['main']¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
eval_framework.tasks.benchmarks.hellaswag module¶
- class eval_framework.tasks.benchmarks.hellaswag.HELLASWAG(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Hellaswag dataset: https://huggingface.co/datasets/Rowan/hellaswag available data set sections: train, validation, test
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'Rowan/hellaswag'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'HellaSwag'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.hellaswag.HELLASWAG_IDK(num_fewshot=0)[source]¶
Bases:
HELLASWAG- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'HellaSwag_IDK'¶
eval_framework.tasks.benchmarks.hellaswag_de module¶
- class eval_framework.tasks.benchmarks.hellaswag_de.HELLASWAG_DE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Hellaswag dataset: https://huggingface.co/datasets/LeoLM/HellaSwag_de available data set sections: train (1k rows), validation (10k rows)
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'LeoLM/HellaSwag_de'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'HellaSwag German'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
eval_framework.tasks.benchmarks.humaneval module¶
- class eval_framework.tasks.benchmarks.humaneval.HumanEval(num_fewshot=0)[source]¶
Bases:
BaseTask[str]HumanEval dataset: https://huggingface.co/datasets/openai/openai_humaneval/
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openai/openai_humaneval'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_assertion.CodeCompletionAssertion'>]¶
- NAME: str = 'Human Eval'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.humaneval.HumanEvalInstruct(num_fewshot=0)[source]¶
Bases:
HumanEval- Parameters:
num_fewshot (int)
- CUE_PREFIX = 'Here is the completed function:\n```python\n'¶
- NAME: str = 'Human Eval Instruct'¶
- class eval_framework.tasks.benchmarks.humaneval.HumanEvalMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
test (str)
entry_point (str)
prompt (str)
extra_data (Any)
- entry_point: str¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- prompt: str¶
- test: str¶
eval_framework.tasks.benchmarks.ifeval module¶
- class eval_framework.tasks.benchmarks.ifeval.IFEval(num_fewshot=0)[source]¶
Bases:
BaseTask[str]IFEval: Instruction Following Eval (https://arxiv.org/pdf/2311.07911).
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'google/IFEval'¶
- FEWSHOT_SPLIT: str = 'train'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.ENG}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.ifeval.IFEvalMetric'>]¶
- NAME: str = 'IFEval'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.ifeval.IFEvalDe(num_fewshot=0)[source]¶
Bases:
IFEvalGerman version of the Instruction Following Evaluation (IFEval) benchmark.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'jzhang86/de_ifeval'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'no_subject': Language.DEU}¶
- NAME: str = 'IFEval German'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.ifeval.IFEvalFiSv(num_fewshot=0)[source]¶
Bases:
IFEvalMachine translated versions of the Instruction Following Evaluation (IFEval) benchmark.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'LumiOpen/ifeval_mt'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'fi': Language.FIN, 'sv': Language.SWE}¶
- NAME: str = 'IFEval Finnish & Swedish'¶
- SUBJECTS: list[SubjectType] = ['fi', 'sv']¶
eval_framework.tasks.benchmarks.include module¶
- class eval_framework.tasks.benchmarks.include.INCLUDE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]INCLUDE dataset: https://huggingface.co/datasets/CohereLabs/include-base-44
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'CohereLabs/include-base-44'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'Albanian': Language.SQI, 'Arabic': Language.ARB, 'Armenian': Language.HYE, 'Azerbaijani': Language.AZE, 'Basque': Language.EUS, 'Belarusian': Language.BEL, 'Bengali': Language.BEN, 'Bulgarian': Language.BUL, 'Chinese': Language.ZHO, 'Croatian': Language.HRV, 'Dutch': Language.NLD, 'Estonian': Language.EST, 'Finnish': Language.FIN, 'French': Language.FRA, 'Georgian': Language.KAT, 'German': Language.DEU, 'Greek': Language.ELL, 'Hebrew': Language.HEB, 'Hindi': Language.HIN, 'Hungarian': Language.HUN, 'Indonesian': Language.IND, 'Italian': Language.ITA, 'Japanese': Language.JPN, 'Kazakh': Language.KAZ, 'Korean': Language.KOR, 'Lithuanian': Language.LIT, 'Malay': Language.MSA, 'Malayalam': Language.MAL, 'Nepali': Language.NEP, 'North Macedonian': Language.MKD, 'Persian': Language.FAS, 'Polish': Language.POL, 'Portuguese': Language.POR, 'Russian': Language.RUS, 'Serbian': Language.SRP, 'Spanish': Language.SPA, 'Tagalog': Language.TGL, 'Tamil': Language.TAM, 'Telugu': Language.TEL, 'Turkish': Language.TUR, 'Ukrainian': Language.UKR, 'Urdu': Language.URD, 'Uzbek': Language.UZB, 'Vietnamese': Language.VIE}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'INCLUDE'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['Albanian', 'Arabic', 'Armenian', 'Azerbaijani', 'Basque', 'Belarusian', 'Bengali', 'Bulgarian', 'Chinese', 'Croatian', 'Dutch', 'Estonian', 'Finnish', 'French', 'Georgian', 'German', 'Greek', 'Hebrew', 'Hindi', 'Hungarian', 'Indonesian', 'Italian', 'Japanese', 'Kazakh', 'Korean', 'Lithuanian', 'Malay', 'Malayalam', 'Nepali', 'North Macedonian', 'Persian', 'Polish', 'Portuguese', 'Russian', 'Serbian', 'Spanish', 'Tagalog', 'Tamil', 'Telugu', 'Turkish', 'Ukrainian', 'Urdu', 'Uzbek', 'Vietnamese']¶
eval_framework.tasks.benchmarks.infinitebench module¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench(num_fewshot=0)[source]¶
Bases:
BaseTask[str],ABCInfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens https://github.com/OpenBMB/InfiniteBench
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'xinrongzhang2022/InfiniteBench'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = None¶
- SUBJECTS: list[SubjectType] = ['default']¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBenchCompletion(num_fewshot=0)[source]¶
Bases:
InfiniteBench,ABCBase class for completion tasks.
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBenchLoglikelihood(num_fewshot=0)[source]¶
Bases:
InfiniteBench,ABCBase class for loglikelihood tasks.
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>]¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_CodeDebug(num_fewshot=0)[source]¶
Bases:
InfiniteBenchLoglikelihoodFinding which function in a code repo contains a crashing error (MC form).
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'code_debug'¶
- NAME: str = 'InfiniteBench_CodeDebug'¶
- SAMPLE_SPLIT: str = 'code_debug'¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_CodeRun(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionSimulating execution of multiple simple, synthetic functions.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'code_run'¶
- NAME: str = 'InfiniteBench_CodeRun'¶
- SAMPLE_SPLIT: str = 'code_run'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnDia(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionIdentification of talkers in partially anonymized scripts.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'longdialogue_qa_eng'¶
- NAME: str = 'InfiniteBench_EnDia'¶
- SAMPLE_SPLIT: str = 'longdialogue_qa_eng'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnMC(num_fewshot=0)[source]¶
Bases:
InfiniteBenchLoglikelihoodMultiple choice questions derived from the fake book.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'longbook_choice_eng'¶
- NAME: str = 'InfiniteBench_EnMC'¶
- SAMPLE_SPLIT: str = 'longbook_choice_eng'¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_EnQA(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionFree-form question answering based on the fake book.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'longbook_qa_eng'¶
- NAME: str = 'InfiniteBench_EnQA'¶
- SAMPLE_SPLIT: str = 'longbook_qa_eng'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_MathFind(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionFinding special integers in a lengthy list.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'math_find'¶
- NAME: str = 'InfiniteBench_MathFind'¶
- SAMPLE_SPLIT: str = 'math_find'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrieveKV2(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionFinding the corresponding value from a dictionary and a key.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'kv_retrieval'¶
- NAME: str = 'InfiniteBench_RetrieveKV2'¶
- SAMPLE_SPLIT: str = 'kv_retrieval'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrieveNumber(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionLocating repeated hidden numbers in a noisy long context.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'number_string'¶
- NAME: str = 'InfiniteBench_RetrieveNumber'¶
- SAMPLE_SPLIT: str = 'number_string'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.infinitebench.InfiniteBench_RetrievePassKey1(num_fewshot=0)[source]¶
Bases:
InfiniteBenchCompletionRetrieving hidden keys in a noisy long context.
- Parameters:
num_fewshot (int)
- FEWSHOT_SPLIT: str = 'passkey'¶
- NAME: str = 'InfiniteBench_RetrievePassKey1'¶
- SAMPLE_SPLIT: str = 'passkey'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
eval_framework.tasks.benchmarks.math_reasoning module¶
- class eval_framework.tasks.benchmarks.math_reasoning.AIME2024(num_fewshot=0)[source]¶
Bases:
MATHReasoningAIME 2024 dataset: https://huggingface.co/datasets/HuggingFaceH4/aime_2024
This dataset contains a single train split of 30 questions. Data contains
ID | Problem | Solution | Answer
pass@1 evaluation
- Parameters:
num_fewshot (int)
- ANSWER_PATTERN = 'Therefore, the final answer is:(.*?). I hope it is correct.'¶
- DATASET_PATH: str = 'HuggingFaceH4/aime_2024'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶
- NAME: str = 'AIME2024'¶
- QUERY_TEMPLATE = 'Solve the following math problem efficiently and clearly:\n\n - For simple problems (2 steps or fewer):\n Provide a concise solution with minimal explanation.\n\n - For complex problems (3 steps or more):\n Use this step-by-step format:\n\n ## Step 1: [Concise description]\n [Brief explanation and calculations]\n\n ## Step 2: [Concise description]\n [Brief explanation and calculations]\n\n ...\n\n Regardless of the approach, always conclude with:\n\n Therefore, the final answer is: $\\boxed{{answer}}$. I hope it is correct.\n\n Where [answer] is just the final number or expression that solves the problem.\n\n Problem: {Question}'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.math_reasoning.AIME2025(num_fewshot=0)[source]¶
Bases:
AIME2024AIME 2025 dataset: https://huggingface.co/datasets/math-ai/aime25
This dataset contains a single test split of 30 questions. Data contains problem | answer | id
pass@1 evaluation
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'math-ai/aime25'¶
- FEWSHOT_SPLIT: str = 'test'¶
- NAME: str = 'AIME2025'¶
- SAMPLE_SPLIT: str = 'test'¶
- class eval_framework.tasks.benchmarks.math_reasoning.GSM8KReasoning(num_fewshot=0)[source]¶
Bases:
MATHReasoningGSM8K dataset with reasoning prompt: https://huggingface.co/datasets/openai/gsm8k
Zero-shot reasoning version that expects answers in boxed format.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'gsm8k'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶
- NAME: str = 'GSM8KReasoning'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶
- QUERY_TEMPLATE = 'Solve the following math problem step by step. Think through the problem carefully and show your reasoning.\n\nPlease provide your answer in the format: $\\boxed{{answer}}$ where answer is the final numerical result.\n\nQuestion: {question}\n\nAnswer:'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['main']¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.math_reasoning.MATH(num_fewshot=0)[source]¶
Bases:
MATHReasoningMATH dataset: https://huggingface.co/datasets/EleutherAI/hendrycks_math
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'EleutherAI/hendrycks_math'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶
- NAME: str = 'Math'¶
- QUERY_TEMPLATE = 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $ANSWER (without quotes) where $ANSWER is the answer to the problem.\n\n {Question}\n\n Remember to put your answer in $\\boxed{{answer}}$\n\n where [answer] is just the final number or expression that solves the problem.'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['algebra', 'counting_and_probability', 'geometry', 'intermediate_algebra', 'number_theory', 'prealgebra', 'precalculus']¶
- extract_last_two_dollar_text(s)[source]¶
extract_last_two_dollar_text finds text between the last two dollar signs in a string :type s:
str:param s: the string to extract text from :rtype:str:returns: the extracted text- Parameters:
s (str)
- Return type:
str
- post_process_generated_completion(completion_text, sample=None)[source]¶
post_process_generated_completion extracts via flex extraction/matching. if there is a boxed answer, then this gets used first if there is no boxed answer, and latex math symbols (“$”) then this will be extracted and used if there is an answer text (“Answer:”) then this will be used last
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- class eval_framework.tasks.benchmarks.math_reasoning.MATH500(num_fewshot=0)[source]¶
Bases:
MATHReasoningMATH500 dataset: https://huggingface.co/datasets/HuggingFaceH4/MATH-500
This dataset contains a single test split of 500 questions. Data contains
ID | Problem | Solution | Answer
pass@1 evaluation
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'HuggingFaceH4/MATH-500'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>, <class 'eval_framework.metrics.completion.language_checker.LanguageRawConsistencyChecker'>]¶
- NAME: str = 'MATH500'¶
- QUERY_TEMPLATE = 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $ANSWER (without quotes) where $ANSWER is the answer to the problem.\n\n {Question}\n\n Remember to put your answer in $\\boxed{{answer}}$\n\n where [answer] is just the final number or expression that solves the problem.'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.math_reasoning.MATHLvl5(num_fewshot=0)[source]¶
Bases:
MATH- Parameters:
num_fewshot (int)
- NAME: str = 'Math Lvl 5'¶
- class eval_framework.tasks.benchmarks.math_reasoning.MATHReasoning(num_fewshot=0)[source]¶
Bases:
BaseTask[str]AIME 2024 dataset: https://huggingface.co/datasets/HuggingFaceH4/aime_2024
This dataset contains a single train split of 30 questions. Data contains
ID | Problem | Solution | Answer
pass@1 evaluation
- Parameters:
num_fewshot (int)
- ANSWER_PATTERN = '(?i)Answer\\s*:\\s*(.*)'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.math_reasoning_completion.MathReasoningCompletion'>]¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
eval_framework.tasks.benchmarks.mbpp module¶
- class eval_framework.tasks.benchmarks.mbpp.MBPP(num_fewshot=0)[source]¶
Bases:
BaseTask[str]MBPP provides both the problem statement and the test cases upfront. It says, “Here’s the problem and here are the tests; write code that passes them.”. Note that LLMs can cheat and only write code that passes the tests without solving the given problem.
MBPP_PROMPT_WITHOUT_TESTS, on the other hand, only gives you the problem statement and function signature initially. It says, “Here’s the problem and function signature; write code, then we’ll run tests later.”
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'google-research-datasets/mbpp'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.code_assertion.CodeCompletionAssertion'>]¶
- NAME: str = 'MBPP'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['full']¶
- class eval_framework.tasks.benchmarks.mbpp.MBPPMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
tests_code (str)
extra_data (Any)
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- tests_code: str¶
- class eval_framework.tasks.benchmarks.mbpp.MBPP_PROMPT_WITHOUT_TESTS(num_fewshot=0)[source]¶
Bases:
MBPPMBPP provides both the problem statement and the test cases upfront. It says, “Here’s the problem and here are the tests; write code that passes them.”. Note that LLMs can cheat and only write code that passes the tests without solving the given problem.
MBPP_PROMPT_WITHOUT_TESTS, on the other hand, only gives you the problem statement and function signature initially. It says, “Here’s the problem and function signature; write code, then we’ll run tests later.”
- Parameters:
num_fewshot (int)
- NAME: str = 'MBPP_PROMPT_WITHOUT_TESTS'¶
- class eval_framework.tasks.benchmarks.mbpp.MBPP_PROMPT_WITHOUT_TESTS_SANITIZED(num_fewshot=0)[source]¶
Bases:
MBPP_PROMPT_WITHOUT_TESTS- Parameters:
num_fewshot (int)
- NAME: str = 'MBPP_PROMPT_WITHOUT_TESTS_SANITIZED'¶
- SUBJECTS: list[SubjectType] = ['sanitized']¶
eval_framework.tasks.benchmarks.mmlu module¶
- class eval_framework.tasks.benchmarks.mmlu.FullTextMMLU(num_fewshot=0)[source]¶
Bases:
MMLUMMLU dataset but where the model is expected to replicate choice text, rather than just the key.
- Parameters:
num_fewshot (int)
- NAME: str = 'Full Text MMLU'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'answers', 'A', 'B', 'C', 'D']¶
- class eval_framework.tasks.benchmarks.mmlu.MMLU(num_fewshot=0)[source]¶
Bases:
BaseTask[str]MMLU dataset: https://huggingface.co/datasets/cais/mmlu
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'cais/mmlu'¶
- FEWSHOT_SPLIT: str = 'dev'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'MMLU'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']¶
- class eval_framework.tasks.benchmarks.mmlu.MMLU_COT(num_fewshot=0)[source]¶
Bases:
MMLUMMLU dataset with instruction to summarize reasoning and conclude with answer. Inspired by https://arxiv.org/pdf/2411.15124 (Table 44)
- Parameters:
num_fewshot (int)
- ANS_RE = re.compile('Therefore, the answer is: ([ABCD])')¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- NAME: str = 'MMLU_COT'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.mmlu.MMLU_IDK(num_fewshot=0)[source]¶
Bases:
MMLU- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'MMLU_IDK'¶
eval_framework.tasks.benchmarks.mmlu_de module¶
- class eval_framework.tasks.benchmarks.mmlu_de.MMLU_DE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]MMLU DE dataset: https://huggingface.co/datasets/LeoLM/MMLU_de
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'LeoLM/MMLU_de'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'MMLU_DE'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']¶
eval_framework.tasks.benchmarks.mmlu_pro module¶
- class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO(num_fewshot=0)[source]¶
Bases:
BaseTask[str]MMLU_PRO dataset: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'TIGER-Lab/MMLU-Pro'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'MMLU Pro'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['engineering', 'physics', 'psychology', 'chemistry', 'biology', 'law', 'philosophy', 'computer science', 'other', 'economics', 'business', 'history', 'math', 'health']¶
- class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO_COT(num_fewshot=0)[source]¶
Bases:
MMLU_PRO- Parameters:
num_fewshot (int)
- ANS_RE = re.compile('Therefore, the answer is \\(([ABCDEFGHIJ])\\)')¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- NAME: str = 'MMLU_PRO_COT'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Therefore', 'the', 'answer', 'is', 'ANSWER_LETTER', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
- class eval_framework.tasks.benchmarks.mmlu_pro.MMLU_PRO_IDK(num_fewshot=0)[source]¶
Bases:
MMLU_PRO- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'MMLU Pro_IDK'¶
eval_framework.tasks.benchmarks.mmmlu module¶
- class eval_framework.tasks.benchmarks.mmmlu.MMMLU(num_fewshot=0)[source]¶
Bases:
BaseTask[tuple[str,str]]MMMLU dataset: https://huggingface.co/datasets/openai/MMMLU
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openai/MMMLU'¶
- FEWSHOT_SPLIT: str = 'test'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {"('AR', 'abstract_algebra')": Language.ARB, "('AR', 'anatomy')": Language.ARB, "('AR', 'astronomy')": Language.ARB, "('AR', 'business_ethics')": Language.ARB, "('AR', 'clinical_knowledge')": Language.ARB, "('AR', 'college_biology')": Language.ARB, "('AR', 'college_chemistry')": Language.ARB, "('AR', 'college_computer_science')": Language.ARB, "('AR', 'college_mathematics')": Language.ARB, "('AR', 'college_medicine')": Language.ARB, "('AR', 'college_physics')": Language.ARB, "('AR', 'computer_security')": Language.ARB, "('AR', 'conceptual_physics')": Language.ARB, "('AR', 'econometrics')": Language.ARB, "('AR', 'electrical_engineering')": Language.ARB, "('AR', 'elementary_mathematics')": Language.ARB, "('AR', 'formal_logic')": Language.ARB, "('AR', 'global_facts')": Language.ARB, "('AR', 'high_school_biology')": Language.ARB, "('AR', 'high_school_chemistry')": Language.ARB, "('AR', 'high_school_computer_science')": Language.ARB, "('AR', 'high_school_european_history')": Language.ARB, "('AR', 'high_school_geography')": Language.ARB, "('AR', 'high_school_government_and_politics')": Language.ARB, "('AR', 'high_school_macroeconomics')": Language.ARB, "('AR', 'high_school_mathematics')": Language.ARB, "('AR', 'high_school_microeconomics')": Language.ARB, "('AR', 'high_school_physics')": Language.ARB, "('AR', 'high_school_psychology')": Language.ARB, "('AR', 'high_school_statistics')": Language.ARB, "('AR', 'high_school_us_history')": Language.ARB, "('AR', 'high_school_world_history')": Language.ARB, "('AR', 'human_aging')": Language.ARB, "('AR', 'human_sexuality')": Language.ARB, "('AR', 'international_law')": Language.ARB, "('AR', 'jurisprudence')": Language.ARB, "('AR', 'logical_fallacies')": Language.ARB, "('AR', 'machine_learning')": Language.ARB, "('AR', 'management')": Language.ARB, "('AR', 'marketing')": Language.ARB, "('AR', 'medical_genetics')": Language.ARB, "('AR', 'miscellaneous')": Language.ARB, "('AR', 'moral_disputes')": Language.ARB, "('AR', 'moral_scenarios')": Language.ARB, "('AR', 'nutrition')": Language.ARB, "('AR', 'philosophy')": Language.ARB, "('AR', 'prehistory')": Language.ARB, "('AR', 'professional_accounting')": Language.ARB, "('AR', 'professional_law')": Language.ARB, "('AR', 'professional_medicine')": Language.ARB, "('AR', 'professional_psychology')": Language.ARB, "('AR', 'public_relations')": Language.ARB, "('AR', 'security_studies')": Language.ARB, "('AR', 'sociology')": Language.ARB, "('AR', 'us_foreign_policy')": Language.ARB, "('AR', 'virology')": Language.ARB, "('AR', 'world_religions')": Language.ARB, "('DE', 'abstract_algebra')": Language.DEU, "('DE', 'anatomy')": Language.DEU, "('DE', 'astronomy')": Language.DEU, "('DE', 'business_ethics')": Language.DEU, "('DE', 'clinical_knowledge')": Language.DEU, "('DE', 'college_biology')": Language.DEU, "('DE', 'college_chemistry')": Language.DEU, "('DE', 'college_computer_science')": Language.DEU, "('DE', 'college_mathematics')": Language.DEU, "('DE', 'college_medicine')": Language.DEU, "('DE', 'college_physics')": Language.DEU, "('DE', 'computer_security')": Language.DEU, "('DE', 'conceptual_physics')": Language.DEU, "('DE', 'econometrics')": Language.DEU, "('DE', 'electrical_engineering')": Language.DEU, "('DE', 'elementary_mathematics')": Language.DEU, "('DE', 'formal_logic')": Language.DEU, "('DE', 'global_facts')": Language.DEU, "('DE', 'high_school_biology')": Language.DEU, "('DE', 'high_school_chemistry')": Language.DEU, "('DE', 'high_school_computer_science')": Language.DEU, "('DE', 'high_school_european_history')": Language.DEU, "('DE', 'high_school_geography')": Language.DEU, "('DE', 'high_school_government_and_politics')": Language.DEU, "('DE', 'high_school_macroeconomics')": Language.DEU, "('DE', 'high_school_mathematics')": Language.DEU, "('DE', 'high_school_microeconomics')": Language.DEU, "('DE', 'high_school_physics')": Language.DEU, "('DE', 'high_school_psychology')": Language.DEU, "('DE', 'high_school_statistics')": Language.DEU, "('DE', 'high_school_us_history')": Language.DEU, "('DE', 'high_school_world_history')": Language.DEU, "('DE', 'human_aging')": Language.DEU, "('DE', 'human_sexuality')": Language.DEU, "('DE', 'international_law')": Language.DEU, "('DE', 'jurisprudence')": Language.DEU, "('DE', 'logical_fallacies')": Language.DEU, "('DE', 'machine_learning')": Language.DEU, "('DE', 'management')": Language.DEU, "('DE', 'marketing')": Language.DEU, "('DE', 'medical_genetics')": Language.DEU, "('DE', 'miscellaneous')": Language.DEU, "('DE', 'moral_disputes')": Language.DEU, "('DE', 'moral_scenarios')": Language.DEU, "('DE', 'nutrition')": Language.DEU, "('DE', 'philosophy')": Language.DEU, "('DE', 'prehistory')": Language.DEU, "('DE', 'professional_accounting')": Language.DEU, "('DE', 'professional_law')": Language.DEU, "('DE', 'professional_medicine')": Language.DEU, "('DE', 'professional_psychology')": Language.DEU, "('DE', 'public_relations')": Language.DEU, "('DE', 'security_studies')": Language.DEU, "('DE', 'sociology')": Language.DEU, "('DE', 'us_foreign_policy')": Language.DEU, "('DE', 'virology')": Language.DEU, "('DE', 'world_religions')": Language.DEU, "('ES', 'abstract_algebra')": Language.SPA, "('ES', 'anatomy')": Language.SPA, "('ES', 'astronomy')": Language.SPA, "('ES', 'business_ethics')": Language.SPA, "('ES', 'clinical_knowledge')": Language.SPA, "('ES', 'college_biology')": Language.SPA, "('ES', 'college_chemistry')": Language.SPA, "('ES', 'college_computer_science')": Language.SPA, "('ES', 'college_mathematics')": Language.SPA, "('ES', 'college_medicine')": Language.SPA, "('ES', 'college_physics')": Language.SPA, "('ES', 'computer_security')": Language.SPA, "('ES', 'conceptual_physics')": Language.SPA, "('ES', 'econometrics')": Language.SPA, "('ES', 'electrical_engineering')": Language.SPA, "('ES', 'elementary_mathematics')": Language.SPA, "('ES', 'formal_logic')": Language.SPA, "('ES', 'global_facts')": Language.SPA, "('ES', 'high_school_biology')": Language.SPA, "('ES', 'high_school_chemistry')": Language.SPA, "('ES', 'high_school_computer_science')": Language.SPA, "('ES', 'high_school_european_history')": Language.SPA, "('ES', 'high_school_geography')": Language.SPA, "('ES', 'high_school_government_and_politics')": Language.SPA, "('ES', 'high_school_macroeconomics')": Language.SPA, "('ES', 'high_school_mathematics')": Language.SPA, "('ES', 'high_school_microeconomics')": Language.SPA, "('ES', 'high_school_physics')": Language.SPA, "('ES', 'high_school_psychology')": Language.SPA, "('ES', 'high_school_statistics')": Language.SPA, "('ES', 'high_school_us_history')": Language.SPA, "('ES', 'high_school_world_history')": Language.SPA, "('ES', 'human_aging')": Language.SPA, "('ES', 'human_sexuality')": Language.SPA, "('ES', 'international_law')": Language.SPA, "('ES', 'jurisprudence')": Language.SPA, "('ES', 'logical_fallacies')": Language.SPA, "('ES', 'machine_learning')": Language.SPA, "('ES', 'management')": Language.SPA, "('ES', 'marketing')": Language.SPA, "('ES', 'medical_genetics')": Language.SPA, "('ES', 'miscellaneous')": Language.SPA, "('ES', 'moral_disputes')": Language.SPA, "('ES', 'moral_scenarios')": Language.SPA, "('ES', 'nutrition')": Language.SPA, "('ES', 'philosophy')": Language.SPA, "('ES', 'prehistory')": Language.SPA, "('ES', 'professional_accounting')": Language.SPA, "('ES', 'professional_law')": Language.SPA, "('ES', 'professional_medicine')": Language.SPA, "('ES', 'professional_psychology')": Language.SPA, "('ES', 'public_relations')": Language.SPA, "('ES', 'security_studies')": Language.SPA, "('ES', 'sociology')": Language.SPA, "('ES', 'us_foreign_policy')": Language.SPA, "('ES', 'virology')": Language.SPA, "('ES', 'world_religions')": Language.SPA, "('FR', 'abstract_algebra')": Language.FRA, "('FR', 'anatomy')": Language.FRA, "('FR', 'astronomy')": Language.FRA, "('FR', 'business_ethics')": Language.FRA, "('FR', 'clinical_knowledge')": Language.FRA, "('FR', 'college_biology')": Language.FRA, "('FR', 'college_chemistry')": Language.FRA, "('FR', 'college_computer_science')": Language.FRA, "('FR', 'college_mathematics')": Language.FRA, "('FR', 'college_medicine')": Language.FRA, "('FR', 'college_physics')": Language.FRA, "('FR', 'computer_security')": Language.FRA, "('FR', 'conceptual_physics')": Language.FRA, "('FR', 'econometrics')": Language.FRA, "('FR', 'electrical_engineering')": Language.FRA, "('FR', 'elementary_mathematics')": Language.FRA, "('FR', 'formal_logic')": Language.FRA, "('FR', 'global_facts')": Language.FRA, "('FR', 'high_school_biology')": Language.FRA, "('FR', 'high_school_chemistry')": Language.FRA, "('FR', 'high_school_computer_science')": Language.FRA, "('FR', 'high_school_european_history')": Language.FRA, "('FR', 'high_school_geography')": Language.FRA, "('FR', 'high_school_government_and_politics')": Language.FRA, "('FR', 'high_school_macroeconomics')": Language.FRA, "('FR', 'high_school_mathematics')": Language.FRA, "('FR', 'high_school_microeconomics')": Language.FRA, "('FR', 'high_school_physics')": Language.FRA, "('FR', 'high_school_psychology')": Language.FRA, "('FR', 'high_school_statistics')": Language.FRA, "('FR', 'high_school_us_history')": Language.FRA, "('FR', 'high_school_world_history')": Language.FRA, "('FR', 'human_aging')": Language.FRA, "('FR', 'human_sexuality')": Language.FRA, "('FR', 'international_law')": Language.FRA, "('FR', 'jurisprudence')": Language.FRA, "('FR', 'logical_fallacies')": Language.FRA, "('FR', 'machine_learning')": Language.FRA, "('FR', 'management')": Language.FRA, "('FR', 'marketing')": Language.FRA, "('FR', 'medical_genetics')": Language.FRA, "('FR', 'miscellaneous')": Language.FRA, "('FR', 'moral_disputes')": Language.FRA, "('FR', 'moral_scenarios')": Language.FRA, "('FR', 'nutrition')": Language.FRA, "('FR', 'philosophy')": Language.FRA, "('FR', 'prehistory')": Language.FRA, "('FR', 'professional_accounting')": Language.FRA, "('FR', 'professional_law')": Language.FRA, "('FR', 'professional_medicine')": Language.FRA, "('FR', 'professional_psychology')": Language.FRA, "('FR', 'public_relations')": Language.FRA, "('FR', 'security_studies')": Language.FRA, "('FR', 'sociology')": Language.FRA, "('FR', 'us_foreign_policy')": Language.FRA, "('FR', 'virology')": Language.FRA, "('FR', 'world_religions')": Language.FRA, "('IT', 'abstract_algebra')": Language.ITA, "('IT', 'anatomy')": Language.ITA, "('IT', 'astronomy')": Language.ITA, "('IT', 'business_ethics')": Language.ITA, "('IT', 'clinical_knowledge')": Language.ITA, "('IT', 'college_biology')": Language.ITA, "('IT', 'college_chemistry')": Language.ITA, "('IT', 'college_computer_science')": Language.ITA, "('IT', 'college_mathematics')": Language.ITA, "('IT', 'college_medicine')": Language.ITA, "('IT', 'college_physics')": Language.ITA, "('IT', 'computer_security')": Language.ITA, "('IT', 'conceptual_physics')": Language.ITA, "('IT', 'econometrics')": Language.ITA, "('IT', 'electrical_engineering')": Language.ITA, "('IT', 'elementary_mathematics')": Language.ITA, "('IT', 'formal_logic')": Language.ITA, "('IT', 'global_facts')": Language.ITA, "('IT', 'high_school_biology')": Language.ITA, "('IT', 'high_school_chemistry')": Language.ITA, "('IT', 'high_school_computer_science')": Language.ITA, "('IT', 'high_school_european_history')": Language.ITA, "('IT', 'high_school_geography')": Language.ITA, "('IT', 'high_school_government_and_politics')": Language.ITA, "('IT', 'high_school_macroeconomics')": Language.ITA, "('IT', 'high_school_mathematics')": Language.ITA, "('IT', 'high_school_microeconomics')": Language.ITA, "('IT', 'high_school_physics')": Language.ITA, "('IT', 'high_school_psychology')": Language.ITA, "('IT', 'high_school_statistics')": Language.ITA, "('IT', 'high_school_us_history')": Language.ITA, "('IT', 'high_school_world_history')": Language.ITA, "('IT', 'human_aging')": Language.ITA, "('IT', 'human_sexuality')": Language.ITA, "('IT', 'international_law')": Language.ITA, "('IT', 'jurisprudence')": Language.ITA, "('IT', 'logical_fallacies')": Language.ITA, "('IT', 'machine_learning')": Language.ITA, "('IT', 'management')": Language.ITA, "('IT', 'marketing')": Language.ITA, "('IT', 'medical_genetics')": Language.ITA, "('IT', 'miscellaneous')": Language.ITA, "('IT', 'moral_disputes')": Language.ITA, "('IT', 'moral_scenarios')": Language.ITA, "('IT', 'nutrition')": Language.ITA, "('IT', 'philosophy')": Language.ITA, "('IT', 'prehistory')": Language.ITA, "('IT', 'professional_accounting')": Language.ITA, "('IT', 'professional_law')": Language.ITA, "('IT', 'professional_medicine')": Language.ITA, "('IT', 'professional_psychology')": Language.ITA, "('IT', 'public_relations')": Language.ITA, "('IT', 'security_studies')": Language.ITA, "('IT', 'sociology')": Language.ITA, "('IT', 'us_foreign_policy')": Language.ITA, "('IT', 'virology')": Language.ITA, "('IT', 'world_religions')": Language.ITA, "('PT', 'abstract_algebra')": Language.POR, "('PT', 'anatomy')": Language.POR, "('PT', 'astronomy')": Language.POR, "('PT', 'business_ethics')": Language.POR, "('PT', 'clinical_knowledge')": Language.POR, "('PT', 'college_biology')": Language.POR, "('PT', 'college_chemistry')": Language.POR, "('PT', 'college_computer_science')": Language.POR, "('PT', 'college_mathematics')": Language.POR, "('PT', 'college_medicine')": Language.POR, "('PT', 'college_physics')": Language.POR, "('PT', 'computer_security')": Language.POR, "('PT', 'conceptual_physics')": Language.POR, "('PT', 'econometrics')": Language.POR, "('PT', 'electrical_engineering')": Language.POR, "('PT', 'elementary_mathematics')": Language.POR, "('PT', 'formal_logic')": Language.POR, "('PT', 'global_facts')": Language.POR, "('PT', 'high_school_biology')": Language.POR, "('PT', 'high_school_chemistry')": Language.POR, "('PT', 'high_school_computer_science')": Language.POR, "('PT', 'high_school_european_history')": Language.POR, "('PT', 'high_school_geography')": Language.POR, "('PT', 'high_school_government_and_politics')": Language.POR, "('PT', 'high_school_macroeconomics')": Language.POR, "('PT', 'high_school_mathematics')": Language.POR, "('PT', 'high_school_microeconomics')": Language.POR, "('PT', 'high_school_physics')": Language.POR, "('PT', 'high_school_psychology')": Language.POR, "('PT', 'high_school_statistics')": Language.POR, "('PT', 'high_school_us_history')": Language.POR, "('PT', 'high_school_world_history')": Language.POR, "('PT', 'human_aging')": Language.POR, "('PT', 'human_sexuality')": Language.POR, "('PT', 'international_law')": Language.POR, "('PT', 'jurisprudence')": Language.POR, "('PT', 'logical_fallacies')": Language.POR, "('PT', 'machine_learning')": Language.POR, "('PT', 'management')": Language.POR, "('PT', 'marketing')": Language.POR, "('PT', 'medical_genetics')": Language.POR, "('PT', 'miscellaneous')": Language.POR, "('PT', 'moral_disputes')": Language.POR, "('PT', 'moral_scenarios')": Language.POR, "('PT', 'nutrition')": Language.POR, "('PT', 'philosophy')": Language.POR, "('PT', 'prehistory')": Language.POR, "('PT', 'professional_accounting')": Language.POR, "('PT', 'professional_law')": Language.POR, "('PT', 'professional_medicine')": Language.POR, "('PT', 'professional_psychology')": Language.POR, "('PT', 'public_relations')": Language.POR, "('PT', 'security_studies')": Language.POR, "('PT', 'sociology')": Language.POR, "('PT', 'us_foreign_policy')": Language.POR, "('PT', 'virology')": Language.POR, "('PT', 'world_religions')": Language.POR}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'MMMLU'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = [('FR_FR', 'abstract_algebra'), ('FR_FR', 'anatomy'), ('FR_FR', 'astronomy'), ('FR_FR', 'business_ethics'), ('FR_FR', 'clinical_knowledge'), ('FR_FR', 'college_biology'), ('FR_FR', 'college_chemistry'), ('FR_FR', 'college_computer_science'), ('FR_FR', 'college_mathematics'), ('FR_FR', 'college_medicine'), ('FR_FR', 'college_physics'), ('FR_FR', 'computer_security'), ('FR_FR', 'conceptual_physics'), ('FR_FR', 'econometrics'), ('FR_FR', 'electrical_engineering'), ('FR_FR', 'elementary_mathematics'), ('FR_FR', 'formal_logic'), ('FR_FR', 'global_facts'), ('FR_FR', 'high_school_biology'), ('FR_FR', 'high_school_chemistry'), ('FR_FR', 'high_school_computer_science'), ('FR_FR', 'high_school_european_history'), ('FR_FR', 'high_school_geography'), ('FR_FR', 'high_school_government_and_politics'), ('FR_FR', 'high_school_macroeconomics'), ('FR_FR', 'high_school_mathematics'), ('FR_FR', 'high_school_microeconomics'), ('FR_FR', 'high_school_physics'), ('FR_FR', 'high_school_psychology'), ('FR_FR', 'high_school_statistics'), ('FR_FR', 'high_school_us_history'), ('FR_FR', 'high_school_world_history'), ('FR_FR', 'human_aging'), ('FR_FR', 'human_sexuality'), ('FR_FR', 'international_law'), ('FR_FR', 'jurisprudence'), ('FR_FR', 'logical_fallacies'), ('FR_FR', 'machine_learning'), ('FR_FR', 'management'), ('FR_FR', 'marketing'), ('FR_FR', 'medical_genetics'), ('FR_FR', 'miscellaneous'), ('FR_FR', 'moral_disputes'), ('FR_FR', 'moral_scenarios'), ('FR_FR', 'nutrition'), ('FR_FR', 'philosophy'), ('FR_FR', 'prehistory'), ('FR_FR', 'professional_accounting'), ('FR_FR', 'professional_law'), ('FR_FR', 'professional_medicine'), ('FR_FR', 'professional_psychology'), ('FR_FR', 'public_relations'), ('FR_FR', 'security_studies'), ('FR_FR', 'sociology'), ('FR_FR', 'us_foreign_policy'), ('FR_FR', 'virology'), ('FR_FR', 'world_religions'), ('DE_DE', 'abstract_algebra'), ('DE_DE', 'anatomy'), ('DE_DE', 'astronomy'), ('DE_DE', 'business_ethics'), ('DE_DE', 'clinical_knowledge'), ('DE_DE', 'college_biology'), ('DE_DE', 'college_chemistry'), ('DE_DE', 'college_computer_science'), ('DE_DE', 'college_mathematics'), ('DE_DE', 'college_medicine'), ('DE_DE', 'college_physics'), ('DE_DE', 'computer_security'), ('DE_DE', 'conceptual_physics'), ('DE_DE', 'econometrics'), ('DE_DE', 'electrical_engineering'), ('DE_DE', 'elementary_mathematics'), ('DE_DE', 'formal_logic'), ('DE_DE', 'global_facts'), ('DE_DE', 'high_school_biology'), ('DE_DE', 'high_school_chemistry'), ('DE_DE', 'high_school_computer_science'), ('DE_DE', 'high_school_european_history'), ('DE_DE', 'high_school_geography'), ('DE_DE', 'high_school_government_and_politics'), ('DE_DE', 'high_school_macroeconomics'), ('DE_DE', 'high_school_mathematics'), ('DE_DE', 'high_school_microeconomics'), ('DE_DE', 'high_school_physics'), ('DE_DE', 'high_school_psychology'), ('DE_DE', 'high_school_statistics'), ('DE_DE', 'high_school_us_history'), ('DE_DE', 'high_school_world_history'), ('DE_DE', 'human_aging'), ('DE_DE', 'human_sexuality'), ('DE_DE', 'international_law'), ('DE_DE', 'jurisprudence'), ('DE_DE', 'logical_fallacies'), ('DE_DE', 'machine_learning'), ('DE_DE', 'management'), ('DE_DE', 'marketing'), ('DE_DE', 'medical_genetics'), ('DE_DE', 'miscellaneous'), ('DE_DE', 'moral_disputes'), ('DE_DE', 'moral_scenarios'), ('DE_DE', 'nutrition'), ('DE_DE', 'philosophy'), ('DE_DE', 'prehistory'), ('DE_DE', 'professional_accounting'), ('DE_DE', 'professional_law'), ('DE_DE', 'professional_medicine'), ('DE_DE', 'professional_psychology'), ('DE_DE', 'public_relations'), ('DE_DE', 'security_studies'), ('DE_DE', 'sociology'), ('DE_DE', 'us_foreign_policy'), ('DE_DE', 'virology'), ('DE_DE', 'world_religions'), ('ES_LA', 'abstract_algebra'), ('ES_LA', 'anatomy'), ('ES_LA', 'astronomy'), ('ES_LA', 'business_ethics'), ('ES_LA', 'clinical_knowledge'), ('ES_LA', 'college_biology'), ('ES_LA', 'college_chemistry'), ('ES_LA', 'college_computer_science'), ('ES_LA', 'college_mathematics'), ('ES_LA', 'college_medicine'), ('ES_LA', 'college_physics'), ('ES_LA', 'computer_security'), ('ES_LA', 'conceptual_physics'), ('ES_LA', 'econometrics'), ('ES_LA', 'electrical_engineering'), ('ES_LA', 'elementary_mathematics'), ('ES_LA', 'formal_logic'), ('ES_LA', 'global_facts'), ('ES_LA', 'high_school_biology'), ('ES_LA', 'high_school_chemistry'), ('ES_LA', 'high_school_computer_science'), ('ES_LA', 'high_school_european_history'), ('ES_LA', 'high_school_geography'), ('ES_LA', 'high_school_government_and_politics'), ('ES_LA', 'high_school_macroeconomics'), ('ES_LA', 'high_school_mathematics'), ('ES_LA', 'high_school_microeconomics'), ('ES_LA', 'high_school_physics'), ('ES_LA', 'high_school_psychology'), ('ES_LA', 'high_school_statistics'), ('ES_LA', 'high_school_us_history'), ('ES_LA', 'high_school_world_history'), ('ES_LA', 'human_aging'), ('ES_LA', 'human_sexuality'), ('ES_LA', 'international_law'), ('ES_LA', 'jurisprudence'), ('ES_LA', 'logical_fallacies'), ('ES_LA', 'machine_learning'), ('ES_LA', 'management'), ('ES_LA', 'marketing'), ('ES_LA', 'medical_genetics'), ('ES_LA', 'miscellaneous'), ('ES_LA', 'moral_disputes'), ('ES_LA', 'moral_scenarios'), ('ES_LA', 'nutrition'), ('ES_LA', 'philosophy'), ('ES_LA', 'prehistory'), ('ES_LA', 'professional_accounting'), ('ES_LA', 'professional_law'), ('ES_LA', 'professional_medicine'), ('ES_LA', 'professional_psychology'), ('ES_LA', 'public_relations'), ('ES_LA', 'security_studies'), ('ES_LA', 'sociology'), ('ES_LA', 'us_foreign_policy'), ('ES_LA', 'virology'), ('ES_LA', 'world_religions'), ('IT_IT', 'abstract_algebra'), ('IT_IT', 'anatomy'), ('IT_IT', 'astronomy'), ('IT_IT', 'business_ethics'), ('IT_IT', 'clinical_knowledge'), ('IT_IT', 'college_biology'), ('IT_IT', 'college_chemistry'), ('IT_IT', 'college_computer_science'), ('IT_IT', 'college_mathematics'), ('IT_IT', 'college_medicine'), ('IT_IT', 'college_physics'), ('IT_IT', 'computer_security'), ('IT_IT', 'conceptual_physics'), ('IT_IT', 'econometrics'), ('IT_IT', 'electrical_engineering'), ('IT_IT', 'elementary_mathematics'), ('IT_IT', 'formal_logic'), ('IT_IT', 'global_facts'), ('IT_IT', 'high_school_biology'), ('IT_IT', 'high_school_chemistry'), ('IT_IT', 'high_school_computer_science'), ('IT_IT', 'high_school_european_history'), ('IT_IT', 'high_school_geography'), ('IT_IT', 'high_school_government_and_politics'), ('IT_IT', 'high_school_macroeconomics'), ('IT_IT', 'high_school_mathematics'), ('IT_IT', 'high_school_microeconomics'), ('IT_IT', 'high_school_physics'), ('IT_IT', 'high_school_psychology'), ('IT_IT', 'high_school_statistics'), ('IT_IT', 'high_school_us_history'), ('IT_IT', 'high_school_world_history'), ('IT_IT', 'human_aging'), ('IT_IT', 'human_sexuality'), ('IT_IT', 'international_law'), ('IT_IT', 'jurisprudence'), ('IT_IT', 'logical_fallacies'), ('IT_IT', 'machine_learning'), ('IT_IT', 'management'), ('IT_IT', 'marketing'), ('IT_IT', 'medical_genetics'), ('IT_IT', 'miscellaneous'), ('IT_IT', 'moral_disputes'), ('IT_IT', 'moral_scenarios'), ('IT_IT', 'nutrition'), ('IT_IT', 'philosophy'), ('IT_IT', 'prehistory'), ('IT_IT', 'professional_accounting'), ('IT_IT', 'professional_law'), ('IT_IT', 'professional_medicine'), ('IT_IT', 'professional_psychology'), ('IT_IT', 'public_relations'), ('IT_IT', 'security_studies'), ('IT_IT', 'sociology'), ('IT_IT', 'us_foreign_policy'), ('IT_IT', 'virology'), ('IT_IT', 'world_religions'), ('PT_BR', 'abstract_algebra'), ('PT_BR', 'anatomy'), ('PT_BR', 'astronomy'), ('PT_BR', 'business_ethics'), ('PT_BR', 'clinical_knowledge'), ('PT_BR', 'college_biology'), ('PT_BR', 'college_chemistry'), ('PT_BR', 'college_computer_science'), ('PT_BR', 'college_mathematics'), ('PT_BR', 'college_medicine'), ('PT_BR', 'college_physics'), ('PT_BR', 'computer_security'), ('PT_BR', 'conceptual_physics'), ('PT_BR', 'econometrics'), ('PT_BR', 'electrical_engineering'), ('PT_BR', 'elementary_mathematics'), ('PT_BR', 'formal_logic'), ('PT_BR', 'global_facts'), ('PT_BR', 'high_school_biology'), ('PT_BR', 'high_school_chemistry'), ('PT_BR', 'high_school_computer_science'), ('PT_BR', 'high_school_european_history'), ('PT_BR', 'high_school_geography'), ('PT_BR', 'high_school_government_and_politics'), ('PT_BR', 'high_school_macroeconomics'), ('PT_BR', 'high_school_mathematics'), ('PT_BR', 'high_school_microeconomics'), ('PT_BR', 'high_school_physics'), ('PT_BR', 'high_school_psychology'), ('PT_BR', 'high_school_statistics'), ('PT_BR', 'high_school_us_history'), ('PT_BR', 'high_school_world_history'), ('PT_BR', 'human_aging'), ('PT_BR', 'human_sexuality'), ('PT_BR', 'international_law'), ('PT_BR', 'jurisprudence'), ('PT_BR', 'logical_fallacies'), ('PT_BR', 'machine_learning'), ('PT_BR', 'management'), ('PT_BR', 'marketing'), ('PT_BR', 'medical_genetics'), ('PT_BR', 'miscellaneous'), ('PT_BR', 'moral_disputes'), ('PT_BR', 'moral_scenarios'), ('PT_BR', 'nutrition'), ('PT_BR', 'philosophy'), ('PT_BR', 'prehistory'), ('PT_BR', 'professional_accounting'), ('PT_BR', 'professional_law'), ('PT_BR', 'professional_medicine'), ('PT_BR', 'professional_psychology'), ('PT_BR', 'public_relations'), ('PT_BR', 'security_studies'), ('PT_BR', 'sociology'), ('PT_BR', 'us_foreign_policy'), ('PT_BR', 'virology'), ('PT_BR', 'world_religions'), ('AR_XY', 'abstract_algebra'), ('AR_XY', 'anatomy'), ('AR_XY', 'astronomy'), ('AR_XY', 'business_ethics'), ('AR_XY', 'clinical_knowledge'), ('AR_XY', 'college_biology'), ('AR_XY', 'college_chemistry'), ('AR_XY', 'college_computer_science'), ('AR_XY', 'college_mathematics'), ('AR_XY', 'college_medicine'), ('AR_XY', 'college_physics'), ('AR_XY', 'computer_security'), ('AR_XY', 'conceptual_physics'), ('AR_XY', 'econometrics'), ('AR_XY', 'electrical_engineering'), ('AR_XY', 'elementary_mathematics'), ('AR_XY', 'formal_logic'), ('AR_XY', 'global_facts'), ('AR_XY', 'high_school_biology'), ('AR_XY', 'high_school_chemistry'), ('AR_XY', 'high_school_computer_science'), ('AR_XY', 'high_school_european_history'), ('AR_XY', 'high_school_geography'), ('AR_XY', 'high_school_government_and_politics'), ('AR_XY', 'high_school_macroeconomics'), ('AR_XY', 'high_school_mathematics'), ('AR_XY', 'high_school_microeconomics'), ('AR_XY', 'high_school_physics'), ('AR_XY', 'high_school_psychology'), ('AR_XY', 'high_school_statistics'), ('AR_XY', 'high_school_us_history'), ('AR_XY', 'high_school_world_history'), ('AR_XY', 'human_aging'), ('AR_XY', 'human_sexuality'), ('AR_XY', 'international_law'), ('AR_XY', 'jurisprudence'), ('AR_XY', 'logical_fallacies'), ('AR_XY', 'machine_learning'), ('AR_XY', 'management'), ('AR_XY', 'marketing'), ('AR_XY', 'medical_genetics'), ('AR_XY', 'miscellaneous'), ('AR_XY', 'moral_disputes'), ('AR_XY', 'moral_scenarios'), ('AR_XY', 'nutrition'), ('AR_XY', 'philosophy'), ('AR_XY', 'prehistory'), ('AR_XY', 'professional_accounting'), ('AR_XY', 'professional_law'), ('AR_XY', 'professional_medicine'), ('AR_XY', 'professional_psychology'), ('AR_XY', 'public_relations'), ('AR_XY', 'security_studies'), ('AR_XY', 'sociology'), ('AR_XY', 'us_foreign_policy'), ('AR_XY', 'virology'), ('AR_XY', 'world_religions')]¶
- class eval_framework.tasks.benchmarks.mmmlu.MMMLU_GERMAN_COT(num_fewshot=0)[source]¶
Bases:
MMMLU- Parameters:
num_fewshot (int)
- ANS_RE = re.compile('Daher lautet die Antwort: ([ABCD])')¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {"('de', 'abstract_algebra')": Language.DEU, "('de', 'anatomy')": Language.DEU, "('de', 'astronomy')": Language.DEU, "('de', 'business_ethics')": Language.DEU, "('de', 'clinical_knowledge')": Language.DEU, "('de', 'college_biology')": Language.DEU, "('de', 'college_chemistry')": Language.DEU, "('de', 'college_computer_science')": Language.DEU, "('de', 'college_mathematics')": Language.DEU, "('de', 'college_medicine')": Language.DEU, "('de', 'college_physics')": Language.DEU, "('de', 'computer_security')": Language.DEU, "('de', 'conceptual_physics')": Language.DEU, "('de', 'econometrics')": Language.DEU, "('de', 'electrical_engineering')": Language.DEU, "('de', 'elementary_mathematics')": Language.DEU, "('de', 'formal_logic')": Language.DEU, "('de', 'global_facts')": Language.DEU, "('de', 'high_school_biology')": Language.DEU, "('de', 'high_school_chemistry')": Language.DEU, "('de', 'high_school_computer_science')": Language.DEU, "('de', 'high_school_european_history')": Language.DEU, "('de', 'high_school_geography')": Language.DEU, "('de', 'high_school_government_and_politics')": Language.DEU, "('de', 'high_school_macroeconomics')": Language.DEU, "('de', 'high_school_mathematics')": Language.DEU, "('de', 'high_school_microeconomics')": Language.DEU, "('de', 'high_school_physics')": Language.DEU, "('de', 'high_school_psychology')": Language.DEU, "('de', 'high_school_statistics')": Language.DEU, "('de', 'high_school_us_history')": Language.DEU, "('de', 'high_school_world_history')": Language.DEU, "('de', 'human_aging')": Language.DEU, "('de', 'human_sexuality')": Language.DEU, "('de', 'international_law')": Language.DEU, "('de', 'jurisprudence')": Language.DEU, "('de', 'logical_fallacies')": Language.DEU, "('de', 'machine_learning')": Language.DEU, "('de', 'management')": Language.DEU, "('de', 'marketing')": Language.DEU, "('de', 'medical_genetics')": Language.DEU, "('de', 'miscellaneous')": Language.DEU, "('de', 'moral_disputes')": Language.DEU, "('de', 'moral_scenarios')": Language.DEU, "('de', 'nutrition')": Language.DEU, "('de', 'philosophy')": Language.DEU, "('de', 'prehistory')": Language.DEU, "('de', 'professional_accounting')": Language.DEU, "('de', 'professional_law')": Language.DEU, "('de', 'professional_medicine')": Language.DEU, "('de', 'professional_psychology')": Language.DEU, "('de', 'public_relations')": Language.DEU, "('de', 'security_studies')": Language.DEU, "('de', 'sociology')": Language.DEU, "('de', 'us_foreign_policy')": Language.DEU, "('de', 'virology')": Language.DEU, "('de', 'world_religions')": Language.DEU}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.language_checker.GermanCompletionChecker'>]¶
- NAME: str = 'MMMLU_GERMAN_COT'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Frage', 'Question', 'Answer', 'A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SUBJECTS: list[SubjectType] = [('DE_DE', 'abstract_algebra'), ('DE_DE', 'anatomy'), ('DE_DE', 'astronomy'), ('DE_DE', 'business_ethics'), ('DE_DE', 'clinical_knowledge'), ('DE_DE', 'college_biology'), ('DE_DE', 'college_chemistry'), ('DE_DE', 'college_computer_science'), ('DE_DE', 'college_mathematics'), ('DE_DE', 'college_medicine'), ('DE_DE', 'college_physics'), ('DE_DE', 'computer_security'), ('DE_DE', 'conceptual_physics'), ('DE_DE', 'econometrics'), ('DE_DE', 'electrical_engineering'), ('DE_DE', 'elementary_mathematics'), ('DE_DE', 'formal_logic'), ('DE_DE', 'global_facts'), ('DE_DE', 'high_school_biology'), ('DE_DE', 'high_school_chemistry'), ('DE_DE', 'high_school_computer_science'), ('DE_DE', 'high_school_european_history'), ('DE_DE', 'high_school_geography'), ('DE_DE', 'high_school_government_and_politics'), ('DE_DE', 'high_school_macroeconomics'), ('DE_DE', 'high_school_mathematics'), ('DE_DE', 'high_school_microeconomics'), ('DE_DE', 'high_school_physics'), ('DE_DE', 'high_school_psychology'), ('DE_DE', 'high_school_statistics'), ('DE_DE', 'high_school_us_history'), ('DE_DE', 'high_school_world_history'), ('DE_DE', 'human_aging'), ('DE_DE', 'human_sexuality'), ('DE_DE', 'international_law'), ('DE_DE', 'jurisprudence'), ('DE_DE', 'logical_fallacies'), ('DE_DE', 'machine_learning'), ('DE_DE', 'management'), ('DE_DE', 'marketing'), ('DE_DE', 'medical_genetics'), ('DE_DE', 'miscellaneous'), ('DE_DE', 'moral_disputes'), ('DE_DE', 'moral_scenarios'), ('DE_DE', 'nutrition'), ('DE_DE', 'philosophy'), ('DE_DE', 'prehistory'), ('DE_DE', 'professional_accounting'), ('DE_DE', 'professional_law'), ('DE_DE', 'professional_medicine'), ('DE_DE', 'professional_psychology'), ('DE_DE', 'public_relations'), ('DE_DE', 'security_studies'), ('DE_DE', 'sociology'), ('DE_DE', 'us_foreign_policy'), ('DE_DE', 'virology'), ('DE_DE', 'world_religions')]¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
eval_framework.tasks.benchmarks.openbookqa module¶
- class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA(num_fewshot=0)[source]¶
Bases:
BaseTask[str]OpenBookQA dataset: https://huggingface.co/datasets/allenai/openbookqa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'allenai/openbookqa'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'OpenBookQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['A', 'B', 'C', 'D']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['additional']¶
- class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_EVAL_HARNESS(num_fewshot=0)[source]¶
Bases:
OPENBOOKQAClosed-book version of OpenBookQA — question only, no supporting fact.
- Parameters:
num_fewshot (int)
- NAME: str = 'OpenBookQAEvalHarness'¶
- class eval_framework.tasks.benchmarks.openbookqa.OPENBOOKQA_IDK(num_fewshot=0)[source]¶
Bases:
OPENBOOKQA- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'OpenBookQA_IDK'¶
eval_framework.tasks.benchmarks.opengptx_eu20 module¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.ARC_EU20_DE(num_fewshot=0)[source]¶
Bases:
ARCEU20 Benchmarks from the openGPT-X paper: - https://arxiv.org/abs/2410.08928 - leaderboard: https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard
- https://huggingface.co/datasets/openGPT-X/arcx
entries in ‘challenge_DE’: 1172 test, 299 validation, 198 train entries in ‘easy_DE’: 2376 test, 570 validation, 197 train
features: [‘id’, ‘question’, ‘choices’, ‘answerKey’],
SUBJECTS = [‘challenge_BG’, ‘easy_BG’, ‘challenge_DA’, ‘easy_DA’, ‘challenge_DE’, ‘easy_DE’, ‘challenge_ET’, ‘easy_ET’, ‘challenge_FI’, ‘easy_FI’, ‘challenge_FR’, ‘easy_FR’, ‘challenge_EL’, ‘easy_EL’, ‘challenge_IT’, ‘easy_IT’, ‘challenge_LV’, ‘easy_LV’, ‘challenge_LT’, ‘easy_LT’, ‘challenge_NL’, ‘easy_NL’, ‘challenge_PL’, ‘easy_PL’, ‘challenge_PT-PT’, ‘easy_PT-PT’, ‘challenge_RO’, ‘easy_RO’, ‘challenge_SV’, ‘easy_SV’, ‘challenge_SK’, ‘easy_SK’, ‘challenge_SL’, ‘easy_SL’, ‘challenge_ES’, ‘easy_ES’, ‘challenge_CS’, ‘easy_CS’, ‘challenge_HU’, ‘easy_HU’]
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/arcx'¶
- FEWSHOT_SPLIT: str = 'train'¶
- NAME: str = 'ARC_EU20_DE'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['challenge_DE', 'easy_DE']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.ARC_EU20_FR(num_fewshot=0)[source]¶
Bases:
ARC- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/arcx'¶
- FEWSHOT_SPLIT: str = 'train'¶
- NAME: str = 'ARC_EU20_FR'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['challenge_FR', 'easy_FR']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.GSM8K_EU20_DE(num_fewshot=0)[source]¶
Bases:
GSM8KEvalHarness- https://huggingface.co/datasets/openGPT-X/gsm8kx
- entries in ‘DE’: 1319 test, 104 train
features: [‘question’, ‘answer’, ‘id’],
SUBJECTS = [‘BG’, ‘DA’, ‘DE’, ‘ET’, ‘FI’, ‘FR’, ‘EL’, ‘IT’, ‘LV’, ‘LT’, ‘NL’, ‘PL’, ‘PT-PT’, ‘RO’, ‘SV’, ‘SK’, ‘SL’, ‘ES’, ‘CS’, ‘HU’]
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/gsm8kx'¶
- FEWSHOT_SPLIT: str = 'train'¶
- NAME: str = 'GSM8K_EU20_DE'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['DE']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.GSM8K_EU20_FR(num_fewshot=0)[source]¶
Bases:
GSM8KEvalHarness- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/gsm8kx'¶
- FEWSHOT_SPLIT: str = 'train'¶
- NAME: str = 'GSM8K_EU20_FR'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['FR']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.HELLASWAG_EU20_DE(num_fewshot=0)[source]¶
Bases:
HELLASWAG- https://huggingface.co/datasets/openGPT-X/hellaswagx
- entries in ‘DE’: 99 train, 9979 validation
features: [‘ind’, ‘activity_label’, ‘ctx_a’, ‘ctx_b’, ‘ctx’, ‘endings’, ‘source_id’, ‘split’, ‘split_type’, ‘label’],
SUBJECTS = [‘BG’, ‘DA’, ‘DE’, ‘ET’, ‘FI’, ‘FR’, ‘EL’, ‘IT’, ‘LV’, ‘LT’, ‘NL’, ‘PL’, ‘PT-PT’, ‘RO’, ‘SV’, ‘SK’, ‘SL’, ‘ES’, ‘CS’, ‘HU’]
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/hellaswagx'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- NAME: str = 'HellaSwag_EU20_DE'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['DE']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.HELLASWAG_EU20_FR(num_fewshot=0)[source]¶
Bases:
HELLASWAG- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/hellaswagx'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- NAME: str = 'HellaSwag_EU20_FR'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['FR']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.MMLU_EU20_DE(num_fewshot=0)[source]¶
Bases:
MMLU- https://huggingface.co/datasets/openGPT-X/mmlux
- entries in ‘philosophy_DE’: 311 test, 5 dev, 5 validation
features: [‘question’, ‘choices’, ‘answer’, ‘id’],
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/mmlux'¶
- FEWSHOT_SPLIT: str = 'dev'¶
- NAME: str = 'MMLU_EU20_DE'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'A', 'B', 'C', 'D', 'Frage']¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['abstract_algebra_DE', 'anatomy_DE', 'astronomy_DE', 'business_ethics_DE', 'clinical_knowledge_DE', 'college_biology_DE', 'college_chemistry_DE', 'college_computer_science_DE', 'college_mathematics_DE', 'college_medicine_DE', 'college_physics_DE', 'computer_security_DE', 'conceptual_physics_DE', 'econometrics_DE', 'electrical_engineering_DE', 'elementary_mathematics_DE', 'formal_logic_DE', 'global_facts_DE', 'high_school_biology_DE', 'high_school_chemistry_DE', 'high_school_computer_science_DE', 'high_school_european_history_DE', 'high_school_geography_DE', 'high_school_government_and_politics_DE', 'high_school_macroeconomics_DE', 'high_school_mathematics_DE', 'high_school_microeconomics_DE', 'high_school_physics_DE', 'high_school_psychology_DE', 'high_school_statistics_DE', 'high_school_us_history_DE', 'high_school_world_history_DE', 'human_aging_DE', 'human_sexuality_DE', 'international_law_DE', 'jurisprudence_DE', 'logical_fallacies_DE', 'machine_learning_DE', 'management_DE', 'marketing_DE', 'medical_genetics_DE', 'miscellaneous_DE', 'moral_disputes_DE', 'moral_scenarios_DE', 'nutrition_DE', 'philosophy_DE', 'prehistory_DE', 'professional_accounting_DE', 'professional_law_DE', 'professional_medicine_DE', 'professional_psychology_DE', 'public_relations_DE', 'security_studies_DE', 'sociology_DE', 'us_foreign_policy_DE', 'virology_DE', 'world_religions_DE']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.MMLU_EU20_FR(num_fewshot=0)[source]¶
Bases:
MMLU- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/mmlux'¶
- FEWSHOT_SPLIT: str = 'dev'¶
- NAME: str = 'MMLU_EU20_FR'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['abstract_algebra_FR', 'anatomy_FR', 'astronomy_FR', 'business_ethics_FR', 'clinical_knowledge_FR', 'college_biology_FR', 'college_chemistry_FR', 'college_computer_science_FR', 'college_mathematics_FR', 'college_medicine_FR', 'college_physics_FR', 'computer_security_FR', 'conceptual_physics_FR', 'econometrics_FR', 'electrical_engineering_FR', 'elementary_mathematics_FR', 'formal_logic_FR', 'global_facts_FR', 'high_school_biology_FR', 'high_school_chemistry_FR', 'high_school_computer_science_FR', 'high_school_european_history_FR', 'high_school_geography_FR', 'high_school_government_and_politics_FR', 'high_school_macroeconomics_FR', 'high_school_mathematics_FR', 'high_school_microeconomics_FR', 'high_school_physics_FR', 'high_school_psychology_FR', 'high_school_statistics_FR', 'high_school_us_history_FR', 'high_school_world_history_FR', 'human_aging_FR', 'human_sexuality_FR', 'international_law_FR', 'jurisprudence_FR', 'logical_fallacies_FR', 'machine_learning_FR', 'management_FR', 'marketing_FR', 'medical_genetics_FR', 'miscellaneous_FR', 'moral_disputes_FR', 'moral_scenarios_FR', 'nutrition_FR', 'philosophy_FR', 'prehistory_FR', 'professional_accounting_FR', 'professional_law_FR', 'professional_medicine_FR', 'professional_psychology_FR', 'public_relations_FR', 'security_studies_FR', 'sociology_FR', 'us_foreign_policy_FR', 'virology_FR', 'world_religions_FR']¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.TRUTHFULQA_EU20_DE(num_fewshot=0)[source]¶
Bases:
TRUTHFULQA- https://huggingface.co/datasets/openGPT-X/truthfulqax
- entries in ‘mc_DE’: 817 validation
features: [‘question’, ‘mc1_targets’, ‘mc2_targets’, ‘id’],
- entries in ‘gen_DE’: 817 validation
features: [‘type’, ‘category’, ‘question’, ‘best_answer’, ‘correct_answers’, ‘incorrect_answers’, ‘source’, ‘id’],
SUBJECTS = [‘mc_BG’, ‘gen_BG’, ‘mc_DA’, ‘gen_DA’, ‘mc_DE’, ‘gen_DE’, ‘mc_ET’, ‘gen_ET’, ‘mc_FI’, ‘gen_FI’, ‘mc_FR’, ‘gen_FR’, ‘mc_EL’, ‘gen_EL’, ‘mc_IT’, ‘gen_IT’, ‘mc_LV’, ‘gen_LV’, ‘mc_LT’, ‘gen_LT’, ‘mc_NL’, ‘gen_NL’, ‘mc_PL’, ‘gen_PL’, ‘mc_PT-PT’, ‘gen_PT-PT’, ‘mc_RO’, ‘gen_RO’, ‘mc_SV’, ‘gen_SV’, ‘mc_SK’, ‘gen_SK’, ‘mc_SL’, ‘gen_SL’, ‘mc_ES’, ‘gen_ES’, ‘mc_CS’, ‘gen_CS’, ‘mc_HU’, ‘gen_HU’]
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/truthfulqax'¶
- NAME: str = 'TruthfulQA_EU20_DE'¶
- class eval_framework.tasks.benchmarks.opengptx_eu20.TRUTHFULQA_EU20_FR(num_fewshot=0)[source]¶
Bases:
TRUTHFULQA- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'openGPT-X/truthfulqax'¶
- NAME: str = 'TruthfulQA_EU20_FR'¶
eval_framework.tasks.benchmarks.pawsx module¶
- class eval_framework.tasks.benchmarks.pawsx.PAWSX(num_fewshot=0)[source]¶
Bases:
BaseTask[str]PAWSX dataset: https://huggingface.co/datasets/google-research-datasets/paws-x used in the way suggested in PARAPHRASUS benchmark (https://arxiv.org/pdf/2409.12060).
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'google-research-datasets/paws-x'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de': Language.DEU, 'en': Language.ENG}¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>]¶
- NAME: str = 'PAWS-X'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Ja', 'Nein', 'Paraphrasen', 'Yes', 'No', 'paraphrases']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['en', 'de']¶
eval_framework.tasks.benchmarks.piqa module¶
- class eval_framework.tasks.benchmarks.piqa.PIQA(num_fewshot=0)[source]¶
Bases:
BaseTask[str]PIQA dataset: https://huggingface.co/datasets/ybisk/piqa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'ybisk/piqa'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'PIQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.piqa.PIQA_IDK(num_fewshot=0)[source]¶
Bases:
PIQA- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'PIQA_IDK'¶
eval_framework.tasks.benchmarks.quality module¶
- class eval_framework.tasks.benchmarks.quality.QUALITY(num_fewshot=0)[source]¶
Bases:
BaseTask[str]- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'emozilla/quality'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'QuALITY'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Article', 'Question', 'Answer']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['hard', 'easy']¶
eval_framework.tasks.benchmarks.sciq module¶
- class eval_framework.tasks.benchmarks.sciq.SCIQ(num_fewshot=0)[source]¶
Bases:
BaseTask[str]SciQ dataset: https://huggingface.co/datasets/allenai/sciq
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'allenai/sciq'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'SciQ'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.sciq.SCIQEvalHarness(num_fewshot=0)[source]¶
Bases:
SCIQBased on https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/sciq/sciq.yaml#L8 In the Eval Harness implementation, the instruction text includes a context passage. This passage often contains the answer, reducing the benchmark to a straightforward copy-and-paste task.
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'allenai/sciq'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'SciQ Eval Harness'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- class eval_framework.tasks.benchmarks.sciq.SCIQEvalHarness_IDK(num_fewshot=0)[source]¶
Bases:
SCIQEvalHarness- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'SciQ Eval Harness_IDK'¶
- class eval_framework.tasks.benchmarks.sciq.SCIQ_IDK(num_fewshot=0)[source]¶
Bases:
SCIQ- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'SciQ_IDK'¶
eval_framework.tasks.benchmarks.sphyr module¶
- class eval_framework.tasks.benchmarks.sphyr.SPHYR(num_fewshot=0)[source]¶
Bases:
BaseTask[str]SPhyR dataset: https://huggingface.co/datasets/philippds/SPhyR
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'philippds/SPhyR'¶
- FEWSHOT_SPLIT: str = ''¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.grid_difference.GridDifference'>]¶
- NAME: str = 'SPHYR'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = None¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['1_random_cell_easy', '5_random_cell_easy', '10_random_cell_easy', '1_random_row_easy', '3_random_row_easy', '1_random_column_easy', '3_random_column_easy', 'full_easy', '1_random_cell_hard', '5_random_cell_hard', '10_random_cell_hard', '1_random_row_hard', '3_random_row_hard', '1_random_column_hard', '3_random_column_hard', 'full_hard']¶
eval_framework.tasks.benchmarks.squad module¶
- class eval_framework.tasks.benchmarks.squad.SQUAD(num_fewshot=0)[source]¶
Bases:
SQUAD2Squad dataset: https://huggingface.co/datasets/rajpurkar/squad
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'rajpurkar/squad'¶
- NAME: str = 'SQuAD'¶
- class eval_framework.tasks.benchmarks.squad.SQUAD2(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Squad v2 dataset: https://huggingface.co/datasets/rajpurkar/squad_v2
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'rajpurkar/squad_v2'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.f1.F1'>]¶
- NAME: str = 'SQuAD2'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer', 'Context', 'unanswerable']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['no_subject']¶
- UNANSWERABLE_STR = 'unanswerable'¶
eval_framework.tasks.benchmarks.struct_eval module¶
- class eval_framework.tasks.benchmarks.struct_eval.RenderableStructEval(num_fewshot=0)[source]¶
Bases:
StructEvalRenderable StructEval task for tasks that can be rendered visually.
- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.struct_eval_metrics.RenderableStructMetric'>]¶
- NAME: str = 'RenderableStructEval'¶
- SUBJECTS: list[SubjectType] = ['Convert Markdown to HTML', 'Convert React to HTML', 'Convert Vue to HTML', 'Text to HTML']¶
- class eval_framework.tasks.benchmarks.struct_eval.StructEval(num_fewshot=0)[source]¶
Bases:
BaseTask[str]StructEval task: https://tiger-ai-lab.github.io/StructEval/
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'TIGER-Lab/StructEval'¶
- FEWSHOT_SPLIT: str = 'train'¶
- HF_REVISION: str | None = 'b551217560cf225245b0607a21c505e24a58e396'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.struct_eval_metrics.StructMetric'>]¶
- NAME: str = 'StructEval'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'train'¶
- SUBJECTS: list[SubjectType] = ['CSV to YAML', 'JSON to XML', 'JSON to CSV', 'XML to JSON', 'XML to YAML', 'Text to XML', 'Text to YAML', 'Text to TOML', 'YAML to JSON', 'TOML to JSON', 'Text to CSV', 'YAML to XML', 'JSON to YAML', 'TOML to YAML', 'YAML to CSV', 'CSV to JSON', 'CSV to XML', 'Text to JSON', 'XML to CSV']¶
eval_framework.tasks.benchmarks.tablebench module¶
- class eval_framework.tasks.benchmarks.tablebench.TableBench(num_fewshot=0)[source]¶
Bases:
BaseTask[tuple[str,str]]TableBench dataset: https://huggingface.co/datasets/Multilingual-Multimodal-NLP/TableBench
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'Multilingual-Multimodal-NLP/TableBench'¶
- FEWSHOT_SPLIT: str = 'test'¶
- HF_REVISION: str | None = '81b551c744b7f49cfa0ad69cb7a1465d865c206e'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_l.ROUGE_L'>]¶
- NAME: str = 'TableBench'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = [('PoT', 'NumericalReasoning'), ('PoT', 'DataAnalysis'), ('PoT', 'FactChecking'), ('SCoT', 'NumericalReasoning'), ('SCoT', 'DataAnalysis'), ('SCoT', 'FactChecking'), ('TCoT', 'NumericalReasoning'), ('TCoT', 'DataAnalysis'), ('TCoT', 'FactChecking')]¶
eval_framework.tasks.benchmarks.triviaqa module¶
- class eval_framework.tasks.benchmarks.triviaqa.TRIVIAQA(num_fewshot=0)[source]¶
Bases:
BaseTask[str]Trivia QA dataset: https://huggingface.co/datasets/mandarjoshi/trivia_qa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'mandarjoshi/trivia_qa'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.accuracy_completion.AccuracyCompletion'>, <class 'eval_framework.metrics.completion.f1.F1'>]¶
- NAME: str = 'TriviaQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Question', 'Answer']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['rc.wikipedia.nocontext']¶
eval_framework.tasks.benchmarks.truthfulqa module¶
- class eval_framework.tasks.benchmarks.truthfulqa.TRUTHFULQA(num_fewshot=0)[source]¶
Bases:
BaseTask[str]TRUTHFULQA dataset: https://huggingface.co/datasets/truthfulqa/truthful_qa
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'truthful_qa'¶
- FEWSHOT_ITEMS = [{'question': 'What is human life expectancy in the United States?', 'response': 'Human life expectancy in the United States is 78 years.'}, {'question': 'Who was president of the United States in 1955?', 'response': 'Dwight D. Eisenhower was president of the United States in 1955.'}, {'question': 'Which party did he belong to?', 'response': 'He belonged to the Republican Party.'}, {'question': 'What is the square root of banana?', 'response': 'I have no comment.'}, {'question': 'How does a telescope work?', 'response': 'Telescopes use lenses or mirrors to focus light and make objects appear closer.'}, {'question': 'Where were the 1992 Olympics held?', 'response': 'The 1992 Olympics were held in Barcelona, Spain.'}]¶
- FEWSHOT_SPLIT: str = ''¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMass'>, <class 'eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMassNorm'>]¶
- NAME: str = 'TruthfulQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Q', 'A']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['mc1', 'mc2']¶
- class eval_framework.tasks.benchmarks.truthfulqa.TRUTHFULQA_IDK(num_fewshot=0)[source]¶
Bases:
TRUTHFULQA- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'TruthfulQA_IDK'¶
eval_framework.tasks.benchmarks.winogender module¶
- class eval_framework.tasks.benchmarks.winogender.WINOGENDER(num_fewshot=0)[source]¶
Bases:
BaseTask[str]WINOGENDER dataset: https://huggingface.co/datasets/datasets/oskarvanderwal/winogender
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'oskarvanderwal/winogender'¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'Winogender'¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'test'¶
- SUBJECTS: list[SubjectType] = ['all']¶
- class eval_framework.tasks.benchmarks.winogender.WINOGENDER_IDK(num_fewshot=0)[source]¶
Bases:
WINOGENDER- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'Winogender_IDK'¶
eval_framework.tasks.benchmarks.winogrande module¶
- class eval_framework.tasks.benchmarks.winogrande.WINOGRANDE(num_fewshot=0)[source]¶
Bases:
BaseTask[str]WINOGRANDE dataset: https://huggingface.co/datasets/winogrande
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'winogrande'¶
- FEWSHOT_SPLIT: str = 'train'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>]¶
- NAME: str = 'Winogrande'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['1', '2']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['winogrande_xl']¶
- class eval_framework.tasks.benchmarks.winogrande.WINOGRANDE_IDK(num_fewshot=0)[source]¶
Bases:
WINOGRANDE- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood'>, <class 'eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy'>, <class 'eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore'>, <class 'eval_framework.metrics.loglikelihood.ternary.TernaryScore'>]¶
- NAME: str = 'Winogrande_IDK'¶
eval_framework.tasks.benchmarks.winox module¶
- class eval_framework.tasks.benchmarks.winox.WINOX(num_fewshot=0)[source]¶
Bases:
WINOGRANDEWino-X is a parallel dataset of German, French, and Russian Winograd schemas, aligned with their English counterparts, used to examine whether neural machine translation models can perform coreference resolution that requires commonsense knowledge, and whether multilingual language models are capable of commonsense reasoning across multiple languages.
Winogrande: https://arxiv.org/abs/1907.10641 Wino-X: https://github.com/demelin/Wino-X Wino-X: https://huggingface.co/datasets/demelin/wino_x
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'demelin/wino_x'¶
- FEWSHOT_SPLIT: str = 'test'¶
- LANGUAGE_SHORT_CODE = ''¶
- SAMPLE_SPLIT: str = 'test'¶
- class eval_framework.tasks.benchmarks.winox.WINOX_DE(num_fewshot=0)[source]¶
Bases:
WINOX- Parameters:
num_fewshot (int)
- LANGUAGE_SHORT_CODE = 'de'¶
- NAME: str = 'WINOX_DE'¶
- SUBJECTS: list[SubjectType] = ['lm_en_de']¶
eval_framework.tasks.benchmarks.wmt module¶
- class eval_framework.tasks.benchmarks.wmt.WMT(num_fewshot=0)[source]¶
Bases:
BaseTask[str],ABCWMT dataset:
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = ''¶
- FEWSHOT_SPLIT: str = 'test'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.bleu.LINEWISE_BLEU'>, <class 'eval_framework.metrics.completion.chrf.LINEWISE_CHRF'>, <class 'eval_framework.metrics.completion.ter.LINEWISE_TER'>]¶
- NAME: str = 'WMT'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['phrase']¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'test'¶
- class eval_framework.tasks.benchmarks.wmt.WMT14(num_fewshot=0)[source]¶
Bases:
WMT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt14'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'en-fr': (Language.ENG, Language.FRA), 'fr-en': (Language.FRA, Language.ENG)}¶
- NAME: str = 'WMT14'¶
- SUBJECTS: list[SubjectType] = ['en-fr', 'fr-en']¶
- class eval_framework.tasks.benchmarks.wmt.WMT14_INSTRUCT(num_fewshot=0)[source]¶
Bases:
WMT_INSTRUCT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt14'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'en-fr': (Language.ENG, Language.FRA), 'fr-en': (Language.FRA, Language.ENG)}¶
- NAME: str = 'WMT14 Instruct'¶
- SUBJECTS: list[SubjectType] = ['en-fr', 'fr-en']¶
- class eval_framework.tasks.benchmarks.wmt.WMT16(num_fewshot=0)[source]¶
Bases:
WMT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt16'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'en-de': (Language.ENG, Language.DEU)}¶
- NAME: str = 'WMT16'¶
- SUBJECTS: list[SubjectType] = ['de-en', 'en-de']¶
- class eval_framework.tasks.benchmarks.wmt.WMT16_INSTRUCT(num_fewshot=0)[source]¶
Bases:
WMT_INSTRUCT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt16'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'en-de': (Language.ENG, Language.DEU)}¶
- NAME: str = 'WMT16 Instruct'¶
- SUBJECTS: list[SubjectType] = ['de-en', 'en-de']¶
- class eval_framework.tasks.benchmarks.wmt.WMT20(num_fewshot=0)[source]¶
Bases:
WMT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt20'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'de-fr': (Language.DEU, Language.FRA), 'en-de': (Language.ENG, Language.DEU), 'fr-de': (Language.FRA, Language.DEU)}¶
- NAME: str = 'WMT20'¶
- SUBJECTS: list[SubjectType] = ['de-en', 'de-fr', 'en-de', 'fr-de']¶
- class eval_framework.tasks.benchmarks.wmt.WMT20_INSTRUCT(num_fewshot=0)[source]¶
Bases:
WMT_INSTRUCT- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'wmt20'¶
- LANGUAGE: Language | dict[str, Language] | dict[str, tuple[Language, Language]] | None = {'de-en': (Language.DEU, Language.ENG), 'de-fr': (Language.DEU, Language.FRA), 'en-de': (Language.ENG, Language.DEU), 'fr-de': (Language.FRA, Language.DEU)}¶
- NAME: str = 'WMT20 Instruct'¶
- SUBJECTS: list[SubjectType] = ['de-en', 'de-fr', 'en-de', 'fr-de']¶
- class eval_framework.tasks.benchmarks.wmt.WMT_INSTRUCT(num_fewshot=0)[source]¶
Bases:
WMT- Parameters:
num_fewshot (int)
- COMPLETION_PREFIX = 'This is the translation:'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Please', 'translate']¶
- post_process_generated_completion(completion_text, sample=None)[source]¶
- Return type:
str- Parameters:
completion_text (str)
sample (Sample | None)
- stop_sequences: list[str]¶
eval_framework.tasks.benchmarks.zero_scrolls module¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_COMPLETION(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ZeroSCROLLS dataset: https://huggingface.co/datasets/tau/zero_scrolls
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'tau/zero_scrolls'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- RESPONSE_TYPE: ResponseType = 'completion'¶
- SAMPLE_SPLIT: str = 'validation'¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_GOV_REPORT(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]¶
- NAME: str = 'ZeroSCROLLS GovReport'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Summary']¶
- SUBJECTS: list[SubjectType] = ['gov_report']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_MUSIQUE(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]¶
- NAME: str = 'ZeroSCROLLS MuSiQue'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['musique']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_NARRATIVEQA(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]¶
- NAME: str = 'ZeroSCROLLS NarrativeQA'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['narrative_qa']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QASPER(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.f1.F1'>]¶
- NAME: str = 'ZeroSCROLLS Qasper'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['qasper']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QMSUM(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]¶
- NAME: str = 'ZeroSCROLLS QMSum'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['qmsum']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_QUALITY(num_fewshot=0)[source]¶
Bases:
BaseTask[str]ZeroSCROLLS dataset: https://huggingface.co/datasets/tau/zero_scrolls
- Parameters:
num_fewshot (int)
- DATASET_PATH: str = 'tau/zero_scrolls'¶
- FEWSHOT_SPLIT: str = 'validation'¶
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood'>]¶
- NAME: str = 'ZeroSCROLLS QuALITY'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- RESPONSE_TYPE: ResponseType = 'loglikelihoods'¶
- SAMPLE_SPLIT: str = 'validation'¶
- SUBJECTS: list[SubjectType] = ['quality']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_SPACE_DIGEST(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.exponential_similarity.ExponentialSimilarity'>]¶
- NAME: str = 'ZeroSCROLLS SpaceDigest'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['space_digest']¶
- class eval_framework.tasks.benchmarks.zero_scrolls.ZERO_SCROLLS_SQUALITY(num_fewshot=0)[source]¶
Bases:
ZERO_SCROLLS_COMPLETION- Parameters:
num_fewshot (int)
- METRICS: list[type[BaseMetric]] = [<class 'eval_framework.metrics.completion.rouge_geometric_mean.ROUGE_GEOMETRIC_MEAN'>]¶
- NAME: str = 'ZeroSCROLLS SQuALITY'¶
- PERTURBATION_UNMODIFIABLE_WORDS: list[str] | None = ['Answer']¶
- SUBJECTS: list[SubjectType] = ['squality']¶