Using the CLI

The Eval-Framework CLI provides a flexible interface for evaluating LLMs across a wide range of benchmarks. Whether you’re running evaluations locally or in a distributed environment, the CLI allows you to configure tasks, models, and metrics with ease.

Command Structure

uv run eval_framework [OPTIONS]

Required Arguments

--llm-name LLM_NAME Either a module path to a model, or the name of a model found in the file provided via the --models flag.

Execution Configuration

--models MODELS Path to the Python module file containing model classes.

--llm-args [LLM_ARGS ...] Arguments to pass to the LLM as key=value pairs.

--task-name TASK_NAME The name of the task to evaluate.

--output-dir OUTPUT_DIR The path for evaluation outputs.

--num-samples NUM_SAMPLES The number of samples per subject to evaluate.

--num-fewshot NUM_FEWSHOT The number of fewshot examples to use.

--max-tokens The maximum number of tokens to generate for each sample. Overwrites any task default value.

--batch-size BATCH_SIZE Size of batch of samples to send to the LLM for evaluation in parallel. Use 1 for sequential running (default).

--task-suite PATH_TO_TASK_SUITE_FILE Path to a suite file (.yaml or .py) defining a set of tasks to run. Mutually exclusive with --task-name. Hyperparameter overrides and score aggregation methods can be configured within the suite file.

Task Configuration

--task-subjects TASK_SUBJECTS [TASK_SUBJECTS ...] The subjects of the task to evaluate. If empty, all subjects are evaluated. Subjects in the form of tuples can be specified in a comma-delimited way, possibly using wildcard * in some dimensions of a tuple. Examples: "DE_DE, *" or "FR_FR, astronomy"

--hf-revision HF_REVISION A tag name, a branch name, or commit hash for the task HF dataset.

Judge Models

--judge-models JUDGE_MODELS The path to the Python module file containing LLM judge model classes.

--judge-model-name JUDGE_MODEL_NAME The class derived from eval_framework.llm.base.BaseLLM found in the judge-models module to instantiate for LLM judge evaluation metrics.

--judge-model-args JUDGE_MODEL_ARGS The args of the judge model used.

Perturbations

--perturbation-type TYPE The type of perturbation to apply to task instructions. Note that this may not make sense for some prompts for example, those containing math and code.

--perturbation-probability PROBABILITY The probability of applying a perturbation to each word or character (between 0.0 and 1.0).

--perturbation-seed SEED Random seed controlling perturbations.

Logging & Tracking

--wandb-project WANDB_PROJECT The name of the Weights & Biases project to log runs to.

--wandb-entity WANDB_ENTITY The name of the Weights & Biases entity to log runs to. Defaults to the user’s default entity.

--wandb-run-id WANDB_RUN_ID The ID of an existing Weights & Biases run to resume. If not given, creates a new run. If given and exists, will continue the run but will overwrite the python command logged in WandB.

--wandb-upload-results or --no-wandb-upload-results Whether to upload results as an artifact to Weights & Biases (default: True). Needs --wandb-project to be set.

--description DESCRIPTION Description of the run. This will be added to the metadata of the run to help with bookkeeping.

--verbosity VERBOSITY Set the logging verbosity level: 0=CRITICAL, 1=INFO, 2=DEBUG.

Environment

--context {local,determined} The context in which the evaluation is run.

-h, --help Show help message and exit.

Running Hugging Face Models

You can run models directly from Hugging Face Hub using the HFLLM class:

uv run eval_framework \
    --llm-name 'eval_framework.llm.huggingface.HFLLM' \
    --llm-args model_name="microsoft/DialoGPT-medium" formatter_name="Llama3Formatter" \
    --task-name "MMLU" \
    --task-subjects "abstract_algebra" \
    --output-dir ./eval_results \
    --num-fewshot 5 \
    --num-samples 10

This approach allows you to evaluate any model available on Hugging Face by specifying the model_name and appropriate formatter_name in the --llm-args parameter.

Configuring Sampling Parameters for vLLM Models

vLLM models support configurable sampling parameters through the --llm-args parameter. You can specify individual sampling parameters using dot notation:

uv run eval_framework \
    --llm-name 'eval_framework.llm.models.Qwen3_0_6B_VLLM' \
    --llm-args sampling_params.temperature=0.7 sampling_params.top_p=0.95 sampling_params.max_tokens=150 \
    --task-name "MMLU" \
    --task-subjects "abstract_algebra" \
    --output-dir ./eval_results \
    --num-fewshot 5 \
    --num-samples 10

You can also combine sampling parameters with other model arguments:

uv run eval_framework \
    --llm-name 'eval_framework.llm.models.Qwen3_0_6B_VLLM' \
    --llm-args max_model_len=2048 sampling_params.temperature=0.8 sampling_params.top_p=0.9 \
    --task-name "MMLU" \
    --task-subjects "abstract_algebra" \
    --output-dir ./eval_results \
    --num-fewshot 5 \
    --num-samples 10

Running a Task Suite

Task suites let you run multiple tasks in one command and aggregate their scores. You can define a suite in a .yaml file:

# my_suite.yaml
name: my_suite
tasks:
  - tasks: MMLU
    num_samples: 100
    num_fewshot: 5
  - tasks: GSM8K
    num_samples: 50
    num_fewshot: 0
aggregates:
  - name: avg_accuracy
    metric: accuracy
    method: mean

Or equivalently in a .py file:

# my_suite.py
from eval_framework.suite import SuiteAggregate, TaskSuite

suite = TaskSuite(
    name="my_suite",
    tasks=[
        TaskSuite(tasks="MMLU", num_samples=100, num_fewshot=5),
        TaskSuite(tasks="GSM8K", num_samples=50, num_fewshot=0),
    ],
    aggregates=[
        SuiteAggregate(name="avg_accuracy", metric="accuracy", method="mean"),
    ],
)

Then pass the suite file to the CLI:

uv run eval_framework \
    --llm-name 'eval_framework.llm.models.Qwen3_0_6B_VLLM' \
    --task-suite my_suite.yaml \
    --output-dir ./eval_results