Using the CLI¶
The Eval-Framework CLI provides a flexible interface for evaluating LLMs across a wide range of benchmarks. Whether you’re running evaluations locally or in a distributed environment, the CLI allows you to configure tasks, models, and metrics with ease.
Command Structure¶
uv run eval_framework [OPTIONS]
Required Arguments¶
--llm-name LLM_NAME
Either a module path to a model, or the name of a model found in the file provided via the --models flag.
Execution Configuration¶
--models MODELS
Path to the Python module file containing model classes.
--llm-args [LLM_ARGS ...]
Arguments to pass to the LLM as key=value pairs.
--task-name TASK_NAME
The name of the task to evaluate.
--output-dir OUTPUT_DIR
The path for evaluation outputs.
--num-samples NUM_SAMPLES
The number of samples per subject to evaluate.
--num-fewshot NUM_FEWSHOT
The number of fewshot examples to use.
--max-tokens
The maximum number of tokens to generate for each sample. Overwrites any task default value.
--batch-size BATCH_SIZE
Size of batch of samples to send to the LLM for evaluation in parallel. Use 1 for sequential running (default).
Task Configuration¶
--task-subjects TASK_SUBJECTS [TASK_SUBJECTS ...]
The subjects of the task to evaluate. If empty, all subjects are evaluated. Subjects in the form of tuples can be specified in a comma-delimited way, possibly using wildcard * in some dimensions of a tuple.
Examples: "DE_DE, *" or "FR_FR, astronomy"
--hf-revision HF_REVISION
A tag name, a branch name, or commit hash for the task HF dataset.
Judge Models¶
--judge-models JUDGE_MODELS
The path to the Python module file containing LLM judge model classes.
--judge-model-name JUDGE_MODEL_NAME
The class derived from eval_framework.llm.base.BaseLLM found in the judge-models module to instantiate for LLM judge evaluation metrics.
--judge-model-args JUDGE_MODEL_ARGS
The args of the judge model used.
Perturbations¶
--perturbation-type TYPE
The type of perturbation to apply to task instructions. Note that this may not make sense for some prompts for example, those containing math and code.
--perturbation-probability PROBABILITY
The probability of applying a perturbation to each word or character (between 0.0 and 1.0).
--perturbation-seed SEED
Random seed controlling perturbations.
Logging & Tracking¶
--wandb-project WANDB_PROJECT
The name of the Weights & Biases project to log runs to.
--wandb-entity WANDB_ENTITY
The name of the Weights & Biases entity to log runs to. Defaults to the user’s default entity.
--wandb-run-id WANDB_RUN_ID
The ID of an existing Weights & Biases run to resume. If not given, creates a new run. If given and exists, will continue the run but will overwrite the python command logged in WandB.
--wandb-upload-results or --no-wandb-upload-results
Whether to upload results as an artifact to Weights & Biases (default: True). Needs --wandb-project to be set.
--description DESCRIPTION
Description of the run. This will be added to the metadata of the run to help with bookkeeping.
--verbosity VERBOSITY
Set the logging verbosity level: 0=CRITICAL, 1=INFO, 2=DEBUG.
Environment¶
--context {local,determined}
The context in which the evaluation is run.
-h, --help
Show help message and exit.
Running Hugging Face Models¶
You can run models directly from Hugging Face Hub using the HFLLM class:
uv run eval_framework \
--llm-name 'eval_framework.llm.huggingface.HFLLM' \
--llm-args model_name="microsoft/DialoGPT-medium" formatter_name="Llama3Formatter" \
--task-name "MMLU" \
--task-subjects "abstract_algebra" \
--output-dir ./eval_results \
--num-fewshot 5 \
--num-samples 10
This approach allows you to evaluate any model available on Hugging Face by specifying the model_name and appropriate formatter_name in the --llm-args parameter.
Configuring Sampling Parameters for vLLM Models¶
vLLM models support configurable sampling parameters through the --llm-args parameter. You can specify individual sampling parameters using dot notation:
uv run eval_framework \
--llm-name 'eval_framework.llm.models.Qwen3_0_6B_VLLM' \
--llm-args sampling_params.temperature=0.7 sampling_params.top_p=0.95 sampling_params.max_tokens=150 \
--task-name "MMLU" \
--task-subjects "abstract_algebra" \
--output-dir ./eval_results \
--num-fewshot 5 \
--num-samples 10
You can also combine sampling parameters with other model arguments:
uv run eval_framework \
--llm-name 'eval_framework.llm.models.Qwen3_0_6B_VLLM' \
--llm-args max_model_len=2048 sampling_params.temperature=0.8 sampling_params.top_p=0.9 \
--task-name "MMLU" \
--task-subjects "abstract_algebra" \
--output-dir ./eval_results \
--num-fewshot 5 \
--num-samples 10