Weights and Biases Integration with Eval-Framework¶
Overview¶
The evaluation framework supports logging results to Weights and Biases (WandB) and loading registered model checkpoints.
Benefits and Enablement¶
Centralized eval tracking: Automatically log evaluation metrics
Centralized checkpoint Storage: Discover and reference checkpoints from a central location
Collaboration: Share results and models with team members through WandB’s web interface
Registered Models and Results¶
The eval-framework can load models from your WandB Model Registry and can upload results as WandB artifacts.
This enables:
Version control: Track model checkpoints with versioning and aliases
Metadata management: Store model descriptions and additional metadata per version
Centralized discovery: Browse and search models
Lineage tracking: Maintain audit trails
Storage Location¶
We currently support one of two storage backends:
WandB Cloud: Default storage in WandB’s managed infrastructure
S3-backed artifacts: S3-compatible buckets (see the Environment Variables section for AWS configuration)
Evaluation Run Logging¶
This integration automatically:
Groups runs by checkpoint name: Organizes evaluation results by model checkpoint
Logs evaluation metrics and configuration: Log metrics and run settings
Records eval-framework version: Tracks the eval-framework version used during runs
References HuggingFace upload paths: Provides a link to full result upload locations when available
Maintains model lineage: Links eval runs to a particular model and model version
Experiment details:
Runs are grouped within projects by checkpoint name and version. Additional hierarchical groupings are available, but not limited to the following:
Language
Benchmark Task
Fewshot
Number of samples
Metric
Usage¶
WandB logging is disabled by default. To enable it, set up a valid WandB account and configure the required environment variables in your .env file:
# Weights & Biases configuration
WANDB_API_KEY="YOUR_WANDB_API_KEY_HERE"
Method 1: CLI Upload¶
Add the --wandb-project (and potentially --wandb-entity if not the default one) to your CLI command:
uv run eval_framework \
--context local \
--models tests/conftest.py \
--llm-name Llama31_8B_Instruct_HF \
--task-name ARC \
--num-fewshot 3 \
--num-samples 100 \
--output-dir "./test_outputs_folder" \
--wandb-project "my_wandb_project"
Method 2: Determined Configuration¶
Add wandb_project (and potentially wandb_entity if not the default one) as a hyperparameter in your Determined experiment config:
hyperparameters:
experiment_name: "my_experiment"
llm_name: "Llama31_8B_Instruct_HF"
wandb_project: "my_wandb_project"
task_args:
- task_name: "ARC"
num_fewshot: 3
num_samples: 100
Environment Variables¶
Required:
WANDB_API_KEY: Your WandB API key from wandb.ai/authorize
Optional (for S3-backed artifacts):
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_ENDPOINT_URL
Note: AWS variables are only needed if using reference-backed artifacts. Direct WandB artifact storage doesn’t require them.
Custom ones:
WANDB_CACHE_SKIP: Whether to use W&B cache when downloading model artifacts (defaults to False to avoid double storage).WANDB_ARTIFACT_DIR: Directory where model artifacts will be downloaded (if not given, a temporary one will be used).WANDB_ARTIFACT_WAIT_TIMEOUT_SEC: How long to wait for an artifact to become available on W&B if a corresponding “-local” version of the artifact is available.