Weights and Biases Integration with Eval-Framework

Overview

The evaluation framework supports logging results to Weights and Biases (WandB) and loading registered model checkpoints.

Benefits and Enablement

  • Centralized eval tracking: Automatically log evaluation metrics

  • Centralized checkpoint Storage: Discover and reference checkpoints from a central location

  • Collaboration: Share results and models with team members through WandB’s web interface

Registered Models and Results

The eval-framework can load models from your WandB Model Registry and can upload results as WandB artifacts.

This enables:

  • Version control: Track model checkpoints with versioning and aliases

  • Metadata management: Store model descriptions and additional metadata per version

  • Centralized discovery: Browse and search models

  • Lineage tracking: Maintain audit trails

Storage Location

We currently support one of two storage backends:

  • WandB Cloud: Default storage in WandB’s managed infrastructure

  • S3-backed artifacts: S3-compatible buckets (see the Environment Variables section for AWS configuration)

Evaluation Run Logging

This integration automatically:

  • Groups runs by checkpoint name: Organizes evaluation results by model checkpoint

  • Logs evaluation metrics and configuration: Log metrics and run settings

  • Records eval-framework version: Tracks the eval-framework version used during runs

  • References HuggingFace upload paths: Provides a link to full result upload locations when available

  • Maintains model lineage: Links eval runs to a particular model and model version

Experiment details:

Runs are grouped within projects by checkpoint name and version. Additional hierarchical groupings are available, but not limited to the following:

  • Language

  • Benchmark Task

  • Fewshot

  • Number of samples

  • Metric

Usage

WandB logging is disabled by default. To enable it, set up a valid WandB account and configure the required environment variables in your .env file:

# Weights & Biases configuration
WANDB_API_KEY="YOUR_WANDB_API_KEY_HERE"

Method 1: CLI Upload

Add the --wandb-project (and potentially --wandb-entity if not the default one) to your CLI command:

uv run eval_framework \
    --context local \
    --models tests/conftest.py \
    --llm-name Llama31_8B_Instruct_HF \
    --task-name ARC \
    --num-fewshot 3 \
    --num-samples 100 \
    --output-dir "./test_outputs_folder" \
    --wandb-project "my_wandb_project"

Method 2: Determined Configuration

Add wandb_project (and potentially wandb_entity if not the default one) as a hyperparameter in your Determined experiment config:

hyperparameters:
  experiment_name: "my_experiment"
  llm_name: "Llama31_8B_Instruct_HF"
  wandb_project: "my_wandb_project"
  task_args:
    - task_name: "ARC"
      num_fewshot: 3
      num_samples: 100

Environment Variables

Required:

Optional (for S3-backed artifacts):

  • AWS_ACCESS_KEY_ID

  • AWS_SECRET_ACCESS_KEY

  • AWS_ENDPOINT_URL

Note: AWS variables are only needed if using reference-backed artifacts. Direct WandB artifact storage doesn’t require them.

Custom ones:

  • WANDB_CACHE_SKIP: Whether to use W&B cache when downloading model artifacts (defaults to False to avoid double storage).

  • WANDB_ARTIFACT_DIR: Directory where model artifacts will be downloaded (if not given, a temporary one will be used).

  • WANDB_ARTIFACT_WAIT_TIMEOUT_SEC: How long to wait for an artifact to become available on W&B if a corresponding “-local” version of the artifact is available.