Docker Guide¶

Guide for using eval-framework with Docker for both AA users and external contributors.

Build from Repository¶

Latest Release:

git clone https://github.com/Aleph-Alpha-Research/eval-framework.git
cd eval-framework
docker build -t eval_framework:latest .

Specific Version:

git clone https://github.com/Aleph-Alpha-Research/eval-framework.git
cd eval-framework
git checkout v0.2.3
docker build -t eval_framework:v0.2.3 .

For Aleph Alpha Users¶

AA users have access to pre-built Docker images via GitLab registry.

# Authenticate
docker login registry.gitlab.aleph-alpha.de

# Pull specific version
docker pull registry.gitlab.aleph-alpha.de/research/public-registry/eval_framework:v0.2.3

# Run evaluation
docker run --gpus all \
  -v $(pwd)/results:/results \
  registry.gitlab.aleph-alpha.de/research/public-registry/eval_framework:v0.2.3 \
  eval_framework --task-name MMLU --output-dir /results

Available Tags¶

Check PyPI releases or main README for current versions.

# Specific version
docker pull registry.gitlab.aleph-alpha.de/research/public-registry/eval_framework:v0.2.3

# Minor version
docker pull registry.gitlab.aleph-alpha.de/research/public-registry/eval_framework:v0.2

# Latest stable
docker pull registry.gitlab.aleph-alpha.de/research/public-registry/eval_framework:latest

# Specific branch
docker pull registry.gitlab.aleph-alpha.de/research/public-registry/eval_framework:feature-branch-name

Prerequisites and Configuration¶

Required API Keys¶

Key	Required For
`HF_TOKEN`	HuggingFace model downloads, gated datasets
`WANDB_API_KEY`	Experiment tracking (optional)
`GL_REGISTRY_TOKEN`	Pulling AA Docker images

When are these needed?

HF_TOKEN: Only required if you’re downloading gated models (e.g., Llama, Mistral) or private datasets. Public models work without it.
WANDB_API_KEY: Optional. Only needed if you want to log experiments to Weights & Biases.
GL_REGISTRY_TOKEN: Only for AA users pulling pre-built Docker images from GitLab registry.

Note: For basic evaluations with public models, you can skip this section entirely.

Setting Up Keys¶

Export in shell:

export HF_TOKEN="hf_your_token_here"
export WANDB_API_KEY="your_wandb_key_here"

Pass at runtime:

docker run \
  -e HF_TOKEN=$HF_TOKEN \
  -e WANDB_API_KEY=$WANDB_API_KEY \
  eval_framework:v0.2.3 [command]

Using .env file:

Create .env:

HF_TOKEN=hf_your_token_here
WANDB_API_KEY=your_wandb_key_here

Run with env file:

docker run --env-file .env eval_framework:v0.2.3 [command]

Running Evaluations¶

GPU Configuration¶

# All GPUs
docker run --gpus all eval_framework:v0.2.3 [command]

# Specific GPU
docker run --gpus 0 eval_framework:v0.2.3 [command]

# Multiple GPUs
docker run --gpus '"device=0,1"' eval_framework:v0.2.3 [command]

Interactive Shell¶

docker run -it --gpus all \
  -v $(pwd):/workspace \
  eval_framework:v0.2.3 \
  /bin/bash

Determined AI Integration¶

Note: For a detailed guide see the Using Determined guide.

Add to experiment.yaml:

name: eval-framework-experiment

environment:
  image: registry.gitlab.aleph-alpha.de/research/public-registry/eval_framework:v0.2.3
  registry_auth:
    username: token
    password: $GL_REGISTRY_TOKEN
  environment_variables:
    - HF_TOKEN
    - WANDB_API_KEY

resources:
  slots_per_trial: 1

entrypoint: |
  eval_framework \
    --task-name MMLU \
    --output-dir /tmp/results \
    --num-fewshot 5

Submit:

export HF_TOKEN="hf_your_token_here"
export GL_REGISTRY_TOKEN="glpat_your_token_here"
det experiment create experiment.yaml .

Example Workflows¶

Single Evaluation¶

docker run --gpus all \
  -v $(pwd)/results:/results \
  eval_framework:v0.2.3 \
  eval_framework \
    --models /eval_framework/src/eval_framework/llm/models.py \
    --llm-name Smollm135MInstruct \
    --task-name MMLU \
    --task-subjects abstract_algebra \
    --output-dir /results \
    --num-fewshot 5 \
    --num-samples 10

Batch Evaluations¶

#!/bin/bash
for task in MMLU HellaSwag ARC TriviaQA; do
  docker run --gpus all \
    -v $(pwd)/results:/results \
    eval_framework:v0.2.3 \
    eval_framework \
      --task-name $task \
      --output-dir /results/$task \
      --num-fewshot 5
done