Testing¶
This repository contains a large and diverse test suite. To keep iteration fast, tests are split into fast PR tests, slow/advanced tests, and nightly workflows. Contributors should generally run only the fast tests locally unless reproducing a specific failure.
Test Tiers¶
1. Fast / PR Tests¶
Runs on: Every push to
main, pull requests, and merge queue.Runtime: ~20 minutes total
Purpose: Ensure code correctness for PRs without running the heaviest tests.
Includes:
Linting, pre-commit, type checks (2 min)
Tag setup and HuggingFace datasets cache pull (2 min)
Docker image build (5 min)
UV install dependency tests (1 min)
CPU tests excluding slow/external tests (3 min)
CPU slow tests (3–4 min)
Formatter hash tests (3 min)
GPU tests / optional extras (12 min)
Recommended local command: For most contributors, running the CPU fast tests is sufficient:
# Run the tests that PR CI runs
uv run --all-extras pytest -n auto --max-worker-restart=0 -v \
-m "not gpu and not cpu_slow and not external_api and not formatter_hash"
2. Advanced / GPU Tests¶
Runs on: PR workflow (
test-docker-gpu)Runtime: ~12 min
Purpose: Run GPU tests or all optional extras together. Typically only required if debugging GPU-specific issues.
Includes:
GPU tests excluding CPU-slow / external API / vllm (12 min)
Optional extras (
vllm,mistral) installations
Recommended local command (advanced users):
uv run --exact --all-extras pytest -v --noconftest tests/tests_eval_framework/installs/
⚠️ Warning: Running GPU/full extras locally may take significant time and requires a GPU.
3. Nightly Workflows¶
Nightly HuggingFace dataset cache rebuild
Runtime: ~20 min
Purpose: Rebuild the full dataset cache for CI and experiments
Command:
uv run --extra=comet --extra=openai python tests/tests_eval_framework/utils/update_datasets.py rebuild
Nightly Docker build cache
Runtime: ~30 sec
Purpose: Refresh Docker build cache for PR workflows
Nightly workflows are not expected to be run locally. They ensure CI has up-to-date datasets and Docker cache.
CI as Source of Truth¶
The authoritative definition of which tests belong to each tier is encoded in the GitHub workflows:
tests.yml→ PR tests, CPU and GPU tests, lintingnightly_hf_cache_build.yml→ full dataset cache rebuildnightly_docker_cache_build.yml→ Docker cache refresh
Tips for Contributors¶
Run fast PR tests before pushing code.
Do not attempt to run the full suite unless reproducing a nightly/CI failure.
CI automatically runs GPU and slow tests on PRs; nightly workflows cover the rest.