Testing

This repository contains a large and diverse test suite. To keep iteration fast, tests are split into fast PR tests, slow/advanced tests, and nightly workflows. Contributors should generally run only the fast tests locally unless reproducing a specific failure.


Test Tiers

1. Fast / PR Tests

  • Runs on: Every push to main, pull requests, and merge queue.

  • Runtime: ~20 minutes total

  • Purpose: Ensure code correctness for PRs without running the heaviest tests.

  • Includes:

    • Linting, pre-commit, type checks (2 min)

    • Tag setup and HuggingFace datasets cache pull (2 min)

    • Docker image build (5 min)

    • UV install dependency tests (1 min)

    • CPU tests excluding slow/external tests (3 min)

    • CPU slow tests (3–4 min)

    • Formatter hash tests (3 min)

    • GPU tests / optional extras (12 min)

Recommended local command: For most contributors, running the CPU fast tests is sufficient:

# Run the tests that PR CI runs
uv run --all-extras pytest -n auto --max-worker-restart=0 -v \
    -m "not gpu and not cpu_slow and not external_api and not formatter_hash"

2. Advanced / GPU Tests

  • Runs on: PR workflow (test-docker-gpu)

  • Runtime: ~12 min

  • Purpose: Run GPU tests or all optional extras together. Typically only required if debugging GPU-specific issues.

  • Includes:

    • GPU tests excluding CPU-slow / external API / vllm (12 min)

    • Optional extras (vllm, mistral) installations

Recommended local command (advanced users):

uv run --exact --all-extras pytest -v --noconftest tests/tests_eval_framework/installs/

⚠️ Warning: Running GPU/full extras locally may take significant time and requires a GPU.


3. Nightly Workflows

  • Nightly HuggingFace dataset cache rebuild

    • Runtime: ~20 min

    • Purpose: Rebuild the full dataset cache for CI and experiments

    • Command: uv run --extra=comet --extra=openai python tests/tests_eval_framework/utils/update_datasets.py rebuild

  • Nightly Docker build cache

    • Runtime: ~30 sec

    • Purpose: Refresh Docker build cache for PR workflows

Nightly workflows are not expected to be run locally. They ensure CI has up-to-date datasets and Docker cache.


CI as Source of Truth

The authoritative definition of which tests belong to each tier is encoded in the GitHub workflows:

  • tests.yml → PR tests, CPU and GPU tests, linting

  • nightly_hf_cache_build.yml → full dataset cache rebuild

  • nightly_docker_cache_build.yml → Docker cache refresh


Tips for Contributors

  • Run fast PR tests before pushing code.

  • Do not attempt to run the full suite unless reproducing a nightly/CI failure.

  • CI automatically runs GPU and slow tests on PRs; nightly workflows cover the rest.