Model Arguments¶
The Eval-Framework provides a set of model wrapper classes that standardize how LLMs are loaded, formatted, and used for evaluation. Each wrapper manages specific model backends, such as Hugging Face, OpenAI, Aleph Alpha API, or vLLM-based models.
The following sections describe the constructor arguments for each model class, highlighting configuration options, defaults, and their purpose. Understanding these arguments allows you to customize evaluation behavior, token limits, concurrency, and model-specific settings.
HFLLM Class — Constructor Arguments¶
HFLLM is a high-level wrapper for Hugging Face causal language models within the evaluation framework.
It extends BaseHFLLM, managing model loading (from local checkpoints, HF Hub, or W&B), formatting, and text generation.
HFLLM Constructor Argument Reference¶
Argument |
Type |
Description |
Default |
|---|---|---|---|
|
|
Path to a local checkpoint directory or model weights. Used when loading from disk instead of HF Hub. |
|
|
|
Hugging Face model name (e.g. |
|
|
|
Weights & Biases artifact name (e.g. |
|
|
|
Explicit formatter instance used to convert chat messages into model prompts. If not provided, falls back to |
|
|
|
Name of a formatter class (e.g. |
|
|
|
Keyword arguments for the formatter constructor (used with |
|
|
|
Custom display/logging name for the checkpoint. If omitted, inferred from model or artifact name. |
|
|
|
Used to scale token generation limits based on model tokenizer density. Passed to the parent class |
|
|
|
Additional keyword args passed to |
— |
AlephAlphaAPIModel — Constructor Arguments¶
AlephAlphaAPIModel is a wrapper around the Aleph Alpha API, extending BaseLLM.
It handles formatter setup, request concurrency, retry behavior, and timeout management.
AlephAlphaAPIModel Constructor Argument Reference¶
Argument |
Type |
Description |
Default |
|---|---|---|---|
|
|
Explicit formatter instance used to convert chat messages into model prompts. If not provided, falls back to |
|
|
|
Custom display or logging name for the model. If omitted, uses the class-level |
|
|
|
Maximum number of retry attempts for failed API requests (e.g. network errors, rate limits). |
|
|
|
Maximum number of concurrent asynchronous API requests allowed. Controls throughput and parallelism. |
|
|
|
Maximum number of seconds before an API request times out. |
|
|
|
Maximum number of seconds to wait when the async request queue is full before giving up. |
|
|
|
Used to scale token-based limits based on model tokenizer density. See Deep Dive: bytes_per_token. |
|
OpenAIModel — Constructor Arguments¶
OpenAIModel is a wrapper for OpenAI’s API models (e.g., GPT-4, GPT-3.5) that integrates with the evaluation framework.
It manages model configuration, authentication, and request parameters for the OpenAI client.
OpenAIModel Constructor Argument Reference¶
Argument |
Type |
Description |
Default |
|---|---|---|---|
|
|
Name of the OpenAI model to use (e.g. |
|
|
|
Explicit formatter instance used to convert chat messages into model prompts. If not provided, falls back to |
|
|
|
Sampling temperature controlling output randomness ( |
|
|
|
OpenAI API key. If not provided, defaults to the |
|
|
|
Optional OpenAI organization ID for multi-org API usage or billing separation. |
|
|
|
Custom API base URL, e.g., for Azure OpenAI endpoints or local proxies. |
|
|
|
Used to scale token-based limits based on model tokenizer density. See Deep Dive: bytes_per_token. |
|
BaseVLLMModel — Constructor Arguments¶
BaseVLLMModel defines the core initialization logic for all vLLM-backed models.
It manages GPU allocation, tokenizer setup, and internal sampling parameter normalization.
BaseVLLMModel Constructor Argument Reference¶
Argument |
Type |
Description |
Default |
|---|---|---|---|
|
|
Explicit formatter instance used to convert chat messages into model prompts. If not provided, falls back to |
|
|
|
Maximum sequence length (token context limit). Used to configure the vLLM engine. |
|
|
|
Number of GPUs for tensor-level parallel inference. |
|
|
|
Fraction of total GPU memory reserved for model weights and KV cache. |
|
|
|
Number of sequences processed per batch. |
|
|
|
Local model path or checkpoint directory. |
|
|
|
Human-readable identifier for the checkpoint. |
|
|
|
Sampling configuration parameters. |
|
|
|
Used to scale token-based limits based on model tokenizer density. See Deep Dive: bytes_per_token. |
|
|
|
Any remaining parameters forwarded to the |
— |
MistralVLLM — Constructor Arguments¶
MistralVLLM is a specialized subclass of VLLMModel → BaseVLLMModel designed to run Mistral Hugging Face models using the vLLM inference backend.
It provides flexible model loading (from local files, Hugging Face Hub, or Weights & Biases), GPU-efficient parallelism, and tunable sampling behavior.
MistralVLLM Constructor Argument Reference¶
Argument |
Type |
Description |
Default |
|---|---|---|---|
|
|
Hugging Face model identifier (e.g. |
|
|
|
Weights & Biases artifact reference (e.g. |
|
|
|
Name of a registered formatter class (e.g. |
|
|
|
Keyword arguments passed to the formatter constructor. |
|
|
|
Additional keyword arguments passed through to |
— |
Deep Dive: bytes_per_token¶
What it is¶
bytes_per_token is a scalar that adjusts generation limits (max_tokens) based on the model’s tokenizer characteristics.
Different models tokenize text differently — some produce more tokens per byte, some fewer.
This parameter helps keep generation length consistent across models by normalizing the token budget.
How it works internally¶
if bytes_per_token is not None:
bytes_per_token_scalar = 4.0 / bytes_per_token
else:
bytes_per_token_scalar = 4.0 / BYTES_PER_TOKEN # defaults to 4.0 / 4.0 = 1.0
The constant
BYTES_PER_TOKEN = 4.0is a heuristic from OpenAI’s tokenizer documentation.The scalar is then applied when calculating generation limits:
scaled_max_tokens = ceil(max_tokens * bytes_per_token_scalar)
This ensures that, for models with different token byte sizes, output lengths remain roughly comparable in bytes or visible characters.
Why it matters¶
Without this scaling, a model that uses shorter tokens would produce longer outputs (more tokens for the same byte size),
and a model with longer tokens would produce shorter outputs.
bytes_per_token compensates for that, aligning models to a common byte-level generation length.
Example calculations¶
Scenario |
bytes_per_token |
Computed Scalar (4.0 / bpt) |
max_tokens=100 → scaled |
Effect |
|---|---|---|---|---|
Default (no override) |
|
|
|
No change. |
Denser tokenizer (2 bytes/token) |
|
|
|
Model can generate twice as many tokens. |
Sparser tokenizer (8 bytes/token) |
|
|
|
Model generates half as many tokens. |
Recommended usage¶
Leave unset (
None) for most models — default behavior is fine.If you empirically know your tokenizer’s average bytes/token, pass it explicitly:
model = HFLLM(model_name="my-model", bytes_per_token=3.2)
Common approximate values:
GPT-family BPE: 3–4 bytes/token
SentencePiece or WordPiece (smaller vocab): 2–3 bytes/token
Character-level tokenizers: 1–2 bytes/token