eval_framework.llm package

Submodules

eval_framework.llm.aleph_alpha module

class eval_framework.llm.aleph_alpha.AlephAlphaAPIModel(formatter=None, checkpoint_name=None, temperature=None, max_retries=100, max_async_concurrent_requests=32, request_timeout_seconds=1805, queue_full_timeout_seconds=1805, bytes_per_token=None, token='dummy', base_url='dummy_endpoint')[source]

Bases: BaseLLM

Parameters:
  • formatter (BaseFormatter | None)

  • checkpoint_name (str | None)

  • temperature (float | None)

  • max_retries (int)

  • max_async_concurrent_requests (int)

  • request_timeout_seconds (int)

  • queue_full_timeout_seconds (int)

  • bytes_per_token (float | None)

  • token (str)

  • base_url (str)

BYTES_PER_TOKEN: float = 4.0
DEFAULT_FORMATTER: Callable[[], BaseFormatter] | None = None
LLM_NAME: str
generate_from_messages(messages, stop_sequences=None, max_tokens=None, temperature=None)[source]

stop_sequences and max_tokens are injected by the task if exist. They should be overwritten or extended with the properties of the model. This includes but is not limited to the stop tokens by the evaluated checkpoint (e.g. <|eot_id|> for an instruction finetuned Llama3.1, <|endoftext|> for a pretrained Llama3.1).

This function is expected to raise errors which are caught and reported when running the eval. Please also make sure to raise an error in case of sequence length issues. We expect to always raise an error if something impedes the expected completion of a task.

Important! The completion is expected to be detokenized and to NOT contain special tokens.

Returns: List[RawCompletion]

Return type:

list[RawCompletion]

Parameters:
  • messages (list[Sequence[Message]])

  • stop_sequences (list[str] | None)

  • max_tokens (int | None)

  • temperature (float | None)

logprobs(samples)[source]

This function is expected to raise errors which are caught and reported when running the eval. Please also make sure to raise an error in case of sequence length issues. We expect to always raise an error if something prevents the expected completion of a task.

Return type:

list[RawLoglikelihood]

Parameters:

samples (list[Sample])

class eval_framework.llm.aleph_alpha.Llama31_8B_Instruct_API(formatter=None, checkpoint_name=None, temperature=None, max_retries=100, max_async_concurrent_requests=32, request_timeout_seconds=1805, queue_full_timeout_seconds=1805, bytes_per_token=None, token='dummy', base_url='dummy_endpoint')[source]

Bases: AlephAlphaAPIModel

Parameters:
  • formatter (BaseFormatter | None)

  • checkpoint_name (str | None)

  • temperature (float | None)

  • max_retries (int)

  • max_async_concurrent_requests (int)

  • request_timeout_seconds (int)

  • queue_full_timeout_seconds (int)

  • bytes_per_token (float | None)

  • token (str)

  • base_url (str)

DEFAULT_FORMATTER

alias of Llama3Formatter

LLM_NAME: str = 'llama-3.1-8b-instruct'
eval_framework.llm.aleph_alpha.safe_json_loads(s)[source]
Return type:

dict[str, str]

Parameters:

s (str)

eval_framework.llm.base module

class eval_framework.llm.base.BaseLLM[source]

Bases: ABC

generate(samples, stop_sequences=None, max_tokens=None, temperature=None)[source]

Generates a model response for each sample.

Uses ‘generate_from_samples’ to generate responses if implemented, otherwise falls back to ‘generate_from_messages’.

Return type:

list[RawCompletion]

Parameters:
  • samples (list[Sample])

  • stop_sequences (list[str] | None)

  • max_tokens (int | None)

  • temperature (float | None)

abstractmethod generate_from_messages(messages, stop_sequences=None, max_tokens=None, temperature=None)[source]

stop_sequences and max_tokens are injected by the task if exist. They should be overwritten or extended with the properties of the model. This includes but is not limited to the stop tokens by the evaluated checkpoint (e.g. <|eot_id|> for an instruction finetuned Llama3.1, <|endoftext|> for a pretrained Llama3.1).

This function is expected to raise errors which are caught and reported when running the eval. Please also make sure to raise an error in case of sequence length issues. We expect to always raise an error if something impedes the expected completion of a task.

Important! The completion is expected to be detokenized and to NOT contain special tokens.

Returns: List[RawCompletion]

Return type:

list[RawCompletion]

Parameters:
  • messages (list[Sequence[Message]])

  • stop_sequences (list[str] | None)

  • max_tokens (int | None)

  • temperature (float | None)

generate_from_samples(samples, stop_sequences=None, max_tokens=None, temperature=None)[source]

stop_sequences and max_tokens are injected by the task if exist. They should be overwritten or extended with the properties of the model. This includes but is not limited to the stop tokens by the evaluated checkpoint (e.g. <|eot_id|> for an instruction finetuned Llama3.1, <|endoftext|> for a pretrained Llama3.1).

This function is expected to raise errors which are caught and reported when running the eval. Please also make sure to raise an error in case of sequence length issues. We expect to always raise an error if something impedes the expected completion of a task.

Important! The completion is expected to be detokenized and to NOT contain special tokens.

Returns: List[RawCompletion]

Return type:

list[RawCompletion]

Parameters:
  • samples (list[Sample])

  • stop_sequences (list[str] | None)

  • max_tokens (int | None)

  • temperature (float | None)

abstractmethod logprobs(samples)[source]

This function is expected to raise errors which are caught and reported when running the eval. Please also make sure to raise an error in case of sequence length issues. We expect to always raise an error if something prevents the expected completion of a task.

Return type:

list[RawLoglikelihood]

Parameters:

samples (list[Sample])

property name: str

This property is used to name the results folder and identify the eval results. Overwrite this property in the subclass with e.g. the checkpoint name/huggingface model name.

post_process_completion(completion, sample)[source]

Model-specific post-processing of generated completions.

Override this method to apply model-specific cleanup or transformations (e.g., removing specific artifacts such as reasoning traces, handling special tokens).

Parameters:
  • completion (str) – The raw completion string from the model

  • sample (Sample) – The sample that was used to generate the completion

Return type:

str

Returns:

The post-processed completion string

eval_framework.llm.huggingface module

class eval_framework.llm.huggingface.BaseHFLLM(formatter=None, bytes_per_token=None)[source]

Bases: BaseLLM

Parameters:
  • formatter (BaseFormatter | None)

  • bytes_per_token (float | None)

BYTES_PER_TOKEN: float = 4.0
DEFAULT_FORMATTER: Callable[[], BaseFormatter] | None = None
LLM_NAME: str
SEQ_LENGTH: int | None = None
count_tokens(text, /)[source]

Count the number of tokens in a string.

Return type:

int

Parameters:

text (str)

generate_from_messages(messages, stop_sequences=None, max_tokens=None, temperature=None)[source]

stop_sequences and max_tokens are injected by the task if exist. They should be overwritten or extended with the properties of the model. This includes but is not limited to the stop tokens by the evaluated checkpoint (e.g. <|eot_id|> for an instruction finetuned Llama3.1, <|endoftext|> for a pretrained Llama3.1).

This function is expected to raise errors which are caught and reported when running the eval. Please also make sure to raise an error in case of sequence length issues. We expect to always raise an error if something impedes the expected completion of a task.

Important! The completion is expected to be detokenized and to NOT contain special tokens.

Returns: List[RawCompletion]

Return type:

list[RawCompletion]

Parameters:
  • messages (list[Sequence[Message]])

  • stop_sequences (list[str] | None)

  • max_tokens (int | None)

  • temperature (float | None)

logprobs(samples)[source]

This function is expected to raise errors which are caught and reported when running the eval. Please also make sure to raise an error in case of sequence length issues. We expect to always raise an error if something prevents the expected completion of a task.

Return type:

list[RawLoglikelihood]

Parameters:

samples (list[Sample])

property seq_length: int | None
class eval_framework.llm.huggingface.HFLLM(checkpoint_path=None, model_name=None, artifact_name=None, formatter=None, formatter_name=None, formatter_kwargs=None, checkpoint_name=None, bytes_per_token=None, **kwargs)[source]

Bases: BaseHFLLM

A class to create HFLLM instances from various model sources.

Parameters:
  • checkpoint_path (str | Path | None)

  • model_name (str | None)

  • artifact_name (str | None)

  • formatter (BaseFormatter | None)

  • formatter_name (str | None)

  • formatter_kwargs (dict[str, Any] | None)

  • checkpoint_name (str | None)

  • bytes_per_token (float | None)

  • kwargs (Any)

property name: str

This property is used to name the results folder and identify the eval results. Overwrite this property in the subclass with e.g. the checkpoint name/huggingface model name.

class eval_framework.llm.huggingface.HFLLMRegistryModel(artifact_name, version='latest', formatter='', formatter_identifier='', **kwargs)[source]

Bases: HFLLM

A class to create HFLLM instances from registered models in Wandb registry. Downloads the model artifacts from Wandb and creates a local HFLLM instance.

Parameters:
  • artifact_name (str)

  • version (str)

  • formatter (str)

  • formatter_identifier (str)

  • kwargs (Any)

class eval_framework.llm.huggingface.HFLLM_from_name(model_name, formatter='Llama3Formatter', **kwargs)[source]

Bases: HFLLM

A generic class to create HFLLM instances from a given model name.

Parameters:
  • model_name (str)

  • formatter (str)

  • kwargs (Any)

class eval_framework.llm.huggingface.Pythia410m(checkpoint_path=None, model_name=None, artifact_name=None, formatter=None, formatter_name=None, formatter_kwargs=None, checkpoint_name=None, bytes_per_token=None, **kwargs)[source]

Bases: HFLLM

Parameters:
  • checkpoint_path (str | Path | None)

  • model_name (str | None)

  • artifact_name (str | None)

  • formatter (BaseFormatter | None)

  • formatter_name (str | None)

  • formatter_kwargs (dict[str, Any] | None)

  • checkpoint_name (str | None)

  • bytes_per_token (float | None)

  • kwargs (Any)

DEFAULT_FORMATTER

alias of ConcatFormatter

LLM_NAME: str = 'EleutherAI/pythia-410m'
class eval_framework.llm.huggingface.Qwen3_0_6B(checkpoint_path=None, model_name=None, artifact_name=None, formatter=None, formatter_name=None, formatter_kwargs=None, checkpoint_name=None, bytes_per_token=None, **kwargs)[source]

Bases: HFLLM

Parameters:
  • checkpoint_path (str | Path | None)

  • model_name (str | None)

  • artifact_name (str | None)

  • formatter (BaseFormatter | None)

  • formatter_name (str | None)

  • formatter_kwargs (dict[str, Any] | None)

  • checkpoint_name (str | None)

  • bytes_per_token (float | None)

  • kwargs (Any)

DEFAULT_FORMATTER: Callable[[], BaseFormatter] | None = functools.partial(<class 'template_formatting.formatter.HFFormatter'>, 'Qwen/Qwen3-0.6B', chat_template_kwargs={'enable_thinking': True})
Parameters:

chat_template_kwargs (dict[str, Any] | None)

Return type:

None

LLM_NAME: str = 'Qwen/Qwen3-0.6B'
class eval_framework.llm.huggingface.RepeatedTokenSequenceCriteria(tokenizer, completion_start_index)[source]

Bases: StoppingCriteria

Parameters:
  • tokenizer (Tokenizer)

  • completion_start_index (int)

class eval_framework.llm.huggingface.SmolLM135M(checkpoint_path=None, model_name=None, artifact_name=None, formatter=None, formatter_name=None, formatter_kwargs=None, checkpoint_name=None, bytes_per_token=None, **kwargs)[source]

Bases: HFLLM

Parameters:
  • checkpoint_path (str | Path | None)

  • model_name (str | None)

  • artifact_name (str | None)

  • formatter (BaseFormatter | None)

  • formatter_name (str | None)

  • formatter_kwargs (dict[str, Any] | None)

  • checkpoint_name (str | None)

  • bytes_per_token (float | None)

  • kwargs (Any)

DEFAULT_FORMATTER

alias of ConcatFormatter

LLM_NAME: str = 'HuggingFaceTB/SmolLM-135M'
class eval_framework.llm.huggingface.Smollm135MInstruct(checkpoint_path=None, model_name=None, artifact_name=None, formatter=None, formatter_name=None, formatter_kwargs=None, checkpoint_name=None, bytes_per_token=None, **kwargs)[source]

Bases: HFLLM

Parameters:
  • checkpoint_path (str | Path | None)

  • model_name (str | None)

  • artifact_name (str | None)

  • formatter (BaseFormatter | None)

  • formatter_name (str | None)

  • formatter_kwargs (dict[str, Any] | None)

  • checkpoint_name (str | None)

  • bytes_per_token (float | None)

  • kwargs (Any)

DEFAULT_FORMATTER: Callable[[], BaseFormatter] | None = functools.partial(<class 'template_formatting.formatter.HFFormatter'>, 'HuggingFaceTB/SmolLM-135M-Instruct')
Parameters:

chat_template_kwargs (dict[str, Any] | None)

Return type:

None

LLM_NAME: str = 'HuggingFaceTB/SmolLM-135M-Instruct'
class eval_framework.llm.huggingface.StopSequenceCriteria(tokenizer, stop_sequences, prompt_token_count)[source]

Bases: StoppingCriteria

Parameters:
  • tokenizer (Tokenizer)

  • stop_sequences (list[str])

  • prompt_token_count (int)

eval_framework.llm.mistral module

class eval_framework.llm.mistral.MistralAdapter(target_mdl)[source]

Bases: VLLMTokenizerAPI[list[Message]]

Parameters:

target_mdl (str)

encode_formatted_struct(struct)[source]

Encode prompt to token IDs.

Return type:

TokenizedContainer

Parameters:

struct (list[Message])

encode_plain_text(text)[source]
Return type:

TokenizedContainer

Parameters:

text (str)

class eval_framework.llm.mistral.MistralVLLM(checkpoint_path=None, model_name=None, artifact_name=None, formatter=None, formatter_name=None, formatter_kwargs=None, checkpoint_name=None, max_model_len=None, tensor_parallel_size=1, gpu_memory_utilization=0.9, batch_size=1, sampling_params=None, bytes_per_token=None, **kwargs)[source]

Bases: VLLMModel

Parameters:
  • checkpoint_path (str | Path | None)

  • model_name (str | None)

  • artifact_name (str | None)

  • formatter (BaseFormatter | None)

  • formatter_name (str | None)

  • formatter_kwargs (dict[str, Any] | None)

  • checkpoint_name (str | None)

  • max_model_len (int | None)

  • tensor_parallel_size (int)

  • gpu_memory_utilization (float)

  • batch_size (int)

  • sampling_params (SamplingParams | dict[str, Any] | None)

  • bytes_per_token (float | None)

  • kwargs (Any)

property formatter_output_mode: Literal['string', 'list']

Determine the correct output mode for the formatter based on tokenizer type.

property tokenizer: VLLMTokenizerAPI

eval_framework.llm.models module

This is just a default model file with some small models for testing.

Please define your own model file externally and pass it to the eval-framework entrypoint to use it.

eval_framework.llm.openai module

class eval_framework.llm.openai.DeepseekModel(model_name=None, formatter=None, temperature=None, api_key=None, organization=None, base_url=None, tokenizer_name=None)[source]

Bases: OpenAIModel

General Deepseek model wrapper using OpenAI-compatible API for deepseek-chat and deepseek-reasoner models.

Using the deepseek API: https://api-docs.deepseek.com/quick_start/pricing

Parameters:
  • model_name (str | None)

  • formatter (BaseFormatter | None)

  • temperature (float | None)

  • api_key (str | None)

  • organization (str | None)

  • base_url (str | None)

  • tokenizer_name (str | None)

class eval_framework.llm.openai.Deepseek_chat(model_name=None, formatter=None, temperature=None, api_key=None, organization=None, base_url=None, tokenizer_name=None)[source]

Bases: DeepseekModel

Parameters:
  • model_name (str | None)

  • formatter (BaseFormatter | None)

  • temperature (float | None)

  • api_key (str | None)

  • organization (str | None)

  • base_url (str | None)

  • tokenizer_name (str | None)

LLM_NAME: str | None = 'deepseek-chat'
class eval_framework.llm.openai.Deepseek_chat_with_formatter(model_name=None, formatter=None, temperature=None, api_key=None, organization=None, base_url=None, tokenizer_name=None)[source]

Bases: DeepseekModel

Parameters:
  • model_name (str | None)

  • formatter (BaseFormatter | None)

  • temperature (float | None)

  • api_key (str | None)

  • organization (str | None)

  • base_url (str | None)

  • tokenizer_name (str | None)

DEFAULT_FORMATTER: Callable[[], BaseFormatter] | None = functools.partial(<class 'template_formatting.formatter.HFFormatter'>, 'deepseek-ai/DeepSeek-V3.2-Exp')

What color is the night sky? <|Assistant|></think>Answer:

Type:

<|begin▁of▁sentence|><|User|>Question

Parameters:

chat_template_kwargs (dict[str, Any] | None)

Return type:

None

LLM_NAME: str | None = 'deepseek-chat'
class eval_framework.llm.openai.Deepseek_reasoner(model_name=None, formatter=None, temperature=None, api_key=None, organization=None, base_url=None, tokenizer_name=None)[source]

Bases: DeepseekModel

Parameters:
  • model_name (str | None)

  • formatter (BaseFormatter | None)

  • temperature (float | None)

  • api_key (str | None)

  • organization (str | None)

  • base_url (str | None)

  • tokenizer_name (str | None)

LLM_NAME: str | None = 'deepseek-reasoner'
class eval_framework.llm.openai.OpenAIEmbeddingModel(model_name='text-embedding-3-large', formatter=None, api_key=None, organization=None, base_url=None)[source]

Bases: BaseLLM

Parameters:
  • model_name (str)

  • formatter (BaseFormatter | None)

  • api_key (str | None)

  • organization (str | None)

  • base_url (str | None)

generate_embeddings(messages)[source]
Return type:

list[list[float]]

Parameters:

messages (list[Sequence[Message]])

generate_from_messages(messages, stop_sequences=None, max_tokens=None, temperature=None)[source]

stop_sequences and max_tokens are injected by the task if exist. They should be overwritten or extended with the properties of the model. This includes but is not limited to the stop tokens by the evaluated checkpoint (e.g. <|eot_id|> for an instruction finetuned Llama3.1, <|endoftext|> for a pretrained Llama3.1).

This function is expected to raise errors which are caught and reported when running the eval. Please also make sure to raise an error in case of sequence length issues. We expect to always raise an error if something impedes the expected completion of a task.

Important! The completion is expected to be detokenized and to NOT contain special tokens.

Returns: List[RawCompletion]

Return type:

list[RawCompletion]

Parameters:
  • messages (list[Sequence[Message]])

  • stop_sequences (list[str] | None)

  • max_tokens (int | None)

  • temperature (float | None)

logprobs(samples)[source]

This function is expected to raise errors which are caught and reported when running the eval. Please also make sure to raise an error in case of sequence length issues. We expect to always raise an error if something prevents the expected completion of a task.

Return type:

list[RawLoglikelihood]

Parameters:

samples (list[Sample])

class eval_framework.llm.openai.OpenAIModel(model_name=None, formatter=None, temperature=None, api_key='', organization=None, base_url=None, bytes_per_token=None)[source]

Bases: BaseLLM

LLM wrapper for OpenAI API providing text/chat completions and log-probability evaluation output.

Parameters:
  • model_name (str | None)

  • formatter (BaseFormatter | None)

  • temperature (float | None)

  • api_key (str | None)

  • organization (str | None)

  • base_url (str | None)

  • bytes_per_token (float | None)

BYTES_PER_TOKEN: float = 4.0
DEFAULT_FORMATTER: Callable[[], BaseFormatter] | None = None
LLM_NAME: str | None = None
generate_from_messages(messages, stop_sequences=None, max_tokens=None, temperature=None)[source]

Generate completions for a list of message sequences concurrently.

Uses text completion API when a formatter is configured, otherwise uses chat completion API.

Parameters:
  • messages (list[Sequence[Message]]) – Sequence of messages.

  • stop_sequences (list[str] | None) – Optional list of stop sequences.

  • max_tokens (int | None) – Optional maximum number of tokens to generate.

  • temperature (float | None) – Sampling temperature.

Return type:

list[RawCompletion]

Returns:

List of RawCompletion objects containing prompts and completions.

logprobs(samples)[source]

Compute total log-probabilities for possible completions given each sample’s prompt.

Parameters:

samples (list[Sample]) – List of Sample objects, each with prompt messages and possible completions.

Return type:

list[RawLoglikelihood]

Returns:

List of RawLoglikelihood objects mapping each prompt and completion to its log-probability.

Note

Uses the OpenAI completions API with echo=True; chat logprobs are not supported.

class eval_framework.llm.openai.OpenAI_davinci_002(model_name=None, formatter=None, temperature=None, api_key='', organization=None, base_url=None, bytes_per_token=None)[source]

Bases: OpenAIModel

Parameters:
  • model_name (str | None)

  • formatter (BaseFormatter | None)

  • temperature (float | None)

  • api_key (str | None)

  • organization (str | None)

  • base_url (str | None)

  • bytes_per_token (float | None)

DEFAULT_FORMATTER

alias of ConcatFormatter

LLM_NAME: str | None = 'davinci-002'
class eval_framework.llm.openai.OpenAI_gpt_4o_mini(model_name=None, formatter=None, temperature=None, api_key='', organization=None, base_url=None, bytes_per_token=None)[source]

Bases: OpenAIModel

Parameters:
  • model_name (str | None)

  • formatter (BaseFormatter | None)

  • temperature (float | None)

  • api_key (str | None)

  • organization (str | None)

  • base_url (str | None)

  • bytes_per_token (float | None)

LLM_NAME: str | None = 'gpt-4o-mini-2024-07-18'
class eval_framework.llm.openai.OpenAI_gpt_4o_mini_with_ConcatFormatter(model_name=None, formatter=None, temperature=None, api_key='', organization=None, base_url=None, bytes_per_token=None)[source]

Bases: OpenAIModel

Parameters:
  • model_name (str | None)

  • formatter (BaseFormatter | None)

  • temperature (float | None)

  • api_key (str | None)

  • organization (str | None)

  • base_url (str | None)

  • bytes_per_token (float | None)

DEFAULT_FORMATTER

alias of ConcatFormatter

LLM_NAME: str | None = 'gpt-4o-mini-2024-07-18'

eval_framework.llm.vllm module

class eval_framework.llm.vllm.BaseVLLMModel(formatter=None, max_model_len=None, tensor_parallel_size=1, gpu_memory_utilization=0.9, batch_size=1, checkpoint_path=None, checkpoint_name=None, sampling_params=None, bytes_per_token=None, **kwargs)[source]

Bases: BaseLLM

Parameters:
  • formatter (BaseFormatter | None)

  • max_model_len (int | None)

  • tensor_parallel_size (int)

  • gpu_memory_utilization (float)

  • batch_size (int)

  • checkpoint_path (str | Path | None)

  • checkpoint_name (str | None)

  • sampling_params (SamplingParams | dict[str, Any] | None)

  • bytes_per_token (float | None)

  • kwargs (Any)

BYTES_PER_TOKEN: float = 4.0
DEFAULT_FORMATTER: Callable[[], BaseFormatter] | None = None
LLM_NAME: str
SEQ_LENGTH: int | None = None
build_redis_key_from_prompt_objs(prompt_objs, sampling_params)[source]

Build a redis key from a list of prompt objects and sampling parameters. TokenizedContainers are not serializable so we just pass the tokens and sampling params.

Return type:

Any

Parameters:
count_tokens(text, /)[source]
Return type:

int

Parameters:

text (str)

property formatter_output_mode: Literal['string', 'list']
generate_from_messages(messages, stop_sequences=None, max_tokens=None, temperature=None)[source]

stop_sequences and max_tokens are injected by the task if exist. They should be overwritten or extended with the properties of the model. This includes but is not limited to the stop tokens by the evaluated checkpoint (e.g. <|eot_id|> for an instruction finetuned Llama3.1, <|endoftext|> for a pretrained Llama3.1).

This function is expected to raise errors which are caught and reported when running the eval. Please also make sure to raise an error in case of sequence length issues. We expect to always raise an error if something impedes the expected completion of a task.

Important! The completion is expected to be detokenized and to NOT contain special tokens.

Returns: List[RawCompletion]

Return type:

list[RawCompletion]

Parameters:
  • messages (list[Sequence[Message]])

  • stop_sequences (list[str] | None)

  • max_tokens (int | None)

  • temperature (float | None)

logprobs(samples)[source]

Batched version of logprobs for improved performance.

Return type:

list[RawLoglikelihood]

Parameters:

samples (list[Sample])

property max_seq_length: int

Returns the maximum sequence length for this model. Priority order: 1. max_model_len parameter passed to __init__ 2. SEQ_LENGTH class attribute 3. Model’s actual max_model_len from config 4. Default fallback of 2048

property name: str

This property is used to name the results folder and identify the eval results. Overwrite this property in the subclass with e.g. the checkpoint name/huggingface model name.

property seq_length: int | None

Kept for backward compatibility.

property tokenizer: VLLMTokenizerAPI
class eval_framework.llm.vllm.HFTokenizerProtocol(*args, **kwargs)[source]

Bases: Protocol

property chat_template: str | None

Chat template for the tokenizer.

decode(tokens)[source]

Decode token IDs to text.

Return type:

str

Parameters:

tokens (list[int])

encode(text, add_special_tokens=False)[source]

Encode text to token IDs.

Return type:

list[int]

Parameters:
  • text (str)

  • add_special_tokens (bool)

class eval_framework.llm.vllm.Qwen3_0_6B_VLLM(checkpoint_path=None, model_name=None, artifact_name=None, formatter=None, formatter_name=None, formatter_kwargs=None, checkpoint_name=None, max_model_len=None, tensor_parallel_size=1, gpu_memory_utilization=0.9, batch_size=1, sampling_params=None, **kwargs)[source]

Bases: VLLMModel

Parameters:
  • checkpoint_path (str | Path | None)

  • model_name (str | None)

  • artifact_name (str | None)

  • formatter (BaseFormatter | None)

  • formatter_name (str | None)

  • formatter_kwargs (dict[str, Any] | None)

  • checkpoint_name (str | None)

  • max_model_len (int | None)

  • tensor_parallel_size (int)

  • gpu_memory_utilization (float)

  • batch_size (int)

  • sampling_params (SamplingParams | dict[str, Any] | None)

  • kwargs (Any)

DEFAULT_FORMATTER: Callable[[], BaseFormatter] | None = functools.partial(<class 'template_formatting.formatter.HFFormatter'>, 'Qwen/Qwen3-0.6B', chat_template_kwargs={'enable_thinking': True})
Parameters:

chat_template_kwargs (dict[str, Any] | None)

Return type:

None

LLM_NAME: str = 'Qwen/Qwen3-0.6B'
class eval_framework.llm.vllm.Qwen3_0_6B_VLLM_No_Thinking(checkpoint_path=None, model_name=None, artifact_name=None, formatter=None, formatter_name=None, formatter_kwargs=None, checkpoint_name=None, max_model_len=None, tensor_parallel_size=1, gpu_memory_utilization=0.9, batch_size=1, sampling_params=None, **kwargs)[source]

Bases: VLLMModel

Parameters:
  • checkpoint_path (str | Path | None)

  • model_name (str | None)

  • artifact_name (str | None)

  • formatter (BaseFormatter | None)

  • formatter_name (str | None)

  • formatter_kwargs (dict[str, Any] | None)

  • checkpoint_name (str | None)

  • max_model_len (int | None)

  • tensor_parallel_size (int)

  • gpu_memory_utilization (float)

  • batch_size (int)

  • sampling_params (SamplingParams | dict[str, Any] | None)

  • kwargs (Any)

DEFAULT_FORMATTER: Callable[[], BaseFormatter] | None = functools.partial(<class 'template_formatting.formatter.HFFormatter'>, 'Qwen/Qwen3-0.6B', chat_template_kwargs={'enable_thinking': False})
Parameters:

chat_template_kwargs (dict[str, Any] | None)

Return type:

None

LLM_NAME: str = 'Qwen/Qwen3-0.6B'
class eval_framework.llm.vllm.TokenizedContainer(tokens, text)[source]

Bases: object

Container object to store tokens and formatted prompt

Parameters:
  • tokens (list[int])

  • text (str)

text: str
tokens: list[int]
class eval_framework.llm.vllm.VLLMModel(checkpoint_path=None, model_name=None, artifact_name=None, formatter=None, formatter_name=None, formatter_kwargs=None, checkpoint_name=None, max_model_len=None, tensor_parallel_size=1, gpu_memory_utilization=0.9, batch_size=1, sampling_params=None, **kwargs)[source]

Bases: BaseVLLMModel

A class to create VLLM instances from various model sources.

Parameters:
  • checkpoint_path (str | Path | None)

  • model_name (str | None)

  • artifact_name (str | None)

  • formatter (BaseFormatter | None)

  • formatter_name (str | None)

  • formatter_kwargs (dict[str, Any] | None)

  • checkpoint_name (str | None)

  • max_model_len (int | None)

  • tensor_parallel_size (int)

  • gpu_memory_utilization (float)

  • batch_size (int)

  • sampling_params (SamplingParams | dict[str, Any] | None)

  • kwargs (Any)

class eval_framework.llm.vllm.VLLMRegistryModel(artifact_name, version='latest', formatter='', formatter_identifier='', **kwargs)[source]

Bases: VLLMModel

A class to create VLLM instances from registered models in Wandb registry. Downloads the model artifacts from Wandb and creates a local VLLM instance.

Parameters:
  • artifact_name (str)

  • version (str)

  • formatter (str)

  • formatter_identifier (str)

  • kwargs (Any)

class eval_framework.llm.vllm.VLLMTokenizer(target_mdl)[source]

Bases: VLLMTokenizerAPI[str]

Parameters:

target_mdl (str | Path)

property chat_template: str | None
decode(tokens)[source]
Return type:

str

Parameters:

tokens (list[int])

encode_formatted_struct(struct)[source]

Encode prompt to token IDs.

Return type:

TokenizedContainer

Parameters:

struct (str)

encode_plain_text(text)[source]
Return type:

TokenizedContainer

Parameters:

text (str)

class eval_framework.llm.vllm.VLLMTokenizerAPI[source]

Bases: ABC, Generic

Protocol for tokenizer interface that defines required methods. Needed for type checking because of the vllm tokenizer.

property chat_template: str | None
abstractmethod encode_formatted_struct(struct)[source]

Encode prompt to token IDs.

Return type:

TokenizedContainer

Parameters:

struct (prompt_type)

abstractmethod encode_plain_text(text)[source]
Return type:

TokenizedContainer

Parameters:

text (str)

Module contents