eval_framework.metrics.llm package

Submodules

eval_framework.metrics.llm.base module

class eval_framework.metrics.llm.base.BaseLLMJudgeMetric(llm_judge, randomize_order=False)[source]

Bases: BaseMetric[Completion]

Parameters:
  • llm_judge (BaseLLM)

  • randomize_order (bool)

eval_framework.metrics.llm.llm_judge_chatbot_style module

class eval_framework.metrics.llm.llm_judge_chatbot_style.LLMJudgeChatbotStyle(llm_judge)[source]

Bases: BaseLLMJudgeMetric

Parameters:

llm_judge (BaseLLM)

NAME: str = 'Chatbot Style'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

eval_framework.metrics.llm.llm_judge_coherence module

class eval_framework.metrics.llm.llm_judge_coherence.LLMJudgeCoherence(llm_judge)[source]

Bases: BaseLLMJudgeMetric

Parameters:

llm_judge (BaseLLM)

KEYS: list[str] | None = ['coherence_score']
NAME: str = 'Coherence'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

eval_framework.metrics.llm.llm_judge_completion_accuracy module

class eval_framework.metrics.llm.llm_judge_completion_accuracy.LLMJudgeCompletionAccuracy(llm_judge)[source]

Bases: BaseLLMJudgeMetric

Parameters:

llm_judge (BaseLLM)

NAME: str = 'Judge Completion Accuracy'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

eval_framework.metrics.llm.llm_judge_conciseness module

class eval_framework.metrics.llm.llm_judge_conciseness.LLMJudgeConciseness(llm_judge)[source]

Bases: BaseLLMJudgeMetric

Parameters:

llm_judge (BaseLLM)

NAME: str = 'Conciseness'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

eval_framework.metrics.llm.llm_judge_contains_names module

class eval_framework.metrics.llm.llm_judge_contains_names.LLMJudgeAvoidsNames(llm_judge)[source]

Bases: BaseLLMJudgeMetric

Parameters:

llm_judge (BaseLLM)

NAME: str = 'Avoids Names'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

eval_framework.metrics.llm.llm_judge_format_correctness module

class eval_framework.metrics.llm.llm_judge_format_correctness.LLMJudgeFormatCorrectness(llm_judge)[source]

Bases: BaseLLMJudgeMetric

Parameters:

llm_judge (BaseLLM)

NAME: str = 'Format Correctness'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

class eval_framework.metrics.llm.llm_judge_format_correctness.LLMJudgeFormatCorrectnessContext(**data)[source]

Bases: BaseMetricContext

Parameters:
  • language (str)

  • extra_data (Any)

language: str
model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

eval_framework.metrics.llm.llm_judge_instruction module

class eval_framework.metrics.llm.llm_judge_instruction.LLMJudgeInstruction(llm_judge)[source]

Bases: BaseLLMJudgeMetric

Parameters:

llm_judge (BaseLLM)

KEYS: list[str] | None = ['quality', 'is_following_instruction', 'has_correct_grammar_and_spelling', 'is_context_consistent', 'is_not_repeating', 'is_trustworthy', 'is_safe']
NAME: str = 'Instruction Following'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

eval_framework.metrics.llm.llm_judge_mtbench_pair module

class eval_framework.metrics.llm.llm_judge_mtbench_pair.MTBenchJudgePair(llm_judge, randomize_order=False)[source]

Bases: BaseLLMJudgeMetric

Parameters:
  • llm_judge (BaseLLM)

  • randomize_order (bool)

NAME: str = 'pairwise_judgement'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

class eval_framework.metrics.llm.llm_judge_mtbench_pair.MTBenchJudgePairMetricContext(**data)[source]

Bases: BaseMetricContext

Parameters:
  • category (str)

  • answer (list[str] | str)

  • reference (list[str] | str | None)

  • extra_data (Any)

answer: list[str] | str
category: str
model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

reference: list[str] | str | None
class eval_framework.metrics.llm.llm_judge_mtbench_pair.PromptToJudge(**data)[source]

Bases: BaseModel

Parameters:
  • comparison_type (str)

  • prompt_text (str)

  • candidate_is_a (bool)

candidate_is_a: bool
comparison_type: str
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt_text: str
eval_framework.metrics.llm.llm_judge_mtbench_pair.generate_pair_judge_prompts(response, randomize_order=False, seed=None)[source]

Generate pairwise judge prompts for comparing candidate vs reference completions.

Parameters:
  • response (Completion) – The completion response containing the candidate completion.

  • randomize_order (bool) – If True, randomly swap the order of A/B to eliminate position bias.

  • seed (int | None) – Optional random seed for reproducibility. If None and randomize_order is True, uses the response id as seed for deterministic per-sample randomization.

Return type:

list[PromptToJudge]

Returns:

List of PromptToJudge objects with candidate_is_a indicating whether the candidate completion is in position A (True) or position B (False).

eval_framework.metrics.llm.llm_judge_mtbench_single module

class eval_framework.metrics.llm.llm_judge_mtbench_single.MTBenchJudgeSingle(llm_judge, randomize_order=False)[source]

Bases: BaseLLMJudgeMetric

Parameters:
  • llm_judge (BaseLLM)

  • randomize_order (bool)

NAME: str = 'single_judgement'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

class eval_framework.metrics.llm.llm_judge_mtbench_single.MTBenchJudgeSingleMetricContext(**data)[source]

Bases: BaseMetricContext

Parameters:
  • category (str)

  • reference (list[str] | str | None)

  • extra_data (Any)

category: str
model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

reference: list[str] | str | None
class eval_framework.metrics.llm.llm_judge_mtbench_single.PromptToJudge(**data)[source]

Bases: BaseModel

Parameters:
  • comparison_type (str)

  • prompt_text (str)

comparison_type: str
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt_text: str
eval_framework.metrics.llm.llm_judge_mtbench_single.generate_single_judge_prompts(response)[source]
Return type:

list[PromptToJudge]

Parameters:

response (Completion)

eval_framework.metrics.llm.llm_judge_refusal module

class eval_framework.metrics.llm.llm_judge_refusal.LLMJudgeRefusal(llm_judge)[source]

Bases: BaseLLMJudgeMetric

Parameters:

llm_judge (BaseLLM)

NAME: str = 'refusal_classifier'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

eval_framework.metrics.llm.llm_judge_sql module

class eval_framework.metrics.llm.llm_judge_sql.LLMJudgeSql(llm_judge)[source]

Bases: BaseLLMJudgeMetric

Parameters:

llm_judge (BaseLLM)

NAME: str = 'SQL Quality'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

connect_to_mysql()[source]
Return type:

PooledMySQLConnection | MySQLConnectionAbstract

connect_to_postgres()[source]
Return type:

connection

validate_query(dialect, create_db_statements, sql_query, db_schema)[source]
Return type:

SqlValidationResult

Parameters:
  • dialect (SqlDialects)

  • create_db_statements (str)

  • sql_query (str)

  • db_schema (str)

validate_query_mysql(create_db_statements, sql_query, db_schema)[source]
Return type:

SqlValidationResult

Parameters:
  • create_db_statements (str)

  • sql_query (str)

  • db_schema (str)

validate_query_postgres(create_db_statements, sql_query, db_schema)[source]
Return type:

SqlValidationResult

Parameters:
  • create_db_statements (str)

  • sql_query (str)

  • db_schema (str)

validate_query_sqlite(create_db_statements, sql_query, db_schema)[source]
Return type:

SqlValidationResult

Parameters:
  • create_db_statements (str)

  • sql_query (str)

  • db_schema (str)

class eval_framework.metrics.llm.llm_judge_sql.LLMJudgeSqlMetricContext(**data)[source]

Bases: LanguageMetricContext

Parameters:
  • language (str)

  • dialect (str)

  • db_schema (str)

  • extra_data (Any)

db_schema: str
dialect: str
model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class eval_framework.metrics.llm.llm_judge_sql.SqlDialects(*values)[source]

Bases: Enum

mysql = 'mysql'
postgres = 'postgresql'
sqlite = 'sqlite'
standard_sql = 'standard_sql'
class eval_framework.metrics.llm.llm_judge_sql.SqlOutputComparison(**data)[source]

Bases: BaseModel

Parameters:
  • matches_results_count (bool)

  • matches_column_count (bool)

  • results_equal (bool)

matches_column_count: bool
matches_results_count: bool
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

results_equal: bool
class eval_framework.metrics.llm.llm_judge_sql.SqlValidationResult(**data)[source]

Bases: BaseModel

Parameters:
  • success (bool)

  • schema_error (str | None)

  • query_error (str | None)

  • results (list[Any])

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

query_error: str | None
results: list[Any]
schema_error: str | None
success: bool
eval_framework.metrics.llm.llm_judge_sql.count_result_columns(result)[source]
Return type:

int

Parameters:

result (list[Any])

eval_framework.metrics.llm.llm_judge_sql.extract_query_from_completions(completion)[source]
Return type:

str | None

Parameters:

completion (str)

eval_framework.metrics.llm.llm_judge_sql.is_create_table_statement(statement)[source]
Return type:

bool

Parameters:

statement (str)

eval_framework.metrics.llm.llm_judge_sql.separate_statements(statements)[source]
Return type:

list[str]

Parameters:

statements (str)

eval_framework.metrics.llm.llm_judge_world_knowledge module

class eval_framework.metrics.llm.llm_judge_world_knowledge.LLMJudgeWorldKnowledge(llm_judge)[source]

Bases: BaseLLMJudgeMetric

Parameters:

llm_judge (BaseLLM)

NAME: str = 'World Knowledge'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Completion)

eval_framework.metrics.llm.utils module

Utility functions for LLM-based metrics.

eval_framework.metrics.llm.utils.order_answers_for_comparison(candidate, reference, swap)[source]

Order candidate and reference answers for A/B comparison.

This function is used to mitigate position bias in LLM-as-judge evaluations by optionally swapping the order in which answers are presented.

Parameters:
  • candidate (str) – The candidate completion to evaluate.

  • reference (str) – The reference/baseline completion.

  • swap (bool) – If True, swap the order (reference becomes A, candidate becomes B).

Return type:

tuple[str, str]

Returns:

Tuple of (answer_a, answer_b) in the correct order.

Module contents