eval_framework.metrics.llm package¶
Submodules¶
eval_framework.metrics.llm.base module¶
- class eval_framework.metrics.llm.base.BaseLLMJudgeMetric(llm_judge, randomize_order=False)[source]¶
Bases:
BaseMetric[Completion]- Parameters:
llm_judge (BaseLLM)
randomize_order (bool)
eval_framework.metrics.llm.llm_judge_chatbot_style module¶
- class eval_framework.metrics.llm.llm_judge_chatbot_style.LLMJudgeChatbotStyle(llm_judge)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
- NAME: str = 'Chatbot Style'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.llm.llm_judge_coherence module¶
- class eval_framework.metrics.llm.llm_judge_coherence.LLMJudgeCoherence(llm_judge)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
- KEYS: list[str] | None = ['coherence_score']¶
- NAME: str = 'Coherence'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.llm.llm_judge_completion_accuracy module¶
- class eval_framework.metrics.llm.llm_judge_completion_accuracy.LLMJudgeCompletionAccuracy(llm_judge)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
- NAME: str = 'Judge Completion Accuracy'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.llm.llm_judge_conciseness module¶
- class eval_framework.metrics.llm.llm_judge_conciseness.LLMJudgeConciseness(llm_judge)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
- NAME: str = 'Conciseness'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.llm.llm_judge_contains_names module¶
- class eval_framework.metrics.llm.llm_judge_contains_names.LLMJudgeAvoidsNames(llm_judge)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
- NAME: str = 'Avoids Names'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.llm.llm_judge_format_correctness module¶
- class eval_framework.metrics.llm.llm_judge_format_correctness.LLMJudgeFormatCorrectness(llm_judge)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
- NAME: str = 'Format Correctness'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.llm.llm_judge_format_correctness.LLMJudgeFormatCorrectnessContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
language (str)
extra_data (Any)
- language: str¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
eval_framework.metrics.llm.llm_judge_instruction module¶
- class eval_framework.metrics.llm.llm_judge_instruction.LLMJudgeInstruction(llm_judge)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
- KEYS: list[str] | None = ['quality', 'is_following_instruction', 'has_correct_grammar_and_spelling', 'is_context_consistent', 'is_not_repeating', 'is_trustworthy', 'is_safe']¶
- NAME: str = 'Instruction Following'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.llm.llm_judge_mtbench_pair module¶
- class eval_framework.metrics.llm.llm_judge_mtbench_pair.MTBenchJudgePair(llm_judge, randomize_order=False)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
randomize_order (bool)
- NAME: str = 'pairwise_judgement'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.llm.llm_judge_mtbench_pair.MTBenchJudgePairMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
category (str)
answer (list[str] | str)
reference (list[str] | str | None)
extra_data (Any)
- answer: list[str] | str¶
- category: str¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- reference: list[str] | str | None¶
- class eval_framework.metrics.llm.llm_judge_mtbench_pair.PromptToJudge(**data)[source]¶
Bases:
BaseModel- Parameters:
comparison_type (str)
prompt_text (str)
candidate_is_a (bool)
- candidate_is_a: bool¶
- comparison_type: str¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- prompt_text: str¶
- eval_framework.metrics.llm.llm_judge_mtbench_pair.generate_pair_judge_prompts(response, randomize_order=False, seed=None)[source]¶
Generate pairwise judge prompts for comparing candidate vs reference completions.
- Parameters:
response (
Completion) – The completion response containing the candidate completion.randomize_order (
bool) – If True, randomly swap the order of A/B to eliminate position bias.seed (
int|None) – Optional random seed for reproducibility. If None and randomize_order is True, uses the response id as seed for deterministic per-sample randomization.
- Return type:
list[PromptToJudge]- Returns:
List of PromptToJudge objects with candidate_is_a indicating whether the candidate completion is in position A (True) or position B (False).
eval_framework.metrics.llm.llm_judge_mtbench_single module¶
- class eval_framework.metrics.llm.llm_judge_mtbench_single.MTBenchJudgeSingle(llm_judge, randomize_order=False)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
randomize_order (bool)
- NAME: str = 'single_judgement'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- class eval_framework.metrics.llm.llm_judge_mtbench_single.MTBenchJudgeSingleMetricContext(**data)[source]¶
Bases:
BaseMetricContext- Parameters:
category (str)
reference (list[str] | str | None)
extra_data (Any)
- category: str¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- reference: list[str] | str | None¶
- class eval_framework.metrics.llm.llm_judge_mtbench_single.PromptToJudge(**data)[source]¶
Bases:
BaseModel- Parameters:
comparison_type (str)
prompt_text (str)
- comparison_type: str¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- prompt_text: str¶
- eval_framework.metrics.llm.llm_judge_mtbench_single.generate_single_judge_prompts(response)[source]¶
- Return type:
list[PromptToJudge]- Parameters:
response (Completion)
eval_framework.metrics.llm.llm_judge_refusal module¶
- class eval_framework.metrics.llm.llm_judge_refusal.LLMJudgeRefusal(llm_judge)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
- NAME: str = 'refusal_classifier'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.llm.llm_judge_sql module¶
- class eval_framework.metrics.llm.llm_judge_sql.LLMJudgeSql(llm_judge)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
- NAME: str = 'SQL Quality'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
- validate_query(dialect, create_db_statements, sql_query, db_schema)[source]¶
- Return type:
- Parameters:
dialect (SqlDialects)
create_db_statements (str)
sql_query (str)
db_schema (str)
- validate_query_mysql(create_db_statements, sql_query, db_schema)[source]¶
- Return type:
- Parameters:
create_db_statements (str)
sql_query (str)
db_schema (str)
- validate_query_postgres(create_db_statements, sql_query, db_schema)[source]¶
- Return type:
- Parameters:
create_db_statements (str)
sql_query (str)
db_schema (str)
- class eval_framework.metrics.llm.llm_judge_sql.LLMJudgeSqlMetricContext(**data)[source]¶
Bases:
LanguageMetricContext- Parameters:
language (str)
dialect (str)
db_schema (str)
extra_data (Any)
- db_schema: str¶
- dialect: str¶
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class eval_framework.metrics.llm.llm_judge_sql.SqlDialects(*values)[source]¶
Bases:
Enum- mysql = 'mysql'¶
- postgres = 'postgresql'¶
- sqlite = 'sqlite'¶
- standard_sql = 'standard_sql'¶
- class eval_framework.metrics.llm.llm_judge_sql.SqlOutputComparison(**data)[source]¶
Bases:
BaseModel- Parameters:
matches_results_count (bool)
matches_column_count (bool)
results_equal (bool)
- matches_column_count: bool¶
- matches_results_count: bool¶
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- results_equal: bool¶
- class eval_framework.metrics.llm.llm_judge_sql.SqlValidationResult(**data)[source]¶
Bases:
BaseModel- Parameters:
success (bool)
schema_error (str | None)
query_error (str | None)
results (list[Any])
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- query_error: str | None¶
- results: list[Any]¶
- schema_error: str | None¶
- success: bool¶
- eval_framework.metrics.llm.llm_judge_sql.count_result_columns(result)[source]¶
- Return type:
int- Parameters:
result (list[Any])
- eval_framework.metrics.llm.llm_judge_sql.extract_query_from_completions(completion)[source]¶
- Return type:
str|None- Parameters:
completion (str)
eval_framework.metrics.llm.llm_judge_world_knowledge module¶
- class eval_framework.metrics.llm.llm_judge_world_knowledge.LLMJudgeWorldKnowledge(llm_judge)[source]¶
Bases:
BaseLLMJudgeMetric- Parameters:
llm_judge (BaseLLM)
- NAME: str = 'World Knowledge'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Completion)
eval_framework.metrics.llm.utils module¶
Utility functions for LLM-based metrics.
- eval_framework.metrics.llm.utils.order_answers_for_comparison(candidate, reference, swap)[source]¶
Order candidate and reference answers for A/B comparison.
This function is used to mitigate position bias in LLM-as-judge evaluations by optionally swapping the order in which answers are presented.
- Parameters:
candidate (
str) – The candidate completion to evaluate.reference (
str) – The reference/baseline completion.swap (
bool) – If True, swap the order (reference becomes A, candidate becomes B).
- Return type:
tuple[str,str]- Returns:
Tuple of (answer_a, answer_b) in the correct order.