eval_framework.metrics.loglikelihood package¶
Submodules¶
eval_framework.metrics.loglikelihood.accuracy_loglikelihood module¶
- class eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood[source]¶
Bases:
BaseMetric[Loglikelihood]- NAME: str = 'Accuracy Loglikelihood'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Loglikelihood)
- class eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood[source]¶
Bases:
BaseMetric[Loglikelihood]- NAME: str = 'Accuracy Normalized Loglikelihood'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Loglikelihood)
- class eval_framework.metrics.loglikelihood.accuracy_loglikelihood.PartialEvalAccuracy[source]¶
Bases:
BaseMetric[Loglikelihood]An accuracy metric for partial evaluation tasks, e.g. WinograndeCloze.
Here, for each item, we generate a pair of two samples, one for each option. We then calculate the accuracy of the model’s completion for each option, and then use the accuracy of the correct option to calculate the overall accuracy.
NOTE: The current implementation relies on the assumption that it comes in pairs of samples, which can be identified by having consecutive ids (odd and even). This is how it is implemented in the WinograndeCloze tasks, but if other tasks use this metric, it might not be the case and require a more general implementation (e.g. storing item_id in the Sample.context).
- NAME: str = 'Partial Evaluation Accuracy'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Loglikelihood)
eval_framework.metrics.loglikelihood.base module¶
- class eval_framework.metrics.loglikelihood.base.BaseLoglikelihoodMetric(*, len_normalised=True)[source]¶
Bases:
BaseMetric[Loglikelihood]Base class for metrics that operate on loglikelihood responses.
- Parameters:
len_normalised (bool)
eval_framework.metrics.loglikelihood.bits_per_byte module¶
- class eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood[source]¶
Bases:
BaseMetric[Loglikelihood]Bits-per-byte metric for loglikelihood responses.
This follows the Paloma definition: the negative log-likelihood of the answer divided by the number of UTF-8 bytes in the answer string.
- NAME: str = 'BitsPerByte'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Loglikelihood)
eval_framework.metrics.loglikelihood.confidence_weighted_accuracy module¶
- class eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy(*, len_normalised=True)[source]¶
Bases:
BaseLoglikelihoodMetric- Parameters:
len_normalised (bool)
- NAME: str = 'Confidence-weighted Accuracy'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Loglikelihood)
eval_framework.metrics.loglikelihood.dcs module¶
- class eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore(*, lc=1.0, lw=1.0, len_normalised=True)[source]¶
Bases:
BaseLoglikelihoodMetricBased on Burns (2025) Measuring Language Model Hallucinations Through Distributional Correctness.
- Parameters:
lc (float)
lw (float)
len_normalised (bool)
- NAME: str = 'Distributional Correctness Score'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Loglikelihood)
eval_framework.metrics.loglikelihood.probability_mass module¶
- class eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMass[source]¶
Bases:
BaseMetric[Loglikelihood]- NAME: str = 'Probability Mass'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Loglikelihood)
- class eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMassNorm[source]¶
Bases:
BaseMetric[Loglikelihood]- NAME: str = 'Probability Mass Normalized'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Loglikelihood)
eval_framework.metrics.loglikelihood.ternary module¶
- class eval_framework.metrics.loglikelihood.ternary.TernaryScore(*, lc=1.0, lw=1.0, len_normalised=True)[source]¶
Bases:
BaseLoglikelihoodMetricBased on Kalai et al. (2025) Why language models hallucinate. arXiv:2509.04664
- Parameters:
lc (float)
lw (float)
len_normalised (bool)
- NAME: str = 'Ternary Score'¶
- calculate(response)[source]¶
- Return type:
list[MetricResult]- Parameters:
response (Loglikelihood)