eval_framework.metrics.loglikelihood package

Submodules

eval_framework.metrics.loglikelihood.accuracy_loglikelihood module

class eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyLoglikelihood[source]

Bases: BaseMetric[Loglikelihood]

NAME: str = 'Accuracy Loglikelihood'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Loglikelihood)

class eval_framework.metrics.loglikelihood.accuracy_loglikelihood.AccuracyNormLoglikelihood[source]

Bases: BaseMetric[Loglikelihood]

NAME: str = 'Accuracy Normalized Loglikelihood'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Loglikelihood)

class eval_framework.metrics.loglikelihood.accuracy_loglikelihood.PartialEvalAccuracy[source]

Bases: BaseMetric[Loglikelihood]

An accuracy metric for partial evaluation tasks, e.g. WinograndeCloze.

Here, for each item, we generate a pair of two samples, one for each option. We then calculate the accuracy of the model’s completion for each option, and then use the accuracy of the correct option to calculate the overall accuracy.

NOTE: The current implementation relies on the assumption that it comes in pairs of samples, which can be identified by having consecutive ids (odd and even). This is how it is implemented in the WinograndeCloze tasks, but if other tasks use this metric, it might not be the case and require a more general implementation (e.g. storing item_id in the Sample.context).

NAME: str = 'Partial Evaluation Accuracy'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Loglikelihood)

eval_framework.metrics.loglikelihood.base module

class eval_framework.metrics.loglikelihood.base.BaseLoglikelihoodMetric(*, len_normalised=True)[source]

Bases: BaseMetric[Loglikelihood]

Base class for metrics that operate on loglikelihood responses.

Parameters:

len_normalised (bool)

eval_framework.metrics.loglikelihood.bits_per_byte module

class eval_framework.metrics.loglikelihood.bits_per_byte.BitsPerByteLoglikelihood[source]

Bases: BaseMetric[Loglikelihood]

Bits-per-byte metric for loglikelihood responses.

This follows the Paloma definition: the negative log-likelihood of the answer divided by the number of UTF-8 bytes in the answer string.

NAME: str = 'BitsPerByte'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Loglikelihood)

eval_framework.metrics.loglikelihood.confidence_weighted_accuracy module

class eval_framework.metrics.loglikelihood.confidence_weighted_accuracy.ConfidenceWeightedAccuracy(*, len_normalised=True)[source]

Bases: BaseLoglikelihoodMetric

Parameters:

len_normalised (bool)

NAME: str = 'Confidence-weighted Accuracy'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Loglikelihood)

eval_framework.metrics.loglikelihood.dcs module

class eval_framework.metrics.loglikelihood.dcs.DistributionalCorrectnessScore(*, lc=1.0, lw=1.0, len_normalised=True)[source]

Bases: BaseLoglikelihoodMetric

Based on Burns (2025) Measuring Language Model Hallucinations Through Distributional Correctness.

Parameters:
  • lc (float)

  • lw (float)

  • len_normalised (bool)

NAME: str = 'Distributional Correctness Score'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Loglikelihood)

eval_framework.metrics.loglikelihood.probability_mass module

class eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMass[source]

Bases: BaseMetric[Loglikelihood]

NAME: str = 'Probability Mass'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Loglikelihood)

class eval_framework.metrics.loglikelihood.probability_mass.ProbabilityMassNorm[source]

Bases: BaseMetric[Loglikelihood]

NAME: str = 'Probability Mass Normalized'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Loglikelihood)

eval_framework.metrics.loglikelihood.ternary module

class eval_framework.metrics.loglikelihood.ternary.TernaryScore(*, lc=1.0, lw=1.0, len_normalised=True)[source]

Bases: BaseLoglikelihoodMetric

Based on Kalai et al. (2025) Why language models hallucinate. arXiv:2509.04664

Parameters:
  • lc (float)

  • lw (float)

  • len_normalised (bool)

NAME: str = 'Ternary Score'
calculate(response)[source]
Return type:

list[MetricResult]

Parameters:

response (Loglikelihood)

Module contents