Data Utilities

Response matrix and pairwise comparison data utilities.

class torch_measure.data.PairwiseComparisons(subject_a, subject_b, outcome, subject_ids, item_ids=None, item_contents=None, item_idx=None, subject_metadata=None, comparison_metadata=None)[source]

Pairwise comparison data (e.g., Chatbot Arena).

Each observation records subject_a vs subject_b with an outcome.

Parameters:
  • subject_a (torch.LongTensor) – Indices into subject_ids for the first subject in each comparison. Shape: (n_comparisons,).

  • subject_b (torch.LongTensor) – Indices into subject_ids for the second subject in each comparison. Shape: (n_comparisons,).

  • outcome (torch.Tensor) – Comparison outcome. 1.0 = subject_a wins, 0.0 = subject_b wins, 0.5 = tie. Shape: (n_comparisons,).

  • subject_ids (list[str]) – Unique subject identifiers (e.g., model names).

  • item_ids (list[str] | None) – Unique item/prompt identifiers (e.g., question IDs).

  • item_contents (list[str] | None) – Text content for each item (one per entry in item_ids).

  • item_idx (torch.LongTensor | None) – Per-comparison index into item_ids, shape (n_comparisons,). Maps each comparison to the item/prompt it was evaluated on.

  • subject_metadata (list[dict] | None) – Structured metadata per subject (one dict per entry in subject_ids).

  • comparison_metadata (list[dict] | None) – Structured metadata per comparison (one dict per row).

property n_comparisons: int

Number of pairwise comparisons.

property n_subjects: int

Number of unique subjects.

property n_items: int

Number of unique items/prompts.

property shape: tuple[int, int]

(n_comparisons, n_subjects).

property density: float

Fraction of all possible ordered pairs that are observed.

Computed as n_comparisons / (n_subjects * (n_subjects - 1) / 2).

win_rates()[source]

Per-subject overall win rate.

Returns:

Win rate for each subject, shape (n_subjects,). Ties count as 0.5 wins and 0.5 losses.

Return type:

torch.Tensor

to_win_matrix()[source]

Aggregate into a pairwise win-rate matrix.

Returns:

Square matrix of shape (n_subjects, n_subjects) where entry (i, j) is the win rate of subject i against subject j. Diagonal is NaN. Unobserved pairs are NaN.

Return type:

torch.Tensor

to(device)[source]

Move tensors to a device.

Parameters:

device (device | str)

Return type:

PairwiseComparisons

classmethod from_dataframe(df, subject_a_col='model_a', subject_b_col='model_b', outcome_col='outcome')[source]

Create from a pandas DataFrame.

Parameters:
  • df (pandas.DataFrame) – DataFrame with at least subject_a, subject_b, and outcome columns.

  • subject_a_col (str) – Column name for the first subject.

  • subject_b_col (str) – Column name for the second subject.

  • outcome_col (str) – Column name for the outcome (1.0 = a wins, 0.0 = b wins, 0.5 = tie).

Return type:

PairwiseComparisons

class torch_measure.data.ResponseMatrix(data, subject_ids=None, item_ids=None, item_contents=None, subject_metadata=None, info=None)[source]

A binary or continuous response matrix (subjects x items).

Parameters:
  • data (torch.Tensor) – Response matrix of shape (n_subjects, n_items). Values can be: - Binary (0/1) for correct/incorrect responses - Continuous [0, 1] for probability responses - NaN for missing data

  • subject_ids (list[str] | None) – Optional identifiers for subjects (rows).

  • item_ids (list[str] | None) – Optional identifiers for items (columns).

  • item_contents (list[str] | None) – Optional text content for each item (e.g., question text).

  • subject_metadata (list[dict[str, str | int | float | bool | None]] | None) – Optional structured metadata for each subject (one dict per row). For HELM datasets, each dict has keys: org, model, param_count, is_instruct.

  • info (dict | None) – Optional dataset-level metadata (interpretation notes, paper URL, data source URL, license, etc.). Usually loaded from data/<benchmark>/info.yaml. Common keys include: description, testing_condition, paper_url, data_source_url, subject_type, item_type, license, citation, tags.

property n_rows: int

Number of subjects (rows).

property n_cols: int

Number of items (columns).

property n_subjects: int

Number of subjects (rows).

property n_items: int

Number of items (columns).

property shape: tuple[int, int]

Shape of the response matrix.

property observed_mask: Tensor

Boolean mask of observed (non-NaN) entries.

property density: float

Fraction of observed (non-missing) entries.

property subject_means: Tensor

Mean response per subject (ignoring NaN).

property item_means: Tensor

Mean response per item (ignoring NaN), i.e., item easiness/facility.

to(device)[source]

Move response matrix to a device.

Parameters:

device (device | str)

Return type:

ResponseMatrix

binarize(threshold=0.5)[source]

Convert continuous responses to binary using a threshold.

Parameters:

threshold (float)

Return type:

ResponseMatrix

classmethod from_numpy(array, **kwargs)[source]

Create from a numpy array.

Return type:

ResponseMatrix

classmethod from_dataframe(df)[source]

Create from a pandas DataFrame.

Return type:

ResponseMatrix

classmethod from_long(data)[source]

Pivot a LongFormData into a wide ResponseMatrix.

When multiple trials or non-null test_condition values exist per (subject, item) cell, the response is averaged across those dimensions. The legacy load() path used to do this automatically; consumers who want polytomous / per-trial / multi-condition analysis should work with the LongFormData directly.

Parameters:

data (LongFormData) – The long-form dataset returned by torch_measure.datasets.load().

Returns:

Subject-by-item matrix with subjects rendered as their display_name (when the subjects registry is populated) and items keyed by item_id. item_contents carries the item content strings from the items registry.

Return type:

ResponseMatrix

torch_measure.data.random_mask(observed, train_frac=0.8)[source]

Randomly split observed entries into train/test masks.

Parameters:
  • observed (torch.Tensor) – Boolean mask of observed entries (n_subjects x n_items).

  • train_frac (float) – Fraction of observed entries to assign to training.

Returns:

train_mask, test_mask – Boolean masks for training and testing.

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.l_mask(observed, row_frac=0.8, col_frac=0.8)[source]

L-shaped masking: fully observe a subset of rows AND columns for training.

The test set consists of the intersection of held-out rows and held-out columns. This tests transductive generalization (new subjects on new items).

Parameters:
  • observed (torch.Tensor) – Boolean mask of observed entries.

  • row_frac (float) – Fraction of rows to fully observe in training.

  • col_frac (float) – Fraction of columns to fully observe in training.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.row_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]

Row-based masking: fully observe some rows, partially observe the rest.

Parameters:
  • observed (torch.Tensor) – Boolean mask of observed entries.

  • train_frac (float) – Fraction of rows to fully observe.

  • exposure_rate (float) – Fraction of entries to observe in held-out rows.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.col_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]

Column-based masking: fully observe some columns, partially observe the rest.

Parameters:
  • observed (torch.Tensor) – Boolean mask of observed entries.

  • train_frac (float) – Fraction of columns to fully observe.

  • exposure_rate (float) – Fraction of entries to observe in held-out columns.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.model_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]

Model-based masking (alias for row_mask).

Fully observe train_frac of models, partially observe the rest.

Parameters:
Return type:

tuple[Tensor, Tensor]

torch_measure.data.item_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]

Item-based masking (alias for col_mask).

Fully observe train_frac of items, partially observe the rest.

Parameters:
Return type:

tuple[Tensor, Tensor]

torch_measure.data.binarize(data, threshold=0.5)[source]

Convert continuous response matrix to binary.

Parameters:
  • data (torch.Tensor) – Response matrix with values in [0, 1] (may contain NaN).

  • threshold (float) – Values >= threshold become 1, otherwise 0.

Returns:

Binary response matrix (NaN preserved).

Return type:

torch.Tensor

torch_measure.data.normalize_rows(data)[source]

Normalize each row to zero mean and unit variance (ignoring NaN).

Parameters:

data (torch.Tensor) – Response matrix (may contain NaN).

Returns:

Row-normalized matrix (NaN preserved).

Return type:

torch.Tensor

class torch_measure.data.ResponseMatrix(data, subject_ids=None, item_ids=None, item_contents=None, subject_metadata=None, info=None)[source]

A binary or continuous response matrix (subjects x items).

Parameters:
  • data (torch.Tensor) – Response matrix of shape (n_subjects, n_items). Values can be: - Binary (0/1) for correct/incorrect responses - Continuous [0, 1] for probability responses - NaN for missing data

  • subject_ids (list[str] | None) – Optional identifiers for subjects (rows).

  • item_ids (list[str] | None) – Optional identifiers for items (columns).

  • item_contents (list[str] | None) – Optional text content for each item (e.g., question text).

  • subject_metadata (list[dict[str, str | int | float | bool | None]] | None) – Optional structured metadata for each subject (one dict per row). For HELM datasets, each dict has keys: org, model, param_count, is_instruct.

  • info (dict | None) – Optional dataset-level metadata (interpretation notes, paper URL, data source URL, license, etc.). Usually loaded from data/<benchmark>/info.yaml. Common keys include: description, testing_condition, paper_url, data_source_url, subject_type, item_type, license, citation, tags.

property n_rows: int

Number of subjects (rows).

property n_cols: int

Number of items (columns).

property n_subjects: int

Number of subjects (rows).

property n_items: int

Number of items (columns).

property shape: tuple[int, int]

Shape of the response matrix.

property observed_mask: Tensor

Boolean mask of observed (non-NaN) entries.

property density: float

Fraction of observed (non-missing) entries.

property subject_means: Tensor

Mean response per subject (ignoring NaN).

property item_means: Tensor

Mean response per item (ignoring NaN), i.e., item easiness/facility.

to(device)[source]

Move response matrix to a device.

Parameters:

device (device | str)

Return type:

ResponseMatrix

binarize(threshold=0.5)[source]

Convert continuous responses to binary using a threshold.

Parameters:

threshold (float)

Return type:

ResponseMatrix

classmethod from_numpy(array, **kwargs)[source]

Create from a numpy array.

Return type:

ResponseMatrix

classmethod from_dataframe(df)[source]

Create from a pandas DataFrame.

Return type:

ResponseMatrix

classmethod from_long(data)[source]

Pivot a LongFormData into a wide ResponseMatrix.

When multiple trials or non-null test_condition values exist per (subject, item) cell, the response is averaged across those dimensions. The legacy load() path used to do this automatically; consumers who want polytomous / per-trial / multi-condition analysis should work with the LongFormData directly.

Parameters:

data (LongFormData) – The long-form dataset returned by torch_measure.datasets.load().

Returns:

Subject-by-item matrix with subjects rendered as their display_name (when the subjects registry is populated) and items keyed by item_id. item_contents carries the item content strings from the items registry.

Return type:

ResponseMatrix

Masking

torch_measure.data.random_mask(observed, train_frac=0.8)[source]

Randomly split observed entries into train/test masks.

Parameters:
  • observed (torch.Tensor) – Boolean mask of observed entries (n_subjects x n_items).

  • train_frac (float) – Fraction of observed entries to assign to training.

Returns:

train_mask, test_mask – Boolean masks for training and testing.

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.l_mask(observed, row_frac=0.8, col_frac=0.8)[source]

L-shaped masking: fully observe a subset of rows AND columns for training.

The test set consists of the intersection of held-out rows and held-out columns. This tests transductive generalization (new subjects on new items).

Parameters:
  • observed (torch.Tensor) – Boolean mask of observed entries.

  • row_frac (float) – Fraction of rows to fully observe in training.

  • col_frac (float) – Fraction of columns to fully observe in training.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.row_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]

Row-based masking: fully observe some rows, partially observe the rest.

Parameters:
  • observed (torch.Tensor) – Boolean mask of observed entries.

  • train_frac (float) – Fraction of rows to fully observe.

  • exposure_rate (float) – Fraction of entries to observe in held-out rows.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.col_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]

Column-based masking: fully observe some columns, partially observe the rest.

Parameters:
  • observed (torch.Tensor) – Boolean mask of observed entries.

  • train_frac (float) – Fraction of columns to fully observe.

  • exposure_rate (float) – Fraction of entries to observe in held-out columns.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]