Data Utilities¶

Response matrix and pairwise comparison data utilities.

class torch_measure.data.PairwiseComparisons(subject_a, subject_b, outcome, subject_ids, item_ids=None, item_contents=None, item_idx=None, subject_metadata=None, comparison_metadata=None)[source]¶

Pairwise comparison data (e.g., Chatbot Arena).

Each observation records subject_a vs subject_b with an outcome.

Parameters:

subject_a (torch.LongTensor) – Indices into subject_ids for the first subject in each comparison. Shape: (n_comparisons,).
subject_b (torch.LongTensor) – Indices into subject_ids for the second subject in each comparison. Shape: (n_comparisons,).
outcome (torch.Tensor) – Comparison outcome. 1.0 = subject_a wins, 0.0 = subject_b wins, 0.5 = tie. Shape: (n_comparisons,).
subject_ids (list[str]) – Unique subject identifiers (e.g., model names).
item_ids (list[str] | None) – Unique item/prompt identifiers (e.g., question IDs).
item_contents (list[str] | None) – Text content for each item (one per entry in item_ids).
item_idx (torch.LongTensor | None) – Per-comparison index into item_ids, shape (n_comparisons,). Maps each comparison to the item/prompt it was evaluated on.
subject_metadata (list[dict] | None) – Structured metadata per subject (one dict per entry in subject_ids).
comparison_metadata (list[dict] | None) – Structured metadata per comparison (one dict per row).

property n_comparisons: int¶: Number of pairwise comparisons.

property n_subjects: int¶: Number of unique subjects.

property n_items: int¶: Number of unique items/prompts.

property shape: tuple[int, int]¶: (n_comparisons, n_subjects).

property density: float¶

Fraction of all possible ordered pairs that are observed.

Computed as n_comparisons / (n_subjects * (n_subjects - 1) / 2).

win_rates()[source]¶

Per-subject overall win rate.

Returns:: Win rate for each subject, shape (n_subjects,). Ties count as 0.5 wins and 0.5 losses.
Return type:: torch.Tensor

to_win_matrix()[source]¶

Aggregate into a pairwise win-rate matrix.

Returns:: Square matrix of shape (n_subjects, n_subjects) where entry (i, j) is the win rate of subject i against subject j. Diagonal is NaN. Unobserved pairs are NaN.
Return type:: torch.Tensor

to(device)[source]¶

Move tensors to a device.

Parameters:: device (device | str)
Return type:: PairwiseComparisons

classmethod from_dataframe(df, subject_a_col='model_a', subject_b_col='model_b', outcome_col='outcome')[source]¶

Create from a pandas DataFrame.

Parameters:

df (pandas.DataFrame) – DataFrame with at least subject_a, subject_b, and outcome columns.
subject_a_col (str) – Column name for the first subject.
subject_b_col (str) – Column name for the second subject.
outcome_col (str) – Column name for the outcome (1.0 = a wins, 0.0 = b wins, 0.5 = tie).

Return type:

PairwiseComparisons

class torch_measure.data.ResponseMatrix(data, subject_ids=None, item_ids=None, item_contents=None, subject_metadata=None, info=None)[source]¶

A binary or continuous response matrix (subjects x items).

Parameters:

data (torch.Tensor) – Response matrix of shape (n_subjects, n_items). Values can be: - Binary (0/1) for correct/incorrect responses - Continuous [0, 1] for probability responses - NaN for missing data
subject_ids (list[str] | None) – Optional identifiers for subjects (rows).
item_ids (list[str] | None) – Optional identifiers for items (columns).
item_contents (list[str] | None) – Optional text content for each item (e.g., question text).
subject_metadata (list[dict[str, str | int | float | bool | None]] | None) – Optional structured metadata for each subject (one dict per row). For HELM datasets, each dict has keys: org, model, param_count, is_instruct.
info (dict | None) – Optional dataset-level metadata (interpretation notes, paper URL, data source URL, license, etc.). Usually loaded from data/<benchmark>/info.yaml. Common keys include: description, testing_condition, paper_url, data_source_url, subject_type, item_type, license, citation, tags.

property n_rows: int¶: Number of subjects (rows).

property n_cols: int¶: Number of items (columns).

property n_subjects: int¶: Number of subjects (rows).

property n_items: int¶: Number of items (columns).

property shape: tuple[int, int]¶: Shape of the response matrix.

property observed_mask: Tensor¶: Boolean mask of observed (non-NaN) entries.

property density: float¶: Fraction of observed (non-missing) entries.

property subject_means: Tensor¶: Mean response per subject (ignoring NaN).

property item_means: Tensor¶: Mean response per item (ignoring NaN), i.e., item easiness/facility.

to(device)[source]¶

Move response matrix to a device.

Parameters:: device (device | str)
Return type:: ResponseMatrix

binarize(threshold=0.5)[source]¶

Convert continuous responses to binary using a threshold.

Parameters:: threshold (float)
Return type:: ResponseMatrix

classmethod from_numpy(array, **kwargs)[source]¶

Create from a numpy array.

Return type:: ResponseMatrix

classmethod from_dataframe(df)[source]¶

Create from a pandas DataFrame.

Return type:: ResponseMatrix

classmethod from_long(data)[source]¶

Pivot a LongFormData into a wide ResponseMatrix.

When multiple trials or non-null test_condition values exist per (subject, item) cell, the response is averaged across those dimensions. The legacy load() path used to do this automatically; consumers who want polytomous / per-trial / multi-condition analysis should work with the LongFormData directly.

Parameters:: data (LongFormData) – The long-form dataset returned by torch_measure.datasets.load().
Returns:: Subject-by-item matrix with subjects rendered as their display_name (when the subjects registry is populated) and items keyed by item_id. item_contents carries the item content strings from the items registry.
Return type:: ResponseMatrix

torch_measure.data.random_mask(observed, train_frac=0.8)[source]¶

Randomly split observed entries into train/test masks.

Parameters:

observed (torch.Tensor) – Boolean mask of observed entries (n_subjects x n_items).
train_frac (float) – Fraction of observed entries to assign to training.

Returns:

train_mask, test_mask – Boolean masks for training and testing.

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.l_mask(observed, row_frac=0.8, col_frac=0.8)[source]¶

L-shaped masking: fully observe a subset of rows AND columns for training.

The test set consists of the intersection of held-out rows and held-out columns. This tests transductive generalization (new subjects on new items).

Parameters:

observed (torch.Tensor) – Boolean mask of observed entries.
row_frac (float) – Fraction of rows to fully observe in training.
col_frac (float) – Fraction of columns to fully observe in training.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.row_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶

Row-based masking: fully observe some rows, partially observe the rest.

Parameters:

observed (torch.Tensor) – Boolean mask of observed entries.
train_frac (float) – Fraction of rows to fully observe.
exposure_rate (float) – Fraction of entries to observe in held-out rows.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.col_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶

Column-based masking: fully observe some columns, partially observe the rest.

Parameters:

observed (torch.Tensor) – Boolean mask of observed entries.
train_frac (float) – Fraction of columns to fully observe.
exposure_rate (float) – Fraction of entries to observe in held-out columns.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.model_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶

Model-based masking (alias for row_mask).

Fully observe train_frac of models, partially observe the rest.

Parameters:

observed (Tensor)
train_frac (float)
exposure_rate (float)

Return type:

tuple[Tensor, Tensor]

torch_measure.data.item_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶

Item-based masking (alias for col_mask).

Fully observe train_frac of items, partially observe the rest.

Parameters:

observed (Tensor)
train_frac (float)
exposure_rate (float)

Return type:

tuple[Tensor, Tensor]

torch_measure.data.binarize(data, threshold=0.5)[source]¶

Convert continuous response matrix to binary.

Parameters:

data (torch.Tensor) – Response matrix with values in [0, 1] (may contain NaN).
threshold (float) – Values >= threshold become 1, otherwise 0.

Returns:

Binary response matrix (NaN preserved).

Return type:

torch.Tensor

torch_measure.data.normalize_rows(data)[source]¶

Normalize each row to zero mean and unit variance (ignoring NaN).

Parameters:: data (torch.Tensor) – Response matrix (may contain NaN).
Returns:: Row-normalized matrix (NaN preserved).
Return type:: torch.Tensor

class torch_measure.data.ResponseMatrix(data, subject_ids=None, item_ids=None, item_contents=None, subject_metadata=None, info=None)[source]¶

A binary or continuous response matrix (subjects x items).

Parameters:

data (torch.Tensor) – Response matrix of shape (n_subjects, n_items). Values can be: - Binary (0/1) for correct/incorrect responses - Continuous [0, 1] for probability responses - NaN for missing data
subject_ids (list[str] | None) – Optional identifiers for subjects (rows).
item_ids (list[str] | None) – Optional identifiers for items (columns).
item_contents (list[str] | None) – Optional text content for each item (e.g., question text).
subject_metadata (list[dict[str, str | int | float | bool | None]] | None) – Optional structured metadata for each subject (one dict per row). For HELM datasets, each dict has keys: org, model, param_count, is_instruct.
info (dict | None) – Optional dataset-level metadata (interpretation notes, paper URL, data source URL, license, etc.). Usually loaded from data/<benchmark>/info.yaml. Common keys include: description, testing_condition, paper_url, data_source_url, subject_type, item_type, license, citation, tags.

property n_rows: int¶: Number of subjects (rows).

property n_cols: int¶: Number of items (columns).

property n_subjects: int¶: Number of subjects (rows).

property n_items: int¶: Number of items (columns).

property shape: tuple[int, int]¶: Shape of the response matrix.

property observed_mask: Tensor¶: Boolean mask of observed (non-NaN) entries.

property density: float¶: Fraction of observed (non-missing) entries.

property subject_means: Tensor¶: Mean response per subject (ignoring NaN).

property item_means: Tensor¶: Mean response per item (ignoring NaN), i.e., item easiness/facility.

to(device)[source]¶

Move response matrix to a device.

Parameters:: device (device | str)
Return type:: ResponseMatrix

binarize(threshold=0.5)[source]¶

Convert continuous responses to binary using a threshold.

Parameters:: threshold (float)
Return type:: ResponseMatrix

classmethod from_numpy(array, **kwargs)[source]¶

Create from a numpy array.

Return type:: ResponseMatrix

classmethod from_dataframe(df)[source]¶

Create from a pandas DataFrame.

Return type:: ResponseMatrix

classmethod from_long(data)[source]¶

Pivot a LongFormData into a wide ResponseMatrix.

Parameters:: data (LongFormData) – The long-form dataset returned by torch_measure.datasets.load().
Returns:: Subject-by-item matrix with subjects rendered as their display_name (when the subjects registry is populated) and items keyed by item_id. item_contents carries the item content strings from the items registry.
Return type:: ResponseMatrix

Masking¶

torch_measure.data.random_mask(observed, train_frac=0.8)[source]¶

Randomly split observed entries into train/test masks.

Parameters:

observed (torch.Tensor) – Boolean mask of observed entries (n_subjects x n_items).
train_frac (float) – Fraction of observed entries to assign to training.

Returns:

train_mask, test_mask – Boolean masks for training and testing.

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.l_mask(observed, row_frac=0.8, col_frac=0.8)[source]¶

L-shaped masking: fully observe a subset of rows AND columns for training.

The test set consists of the intersection of held-out rows and held-out columns. This tests transductive generalization (new subjects on new items).

Parameters:

observed (torch.Tensor) – Boolean mask of observed entries.
row_frac (float) – Fraction of rows to fully observe in training.
col_frac (float) – Fraction of columns to fully observe in training.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.row_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶

Row-based masking: fully observe some rows, partially observe the rest.

Parameters:

observed (torch.Tensor) – Boolean mask of observed entries.
train_frac (float) – Fraction of rows to fully observe.
exposure_rate (float) – Fraction of entries to observe in held-out rows.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]

torch_measure.data.col_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶

Column-based masking: fully observe some columns, partially observe the rest.

Parameters:

observed (torch.Tensor) – Boolean mask of observed entries.
train_frac (float) – Fraction of columns to fully observe.
exposure_rate (float) – Fraction of entries to observe in held-out columns.

Returns:

train_mask, test_mask

Return type:

tuple[torch.Tensor, torch.Tensor]