Data Utilities¶
Response matrix and pairwise comparison data utilities.
- class torch_measure.data.PairwiseComparisons(subject_a, subject_b, outcome, subject_ids, item_ids=None, item_contents=None, item_idx=None, subject_metadata=None, comparison_metadata=None)[source]¶
Pairwise comparison data (e.g., Chatbot Arena).
Each observation records subject_a vs subject_b with an outcome.
- Parameters:
subject_a (torch.LongTensor) – Indices into
subject_idsfor the first subject in each comparison. Shape:(n_comparisons,).subject_b (torch.LongTensor) – Indices into
subject_idsfor the second subject in each comparison. Shape:(n_comparisons,).outcome (torch.Tensor) – Comparison outcome.
1.0= subject_a wins,0.0= subject_b wins,0.5= tie. Shape:(n_comparisons,).subject_ids (list[str]) – Unique subject identifiers (e.g., model names).
item_ids (list[str] | None) – Unique item/prompt identifiers (e.g., question IDs).
item_contents (list[str] | None) – Text content for each item (one per entry in
item_ids).item_idx (torch.LongTensor | None) – Per-comparison index into
item_ids, shape(n_comparisons,). Maps each comparison to the item/prompt it was evaluated on.subject_metadata (list[dict] | None) – Structured metadata per subject (one dict per entry in
subject_ids).comparison_metadata (list[dict] | None) – Structured metadata per comparison (one dict per row).
- property density: float¶
Fraction of all possible ordered pairs that are observed.
Computed as
n_comparisons / (n_subjects * (n_subjects - 1) / 2).
- win_rates()[source]¶
Per-subject overall win rate.
- Returns:
Win rate for each subject, shape
(n_subjects,). Ties count as 0.5 wins and 0.5 losses.- Return type:
- to_win_matrix()[source]¶
Aggregate into a pairwise win-rate matrix.
- Returns:
Square matrix of shape
(n_subjects, n_subjects)where entry(i, j)is the win rate of subject i against subject j. Diagonal is NaN. Unobserved pairs are NaN.- Return type:
- class torch_measure.data.ResponseMatrix(data, subject_ids=None, item_ids=None, item_contents=None, subject_metadata=None, info=None)[source]¶
A binary or continuous response matrix (subjects x items).
- Parameters:
data (torch.Tensor) – Response matrix of shape (n_subjects, n_items). Values can be: - Binary (0/1) for correct/incorrect responses - Continuous [0, 1] for probability responses - NaN for missing data
subject_ids (list[str] | None) – Optional identifiers for subjects (rows).
item_ids (list[str] | None) – Optional identifiers for items (columns).
item_contents (list[str] | None) – Optional text content for each item (e.g., question text).
subject_metadata (list[dict[str, str | int | float | bool | None]] | None) – Optional structured metadata for each subject (one dict per row). For HELM datasets, each dict has keys:
org,model,param_count,is_instruct.info (dict | None) – Optional dataset-level metadata (interpretation notes, paper URL, data source URL, license, etc.). Usually loaded from
data/<benchmark>/info.yaml. Common keys include:description,testing_condition,paper_url,data_source_url,subject_type,item_type,license,citation,tags.
- binarize(threshold=0.5)[source]¶
Convert continuous responses to binary using a threshold.
- Parameters:
threshold (float)
- Return type:
- classmethod from_long(data)[source]¶
Pivot a
LongFormDatainto a wideResponseMatrix.When multiple trials or non-null
test_conditionvalues exist per (subject, item) cell, the response is averaged across those dimensions. The legacyload()path used to do this automatically; consumers who want polytomous / per-trial / multi-condition analysis should work with theLongFormDatadirectly.- Parameters:
data (LongFormData) – The long-form dataset returned by
torch_measure.datasets.load().- Returns:
Subject-by-item matrix with subjects rendered as their
display_name(when the subjects registry is populated) and items keyed byitem_id.item_contentscarries the itemcontentstrings from the items registry.- Return type:
- torch_measure.data.random_mask(observed, train_frac=0.8)[source]¶
Randomly split observed entries into train/test masks.
- Parameters:
observed (torch.Tensor) – Boolean mask of observed entries (n_subjects x n_items).
train_frac (float) – Fraction of observed entries to assign to training.
- Returns:
train_mask, test_mask – Boolean masks for training and testing.
- Return type:
- torch_measure.data.l_mask(observed, row_frac=0.8, col_frac=0.8)[source]¶
L-shaped masking: fully observe a subset of rows AND columns for training.
The test set consists of the intersection of held-out rows and held-out columns. This tests transductive generalization (new subjects on new items).
- Parameters:
observed (torch.Tensor) – Boolean mask of observed entries.
row_frac (float) – Fraction of rows to fully observe in training.
col_frac (float) – Fraction of columns to fully observe in training.
- Returns:
train_mask, test_mask
- Return type:
- torch_measure.data.row_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶
Row-based masking: fully observe some rows, partially observe the rest.
- Parameters:
observed (torch.Tensor) – Boolean mask of observed entries.
train_frac (float) – Fraction of rows to fully observe.
exposure_rate (float) – Fraction of entries to observe in held-out rows.
- Returns:
train_mask, test_mask
- Return type:
- torch_measure.data.col_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶
Column-based masking: fully observe some columns, partially observe the rest.
- Parameters:
observed (torch.Tensor) – Boolean mask of observed entries.
train_frac (float) – Fraction of columns to fully observe.
exposure_rate (float) – Fraction of entries to observe in held-out columns.
- Returns:
train_mask, test_mask
- Return type:
- torch_measure.data.model_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶
Model-based masking (alias for row_mask).
Fully observe train_frac of models, partially observe the rest.
- torch_measure.data.item_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶
Item-based masking (alias for col_mask).
Fully observe train_frac of items, partially observe the rest.
- torch_measure.data.binarize(data, threshold=0.5)[source]¶
Convert continuous response matrix to binary.
- Parameters:
data (torch.Tensor) – Response matrix with values in [0, 1] (may contain NaN).
threshold (float) – Values >= threshold become 1, otherwise 0.
- Returns:
Binary response matrix (NaN preserved).
- Return type:
- torch_measure.data.normalize_rows(data)[source]¶
Normalize each row to zero mean and unit variance (ignoring NaN).
- Parameters:
data (torch.Tensor) – Response matrix (may contain NaN).
- Returns:
Row-normalized matrix (NaN preserved).
- Return type:
- class torch_measure.data.ResponseMatrix(data, subject_ids=None, item_ids=None, item_contents=None, subject_metadata=None, info=None)[source]¶
A binary or continuous response matrix (subjects x items).
- Parameters:
data (torch.Tensor) – Response matrix of shape (n_subjects, n_items). Values can be: - Binary (0/1) for correct/incorrect responses - Continuous [0, 1] for probability responses - NaN for missing data
subject_ids (list[str] | None) – Optional identifiers for subjects (rows).
item_ids (list[str] | None) – Optional identifiers for items (columns).
item_contents (list[str] | None) – Optional text content for each item (e.g., question text).
subject_metadata (list[dict[str, str | int | float | bool | None]] | None) – Optional structured metadata for each subject (one dict per row). For HELM datasets, each dict has keys:
org,model,param_count,is_instruct.info (dict | None) – Optional dataset-level metadata (interpretation notes, paper URL, data source URL, license, etc.). Usually loaded from
data/<benchmark>/info.yaml. Common keys include:description,testing_condition,paper_url,data_source_url,subject_type,item_type,license,citation,tags.
- binarize(threshold=0.5)[source]¶
Convert continuous responses to binary using a threshold.
- Parameters:
threshold (float)
- Return type:
- classmethod from_long(data)[source]¶
Pivot a
LongFormDatainto a wideResponseMatrix.When multiple trials or non-null
test_conditionvalues exist per (subject, item) cell, the response is averaged across those dimensions. The legacyload()path used to do this automatically; consumers who want polytomous / per-trial / multi-condition analysis should work with theLongFormDatadirectly.- Parameters:
data (LongFormData) – The long-form dataset returned by
torch_measure.datasets.load().- Returns:
Subject-by-item matrix with subjects rendered as their
display_name(when the subjects registry is populated) and items keyed byitem_id.item_contentscarries the itemcontentstrings from the items registry.- Return type:
Masking¶
- torch_measure.data.random_mask(observed, train_frac=0.8)[source]¶
Randomly split observed entries into train/test masks.
- Parameters:
observed (torch.Tensor) – Boolean mask of observed entries (n_subjects x n_items).
train_frac (float) – Fraction of observed entries to assign to training.
- Returns:
train_mask, test_mask – Boolean masks for training and testing.
- Return type:
- torch_measure.data.l_mask(observed, row_frac=0.8, col_frac=0.8)[source]¶
L-shaped masking: fully observe a subset of rows AND columns for training.
The test set consists of the intersection of held-out rows and held-out columns. This tests transductive generalization (new subjects on new items).
- Parameters:
observed (torch.Tensor) – Boolean mask of observed entries.
row_frac (float) – Fraction of rows to fully observe in training.
col_frac (float) – Fraction of columns to fully observe in training.
- Returns:
train_mask, test_mask
- Return type:
- torch_measure.data.row_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶
Row-based masking: fully observe some rows, partially observe the rest.
- Parameters:
observed (torch.Tensor) – Boolean mask of observed entries.
train_frac (float) – Fraction of rows to fully observe.
exposure_rate (float) – Fraction of entries to observe in held-out rows.
- Returns:
train_mask, test_mask
- Return type:
- torch_measure.data.col_mask(observed, train_frac=0.8, exposure_rate=0.3)[source]¶
Column-based masking: fully observe some columns, partially observe the rest.
- Parameters:
observed (torch.Tensor) – Boolean mask of observed entries.
train_frac (float) – Fraction of columns to fully observe.
exposure_rate (float) – Fraction of entries to observe in held-out columns.
- Returns:
train_mask, test_mask
- Return type: