Psychometric Metrics¶
Psychometric metrics for measurement analysis.
- torch_measure.metrics.tetrachoric_correlation(data, min_pairs=5)[source]¶
Compute the tetrachoric correlation matrix for binary data.
- Uses the cosine-pi approximation:
r = cos(pi / (1 + sqrt(AD / BC)))
where A, B, C, D are the counts in the 2x2 contingency table for each pair of items.
- Parameters:
data (torch.Tensor) – Binary response matrix (n_subjects, n_items) with values 0, 1, or NaN.
min_pairs (int) – Minimum number of valid pairs required. Pairs with fewer observations get correlation 0.
- Returns:
Tetrachoric correlation matrix of shape (n_items, n_items).
- Return type:
- torch_measure.metrics.point_biserial_correlation(continuous, binary)[source]¶
Compute point-biserial correlation between continuous and binary variables.
- Parameters:
continuous (torch.Tensor) – Continuous variable (e.g., total score) of shape (N,).
binary (torch.Tensor) – Binary variable (e.g., item response) of shape (N,) or (N, M).
- Returns:
Correlation(s). Scalar if binary is 1D, shape (M,) if 2D.
- Return type:
- torch_measure.metrics.infit_statistics(predicted, observed, mask=None)[source]¶
Compute Rasch infit (information-weighted) mean square statistics per item.
Infit is sensitive to unexpected responses near item difficulty. Values near 1.0 indicate good fit. Values > 1.3 indicate underfit (noise), values < 0.7 indicate overfit (Guttman pattern).
- Parameters:
predicted (torch.Tensor) – Predicted probabilities (n_subjects, n_items).
observed (torch.Tensor) – Observed binary responses (n_subjects, n_items).
mask (torch.Tensor | None) – Boolean mask of entries to include.
- Returns:
Infit statistics per item, shape (n_items,).
- Return type:
- torch_measure.metrics.outfit_statistics(predicted, observed, mask=None)[source]¶
Compute Rasch outfit (unweighted) mean square statistics per item.
Outfit is sensitive to unexpected responses far from item difficulty.
- Parameters:
predicted (torch.Tensor) – Predicted probabilities (n_subjects, n_items).
observed (torch.Tensor) – Observed binary responses (n_subjects, n_items).
mask (torch.Tensor | None) – Boolean mask of entries to include.
- Returns:
Outfit statistics per item, shape (n_items,).
- Return type:
- torch_measure.metrics.item_total_correlation(data, mask=None)[source]¶
Compute corrected item-total correlation for each item.
For each item, computes the Pearson correlation between the item responses and the total score excluding that item.
- Parameters:
data (torch.Tensor) – Binary response matrix (n_subjects, n_items).
mask (torch.Tensor | None) – Boolean mask.
- Returns:
Item-total correlations, shape (n_items,).
- Return type:
- torch_measure.metrics.cronbach_alpha(data, mask=None)[source]¶
Compute Cronbach’s alpha reliability coefficient.
- Parameters:
data (torch.Tensor) – Response matrix (n_subjects, n_items).
mask (torch.Tensor | None) – Boolean mask.
- Returns:
Cronbach’s alpha.
- Return type:
- torch_measure.metrics.variance_components(response_matrix, subject_col='subject_id', item_col='item_id', trial_col='trial', response_col='response', method='moments')[source]¶
Decompose Var(response) into subject, item, subject x item, and residual facets.
Henderson Method I (moments-based ANOVA estimator) on a person x item x replication crossed design. Negative variance estimates are clamped to 0. With one observation per cell, residual is unidentifiable.
- Parameters:
response_matrix (pandas.DataFrame) – Long-form responses with columns
subject_col,item_col,trial_col,response_col.subject_col (str) – Column names; defaults match the measurement-db long-form schema.
item_col (str) – Column names; defaults match the measurement-db long-form schema.
trial_col (str) – Column names; defaults match the measurement-db long-form schema.
response_col (str) – Column names; defaults match the measurement-db long-form schema.
method ({"moments"}) – Only
"moments"is implemented in v1.
- Returns:
Keys:
subject,item,subject_item,residual(variances, floats),n_subjects,n_items(ints),n_reps_harmonic(float; harmonic mean of cell counts),identifiable(dict[str, bool]),method(str).- Return type:
- torch_measure.metrics.g_coefficient(variance_components, n_items, n_reps=1, type='absolute')[source]¶
Brennan (2001) G-coefficient under a person x item x replication design.
Relative G uses ranking-only error (subject x item + residual); absolute G (Phi) also includes the item main effect.
- Parameters:
variance_components (dict) – Output of
variance_components(), or any dict with keyssubject,item,subject_item,residual.n_items (int) – Number of items in the projected design (>= 1).
n_reps (int) – Replications per cell in the projected design (>= 1).
type ({"relative", "absolute"}) – Which G-coefficient to compute.
- Returns:
G-coefficient in [0, 1]. 0.0 if the denominator is numerically zero.
- Return type:
- torch_measure.metrics.intraclass_correlation(variance_components, form='ICC3k', n_items=None)[source]¶
Intraclass correlation coefficient from two-way variance components.
Subjects are targets, items are raters. ICC2/ICC3 are single-rater (absolute agreement / consistency); ICC2k/ICC3k average over k raters and equal the absolute / relative
g_coefficient()atn_reps=1. One-way forms (ICC1) need a one-way model and are not supported here.- Parameters:
variance_components (dict) – Output of
variance_components(), or any dict with keyssubject,item,subject_item,residual.form ({"ICC2", "ICC3", "ICC2k", "ICC3k"}) – Which coefficient to compute. The
kforms average over raters.n_items (int | None) – Number of raters k for the
kforms; defaults tovariance_components["n_items"].
- Returns:
ICC in [0, 1]. 0.0 if the denominator is numerically zero.
- Return type:
- torch_measure.metrics.d_study(variance_components, n_items_grid, n_reps_grid)[source]¶
Project G-coefficients and SEs over a (n_items, n_reps) design grid.
- Parameters:
variance_components (dict) – Output of
variance_components().n_items_grid (sequence of int) – Candidate design dimensions to project.
n_reps_grid (sequence of int) – Candidate design dimensions to project.
- Returns:
One row per (n_items, n_reps) cell with columns
n_items,n_reps,g_relative,g_absolute,se_relative,se_absolute.- Return type:
pandas.DataFrame
- torch_measure.metrics.bootstrap_variance_components(response_matrix, subject_col='subject_id', item_col='item_id', trial_col='trial', response_col='response', method='moments', n_boot=2000, ci=0.95, seed=None)[source]¶
Nonparametric bootstrap CIs for variance components by resampling subjects.
Subjects are the exchangeable unit: each bootstrap draw samples
n_subjectssubjects with replacement, relabels duplicates so each draw is treated as a distinct unit, and re-fitsvariance_components(). Percentile CIs are reported. The full bootstrap distribution for each component is also returned so callers can derive CIs for any function of the components (e.g.g_coefficient(),intraclass_correlation()).- Parameters:
response_matrix (pandas.DataFrame) – Long-form responses, same schema as
variance_components().subject_col (str) – Forwarded to
variance_components().item_col (str) – Forwarded to
variance_components().trial_col (str) – Forwarded to
variance_components().response_col (str) – Forwarded to
variance_components().method (str) – Forwarded to
variance_components().n_boot (int) – Number of bootstrap replicates (>= 1).
ci (float) – Confidence level in (0, 1).
seed (int | None) – Seed for
numpy.random.default_rng.
- Returns:
Keys:
subject,item,subject_item,residual(point estimates on the original sample),n_subjects,n_items,n_reps_harmonic,identifiable,method(as invariance_components()), plusci(dict[str, tuple[float, float]] per component),samples(dict[str, numpy.ndarray] of lengthn_boot),n_boot(int), andci_level(float).- Return type:
- torch_measure.metrics.mokken_scalability(data, mask=None)[source]¶
Compute Mokken scalability coefficients.
Mokken scaling is a non-parametric IRT approach that tests whether items form a unidimensional scale. The H coefficient measures how well item pairs conform to the Guttman pattern.
H >= 0.5: strong scale 0.4 <= H < 0.5: medium scale 0.3 <= H < 0.4: weak scale H < 0.3: not a scale
- Parameters:
data (torch.Tensor) – Binary response matrix (n_subjects, n_items).
mask (torch.Tensor | None) – Boolean mask.
- Returns:
Dictionary with: - ‘H’: Overall scalability coefficient - ‘H_items’: Per-item scalability coefficients, shape (n_items,) - ‘H_pairs’: Pairwise scalability matrix, shape (n_items, n_items)
- Return type:
- torch_measure.metrics.expected_calibration_error(predicted, observed, mask=None, n_bins=15)[source]¶
Compute Expected Calibration Error (ECE).
Measures how well predicted probabilities match observed frequencies. ECE = 0 means perfectly calibrated.
- Parameters:
predicted (torch.Tensor) – Predicted probabilities.
observed (torch.Tensor) – Observed binary outcomes.
mask (torch.Tensor | None) – Boolean mask of entries to evaluate.
n_bins (int) – Number of calibration bins.
- Returns:
ECE value in [0, 1].
- Return type:
- torch_measure.metrics.brier_score(predicted, observed, mask=None)[source]¶
Compute the Brier score (mean squared error of probabilities).
- Parameters:
predicted (torch.Tensor) – Predicted probabilities.
observed (torch.Tensor) – Observed binary outcomes.
mask (torch.Tensor | None) – Boolean mask.
- Returns:
Brier score in [0, 1]. Lower is better.
- Return type:
- torch_measure.metrics.differential_item_functioning(data, group, mask=None, method='mh')[source]¶
Detect Differential Item Functioning (DIF).
DIF occurs when subjects of equal ability from different groups have different probabilities of answering an item correctly.
- Parameters:
data (torch.Tensor) – Binary response matrix (n_subjects, n_items).
group (torch.Tensor) – Group membership for each subject (n_subjects,). Binary (0/1).
mask (torch.Tensor | None) – Boolean mask.
method (str) – DIF detection method. Currently supports “mh” (Mantel-Haenszel).
- Returns:
Dictionary with: - ‘mh_statistic’: Mantel-Haenszel chi-square per item, shape (n_items,) - ‘effect_size’: MH odds ratio (Delta-MH) per item, shape (n_items,) - ‘flagged’: Boolean mask of items flagged for DIF
- Return type:
- torch_measure.metrics.ability_standard_errors(ability, difficulty, discrimination=None)[source]¶
Compute standard errors for ability estimates.
SE(theta_i) = 1 / sqrt(sum_j I_j(theta_i)), where I_j is the Fisher information of item j evaluated at theta_i.
- Parameters:
ability (torch.Tensor) – Subject ability values, shape (N,).
difficulty (torch.Tensor) – Item difficulty values, shape (M,).
discrimination (torch.Tensor | None) – Item discrimination values, shape (M,). Defaults to 1 (Rasch).
- Returns:
Standard errors per subject, shape (N,).
- Return type:
- torch_measure.metrics.difficulty_standard_errors(ability, difficulty, response_matrix, discrimination=None, mask=None)[source]¶
Compute standard errors for difficulty estimates.
SE(b_j) = 1 / sqrt(sum_i I_j(theta_i)) over observed subjects for each item, where I_j(theta_i) = a_j^2 * P_ij * Q_ij.
- Parameters:
ability (torch.Tensor) – Subject ability values, shape (N,).
difficulty (torch.Tensor) – Item difficulty values, shape (M,).
response_matrix (torch.Tensor) – Response matrix, shape (N, M). Used only for determining observed entries.
discrimination (torch.Tensor | None) – Item discrimination values, shape (M,). Defaults to 1 (Rasch).
mask (torch.Tensor | None) – Boolean mask of observed entries, shape (N, M). If None, all non-NaN entries are treated as observed.
- Returns:
Standard errors per item, shape (M,).
- Return type:
- torch_measure.metrics.discrimination_standard_errors(ability, difficulty, discrimination, response_matrix, mask=None)[source]¶
Compute standard errors for discrimination estimates.
I(a_j) = sum_i (theta_i - b_j)^2 * P_ij * Q_ij over observed subjects. SE(a_j) = 1 / sqrt(I(a_j)).
- Parameters:
ability (torch.Tensor) – Subject ability values, shape (N,).
difficulty (torch.Tensor) – Item difficulty values, shape (M,).
discrimination (torch.Tensor) – Item discrimination values, shape (M,).
response_matrix (torch.Tensor) – Response matrix, shape (N, M). Used only for determining observed entries.
mask (torch.Tensor | None) – Boolean mask of observed entries, shape (N, M). If None, all non-NaN entries are treated as observed.
- Returns:
Standard errors per item, shape (M,).
- Return type:
- torch_measure.metrics.strength_centrality(adjacency)[source]¶
Node strength: sum of absolute edge weights.
The most widely used centrality measure in network psychometrics. A high-strength node has strong connections (in absolute value) with many other nodes.
- Parameters:
adjacency (torch.Tensor) – Symmetric edge-weight matrix (n_items, n_items), zero diagonal.
- Returns:
Strength per node, shape (n_items,).
- Return type:
- torch_measure.metrics.expected_influence(adjacency)[source]¶
Expected influence: signed sum of edge weights.
Unlike strength, this is sensitive to the polarity of edges and can be negative for nodes connected primarily by negative edges. Proposed by Robinaugh et al. (2016) for signed networks (e.g., symptom networks).
- Parameters:
adjacency (torch.Tensor) – Symmetric edge-weight matrix (n_items, n_items), zero diagonal.
- Returns:
Expected influence per node, shape (n_items,).
- Return type:
References
- torch_measure.metrics.closeness_centrality(adjacency)[source]¶
Closeness centrality: normalised reciprocal of mean shortest-path distance.
Defined as
(reachable − 1) / Σ dist(i, j)over all reachable j ≠ i, matching the Wasserman-Faust normalisation for possibly disconnected graphs.- Parameters:
adjacency (torch.Tensor) – Symmetric edge-weight matrix (n_items, n_items), zero diagonal.
- Returns:
Closeness scores per node, shape (n_items,). Zero for isolated nodes.
- Return type:
- torch_measure.metrics.betweenness_centrality(adjacency)[source]¶
Node betweenness centrality.
For each node v, counts the fraction of (s, t) pairs (s < t, s ≠ v, t ≠ v) for which v lies on a shortest path. A node on a shortest path satisfies
dist(s, v) + dist(v, t) ≈ dist(s, t).
The result is normalised by
(n−1)(n−2)/2, the total number of source–target pairs.- Parameters:
adjacency (torch.Tensor) – Symmetric edge-weight matrix (n_items, n_items), zero diagonal.
- Returns:
Betweenness per node in [0, 1], shape (n_items,).
- Return type:
Correlation¶
- torch_measure.metrics.tetrachoric_correlation(data, min_pairs=5)[source]¶
Compute the tetrachoric correlation matrix for binary data.
- Uses the cosine-pi approximation:
r = cos(pi / (1 + sqrt(AD / BC)))
where A, B, C, D are the counts in the 2x2 contingency table for each pair of items.
- Parameters:
data (torch.Tensor) – Binary response matrix (n_subjects, n_items) with values 0, 1, or NaN.
min_pairs (int) – Minimum number of valid pairs required. Pairs with fewer observations get correlation 0.
- Returns:
Tetrachoric correlation matrix of shape (n_items, n_items).
- Return type:
- torch_measure.metrics.point_biserial_correlation(continuous, binary)[source]¶
Compute point-biserial correlation between continuous and binary variables.
- Parameters:
continuous (torch.Tensor) – Continuous variable (e.g., total score) of shape (N,).
binary (torch.Tensor) – Binary variable (e.g., item response) of shape (N,) or (N, M).
- Returns:
Correlation(s). Scalar if binary is 1D, shape (M,) if 2D.
- Return type:
Reliability¶
- torch_measure.metrics.infit_statistics(predicted, observed, mask=None)[source]¶
Compute Rasch infit (information-weighted) mean square statistics per item.
Infit is sensitive to unexpected responses near item difficulty. Values near 1.0 indicate good fit. Values > 1.3 indicate underfit (noise), values < 0.7 indicate overfit (Guttman pattern).
- Parameters:
predicted (torch.Tensor) – Predicted probabilities (n_subjects, n_items).
observed (torch.Tensor) – Observed binary responses (n_subjects, n_items).
mask (torch.Tensor | None) – Boolean mask of entries to include.
- Returns:
Infit statistics per item, shape (n_items,).
- Return type:
- torch_measure.metrics.outfit_statistics(predicted, observed, mask=None)[source]¶
Compute Rasch outfit (unweighted) mean square statistics per item.
Outfit is sensitive to unexpected responses far from item difficulty.
- Parameters:
predicted (torch.Tensor) – Predicted probabilities (n_subjects, n_items).
observed (torch.Tensor) – Observed binary responses (n_subjects, n_items).
mask (torch.Tensor | None) – Boolean mask of entries to include.
- Returns:
Outfit statistics per item, shape (n_items,).
- Return type:
- torch_measure.metrics.item_total_correlation(data, mask=None)[source]¶
Compute corrected item-total correlation for each item.
For each item, computes the Pearson correlation between the item responses and the total score excluding that item.
- Parameters:
data (torch.Tensor) – Binary response matrix (n_subjects, n_items).
mask (torch.Tensor | None) – Boolean mask.
- Returns:
Item-total correlations, shape (n_items,).
- Return type:
- torch_measure.metrics.cronbach_alpha(data, mask=None)[source]¶
Compute Cronbach’s alpha reliability coefficient.
- Parameters:
data (torch.Tensor) – Response matrix (n_subjects, n_items).
mask (torch.Tensor | None) – Boolean mask.
- Returns:
Cronbach’s alpha.
- Return type:
Generalizability¶
- torch_measure.metrics.variance_components(response_matrix, subject_col='subject_id', item_col='item_id', trial_col='trial', response_col='response', method='moments')[source]¶
Decompose Var(response) into subject, item, subject x item, and residual facets.
Henderson Method I (moments-based ANOVA estimator) on a person x item x replication crossed design. Negative variance estimates are clamped to 0. With one observation per cell, residual is unidentifiable.
- Parameters:
response_matrix (pandas.DataFrame) – Long-form responses with columns
subject_col,item_col,trial_col,response_col.subject_col (str) – Column names; defaults match the measurement-db long-form schema.
item_col (str) – Column names; defaults match the measurement-db long-form schema.
trial_col (str) – Column names; defaults match the measurement-db long-form schema.
response_col (str) – Column names; defaults match the measurement-db long-form schema.
method ({"moments"}) – Only
"moments"is implemented in v1.
- Returns:
Keys:
subject,item,subject_item,residual(variances, floats),n_subjects,n_items(ints),n_reps_harmonic(float; harmonic mean of cell counts),identifiable(dict[str, bool]),method(str).- Return type:
- torch_measure.metrics.g_coefficient(variance_components, n_items, n_reps=1, type='absolute')[source]¶
Brennan (2001) G-coefficient under a person x item x replication design.
Relative G uses ranking-only error (subject x item + residual); absolute G (Phi) also includes the item main effect.
- Parameters:
variance_components (dict) – Output of
variance_components(), or any dict with keyssubject,item,subject_item,residual.n_items (int) – Number of items in the projected design (>= 1).
n_reps (int) – Replications per cell in the projected design (>= 1).
type ({"relative", "absolute"}) – Which G-coefficient to compute.
- Returns:
G-coefficient in [0, 1]. 0.0 if the denominator is numerically zero.
- Return type:
- torch_measure.metrics.intraclass_correlation(variance_components, form='ICC3k', n_items=None)[source]¶
Intraclass correlation coefficient from two-way variance components.
Subjects are targets, items are raters. ICC2/ICC3 are single-rater (absolute agreement / consistency); ICC2k/ICC3k average over k raters and equal the absolute / relative
g_coefficient()atn_reps=1. One-way forms (ICC1) need a one-way model and are not supported here.- Parameters:
variance_components (dict) – Output of
variance_components(), or any dict with keyssubject,item,subject_item,residual.form ({"ICC2", "ICC3", "ICC2k", "ICC3k"}) – Which coefficient to compute. The
kforms average over raters.n_items (int | None) – Number of raters k for the
kforms; defaults tovariance_components["n_items"].
- Returns:
ICC in [0, 1]. 0.0 if the denominator is numerically zero.
- Return type:
- torch_measure.metrics.d_study(variance_components, n_items_grid, n_reps_grid)[source]¶
Project G-coefficients and SEs over a (n_items, n_reps) design grid.
- Parameters:
variance_components (dict) – Output of
variance_components().n_items_grid (sequence of int) – Candidate design dimensions to project.
n_reps_grid (sequence of int) – Candidate design dimensions to project.
- Returns:
One row per (n_items, n_reps) cell with columns
n_items,n_reps,g_relative,g_absolute,se_relative,se_absolute.- Return type:
pandas.DataFrame
- torch_measure.metrics.bootstrap_variance_components(response_matrix, subject_col='subject_id', item_col='item_id', trial_col='trial', response_col='response', method='moments', n_boot=2000, ci=0.95, seed=None)[source]¶
Nonparametric bootstrap CIs for variance components by resampling subjects.
Subjects are the exchangeable unit: each bootstrap draw samples
n_subjectssubjects with replacement, relabels duplicates so each draw is treated as a distinct unit, and re-fitsvariance_components(). Percentile CIs are reported. The full bootstrap distribution for each component is also returned so callers can derive CIs for any function of the components (e.g.g_coefficient(),intraclass_correlation()).- Parameters:
response_matrix (pandas.DataFrame) – Long-form responses, same schema as
variance_components().subject_col (str) – Forwarded to
variance_components().item_col (str) – Forwarded to
variance_components().trial_col (str) – Forwarded to
variance_components().response_col (str) – Forwarded to
variance_components().method (str) – Forwarded to
variance_components().n_boot (int) – Number of bootstrap replicates (>= 1).
ci (float) – Confidence level in (0, 1).
seed (int | None) – Seed for
numpy.random.default_rng.
- Returns:
Keys:
subject,item,subject_item,residual(point estimates on the original sample),n_subjects,n_items,n_reps_harmonic,identifiable,method(as invariance_components()), plusci(dict[str, tuple[float, float]] per component),samples(dict[str, numpy.ndarray] of lengthn_boot),n_boot(int), andci_level(float).- Return type:
Calibration¶
- torch_measure.metrics.expected_calibration_error(predicted, observed, mask=None, n_bins=15)[source]¶
Compute Expected Calibration Error (ECE).
Measures how well predicted probabilities match observed frequencies. ECE = 0 means perfectly calibrated.
- Parameters:
predicted (torch.Tensor) – Predicted probabilities.
observed (torch.Tensor) – Observed binary outcomes.
mask (torch.Tensor | None) – Boolean mask of entries to evaluate.
n_bins (int) – Number of calibration bins.
- Returns:
ECE value in [0, 1].
- Return type:
- torch_measure.metrics.brier_score(predicted, observed, mask=None)[source]¶
Compute the Brier score (mean squared error of probabilities).
- Parameters:
predicted (torch.Tensor) – Predicted probabilities.
observed (torch.Tensor) – Observed binary outcomes.
mask (torch.Tensor | None) – Boolean mask.
- Returns:
Brier score in [0, 1]. Lower is better.
- Return type:
Scalability¶
- torch_measure.metrics.mokken_scalability(data, mask=None)[source]¶
Compute Mokken scalability coefficients.
Mokken scaling is a non-parametric IRT approach that tests whether items form a unidimensional scale. The H coefficient measures how well item pairs conform to the Guttman pattern.
H >= 0.5: strong scale 0.4 <= H < 0.5: medium scale 0.3 <= H < 0.4: weak scale H < 0.3: not a scale
- Parameters:
data (torch.Tensor) – Binary response matrix (n_subjects, n_items).
mask (torch.Tensor | None) – Boolean mask.
- Returns:
Dictionary with: - ‘H’: Overall scalability coefficient - ‘H_items’: Per-item scalability coefficients, shape (n_items,) - ‘H_pairs’: Pairwise scalability matrix, shape (n_items, n_items)
- Return type: