Base class for any model producing P(correct) over (subject, item) cells.
Subclasses implement predict(), which accepts a long-form query
(a dict of 1-D index tensors) and returns one probability per row.
forward() is a thin wrapper that delegates to predict(),
so model(query) works via nn.Module.__call__().
Each subclass declares the keys it consumes via expected_keys.
The default ("subject_idx","item_idx") covers every IRT-style
model; condition-aware or trial-aware models extend it.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Abstract base for factor-based Item Response Theory models.
Specialises Predictor for models with explicit ability and
difficulty parameters that compose into a per-cell probability via
a logistic link. Subclasses implement predict() (inherited from
Predictor) by gathering parameters at the query indices and
applying the IRT formula — see _irt_probability().
For non-factor predictors (TabPFN-style, neural baselines), inherit
Predictor directly instead.
data (LongFormData | torch.Tensor) – Either a LongFormData (canonical
long-form input — every observation is one row) or a wide-form
response tensor of shape (n_subjects,n_items). For wide-form,
missing entries may be encoded as NaN or -1.
mask (torch.Tensor | None) – Only used when data is a wide-form tensor — boolean mask of
entries to use for fitting. Inferred from NaN/-1 when None.
Ignored for long-form input (absent rows are absent observations).
method (str) – Fitting method: "mle", "em", "jml", or "svi".
max_epochs (int) – Maximum number of optimization epochs.
data (LongFormData | torch.Tensor) – Long-form dataset (preferred) or wide-form response tensor of
shape (n_subjects,n_items). For wide-form, missing entries
may be encoded as NaN or -1.
mask (torch.Tensor | None) – Only used with wide-form input — boolean mask of entries to use.
Inferred from NaNs if None.
device (str or torch.device or None) – Device for the returned tensors. None uses the torch default.
Returns:
{"subject_idx":LongTensor(n_subjects*n_items,),"item_idx":LongTensor(n_subjects*n_items,)}. Row order is subject-major:
all of subject 0’s items first, then subject 1’s items, etc.
Predict over the full (n_subjects,n_items) Cartesian grid.
Convenience wrapper around cartesian_query() + model.predict,
reshaped back to a (n_subjects,n_items) matrix. Use this for
visualization, EM quadrature, and other callers that genuinely want
the dense view.
Parameters:
model (Predictor) – Any predictor with a (n_subjects,n_items) universe.
**extra_keys (torch.Tensor) – Additional query columns required by the model’s expected_keys
beyond subject_idx / item_idx. Each must be 1-D of length
n_subjects*n_items.
Fit the Ising model via Maximum Pseudo-Likelihood Estimation.
Minimises the summed binary cross-entropy of each item given all
other items across all observed (subject, item) pairs.
Parameters:
data (LongFormData | torch.Tensor) – Long-form dataset (preferred) or wide-form binary response tensor
of shape (n_subjects,n_items). NaN or -1 marks missing.
mask (torch.Tensor | None) – Only used with wide-form input — boolean mask of observed
entries. Inferred from NaNs if None.
Gaussian Graphical Model for continuous response data.
Estimates a sparse precision matrix K via the GraphicalLasso objective,
optimised with Adam using a Cholesky parameterisation to ensure K remains
positive definite.
Parameters:
n_items (int) – Number of items (nodes in the network).
lam (float) – L1 regularisation strength on off-diagonal precision entries.
Larger values produce sparser networks.
Minimises −logdetK+tr(SK)+λ·Σᵢ≠ⱼ|Kᵢⱼ| with K
constrained to be positive definite via Cholesky parameterisation.
Parameters:
data (LongFormData | torch.Tensor) – Long-form dataset (preferred) or wide-form continuous response
tensor of shape (n_subjects,n_items). NaN or -1 marks missing.
mask (torch.Tensor | None) – Only used with wide-form input — boolean mask. Inferred from
NaNs if None.
Models the probability that subject a beats subject b as:
\[P(a > b) = \sigma(\theta_a - \theta_b)\]
Mathematically equivalent to Rasch, but the “item” axis is itself a
subject — so predict(query) consumes subject_idx (the A-side)
and item_idx (the B-side).
Parameters:
n_subjects (int) – Number of subjects (e.g., LLMs).
Identical to Rasch in prediction: mu=sigmoid(theta-b).
Uses Beta NLL loss instead of Bernoulli NLL for fitting, allowing
continuous responses in (0, 1) such as empirical probabilities.
Parameters:
n_subjects (int) – Number of subjects (test-takers / models).
n_items (int) – Number of items (test questions / benchmark tasks).
phi (float) – Beta distribution precision parameter. Higher values mean
tighter concentration around the predicted mean. Default 10.0.
response_matrix (torch.Tensor) – Continuous response matrix with values in (0, 1),
shape (n_subjects, n_items). Values must be strictly
between 0 and 1 (exclusive).
mask (torch.Tensor | None) – Boolean mask of entries to use. If None, uses all non-NaN entries.
method (str) – Fitting method: “mle”, “em”, or “jml”.
Identical to TwoPL in prediction: mu=sigmoid(a*(theta-b)).
Uses Beta NLL loss instead of Bernoulli NLL for fitting, allowing
continuous responses in (0, 1) such as empirical probabilities.
Parameters:
n_subjects (int) – Number of subjects (test-takers / models).
n_items (int) – Number of items (test questions / benchmark tasks).
phi (float) – Beta distribution precision parameter. Higher values mean
tighter concentration around the predicted mean. Default 10.0.
response_matrix (torch.Tensor) – Continuous response matrix with values in (0, 1),
shape (n_subjects, n_items). Values must be strictly
between 0 and 1 (exclusive).
mask (torch.Tensor | None) – Boolean mask of entries to use. If None, uses all non-NaN entries.
method (str) – Fitting method: “mle”, “em”, or “jml”.
Instead of learning independent parameters for each item, this model
learns a mapping from item embeddings to item parameters (difficulty,
discrimination, guessing). This enables zero-shot prediction on new
items given their embeddings.
P(correct) = c + (1-c) * sigmoid(a * (theta - b))
where b, a, c = f(embedding) are predicted by a neural network.
Per-cell TabPFN predictor for cold-item performance prediction.
Each (subject, item) cell becomes a training row with features
[item_features,subject_id] and label = response. subject_id
is appended as a categorical column at fit time. TabPFN does
in-context learning over observed cells and predicts held cells.
Unlike AmortizedIRT, this model does not factorize into
ability and difficulty – subject identity is just one categorical
feature alongside the item-side features. It inherits
Predictor directly (not IRTModel) because it has
no latent factor parameters. Use this when item features carry
per-task signal beyond what subject identity already encodes; on
homogeneous benchmarks (where they don’t) a row-mean baseline can
be hard to beat.
n_features (int) – Number of item features (item_features.shape[1]).
max_train (int, default 10000) – Maximum training rows passed to TabPFN. Larger contexts
materialize an N x N attention tensor that exceeds GPU memory
well below TabPFN’s pretraining limits (e.g. 47K rows OOMs an
H100 80 GB inside scaled_dot_product_attention). When the
observed-cell count exceeds max_train, a stratified random
subsample is taken and a UserWarning is raised. TabPFN’s design
point is <=10K samples regardless.
n_estimators (int, default 2) – Number of TabPFN ensemble members. Higher = better calibration
at proportional cost.
categorical_feature_indices (list[int] | None) – Indices into item_features of columns that should be
treated as categorical. The internally-appended subject_id
column is always added to this list automatically.
random_state (int, default 0) – Seed for the stratified subsample and TabPFN’s internal RNG.
device (str, default "cpu") – Device for TabPFN inference. Use “cuda” or “cuda:0” for GPU.
Notes
The response matrix is treated as binary – non-{0, 1} entries are
cast to int after masking out NaN and -1, matching the
convention of the other IRT models in this package.
Compute P(correct) at query rows for the given facet level(s).
Query must contain subject_idx and item_idx (1-D, length N).
Optionally facet_idx (1-D, length N or scalar). When omitted,
defaults to facet level 0 — matches the prior behavior where
fitting did not surface facet information.
Anchor a facet level to zero (e.g., English baseline).
Forces gamma[level_idx]=0 and tau[:,level_idx]=0 at both
fit and predict time. Also zeros delta[:,level_idx] (the subject
intercept under the reference facet is absorbed by ability).
A constrained factor model with one general factor and multiple
group-specific factors. The general factor loads on all items,
while group factors load only on items in their cluster.
Build a testlet mapping from hierarchical item identifiers.
Parameters:
item_ids (list[str]) – Item identifiers with testlet structure, e.g.
["task_1:0","task_1:1","task_2:0",...].
The prefix before separator identifies the testlet.
separator (str) – Delimiter between testlet name and sub-item index.
Returns:
testlet_map (torch.Tensor) – Integer tensor of shape (n_items,) mapping each item to
its testlet index.
testlet_names (list[str]) – Ordered list of unique testlet names (first-seen order).
Instead of learning independent parameters for each item, this model
learns a mapping from item embeddings to item parameters (difficulty,
discrimination, guessing). This enables zero-shot prediction on new
items given their embeddings.
P(correct) = c + (1-c) * sigmoid(a * (theta - b))
where b, a, c = f(embedding) are predicted by a neural network.
Per-cell TabPFN predictor for cold-item performance prediction.
Each (subject, item) cell becomes a training row with features
[item_features,subject_id] and label = response. subject_id
is appended as a categorical column at fit time. TabPFN does
in-context learning over observed cells and predicts held cells.
Unlike AmortizedIRT, this model does not factorize into
ability and difficulty – subject identity is just one categorical
feature alongside the item-side features. It inherits
Predictor directly (not IRTModel) because it has
no latent factor parameters. Use this when item features carry
per-task signal beyond what subject identity already encodes; on
homogeneous benchmarks (where they don’t) a row-mean baseline can
be hard to beat.
n_features (int) – Number of item features (item_features.shape[1]).
max_train (int, default 10000) – Maximum training rows passed to TabPFN. Larger contexts
materialize an N x N attention tensor that exceeds GPU memory
well below TabPFN’s pretraining limits (e.g. 47K rows OOMs an
H100 80 GB inside scaled_dot_product_attention). When the
observed-cell count exceeds max_train, a stratified random
subsample is taken and a UserWarning is raised. TabPFN’s design
point is <=10K samples regardless.
n_estimators (int, default 2) – Number of TabPFN ensemble members. Higher = better calibration
at proportional cost.
categorical_feature_indices (list[int] | None) – Indices into item_features of columns that should be
treated as categorical. The internally-appended subject_id
column is always added to this list automatically.
random_state (int, default 0) – Seed for the stratified subsample and TabPFN’s internal RNG.
device (str, default "cpu") – Device for TabPFN inference. Use “cuda” or “cuda:0” for GPU.
Notes
The response matrix is treated as binary – non-{0, 1} entries are
cast to int after masking out NaN and -1, matching the
convention of the other IRT models in this package.
Compute P(correct) at query rows for the given facet level(s).
Query must contain subject_idx and item_idx (1-D, length N).
Optionally facet_idx (1-D, length N or scalar). When omitted,
defaults to facet level 0 — matches the prior behavior where
fitting did not surface facet information.
Identical to Rasch in prediction: mu=sigmoid(theta-b).
Uses Beta NLL loss instead of Bernoulli NLL for fitting, allowing
continuous responses in (0, 1) such as empirical probabilities.
Parameters:
n_subjects (int) – Number of subjects (test-takers / models).
n_items (int) – Number of items (test questions / benchmark tasks).
phi (float) – Beta distribution precision parameter. Higher values mean
tighter concentration around the predicted mean. Default 10.0.
response_matrix (torch.Tensor) – Continuous response matrix with values in (0, 1),
shape (n_subjects, n_items). Values must be strictly
between 0 and 1 (exclusive).
mask (torch.Tensor | None) – Boolean mask of entries to use. If None, uses all non-NaN entries.
method (str) – Fitting method: “mle”, “em”, or “jml”.
Identical to TwoPL in prediction: mu=sigmoid(a*(theta-b)).
Uses Beta NLL loss instead of Bernoulli NLL for fitting, allowing
continuous responses in (0, 1) such as empirical probabilities.
Parameters:
n_subjects (int) – Number of subjects (test-takers / models).
n_items (int) – Number of items (test questions / benchmark tasks).
phi (float) – Beta distribution precision parameter. Higher values mean
tighter concentration around the predicted mean. Default 10.0.
response_matrix (torch.Tensor) – Continuous response matrix with values in (0, 1),
shape (n_subjects, n_items). Values must be strictly
between 0 and 1 (exclusive).
mask (torch.Tensor | None) – Boolean mask of entries to use. If None, uses all non-NaN entries.
method (str) – Fitting method: “mle”, “em”, or “jml”.
A constrained factor model with one general factor and multiple
group-specific factors. The general factor loads on all items,
while group factors load only on items in their cluster.