This vignette shows how to specify and estimate a two-factor measurement system with factorana. The latent factors are identified from a set of noisy indicators (measurements) whose loadings the package estimates jointly with the factor variances.
The example below uses two latent factors (“cognitive” and “non-cognitive” skills) each measured by three continuous indicators, and is small enough (n = 300) to run quickly.
library(factorana)
set.seed(1)
n <- 300
# Latent factors (true values, unobserved in practice)
f_cog <- rnorm(n, mean = 0, sd = 1.0)
f_noncog <- rnorm(n, mean = 0, sd = 0.8)
# Cognitive indicators: loadings (1.0, 0.9, 0.7); error sd = 0.5
cog1 <- 0.0 + 1.0 * f_cog + rnorm(n, 0, 0.5)
cog2 <- 0.2 + 0.9 * f_cog + rnorm(n, 0, 0.5)
cog3 <- 0.1 + 0.7 * f_cog + rnorm(n, 0, 0.5)
# Non-cognitive indicators: loadings (1.0, 1.1, 0.8); error sd = 0.5
nc1 <- 0.0 + 1.0 * f_noncog + rnorm(n, 0, 0.5)
nc2 <- 0.1 + 1.1 * f_noncog + rnorm(n, 0, 0.5)
nc3 <- 0.0 + 0.8 * f_noncog + rnorm(n, 0, 0.5)
dat <- data.frame(
intercept = 1,
cog1 = cog1, cog2 = cog2, cog3 = cog3,
nc1 = nc1, nc2 = nc2, nc3 = nc3
)
head(dat)
#> intercept cog1 cog2 cog3 nc1 nc2 nc3
#> 1 1 -0.7969873 -1.1345097 -1.11703554 1.1399607 1.2434756 -0.0985743
#> 2 1 0.9348556 0.4624395 1.19013215 -1.3004950 -0.5309301 -0.6931244
#> 3 1 -0.5714748 -0.4198545 -1.41335484 2.0238605 1.7614149 2.3556515
#> 4 1 1.8663765 1.0763851 0.16363734 -0.7774106 0.5158946 0.4655233
#> 5 1 0.2611711 0.8220335 0.67947970 1.5927923 1.4158815 1.1502765
#> 6 1 -1.3888353 -1.0548717 -0.02060567 1.1187830 2.4446165 0.6413497Two independent latent factors. Loading normalizations are set on the component side: each factor has one indicator with loading fixed at 1 (to pin the scale) and two free loadings.
For each indicator we declare a linear equation, which factor(s) it
loads on, and any fixed loadings. loading_normalization
takes a vector of length n_factors:
1 = loading fixed at 1 (scale normalization)0 = loading fixed at 0 (indicator is unrelated to that
factor)NA_real_ = free parameter, to be estimated# Cognitive indicators: load on factor 1 only
mc_cog1 <- define_model_component(
name = "cog1", data = dat, outcome = "cog1", factor = fm,
covariates = "intercept", model_type = "linear",
loading_normalization = c(1, 0) # factor 1 loading = 1, factor 2 loading = 0
)
mc_cog2 <- define_model_component(
name = "cog2", data = dat, outcome = "cog2", factor = fm,
covariates = "intercept", model_type = "linear",
loading_normalization = c(NA_real_, 0) # factor 1 loading free, factor 2 loading = 0
)
mc_cog3 <- define_model_component(
name = "cog3", data = dat, outcome = "cog3", factor = fm,
covariates = "intercept", model_type = "linear",
loading_normalization = c(NA_real_, 0)
)
# Non-cognitive indicators: load on factor 2 only
mc_nc1 <- define_model_component(
name = "nc1", data = dat, outcome = "nc1", factor = fm,
covariates = "intercept", model_type = "linear",
loading_normalization = c(0, 1)
)
mc_nc2 <- define_model_component(
name = "nc2", data = dat, outcome = "nc2", factor = fm,
covariates = "intercept", model_type = "linear",
loading_normalization = c(0, NA_real_)
)
mc_nc3 <- define_model_component(
name = "nc3", data = dat, outcome = "nc3", factor = fm,
covariates = "intercept", model_type = "linear",
loading_normalization = c(0, NA_real_)
)Assemble the components into a system:
The estimator uses Gauss-Hermite quadrature to integrate over the
latent factors; we keep n_quad_points modest here for
speed.
# Tidy table of parameter estimates with standard errors
components_table(fit, digits = 3)
#>
#> Factor Model Results by Component
#> ========================================================================================================================
#>
#> Parameter factor cog1 cog2 cog3 nc1 nc2 nc3
#> ----------------------------------------------------------------------------------------------------------
#> Intercept 0.170** 0.353*** 0.264*** -0.030 0.089 0.025
#> (0.065) (0.057) (0.047) (0.059) (0.061) (0.049)
#> Sigma 0.608*** 0.535*** 0.483*** 0.545*** 0.580*** 0.531***
#> (0.037) (0.032) (0.027) (0.037) (0.043) (0.028)
#> factor_var_1 0.794***
#> (0.089)
#> factor_var_2 0.659***
#> (0.062)
#> cog2_loading_1 0.888***
#> (0.053)
#> cog3_loading_1 0.692***
#> (0.044)
#> nc2_loading_2 1.036***
#> (0.067)
#> nc3_loading_2 0.769***
#> (0.048)
#>
#> ----------------------------------------------------------------------------------------------------------
#> N 0 300 300 300 300 300 300
#> Log-likelihood: -2053.38
#>
#> Standard errors in parentheses
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1Factor variances are near the true values (1.0 for the cognitive factor and 0.64 = 0.8^2 for the non-cognitive factor); loadings recover the simulated values (cog2 ≈ 0.9, cog3 ≈ 0.7, nc2 ≈ 1.1, nc3 ≈ 0.8).
The standard errors shown by components_table() come
from the inverse observed information (model-based). For inference that
does not lean on the measurement model being exactly correct,
vcov_factorana() returns a sandwich (Huber-White) robust or
cluster-robust covariance matrix from the same fit, and
robust_se() is a convenience wrapper returning the standard
errors directly.
# Robust (Huber-White) standard errors, over the full parameter vector
se_robust <- robust_se(fit, dat, type = "robust")
# Compare model-based vs robust SEs for the free parameters
free <- fit$free_idx
data.frame(
parameter = names(fit$estimates)[free],
estimate = round(fit$estimates[free], 3),
se_model = round(fit$std_errors[free], 3),
se_robust = round(se_robust[free], 3),
row.names = NULL
)
#> parameter estimate se_model se_robust
#> 1 factor_var_1 0.794 0.089 0.128
#> 2 factor_var_2 0.659 0.062 0.061
#> 3 cog1_intercept 0.170 0.065 0.109
#> 4 cog1_sigma 0.608 0.037 0.038
#> 5 cog2_intercept 0.353 0.057 0.092
#> 6 cog2_loading_1 0.888 0.053 0.059
#> 7 cog2_sigma 0.535 0.032 0.032
#> 8 cog3_intercept 0.264 0.047 0.075
#> 9 cog3_loading_1 0.692 0.044 0.043
#> 10 cog3_sigma 0.483 0.027 0.030
#> 11 nc1_intercept -0.030 0.059 0.091
#> 12 nc1_sigma 0.545 0.037 0.038
#> 13 nc2_intercept 0.089 0.061 0.093
#> 14 nc2_loading_2 1.036 0.067 0.068
#> 15 nc2_sigma 0.580 0.043 0.052
#> 16 nc3_intercept 0.025 0.049 0.075
#> 17 nc3_loading_2 0.769 0.048 0.040
#> 18 nc3_sigma 0.531 0.028 0.030The full covariance matrix (for example to build Wald tests or linear combinations) is available directly:
For clustered data, pass a cluster id (a column name in the data or a
vector of length nrow(dat)). Each factorana row is one
independent unit whose indicators are integrated jointly over the latent
factors, so the factor already captures the within-row dependence;
clustering changes the standard errors only when a cluster groups
several rows (for example several pupils per school, siblings in a
household, or stacked person-wave rows in a long-format second
stage):
# cluster-robust SEs, clustering by a school id column in the data
se_cluster <- robust_se(fit, dat, type = "cluster", cluster = "school_id")The dynamic factor vignette shows a cluster-robust example on panel data. Note that for two-stage fits these standard errors treat the first-stage measurement parameters as known and so understate uncertainty; a both-stages bootstrap is the honest route there.
Posterior mean factor scores for each observation can be recovered from the estimated model:
fscores <- estimate_factorscores_rcpp(
fit, dat, control = ctrl, parallel = FALSE, verbose = FALSE
)
head(fscores[, c("obs_id", "factor_1", "factor_2",
"se_factor_1", "se_factor_2", "converged")])
#> Factor Score Estimates
#> ======================
#>
#> Observations: 6
#> Factors: 2
#> Converged: 6 (100.0%)
#>
#> Summary Statistics:
#> Factor 1: mean=-0.295, sd=0.944, mean_se=0.365
#> Factor 2: mean=0.702, sd=0.957, mean_se=0.340
#>
#> First 6 rows:
#> obs_id factor_1 factor_2 se_factor_1 se_factor_2 converged
#> 1 1 -1.2915708 0.7037167694 0.364723 0.3399574 TRUE
#> 2 2 0.5875902 -0.8011073548 0.364723 0.3399574 TRUE
#> 3 3 -1.0696673 1.8108452339 0.364723 0.3399574 TRUE
#> 4 4 0.7438436 -0.0003981072 0.364723 0.3399574 TRUE
#> 5 5 0.3346083 1.2404898768 0.364723 0.3399574 TRUE
#> 6 6 -1.0742353 1.2587562101 0.364723 0.3399574 TRUE
# Correlation of estimated factor scores with the true (unobserved) factors
cor(fscores$factor_1, f_cog)
#> [1] 0.936484
cor(fscores$factor_2, f_noncog)
#> [1] 0.9422921simulate_factor_model() generates new data from a fitted
(or fully specified) model: it resamples rows of data to
supply any covariates, draws fresh latent factors from the model’s
distribution, and draws each indicator from its model. The simulated
outcome columns are named after the components’ outcome variables, so
the result is directly re-estimable with the same model system (useful
for parametric-bootstrap or recovery checks).
sim <- simulate_factor_model(fit, dat, n = 300, seed = 1)
head(sim[, c("xid", "factor_1", "factor_2", "cog1", "nc1")])
#> xid factor_1 factor_2 cog1 nc1
#> 1 167 -0.5254634 -0.1929168 -0.04188089 0.07070051
#> 2 129 -1.5274478 -0.7523846 -0.93387196 -1.72889523
#> 3 299 -0.3750665 0.3338236 -0.18966667 0.99254912
#> 4 270 0.2763045 -0.1614299 -0.74141053 0.11932253
#> 5 187 1.5168179 -0.4525142 3.05303402 -1.04517954
#> 6 85 -0.3950109 -0.7932152 -0.33931644 -0.51478808With detail = TRUE (the default) the output also
carries, per component, the linear predictor
(<name>_Vobs), the shock
(<name>_eps), and (for discrete components) the
choice probabilities.
vignette("roy_model", package = "factorana") — an
applied example with a discrete sector-choice component and partially
observed potential outcomes.?define_model_component — supported model types
(linear, probit, logit, ordered probit) and options for factor
interactions and type intercepts.?estimate_model_rcpp — full list of estimator options,
including parallel estimation, checkpointing, and adaptive integration
for two-stage workflows.?simulate_factor_model — simulate data from a fitted or
specified model.?vcov_factorana — robust and cluster-robust standard
errors for a fitted model.