Dynamic Factor Model

Overview

An example of how to estimate the dynamics of a latent factor. Assume that indicators are observed at two time points and you want to estimate how the period-2 value depends on the period-1 value. You need two things:

A measurement system that is invariant across time (same loadings, thresholds / residual SDs across the two periods) so the factor scale is comparable, and
A structural equation linking the period-2 factor to the period-1 factor: f₂ = α + β·f₁ + ε.

factorana supports this with a two-stage workflow, and two helper functions make the setup a few lines:

define_dynamic_measurement() builds the Stage 1 measurement model: a 2-factor independent system (or k-factor, for k periods) with loadings and residual sigmas tied across periods via equality constraints, but measurement intercepts left period-specific.
build_dynamic_previous_stage() converts the Stage 1 estimation result into a previous_stage object for Stage 2, carrying the anchor period’s intercepts into every factor slot. This anchors the measurement level under E[f₁] = 0 and lets the observed mean shift between periods identify the structural intercept α.

This vignette walks through a simulated example where α, β, and σ²_ε are known, shows the recovery, and explains one subtlety of the workflow (why the anchor-period intercepts are the right ones to carry into Stage 2).

Simulate data

One latent construct, three linear indicators per period, two periods. The DGP:

\[ f_1 \sim N(0,\sigma^2_{f_1}), \qquad f_2 = \alpha + \beta\,f_1 + \varepsilon, \quad \varepsilon \sim N(0,\sigma^2_\varepsilon) \]

\[ Y_{t,m} = \tau_m + \lambda_m\,f_t + u_{t,m}, \quad u_{t,m} \sim N(0,\sigma_m^2) \]

with measurement parameters τ_m, λ_m, σ_m shared across the two periods.

library(factorana)

set.seed(41)
n <- 1500

# Structural parameters (what we want to recover)
true_var_f1   <- 1.0
true_alpha    <- 0.4
true_beta     <- 0.6
true_sigma_e2 <- 0.5

# Shared measurement parameters
item_int   <- c(1.5, 1.0, 0.8)
item_load  <- c(1.0, 0.9, 1.1)   # first loading fixed to 1 in the model
item_sigma <- c(0.7, 0.75, 0.65)

f1  <- rnorm(n, 0, sqrt(true_var_f1))
eps <- rnorm(n, 0, sqrt(true_sigma_e2))
f2  <- true_alpha + true_beta * f1 + eps

gen_Y <- function(f, i) {
  item_int[i] + item_load[i] * f + rnorm(length(f), 0, item_sigma[i])
}

dat <- data.frame(
  intercept = 1,
  eval      = 1L,
  Y_t1_m1 = gen_Y(f1, 1), Y_t1_m2 = gen_Y(f1, 2), Y_t1_m3 = gen_Y(f1, 3),
  Y_t2_m1 = gen_Y(f2, 1), Y_t2_m2 = gen_Y(f2, 2), Y_t2_m3 = gen_Y(f2, 3)
)

The wide-format data has columns named <prefix><item>: Y_t1_m1, Y_t1_m2, Y_t1_m3 for wave 1, and Y_t2_m1, … for wave 2.

Stage 1: `define_dynamic_measurement()`

One function call builds the measurement model: a 2-factor independent system (one factor per period) with loadings and sigmas tied across periods, intercepts left free per period.

dyn <- define_dynamic_measurement(
  data                 = dat,
  items                = c("m1", "m2", "m3"),
  period_prefixes      = c("Y_t1_", "Y_t2_"),
  model_type           = "linear",
  evaluation_indicator = "eval"
)
# The wrapper generates 5 equality constraints: 2 for loadings (items
# m2, m3; item m1's loading is fixed to 1 on its factor slot) and 3
# for sigmas.
length(dyn$equality_constraints)
#> [1] 5

Estimate Stage 1 with estimate_model_rcpp() exactly as for any model system:

ctrl <- define_estimation_control(n_quad_points = 8, num_cores = 1)
result_stage1 <- estimate_model_rcpp(
  model_system = dyn$model_system,
  data         = dat,
  control      = ctrl,
  optimizer    = "nlminb",
  parallel     = FALSE,
  verbose      = FALSE
)
result_stage1$convergence
#> [1] 0

Inspect the intercepts: the wave-1 intercepts should match the DGP τ_m (because E[f₁] = 0 by convention). The wave-2 intercepts are shifted by λ_m · E[f₂], which is the mean-drift artefact we are about to discard.

est <- result_stage1$estimates
tab <- data.frame(
  m        = 1:3,
  DGP_tau  = item_int,
  wave_1   = round(c(est["Y_t1_m1_intercept"],
                     est["Y_t1_m2_intercept"],
                     est["Y_t1_m3_intercept"]), 3),
  wave_2   = round(c(est["Y_t2_m1_intercept"],
                     est["Y_t2_m2_intercept"],
                     est["Y_t2_m3_intercept"]), 3)
)
knitr::kable(tab, row.names = FALSE)

m	DGP_tau	wave_1	wave_2
1	1.5	1.488	1.905
2	1.0	1.000	1.332
3	0.8	0.778	1.234

Why we do not pool the intercepts

A natural alternative is to stack period-1 and period-2 rows into one long-format dataset and fit a single-factor CFA. The resulting intercepts would be the pooled means of Y across both periods, biased by λ_m · E[f₂]/2 (because the pooled mean averages the period-1 and period-2 factor means). Plugging those into Stage 2 under-estimates α by roughly E[f₂]/2.

The wrapper avoids that: Stage 1 estimates period-specific intercepts, then build_dynamic_previous_stage() carries only the anchor period’s intercepts into Stage 2.

Stage 2: `build_dynamic_previous_stage()` + `SE_linear`

dummy <- build_dynamic_previous_stage(
  dyn           = dyn,
  stage1_result = result_stage1,
  data          = dat,
  anchor_period = 1L
)

fm_stage2 <- define_factor_model(
  n_factors        = 2,
  n_types          = 1,
  factor_structure = "SE_linear"
)
ms_stage2 <- define_model_system(
  components     = list(),       # measurement components prepended from previous_stage
  factor         = fm_stage2,
  previous_stage = dummy
)
#> Two-stage SE estimation: Stage 1 (independent) -> Stage 2 (SE_linear)
#>   Measurement parameters (loadings, thresholds) will be fixed from Stage 1
#>   Factor distribution parameters will use Stage 2 structure
#>   Fixing 16 measurement parameters, ignoring 2 factor distribution parameters

init_s2 <- initialize_parameters(ms_stage2, dat, verbose = FALSE)
init_s2$init_params["factor_var_1"]    <- unname(dummy$estimates["factor_var_1"])
init_s2$init_params["se_intercept"]    <- 0.0
init_s2$init_params["se_linear_1"]     <- 0.5
init_s2$init_params["se_residual_var"] <- 0.5

result_stage2 <- estimate_model_rcpp(
  model_system = ms_stage2,
  data         = dat,
  init_params  = init_s2$init_params,
  control      = ctrl,
  optimizer    = "nlminb",
  parallel     = FALSE,
  verbose      = FALSE
)
result_stage2$convergence
#> [1] 0

Recovery

est <- result_stage2$estimates
se  <- result_stage2$std_errors
ps  <- c("factor_var_1", "se_intercept", "se_linear_1", "se_residual_var")

tab <- data.frame(
  param = ps,
  true  = c(true_var_f1, true_alpha, true_beta, true_sigma_e2),
  est   = round(unname(est[ps]), 4),
  se    = round(unname(se[ps]),  4)
)
tab$z <- round((tab$est - tab$true) / tab$se, 2)
knitr::kable(tab, row.names = FALSE)

param	true	est	se	z
factor_var_1	1.0	0.9104	0.0325	-2.76
se_intercept	0.4	0.4036	0.0217	0.17
se_linear_1	0.6	0.5880	0.0236	-0.51
se_residual_var	0.5	0.5051	0.0261	0.20

se_intercept (the structural α) recovers essentially exactly, which is the whole point of the workflow. The slope β, residual variance σ²_ε, and input factor variance σ²_f₁ also recover within their standard errors.

Cluster-robust standard errors

The standard errors above are model-based (inverse observed information). When individuals are grouped (for example pupils within schools, or siblings within households) and that grouping induces dependence the model does not capture, vcov_factorana() provides cluster-robust standard errors. Pass cluster as the name of a grouping column in the data (or a vector of length nrow(dat)).

Here we attach a synthetic school id (50 schools) and cluster on it:

set.seed(7)
dat$school <- sample(1:50, n, replace = TRUE)

se_cluster <- robust_se(result_stage2, dat, type = "cluster", cluster = "school")

ps <- c("factor_var_1", "se_intercept", "se_linear_1", "se_residual_var")
data.frame(
  param      = ps,
  se_model   = round(unname(result_stage2$std_errors[ps]), 4),
  se_cluster = round(unname(se_cluster[ps]),                4),
  row.names  = NULL
)
#>             param se_model se_cluster
#> 1    factor_var_1   0.0325     0.0328
#> 2    se_intercept   0.0217     0.0212
#> 3     se_linear_1   0.0236     0.0256
#> 4 se_residual_var   0.0261     0.0302

Because the schools were assigned at random here, the cluster-robust and model-based standard errors are close; with genuine within-cluster dependence the cluster-robust values would typically be larger.

Two-stage caveat. These standard errors are computed on the Stage 2 fit and treat the Stage 1 measurement parameters (loadings, sigmas, intercepts) as known. They therefore understate the total uncertainty (the generated-regressor problem). For honest two-stage inference, resample clusters and re-run both stages in a bootstrap; a single-stage robust or cluster-robust covariance is exact only when the measurement system is not itself estimated.

Bootstrap for honest two-stage standard errors

A cluster bootstrap that re-runs both stages on each resample accounts for the first-stage uncertainty. bootstrap_factorana_multistage() drives this in one call: it resamples clusters, and for each replicate fits Stage 1 and then Stage 2 chained on that replicate’s own Stage 1 fit. Each (stage, sample) fit is saved to disk and skipped if it already exists, so the run is restartable and the per-sample work can be scattered across a compute cluster with bootstrap_fit_sample().

stage_builders <- list(
  # Stage 1: the measurement system (prev_fits is empty for the first stage)
  function(prev_fits, data) dyn$model_system,
  # Stage 2: SE_linear, chaining on THIS replicate's Stage 1 fit
  function(prev_fits, data) {
    prev <- build_dynamic_previous_stage(
      dyn = dyn, stage1_result = prev_fits[[1]], data = data, anchor_period = 1L)
    fm2 <- define_factor_model(n_factors = 2, factor_structure = "SE_linear")
    define_model_system(components = list(), factor = fm2, previous_stage = prev)
  }
)

boot <- bootstrap_factorana_multistage(
  stage_builders, dat, R = 500, cluster = "person",
  dir = "boot_run", control = ctrl,
  optimizer = "nlminb", parallel = FALSE, verbose = FALSE
)

boot$final$boot_se   # Stage 2 bootstrap standard errors (include Stage 1 uncertainty)
boot$final$ci        # percentile intervals

For a multi-node run, generate the samples once with generate_bootstrap_samples() and call bootstrap_fit_sample() from an array job (one task per sample, per stage), then summarise with collect_bootstrap().

Notes on generalisation

More than two periods. period_prefixes can have length 3, 4, or more. The wrapper builds a k-factor independent Stage 1 with all k · (n_items - 1) loadings and k · n_items sigmas tied across periods. For Stage 2 SE modelling, pick any two factors (e.g., periods t-1 and t) and build a 2-factor SE model with previous_stage.
Ordered probit measures. Pass model_type = "oprobit" and n_categories = K. The wrapper ties the K-1 cutpoints of each item across periods (in place of sigma).
Additional covariates. Pass covariates = c("intercept", "x1", ...). Covariate coefficients on items are estimated per-period in Stage 1 (not tied by this wrapper); the anchor-period values are carried into Stage 2.

Where to go next

?define_factor_model: factor_structure = "SE_quadratic" adds a quadratic term γ·f_1² to the structural equation (trap/threshold dynamics).
?define_model_system: the underlying equality_constraints argument, used by the wrapper.
vignette("measurement_system", package = "factorana"): basics of measurement-system estimation.
vignette("roy_model", package = "factorana"): a different structural setting (discrete sector choice plus continuous outcomes sharing a latent ability).