Package {highMLR}


Title: Machine Learning Feature Selection for High Dimensional Survival Data
Version: 1.0.1
Date: 2026-05-23
Description: A unified, flexible framework for high dimensional feature selection in the presence of a survival outcome. Provides multiple machine learning approaches (Cox elastic net, random survival forest, accelerated oblique random survival forest, gradient-boosted Cox, stability selection, classical univariate Cox screening, pseudo- observation bridging to arbitrary regression learners, and Fine-Gray competing risks selection) under a single interface. Adds causal survival forest estimation of heterogeneous treatment effects on survival (experimental), conformal survival prediction with finite- sample coverage guarantees, and time-dependent 'SHAP' explanations via 'SurvSHAP(t)'. Methodology is based on regularised Cox regression (2011) <doi:10.18637/jss.v039.i05>, random survival forests (2008) <doi:10.1214/08-AOAS169>, oblique random survival forests (2024) <doi:10.1080/10618600.2023.2231048>, stability selection (2010) <doi:10.1111/j.1467-9868.2010.00740.x>, causal survival forests (2023) <doi:10.1111/rssb.12538>, time-dependent survival explanations (2023) <doi:10.1016/j.knosys.2022.110234>, conformal survival prediction (2023) <doi:10.1093/biomet/asad043>, the Fine-Gray model for competing risks (1999) <doi:10.1080/01621459.1999.10474144>, and pseudo-observation regression (2010) <doi:10.1177/0962280209105020>.
Depends: R (≥ 4.1.0)
Imports: survival, glmnet, ranger, aorsf, xgboost, stabs, survex, grf, prodlim, cmprsk, future, future.apply, tibble, ggplot2, rlang, stats, utils
Suggests: knitr, rmarkdown, testthat (≥ 3.0.0), mice, riskRegression
License: GPL-3
Encoding: UTF-8
Language: en-GB
LazyData: true
LazyDataCompression: xz
RoxygenNote: 7.3.3
VignetteBuilder: knitr
Config/testthat/edition: 3
NeedsCompilation: no
Author: Atanu Bhattacharjee [aut, cre]
Maintainer: Atanu Bhattacharjee <atanustat@gmail.com>
Packaged: 2026-05-23 12:14:26 UTC; atanu
Repository: CRAN
Date/Publication: 2026-05-23 12:30:02 UTC

highMLR: Machine Learning Feature Selection for High Dimensional Survival Data

Description

A unified, flexible framework for high dimensional feature selection in the presence of a survival outcome. Provides multiple machine learning approaches under a single interface: Cox elastic net, random survival forest, accelerated oblique RSF, gradient-boosted Cox, stability selection, classical univariate Cox screening, pseudo-observation bridging to any regression learner, and Fine-Gray competing risks selection. Adds causal survival forest estimation of heterogeneous treatment effects, conformal survival prediction intervals, and time-dependent SHAP explanations via SurvSHAP(t).

Main functions

[highmlr()]

Main entry point. Fit one of eight ML methods.

[highmlr_compare()]

Compare multiple methods side by side.

[highmlr_stability()]

Stability selection wrapper.

[highmlr_explain()]

Time-dependent SHAP via SurvSHAP(t).

[highmlr_screen()]

Pre-screening for very high p.

[highmlr_report()]

Generate a Quarto/Rmd report.

[highmlr_causal()]

Causal survival forest (experimental).

[highmlr_conformal()]

Conformal prediction intervals.

Bundled datasets

[hnscc]

High dimensional head and neck cancer survival data.

[srdata]

High dimensional protein gene expression data.

Author(s)

Atanu Bhattacharjee atanustat@gmail.com


Coefficients from a highmlr_fit

Description

Coefficients from a highmlr_fit

Usage

## S3 method for class 'highmlr_fit'
coef(object, ...)

Arguments

object

A 'highmlr_fit' object.

...

Unused.

Value

A named numeric vector of coefficients (where defined) or importance scores otherwise.


Machine learning feature selection for high dimensional survival data

Description

Fits one of several survival ML methods and returns a unified 'highmlr_fit' object summarising the selected features, their importance/coefficients, and (optionally) out-of-sample performance.

Usage

highmlr(
  data,
  time,
  status,
  features = NULL,
  method = c("coxnet", "rsf", "aorsf", "xgboost", "stability", "univariate", "pseudo",
    "finegray"),
  engine = NULL,
  recipe = NULL,
  resampling = c("cv", "bootstrap", "holdout", "none"),
  folds = 5L,
  tune = FALSE,
  top_n = 50L,
  parallel = FALSE,
  seed = NULL,
  ...
)

Arguments

data

A data frame containing 'time', 'status', and the candidate features (or a superset). Rows with missing time/status are dropped.

time

Character scalar: name of the survival time column.

status

Character scalar: name of the event indicator column. For right-censored methods: 1 = event, 0 = censored. For Fine-Gray (method = "finegray"): 0 = censored, 1 = event of interest, 2+ = competing event(s).

features

Character vector of candidate feature column names. If 'NULL' (default), all columns except 'time' and 'status' are used.

method

One of '"coxnet"', '"rsf"', '"aorsf"', '"xgboost"', '"stability"', '"univariate"', '"pseudo"', '"finegray"'.

engine

Optional engine override.

recipe

Optional preprocessing recipe object (currently accepted for forward compatibility; not yet applied).

resampling

One of '"cv"', '"bootstrap"', '"holdout"', '"none"'.

folds

Integer, number of CV folds (default 5).

tune

Logical. Internal tuning (currently coxnet only).

top_n

Integer. For ranking-based methods, keep this many top features (default 50).

parallel

Logical. Use future-based parallelism for the embarrassingly parallel parts.

seed

Optional integer for reproducibility.

...

Additional arguments passed to the method-specific fitter.

Value

An object of class 'highmlr_fit'. See [new_highmlr_fit()].

Examples


if (requireNamespace("glmnet", quietly = TRUE)) {
  data(hnscc)
  fit <- highmlr(hnscc, time = "OS", status = "Death",
                 method = "coxnet", resampling = "cv", folds = 5)
  print(fit)
}



Causal survival forest for heterogeneous treatment effects (experimental)

Description

Estimates patient-level conditional average treatment effects (CATEs) on a survival outcome using 'grf::causal_survival_forest'. Unlike the rest of 'highMLR', this function answers a different question: not "which features predict survival?" but "for which patients does treatment T extend (or shorten) survival, and which features modify that effect?".

Usage

highmlr_causal(
  data,
  time,
  status,
  treatment,
  covariates = NULL,
  horizon = NULL,
  num.trees = 2000L,
  target = c("RMST", "survival.probability"),
  honesty = TRUE,
  seed = NULL,
  ...
)

## S3 method for class 'highmlr_causal'
print(x, n = 10, ...)

## S3 method for class 'highmlr_causal'
plot(x, ...)

Arguments

data

A data frame.

time

Character: name of the survival time column.

status

Character: name of the event indicator (0/1).

treatment

Character: name of the binary treatment column (0 = control, 1 = treated). Must be exactly two levels.

covariates

Character vector of covariate column names. If 'NULL', all columns other than 'time', 'status', 'treatment'.

horizon

Numeric. The time horizon at which the treatment effect on the survival probability is estimated. Defaults to the median observed time.

num.trees

Number of trees in the forest (default 2000).

target

One of '"RMST"' (restricted mean survival time difference up to 'horizon') or '"survival.probability"' (difference in survival probability at 'horizon').

honesty

Logical (default TRUE) – honest splitting per 'grf'.

seed

Optional integer seed.

...

Passed to 'grf::causal_survival_forest'.

x

A 'highmlr_causal' object.

n

Number of top covariates to print (default 10).

Value

An object of class 'highmlr_causal' containing the fitted forest, per-patient CATE estimates with standard errors, and covariate importance.

'print()' invisibly returns 'x'; 'plot()' returns a 'ggplot' object showing the distribution of estimated CATEs.

Experimental

This function is marked experimental. The signature, defaults, and return shape may change in a future release. Use with care in published analyses, and report the package version.

Examples

## Not run: 
set.seed(1)
n <- 500; p <- 10
X <- matrix(rnorm(n*p), n, p); colnames(X) <- paste0("V", 1:p)
W <- rbinom(n, 1, 0.5)
t <- rexp(n, rate = exp(0.3*W + 0.5*X[,1]*W))
c <- rexp(n, rate = 0.05)
d <- data.frame(OS = pmin(t,c), Death = as.integer(t<=c),
                arm = W, X)
cf <- highmlr_causal(d, "OS", "Death", treatment = "arm",
                     covariates = paste0("V", 1:p))
print(cf); plot(cf)

## End(Not run)


Compare multiple highMLR methods on the same data

Description

Runs several methods and returns a side-by-side comparison of selected features and performance.

Usage

highmlr_compare(
  data,
  time,
  status,
  features = NULL,
  methods = c("coxnet", "rsf", "univariate"),
  ...
)

Arguments

data, time, status, features

As in [highmlr()].

methods

Character vector of methods to compare.

...

Passed to each call to 'highmlr()'.

Value

A list with two elements: 'fits' (named list of 'highmlr_fit' objects) and 'summary' (a tibble of method, n_selected, key metric).

Examples

## Not run: 
data(hnscc)
cmp <- highmlr_compare(hnscc, "OS", "Death",
                       methods = c("coxnet", "rsf", "univariate"))
cmp$summary

## End(Not run)


Conformal prediction intervals for survival times

Description

Computes calibrated lower bounds on survival time for each new subject using a split-conformal procedure with inverse probability of censoring weights (Candes, Lei and Ren, 2023). The returned lower bound satisfies a marginal coverage guarantee approximately equal to one minus alpha under standard conformal assumptions and a consistent censoring model.

Usage

highmlr_conformal(
  fit,
  new_data,
  calibration_data = NULL,
  alpha = 0.1,
  calibration_split = 0.3,
  time = NULL,
  status = NULL,
  seed = NULL
)

Arguments

fit

A highmlr_fit object whose predict() method returns a linear predictor or risk score.

new_data

Data frame on which to compute prediction intervals.

calibration_data

Data frame on which to compute conformity scores. If NULL, a random calibration_split fraction of new_data is held out for calibration and the rest is used as the test set (split-conformal).

alpha

Miscoverage level; default 0.1 (so 90 percent coverage).

calibration_split

Fraction of new_data to use for calibration when calibration_data is NULL. Default 0.3.

time

Name of the survival time column in calibration data. Defaults to the column used in fit.

status

Name of the event column in calibration data.

seed

Optional integer seed for the split.

Value

An object of class highmlr_conformal containing per-subject point predictions and lower confidence bounds for survival time.

Examples

## Not run: 
fit  <- highmlr(d_train, "OS", "Death", method = "coxnet")
intv <- highmlr_conformal(fit, new_data = d_test, alpha = 0.1)
print(intv)
plot(intv)

## End(Not run)


Time-dependent SHAP explanations for a highmlr_fit (SurvSHAP(t))

Description

Computes SurvSHAP(t) attributions (Krzyzinski et al., 2023) – SHAP values that vary with follow-up time – for the top features in a fitted 'highmlr_fit'. Returns the survex explainer, per-feature aggregated importance, and a plotting helper.

Usage

highmlr_explain(
  fit,
  new_data = NULL,
  top_n = 10L,
  times = NULL,
  method = c("survshap", "permutation", "break_down"),
  n_explain = 25L,
  seed = NULL,
  ...
)

## S3 method for class 'highmlr_explain'
print(x, n = 10, ...)

## S3 method for class 'highmlr_explain'
plot(x, top_n = 10, ...)

Arguments

fit

A 'highmlr_fit' object with a stored model.

new_data

Data on which to compute explanations.

top_n

Number of top features to explain (default 10).

times

Optional numeric vector of time points at which SHAP values are computed. Defaults to a 20-point grid spanning the observed time range.

method

SHAP method passed through to 'survex'. Default '"survshap"' (time-dependent). Other options: '"permutation"', '"break_down"'.

n_explain

How many test rows to compute SHAP for. Default 25 (SHAP is expensive; full-cohort computation is rarely needed).

seed

Optional integer for reproducibility of subsampling.

...

Passed to 'survex::model_survshap()' or 'survex::explain_survival()'.

x

A 'highmlr_explain' object.

n

Number of top features to print (default 10).

Value

A list with class 'highmlr_explain' containing: * 'explainer' – the 'survex' explainer object * 'survshap' – the time-dependent SHAP object (if applicable) * 'top_features' – the top features table from the fit * 'aggregated' – tibble of mean absolute SHAP per feature, averaged across time and explained rows


Generate a Quarto/Rmd report skeleton for a highmlr_fit

Description

Writes a self-contained Rmd file that, when rendered, produces a standard biomarker report (selected features, hazard ratios where available, performance, forest plot).

Usage

highmlr_report(fit, file = "highmlr_report.Rmd", render = FALSE)

Arguments

fit

A 'highmlr_fit' object.

file

Output '.Rmd' path (default '"highmlr_report.Rmd"').

render

Logical: if 'TRUE', also render via 'rmarkdown::render()'.

Value

Invisibly, the path to the written file.


Pre-screen features when p is very large

Description

Lightweight filter before the main pipeline (e.g. to drop features with low variance or low marginal association).

Usage

highmlr_screen(
  data,
  time,
  status,
  features = NULL,
  filter = c("variance", "univariate_p", "none"),
  keep = 1000L
)

Arguments

data, time, status, features

As in [highmlr()].

filter

One of '"variance"', '"univariate_p"', '"none"'.

keep

Integer, how many features to retain (default 1000).

Value

Character vector of retained feature names.

Examples

## Not run: 
data(srdata)
keep <- highmlr_screen(srdata, "OS", "event",
                       filter = "variance", keep = 500)
fit <- highmlr(srdata, "OS", "event", features = keep, method = "coxnet")

## End(Not run)


Post-hoc stability analysis of a fitted highmlr_fit

Description

Runs stability selection on the data used in 'fit', returning a selection frequency per feature.

Usage

highmlr_stability(fit, B = 100L, cutoff = 0.75, PFER = 1, ...)

Arguments

fit

A 'highmlr_fit' object (used only for the data / call).

B

Number of subsamples (default 100).

cutoff

Selection probability threshold (default 0.75).

PFER

Per-family error rate bound (default 1).

...

Passed to [fit_stability()].

Value

A new 'highmlr_fit' with 'method = "stability"'.


High dimensional head and neck cancer survival and gene expression data

Description

Survival and gene expression measurements for head and neck squamous cell carcinoma patients, used to demonstrate high-dimensional feature selection.

Usage

hnscc

Format

A data frame with 565 rows (one per patient) and 104 columns. The first five columns are the identifier and outcome variables: ID (patient identifier), Death (overall survival event indicator, 1 = death, 0 = censored), OS (overall survival time), PFS (progression-free survival time), and Prog (progression event indicator, 1 = progression, 0 = none). The remaining 99 columns are numeric gene expression features named by gene symbol (for example GJB1, HPN, PROM1).

Source

Bundled with the package since highMLR v0.1.1.


Constructor for highmlr_fit objects

Description

Internal constructor. Not exported. Use [highmlr()] to create fits.

Usage

new_highmlr_fit(
  selected,
  performance = NULL,
  model = NULL,
  method = NA_character_,
  call = NULL,
  data_summary = list(),
  meta = list()
)

Arguments

selected

A 'tibble' of selected features with columns 'feature', 'importance', and method-specific extras (e.g. 'coef', 'hazard_ratio', 'selection_freq').

performance

A named list of out-of-sample performance metrics (e.g. 'c_index', 'ibs').

model

The fitted underlying model object (parsnip fit, glmnet object, or list of these for stability selection).

method

Character scalar, the method used.

call

The matched call.

data_summary

A small list summarising the input data (n, p, events, censoring rate).

meta

Optional list for method-specific metadata.

Value

An object of class 'highmlr_fit'.


Plot method for highmlr_conformal objects

Description

Plot method for highmlr_conformal objects

Usage

## S3 method for class 'highmlr_conformal'
plot(x, ...)

Arguments

x

A highmlr_conformal object.

...

Unused.

Value

A ggplot object.


Forest / importance plot for a highmlr_fit

Description

Forest / importance plot for a highmlr_fit

Usage

## S3 method for class 'highmlr_fit'
plot(x, top_n = 20, ...)

Arguments

x

A 'highmlr_fit' object.

top_n

Number of top features to plot (default 20).

...

Unused.

Value

A 'ggplot' object.


Predict from a highmlr_fit

Description

Predict from a highmlr_fit

Usage

## S3 method for class 'highmlr_fit'
predict(object, new_data, type = c("linear_pred", "survival", "risk"), ...)

Arguments

object

A 'highmlr_fit' object.

new_data

A data frame containing the features used in fitting.

type

One of '"linear_pred"', '"survival"', or '"risk"'. Availability depends on the underlying model.

...

Passed to the underlying model's predict method.

Value

Predicted values (vector or tibble depending on 'type').


Print method for highmlr_conformal objects

Description

Print method for highmlr_conformal objects

Usage

## S3 method for class 'highmlr_conformal'
print(x, n = 10, ...)

Arguments

x

A highmlr_conformal object.

n

Number of rows to display in the preview table (default 10).

...

Unused.

Value

Invisibly returns x.


Print method for highmlr_fit

Description

Print method for highmlr_fit

Usage

## S3 method for class 'highmlr_fit'
print(x, n = 10, ...)

Arguments

x

A 'highmlr_fit' object.

n

Number of top features to display (default 10).

...

Unused.

Value

Invisibly returns 'x'.


High dimensional protein gene expression survival data

Description

Protein expression measurements with a survival outcome, used to demonstrate high-dimensional feature selection.

Usage

srdata

Format

A data frame with 288 rows and 250 columns. The first four columns are the identifier and outcome variables: ID (subject identifier), Visit (visit number), OS (overall survival time), and event (survival event indicator, 1 = event, 0 = censored). The remaining 246 columns are numeric protein expression features named by protein or marker (for example C6kine, ActivinA, Adiponectin).

Source

Bundled with the package since highMLR v0.1.1.


Summary method for highmlr_fit

Description

Summary method for highmlr_fit

Usage

## S3 method for class 'highmlr_fit'
summary(object, ...)

Arguments

object

A 'highmlr_fit' object.

...

Unused.

Value

A list with the full selected feature table and performance.