Getting Started with CytoProfile

CytoProfile is an R package designed for end-to-end analysis and visualization of cytokine profiling data from immunological and clinical studies. It provides a structured workflow covering:

This vignette provides a guided, step-by-step tutorial for each function, including input data requirements, key parameters, and guidance on interpreting outputs.


Installation and Dependencies

Required Bioconductor package

CytoProfile depends on mixOmics, which must be installed from Bioconductor before installing CytoProfile:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("mixOmics")

Installing CytoProfile

# From CRAN (stable release)
install.packages("CytoProfile")

# From GitHub (development version)
# install.packages("devtools")
devtools::install_github("saraswatsh/CytoProfile", ref = "devel")

Package loading

CytoProfile imports all necessary packages internally, you do not need to call library() for its dependencies (e.g., ggplot2, dplyr, mixOmics). The only exception in this vignette is dplyr, which is used explicitly for data subsetting in the examples below and is not re-exported by CytoProfile.

library(CytoProfile)
library(dplyr)   # used only for filter() in the examples below

1. Data Loading and Format Requirements

Input data format

All CytoProfile functions expect a data frame (or coercible matrix) where:

CytoProfile ships with five example datasets (ExampleData1 through ExampleData5). Throughout this vignette we use ExampleData1, which contains 297 observations of 25 cytokines along with Group, Treatment, and Time columns.

data("ExampleData1")
data_df <- ExampleData1

# Inspect structure
dim(data_df)
#> [1] 297  28
names(data_df)[1:6]
#> [1] "Group"     "Treatment" "Time"      "IL-17F"    "GM-CSF"    "IFN-G"
head(data_df[, 1:5])
#>   Group Treatment Time IL-17F GM-CSF
#> 1   T2D  CD3/CD28   20   3.31   7.16
#> 2   T2D  CD3/CD28   20   0.38   0.87
#> 3   T2D  CD3/CD28   20   1.25   2.47
#> 4   T2D  CD3/CD28   20   2.24   3.52
#> 5   T2D  CD3/CD28   20   0.25   1.58
#> 6   T2D  CD3/CD28   20   0.96   1.87
table(data_df$Group)
#> 
#>     ND PreT2D    T2D 
#>     99     99     99
table(data_df$Treatment)
#> 
#>     CD3/CD28          LPS Unstimulated 
#>           99           99           99

Recommended sample size: Functions perform best with at least 10 samples per group. Machine learning functions (Random Forest, XGBoost) require larger sample sizes (\(\ge\) 20 per group) to produce reliable cross-validated performance estimates.


2. Exploratory Data Analysis

2.1 Boxplots - cyt_bp()

Purpose: Generate boxplots for each numeric variable, optionally grouped by one or more categorical variables.

Key data requirements: - At least one numeric column is required. - If group_by is specified, those columns must be character or factor.

Key parameters:

Parameter Description Default
data Data frame or matrix -
group_by Column name(s) for grouping NULL
scale Transformation: "none", "log2", "log10", "zscore", "custom" "none"
bin_size Max plots per page (ungrouped only) 25
y_lim y-axis limits, e.g. c(0, 20) NULL
output_file File path to save (e.g. "plots.pdf"). NULL displays interactively. NULL

When to use which scale: - "log2" or "log10": when cytokine concentrations span several orders of magnitude or show strong right-skew (common for raw immunoassay data). - "zscore": when comparing distributions on a standardized scale across cytokines with very different concentration ranges. - "none": when data are already transformed.

Interpreting the output: - Each boxplot shows the median (horizontal line), interquartile range (box), 1.5x IQR whiskers, and individual data points (jittered). - Overlapping distributions between groups suggest no strong group effect for that cytokine. - Outliers far beyond the whiskers warrant attention - consider whether they represent biological extremes or measurement artefacts.

# Ungrouped boxplots with log2 transformation
# All numeric columns are plotted; up to bin_size = 25 per page
cyt_bp(data_df[, -c(1:3)], output_file = NULL, scale = "log2")

# Grouped boxplots: one plot per cytokine, colored by Group
# Only passing Group and two cytokines for a concise display
cyt_bp(
  data_df[, c("Group", "IL-10", "CCL-20/MIP-3A")],
  group_by = "Group",
  scale = "zscore"
)

Note: cyt_bp() invisibly returns a named list of ggplot objects, so you can retrieve and further modify any individual plot: plots <- cyt_bp(...); plots[["IL-10"]] + ggplot2::ggtitle("Custom title").


2.2 Violin Plots - cyt_violin()

Purpose: Similar to cyt_bp() but shows the full data distribution shape using kernel density estimation, making it especially informative for bimodal or asymmetric cytokine distributions.

Key parameters (same as cyt_bp(), plus):

Parameter Description Default
boxplot_overlay Draw a narrow boxplot inside each violin to show median and IQR FALSE

Interpreting the output: - A wide violin belly indicates many observations at that value. - Bimodal violins (two wide regions) suggest subpopulations within a group - worth investigating with further stratification. - Setting boxplot_overlay = TRUE combines density estimation with quartile summaries in one panel.

# Ungrouped violin plots with z-score scaling
cyt_violin(data_df[, -c(1:3)], output_file = NULL, scale = "zscore")

# Grouped violin plots with boxplot overlay and log2 scaling
cyt_violin(
  data_df[, c("Group", "IL-10", "CCL-20/MIP-3A")],
  group_by = "Group",
  boxplot_overlay = TRUE,
  scale = "log2"
)


2.3 Skewness and Kurtosis - cyt_skku()

Purpose: Assess whether cytokine distributions are symmetric (skewness \(\approx\) 0) and whether they have tails heavier than a normal distribution (kurtosis > 0). This is important for deciding whether a log or other transformation is appropriate before statistical testing.

Key data requirements: - Numeric measurement columns only (exclude grouping columns via group_cols). - All values should be positive if log2 transformation will be applied downstream, since cyt_skku() uses log2 internally for comparison.

Key parameters:

Parameter Description Default
group_cols Character vector of grouping column names NULL
print_res_raw Return summary stats for raw data FALSE
print_res_log Return summary stats for log2-transformed data FALSE
output_file File path to save plots NULL

Interpreting the output: - Two histograms are produced: one for skewness, one for kurtosis, comparing raw data (red) and log2-transformed data (blue). - Skewness: Values far from 0 indicate asymmetry. Positive skewness (right tail) is common in raw cytokine data. After log2 transformation, distributions closer to 0 indicate better symmetry. - Kurtosis: Positive excess kurtosis indicates heavier-than-normal tails (more outliers expected). After log2 transformation, kurtosis values closer to 0 suggest the transformation improved normality. - If log2 transformation substantially reduces both skewness and kurtosis compared to raw values, it is recommended as a pre-processing step.

# Overall distributional assessment (no grouping)
cyt_skku(data_df[, -c(1:3)], output_file = NULL, group_cols = NULL)

# Grouped assessment by "Group"
cyt_skku(data_df[, -c(2:3)], output_file = NULL, group_cols = c("Group"))


2.4 Error Bar Plots - cyt_errbp()

Purpose: Visualize the central tendency and spread of each cytokine across groups, with optional statistical annotations (p-values and effect sizes) comparing each group to the first (baseline) group.

Key data requirements: - At least one numeric column and one grouping column. - The first factor level of group_col is used as the baseline for comparisons.

Key parameters:

Parameter Description Default
group_col Name of the grouping column -
stat Central tendency: "mean" or "median" "mean"
error Spread: "se" (standard error), "sd", "mad", "ci" (95% CI) "se"
scale Data transformation (see cyt_bp()) "none"
method Statistical test: "auto", "ttest", "wilcox" "auto"
p_lab Show p-value annotations TRUE
es_lab Show effect size annotations TRUE
class_symbol Use symbols (*, >>) rather than numeric values FALSE
p_adjust_method p-value correction method (e.g. "BH") NULL
output_file File path to save NULL

Interpreting the output: - Each facet panel corresponds to one cytokine. Bars show the central tendency; error bars show spread. - P-value labels above bars indicate statistical significance versus the baseline group. Values \(\le\) 0.05 are conventionally significant. - Effect size labels indicate the practical magnitude of the difference: larger absolute values denote greater differences. For t-test comparisons, this is Cohen’s d; for Wilcoxon, the rank-biserial correlation. - class_symbol = TRUE renders significance as stars (* to *****) and effect size as arrows (> to >>>>>), following conventions used in the accompanying publication.

# Basic error bar plot: default mean +/- SE
df_err <- ExampleData1[, c("Group", "CCL-20/MIP-3A", "IL-10")]
cyt_errbp(
  df_err,
  group_col = "Group",
  x_lab = "Group",
  y_lab = "Concentration"
)

# Mean +/- SD with log2 transformation, symbols, and t-test
cyt_errbp(
  df_err,
  group_col = "Group",
  stat = "mean",
  error = "sd",
  scale = "log2",
  class_symbol = TRUE,
  method = "ttest",
  x_lab = "Group",
  y_lab = "Concentration (log2)"
)

Tip: Use p_adjust_method = "BH" when analysing a large cytokine panel to control the false discovery rate across multiple comparisons.


3. Univariate Analysis

3.1 Two-group comparisons - cyt_univariate()

Purpose: Perform pairwise statistical tests between exactly two groups for each numeric cytokine column. For each categorical predictor with exactly two levels, either a Student’s t-test or Wilcoxon rank-sum test is applied to each numeric outcome.

Key data requirements: - At least one factor/character column with exactly two levels. - At least one numeric column. - Both groups must have \(\ge\) 2 observations with non-zero variance.

Automatic test selection (method = "auto"): When method = "auto", a Shapiro-Wilk normality test is applied to the pooled values for each cytokine within a comparison. If both groups pass normality (p > 0.05), a t-test is used; otherwise, the Wilcoxon rank-sum (Mann-Whitney U) test is used. This selection is done independently for each cytokine. If you prefer a consistent approach, set method = "ttest" or method = "wilcox" explicitly.

Key parameters:

Parameter Description Default
scale Transformation applied before testing NULL
method "auto", "ttest", or "wilcox" "auto"
format_output Return as tidy data frame (TRUE) or list of test objects (FALSE) FALSE
p_adjust_method Multiple testing correction (e.g. "BH") NULL

Interpreting the output: - With format_output = TRUE, a data frame is returned with columns: Outcome (cytokine), Categorical (grouping variable), Comparison (group A vs group B), Test (method used), Estimate, Statistic, and P_value. - P-values < 0.05 indicate a statistically significant difference between groups for that cytokine. - When testing many cytokines simultaneously, use p_adjust_method = "BH" to control the false discovery rate; the adjusted column P_adj will be appended.

data_uni <- ExampleData1[, -c(3)]
data_uni <- dplyr::filter(data_uni, Group != "ND", Treatment != "Unstimulated")

# Tidy output with log2 transformation and automatic test selection
cyt_univariate(
  data_uni[, c("Group", "Treatment", "IL-10", "CCL-20/MIP-3A")],
  scale    = "log2",
  method   = "auto",
  format_output = TRUE,
  p_adjust_method = "BH"
)
#>         Outcome Categorical      Comparison
#> 1         IL.10       Group   PreT2D vs T2D
#> 2 CCL.20.MIP.3A       Group   PreT2D vs T2D
#> 3         IL.10   Treatment CD3/CD28 vs LPS
#> 4 CCL.20.MIP.3A   Treatment CD3/CD28 vs LPS
#>                                                Test Estimate Statistic P_value
#> 1 Wilcoxon rank sum test with continuity correction   -0.956    1625.0   0.012
#> 2 Wilcoxon rank sum test with continuity correction   -1.575    1050.5   0.000
#> 3 Wilcoxon rank sum test with continuity correction    1.690    3091.0   0.000
#> 4 Wilcoxon rank sum test with continuity correction    0.488    2509.5   0.132
#>   P_adj
#> 1 0.016
#> 2 0.000
#> 3 0.000
#> 4 0.132

3.2 Multi-group comparisons - cyt_univariate_multi()

Purpose: Extend univariate testing to categorical predictors with more than two levels using either ANOVA (parametric) or Kruskal-Wallis (non-parametric) global tests, followed by pairwise post-hoc comparisons.

Key data requirements: - At least one factor/character column with more than two levels. - At least one numeric column. - ANOVA is restricted to predictors with \(\le\) 10 levels.

Key parameters:

Parameter Description Default
method "anova" (Tukey HSD post-hoc) or "kruskal" (pairwise Wilcoxon) "anova"
cat_vars Character vector of categorical column names to test NULL (auto-detect)
cont_vars Character vector of numeric column names to test NULL (auto-detect)
p_adjust_method Correction method for pairwise p-values "BH"
format_output Return as tidy data frame FALSE

Interpreting the output: - With format_output = TRUE, a data frame with columns Outcome, Categorical, Comparison, and P_adj is returned. - Each row represents one pairwise comparison for one cytokine. P_adj values < 0.05 indicate significant pairwise differences after multiple testing correction. - Use method = "kruskal" when cytokine data are non-normal or when sample sizes are small.

# Kruskal-Wallis with pairwise Wilcoxon for multi-group comparison
cyt_univariate_multi(
  ExampleData1[, c("Group", "IL-10", "CCL-20/MIP-3A")],
  method = "kruskal",
  format_output = TRUE
)
#>         Outcome Categorical Comparison  P_adj
#> 1         IL.10       Group  PreT2D-ND 0.2995
#> 2         IL.10       Group     T2D-ND 0.3181
#> 3         IL.10       Group T2D-PreT2D 0.0849
#> 4 CCL.20.MIP.3A       Group  PreT2D-ND 0.0518
#> 5 CCL.20.MIP.3A       Group     T2D-ND 0.0332
#> 6 CCL.20.MIP.3A       Group T2D-PreT2D 0.0001

4. Statistical Visualizations

4.1 Volcano Plot - cyt_volc()

Purpose: Visualize fold change versus statistical significance for all cytokines between two groups simultaneously, highlighting cytokines that are both biologically meaningful (large fold change) and statistically significant.

Key data requirements: - One grouping column and multiple numeric columns. - Both conditions specified must exist as levels of group_col.

Key parameters:

Parameter Description Default
group_col Grouping column name -
cond1, cond2 The two conditions to compare NULL (all pairwise)
fold_change_thresh Fold change threshold (original scale) 2
p_value_thresh p-value significance threshold 0.05
top_labels Number of top cytokines to label 10
method "ttest" or "wilcox" "ttest"
p_adjust_method Multiple testing correction NULL
add_effect Compute and return effect sizes FALSE

Interpreting the output: - x-axis: log2 fold change (cond2 / cond1). Positive values indicate higher expression in cond2; negative values in cond1. - y-axis: -log10(p-value). Higher values = more significant. - Dashed lines: Vertical lines mark the +/-log2 fold change threshold; the horizontal line marks the -log10 p-value threshold. - Red points: Cytokines exceeding both thresholds - these are the candidates most likely to be biologically relevant between the two conditions. - Labels: The top_labels cytokines with the most significant p-values are labelled.

data_volc <- ExampleData1[, -c(2:3)]

volc_plots <- cyt_volc(
  data_volc,
  group_col = "Group",
  cond1 = "T2D",
  cond2 = "ND",
  fold_change_thresh = 2.0,
  p_value_thresh = 0.05,
  top_labels = 15,
  method = "ttest"
)

# The function returns a named list of ggplot objects; print the comparison
volc_plots[["T2D vs ND"]]


4.2 Heatmap - cyt_heatmap()

Purpose: Visualize the expression matrix of all cytokines across all samples, with optional annotation and hierarchical clustering to reveal sample groupings and cytokine co-expression patterns.

Key data requirements: - A data frame containing numeric cytokine columns. Non-numeric columns are ignored. - The annotation column (if used) must have length equal to the number of rows in the numeric matrix.

Key parameters:

Parameter Description Default
scale Transformation: "log2", "log10", "row_zscore", "col_zscore", "zscore" NULL
annotation_col Column name or vector for row/column annotation NULL
annotation_side "auto", "row", or "col" "auto"
show_row_names Show sample row labels FALSE
show_col_names Show cytokine column labels TRUE
cluster_rows Cluster samples (rows) TRUE
cluster_cols Cluster cytokines (columns) TRUE
title Plot title, or a file path ending in .pdf/.png to save NULL

Interpreting the output: - The color scale represents cytokine expression after any transformation: blue = low, white = mid, red = high. - Row dendrogram (left) clusters samples by their cytokine expression profiles. Samples that cluster together share similar cytokine signatures. - Column dendrogram (top) clusters cytokines by co-expression. Cytokines close together in the column dendrogram are correlated across samples. - The annotation sidebar (if provided) color-codes samples by group, making it easy to see whether cluster structure aligns with biological groupings. - Hiding row names (show_row_names = FALSE) is recommended when there are many samples, as individual sample labels become unreadable and clutter the plot.

cyt_heatmap(
  data = data_df[, -c(2:3)],
  scale = "log2",
  annotation_col = "Group",
  annotation_side = "auto",
  show_row_names = FALSE,
  title = NULL
)


4.3 Dual-Flashlight Plot - cyt_dualflashplot()

Purpose: Simultaneously visualize effect size (SSMD - Strictly Standardized Mean Difference) and fold change for all cytokines between two groups. The dual-flashlight plot is particularly valuable for identifying cytokines with both a large effect size and substantial fold change, which are strong candidates for further investigation as biomarkers.

Why SSMD instead of p-values? Traditional p-values conflate effect size with sample size: a very large study can yield significant p-values for biologically trivial differences. SSMD quantifies effect size independently of sample size, making it suitable for prioritizing cytokines in panels of varying sizes.

Key data requirements: - A data frame with numeric cytokine columns and one grouping column. - group1 and group2 must be levels present in group_var.

Key parameters:

Parameter Description Default
group_var Grouping column name -
group1, group2 The two conditions to compare -
ssmd_thresh SSMD threshold for marking significance 1
log2fc_thresh log2 fold change threshold 1
top_labels Number of cytokines with largest SSMD
verbose Print the underlying statistics table FALSE

Interpreting the output: - x-axis: Average log2 fold change. Positive = higher in group1. - y-axis: SSMD. |SSMD| \(\ge\) 1 = strong effect; |SSMD| \(\ge\) 0.5 = moderate. - Point color: Effect strength category (strong = red, moderate = orange, weak = blue). - Point shape: Triangles indicate cytokines that exceed both the SSMD and log2FC thresholds simultaneously - these are the most compelling candidates. - Dashed lines: Mark the defined thresholds. - Cytokines in the upper-right and lower-left quadrants beyond the dashed lines are elevated in group1 or group2 respectively, with strong effect and substantial fold change.

data_dfp <- ExampleData1[, -c(2:3)]

dfp <- cyt_dualflashplot(
  data_dfp,
  group_var    = "Group",
  group1       = "T2D",
  group2       = "ND",
  ssmd_thresh  = 0.2,
  log2fc_thresh = 1,
  top_labels   = 10,
  verbose      = FALSE
)
dfp

# Inspect the underlying statistics
head(dfp$data[order(abs(dfp$data$ssmd), decreasing = TRUE), ], 10)
#> # A tibble: 10 × 11
#>    cytokine        mean_ND mean_PreT2D   mean_T2D variance_ND variance_PreT2D
#>    <chr>             <dbl>       <dbl>      <dbl>       <dbl>           <dbl>
#>  1 IL.6          4620.       5197.      8925.         2.86e+7         5.72e+7
#>  2 IL.27            0.0662      0.0834     0.106      6.18e-3         5.66e-3
#>  3 IL.12.P70       13.0        39.1       78.9        4.15e+2         2.56e+4
#>  4 IL.23            0.147       0.243      0.269      3.13e-2         9.37e-2
#>  5 CCL.20.MIP.3A  634.        404.       887.         6.72e+5         2.74e+5
#>  6 IL.2          9227.      10718.     16129.         2.60e+8         4.10e+8
#>  7 IL.17F           1.63        2.35       3.11       1.56e+1         3.37e+1
#>  8 IL.10          979.        836.      1366.         1.99e+6         1.19e+6
#>  9 IL.28A           0.0537      0.0710     0.0666     2.45e-3         5.10e-3
#> 10 IL.17A         352.        653.       615.         9.40e+5         2.88e+6
#> # ℹ 5 more variables: variance_T2D <dbl>, ssmd <dbl>, log2FC <dbl>,
#> #   Effect <fct>, Significant <lgl>

5. Multivariate Analysis

5.1 Principal Component Analysis - cyt_pca()

Purpose: Reduce the dimensionality of the cytokine data and visualize sample-level variation across groups. PCA identifies the axes (principal components) of greatest variance in the data without using group labels, making it an unsupervised method suitable for initial exploration.

Key data requirements: - At least one grouping column and multiple numeric cytokine columns. - A sufficient number of observations relative to the number of cytokines is recommended (at least n > p, ideally n > 3p).

Key parameters:

Parameter Description Default
group_col Primary grouping column -
group_col2 Second grouping column (stratifies analysis) NULL
colors Vector of colors for groups NULL (auto)
pch_values Vector of point shapes, one per group level required
comp_num Number of principal components to compute 2
ellipse Draw 95% confidence ellipses per group FALSE
scale Data transformation "none"
style Set to "3D" for a 3D scatter if comp_num = 3 NULL
output_file File path to save all plots -

Interpreting the output: - Individuals plot: Each point is one sample. Points close together have similar cytokine profiles. Clear separation between colored groups suggests a discriminating cytokine signature exists. - Scree plot: Shows the percentage of variance explained by each component (individual, blue) and cumulatively (dashed green). Choose the number of components that capture \(\ge\) 70-80% of variance. - Loadings plot: Shows which cytokines contribute most to each component. Long bars = large contribution. The sign indicates direction (positive loading = higher values push samples in the positive direction along that component). - Biplot: Overlays sample scores and cytokine loadings. Cytokine arrows pointing toward a group cluster indicate those cytokines are elevated in that group. - Correlation circle: Cytokines near the edge of the circle are well represented by the selected two components. Cytokines close together are positively correlated; those opposite are negatively correlated.

data_pca <- ExampleData1[, -c(3, 23)]
data_pca <- dplyr::filter(data_pca, Group != "ND" & Treatment != "Unstimulated")

pca_results <- cyt_pca(
  data_pca,
  output_file  = NULL,
  colors       = c("black", "red2"),
  scale        = "log2",
  comp_num     = 2,
  pch_values   = c(16, 4),
  group_col    = "Group",
  ellipse      = TRUE
)
pca_results$overall_indiv_plot
pca_results$scree_plot

pca_results$loadings$Comp1()
pca_results$correlation_circle()


5.2 Sparse PLS-DA - cyt_splsda()

Purpose: A supervised multivariate method that finds linear combinations of cytokines that maximally discriminate between groups, while simultaneously selecting the most discriminating cytokines (sparsity). Unlike PCA, sPLS-DA uses group labels to guide the component construction.

Key data requirements: - At least two groups in group_col. - Multiple numeric cytokine columns. - For reliable cross-validation, \(\ge\) 10 samples per group is recommended.

Key parameters:

Parameter Description Default
group_col Primary grouping column -
group_col2 Treatment/stratification column (runs analysis per level) NULL
var_num Number of cytokines selected per component required
comp_num Number of components 2
pch_values Point shapes, one per group required
cv_opt Cross-validation: "loocv" or "Mfold" NULL
fold_num Number of folds for Mfold CV 5
tune Auto-tune var_num and comp_num via CV FALSE
ellipse Draw 95% confidence ellipses FALSE
bg Draw prediction background FALSE
roc Plot ROC curves FALSE
scale Data transformation "none"

Interpreting the output: - Classification plot (individuals plot): Labelled with accuracy (% training samples correctly classified). Clear separation indicates the model successfully identified a discriminating cytokine signature. - Loadings plot: Shows which cytokines load most strongly on each component and in which group they are elevated (color of the bar). The contrib = "max" setting colors each bar by the group with the highest mean for that cytokine. - VIP scores plot: Variable Importance in Projection scores for each component. Cytokines with VIP > 1 are considered important discriminators and are used to fit a refined model. - VIP > 1 classification plot: A second model fitted using only VIP > 1 cytokines. Comparing accuracy between the full and VIP-filtered model shows whether the signature can be simplified. - Cross-validation error plot (if cv_opt is set): Error rate across components. The optimal number of components is where the error rate plateaus or reaches its minimum.

data_spls <- ExampleData1[, -c(3)]
data_spls <- dplyr::filter(data_spls, Group != "ND" & Treatment == "CD3/CD28")

spls_results <- cyt_splsda(
  data_spls,
  output_file  = NULL,
  colors       = c("black", "purple"),
  scale        = "log2",
  ellipse      = TRUE,
  var_num      = 25,
  cv_opt       = "loocv",
  comp_num     = 2,
  pch_values   = c(16, 4),
  group_col    = "Group",
  group_col2   = "Treatment",
  roc          = FALSE,
  verbose      = FALSE
)
spls_results$overall_indiv_plot
spls_results$vip_indiv_plot

spls_results$loadings$Comp1()
spls_results$vip_scores$Comp1()


5.3 MINT sPLS-DA - cyt_mint_splsda()

Purpose: An extension of sPLS-DA for datasets collected across multiple batches or studies. MINT (Multivariate INTegration) models a global biological signal while accounting for batch-specific variation, making it appropriate when data cannot be combined directly due to technical differences between collection sites or time points.

Key data requirements: - Same as cyt_splsda(), plus a batch_col column identifying which batch/study each sample belongs to. - At least two batches are required; each batch should contain samples from all groups if possible.

Key parameters: Same as cyt_splsda(), plus:

Parameter Description Default
batch_col Column identifying batch/study membership required

Interpreting the output: - Identical to cyt_splsda() outputs, with the addition of partial plots (one per batch), which show how well the global model fits within each individual batch. - The global individuals plot should show separation between groups even when batch effects are present - if groups overlap only in the global plot but separate in partial plots, the batch effect is dominating the signal.

data_mint <- ExampleData5[, -c(2, 4)]
data_mint <- dplyr::filter(data_mint, Group != "ND")

mint_results <- cyt_mint_splsda(
  data_mint,
  group_col  = "Group",
  batch_col  = "Batch",
  colors     = c("black", "purple"),
  ellipse    = TRUE,
  var_num    = 25,
  comp_num   = 2,
  scale      = "log2",
  verbose    = FALSE
)
mint_results$global_indiv_plot
mint_results$partial_indiv_plot

mint_results$global_loadings_plots$Comp1()


6. Machine Learning Models

Recommended sample size: For reliable performance estimates, machine learning methods require larger datasets. A minimum of 20 samples per class is recommended; cross-validation estimates become stable with \(\ge\) 50 samples per class.

6.1 XGBoost Classification - cyt_xgb()

Purpose: Train a gradient-boosted tree classifier on cytokine data for multi-class or binary group classification. XGBoost is well suited to high-dimensional cytokine panels and provides feature importance as a by-product.

Key data requirements: - One grouping column (the outcome) and multiple numeric cytokine predictors. - Grouping column will be encoded to numeric labels internally (0-indexed).

Key parameters:

Parameter Description Default
group_col Outcome column name -
train_fraction Proportion used for training 0.7
nrounds Number of boosting rounds 500
max_depth Maximum tree depth 6
learning_rate Step size shrinkage 0.1
cv Run k-fold cross-validation FALSE
nfold Number of CV folds 5
objective Loss function: "multi:softprob" or "binary:logistic" "multi:softprob"
eval_metric Evaluation metric during training "mlogloss"
plot_roc Plot ROC curve (binary only) FALSE
top_n_features Number of features shown in importance plot 10
scale Data transformation "none"

Note on Ckmeans.1d.dp: The xgb.ggplot.importance() function optionally uses the Ckmeans.1d.dp package for a clustered importance plot. If this package is not installed, cyt_xgb() will automatically fall back to a standard bar chart. Install it with install.packages("Ckmeans.1d.dp") for the enhanced plot.

Interpreting the output: - Confusion matrix: Rows = actual classes; columns = predicted classes. The diagonal shows correct classifications. Off-diagonal values are misclassifications. - Feature importance plot: Shows the top top_n_features cytokines ranked by Gain (the improvement in the loss function attributable to each feature). Cytokines with higher Gain are more important for classification. - Cross-validation accuracy (if cv = TRUE): Provides an estimate of out-of-sample performance less susceptible to overfitting than training accuracy alone. - ROC curve (binary only): Area Under the Curve (AUC) summarizes classifier performance across all thresholds. AUC = 1 is perfect; AUC = 0.5 is no better than chance.

data_ml <- data.frame(ExampleData1[, 1:3], log2(ExampleData1[, -c(1:3)]))
data_ml <- data_ml[, -c(2:3)]
data_ml <- dplyr::filter(data_ml, Group != "ND")

cyt_xgb(
  data          = data_ml,
  group_col     = "Group",
  nrounds       = 250,
  max_depth     = 4,
  learning_rate = 0.05,
  nfold         = 5,
  cv            = FALSE,
  objective     = "multi:softprob",
  eval_metric   = "mlogloss",
  top_n_features = 10,
  verbose       = 0,
  plot_roc      = FALSE,
  print_results = FALSE,
  seed          = 123
)


6.2 Random Forest Classification - cyt_rf()

Purpose: Train a random forest ensemble classifier on cytokine data. Random forests are robust to outliers and correlated predictors, and naturally provide variable importance through the mean decrease in Gini impurity.

Key parameters:

Parameter Description Default
group_col Outcome column name -
ntree Number of trees 500
mtry Variables tried at each split 5
train_fraction Training proportion 0.7
run_rfcv Run cross-validation for feature selection TRUE
k_folds Folds for rfcv 5
plot_roc Plot ROC curve (binary only) FALSE
cv Run caret k-fold CV FALSE
scale Data transformation "none"
verbose Print performance metrics FALSE

Interpreting the output: - Variable importance plot: Cytokines ordered by Mean Decrease in Gini. Higher values indicate greater discrimination ability. This complements sPLS-DA loadings as a non-linear importance measure. - Cross-validation error curve (if run_rfcv = TRUE): Error rate versus number of variables. The elbow of this curve (where error stabilizes) suggests the minimum number of cytokines needed for good classification. This is useful for building parsimonious panels. - ROC curve (binary only): Same interpretation as in cyt_xgb().

cyt_rf(
  data       = data_ml,
  group_col  = "Group",
  ntree      = 500,
  mtry       = 4,
  k_folds    = 5,
  run_rfcv   = TRUE,
  plot_roc   = FALSE,
  verbose    = FALSE,
  seed       = 123
)


7. Saving Plots and Exporting Results

Saving plots from individual functions

All plotting functions accept an output_file argument. The file extension determines the format:

# Save all boxplots to a multi-page PDF
cyt_bp(data_df[, -c(1:3)], output_file = "boxplots.pdf", scale = "log2")

# Save grouped violin plots to PNG
cyt_violin(data_df[, -c(3, 5:28)], group_by = "Group",
           output_file = "violins.png")

Supported formats: .pdf, .png, .tiff, .svg.

Saving plots from multivariate functions using cyt_export()

Multivariate functions return lists of ggplot objects and plot-generating closures. Use cyt_export() to save them in bulk:

# Save all sPLS-DA plots to a PDF
cyt_export(
  plots    = list(
    spls_results$overall_indiv_plot,
    spls_results$loadings$Comp1,
    spls_results$vip_indiv_plot
  ),
  filename = "splsda_results",
  format   = "pdf"
)

# Or pass output_file directly to the function
cyt_splsda(..., output_file = "splsda_results.pdf")

Modifying returned plots

All functions return ggplot objects invisibly, enabling further customization:

p <- cyt_bp(data_df[, -c(1:3)], scale = "log2")
# Add a custom title to the first page
p[["page1"]] + ggplot2::ggtitle("My Custom Title") +
  ggplot2::theme(plot.title = ggplot2::element_text(size = 16))

Session Information

sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.2.0       CytoProfile_0.2.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1     timeDate_4052.112    farver_2.1.2        
#>  [4] S7_0.2.1             fastmap_1.2.0        pROC_1.19.0.1       
#>  [7] caret_7.0-1          digest_0.6.39        rpart_4.1.24        
#> [10] timechange_0.4.0     lifecycle_1.0.5      survival_3.8-6      
#> [13] magrittr_2.0.4       compiler_4.5.1       rlang_1.1.7         
#> [16] sass_0.4.10          tools_4.5.1          utf8_1.2.6          
#> [19] plot3D_1.4.2         igraph_2.2.2         yaml_2.3.12         
#> [22] data.table_1.18.2.1  knitr_1.51           labeling_0.4.3      
#> [25] rARPACK_0.11-0       xgboost_3.2.0.1      plyr_1.8.9          
#> [28] RColorBrewer_1.1-3   BiocParallel_1.42.2  withr_3.0.2         
#> [31] purrr_1.2.1          nnet_7.3-20          grid_4.5.1          
#> [34] stats4_4.5.1         Ckmeans.1d.dp_4.3.5  e1071_1.7-17        
#> [37] future_1.69.0        ggplot2_4.0.2        globals_0.19.0      
#> [40] scales_1.4.0         iterators_1.0.14     MASS_7.3-65         
#> [43] cli_3.6.5            ellipse_0.5.0        rmarkdown_2.30      
#> [46] generics_0.1.4       otel_0.2.0           RSpectra_0.16-2     
#> [49] future.apply_1.20.2  reshape2_1.4.5       cachem_1.1.0        
#> [52] proxy_0.4-29         stringr_1.6.0        splines_4.5.1       
#> [55] parallel_4.5.1       matrixStats_1.5.0    vctrs_0.7.1         
#> [58] misc3d_0.9-2         hardhat_1.4.2        Matrix_1.7-4        
#> [61] jsonlite_2.0.0       ggrepel_0.9.7        listenv_0.10.0      
#> [64] foreach_1.5.2        tidyr_1.3.2          gower_1.0.2         
#> [67] jquerylib_0.1.4      recipes_1.3.1        glue_1.8.0          
#> [70] parallelly_1.46.1    codetools_0.2-20     mixOmics_6.32.0     
#> [73] lubridate_1.9.5      stringi_1.8.7        gtable_0.3.6        
#> [76] tibble_3.3.1         pillar_1.11.1        htmltools_0.5.9     
#> [79] randomForest_4.7-1.2 ipred_0.9-15         lava_1.8.2          
#> [82] R6_2.6.1             Rdpack_2.6.6         tcltk_4.5.1         
#> [85] evaluate_1.0.5       lattice_0.22-9       rbibutils_2.4.1     
#> [88] pheatmap_1.0.13      corpcor_1.6.10       bslib_0.10.0        
#> [91] class_7.3-23         Rcpp_1.1.1           gridExtra_2.3       
#> [94] nlme_3.1-168         prodlim_2025.04.28   xfun_0.56           
#> [97] ModelMetrics_1.2.2.2 pkgconfig_2.0.3