The goal of this vignette is explain how SOAKED (Same/Other/All K-fold cross-validation Extension Downsampling) can be used with ResamplingSameOtherSizesCV to determine why there is a difference between Same/Other/All models.

Simulations

We begin by simulating some regression data according to two different patterns.

library(data.table)
#>
#>Attaching package: 'data.table'
#>
#>The following object is masked from 'package:base':
#>
#>    %notin%
#>
N <- 2400
abs.x <- 3*pi
set.seed(2)
grid.dt <- data.table(
  x=seq(-abs.x,abs.x, l=201),
  y=0)
x.vec <- runif(N, -abs.x, abs.x)
standard.deviation.vec <- c(
  easy=0.1,
  hard=1.7)

There are two standard deviation parameters for the simulation: easy (small noise) and hard (large noise). Below we simulate and plot the data.

reg.data.list <- list()
grid.signal.dt.list <- list()
sim_fun <- sin
for(difficulty in names(standard.deviation.vec)){
  standard.deviation <- standard.deviation.vec[[difficulty]]
  signal.vec <- sim_fun(x.vec)
  y <- signal.vec+rnorm(N,sd=standard.deviation)
  task.dt <- data.table(x=x.vec, y)
  reg.data.list[[difficulty]] <- data.table(difficulty, task.dt)
  grid.signal.dt.list[[difficulty]] <- data.table(
    difficulty,
    algorithm="ideal",
    x=grid.dt$x,
    y=sim_fun(grid.dt$x))
}
reg.data <- rbindlist(reg.data.list)
grid.signal.dt <- rbindlist(grid.signal.dt.list)
algo.colors <- c(
  featureless="blue",
  rpart="red",
  ideal="black")
if(require(ggplot2)){
  my_theme <- theme_bw(15)
  ggplot()+
    my_theme+
    theme(panel.spacing=grid::unit(1, "cm"))+
    geom_point(aes(
      x, y),
      fill="white",
      color="grey",
      data=reg.data)+
    geom_line(aes(
      x, y, color=algorithm),
      linewidth=2,
      data=grid.signal.dt)+
    scale_color_manual(values=algo.colors)+
    facet_grid(. ~ difficulty, labeller=label_both)
}
#>Loading required package: ggplot2

Above we see the simulated data, which represent regression problems in 1D. There is a panel for each difficulty level, with a grey dot for each training data point, and a black curve that represents the ideal prediction function (same in both difficulty levels).

mlr3 benchmark

In this section, we define a benchmark using these simulated data. First, we create the SOAKED instance by setting sizes=0, and we use 10-fold CV.

SOAKED <- mlr3resampling::ResamplingSameOtherSizesCV$new()
SOAKED$param_set$values$sizes <- 0
SOAKED$param_set$values$folds <- 10

Next, we create and visualize two Tasks:

set.seed(1)
sim.meta.list <- list(
  different=rbind(
    reg.data[difficulty=="easy"][sample(.N, 400)],
    reg.data[difficulty=="hard"][sample(.N, 200)]
  )[, .(x,y,Subset=ifelse(difficulty=="easy", "large", "small"))],
  iid_easy=reg.data[
    difficulty=="easy"
  ][sample(.N, 120)][
  , Subset := rep(c("large","large","small"), l=.N)
  ][, .(x,y,Subset)])
d_task_list <- list()
gg_list <- list()
for(sim.name in names(sim.meta.list)){
  sim.i.dt <- sim.meta.list[[sim.name]]
  sub_task <- mlr3::TaskRegr$new(
    sim.name, sim.i.dt, target="y")
  sub_task$col_roles$subset <- "Subset"
  sub_task$col_roles$feature <- "x"
  d_task_list[[sim.name]] <- sub_task
  if(require("ggplot2")){
    gg_list[[sim.name]] <- ggplot()+
      my_theme+
      ggtitle(paste("Task:", sim.name))+
      geom_point(aes(
        x, y),
        shape=21,
        color="black",
        fill="white",
        data=sim.i.dt)+
      geom_line(aes(
        x, y, color=algorithm),
        data=grid.signal.dt)+
      scale_color_manual(values=algo.colors)+
      facet_grid(Subset~., labeller=label_both)
  }
}
gg_list
#>$different
#>
#>$iid_easy
#>

The figures above show the two Tasks. Each Task has two subsets: large and small.

The two subsets in different have different noise levels, so we expect to see significant differences when training using either the same or different numbers of samples.
The two subsets in iid_easy have the same noise level, so we expect to see significant test error differences only when training using different numbers of samples.

Below we create the benchmark grid.

reg.learner.list <- list(
  if(requireNamespace("rpart"))mlr3::LearnerRegrRpart$new(),
  mlr3::LearnerRegrFeatureless$new())
#>Loading required namespace: rpart
(reg.bench.grid <- mlr3::benchmark_grid(
  d_task_list,
  reg.learner.list,
  SOAKED))
#>        task          learner          resampling
#>      <char>           <char>              <char>
#>1: different       regr.rpart same_other_sizes_cv
#>2: different regr.featureless same_other_sizes_cv
#>3:  iid_easy       regr.rpart same_other_sizes_cv
#>4:  iid_easy regr.featureless same_other_sizes_cv

The benchmark includes both data sets, and two learners: rpart decision tree and featureless baseline.

if(require(future))plan("multisession")
#>Loading required package: future
if(require(lgr))get_logger("mlr3")$set_threshold("warn")
#>Loading required package: lgr
#>
#>Attaching package: 'lgr'
#>
#>The following object is masked from 'package:ggplot2':
#>
#>    Layout
#>
(reg.bench.result <- mlr3::benchmark(reg.bench.grid))
#>
#>── <BenchmarkResult> of 400 rows with 4 resampling run ─────────────────────────
#> nr   task_id       learner_id       resampling_id iters warnings errors
#>  1 different       regr.rpart same_other_sizes_cv   100        0      0
#>  2 different regr.featureless same_other_sizes_cv   100        0      0
#>  3  iid_easy       regr.rpart same_other_sizes_cv   100        0      0
#>  4  iid_easy regr.featureless same_other_sizes_cv   100        0      0
score_dt <- mlr3resampling::score(
  reg.bench.result, mlr3::msr("regr.rmse"))
plot(score_dt)+my_theme

The figure above shows one dot for every train/test split. There are five rows of data per panel, because we train same/other/all at full sample size, and two downsampled models (all and same or other). Below we compute P-values using the full sample sizes.

plist <- mlr3resampling::pvalue(score_dt)
plot(plist)+my_theme

Above we see that for the rpart learner, there are significant differences between same and other/all. Comparing same and other, we consistently see better predictions (smaller error values) for models with more training data. There are two possible explanations

sample size effect: there is no distributional difference between subsets. The predictions are more accurate because the learning algorithm has seen more relevant training data.
subset effect: there is a distributional difference between subsets that makes learning and prediction easier in the larger subset.

In the next section, we show how downsampling can be used to determine which of these interpretations is consistent with the data.

Downsample analysis

In this section, we do downsample analysis to determine if the differences observed at full sample size are due to the different sample sizes, or distributional differences between subsets.

iid easy task

In this simulation, we want to verify that SOAKED can detect that the two subsets are iid from the same distribution.

dlist <- mlr3resampling::pvalue_downsample(score_dt[
  algorithm=="rpart" & task_id=="iid_easy" & test.subset=="large"])
plot(dlist)+my_theme

In both test subsets (above and below), we see significant differences at full sample size (left), that disappear at smallest sample size (right). This is a clear sample size effect (no distributional difference between subsets), as expected for the iid_easy task.

dlist <- mlr3resampling::pvalue_downsample(score_dt[
  algorithm=="rpart" & task_id=="iid_easy" & test.subset=="small"])
plot(dlist)+my_theme

different task

In this simulation, we want to verify that SOAKED can detect that the two subsets are iid have a real distributional difference that makes it easier to learn using data from the larger subset.

dlist <- mlr3resampling::pvalue_downsample(score_dt[
  algorithm=="rpart" & task_id=="different" & test.subset=="large"])
plot(dlist)+my_theme

Both above and below (but especially above), we see significant differences at full sample size (left), that persist after downsampling (right), which is a clear indication of a distributional difference between subsets, as expected for the different task.

dlist <- mlr3resampling::pvalue_downsample(score_dt[
  algorithm=="rpart" & task_id=="different" & test.subset=="small"])
plot(dlist)+my_theme

Conclusion

We have shown how SOAKED (cross-validation with subsets and downsampling) can be used to determine if there are differences in learnable and predictable patterns between subsets.

Stop future background workers

This code is needed to avoid R CMD check NOTE about detritus in the temp directory.

if(require(future))plan("sequential")