The goal of this vignette is explain how SOAKED (Same/Other/All K-fold cross-validation Extension Downsampling) can be used with ResamplingSameOtherSizesCV to determine why there is a difference between Same/Other/All models.
We begin by simulating some regression data according to two different patterns.
library(data.table)
#>
#>Attaching package: 'data.table'
#>
#>The following object is masked from 'package:base':
#>
#> %notin%
#>
N <- 2400
abs.x <- 3*pi
set.seed(2)
grid.dt <- data.table(
x=seq(-abs.x,abs.x, l=201),
y=0)
x.vec <- runif(N, -abs.x, abs.x)
standard.deviation.vec <- c(
easy=0.1,
hard=1.7)
There are two standard deviation parameters for the simulation: easy (small noise) and hard (large noise). Below we simulate and plot the data.
reg.data.list <- list()
grid.signal.dt.list <- list()
sim_fun <- sin
for(difficulty in names(standard.deviation.vec)){
standard.deviation <- standard.deviation.vec[[difficulty]]
signal.vec <- sim_fun(x.vec)
y <- signal.vec+rnorm(N,sd=standard.deviation)
task.dt <- data.table(x=x.vec, y)
reg.data.list[[difficulty]] <- data.table(difficulty, task.dt)
grid.signal.dt.list[[difficulty]] <- data.table(
difficulty,
algorithm="ideal",
x=grid.dt$x,
y=sim_fun(grid.dt$x))
}
reg.data <- rbindlist(reg.data.list)
grid.signal.dt <- rbindlist(grid.signal.dt.list)
algo.colors <- c(
featureless="blue",
rpart="red",
ideal="black")
if(require(ggplot2)){
my_theme <- theme_bw(15)
ggplot()+
my_theme+
theme(panel.spacing=grid::unit(1, "cm"))+
geom_point(aes(
x, y),
fill="white",
color="grey",
data=reg.data)+
geom_line(aes(
x, y, color=algorithm),
linewidth=2,
data=grid.signal.dt)+
scale_color_manual(values=algo.colors)+
facet_grid(. ~ difficulty, labeller=label_both)
}
#>Loading required package: ggplot2
Above we see the simulated data, which represent regression problems in 1D. There is a panel for each difficulty level, with a grey dot for each training data point, and a black curve that represents the ideal prediction function (same in both difficulty levels).
In this section, we define a benchmark using these simulated data.
First, we create the SOAKED instance by setting sizes=0, and we use 10-fold CV.
SOAKED <- mlr3resampling::ResamplingSameOtherSizesCV$new()
SOAKED$param_set$values$sizes <- 0
SOAKED$param_set$values$folds <- 10
Next, we create and visualize two Tasks:
set.seed(1)
sim.meta.list <- list(
different=rbind(
reg.data[difficulty=="easy"][sample(.N, 400)],
reg.data[difficulty=="hard"][sample(.N, 200)]
)[, .(x,y,Subset=ifelse(difficulty=="easy", "large", "small"))],
iid_easy=reg.data[
difficulty=="easy"
][sample(.N, 120)][
, Subset := rep(c("large","large","small"), l=.N)
][, .(x,y,Subset)])
d_task_list <- list()
gg_list <- list()
for(sim.name in names(sim.meta.list)){
sim.i.dt <- sim.meta.list[[sim.name]]
sub_task <- mlr3::TaskRegr$new(
sim.name, sim.i.dt, target="y")
sub_task$col_roles$subset <- "Subset"
sub_task$col_roles$feature <- "x"
d_task_list[[sim.name]] <- sub_task
if(require("ggplot2")){
gg_list[[sim.name]] <- ggplot()+
my_theme+
ggtitle(paste("Task:", sim.name))+
geom_point(aes(
x, y),
shape=21,
color="black",
fill="white",
data=sim.i.dt)+
geom_line(aes(
x, y, color=algorithm),
data=grid.signal.dt)+
scale_color_manual(values=algo.colors)+
facet_grid(Subset~., labeller=label_both)
}
}
gg_list
#>$different
#>
#>$iid_easy
#>
The figures above show the two Tasks. Each Task has two subsets: large and small.
different have different noise levels, so we expect to see significant differences when training using either the same or different numbers of samples.iid_easy have the same noise level, so we expect to see significant test error differences only when training using different numbers of samples.Below we create the benchmark grid.
reg.learner.list <- list(
if(requireNamespace("rpart"))mlr3::LearnerRegrRpart$new(),
mlr3::LearnerRegrFeatureless$new())
#>Loading required namespace: rpart
(reg.bench.grid <- mlr3::benchmark_grid(
d_task_list,
reg.learner.list,
SOAKED))
#> task learner resampling
#> <char> <char> <char>
#>1: different regr.rpart same_other_sizes_cv
#>2: different regr.featureless same_other_sizes_cv
#>3: iid_easy regr.rpart same_other_sizes_cv
#>4: iid_easy regr.featureless same_other_sizes_cv
The benchmark includes both data sets, and two learners: rpart decision tree and featureless baseline.
if(require(future))plan("multisession")
#>Loading required package: future
if(require(lgr))get_logger("mlr3")$set_threshold("warn")
#>Loading required package: lgr
#>
#>Attaching package: 'lgr'
#>
#>The following object is masked from 'package:ggplot2':
#>
#> Layout
#>
(reg.bench.result <- mlr3::benchmark(reg.bench.grid))
#>
#>── <BenchmarkResult> of 400 rows with 4 resampling run ─────────────────────────
#> nr task_id learner_id resampling_id iters warnings errors
#> 1 different regr.rpart same_other_sizes_cv 100 0 0
#> 2 different regr.featureless same_other_sizes_cv 100 0 0
#> 3 iid_easy regr.rpart same_other_sizes_cv 100 0 0
#> 4 iid_easy regr.featureless same_other_sizes_cv 100 0 0
score_dt <- mlr3resampling::score(
reg.bench.result, mlr3::msr("regr.rmse"))
plot(score_dt)+my_theme
The figure above shows one dot for every train/test split. There are five rows of data per panel, because we train same/other/all at full sample size, and two downsampled models (all and same or other). Below we compute P-values using the full sample sizes.
plist <- mlr3resampling::pvalue(score_dt)
plot(plist)+my_theme
Above we see that for the rpart learner, there are significant differences between same and other/all. Comparing same and other, we consistently see better predictions (smaller error values) for models with more training data. There are two possible explanations
In the next section, we show how downsampling can be used to determine which of these interpretations is consistent with the data.
In this section, we do downsample analysis to determine if the differences observed at full sample size are due to the different sample sizes, or distributional differences between subsets.
In this simulation, we want to verify that SOAKED can detect that the two subsets are iid from the same distribution.
dlist <- mlr3resampling::pvalue_downsample(score_dt[
algorithm=="rpart" & task_id=="iid_easy" & test.subset=="large"])
plot(dlist)+my_theme
In both test subsets (above and below), we see significant differences at full sample size (left), that disappear at smallest sample size (right).
This is a clear sample size effect (no distributional difference between subsets), as expected for the iid_easy task.
dlist <- mlr3resampling::pvalue_downsample(score_dt[
algorithm=="rpart" & task_id=="iid_easy" & test.subset=="small"])
plot(dlist)+my_theme
In this simulation, we want to verify that SOAKED can detect that the two subsets are iid have a real distributional difference that makes it easier to learn using data from the larger subset.
dlist <- mlr3resampling::pvalue_downsample(score_dt[
algorithm=="rpart" & task_id=="different" & test.subset=="large"])
plot(dlist)+my_theme
Both above and below (but especially above), we see significant differences at full sample size (left), that persist after downsampling (right), which is a clear indication of a distributional difference between subsets, as expected for the different task.
dlist <- mlr3resampling::pvalue_downsample(score_dt[
algorithm=="rpart" & task_id=="different" & test.subset=="small"])
plot(dlist)+my_theme
We have shown how SOAKED (cross-validation with subsets and downsampling) can be used to determine if there are differences in learnable and predictable patterns between subsets.
This code is needed to avoid R CMD check NOTE about detritus in the temp directory.
if(require(future))plan("sequential")