---
vignette: >
  %\VignetteIndexEntry{2 Using subset with group and stratum}
  %\VignetteEngine{litedown::vignette}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
litedown::reactor(
  print=NA,
  collapse = TRUE,
  comment = "#>",
  fig.width=10,
  fig.height=6)
data.table::setDTthreads(1)
```

In `mlr3`, a data set is represented as a `Task` with meta-data called roles which define how columns should be used for learning and evaluation.
The goal of this vignette is to explain how to use the new `subset` role with previous `group` and `stratum` roles.

# Introduction: role definitions

Previous column roles in `mlr3` are:

* `stratum`: proportions should be respected when splitting.
* `group`: rows that must stay together when splitting.

New column role proposed in `mlr3resampling` is:

* `subset`: one value is test, same/other/all values are train.

# Example using AZtrees data

The `AZtrees` data represent an image segmentation problem.
Each row is a pixel, which we want to classify as tree or not (`stratum` role).
Data were labeled by drawing polygons on a map (`group` role).
Data are divided into either 3 or 4 regions (`subset` role).

```{r}
data(AZtrees, package="mlr3resampling")
library(data.table)
options(datatable.print.keys=FALSE, datatable.print.class=FALSE)
AZtrees[, .(region3, polygon, y)]
```

Above we see the three columns that are relevant for these roles in these data.
Below we count the number of rows for each polygon,

```{r}
dcast(AZtrees, polygon ~ y, length)[order(`Not tree`, -Tree)]
```

Above we see that there are several big polygons with no trees (hundreds of rows), whereas the polygons for presence of trees are smaller (max 45 rows).
Below we define a helper function which creates a `Task` and assigns roles based on arguments.

```{r}
trees <- function(..., cv=NULL, folds=3L){
  trees_task <- mlr3::as_task_classif(AZtrees, target="y")
  role_list <- list(...)
  for(role in names(role_list)){
    trees_task$col_roles[[role]] <- role_list[[role]]
  }
  if(is.null(cv))cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
  cv$param_set$values$folds <- folds
  cv$instantiate(trees_task)
  role_dt <- data.table(AZtrees)
  for(fold in 1:folds){
    set(role_dt, cv$test_set(fold), "fold", fold)
  }
  setkey(role_dt, polygon, fold)
  in_list <- list(
    rows=role_dt,
    groups=role_dt[, .(fold=paste(unique(fold), collapse=",")), by=.(y,polygon)])
  out_list <- list()
  for(unit in names(in_list)){
    in_dt <- in_list[[unit]]
    out_list[[unit]] <- rbind(in_dt[, table(y, fold)], TOTAL=in_dt[, table(fold)])
  }
  out_list
}
set.seed(1)
```

Above we set the seed to obtain reproducible results below (fold assignment is random).

## Cross-validation with no column roles

Below we show counts of rows and groups in each fold, when no column roles are set:

```{r}
trees()
```

We see in above that 

* in `rows`, there is about the same label distribution per fold (plus or minus 10 or 20 labels across folds), but not exactly the same (this can be achieved by setting `stratum` role). In `TOTAL` row we see the same number of rows per fold (plus or minus 1 because the total number of rows has a remainder when divided into the number of folds).
* in `groups`, rows from some polygons have been split into several different folds (`1,2` etc.), which is undesirable (and can be avoided by setting `group` role).

Below we do it again to emphasize the random variation that can occur (especially in label counts) when roles are not set:

```{r}
trees()
```

Below we compare with the previous `mlr3` code:

```{r}
trees(cv=mlr3::rsmp("cv"))
```

When no column roles are set, results are similar between previous `mlr3` and proposed `mlr3resampling`.

## Cross-validation using strata

Next, we illustrate the `stratum` column role.

```{r}
trees(stratum="y")
```

Above we see that

* in `rows`, both the label counts, and the `TOTAL`, are consistent across folds (plus or minus 1).
* in `groups`, there is large variation across folds, and some polygons have been split into different folds (column named `1,2` etc.).

Below we run the previous `mlr3` method for comparison.

```{r}
trees(stratum="y", cv=mlr3::rsmp("cv"))
```

Above we see the same patterns, except `rows` `TOTAL` is slightly less consistent (plus or minus 2 instead of 1).

## Cross-validation on polygons

Below we compute counts when setting `group` role to `polygon`:

```{r}
trees(group="polygon")
```

Above we see 

* in `rows`, there is substantial label distribution variation across folds (which can be avoided by setting `stratum` role). We see the `TOTAL` number of rows is equal across folds (again plus or minus 1).
* in `groups`, each polygon has been assigned to one fold (no `1,2` etc.), which is desirable. We see different numbers of `TOTAL` groups per fold, which is expected, because the algorithm attempts to equalize number or rows per fold, and there are different numbers or rows per group.

Below we do it again to emphasize the random variation that can occur (especially in label counts) in this case:

```{r}
trees(group="polygon")
```

Above we see `TOTAL` is the same, but label counts per fold have changed.

Next, we try regular CV from mlr3,

```{r}
trees(group="polygon", cv=mlr3::rsmp("cv"))
```

We see a very different pattern: `TOTAL=63` polygons per group, but there is large variation in the number of rows per fold.

Finally, we compare results for a larger number of folds.

```{r}
more.folds <- 5
trees(group="polygon", folds=more.folds)
trees(group="polygon", folds=more.folds, cv=mlr3::rsmp("cv"))
```

Above we see that `mlr3resampling` results in a fold with one polygon and one class (unusable), whereas `mlr3` yields both classes in every fold. This kind of failure only happens in data sets where the total number of rows is small, there is a very large group, a large number of folds, and `stratum` role not set (see fix below).

This section showed that the proposed `mlr3resampling` provides a new method for fold assignment using groups (total row counts equalized across folds), that is not the same as the previous method available in `mlr3` (group counts equalized across folds).

## Cross-validation on polygons with strata

Below we compute counts when setting `stratum` and `group` roles.

```{r}
trees(stratum="y", group="polygon")
```

Above we see

* in `rows`, the labels counts and `TOTAL` are consistent across folds (again plus or minus 1).
* in `groups`, each polygon has been assigned to one fold (no `1,2` etc.), which is desirable. We see different numbers of `TOTAL` groups per fold, which is expected, because the algorithm attempts to equalize number or rows per fold, and there are different numbers or rows per group.

Below we do it again to emphasize the random variation that does not occur in this case:

```{r}
trees(stratum="y", group="polygon")
```

Above we see consistent counts (no variation in counts across random seeds in these data).

Next, we try regular CV from mlr3,

```{r error=TRUE, purl=FALSE}
trees(stratum="y", group="polygon", cv=mlr3::rsmp("cv"))
```

Above we see an error, because `mlr3` does not implement the heuristic fold assignment algorithm for the case of groups and strata at the same time.

Finally, we compare results for a larger number of folds.

```{r}
trees(stratum="y", group="polygon", folds=more.folds)
```

We see above thet all folds have some data from both classes.
One fold is much larger than the others, because there is one group that is very large.

This comparison shows that the proposed `mlr3resampling` provides a new method for fold assignment using groups and strata, that was not previously available in `mlr3`.

## Cross-validation with subsets, strata, and groups

Below we define a task with all three roles:

```{r}
trees_task <- mlr3::as_task_classif(AZtrees, target="y")
trees_task$col_roles$subset <- "region3"
trees_task$col_roles$stratum <- c("region3","y")
trees_task$col_roles$group <- "polygon"
```

Note above that there are two strata variables: the subset variable `region3`, and the target variable `y`.
In general, both the target and subset variables should be used for `stratum`, so that folds are as balanced as possible.

Below we compute the row and group counts by label, fold, and subset.

```{r}
cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
cv$instantiate(trees_task)
role_dt <- data.table(AZtrees, cv$instance$fold.dt)
in_list <- list(
  rows=role_dt,
  groups=role_dt[, .(rows=.N), by=.(y, polygon, test.subset, fold)])
out_list <- list()
for(unit in names(in_list)){
  in_dt <- in_list[[unit]][, fsub := paste0(test.subset, fold)]
  out_list[[unit]] <- rbind(in_dt[, table(y, fsub)], TOTAL=in_dt[, table(fsub)])
}
out_list
```

Above we see that the label row counts in NE and NW are very similar across folds, but subset S is more variable, because there are so few polygons (two folds with only one negative polygon).

# Example using respiratory data

These are medical data,

* with a binary target `outcome` (values 0 or 1),
* person `id` used for group.

First we define a helper function,

```{r}
resp_counts <- function(Subset)if(requireNamespace("geepack")){
  data(respiratory, package="geepack")
  resp_dt <- data.table(respiratory)[, person := paste(center, id)]
  resp_task <- mlr3::as_task_classif(resp_dt, target="outcome")
  resp_task$col_roles$subset <- Subset
  resp_task$col_roles$stratum <- c(Subset, "outcome")
  resp_task$col_roles$group <- "person"
  cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
  cv$instantiate(resp_task)
  role_dt <- data.table(respiratory, cv$instance$fold.dt)[, subval := get(Subset)]
  in_list <- list(
    rows=role_dt,
    groups=role_dt[, .(rows=.N), by=.(outcome, id, fold, subval)])
  out_list <- list()
  for(unit in names(in_list)){
    in_dt <- in_list[[unit]][, fsub := paste0(subval, "_fold", fold)]
    out_list[[unit]] <- rbind(in_dt[, table(outcome, fsub)], TOTAL=in_dt[, table(fsub)])
  }
  out_list
}
```

## Are models consistent across sex?

To see if we can learn a model which can combine data across sex, or predict well on one sex after training on the other, we can use sex as `subset` role.

```{r}
resp_counts("sex")
```

The results above show row and group counts per label, sex, and fold.
We see that the code works well for generating balanced folds.

## Are models consistent across centers?

To see if we can learn a model which can combine data across centers, or predict well on one center after training on the other, we can use center as `subset` role.

```{r}
resp_counts("center")
```

The results above show row and group counts per label, center, and fold.
We see that the code works well for generating balanced folds.

# Conclusion

Overall these results show that the proposed resampling methods can handle complex cross-validation experiments, including previous roles (`group` and `stratum`) and a new `subset` role.