# (CAST) Repeated K-fold Nearest Neighbour Distance Matching

Source:`R/ResamplingRepeatedSpCVknndm.R`

`mlr_resamplings_repeated_spcv_knndm.Rd`

This function implements the kNNDM algorithm and returns the necessary indices to perform a k-fold NNDM CV for map validation.

## Details

knndm is a k-fold version of NNDM LOO CV for medium and large datasets. Brielfy, the algorithm tries to find a k-fold configuration such that the integral of the absolute differences (Wasserstein W statistic) between the empirical nearest neighbour distance distribution function between the test and training data during CV (Gj*), and the empirical nearest neighbour distance distribution function between the prediction and training points (Gij), is minimised. It does so by performing clustering of the training points' coordinates for different numbers of clusters that range from k to N (number of observations), merging them into k final folds, and selecting the configuration with the lowest W.

Using a projected CRS in `knndm` has large computational advantages since fast nearest neighbour search can be done via the `FNN` package, while working with geographic coordinates requires computing the full spherical distance matrices. As a clustering algorithm, `kmeans` can only be used for projected CRS while `hierarchical` can work with both projected and geographical coordinates, though it requires calculating the full distance matrix of the training points even for a projected CRS.

In order to select between clustering algorithms and number of folds `k`, different `knndm` configurations can be run and compared, being the one with a lower W statistic the one that offers a better match. W statistics between `knndm` runs are comparable as long as `tpoints` and `ppoints` or `modeldomain` stay the same.

Map validation using knndm should be used using `CAST::global_validation`, i.e. by stacking all out-of-sample predictions and evaluating them all at once. The reasons behind this are 1) The resulting folds can be unbalanced and 2) nearest neighbour functions are constructed and matched using all CV folds simultaneously.

If training data points are very clustered with respect to the prediction area and the presented knndm configuration still show signs of Gj* > Gij, there are several things that can be tried. First, increase the `maxp` parameter; this may help to control for strong clustering (at the cost of having unbalanced folds). Secondly, decrease the number of final folds `k`, which may help to have larger clusters.

The `modeldomain` is a sf polygon that defines the prediction area. The function takes a regular point sample (amount defined by `samplesize`) from the spatial extent. As an alternative use `ppoints` instead of `modeldomain`, if you have already defined the prediction locations (e.g. raster pixel centroids). When using either `modeldomain` or `ppoints`, we advise to plot the study area polygon and the training/prediction points as a previous step to ensure they are aligned.

## Parameters

`folds`

(`integer(1)`

)

Number of folds.`stratify`

If`TRUE`

, stratify on the target column.

`repeats`

(`integer(1)`

)

Number of repeats.

## References

Linnenbrink, J., Mila, C., Ludwig, M., Meyer, H. (2023).
“kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation.”
*EGUsphere*, **2023**, 1--16.
doi:10.5194/egusphere-2023-1308
, https://egusphere.copernicus.org/preprints/2023/egusphere-2023-1308/.

## Super class

`mlr3::Resampling`

-> `ResamplingRepeatedSpCVKnndm`

## Active bindings

`iters`

`integer(1)`

Returns the number of resampling iterations, depending on the values stored in the`param_set`

.

## Methods

## Inherited methods

### Method `new()`

Create a "K-fold Nearest Neighbour Distance Matching" resampling instance.

#### Usage

`ResamplingRepeatedSpCVKnndm$new(id = "repeated_spcv_knndm")`

### Method `folds()`

Translates iteration numbers to fold number.

#### Arguments

`iters`

`integer()`

Iteration number.

### Method `repeats()`

Translates iteration numbers to repetition number.

#### Arguments

`iters`

`integer()`

Iteration number.

### Method `instantiate()`

Materializes fixed training and test splits for a given task.

#### Arguments

`task`

Task

A task to instantiate.

## Examples

```
library(mlr3)
library(mlr3spatial)
#>
#> Attaching package: ‘mlr3spatial’
#> The following objects are masked from ‘package:mlr3spatiotempcv’:
#>
#> TaskClassifST, TaskRegrST, as_task_classif_st,
#> as_task_classif_st.DataBackend, as_task_classif_st.TaskClassifST,
#> as_task_classif_st.data.frame, as_task_classif_st.sf,
#> as_task_regr_st, as_task_regr_st.DataBackend,
#> as_task_regr_st.TaskClassifST, as_task_regr_st.TaskRegrST,
#> as_task_regr_st.data.frame, as_task_regr_st.sf
set.seed(42)
simarea = list(matrix(c(0, 0, 0, 100, 100, 100, 100, 0, 0, 0), ncol = 2, byrow = TRUE))
simarea = sf::st_polygon(simarea)
train_points = sf::st_sample(simarea, 1000, type = "random")
train_points = sf::st_as_sf(train_points)
train_points$target = as.factor(sample(c("TRUE", "FALSE"), 1000, replace = TRUE))
pred_points = sf::st_sample(simarea, 1000, type = "regular")
task = mlr3spatial::as_task_classif_st(sf::st_as_sf(train_points), "target", positive = "TRUE")
cv_knndm = rsmp("repeated_spcv_knndm", ppoints = pred_points, repeats = 2)
cv_knndm$instantiate(task)
#> Warning: Missing CRS in training or prediction points. Assuming projected CRS.
#> Gij <= Gj; a random CV assignment is returned
#> Warning: Missing CRS in training or prediction points. Assuming projected CRS.
#> Gij <= Gj; a random CV assignment is returned
#' ### Individual sets:
# cv_knndm$train_set(1)
# cv_knndm$test_set(1)
# check that no obs are in both sets
intersect(cv_knndm$train_set(1), cv_knndm$test_set(1)) # good!
#> integer(0)
# Internal storage:
# cv_knndm$instance # table
```