Spatial clustering cross-validation splits the data into V groups of disjointed sets by clustering points based on their spatial coordinates. A resample of the analysis data consists of V-1 of the folds/clusters while the assessment set contains the final fold/cluster.

## Usage

spatial_clustering_cv(
data,
coords,
v = 10,
cluster_function = c("kmeans", "hclust"),
buffer = NULL,
...
)

## Arguments

data

A data frame or an sf object (often from sf::read_sf() or sf::st_as_sf()), to split into folds.

coords

A vector of variable names, typically spatial coordinates, to partition the data into disjointed sets via clustering. This argument is ignored (with a warning) if data is an sf object.

v

The number of partitions of the data set.

cluster_function

Which function should be used for clustering? Options are either "kmeans" (to use stats::kmeans()) or "hclust" (to use stats::hclust()). You can also provide your own function; see Details.

Numeric: points within this distance of the initially-selected test points will be assigned to the assessment set. If NULL, no radius is applied.

buffer

Numeric: points within this distance of any point in the test set (after radius is applied) will be assigned to neither the analysis or assessment set. If NULL, no buffer is applied.

...

Extra arguments passed on to stats::kmeans() or stats::hclust().

## Value

A tibble with classes spatial_clustering_cv, spatial_rset, rset, tbl_df, tbl, and data.frame. The results include a column for the data split objects and an identification variable id. Resamples created from non-sf objects will not have the spatial_rset class.

## Details

Clusters are created based on either the distances between observations (if data is an sf object) or by clustering the variables in the coords argument. Each cluster is used as a fold for cross-validation. Depending on how the data are distributed spatially, there may not be an equal number of observations in each fold.

You can optionally provide a custom function to cluster_function. The function must take three arguments:

• dists, a stats::dist() object with distances between data points

• v, a length-1 numeric for the number of folds to create

• ..., to pass any additional named arguments to your function

The function should return a vector of cluster assignments of length nrow(data), with each element of the vector corresponding to the matching row of the data frame.

## References

A. Brenning, "Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest," 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, 2012, pp. 5372-5375, doi: 10.1109/IGARSS.2012.6352393.

## Examples

data(Smithsonian, package = "modeldata")
spatial_clustering_cv(Smithsonian, coords = c(latitude, longitude), v = 5)
#> #  5-fold spatial cross-validation
#> # A tibble: 5 × 2
#>   splits          id
#>   <list>          <chr>
#> 1 <split [19/1]>  Fold1
#> 2 <split [15/5]>  Fold2
#> 3 <split [10/10]> Fold3
#> 4 <split [18/2]>  Fold4
#> 5 <split [18/2]>  Fold5

smithsonian_sf <- sf::st_as_sf(
Smithsonian,
coords = c("longitude", "latitude"),
# Set CRS to WGS84
crs = 4326
)

# When providing sf objects, coords are inferred automatically
spatial_clustering_cv(smithsonian_sf, v = 5)
#> #  5-fold spatial cross-validation
#> # A tibble: 5 × 2
#>   splits         id
#>   <list>         <chr>
#> 1 <split [19/1]> Fold1
#> 2 <split [18/2]> Fold2
#> 3 <split [19/1]> Fold3
#> 4 <split [5/15]> Fold4
#> 5 <split [19/1]> Fold5