Spatial clustering cross-validation splits the data into V groups of disjointed sets by clustering points based on their spatial coordinates. A resample of the analysis data consists of V-1 of the folds/clusters while the assessment set contains the final fold/cluster.
Usage
spatial_clustering_cv(
data,
v = 10,
cluster_function = c("kmeans", "hclust"),
radius = NULL,
buffer = NULL,
...,
repeats = 1,
distance_function = function(x) as.dist(sf::st_distance(x))
)
Arguments
- data
An
sf
object (often fromsf::read_sf()
orsf::st_as_sf()
) to split into folds.- v
The number of partitions of the data set.
- cluster_function
Which function should be used for clustering? Options are either
"kmeans"
(to usestats::kmeans()
) or"hclust"
(to usestats::hclust()
). You can also provide your own function; seeDetails
.- radius
Numeric: points within this distance of the initially-selected test points will be assigned to the assessment set. If
NULL
, no radius is applied.- buffer
Numeric: points within this distance of any point in the test set (after
radius
is applied) will be assigned to neither the analysis or assessment set. IfNULL
, no buffer is applied.- ...
Extra arguments passed on to
stats::kmeans()
orstats::hclust()
.- repeats
The number of times to repeat the clustered partitioning.
- distance_function
Which function should be used for distance calculations? Defaults to
sf::st_distance()
, with the output matrix converted to astats::dist()
object. You can also provide your own function; see Details.
Value
A tibble with classes spatial_clustering_cv
, spatial_rset
,
rset
, tbl_df
, tbl
, and data.frame
.
The results include a column for the data split objects and
an identification variable id
.
Resamples created from non-sf
objects will not have the
spatial_rset
class.
Details
Clusters are created based on the distances between observations
if data
is an sf
object. Each cluster is used as a fold for
cross-validation. Depending on how the data are distributed spatially, there
may not be an equal number of observations in each fold.
You can optionally provide a custom function to distance_function.
The
function should take an sf
object and return a stats::dist()
object with
distances between data points.
You can optionally provide a custom function to cluster_function
. The
function must take three arguments:
dists
, astats::dist()
object with distances between data pointsv
, a length-1 numeric for the number of folds to create...
, to pass any additional named arguments to your function
The function should return a vector of cluster assignments of length
nrow(data)
, with each element of the vector corresponding to the matching
row of the data frame.
Changes in spatialsample 0.3.0
As of spatialsample version 0.3.0, this function no longer accepts non-sf
objects as arguments to data
. In order to perform clustering with
non-spatial data, consider using rsample::clustering_cv()
.
Also as of version 0.3.0, this function now calculates edge-to-edge distance for non-point geometries, in line with the rest of the package. Earlier versions relied upon between-centroid distances.
References
A. Brenning, "Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest," 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, 2012, pp. 5372-5375, doi: 10.1109/IGARSS.2012.6352393.
Examples
data(Smithsonian, package = "modeldata")
smithsonian_sf <- sf::st_as_sf(
Smithsonian,
coords = c("longitude", "latitude"),
# Set CRS to WGS84
crs = 4326
)
# When providing sf objects, coords are inferred automatically
spatial_clustering_cv(smithsonian_sf, v = 5)
#> # 5-fold spatial cross-validation
#> # A tibble: 5 × 2
#> splits id
#> <list> <chr>
#> 1 <split [19/1]> Fold1
#> 2 <split [16/4]> Fold2
#> 3 <split [9/11]> Fold3
#> 4 <split [18/2]> Fold4
#> 5 <split [18/2]> Fold5
# Can use hclust instead:
spatial_clustering_cv(smithsonian_sf, v = 5, cluster_function = "hclust")
#> # 5-fold spatial cross-validation
#> # A tibble: 5 × 2
#> splits id
#> <list> <chr>
#> 1 <split [19/1]> Fold1
#> 2 <split [4/16]> Fold2
#> 3 <split [19/1]> Fold3
#> 4 <split [19/1]> Fold4
#> 5 <split [19/1]> Fold5