Buffering • spatialsample

The goal of spatialsample is to provide functions and classes for spatial resampling to use with rsample. Keeping the data used to train a spatial model (what we call a training or analysis set) separate from the data used to evaluate that model (what we call a testing or assessment set) provides a more realistic view of how well it will perform when extrapolating to new locations.

When resampling spatial data, we often want to introduce some distance between analysis and assessment data; one of the most common methods for introducing this distance is to “buffer” the assessment set, removing all points within a given distance from the analysis set to enforce a minimum space between data sets. This vignette walks through how to buffer your assessment folds with spatialsample, as well as some considerations about how those buffers are calculated.

To begin, let’s load spatialsample:

library(spatialsample)

Exclusion buffers

By default, most spatial cross-validation methods in spatialsample don’t automatically create buffer zones. Take for instance spatial_block_cv(), which creates a number of “blocks” in a grid and assigns data to folds based on the block its centroid falls in:

set.seed(123)
blocks <- spatial_block_cv(boston_canopy, v = 5)

autoplot(blocks)

If we look at the individual folds, we can see that the assessment data directly borders the analysis data for each given fold:

purrr::walk(blocks$splits, function(x) print(autoplot(x)))

This is a downside of standard blocking cross-validation approaches; while it does introduce some spatial separation between the analysis and assessment sets for data at the middle of the block, data towards the edges may not be separated at all.

Applying an exclusion buffer around each assessment fold lets us change that. To create these exclusion buffers while using any cross-validation function in spatialsample, we can use a standardized buffer argument:

set.seed(123)
blocks <- spatial_block_cv(boston_canopy, v = 5, buffer = 1500)

Now when we plot the folds separately, we can see that a strip of data around each assessment block has been assigned to neither the analysis or assessment fold. Instead, it’s been removed entirely in order to provide some distance between the two sets:

purrr::walk(blocks$splits, function(x) print(autoplot(x)))

By default, buffer is assumed to be in the same units as your data, as determined by the data’s coordinate reference system. To apply buffers of other units, use the units package to explicitly specify what units your buffer is in.

For instance, boston_canopy uses units of US feet for distance. To specify a buffer in meters instead, we can use:

set.seed(123)
blocks <- spatial_block_cv(
  boston_canopy,
  v = 5,
  buffer = units::as_units(1500, "m")
)

purrr::walk(blocks$splits, function(x) print(autoplot(x)))

Note that, when you’re using non-point data, the distance between observations is calculated as the shortest distance between any points in two observations. For instance, buffers on polygon data will exclude data based on the edge-to-edge distance between observations, rather than centroid to centroid.

One special case, however, is when buffer is set to 0. In this case, spatialsample won’t apply a buffer at all. While polygons that share an edge are within 0 distance of each other, when calculated from edge-to-edge, we think that setting buffer = 0 would intuitively apply zero (that is, no) buffer. If you want to be sure to only capture adjacent polygons in a buffer, set buffer to a tiny, non-zero value:

set.seed(123)
blocks <- spatial_block_cv(
  boston_canopy,
  v = 5,
  buffer = 2e-200
)

purrr::walk(blocks$splits, function(x) print(autoplot(x)))

Inclusion radii

In addition to exclusion buffers, spatialsample also provides a way to add an inclusion buffer (or as we call it, an “inclusion radius”) around your assessment set. Simply set the radius argument in any spatial cross-validation function to your desired distance, and any data within that inclusion radius will be added to the assessment set:

set.seed(123)
blocks <- spatial_block_cv(
  boston_canopy,
  v = 5,
  radius = 2e-200
)

purrr::walk(blocks$splits, function(x) print(autoplot(x)))

This argument is handled the same way as buffer, with the same caveats:

Unless units are specified explicitly, radius is assumed to be in the same units as your data’s coordinate reference system.
Distances are calculated between the closest parts of observations.
Values of zero do not apply a radius.

Both radius and buffer can be specified at the same time. This makes it possible to implement, for instance, leave-one-disc-out cross-validation using spatialsample:

set.seed(123)
blocks <- spatial_buffer_vfold_cv(
  boston_canopy,
  v = nrow(boston_canopy),
  radius = 1500,
  buffer = 1500
)
purrr::walk(blocks$splits, function(x) print(autoplot(x)))

When both radius and buffer are specified, spatialsample first applies the inclusion radius to the original randomly-selected assessment set, adding any data within the radius to the assessment set. Next, the exclusion buffer is applied to all the points in the new (post-radius) assessment set, removing any data within the buffer from the analysis set.

Note that this means that buffer is not simply applying a “doughnut” around the circular “radius”, but is buffering each test point separately. See for instance the non-uniform buffer region that happens when there’s a gap in the data:

autoplot(blocks$splits[[12]])

This leaves more data in your analysis set for fitting the model, while still ensuring your assessment data is spatially removed.