Skip to contents

Repeated spatial cluster sampling splits the data into V groups using partitioning (kmeans)/ hierarchical(hclust) clustering of some variables, typically spatial coordinates.

A resample of the analysis data works as in spatial_cluster_sample but with repeats. The number or resamples is equal to fold * repeats, resample sizes are not equal across folds and repeats.

Usage

repeated_spatial_cluster_sample(
  data = data,
  v = 10,
  repeats = 1,
  coords = c("X", "Y"),
  strata = NULL,
  breaks = 4,
  pool = 0.1,
  spatial = FALSE,
  clust_method = "kmeans",
  dist_clust = NULL,
  ...
)

Arguments

data

data input data set one of sp, sf or data.frame with X and Y as variables

v

number of partitions of the data set or number of clusters

repeats

number of repetitions of partition of data set

coords

(vector) pair of coordinates if data type is aspatial or data.frame

strata

(character) strata variable; default is NULL, as it does not yield good results with stratification based on class/strata

breaks

(integer) A single number giving the number of bins desired to stratify a numeric stratification variable

pool

(numeric) A proportion of data used to determine if a particular group is too small and should be pooled into another group. Default is 0.1 vfold_cv

spatial

(logical) if data set is spatial (when sf or sp) or aspatial (data.frame)

clust_method

one of partitioning (default = kmeans) or one of hierarchical methods(hclust)

dist_clust

the agglomeration method to be used. This should be one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC). The dist_clust in the function is method in stats::hclust

...

currently not used

Value

A tibble with classes spatial_cv, rset, tbl_df, tbl, and data.frame. The results include a column for the data split objects and one or more identification variables. For a single repeat, there will be one column called id that has a character string with the fold identifier. For repeats, id is the repeat number and an additional column called id2 that contains the fold information (within repeat).

Details

The variables in the coords argument, if input data is data.frame or extracted from sp, or sf data are used for clustering of the data into disjointed sets. These clusters are used as the folds for cross-validation. Depending on how the data are distributed spatially. The function is similar to repeated cross validation or v-fold cross validation vfold_cv but for spatial data with clustering.

References

A. Brenning, "Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest," 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, 2012, pp. 5372-5375, doi: 10.1109/IGARSS.2012.6352393.

Julia Silge (2021). spatialsample: Spatial Resampling Infrastructure. https://github.com/tidymodels/spatialsample, https://spatialsample.tidymodels.org.

Julia Silge, Fanny Chow, Max Kuhn and Hadley Wickham (2021). rsample: General Resampling Infrastructure. R package version 0.1.1. https://CRAN.R-project.org/package=rsample

Examples

if (FALSE) {
data("landcover")

rscv<- repeated_spatial_cluster_sample(data = landcover,coords = NULL, v = 10,
      repeats = 5, spatial = TRUE, clust_method = "kmeans",
      dist_clust = NULL, breaks = 4, pool = 0.1)

rscv
}