Repeated Spatial Clustering of Point Data for Tidy Modeling
Source:R/repeated_spatial_cluster_sample.R
repeated_spatial_cluster_sample.Rd
Repeated spatial cluster sampling splits the data into V groups
using partitioning (kmeans
)/ hierarchical(hclust
)
clustering of some variables, typically spatial coordinates.
A resample of the analysis data works as in spatial_cluster_sample
but with repeats.
The number or resamples is equal to fold * repeats, resample sizes are
not equal across folds and repeats.
Usage
repeated_spatial_cluster_sample(
data = data,
v = 10,
repeats = 1,
coords = c("X", "Y"),
strata = NULL,
breaks = 4,
pool = 0.1,
spatial = FALSE,
clust_method = "kmeans",
dist_clust = NULL,
...
)
Arguments
- data
data input data set one of sp, sf or data.frame with X and Y as variables
- v
number of partitions of the data set or number of clusters
- repeats
number of repetitions of partition of data set
- coords
(vector) pair of coordinates if data type is aspatial or data.frame
- strata
(character) strata variable; default is NULL, as it does not yield good results with stratification based on class/strata
- breaks
(integer) A single number giving the number of bins desired to stratify a numeric stratification variable
- pool
(numeric) A proportion of data used to determine if a particular group is too small and should be pooled into another group. Default is 0.1
vfold_cv
- spatial
(logical) if data set is spatial (when sf or sp) or aspatial (data.frame)
- clust_method
one of partitioning (default = kmeans) or one of hierarchical methods(
hclust
)- dist_clust
the agglomeration method to be used. This should be one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC). The dist_clust in the function is method in stats::hclust
- ...
currently not used
Value
A tibble with classes spatial_cv
, rset
, tbl_df
, tbl
, and data.frame.
The
results include a column for the data split objects and one or more
identification variables.
For a single repeat, there will be one column called id that has a character
string with the fold identifier. For repeats, id
is the repeat number and an
additional column called id2 that contains the fold information (within repeat).
Details
The variables in the coords
argument, if input data is data.frame or
extracted from sp, or sf data are used for clustering of the data into
disjointed sets. These clusters are used as the folds for cross-validation.
Depending on how the data are distributed spatially.
The function is similar to repeated cross validation or v-fold cross
validation vfold_cv
but for spatial data with clustering.
References
A. Brenning, "Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest," 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, 2012, pp. 5372-5375, doi: 10.1109/IGARSS.2012.6352393.
Julia Silge (2021). spatialsample: Spatial Resampling Infrastructure. https://github.com/tidymodels/spatialsample, https://spatialsample.tidymodels.org.
Julia Silge, Fanny Chow, Max Kuhn and Hadley Wickham (2021). rsample: General Resampling Infrastructure. R package version 0.1.1. https://CRAN.R-project.org/package=rsample
Examples
if (FALSE) {
data("landcover")
rscv<- repeated_spatial_cluster_sample(data = landcover,coords = NULL, v = 10,
repeats = 5, spatial = TRUE, clust_method = "kmeans",
dist_clust = NULL, breaks = 4, pool = 0.1)
rscv
}