Spatial Clustering of Point Data for Machine Learning of Multiclass Classification Problems • stdcab

library(stdcab)

Introduction

Ecological data usually have inherent spatial structures; nearer things are more similar than distant things. This relationship can exist at any spatial scale from local to global with a positive spatial autocorrelation as a nonzero covariance between spatially proximal observations or proximal values.

This definition imposes the need for a means to measure, based on a sample of data values, the covariance between nearby points and to decide whether or not this covariance is consistent with a random spatial arrangement of values. Such non-randomness in the data are not problematic. Correlated data allows to uncover patterns from the process. However, in parametric statistical analysis of data problems can arise 1) non-randomness of error; 2) pseudo-replication-if too many observations are made within the distance at which observations are spatially structured, and 3) As the predictor variables (Independent or explanatory variables) are often correlated with dependence structures, resulting model can over-inflate the model’s accuracy by wholly or partially removing residual structures.

Parametric models can address dependence structures in the data (e.g., autoregressive or mixed model), at least in theory. However, in practice, model specification bias together with structural over-fitting, can seriously impair the diagnostic evaluations of the model. Further, well-known popular machine learning models (e.g., random forest) don’t allow accounting for such spatial dependence structures.

What happens with non-parametric models?

In an ideal model building ( or machine learning), evaluation and validation should be performed using independent data. For example, model evaluation data should not come from within the same geographic extent from which the model was built (spatially distinct). Two scenarios may arise in such cases: 1) either no such independent data exist; usually, we tend to focus on collecting data within the spatial extent of our study area. In classification problems of remote sensing data, we may collect sample data from on-screen (heads-up) digitization. We can’t guarantee that the relationship in the data is still the same. 2) Data set on hand may not meet the assumption of independence.

The current paradigm of evaluation predictive error of machine learning model(s) is through cross-validation, usually with the use of hold-out(saving some percent of total sample data). The key idea is that the training data is independent of hold-out (validation) data. However, training points tend to cluster due to resources or time limitations. While it is possible to create independent data apriori, this is not usually the case; because we can’t be certain independence of data as spatial dependence can happen at any scale.

Hold-out data, however, does not necessarily get rid of the problem at hand, as testing/validation data can be drawn from a nearby location with a dependence structure. Therefore, the resulting model would favor the complex, and accuracy metrics would be too optimistic.

How to deal with it

There are several ways to address the spatial dependency in the data for parametric analysis such as autoregressive (AR) models, generalized least squares (GLS) and mixed effects models. However, even after such remedial measures, the problem can still exist. Which invites the strategic blocking of observations. The stdcab offers two approaches:

Clustering: Partitioning and Hierarchical clustering
Blocking: Splitting coordinate space into regular grids

Clustering

Clustering can be done for k-fold cross validation or repeated k-fold cross validation. Two two options are possible 1) partitioning clustering (using coordinates) or 2) hierarchical clustering.

Partitioning clustering

kmeans clustering is one of the widely used un-supervised multivariate analysis.

data("landcover")

# setting seeds

set.seed(1318)

rnd_fold <- spatial_cluster_sample(
  data = landcover, coords = NULL, v = 10,
  spatial = TRUE, clust_method = "kmeans"
)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE

Now a dataframe() of a training and testing data can be created from the list of splits

# create gen_df

gen_df <- function(split) {
  gp <- analysis(split) %>%
    dplyr::mutate(analysis = "Training") %>%
    dplyr::bind_rows(assessment(split) %>%
      dplyr::mutate(analysis = "Testing"))
}

def <- purrr::map_df(rnd_fold$splits, gen_df)

# short-cut to add fold information
vec <- paste0("Fold", 1:10)

# folding information replication
fold <- rep(vec, each = 1922)
# apply fold

def$fold <- fold

Visualize training and testing data for each fold using ggplot2() package folds using ggplot2

# check packages that are not installed
# it threw error in debian system

pkg <- c("ggplot2", "gganimate")

pkg_check <- lapply(pkg, FUN = function(p) {
  if (!require(p, character.only = TRUE)) {
    install.packages(p, dependencies = TRUE,repos = "http://cran.us.r-project.org")
    library(p, character.only = TRUE)
  }
})
#> Loading required package: ggplot2
#> Loading required package: gganimate

# remove coordinates to avoid cluttering

blank_xy <- function() {
  theme(
    axis.text.x = element_blank(),
    # axis.ticks.x=element_blank(),
    axis.text.y = element_blank(),
    # axis.ticks.y=element_blank()
  )
}

vis_fold <- ggplot(def, aes(X, Y, color = analysis)) +
  geom_point(alpha = 0.7, size = 2) +
  coord_fixed() +
  theme_bw(base_size = 12) +
  labs(color = "Training/Testing") +
  scale_color_manual(values = c("purple", "blue")) +
  xlab("Longitude (m)") +
  ylab("Latitidue (m)") +
  facet_wrap(facets = vars(fold), nrow = 5, scales = "fixed") +
  blank_xy()

Hierarchical Clustering

spatial_cluster_sample supports hierarchical clustering using stats::hclust function. Unlike, kmeans hierarchical clustering does not requires the number of k. This version does not support visualizations as to how many distinct clusters are there. Number of clusters and repeats to be used in the repeated_spatial_cluster_sample should be informed by the data. Other functionalists will be added in the future release.

#> # A tibble: 10 × 2
#>    splits             id    
#>    <list>             <chr> 
#>  1 <split [1668/254]> Fold01
#>  2 <split [1829/93]>  Fold02
#>  3 <split [1749/173]> Fold03
#>  4 <split [1755/167]> Fold04
#>  5 <split [1717/205]> Fold05
#>  6 <split [1543/379]> Fold06
#>  7 <split [1722/200]> Fold07
#>  8 <split [1738/184]> Fold08
#>  9 <split [1826/96]>  Fold09
#> 10 <split [1751/171]> Fold10
#> <Analysis/Assess/Total>
#> <1668/254/1922>
#> <Analysis/Assess/Total>
#> <1717/205/1922>

Visualize

p_hclus <- ggplot(hdef, aes(X, Y, color = analysis)) +
  geom_point(alpha = 0.7, size = 2) +
  coord_fixed() +
  theme_bw(base_size = 12) +
  labs(color = "Train/Test") +
  scale_color_manual(values = c("purple", "blue")) +
  xlab("Longitude (m)") +
  ylab("Latitidue (m)") +
  # transition_states(id,state_length = 2)
  labs(
    title =
      "Location {previous_state}"
  ) +
  theme(plot.title = element_text(hjust = 0.5)) +
  gganimate::transition_states(
    states = fold,
    transition_length = 4,
    state_length = 4
  )

References @

Julia Silge (2021). spatialsample: Spatial Resampling Infrastructure. https://github.com/tidymodels/spatialsample, https://spatialsample.tidymodels.org.

Legendre, P., 1993. Spatial autocorrelation: problem or new paradigm? Ecology 74, 1659–1673. Legendre, P., Dale, M.R.T., Fortin, M.-J., Gurevitch, J., Hohn, M., Myers, D., 2002. The consequences of spatial structure for the design and analysis of ecological field surveys. Ecography 25, 601–615.

Legendre, P., Fortin, M.J., 1989. Spatial pattern and ecological analysis. Miller, J., Franklin, J., Aspinall, R., 2007. Incorporating spatial dependence in predictive vegetation models. Ecol. Modell. 202, 225–242. https://doi.org/10.1016/j.ecolmodel.2006.12.012

Miller, J.R., Turner, M.G., Smithwick, E.A.H., Dent, C.L., Stanley, E.H., 2004. Spatial extrapolation: the science of predicting ecological patterns and processes. BioScience 54, 310–320.

Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009

Tobler, W.R., 1970. A computer movie simulating urban growth in the Detroit region. Econ. Geogr. 46, 234–240.

@ not an exhaustive list of references