| Title: | A Bootstrap Augmented k-Means Algorithm for Fuzzy Partitions |
|---|---|
| Description: | Implementation of the bootkmeans algorithm, a bootstrap augmented k-means algorithm that returns probabilistic cluster assignments. From paper by Ghashti, J.S., Andrews, J.L. Thompson, J.R.J., Epp, J. and H.S. Kochar (2025), "A bootstrap augmented k-means algorithm for fuzzy partitions" (Submitted). |
| Authors: | Jesse S. Ghashti [aut, cre], Jeffrey L. Andrews [aut], John R.J. Thompson [aut], Joyce Epp [aut], Harkunwar S. Kochar [aut] |
| Maintainer: | Jesse S. Ghashti <[email protected]> |
| License: | GPL-2 |
| Version: | 1.0.0 |
| Built: | 2026-05-14 08:09:27 UTC |
| Source: | https://github.com/ghashti-j/bootkmeans |
-means algorithm for fuzzy partitionsRepeatedly bootstraps the rows of a data matrix, runs kmeans on each resample (with optional seeding for given centres), tracks per-observation allocations using squared Euclidean distance, and aggregates results into out-of-bag (OOB) fuzzy memberships, hard clusters, and averaged cluster centres. Iterations can stop adaptively using a serial-correlation test on the objective trace.
boot.kmeans( data = NULL, groups = NULL, iterations = 500, nstart = 1, export = FALSE, display = FALSE, pval = 0.05, itermax = 10, maxsamp = 1000, verbose = FALSE, returnall = FALSE )boot.kmeans( data = NULL, groups = NULL, iterations = 500, nstart = 1, export = FALSE, display = FALSE, pval = 0.05, itermax = 10, maxsamp = 1000, verbose = FALSE, returnall = FALSE )
data |
Numeric matrix or data frame of row observations and column variables. Required. |
groups |
Either and integer number of clusters |
iterations |
Initial number of bootstrap iterations to run before considering stopping ( |
nstart |
Passed to |
export |
Logical; if |
display |
Logical; if |
pval |
Significance threshold for adaptive stopping. When the Breusch–Godfrey test p-value on the last |
itermax |
Maximum number of iterations per |
maxsamp |
Upper bound on total iterations if adaptive stopping keeps extending ( |
verbose |
Logical; if |
returnall |
Logical; if |
Each iteration draws a bootstrap sample of rows, runs kmeans on the resample (first using either supplied centres or nstart random starts; subsequent iterations use the previous iteration's centres),
and computes squared Euclidean distances from every original observation to each current centre using mahalanobis with the identity
covariance. Observations are allocated to their nearest centre and these allocations are tracked across iterations.
Out-of-bag (OOB) sets are the observations note included in a given bootstrap sample. For each observation, its OOB allocations across
the most recent iterations runs are tallied to produce a fuzzy membership matrix () and a hard label by maximum membership.
Convergence is assessed adaptively: on the trace of summed per-observation minimum squared distances (the -means objective) over the most recent
iterations runs, a Breusch–Godfrey serial-correlation test (bgtest applied to a regression of the objective on
iteration index) is computed. If the p-value is below pval and iterations < maxsamp, one more iteration is added; otherwise the
loop terminates. Final centres are the elementwise mean of the centres over the last iterations runs.
An object of class "BSKMeans": a list with components
U |
|
clusters |
Integer vector of length |
centres |
|
p.value |
Final Breusch–Godfrey test p-value used for stopping. |
iterations |
Total number of iterations actually run. |
occurences |
|
size |
Number of clusters |
soslist |
Numeric vector of objective values by iteration. |
centrelist |
(If |
ooblist |
(If |
kmlist |
(If |
Jesse S. Ghashti [email protected] and Jeffrey L. Andrews [email protected]
Ghashti, J.S., Andrews, J.L., Thompson, J.R.J., Epp, J. and H.S. Kochar (2025). A bootstrap augmented -means algorithm for fuzzy partitions. Submitted.
Breusch, T.S. (1978). Testing for Autocorrelation in Dynamic Linear Models, Australian Economic Papers, 17, 334-355.
Godfrey, L.G. (1978). Testing Against General Autoregressive and Moving Average Error Models when the Regressors Include Lagged Dependent Variables', Econometrica, 46, 1293-1301.
compare.clusters, compare.tables, bootk.hardsoftvis, kmeans, bgtest
set.seed(1) # basic usage x <- as.matrix(iris[, -5]) fit <- boot.kmeans(data = x, groups = 3, iterations = 50, itermax = 20, verbose = TRUE) table(fit$clusters, iris$Species) # basic usage with initial cluster centres supplied centres.init <- x[sample(nrow(x), 3), ] fit2 <- boot.kmeans(data = x, groups = centres.init, iterations = 50) # plot objective trace plot(fit$soslist, type = "l", xlab = "Iteration", ylab = "Objective Function Value")set.seed(1) # basic usage x <- as.matrix(iris[, -5]) fit <- boot.kmeans(data = x, groups = 3, iterations = 50, itermax = 20, verbose = TRUE) table(fit$clusters, iris$Species) # basic usage with initial cluster centres supplied centres.init <- x[sample(nrow(x), 3), ] fit2 <- boot.kmeans(data = x, groups = centres.init, iterations = 50) # plot objective trace plot(fit$soslist, type = "l", xlab = "Iteration", ylab = "Objective Function Value")
-meansPlots the results of boot.kmeans highlighting which observations are assigned with full certainty (hard) versus fractional out-of-bag membership (soft/fuzzy). Either produces a full scatterplot matrix using all variables or a 2D scatterplot of chosen variables.
bootk.hardsoftvis(data = NULL, res, plotallvars = FALSE, var1 = NULL, var2 = NULL)bootk.hardsoftvis(data = NULL, res, plotallvars = FALSE, var1 = NULL, var2 = NULL)
data |
Numeric data frame or matrix used for clustering in |
res |
Result list returned from |
plotallvars |
Logical; if |
var1 |
Integer column number for the x-axis variable when |
var2 |
Integer column number for the y-axis variable when |
Each observation is classified as hard if any entry of its membership row U[i,] is exactly 1, and soft otherwise.
These categories are mapped to colors green for hard assignments, blue for soft/fuzzy. With plotallvars = TRUE, a scatterplot matrix of all variables is drawn.
With plotallvars = FALSE, only the two specified variables are plotted, with axis labels taken from the column names of data.
No return value, called for side effects (produces a visualization of hard vs. soft cluster assignments
from boot.kmeans results).
Jesse S. Ghashti [email protected] and Jeffrey L. Andrews [email protected]
boot.kmeans, compare.clusters, bootk.hardsoftvis, kmeans, FKM
set.seed(1) x <- as.matrix(iris[, -5]) # run bootstrap kmeans res <- boot.kmeans(data = x, groups = 3, iterations = 20) # scatterplot matrix of all variables bootk.hardsoftvis(x, res, TRUE) # scatterplot matrix of variable 1 and variable 2 bootk.hardsoftvis(x, res, plotallvars = FALSE, var1 = 1, var2 = 2)set.seed(1) x <- as.matrix(iris[, -5]) # run bootstrap kmeans res <- boot.kmeans(data = x, groups = 3, iterations = 20) # scatterplot matrix of all variables bootk.hardsoftvis(x, res, TRUE) # scatterplot matrix of variable 1 and variable 2 bootk.hardsoftvis(x, res, plotallvars = FALSE, var1 = 1, var2 = 2)
-means, bootstrap augmented -means, and fuzzy -meansFits three clustering procedures on the same data: standard kmeans,
our bootstrap augmented -means algorithm boot.kmeans, and (optionally) fuzzy -means
from FKM. Returns the fitted objects of all three whose object can be passed into
compare.clusters to compare side-by-side confusion matrices.
compare.clusters( data = NULL, groups = NULL, seed = 13462, nstart = 50, what = "all")compare.clusters( data = NULL, groups = NULL, seed = 13462, nstart = 50, what = "all")
data |
Numeric matrix or data frame of row observations and column variables. Required. |
groups |
Number of clusters |
seed |
Optional integer random seed for reproducibility. |
nstart |
Number of random starts for initialization for all methods. |
what |
Character flag; if |
The function runs the following algorithms:
km: stats::kmeans(data, centers = groups, nstart = nstart).
bkm: boot.kmeans(data, groups, nstart = nstart, returnall = FALSE).
fkm (if what == "all"): fclust::FKM(data, k = groups, RS = nstart).
A named list with components:
km |
|
bkm |
|
fkm |
(Only if |
what |
Echo of the |
Ghashti, J.S., Andrews, J.L., Thompson, J.R.J., Epp, J. and H.S. Kochar (2025). A bootstrap augmented -means algorithm for fuzzy partitions. Submitted.
Bezdek, J.C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum.
Hartigan, J.A. and M.A. Wong (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100–108.
Ferraro, M.B., Giordani P. and A. Serafini (2019). fclust: An R Package for Fuzzy Clustering, The R Journal, 11.
boot.kmeans, compare.tables, bootk.hardsoftvis, kmeans, FKM
set.seed(1) x <- as.matrix(iris[, -5]) # compare all three methods res <- compare.clusters(x, groups = 3, nstart = 10, what = "all") # hard clusters from bootstrap kmeans table(res$bkm$clusters, iris$Species) # fuzzy memberships from fuzzy \eqn{c}-means head(res$fkm$U) # compare class labels cbind(res$bkm$clusters[1:5], res$fkm$clus[1:5,2], res$km$cluster[1:5])set.seed(1) x <- as.matrix(iris[, -5]) # compare all three methods res <- compare.clusters(x, groups = 3, nstart = 10, what = "all") # hard clusters from bootstrap kmeans table(res$bkm$clusters, iris$Species) # fuzzy memberships from fuzzy \eqn{c}-means head(res$fkm$U) # compare class labels cbind(res$bkm$clusters[1:5], res$fkm$clus[1:5,2], res$km$cluster[1:5])
Given the output of compare.clusters and a vector of true class labels,
prints confusion tables for: (i) hard -means labels, (ii) the bootstrap augmented -means
MAP out-of-bag labels, and (optionally) (iii) fuzzy -means hard labels.
compare.tables(full.res = NULL, true.labs = NULL, verbose = TRUE)compare.tables(full.res = NULL, true.labs = NULL, verbose = TRUE)
full.res |
A list returned by |
true.labs |
A vector of true class labels. |
verbose |
Logical; if |
For -means, hard labels are taken from full.res$km$cluster. For bootstrap -means, labels
are taken from full.res$bkm$clusters. If full.res$what == "all" results are also
taken from full.res$fkm$clus, which are the hard cluster assignments from the fuzzy -means
algorithm.
The function prints two or three contingency tables to the console, with three presented if
compare.clusters has argument what = "all", and two otherwise.
A list with components:
kmeans |
A contingency table comparing true labels to |
bootkmeans |
A contingency table comparing true labels to boot |
fuzzcmeans |
(Optional) A contingency table comparing true labels to fuzzy |
If verbose = TRUE, the tables are also printed to the console.
Ghashti, J.S., Andrews, J.L., Thompson, J.R.J., Epp, J. and H.S. Kochar (2025). A bootstrap augmented -means algorithm for fuzzy partitions. Submitted.
Bezdek, J.C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum.
Hartigan, J.A. and M.A. Wong (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100–108.
Ferraro, M.B., Giordani P. and A. Serafini (2019). fclust: An R Package for Fuzzy Clustering, The R Journal, 11.
boot.kmeans, compare.clusters, bootk.hardsoftvis, kmeans, FKM
set.seed(1) x <- as.matrix(iris[, -5]) # fit three methods (kmeans, bootstrap kmeans, fuzzy \eqn{c}-means) res <- compare.clusters(x, groups = 3, nstart = 10, what = "all") # compare contigency tables compare.tables(res, true.labs = iris$Species)set.seed(1) x <- as.matrix(iris[, -5]) # fit three methods (kmeans, bootstrap kmeans, fuzzy \eqn{c}-means) res <- compare.clusters(x, groups = 3, nstart = 10, what = "all") # compare contigency tables compare.tables(res, true.labs = iris$Species)
Computes fuzzy generalizations of the Adjusted Rand Index based on Frobenius inner products of membership matrices. These measures extends the Adjusted Rand Index to compare fuzzy partitions.
fari(a, b)fari(a, b)
a |
An |
b |
An |
A single numeric value
fari |
The Frobenius Adjusted Rand index between |
Andrews, J.L., Browne, R. and C.D. Hvingelby (2022). On Assessments of Agreement Between Fuzzy Partitions. Journal of Classification, 39, 326–342.
J.L. Andrews, FARI (2013). GitHub repository, https://github.com/its-likeli-jeff/FARI
set.seed(1) a <- matrix(runif(600), nrow = 200, ncol = 3) a <- a / rowSums(a) b <- matrix(runif(600), nrow = 200, ncol = 3) b <- b / rowSums(b) fari(a, b)set.seed(1) a <- matrix(runif(600), nrow = 200, ncol = 3) a <- a / rowSums(a) b <- matrix(runif(600), nrow = 200, ncol = 3) b <- b / rowSums(b) fari(a, b)