boothopach {hopach}R Documentation

functions to perform non-parametric bootstrap resampling of hopach clustering results


The function boothopach takes gene expression data and corresponding hopach gene clustering output and performs non-parametric bootstrap resampling. The medoid genes (cluster profiles) from the original hopach clustering result are fixed, and in each bootstrap resampled data set, each gene is assigned to the closest medoid. The proportion of bootstrap samples in which each gene appears in each cluster is an estimate of the gene's membership in each cluster. These membership probabilities can be viewed as a "fuzzy" clustering result. The function bootmedoids take medoids and a distance function, rather than a hopach object, as input.


boothopach(data, hopachobj, I = 1000, hopachlabels = FALSE)

bootmedoids(data, medoids, d = "cosangle", I = 1000)


data data matrix, data frame or exprSet of gene expression measurements. Each column corresponds to an array, and each row corresponds to a gene. All values must be numeric. Missing values are ignored.
hopachobj output of the hopach function.
I number of bootstrap resampled data sets.
hopachlabels indicator of whether to use the hopach cluster labels hopachobj$clustering$labels for the row names (TRUE) versus the numbers 0 to 'k-1', where 'k' is the number of clusters (FALSE).
medoids row indices of data for the cluster medoids.
d character string specifying the metric to be used for calculating dissimilarities between vectors. The currently available options are "cosangle" (cosine angle or uncentered correlation distance), "abscosangle" (absolute cosine angle or absolute uncentered correlation distance), "euclid" (Euclidean distance), "abseuclid" (absolute Euclidean distance), "cor" (correlation distance), and "abscor" (absolute correlation distance). Advanced users can write their own distance functions and add these.


The function boothopach requires only data and the corresponding output from the HOPACH clustering algorithm produced by the hopach function. The function bootmedoids is designed to work for any clustering result; the user imputs data, medoid row indices, and the distance metric. The supplied distance metrics are the same as for the distancematrix function. Each non-parametric bootstrap resampled data set consists of resampling the 'n' columns of data with replacement 'n' times. The distance between each element and each of the medoid elements is computed using d for each bootstrap data set, and every element is assigned (for that resampled data set) to the cluster whose medoid is closest. These bootstrap cluster assignments are tabulated over all I bootstrap data sets.


A matrix of bootstrap estimated cluster membership probabilities, which sum to 1 (over the clusters) for each element being clustered. This matrix has one row for each element being clustered and one column for each of the original clusters (one cluster for each medoid). The value in row 'j' and column 'i' is the proportion of the I bootstrap resampled data sets that element 'j' appeared in cluster 'i' (i.e. was closest to medoid 'i').


Katherine S. Pollard <> and Mark J. van der Laan <>


van der Laan, M.J. and Pollard, K.S. A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. Journal of Statistical Planning and Inference, 2003, 117, pp. 275-303.

Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.

See Also

distancematrix, hopach


#25 variables from two groups with 3 observations per variable
mydist<-distancematrix(mydata,d="cosangle") #compute the distance matrix.

#clusters and final tree

#bootstrap resampling
table(apply(myobj,1,sum)) # all 1
myobj[clustresult$clust$medoids,] # identity matrix

[Package hopach version 1.0 Index]