labelstomss {hopach}R Documentation

Functions to compute silhouettes and split silhouettes

Description

Silhouettes measure how well an element belongs to its cluster, and the average silhouette measures the strength of cluster membership overall. The Median (or Mean) Split Silhouette (MSS) is a measure of cluster heterogeneity. Given a partitioning of elements into groups, the MSS algorithm considers each group separately and computes the split silhouette for that group, which evaluates evidence in favor of further splitting the group. If the median (or mean) split silhouette over all groups in the partition is low, the groups are homogeneous.

Usage

labelstomss(labels, dist, khigh = 9, within = "med", between = "med", 
hierarchical = TRUE)

labelstosil(labels, dist)

medstosil(medoids, dist)

msscheck(dist, kmax = 9, khigh = 9, within = "med", between = "med", 
    force = FALSE, echo = FALSE, graph = FALSE)

silcheck(data, kmax = 9, diss = FALSE, echo = FALSE, graph = FALSE)

Arguments

labels vector of cluster labels for each element in the set.
dist numeric distance matrix containing the pair wise distances between all elements. All values must be numeric and missing values are not allowed.
medoids a vector indicating the rows/cols of dist that are the cluster medoids, i.e. profiles (or centroids) for each cluster.
data a data matrix. Each column corresponds to an observation, and each row corresponds to a variable. In the gene expression context, observations are arrays and variables are genes. All values must be numeric. Missing values are ignored. In silcheck, data may also be a distance matrix or dissimilarity object if the argument diss=TRUE.
khigh integer between 1 and 9 specifying the maximum number of children for each cluster when computing MSS.
kmax integer between 1 and 9 specifying the maximum number of clusters to consider. Can be different from khigh, though typically these are the same value.
within character string indicating how to compute the split silhouette for each cluster. The available options are "med" (median over all elements in the cluster) or "mean" (mean over all elements in the cluster).
between character string indicating how to compute the MSS over all clusters. The available options are "med" (median over all clusters) or "mean" (mean over all clusters). Recommended to use the same value as within.
hierarchical logical indicating if 'labels' should be treated as encoding a hierarchical tree, e.g. from HOAPCH.
force indicator of whether to require at least 2 clusters, if FALSE (default), one cluster is considered.
echo indicator of whether to print the selected number of clusters and corresponding MSS.
graph indicator of whether to generate a plot of MSS (or average silhouette in silcheck) versus number of clusters.
diss idicator of whether data is a dissimilarity matrix (or dissimilarity object), as in the pam function of the cluster package. If TRUE then data will be considered as a dissimilarity matrix. If FALSE, then data will be considered as a data matrix (observations by variables).

Details

The Median (and mean) Split Silhouette (MSS) criteria is defined in paper107 listed in the references (below). This criteria is based on the criteria function 'silhouette', proposed by Kaufman and Rousseeuw (1990). While average silhouette is a good global measure of cluster strength, MSS was developed to be more "aggressive" for finding small, homogeneous clusters in large data sets. MSS is a measure of average cluster homogeneity. The Median version is more robust than the Mean.

Value

For labelstomss, the median (or mean or combination) split silhouette, depending on the values of within and between.

For medstosil and labelstosil, a list with first component the cluster label for each element and second compenent the silhouette for that element. The average silhouette is simply the mean of the second component.
For msscheck, a vector with first component the chosen number of clusters (minimizing MSS) and second component the corresponding MSS.
For silcheck, a vector with first component the chosen number of clusters (maximizing average silhouette) and second component the corresponding average silhouette.

Author(s)

Katherine S. Pollard <kpollard@soe.ucsc.edu> and Mark J. van der Laan <laan@stat.berkeley.edu>

References

http://www.bepress.com/ucbbiostat/paper107/

http://www.stat.berkeley.edu/~laan/Research/Research_subpages/Papers/jsmpaper.pdf

Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.

See Also

pam, hopach, distancematrix

Examples


mydata<-rbind(cbind(rnorm(10,0,0.5),rnorm(10,0,0.5),rnorm(10,0,0.5)),cbind(rnorm(15,5,0.5),rnorm(15,5,0.5),rnorm(15,5,0.5)))
mydist<-distancematrix(mydata,d="cosangle") #compute the distance matrix.

#pam
result1<-pam(mydata,k=2)
result2<-pam(mydata,k=5)
labelstomss(result1$clust,mydist,hierarchical=FALSE)
labelstomss(result2$clust,mydist,hierarchical=FALSE)

#hopach
result3<-hopach(mydata,dmat=mydist)
labelstomss(result3$clustering$labels,mydist)
labelstomss(result3$clustering$labels,mydist,within="mean",between="mean")


[Package hopach version 1.0 Index]