Cluster analysis applied in more
disciplines to help find
by Julia E. Seaman and I. Elaine Allen
Cluster analysis is being widely used in disciplines as diverse as marketing, genomics and climate change. The technique lends itself to large ("big") data sets and small, focused data sets—from qualitative variables to Likert scales and surveys as well as to quantitative variables.
Clustering is an unsupervised statistical technique and also may be called classification or segmentation. It’s defined as unsupervised because there is no targeted or dependent variable being modeled. Algorithms are applied to group the data based on some calculated measure of similarity. In general, the ordering applied to the groups of variables minimizes the variability within a cluster or group while maximizing the variability between groups.
An easy application of clustering in quality is the detection of anomalies in the data after clustering is completed: Any outliers should branch away from the main groupings. This is analogous to control charts, but no boundaries are created to be measured against. The grouping is based on the data and criteria in the clustering algorithm. Cluster analysis with large data sets provides a novel way to determine which groupings are important for more scrutiny or further analysis.
For example, clusters are a good way to identify fraud and suspicious charges in accounting and credit card transactions with data sets too large for classical statistical modeling. Clustering can be used to group transactions so different attention and effort could be applied to each different cluster.1 It is possible to implement the analysis for a data stream and reconfigure the clusters as data are updated on the fly.
Another common quality application is in microarray image processing and data acquisition of gene expression. The quality of clusters are ranked by the intensity of the signal-to-noise ratio (similar to the intra versus inter-cluster variability to filter promising genes for further analysis2,3).
Basic method, types
Because the definition of "similarity" is difficult to define and clustering can be subjective, you can start with some basic mathematical assumptions. The measure used to discriminate between clusters must exhibit the properties in Table 1.
Clustering usually takes two general forms:
- In partitional clustering, groups are created, but relationships between data points within the groups are not quantified. This type of clustering is similar to sorting your data based on similarity.
- Hierarchical clustering has two subgroups. Data points are successively divided from one group down to the individual data point level (top-down approach) or successively combined into groups (bottom-up approach, also called agglomerative) using quantitative methods.
Graphical representations of the hierarchical method can give you an overview and allows you to quickly identify anomalies using a tree-like classification diagram called a dendrogram. You can look at the dendrogram to determine the correct number of clusters.
In Figure 1, there are two highly separated groups suggestive of two clusters. Unfortunately, things are rarely this clear. If you are using more than a few variables to determine your clusters, you will need multiple size, color and shape options to create a plot of your data.
If we add an outlier to this data set, it is obviously displayed in the dendrogram. Note there are now three groups in Figure 2, but the third group contains only one element: the outlier.
Dendrograms are especially useful for microarray data. The microarray plate contains thousands of DNA spots and is used to measure the expression of specific genes within varying experimental condition. It is commonly used to genotype patients and probe biological functions of proteins.
The plate output shows a signal that indicates gene activity on a red to yellow to green quantitative spectrum. Clustering can be used to assess the quality of microarray signals and to analyze the results so you can group genes with similar response patterns. The dendrogram is a useful visualization for both these purposes.
For data collections performed with multiple replicates, such as microarray chips, clustering is a great method to get an overview of the performance quality. Samples that have similar profiles will cluster together, and any suspect replicate will split off on its own branch. Follow-up analysis of the suspect replication is needed to confirm the data set is, indeed, an outlier and not an effect of the chosen algorithm.
Clustering also may reveal the presence of any underlying confounders that affected the replicates. For example, the data may split early into two main groups that correspond to two different technicians or robotic systems.
Clustering is a very useful analysis tool. For microarray plates, experimenters use clustering to observe how genes responded to the tested conditions and what specific genes may be best for follow-up experiments. In addition to comparing one attribute list, clustering can be applied in multiple dimensions.
Often, microarray data may have multiple conditions (such as different drugs) that were tests. Clustering by gene and drug can show simultaneously what drugs caused similar changes over all the genes, and what genes changed similarly between all the drugs.
Other data qualifiers, such as gene function for the microarray, may be applied to test whether clustering branches follow the descriptor patterns. Using clustering, data patterns are easily displayed and groupings of interest can be identified for follow-up investigations. Figure 3 shows the assay plate prior to clustering and an example of the results following two-dimensional clustering.
Partitional clustering is not as efficient as hierarchical clustering because each individual piece of data is classified in only one cluster, and the user must identify beforehand how many clusters are defined in the data. It does not identify anomalies in the data as quickly as hierarchical models, but it is useful for overlapping clusters in which a membership in the cluster group is important and not the subcluster divisions.
The most common algorithm used is k-means, in which k identifies how many clusters exist. The algorithm uses five steps after the analyst decides on the number of clusters. The continued iteration and definition of clusters is not as efficient as the hierarchical clustering techniques.
- Decide on a value for k.
- Initialize the k cluster centers (randomly, if necessary).
- Decide the class memberships of the N objects by assigning them to the nearest cluster center.
- Re-estimate the k cluster centers by assuming the memberships found above are correct.
- If none of the N objects changed membership in the last iteration, exit. Otherwise, go back to step three.
Given the first plot in Figure 4, it is clear that three clusters exist, but they overlap. A partitioning algorithm will use the steps (the other four plots in Figure 4) to identify the clusters by iteratively computing cluster center and recalculating the center at each iteration.
There are many different hierarchal algorithms for quantifying the distance between clusters. These can be divided into distance measures that separate clusters by the average distance between their centers:
- Those that separate clusters by nearest neighbor (lowest variability within a cluster).
- Those that separate clusters by the maximum variability between clusters.
All distance-measure algorithms are scalable and iterative. The different algorithms have their own strengths and depend on the type of input data and desired results. The clustering algorithms are easily implemented within most statistical software (Statistical Product and Service Solutions, Stata, Statistical Analysis System and R), and most documentations explain the differences between the available algorithms.
Clustering can be an easily applied technique to examine how well your data fit into similar groupings. The technique can identify outliers from specific clusters. Also, for each data point, cluster analysis can calculate a quantitative measure of how far from the center of a cluster (measured as a multivariate mean or other statistic of central tendency) the point falls.
Measuring the overall variability within and between clusters indicate whether the cluster is cohesive. Cluster analysis can be applied to many types of data—categorical, ordinal and numeric—and has helped fields as diverse as marketing and biology.
- Sutapat Thiprungsri, "Cluster Analysis for Anomaly Detection in Accounting Data," Collected Papers of the 19th Annual Strategic and Emerging Technologies Research Workshop, San Francisco, July 31, 2010.
- Xujing Wang, Soumitra Ghosh and Sun-Wei Guo, "Quantitative Quality Control in Microarray Image Processing and Data Acquisition," Nucleic Acids Research, Vol. 29, No. 15, 2001.
- James J. Chen, Huey-Miin Hsueh, Robert R. Delongchamp, Chein-Ju Lin and Chen-An Tsai, "Reproducibility of Microarray Data: A Further Analysis of Microarray Quality Control Data," BMC Bioinformatics, Vol. 8, No. 412, 2007.
Julia E. Seaman is a doctoral student in pharmacogenomics at the University of California-San Francisco, and a statistical consultant for the Babson Survey Research Group at Babson College in Wellesley, MA. She earned a bachelor’s degree in chemistry and mathematics from Pomona College in Claremont, CA.
I. Elaine Allen is professor of biostatistics at the University of California-San Francisco and emeritus professor of statistics at Babson College. She is also director of the Babson Survey Research Group. She earned a doctorate in statistics from Cornell University in Ithaca, NY. Allen is a member of ASQ.