Cluster analysis - a summary

 

Back to concepts

Hierarchical, agglomerative cluster analysis is a series methods for the construction of dendrograms (phenograms) based on pair-wise similarity or distance.  Here is a simplified algorithm.

 

A. Tabulate a matrix of taxa and their character states, suitably coded.

 

B. Construct a distance (or similarity) matrix.  This is a square matrix of dimension n (for n taxa).  The d matrix might contain "p-distances", the proportion of pair-wise differences divided by the number of comparisons made.  Transformations such as Jukes and Cantor's or Kimura's 2 parameter could be applied.

 

C. Execute the following set of instructions:

 

1. Select the smallest distance in the d matrix.

2. Link the corresponding pair at a level corresponding to the distance in question and enter it on the growing tree.

3. Replace the selected pair by a cluster in the matrix, which is reduced in size by 1.

4. Calculate the distances between the new cluster and the remaining members of the matrix following the selected formula (below).

5. Repeat from instruction 1 until the matrix is empty.

 

Cluster replacement formulae:

 

1. Single-linkage: the cluster distance is the smaller of the two distances of its members with the other matrix members.  The method is "space-contracting".

2. Complete-linkage: the cluster distance is the larger of the two distances of its members with the other matrix members.  The method is "space-dilating".

3. Average linkage:  

a. UPGMA (unequally weighted): the new distance is averaged over all members of the growing cluster - one must keep track of the number of individuals previously added to clusters.  The method favours unbalanced clusters linked on their gravitational centre.

b. WPGMA (equally weighted): the new distance is averaged equally for each member of the pair.  The method favours spherical clusters linked on their geometric centre.

 

These methods all form trees with terminal nodes at equal distances from the root, and therefore assume ultrametric data, which in turn assume constant evolutionary rates across taxa.

 

Back to concepts