A Clustering Algorithm  

Clustering is the process by which discrete objects can be assigned to groups which have similar characteristics. The objects might be animals of different species and the purpose of the grouping might be to find those species with the most similarities. The objects might be responses of individuals to a survey, and the clusters might indicate the categories of response types. In the world of satellite-collected image data, billions of pixel points could be grouped into hundreds of clusters. This gross resolution of the data would allow a person in the field to analyze the land-use data with a personal computer rather than a supercomputer.

There are many algorithms used in clustering. The one presented here was described in the article "Finding Clusters: An Application of the Distance Concept" by Robert Clason in the April 1990 issue of The Mathematics Teacher.

Each object is represented by an ordered n-tuple of its attributes. Objects with similar characteristics must have individual attributes which are close in value. Since distance is a measure of "closeness", it can be used in clustering. For each cluster, there will be a single point which is designated as a hub. The placement of the hub within the cluster is determined by the algorithm. We will start with two-dimensional objects since they are the easiest to visualize. Suppose we have 8 points in the configuration below. Draw circles around the clusters that you see visually.

POINT X Y
1 1 1
2 1 2
3 2 1
4 8 1
5 9 1
6 6 6
7 5 5
8 9 9

Let's see what a computer model of clustering would give us.

This algorithm assumes a minimum of 2 clusters and a maximum of p clusters where p is the number of objects. The algorithm is as follows:

  1. From the set of objects (data points), choose any one to be the first hub.
  2. Find the point which is farthest from this hub, and make it the second hub.
  3. For each remaining object, find the distance from the object to each hub and assign the object to the hub to which it is closer.
  4. To determine whether it is necessary to find a third hub, do the following.
    1. Find the distance between the two hubs and divide by two. Call this value R.
    2. Calculate the distance from each object to its hub. If any distance is greater than R,
      it is necessary to find a third hub.
  5. To find the third hub, find the object which is farthest from its hub (1 or 2). So, if Point 8 belongs to Hub 1, look at the distance from Point 8 to Hub 1, but not Point 8 to Hub 2.
  6. Then, calculate the distance from each point to the new hub to see if that distance is less than the point's distance to its current hub. If so, reassign the point to Hub 3.
  7. To determine the stop value, R, find the average distance between the hubs and divide by two. Since there are three hubs, there will be three distances to average together (H1 - H2, H1 - H3, and H2 - H3).
  8. Once again, see if any distance from a point to its hub exceeds R.
  9. Continue this process until all points are within R units of their hub or until all points are themselves hubs.

Student Exercises

  1. Create a new set of data points and ask the students to determine the clusters.
  2. Have different students use different points as the initial hub and see if the clusters differ at the end. Let the students suggest configurations for which the initial hub selection might make a difference.
  3. Use data points in which the domain is 1..10 (the x-coordinate values), but the range is 1000...1000000 (the y-coordinate values). Are the coordinates weighted equally with respect to their importance in the distance formula? What effect does this have in the clustering?
  4. Use a standardization formula to adjust the points in problem 3 before running the clustering program. Compare the standardized versus non-standardized clusters.

Exercise Notes

Exercise 3:
The data might represent the unemployment rate and average housing costs in major U.S. cities. Since the differences in housing costs will be numbers in the 10000's and differences in the unemployment rates will be less than 10, the distance between points will be primarily determined by the housing costs. If unemployment is just as important as housing to the researcher, a method must be used to re-scale the data so that relative, rather than absolute, differences are used.

Exercise 4:
The following standardizing function is often used in clustering.

  1. Find the average of the unemployment rates and call it X.
  2. Do the same for the housing costs and call it Y.
  3. Then, calculate the standard deviation for each attribute, SX and SY. To do this for unemployment, sum the squares of the differences between each unemployment rate and the average unemployment rate, X. Then divide this sum by the number of data points and take the square root. Do the same process for housing costs.
  4. Now, replace each unemployment value with (original unemployment value minus X) divided by SX. Do likewise for each housing cost using (original housing costs minus Y) divided by SY.
  5. Use these new values to cluster the cities. The MatLab program would be useful here.

On to the Matlab Clustering Program.


Home | Contact | Site Map | Search