Clustering


Commercial Examples of Clustering


Example


ID Name Age Balance ($) Income Eyes Gender
1 Amy 62 0 Medium Brown F
2 Al 53 1,800 Medium Green M
3 Betty 47 16,543 High Brown F
4 Bob 32 45 Medium Green M
5 Carla 21 2,300 High Blue F
6 Carl 27 5,400 High Brown M
7 Donna 50 165 Low Blue F
8 Don 46 0 High Blue F
9 Edna 27 500 Low Blue F
10 Ed 68 1,200 Low Blue M

How would you cluster this data?

Clustering Techniques


1.
Partitioning Based
2.
Hierarchy Based
3.
Model Based

Partitioning the Space


\psfig{figure=figures/c1.ps}

K-means Clustering


1.
Determine desired number of clusters
2.
Randomly pick items to become ``seed'' of each cluster
3.
Assign each entry to the nearest cluster
4.
Recalculate centers of clusters
5.
Repeat steps 3 and 4 until number of moves below threshold

Hierarchical Clustering


How merge clusters?


CobWeb


Bayesian Clustering: AutoClass


AutoClass


Pseudocode


For i in NumberOfClusters
   Randomly initialize i clusters
   Do
      Compute class likelihood vectors
      Compute normalized probabilities for each data point
      Update class model parameters
         Analyze new parameters that will maximize probabilities
         (For normal function, recalculate mean, variance, skewness, kurtosis)
   Until convergence (sum of classes' log marginal probability > threshold or
                      no change)

Example


\psfig{figure=figures/ac1.ps}

Sample Run: Auto Imports Database


imports-85c.hd2

num_db2_format_defs 2
number_of_attributes 26
separator_char  ','     ; Can also supply comment char and unknown token
0 discrete nominal "symboling" range 7
1 real scalar "normalized-loses" zero_point 0.0 rel_error 0.01
2 discrete nominal "make" range 22
3 discrete nominal "fuel-type" range 2
4 discrete nominal "aspiration" range 2
5 discrete nominal "num-of-doors" range 2
6 discrete nominal "body-style" range 5
7 discrete nominal "drive-wheels" range 3
8 discrete nominal "engine-location" range 2
9 real scalar "wheel-base" zero_point 0.0 rel_error 0.001
10 real scalar "length" zero_point 0.0 rel_error 0.001
11 real scalar "width" zero_point 0.0 rel_error 0.001
12 real scalar "height" zero_point 0.0 rel_error 0.001
13 real scalar "curb-weight" zero_point 0.0 rel_error 0.0002
14 discrete nominal "engine-type" range 7
15 discrete nominal "num-of-cylinders" range 7
16 real scalar "engine-size" zero_point 0.0 rel_error 0.01
17 discrete nominal "fuel-system" range 8
18 real scalar "bore" zero_point 0.0 rel_error 0.003
19 real scalar "stroke" zero_point 0.0 rel_error 0.003
20 real scalar "compression-ratio" zero_point 0.0 rel_error 0.003
21 real scalar "horse-power" zero_point 0.0 rel_error 0.01
22 real scalar "peak-rpm" zero_point 0.0 rel_error 0.02
23 real scalar "city-mpg" zero_point 0.0 rel_error 0.04
24 real scalar "highway-mpg" zero_point 0.0 rel_error 0.04
25 real scalar "price" zero_point 0.0 rel_error 0.001

Sample Run: Auto Imports Database


imports-85c.db2

3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548,
   dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495
3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548,
   dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,16500
1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.50,171.20,65.50,52.40,2823,
   ohcv,six,152,mpfi,2.68,3.47,9.00,154,5000,19,26,16500
2,164,audi,gas,std,four,sedan,fwd,front,99.80,176.60,66.20,54.30,2337,ohc,four,
   109,mpfi,3.19,3.40,10.00,102,5500,24,30,13950
2,164,audi,gas,std,four,sedan,4wd,front,99.40,176.60,66.40,54.30,2824,ohc,five,
   136,mpfi,3.19,3.40,8.00,115,5500,18,22,17450
2,?,audi,gas,std,two,sedan,fwd,front,99.80,177.30,66.30,53.10,2507,ohc,five,136,
   mpfi,3.19,3.40,8.50,110,5500,19,25,15250
1,158,audi,gas,std,four,sedan,fwd,front,105.80,192.70,71.40,55.70,2844,ohc,five,
   136,mpfi,3.19,3.40,8.50,110,5500,19,25,17710
1,?,audi,gas,std,four,wagon,fwd,front,105.80,192.70,71.40,55.70,2954,ohc,five,
   136,mpfi,3.19,3.40,8.50,110,5500,19,25,18920
1,158,audi,gas,turbo,four,sedan,fwd,front,105.80,192.70,71.40,55.90,3086,ohc,
   five,131,mpfi,3.13,3.40,8.30,140,5500,17,20,23875
0,?,audi,gas,turbo,two,hatchback,4wd,front,99.50,178.20,67.90,52.00,3053,ohc,
   five,131,mpfi,3.13,3.40,7.00,160,5500,16,22,?
2,192,bmw,gas,std,two,sedan,rwd,front,101.20,176.80,64.80,54.30,2395,ohc,four,
   108,mpfi,3.50,2.80,8.80,101,5800,23,29,16430
...

Sample Run: Auto Imports Database


imports-85c.model

model_index 0 4
ignore 0
single_normal_cm 1 18 19 21 22 25
single_normal_cn 9 10 11 12 13 16 20 23 24
single_multinomial default

imports-85c.s-params (abbreviated)

# start_j_list = 2, 3, 5, 7, 10, 15, 25
# min_report_period = 30
# max_duration = 0
# max_n_tries = 0
# n_save = 2
...

Run


imports-85c.log

AUTOCLASS C (version 2.5) STARTING at Mon Jun 26 16:30:39 1995

AUTOCLASS -SEARCH default parameters:
...

WELCOME TO AUTOCLASS.
  1) Each time I have finished a new 'trial', or attempt to find a good
     classification, I will print the number of classes that trial
     started and ended with, such as 9->7.
  2) If that trial results in a duplicate of a previous run, I will print
     print 'dup' first.
  3) If that trial results in a classification better than any previous,
     I will print 'best' first.
  4) If more than 30 seconds have passed since the last report, and a new
     classification has been found which is better than any previous ones,
     I will report on that classification and on the status of the search
     so far.
  5) This report will include an estimate of the time it will take to find
     another even better classification, and how much better that will be.
     In addition, I will estiamte a lower bound on how long it might take to
     find the very best classification, and how much better that might be.
  6) If you are warned about too much time in overhead, you may want to
     change the parameters n_save, min_save_period, min_report_period, or
     min_checkpoint_period.
  7) To quit searching, type a 'q', hit <return>, and wait.  Otherwise I'll
     go on until I complete trial number (12).
  8) If needed, every 30 minutes I will save the best 2 classifications
     so far to file:
     /home/tove/p/autoclass-c/sample/imports-85c.results-bin
     and a description of the search to file:
     /home/tove/p/autoclass-c/sample/imports-85c.search
  9) A record of this search will be printed to file:
     /home/tove/p/autoclass-c/sample/imports-85c.log

BEGINNING SEARCH at Mon Jun 26 16:30:40 1995

[j_in=2]  [cs-3: cycles 15] best2->2(1) [j_in=3]  [cs-3: cycles 49] best3->3(2) [j_in=5]  [cs-3: cycles 12] best5->5(3) [j_in=7]  [cs-3: cycles 11] best7->7(4) [j_in=10]  [cs-3: cycles 14] best10->10(5) [j_in=15]  [cs-3: cycles 28] 15->15(6) [j_in=25]  [cs-3: cycles 10] 25->22(7)

----------------  NEW BEST CLASSIFICATION FOUND on try 5  -------------
It has 10 CLASSES with WEIGHTS 32 30 28 24 21 21 20 11 10 8
PROBABILITY of both the data and the classification = exp(-16368.367)
(Also found 4 other better than last report.)

-----------  SEARCH STATUS as of Mon Jun 26 16:31:12 1995  -----------
It just took 32 seconds since beginning.
Estimate < 28 seconds to find a classification
  exp(61.7) [= 6.0e+26] times more probable.
Estimate >> 1 minute 6 seconds to find the very best classification,
 which may be exp(28.6) to exp(11764.5) times more probable.
Have seen 7 of the estimated > 21 possible classifications (based on 0
 duplicates do far).
Log-Normal fit to classifications probabilities has M(ean) -16598.5,
 S(igma) 154.9
Choosing initial n-classes randomly from a log_normal [M-S, M, M+S] =
 [2.9, 7.0, 16.9]
Overhead time is 3.0 % of total search time

[j_in=9]  [cs-3: cycles 10] 9->9(8) [j_in=3]  [cs-3: cycles 11] 3->3(9) [j_in=5]  [cs-3: cycles 48] 5->5(10) [j_in=3]  [cs-3: cycles 18] 3->3(11) [j_in=5]  [cs-3: cycles 35] 5->5(12)


ENDING SEARCH because max number of tries reached at Mon Jun 26 16:31:32 1995
  after a total of 12 tries over 53 seconds
A log of this search is in file:
 /home/tove/p/autoclass-c/sample/imports-85c.log
The search results are stored in file:
 /home/tove/p/autoclass-c/sample/imports-85c.results-bin
This search can be restarted by having "force_new_search_p = false" in file:
 /home/tove/p/autoclass-c/sample/imports-85c.s-params
 and reinvoking the "autoclass -search ..." form

------------------  SUMMARY OF 10 BEST RESULTS  ------------------
PROBABILITY: exp(-16368.367) N_CLASSES: 10 FOUND ON TRY:   5 *SAVED*
PROBABILITY: exp(-16477.345) N_CLASSES:  9 FOUND ON TRY:   8 *SAVED*
PROBABILITY: exp(-16537.556) N_CLASSES: 15 FOUND ON TRY:   6
PROBABILITY: exp(-16542.413) N_CLASSES:  7 FOUND ON TRY:   4
PROBABILITY: exp(-16590.504) N_CLASSES:  5 FOUND ON TRY:  10
PROBABILITY: exp(-16617.452) N_CLASSES:  5 FOUND ON TRY:   3
PROBABILITY: exp(-16632.595) N_CLASSES:  5 FOUND ON TRY:  12
PROBABILITY: exp(-16673.545) N_CLASSES: 22 FOUND ON TRY:   7
PROBABILITY: exp(-16759.053) N_CLASSES:  3 FOUND ON TRY:   2
PROBABILITY: exp(-16898.385) N_CLASSES:  3 FOUND ON TRY:   9
...

Results


imports-85c.class-text-1

      CROSS REFERENCE: CLASS => CASE NUMBER MEMBERSHIP


      AutoClass CLASSIFICATION for the 205 cases in:
        /home/centauri/cook/projects/ac/sample/imports-85c.db2
        /home/centauri/cook/projects/ac/sample/imports-85c.hd2
      with log-A<X/H> (approximate marginal likelihood) = -16564.197
      from classification results file:
        /home/centauri/cook/projects/ac/sample/imports-85c.results-bin
      and using models:
        /home/centauri/cook/projects/ac/sample/imports-85c.model - index = 0



                                 CLASS = 0



Case #   make            num-of-doors   body-style    (Cls  Prob)
--------------------------------------------------------------------------------

     5   audi            four           sedan               0.99
     7   audi            four           sedan               1.00
     8   audi            four           wagon               1.00
     9   audi            four           sedan               1.00
    10   audi            two            hatchback           1.00
    15   bmw             four           sedan               1.00
    16   bmw             four           sedan               1.00
    17   bmw             two            sedan               1.00
    18   bmw             four           sedan               1.00
    48   jaguar          four           sedan               1.00
    49   jaguar          four           sedan               1.00
    50   jaguar          two            sedan               1.00
    68   mercedes-benz   four           sedan               1.00
    69   mercedes-benz   four           wagon               1.00
    70   mercedes-benz   two            hardtop             1.00
    71   mercedes-benz   four           sedan               1.00
...


                                 CLASS = 1



Case #   make            num-of-doors   body-style    (Cls  Prob)
--------------------------------------------------------------------------------

     1   alfa-romero     two            convertible         1.00
     2   alfa-romero     two            convertible         1.00
     3   alfa-romero     two            hatchback           1.00
    11   bmw             two            sedan               1.00
    12   bmw             four           sedan               1.00
    13   bmw             two            sedan               1.00
    14   bmw             four           sedan               0.99
                                                        0   0.01
    30   dodge           two            hatchback           1.00
    47   isuzu           two            hatchback           1.00
    56   mazda           two            hatchback           1.00
    57   mazda           two            hatchback           1.00
    58   mazda           two            hatchback           1.00
    59   mazda           two            hatchback           1.00
    66   mazda           four           sedan               0.99
    76   mercury         two            hatchback           1.00
    83   mitsubishi      two            hatchback           1.00
    84   mitsubishi      two            hatchback           1.00
    85   mitsubishi      two            hatchback           1.00
   105   nissan          two            hatchback
...

                                 CLASS = 2



Case #   make            num-of-doors   body-style    (Cls  Prob)
--------------------------------------------------------------------------------

    19   chevrolet       two            hatchback           1.00
    20   chevrolet       two            hatchback           1.00
    21   chevrolet       four           sedan               1.00
    22   dodge           two            hatchback           1.00
    23   dodge           two            hatchback           1.00
    31   honda           two            hatchback           1.00
    32   honda           two            hatchback           1.00
    33   honda           two            hatchback           1.00
    34   honda           two            hatchback           1.00
    35   honda           two            hatchback           1.00
    36   honda           four           sedan               1.00
    37   honda           four           wagon               1.00
    45   isuzu           two            sedan               1.00
    46   isuzu           four           sedan               1.00
    51   mazda           two            hatchback           1.00
...

                                 CLASS = 9 (continued)



Case #   make            num-of-doors   body-style    (Cls  Prob)
--------------------------------------------------------------------------------

    81   mitsubishi      two            hatchback           1.00
    88   mitsubishi      four           sedan               1.00
    89   mitsubishi      four           sedan               1.00
   120   plymouth        two            hatchback           1.00
   190   volkswagen      two            convertible         0.99

Results


imports-85c.case-text-1

      CROSS REFERENCE: CASE NUMBER => MOST PROBABLE CLASS


      AutoClass CLASSIFICATION for the 205 cases in:
        /home/centauri/cook/projects/ac/sample/imports-85c.db2
        /home/centauri/cook/projects/ac/sample/imports-85c.hd2
      with log-A<X/H> (approximate marginal likelihood) = -16564.197
      from classification results file:
        /home/centauri/cook/projects/ac/sample/imports-85c.results-bin
      and using models:
        /home/centauri/cook/projects/ac/sample/imports-85c.model - index = 0



     Case #  Class  Prob         Case #  Class  Prob         Case #  Class  Prob
--------------------------------------------------------------------------------
          1    1    1.00             47    1    0.99             93    2    1.00
          2    1    1.00             48    0    1.00             94    2    0.99
          3    1    1.00             49    0    1.00             95    2    1.00
          4    3    0.99             50    0    1.00             96    2    0.99
          5    0    0.99             51    2    0.99             97    2    0.99
          6    4    0.99             52    2    0.99             98    2    0.99
          7    0    1.00             53    2    0.99             99    2    0.99
          8    0    1.00             54    2    0.99            100    3    0.99
          9    0    1.00             55    2    0.99            101    3    0.99
         10    0    0.99             56    1    1.00            102    0    0.99
...

Results


imports-85c.influ-o-text-1

...
CLASSIFICATION HAS 10 POPULATED CLASSES:  (max global influence value = 7.063)

  We give below a heuristic measure of class strength: the approximate
  geometric mean probability for instances belonging to each class,
  computed from the class parameters and statistics.  This approximates
  the contribution made, by any one instance "belonging" to the class,
  to the log probability of the data set w.r.t. the classification.  It
  thus provides a heuristic measure of how strongly each class predicts
  "its" instances.

   Class     Log of class       Relative         Class     Normalized
    num        strength       class strength     weight    class weight

     0        -8.25e+01          1.64e-10          51         0.249
     1        -8.01e+01          1.69e-09          39         0.190
     2        -6.99e+01          4.89e-05          29         0.141
     3        -6.86e+01          1.75e-04          18         0.088
     4        -7.25e+01          3.58e-06          16         0.078
     5        -6.86e+01          1.68e-04          14         0.068
     6        -7.11e+01          1.43e-05          12         0.059
     7        -5.99e+01          1.00e+00           9         0.044
     8        -6.95e+01          7.20e-05           9         0.044
     9        -6.95e+01          6.73e-05           8         0.039
...

ORDERED LIST OF NORMALIZED ATTRIBUTE INFLUENCE VALUES SUMMED OVER ALL CLASSES:

  This gives a rough heuristic measure of relative influence of each
  attribute in differentiating the classes from the overall data set.
  Note that "influence values" are only computable with respect to the
  model terms.  When multiple attributes are modeled by a single
  dependent term (e.g. multi_normal_cn), the term influence value is
  distributed equally over the modeled attributes.

   num                        description                          I-*k

    38: Log compression-ratio                                      1.000
    36: Log curb-weight                                            0.607
    29: Log horse-power                                            0.604
     2: make                                                       0.589
    37: Log engine-size                                            0.582
    32: Log wheel-base                                             0.550
    28: Log stroke                                                 0.515
    33: Log length                                                 0.496
    31: Log price                                                  0.487
    34: Log width                                                  0.437
    17: fuel-system                                                0.414
    27: Log bore                                                   0.408
    26: Log normalized-loses                                       0.305
    35: Log height                                                 0.292
    39: Log city-mpg                                               0.222
     7: drive-wheels                                               0.209
    40: Log highway-mpg                                            0.191
    14: engine-type                                                0.160
     6: body-style                                                 0.130
     3: fuel-type                                                  0.121
     5: num-of-doors                                               0.106
    30: Log peak-rpm                                               0.106
    15: num-of-cylinders                                           0.089
     4: aspiration                                                 0.075
     8: engine-location                                            0.009
     0: symboling                                                  -----
     1: normalized-loses                                           -----
...

Applications: IRAS Data


IRAS Results


Application: DNA Intron Data


Results

Results after removing gene duplication

Application: LandSat Data


Results

Assessment


In the Spotlight