- Group similar items together
- Example: sorting laundry
- Similar items may have important attributes in common, functionality in common
- Group customers together with similar interests and spending patterns
- Form of
*unsupervised learning* - Cluster objects into classes using following rule:
*maximum intraclass similarity and minimize interclass similarity* - Probability-based vs. distanced-based clustering

- Claritas PRIZM system
- Equifax MicroVision system
- Group population by demographic information
- Used for marketing and sales
- Once clusters are formed, analyze for distinguishing features
Name Income Age Education Vendor Blue Blood Estates High 35-54 College PRIZM Shotguns and Pickup Middle 35-64 High school PRIZM Southside City Low Mix Grade school MicroVision Living off Land Middle-Low Families with children Low MicroVision University USA Very Low Young-Mix Medium-High MicroVision Sunset Years Medium Seniors Medium Microvision - Another use: Deviation detection (which do not fit in cluster)

ID | Name | Age | Balance ($) | Income | Eyes | Gender |

1 | Amy | 62 | 0 | Medium | Brown | F |

2 | Al | 53 | 1,800 | Medium | Green | M |

3 | Betty | 47 | 16,543 | High | Brown | F |

4 | Bob | 32 | 45 | Medium | Green | M |

5 | Carla | 21 | 2,300 | High | Blue | F |

6 | Carl | 27 | 5,400 | High | Brown | M |

7 | Donna | 50 | 165 | Low | Blue | F |

8 | Don | 46 | 0 | High | Blue | F |

9 | Edna | 27 | 500 | Low | Blue | F |

10 | Ed | 68 | 1,200 | Low | Blue | M |

How would you cluster this data?

- Financial: (3,5,6,8), (1,2,4), (7,9,10)
- Romantic: (4,5,6,9), (7,8,10), (1,2,3)

- 1.
- Partitioning Based
- Enumerate partitions and score by some criteria
- K-means

- 2.
- Hierarchy Based
- Create hierarchical decomposition of data

- 3.
- Model Based
- Model is hypothesized for each cluster
- Find models that best fit data and each other
- Bayesian classification (AutoClass), Cobweb

- Partitioning in
*n*-dimensional space *n*is number of features- How is distance calculated?
- Manhattan distance
- Euclidean distance
- Can be customized for each type of feature
- Distance between entries 5 and 6: 6 + 3,100 + 0 + 1 + 1 = 3,108

- Dimensions can be weighted separately
- Distance usually calculated from center of mass of the cluster to a point
- How many clusters?

- 1.
- Determine desired number of clusters
- 2.
- Randomly pick items to become ``seed'' of each cluster
- 3.
- Assign each entry to the
*nearest*cluster - 4.
- Recalculate centers of clusters
- 5.
- Repeat steps 3 and 4 until number of moves below threshold

- Create hierarchy of clusters from small to big
- Can choose desired number of clusters after seeing results
- Agglomerative algorithm
- Start with as many clusters as items
- Iteratively merge closest clusters to form next level
- Stop with single cluster
- More popular method

- Divisive
- Start with one cluster
- Iteratively split until clusters each contain one item (or threshold of cluster size is reached)
- More expensive method

- Single-link method
- Merge clusters whose nearest records are the closest
- Can create long clusters

- Complete-link method
- Merge clusters whose farthest records are the closest
- Creates very compact clusters

- Group-average-link method: average locations are closest
- Ward's method: minimum total distance between all records

- Builds tree of probability-based concept descriptions
- Builds tree incrementally as observations are processed
- With each observation
- Adds new data to existing node or
- Adds new node to the tree
- Take action that produces best partition

- Split and merge to improve chosen node or its children
- Discovered concepts maximize number of features that can be predicted

- Fit one of a set of possible probabilistic models to the data
- Select theory that best fits the data
- Produce class descriptions that maximize likelihood of data
- Classifications provide best representations of observed data
- Representations calculate probability that each observation is in a given class
- Same mechanism can calculate probability that a new observation is in each class

- Databases
- Infrared Astronomical Satellite (IRAS) (77 classes, some relevant unknown patterns)
- DNA Intron (3 classes of patterns in protein donor/acceptor sites)
- LandSat (93 classes corresponding to image features such as road pixels)
- Database of all USA airports

- Instead of splitting data into clusters, search for clusters that predict characteristics of all observations
- Classes (clusters) provide probabilities for all attribute values
- One data point has a probability of membership in all classes
- Searching all worlds takes too long
- Instead, search a model space
- Model defined by V (for continuous parameters) and T (for discrete parameters)
- Assume cases are independent
- Class definitions can overlap
- For any set of class assignments, calculate maximally likely values for parameters in V
- A number of models can be used for each type of attribute
- Example: Gaussian Normal
- Model location attribute
- Location is real-valued number
- Gaussian Normal distribution of probabilities for the attribute value
- Can calculate likelihood of the value (in general) by integrating function over a limited range centered at point value

For i in NumberOfClusters Randomly initialize i clusters Do Compute class likelihood vectors Compute normalized probabilities for each data point Update class model parameters Analyze new parameters that will maximize probabilities (For normal function, recalculate mean, variance, skewness, kurtosis) Until convergence (sum of classes' log marginal probability > threshold or no change)

imports-85c.hd2

num_db2_format_defs 2 number_of_attributes 26 separator_char ',' ; Can also supply comment char and unknown token 0 discrete nominal "symboling" range 7 1 real scalar "normalized-loses" zero_point 0.0 rel_error 0.01 2 discrete nominal "make" range 22 3 discrete nominal "fuel-type" range 2 4 discrete nominal "aspiration" range 2 5 discrete nominal "num-of-doors" range 2 6 discrete nominal "body-style" range 5 7 discrete nominal "drive-wheels" range 3 8 discrete nominal "engine-location" range 2 9 real scalar "wheel-base" zero_point 0.0 rel_error 0.001 10 real scalar "length" zero_point 0.0 rel_error 0.001 11 real scalar "width" zero_point 0.0 rel_error 0.001 12 real scalar "height" zero_point 0.0 rel_error 0.001 13 real scalar "curb-weight" zero_point 0.0 rel_error 0.0002 14 discrete nominal "engine-type" range 7 15 discrete nominal "num-of-cylinders" range 7 16 real scalar "engine-size" zero_point 0.0 rel_error 0.01 17 discrete nominal "fuel-system" range 8 18 real scalar "bore" zero_point 0.0 rel_error 0.003 19 real scalar "stroke" zero_point 0.0 rel_error 0.003 20 real scalar "compression-ratio" zero_point 0.0 rel_error 0.003 21 real scalar "horse-power" zero_point 0.0 rel_error 0.01 22 real scalar "peak-rpm" zero_point 0.0 rel_error 0.02 23 real scalar "city-mpg" zero_point 0.0 rel_error 0.04 24 real scalar "highway-mpg" zero_point 0.0 rel_error 0.04 25 real scalar "price" zero_point 0.0 rel_error 0.001

imports-85c.db2

3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548, dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495 3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548, dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,16500 1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.50,171.20,65.50,52.40,2823, ohcv,six,152,mpfi,2.68,3.47,9.00,154,5000,19,26,16500 2,164,audi,gas,std,four,sedan,fwd,front,99.80,176.60,66.20,54.30,2337,ohc,four, 109,mpfi,3.19,3.40,10.00,102,5500,24,30,13950 2,164,audi,gas,std,four,sedan,4wd,front,99.40,176.60,66.40,54.30,2824,ohc,five, 136,mpfi,3.19,3.40,8.00,115,5500,18,22,17450 2,?,audi,gas,std,two,sedan,fwd,front,99.80,177.30,66.30,53.10,2507,ohc,five,136, mpfi,3.19,3.40,8.50,110,5500,19,25,15250 1,158,audi,gas,std,four,sedan,fwd,front,105.80,192.70,71.40,55.70,2844,ohc,five, 136,mpfi,3.19,3.40,8.50,110,5500,19,25,17710 1,?,audi,gas,std,four,wagon,fwd,front,105.80,192.70,71.40,55.70,2954,ohc,five, 136,mpfi,3.19,3.40,8.50,110,5500,19,25,18920 1,158,audi,gas,turbo,four,sedan,fwd,front,105.80,192.70,71.40,55.90,3086,ohc, five,131,mpfi,3.13,3.40,8.30,140,5500,17,20,23875 0,?,audi,gas,turbo,two,hatchback,4wd,front,99.50,178.20,67.90,52.00,3053,ohc, five,131,mpfi,3.13,3.40,7.00,160,5500,16,22,? 2,192,bmw,gas,std,two,sedan,rwd,front,101.20,176.80,64.80,54.30,2395,ohc,four, 108,mpfi,3.50,2.80,8.80,101,5800,23,29,16430 ...

imports-85c.model

model_index 0 4 ignore 0 single_normal_cm 1 18 19 21 22 25 single_normal_cn 9 10 11 12 13 16 20 23 24 single_multinomial default

imports-85c.s-params (abbreviated)

# start_j_list = 2, 3, 5, 7, 10, 15, 25 # min_report_period = 30 # max_duration = 0 # max_n_tries = 0 # n_save = 2 ...

imports-85c.log

AUTOCLASS C (version 2.5) STARTING at Mon Jun 26 16:30:39 1995 AUTOCLASS -SEARCH default parameters: ... WELCOME TO AUTOCLASS. 1) Each time I have finished a new 'trial', or attempt to find a good classification, I will print the number of classes that trial started and ended with, such as 9->7. 2) If that trial results in a duplicate of a previous run, I will print print 'dup' first. 3) If that trial results in a classification better than any previous, I will print 'best' first. 4) If more than 30 seconds have passed since the last report, and a new classification has been found which is better than any previous ones, I will report on that classification and on the status of the search so far. 5) This report will include an estimate of the time it will take to find another even better classification, and how much better that will be. In addition, I will estiamte a lower bound on how long it might take to find the very best classification, and how much better that might be. 6) If you are warned about too much time in overhead, you may want to change the parameters n_save, min_save_period, min_report_period, or min_checkpoint_period. 7) To quit searching, type a 'q', hit <return>, and wait. Otherwise I'll go on until I complete trial number (12). 8) If needed, every 30 minutes I will save the best 2 classifications so far to file: /home/tove/p/autoclass-c/sample/imports-85c.results-bin and a description of the search to file: /home/tove/p/autoclass-c/sample/imports-85c.search 9) A record of this search will be printed to file: /home/tove/p/autoclass-c/sample/imports-85c.log BEGINNING SEARCH at Mon Jun 26 16:30:40 1995 [j_in=2] [cs-3: cycles 15] best2->2(1) [j_in=3] [cs-3: cycles 49] best3->3(2) [j_in=5] [cs-3: cycles 12] best5->5(3) [j_in=7] [cs-3: cycles 11] best7->7(4) [j_in=10] [cs-3: cycles 14] best10->10(5) [j_in=15] [cs-3: cycles 28] 15->15(6) [j_in=25] [cs-3: cycles 10] 25->22(7) ---------------- NEW BEST CLASSIFICATION FOUND on try 5 ------------- It has 10 CLASSES with WEIGHTS 32 30 28 24 21 21 20 11 10 8 PROBABILITY of both the data and the classification = exp(-16368.367) (Also found 4 other better than last report.) ----------- SEARCH STATUS as of Mon Jun 26 16:31:12 1995 ----------- It just took 32 seconds since beginning. Estimate < 28 seconds to find a classification exp(61.7) [= 6.0e+26] times more probable. Estimate >> 1 minute 6 seconds to find the very best classification, which may be exp(28.6) to exp(11764.5) times more probable. Have seen 7 of the estimated > 21 possible classifications (based on 0 duplicates do far). Log-Normal fit to classifications probabilities has M(ean) -16598.5, S(igma) 154.9 Choosing initial n-classes randomly from a log_normal [M-S, M, M+S] = [2.9, 7.0, 16.9] Overhead time is 3.0 % of total search time [j_in=9] [cs-3: cycles 10] 9->9(8) [j_in=3] [cs-3: cycles 11] 3->3(9) [j_in=5] [cs-3: cycles 48] 5->5(10) [j_in=3] [cs-3: cycles 18] 3->3(11) [j_in=5] [cs-3: cycles 35] 5->5(12) ENDING SEARCH because max number of tries reached at Mon Jun 26 16:31:32 1995 after a total of 12 tries over 53 seconds A log of this search is in file: /home/tove/p/autoclass-c/sample/imports-85c.log The search results are stored in file: /home/tove/p/autoclass-c/sample/imports-85c.results-bin This search can be restarted by having "force_new_search_p = false" in file: /home/tove/p/autoclass-c/sample/imports-85c.s-params and reinvoking the "autoclass -search ..." form ------------------ SUMMARY OF 10 BEST RESULTS ------------------ PROBABILITY: exp(-16368.367) N_CLASSES: 10 FOUND ON TRY: 5 *SAVED* PROBABILITY: exp(-16477.345) N_CLASSES: 9 FOUND ON TRY: 8 *SAVED* PROBABILITY: exp(-16537.556) N_CLASSES: 15 FOUND ON TRY: 6 PROBABILITY: exp(-16542.413) N_CLASSES: 7 FOUND ON TRY: 4 PROBABILITY: exp(-16590.504) N_CLASSES: 5 FOUND ON TRY: 10 PROBABILITY: exp(-16617.452) N_CLASSES: 5 FOUND ON TRY: 3 PROBABILITY: exp(-16632.595) N_CLASSES: 5 FOUND ON TRY: 12 PROBABILITY: exp(-16673.545) N_CLASSES: 22 FOUND ON TRY: 7 PROBABILITY: exp(-16759.053) N_CLASSES: 3 FOUND ON TRY: 2 PROBABILITY: exp(-16898.385) N_CLASSES: 3 FOUND ON TRY: 9 ...

imports-85c.class-text-1

CROSS REFERENCE: CLASS => CASE NUMBER MEMBERSHIP AutoClass CLASSIFICATION for the 205 cases in: /home/centauri/cook/projects/ac/sample/imports-85c.db2 /home/centauri/cook/projects/ac/sample/imports-85c.hd2 with log-A<X/H> (approximate marginal likelihood) = -16564.197 from classification results file: /home/centauri/cook/projects/ac/sample/imports-85c.results-bin and using models: /home/centauri/cook/projects/ac/sample/imports-85c.model - index = 0 CLASS = 0 Case # make num-of-doors body-style (Cls Prob) -------------------------------------------------------------------------------- 5 audi four sedan 0.99 7 audi four sedan 1.00 8 audi four wagon 1.00 9 audi four sedan 1.00 10 audi two hatchback 1.00 15 bmw four sedan 1.00 16 bmw four sedan 1.00 17 bmw two sedan 1.00 18 bmw four sedan 1.00 48 jaguar four sedan 1.00 49 jaguar four sedan 1.00 50 jaguar two sedan 1.00 68 mercedes-benz four sedan 1.00 69 mercedes-benz four wagon 1.00 70 mercedes-benz two hardtop 1.00 71 mercedes-benz four sedan 1.00 ... CLASS = 1 Case # make num-of-doors body-style (Cls Prob) -------------------------------------------------------------------------------- 1 alfa-romero two convertible 1.00 2 alfa-romero two convertible 1.00 3 alfa-romero two hatchback 1.00 11 bmw two sedan 1.00 12 bmw four sedan 1.00 13 bmw two sedan 1.00 14 bmw four sedan 0.99 0 0.01 30 dodge two hatchback 1.00 47 isuzu two hatchback 1.00 56 mazda two hatchback 1.00 57 mazda two hatchback 1.00 58 mazda two hatchback 1.00 59 mazda two hatchback 1.00 66 mazda four sedan 0.99 76 mercury two hatchback 1.00 83 mitsubishi two hatchback 1.00 84 mitsubishi two hatchback 1.00 85 mitsubishi two hatchback 1.00 105 nissan two hatchback ... CLASS = 2 Case # make num-of-doors body-style (Cls Prob) -------------------------------------------------------------------------------- 19 chevrolet two hatchback 1.00 20 chevrolet two hatchback 1.00 21 chevrolet four sedan 1.00 22 dodge two hatchback 1.00 23 dodge two hatchback 1.00 31 honda two hatchback 1.00 32 honda two hatchback 1.00 33 honda two hatchback 1.00 34 honda two hatchback 1.00 35 honda two hatchback 1.00 36 honda four sedan 1.00 37 honda four wagon 1.00 45 isuzu two sedan 1.00 46 isuzu four sedan 1.00 51 mazda two hatchback 1.00 ... CLASS = 9 (continued) Case # make num-of-doors body-style (Cls Prob) -------------------------------------------------------------------------------- 81 mitsubishi two hatchback 1.00 88 mitsubishi four sedan 1.00 89 mitsubishi four sedan 1.00 120 plymouth two hatchback 1.00 190 volkswagen two convertible 0.99

imports-85c.case-text-1

CROSS REFERENCE: CASE NUMBER => MOST PROBABLE CLASS AutoClass CLASSIFICATION for the 205 cases in: /home/centauri/cook/projects/ac/sample/imports-85c.db2 /home/centauri/cook/projects/ac/sample/imports-85c.hd2 with log-A<X/H> (approximate marginal likelihood) = -16564.197 from classification results file: /home/centauri/cook/projects/ac/sample/imports-85c.results-bin and using models: /home/centauri/cook/projects/ac/sample/imports-85c.model - index = 0 Case # Class Prob Case # Class Prob Case # Class Prob -------------------------------------------------------------------------------- 1 1 1.00 47 1 0.99 93 2 1.00 2 1 1.00 48 0 1.00 94 2 0.99 3 1 1.00 49 0 1.00 95 2 1.00 4 3 0.99 50 0 1.00 96 2 0.99 5 0 0.99 51 2 0.99 97 2 0.99 6 4 0.99 52 2 0.99 98 2 0.99 7 0 1.00 53 2 0.99 99 2 0.99 8 0 1.00 54 2 0.99 100 3 0.99 9 0 1.00 55 2 0.99 101 3 0.99 10 0 0.99 56 1 1.00 102 0 0.99 ...

imports-85c.influ-o-text-1

... CLASSIFICATION HAS 10 POPULATED CLASSES: (max global influence value = 7.063) We give below a heuristic measure of class strength: the approximate geometric mean probability for instances belonging to each class, computed from the class parameters and statistics. This approximates the contribution made, by any one instance "belonging" to the class, to the log probability of the data set w.r.t. the classification. It thus provides a heuristic measure of how strongly each class predicts "its" instances. Class Log of class Relative Class Normalized num strength class strength weight class weight 0 -8.25e+01 1.64e-10 51 0.249 1 -8.01e+01 1.69e-09 39 0.190 2 -6.99e+01 4.89e-05 29 0.141 3 -6.86e+01 1.75e-04 18 0.088 4 -7.25e+01 3.58e-06 16 0.078 5 -6.86e+01 1.68e-04 14 0.068 6 -7.11e+01 1.43e-05 12 0.059 7 -5.99e+01 1.00e+00 9 0.044 8 -6.95e+01 7.20e-05 9 0.044 9 -6.95e+01 6.73e-05 8 0.039 ... ORDERED LIST OF NORMALIZED ATTRIBUTE INFLUENCE VALUES SUMMED OVER ALL CLASSES: This gives a rough heuristic measure of relative influence of each attribute in differentiating the classes from the overall data set. Note that "influence values" are only computable with respect to the model terms. When multiple attributes are modeled by a single dependent term (e.g. multi_normal_cn), the term influence value is distributed equally over the modeled attributes. num description I-*k 38: Log compression-ratio 1.000 36: Log curb-weight 0.607 29: Log horse-power 0.604 2: make 0.589 37: Log engine-size 0.582 32: Log wheel-base 0.550 28: Log stroke 0.515 33: Log length 0.496 31: Log price 0.487 34: Log width 0.437 17: fuel-system 0.414 27: Log bore 0.408 26: Log normalized-loses 0.305 35: Log height 0.292 39: Log city-mpg 0.222 7: drive-wheels 0.209 40: Log highway-mpg 0.191 14: engine-type 0.160 6: body-style 0.130 3: fuel-type 0.121 5: num-of-doors 0.106 30: Log peak-rpm 0.106 15: num-of-cylinders 0.089 4: aspiration 0.075 8: engine-location 0.009 0: symboling ----- 1: normalized-loses ----- ...

- 5,425 mean spectra of IRAS point sources
- Each spectrum consists of 100 ``blue'' and 100 ``red'' channels
- Spectra cover a wide range of intensities
- Treat each channel as independent normally distributed single real value
- Many difficulties in interacting with scientists
- Scientists released pre-processed data
- Pre-processing removed some interesting data
- Method of pre-processing not initially revealed
- Changed reference point from Vega to Tau partway through collection, did not mention this change

- Generated 77 classes
- Significantly different classification than with human analysis
- AutoClass found many subtle distinctions between spectra that superficially look similar (not previously known)
- Example, two subgroups of stars distinguished (not previously known to
be different)
- Analyzed classes containing known carbon stars, thereby tripling number of known (or suspected) carbon stars
- Revealed blackbody stars with significant IR excess (dust surrounding star)

- Database of 3,000 donor and acceptor sites from human DNA
- Coding DNA is interspersed with parts from messenger RNA
- Beginning of splice point is donor site, end is acceptor site
- Intron length (between donor and acceptor) can vary from 80-thousands of bases
- Donor database consists of ordered list of bases 10 bases before splice site and 40 bases of intron
- Bases are A (adenine), C (cytosine), G (guanine), and T (thymine)

Results

- First generated many classes with one unique base sequence per class
- There are many duplicated splice sites in human DNA
- Analysis showed many duplicates appear in a sequence in the same gene
- If occur in different genes, usually result of gene duplication

Results after removing gene duplication

- Found three classes
- First class, every position was dominated by C (C rich)
- Other two classes were TA rich and G rich
- Class of donor site correlated with class of acceptor site
- Donor, acceptor, and entire intron are C-rich
- Similar pattern observed for all classes
- If one intron is rich in a particular base, high probability that neighboring introns will be rich in same bases

- Analyze 1024x1024 array of satellite image pixels
- Each pixel records seven spectral intensity values
- 1,000,000 cases
- Big enough to need parallel algorithm
- Parallel AutoClass, C AutoClass developed with UTA

Results

- Discovered 93 classes
- Classes used to discover meta-classes
- Classes were roads, rivers, valley bottoms, valley edges, fields of crops

- Clustering good first tool when classes are unknown
- Results can be used as is or used to determine classes
- Structure information may yield better results