Algorithm | Representation | Usage | Adv | Disadv |

Association Rule | Prop if-then | discover | simple | slow, rep, |

correlations | no oracle | no predict | ||

Decision Tree | Disj of | prediction | robust prediction, | overfit, |

prop conj | symbolic rules | representation | ||

Neural Network | Nonlinear | prediction | numeric, | overfit, slow, |

numeric f(x) | noisy data | difficult to | ||

interpret output | ||||

Naive Bayes | Est prob | prediction | noisy data, | not effective |

provable answer | in practice | |||

Belief Network | network & CPTs | prediction, | prior info, | slow |

variable influence | prob dist | learn structure | ||

Nearest Neighbor | Instances | prediction | no training, | slow query, |

no fixed bias | redun attributes | |||

no loss of info | no bias | |||

Clustering | clusters | discover correlations, | no oracle, | difficult to |

discover var influence | find similar features | interpret output | ||

find sim instances |

- Image-Based Data Mining
- Job posting on Monday
- Analyze images for stroke, aging, trauma

- Newswire Data mining
- Reuters Newswire (1987)
- 22,173 documents
- 135 keywords (countries, topics, people, organizations, stock exchanges)
- Background knowledge: 1995 CIA World Factbook (member of, land boundaries, natural resources, export commodities, export partners, industries, etc.)
- Iran, Nicaragua, USA Reagan (6/1.0)
- Iran, USA Reagan (18/0.692)
- gold, copper Canada (5/0.625)
- gold, copper USA (12/0.571)
- gold, copper Switzerland (5/1.0)
- gold, copper Belgium (5/1.0)

- Churning (customer turnover) in telecommunications industry
- Cost of churn is approximately $400 per new subscriber
- Prediction (will this customer churn and when)
- Understanding (why do particular customers churn)
- Act (reduce churn rate by offering incentives)
- account length in days, international plan, voice mail, number of messages, length of day time calls, length of evening calls, length of night calls, length of international calls, customer service calls, churn status
- Clusters
- International Users (no voice mail, high international usage)
- Internet Users (no voice mail, high day, evening, night usage)
- Busy Workers (no voice mail, low day and evening usage, high customer service calls)
- Long term customers, long term voice mail customers, new customer

- Convert data to desired format
- Transform data
- Normalization
- Smoothing
- Data Reduction
- Feature discretization
- Sampling
- Feature selection

- Feature composition
- Often requires human assistance to find best set of transformations

- Dealing with Missing Data

A discretization algorithm converts continuous features into discrete features

- 1.
- Some algorithms are limited to discrete inputs
- 2.
- Many algorithm discretize as part of the learning algorithm, perhaps not in the best manner
- 3.
- Continuous features drastically slow learning
- 4.
- More easily view data

Discretization can be classified on two dimensions

- 1.
- Supervised vs. unsupervised

Supervised uses class information - 2.
- Global (mesh) vs. local
One can apply some discretization methods either globally or locally

Here we consider only global discretization of single features.

- 1.
- Limit scope of problem

Allowing discretization of multiple input features is as hard as entire induction problem - 2.
- Easy to interpret discretization

- 1.
- Given
*k*bins, divide the training-set range into*k*equal-size bins - 2.
- Problems
- (a)
- Unsupervised
- (b)
- Where does
*k*come from? - (c)
- Sensitive to outliers

- Split into intervals of equal size
- Divide
*m*instances into*k*bins, each containing*m*/*k*(possibly duplicated) values - This method is unsupervised
- Not often used

A possible local unsupervised method: k-means clustering

- Developed in 1993 by Holte
- Used in OneR induction algorithm
- Induces one-level decision trees (decision stumps)
- Divide range into pure bins
- Each bin contains strong majority of a class
- Each bin must include at least threshold number of instances
- This method is supervised

- Developed in 1993 by Fayyad and Irani
- Find best threshold split, such that mutual information between feature and label is maximal
- Split data according to threshold
- Given
- Set of instances
*S* - Feature
*A* - Partition boundary
*T*

- Set of instances
- Class entropy of partition induced by T, E(A,T;S) is calculated as

- Recursively discretize each partition
- Stopping condition based on MDL Principle, stop when

where- N is number of instances in set S
- Gain(A,T;S) = Ent(S) - E(E,A;S)

- Run time is
*O*(*km lg m*), space is*O*(*m*)

How many partitions? D2 (Catlett), MDL approach (Fayyad and Irani) offer answers.

This method is supervised.

- Compare binning, 1R, and entropy-based partitioning
- Use C4.5 and Naive Bayes with and without discretizations
- Test on 16 UCI datasets, all with at least one continuous feature

- 1.
- Discretize entire data file
- 2.
- Run 10-fold cross-validation
- 3.
- Report accuracy with and without discretization
- 4.
- Why is this bad?

Large differences in number of intervals. Here is result of diabetes dataset.

Method | Accuracy | Intervals per attribute | |||||||

Entropy | 76.04 | 2 | 4 | 1 | 2 | 3 | 2 | 2 | 3 |

1R | 72.40 | 6 | 13 | 4 | 6 | 18 | 25 | 41 | 12 |

Binning | 73.44 | 8 | 14 | 11 | 11 | 15 | 16 | 18 | 11 |

- All discretiation methods for Naive Bayes lead to average increase in accuracy
- Entropy improves performance on all but three data sets
- C4.5, not much change

- Similar to Entropy method
- Use C4.5 to build a tree with just the one continuous feature
- Apply pruning to find appropriate number of nodes (number of discretization intervals)
- Increased pruning confidence beyond default value
- Run time is
*O*(*lg*_{1/(1-p}*m***m lg m*), where*p*is portion of instances split with each decision - Space is
*O*(*m*)

- How many training examples do we need?
- What type of training examples do we need?
- More training examples, better accuracy
- PAC learning
- Random sampling
- Avoid bias

Selecting a subset of features to give to data mining algorithm.

Motivations

- 1.
- Improve accuracy, many algorithms degrade in performance when given too many features
- 2.
- Improve comprehensibility
- 3.
- Reduce cost and complexity
- 4.
- Investigate with respect to classification task
- 5.
- Scale up to datasets with a large number of features

Credit Approval Database

- Initial number of features: 14
- Criterion: maximal information gain
- Selected features
- Other investments
- Savings account balance
- Bank account

- Conclusion: Other 11 attributes not required for predicting class of customer

- Bayesian approach: no bad features
- Information-theoretic approach: prefer features which reduce uncertainty (entropy) in class
- Distance measures: maximize distance between prior and posterior distributions of class
- Dependence measures: correlation
- Consistency measures: find minimum set of features
- Accuracy measures: choose set of features which maximizes accuracy

Given

- Inducation algorithm
*I* - Dataset
*D*

The *optimal feature subset* *S*^{*} is set of features that yields in
highest-accuracy classifier.

where *I*(*D*_{S'}) is the classifier built by *I* from the dataset *D* using
only features in *S*'.

- Select features in a preprocessing step
- Generally ignores effects on performance of the learning algorithm

- Almuallim and Dietterich, 1991
- Considers all subsets of features
- Selects minimal subset sufficient to determine label value of all training instances (MinFeatures bias)
- Does not generalize well
- Example: use SSN as only feature to discriminate people

- Compute mutual information between each feature and class
- MI between two variables
- Average reduction in uncertainty about second variable given value of first variable
- Assume evaluating feature
*j* - MI weight
*w*_{j}is

- P(y=c) is proportion of training examples in class
*c* - P(
*x*_{j}=v) is probability that feature*j*has value*v*

- More difficult for real-valued features
- Each feature is treated independently
- Parity function over
*n*features

- Kira and Rendell, 1992
- Assign ``relevance'' weight to each feature
- Finds all relevant features, not necessarily minimal set
- Collect random sample of instances
- For each instance
- Find nearest instance of same class (nearest hits)
- Find nearest instance of different class (nearest miss)
- Distance similar to nearest neighbor algorithms

- Estimate W[A] of attribute A approximates following value:
W[A] = P(diff value of A nearest instance from different class) -

P(diff value of A nearest instance from same class) - A good attribute should differentiate between instances from different classes and have same value for instances from same class

set all weights W[A] = 0 for i = 1 to m do randomly select instance R find nearest hit H and nearest miss M for A = 1 to AllAttributes do W[A] = W[A] - diff(A,R,H)/m + diff(A,R,M)/m

Here diff(Attribute,Instance1,Instance2) calculates difference between values of the Attribute for two instances.

- Discrete attributes: differences is 0 or 1
- Continuous atributes: difference is actual difference normalized to [0,1] interval
- All weights are in interval [-1,1]

- Generate decision tree using training set
- Decision tree usually only uses subset of features
- Select features that appear in decision tree
- Features useful for decision trees not necessarily useful for nearest neighbor
- Totally irrelevant features will be removed
- Decision trees do not test more than
*O*(*lgm*) features in a path

62

- Kohavi
- Use induction algorithm as a black box
- Conduct search (best-first search here) in the space of subsets
- The estimated prediction accuracy using cross-validation is the search heuristic

- Tested wrapper using ID3 and Naive Bayes
- Best-first search starting with empty set of features
- Final feature subsets evaluated on unseen test instances using five-fold cross validation

- Wrapper very slow
- Wrapper has danger of overfitting
- Idea for both approaches: weight features, integrate into ML algorithm

- Used AutoClass class as input feature
- No significant performance improvement
- AutoClass class was root feature in all of the best trees
- AutoClass results used to remove features without significant change in accuracy

**http://www.research.ibm.com/scout
**

- Most prediction methods do not handle missing data well
- Missing values cannot be multiplied or compared
- Solutions
- Use only features with all values
- Use only cases with all values
- These methods lead to bias and insufficient sample sizes
- Fill in missing values during data preparation
- Replace all values with single global constant
- Replace each missing value with a single value (single imputation)
- Replace all missing values with ``unknown category''
- CART uses this approach
- Adds additional parameter to estimate
- Does not reflect fact that missing value is actually part of original value set
- Can form class just on ``unknown category'' value

- Problem: The replaced values are frequently not the correct value

- Replace a missing value with its feature mean (mean imputation)
- Replace a missing value with its feature and class mean
- Replace by expected value calculated from probability distribution
- Do not capture uncertainty about true value
- Methods can lead to biases

- Replace each missing value by vector of
*M*imputed values - Generates
*M*complete data sets - Analyze each set separately

- Pretend missing value has
*all*possible values - Weight each value according to frequency among examples in that part of space

Day | Outlook | Temperature | Humidity | Wind | PlayTennis |

D1 |
Sunny | Hot | High | Weak | No |

D2 | Sunny | Hot | High | Strong | No |

D3 | Overcast | Hot | High | Weak | Yes |

D4 | Rain | Mild | High | Weak | Yes |

D5 | Rain | Cool | Normal | Weak | Yes |

D6 | Rain | Cool | Normal | Strong | No |

D7 | Overcast | Cool | Normal | Strong | Yes |

D8 | Sunny | Mild | High | Weak | No |

D9 | Sunny | Cool | Normal | Weak | Yes |

D10 | Rain | Mild | Normal | Weak | Yes |

D11 | Sunny | Mild | Normal | Strong | Yes |

D12 | Overcast | Mild | High | Strong | Yes |

D13 | Overcast | Hot | Normal | Weak | Yes |

D14 | Rain | Mild | High | ? | No |

**D14 now becomes
D14 Rain Mild High Weak No (weight = 8/13 = 0.62)
D14 Rain Mild High Strong No (weight = 5/13 = 0.38)
**

**Gain(Wind) = I(9/14, 5/14) - Remainder(Wind)
Remainder(Wind) =
**

**= **

- Working with Honeywell maintenance data
- Database contains 4,383 records, each with 82 features
- No instance is complete
- Only 41 variables have values for more than 50% instances

**Methods
**

- AutoClass
- Cluster existing data
- AutoClass produces probability distribution of features given classes
- Replace missing value with value of highest probability for the cluster
- AutoClass also ranks variables by influence (useful for feature selection)

CLASS 0 - weight 51 normalized weight 0.249 relative strength 1.64e-10 7 33 R SNcn Log length ......... 1.160 ( 5.25e+00 3.41e-02) 2.64e+00 ( 5.16e+00 7.05e-02) 10 36 R SNcn Log curb-weight .... 1.153 ( 8.08e+00 9.92e-02) 2.57e+00 ( 7.83e+00 1.97e-01) 15 2 D SM make ............... 0.955 mazda .............. -4.41e+00 1.01e-03 8.27e-02 mitsubishi ......... -4.28e+00 8.75e-04 6.33e-02 honda .............. -4.28e+00 8.75e-04 6.33e-02 subaru ............. -4.20e+00 8.75e-04 5.85e-02 volkswagen ......... -4.20e+00 8.78e-04 5.85e-02 dodge .............. -3.92e+00 8.75e-04 4.39e-02 plymouth ........... -3.67e+00 8.75e-04 3.42e-02 isuzu .............. -3.11e+00 8.75e-04 1.96e-02 chevrolet .......... -2.83e+00 8.75e-04 1.48e-02 alfa-romero ........ -2.83e+00 8.75e-04 1.48e-02 renault ............ -2.43e+00 8.75e-04 9.93e-03 mercury ............ -1.74e+00 8.90e-04 5.08e-03 volvo .............. 1.38e+00 2.13e-01 5.36e-02 peugot ............. 1.38e+00 2.13e-01 5.36e-02 jaguar ............. 1.38e+00 5.86e-02 1.48e-02 ...

- C4.5
- Use learning system to predict value of missing variable
- Feature becomes target, instead of class value
- Variable values cannot be missing for training data

**Results
**

- C4.5 yielded 22.6% error rate for target variable
- AutoClass yielded 48.7% error rate for target variable
- Using top three choices, AutoClass error rate was 18%