Review of Data Mining Algorithms


Algorithm Representation Usage Adv Disadv
Association Rule Prop if-then discover simple slow, rep,
    correlations no oracle no predict
Decision Tree Disj of prediction robust prediction, overfit,
  prop conj   symbolic rules representation
Neural Network Nonlinear prediction numeric, overfit, slow,
  numeric f(x)   noisy data difficult to
        interpret output
Naive Bayes Est prob prediction noisy data, not effective
      provable answer in practice
Belief Network network & CPTs prediction, prior info, slow
    variable influence prob dist learn structure
Nearest Neighbor Instances prediction no training, slow query,
      no fixed bias redun attributes
      no loss of info no bias
Clustering clusters discover correlations, no oracle, difficult to
    discover var influence find similar features interpret output
    find sim instances    

Data Mining Applications


Data Preparation


KDD Process


The Problem


Data Preparation


Discretization of Continuous Features


A discretization algorithm converts continuous features into discrete features

\psfig{figure=figures/disc1.ps}

Motivation


1.
Some algorithms are limited to discrete inputs
2.
Many algorithm discretize as part of the learning algorithm, perhaps not in the best manner
3.
Continuous features drastically slow learning
4.
More easily view data

Types of discretization


Discretization can be classified on two dimensions

1.
Supervised vs. unsupervised
Supervised uses class information
2.
Global (mesh) vs. local

\psfig{figure=figures/disc2.ps}

One can apply some discretization methods either globally or locally

Single Feature Discretization


Here we consider only global discretization of single features.

1.
Limit scope of problem
Allowing discretization of multiple input features is as hard as entire induction problem
2.
Easy to interpret discretization

Equal Interval Width (Binning)


1.
Given k bins, divide the training-set range into k equal-size bins

\psfig{figure=figures/disc3.ps}

2.
Problems
(a)
Unsupervised
(b)
Where does k come from?
(c)
Sensitive to outliers

\psfig{figure=figures/disc4.ps}

Equal Frequency


A possible local unsupervised method: k-means clustering

OneR


\psfig{figure=figures/disc5.ps}

Minimal Entropy Partitioning


How many partitions? D2 (Catlett), MDL approach (Fayyad and Irani) offer answers.

This method is supervised.

Comparison 1


The Wrong Way to Experiment


1.
Discretize entire data file
2.
Run 10-fold cross-validation
3.
Report accuracy with and without discretization
4.
Why is this bad?

Results


\psfig{figure=figures/disc6.ps}


Large differences in number of intervals. Here is result of diabetes dataset.


Method Accuracy Intervals per attribute
Entropy 76.04 2 4 1 2 3 2 2 3
1R 72.40 6 13 4 6 18 25 41 12
Binning 73.44 8 14 11 11 15 16 18 11

C4.5-Discretization


Sampling


Feature Selection


Selecting a subset of features to give to data mining algorithm.

Motivations

1.
Improve accuracy, many algorithms degrade in performance when given too many features
2.
Improve comprehensibility
3.
Reduce cost and complexity
4.
Investigate with respect to classification task
5.
Scale up to datasets with a large number of features

Example


Credit Approval Database

Feature Selection Criteria


Optimal Features


Given

The optimal feature subset S* is set of features that yields in highest-accuracy classifier.


\begin{displaymath}S^* \;=\; arg max_{S' \subseteq S} acc(I(D_{S'})),\end{displaymath}

where I(DS') is the classifier built by I from the dataset D using only features in S'.

The Filter Approach


\psfig{figure=figures/fs8.ps}

Focus


Mutual Information


Relief


Pseudocode


set all weights W[A] = 0
for i = 1 to m do
   randomly select instance R
   find nearest hit H and nearest miss M
   for A = 1 to AllAttributes do
      W[A] = W[A] - diff(A,R,H)/m + diff(A,R,M)/m

Here diff(Attribute,Instance1,Instance2) calculates difference between values of the Attribute for two instances.

Feature Filtering Using Decision Trees


Filter Approaches


\psfig{figure=figures/fs9.ps}

The Wrapper Approach


\psfig{figure=figures/fs2.ps} 62

Experimental Results


Problems


Lakshminarayan et al. for feature selection


In the Spotlight


IBM Advanced Scout

http://www.research.ibm.com/scout

Dealing with Missing Data


Single Imputation


Multiple Imputation


Multiple Imputation Using C4.5


Day Outlook Temperature Humidity Wind PlayTennis

D1

Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High ? No

D14 now becomes
D14 Rain Mild High Weak No (weight = 8/13 = 0.62)
D14 Rain Mild High Strong No (weight = 5/13 = 0.38)


Gain(Wind) = I(9/14, 5/14) - Remainder(Wind) Remainder(Wind) =

\begin{displaymath}\frac{p_{weak} + n_{weak}}{p+n}
I(\frac{p_{weak}}{p_{weak} + ...
...ong} + n_{strong}},
\frac{n_{strong}}{p_{strong} + n_{strong}})\end{displaymath}

= \(\frac{8.62}{14}I(\frac{6}{8.62}, \frac{2.62}{8.62}) \;+\;
\frac{5.38}{14}I(\frac{3}{5.38}, \frac{2.38}{5.38})\)

Lakshminarayan et al.


Methods

Results