next up previous


Evaluating Hypotheses


Consider hypotheses H1 and H2 learned by learners L1 and L2

Confidence Intervals


If

Then

Two-Sided and One-Sided Bounds


\psfig{figure=figures/g2.ps} \psfig{figure=figures/g3.ps}

Normal Distribution Approximates Binomial


errorS(h) follows a Binomial distribution, with






Approximate this by a Normal distribution with

Confidence Intervals, More Correctly


If

Then



equivalently, $error_{\cal{D}}(h)$ lies in interval

\begin{displaymath}error_{S}(h) \pm 1.96 \sqrt{\frac{error_{\cal{D}}(h) (1 - error_{\cal{D}}(h))}{n}}\end{displaymath}



which is approximately

\begin{displaymath}error_{S}(h) \pm 1.96 \sqrt{\frac{error_{S}(h) (1 - error_{S}(h))}{n}} \end{displaymath}

Calculating Confidence Intervals


1.
Pick parameter p to estimate

2.
Choose an estimator

3.
Determine probability distribution that governs estimator

4.
Find interval (L, U) such that N% of probability falls in the interval

Difference Between Hypotheses


Test h1 on sample S1, test h2 on S2

1.
Pick parameter to estimate

\begin{displaymath}d \equiv error_{\cal{D}}(h_{1}) - error_{\cal{D}}(h_{2}) \end{displaymath}

2.
Choose an estimator

\begin{displaymath}\hat{d} \equiv error_{S_{1}}(h_{1}) - error_{S_{2}}(h_{2}) \end{displaymath}

3.
Determine probability distribution that governs estimator

\begin{displaymath}\small\sigma_{\hat{d}} \approx \sqrt{\frac{error_{S_{1}}(h_{1...
...+ \frac{error_{S_{2}}(h_{2})(1 - error_{S_{2}}(h_{2}))}{n_{2}}}\end{displaymath}

4.
Find interval (L, U) such that N% of probability mass falls in the interval

\begin{displaymath}\hat{d} \pm z_{N} \sqrt{\frac{error_{S_{1}}(h_{1})(1 -
error_...
...\frac{error_{S_{2}}(h_{2})(1 -
error_{S_{2}}(h_{2}))}{n_{2}} } \end{displaymath}

Hypothesis Testing



P(errorD(h1) > errorD(h2)) = ?

Paired t test to compare hA,hB


1.
Partition data into k disjoint test sets $T_{1}, T_{2},
\ldots, T_{k}$ of equal size, where this size is at least 30.
2.
For i from 1 to k, do
$\delta_{i} \leftarrow error_{T_{i}}(h_{A}) - error_{T_{i}}(h_{B})$

3.
Return the value $\bar{\delta}$, where

\begin{displaymath}\bar{\delta} \equiv \frac{1}{k}\sum_{i=1}^{k} \delta_{i}\end{displaymath}

N% confidence interval estimate for d:

\begin{displaymath}\bar{\delta} \pm t_{N,k-1} \ s_{\bar{\delta}} \end{displaymath}


\begin{displaymath}s_{\bar{\delta}} \equiv \sqrt{\frac{1}{k(k-1)} \sum_{i=1}^{k}(\delta_{i} -
\bar{\delta})^{2}} \end{displaymath}

Note $\delta_{i}$ approximately Normally distributed

Comparing learning algorithms LA and LB


What we'd like to estimate:


\begin{displaymath}E_{S \subset \cal{D}}[ error_{\cal{D}}(L_{A}(S)) -
error_{\cal{D}}(L_{B}(S))] \end{displaymath}

where L(S) is the hypothesis output by learner L using training set S


i.e., the expected difference in true error between hypotheses output by learners LA and LB, when trained using randomly selected training sets S drawn according to distribution $\cal{D}$.





But, given limited data D0, what is a good estimator?

Comparing learning algorithms LA and LB


1.
Partition data D0 into k disjoint test sets $T_{1}, T_{2},
\ldots, T_{k}$ of equal size, where this size is at least 30.
2.
For i from 1 to k, do
use Ti for the test set, and the remaining data for training set Si
  $\mbox{$\bullet$}$
$S_{i} \leftarrow\{D_{0} - T_{i}\}$
  $\mbox{$\bullet$}$
$h_{A} \leftarrow L_{A}(S_{i})$
  $\mbox{$\bullet$}$
$h_{B} \leftarrow L_{B}(S_{i})$
  $\mbox{$\bullet$}$
$\delta_{i} \leftarrow error_{T_{i}}(h_{A}) - error_{T_{i}}(h_{B})$
3.
Return the value $\bar{\delta}$, where

\begin{displaymath}\bar{\delta} \equiv \frac{1}{k}\sum_{i=1}^{k} \delta_{i}\end{displaymath}

4.
This is an approximation (not really correct because training sets are not independent, they overlap)

Analysis of Variance (ANOVA)



\begin{displaymath}{\rm ANOVA:} \;\; F = \frac{MS_{between}}{MS_{within}} \end{displaymath}

Increased F leads to decreased P(means are equal)

Learning Disjunctive Sets of Rules


Method 1: Learn decision tree, convert to rules



Method 2: Sequential covering algorithm:

1.
Learn one rule with high accuracy, any coverage
2.
Remove positive examples covered by this rule
3.
Repeat

Sequential Covering Algorithm


SEQUENTIAL-COVERING( $Target\_attribute, Attributes, Examples, Threshold$)

Learn-One-Rule


\psfig{figure=figures/learn-one-rule.ps}


LEARN-ONE-RULE

Subtleties: Learn One Rule


1.
May use beam search
2.
Easily generalizes to multi-valued target functions
3.
Choose evaluation function to guide search:

Variants of Rule Learning Programs


Learning First Order Rules


Why do that?

First Order Rule for Classifying Web Pages


[Slattery, 1997]


course(A) $\leftarrow$

has-word(A, instructor),
Not has-word(A, good),
link-from(A, B),
has-word(B, assign),
Not link-from(B, C)

Train: 31/31, Test: 31/34

FOIL( $Target\_predicate, Predicates, Examples$)

Specializing Rules in FOIL


Learning rule: $P(x_{1},x_{2}, \ldots, x_{k}) \leftarrow L_{1} \ldots L_{n}$

Candidate specializations add new literal of form:

Information Gain in FOIL



\begin{displaymath}Foil\_Gain(L,R) \equiv t \left( \log_{2}\frac{p_{1}}{p_{1}+n_{1}} -
\log_{2}\frac{p_{0}}{p_{0}+n_{0}} \right) \end{displaymath}

Where

Note

FOIL Example


\psfig{figure=figures/foil.ps}


Instances:

Target function:

Hypothesis space:

In the Spotlight



next up previous