# Evaluating Hypotheses

Consider hypotheses H1 and H2 learned by learners L1 and L2

• How to learn H and estimate accuracy with limited data?
• How well does observed accuracy of H over limited sample estimate accuracy over unseen data?
• If H1 outperforms H2 on sample, will H1 outperform H2 in general?
• Same conclusion for L1 and L2?

# Confidence Intervals

If

• S contains n examples, drawn independently of h and each other

Then
• With approximately 95% probability, lies in interval

• With approximately N% probability, lies in interval

where

 : 50% 68% 80% 90% 95% 98% 99% zN: 0.67 1.00 1.28 1.64 1.96 2.33 2.58

Example: n=40, r=12, errors(h)=.3, 95% confidence interval is .30 .14

# Two-Sided and One-Sided Bounds

• If with confidence
• Then with confidence
and
with confidence
• Example: n = 40, r = 12
• Two-sided, 95% confidence ( )

• One-sided

# Normal Distribution Approximates Binomial

errorS(h) follows a Binomial distribution, with

• mean

• standard deviation

Approximate this by a Normal distribution with

• mean

• standard deviation

# Confidence Intervals, More Correctly

If

• S contains n examples, drawn independently of h and each other

Then
• With approximately 95% probability, errorS(h) lies in interval

equivalently, lies in interval

which is approximately

# Calculating Confidence Intervals

1.
Pick parameter p to estimate

2.
Choose an estimator
• errorS(h)

3.
Determine probability distribution that governs estimator
• errorS(h) governed by Binomial distribution, approximated by Normal when

4.
Find interval (L, U) such that N% of probability falls in the interval
• Use table of zN values

# Difference Between Hypotheses

Test h1 on sample S1, test h2 on S2

1.
Pick parameter to estimate

2.
Choose an estimator

3.
Determine probability distribution that governs estimator

4.
Find interval (L, U) such that N% of probability mass falls in the interval

# Hypothesis Testing

P(errorD(h1) > errorD(h2)) = ?

• Example
errorS1(h1) = 0.30
errorS2(h2) = 0.20
• = probability does not overestimate d by more than 0.10
zN = 1.64
• I.e., reject null hypothesis with 0.05 level of significance

# Paired t test to compare hA,hB

1.
Partition data into k disjoint test sets of equal size, where this size is at least 30.
2.
For i from 1 to k, do

3.
Return the value , where

N% confidence interval estimate for d:

Note approximately Normally distributed

• Good for comparing two learners, not for multiple pairs
• Determining probability of rejecting null hypothesis (learners perform equally)

# Comparing learning algorithms LA and LB

What we'd like to estimate:

where L(S) is the hypothesis output by learner L using training set S

i.e., the expected difference in true error between hypotheses output by learners LA and LB, when trained using randomly selected training sets S drawn according to distribution .

But, given limited data D0, what is a good estimator?

• could partition D0 into training set S and training set T0, and measure

errorT0(LA(S0)) - errorT0(LB(S0))

• even better, repeat this many times and average the results (next slide)

# Comparing learning algorithms LA and LB

1.
Partition data D0 into k disjoint test sets of equal size, where this size is at least 30.
2.
For i from 1 to k, do
use Ti for the test set, and the remaining data for training set Si

3.
Return the value , where

4.
This is an approximation (not really correct because training sets are not independent, they overlap)

# Analysis of Variance (ANOVA)

• Useful when comparing a large number of learning systems
• Many pairwise comparisons to make
• Is the set of significance values significant?
• Let j = number of groups
• Let k = number of trials per group

Increased F leads to decreased P(means are equal)

• Degrees of Freedom for numerator = j-1
• Degrees of freedom for denominator = j(k-1)
• Look up value in table

# Learning Disjunctive Sets of Rules

Method 1: Learn decision tree, convert to rules

Method 2: Sequential covering algorithm:

1.
Learn one rule with high accuracy, any coverage
2.
Remove positive examples covered by this rule
3.
Repeat

# Sequential Covering Algorithm

SEQUENTIAL-COVERING( )

• LEARN-ONE-RULE
• while PERFORMANCE( Rule, Examples) > Threshold, do
• {examples correctly classified by Rule}
• LEARN-ONE-RULE
• sort accord to PERFORMANCE over Examples
• return

# Learn-One-Rule

LEARN-ONE-RULE

• positive Examples

• negative Examples

• while Pos, do
Learn a NewRule

most general rule possible

while NewRuleNeg, do
Add a new literal to specialize NewRule

1.
generate candidates

2.

Performance(SpecializeRule(NewRule,L))

3.

4.
subset of NewRuleNeg that satisfies NewRulepreconditions

{members of Pos covered by NewRule}
• Return

# Subtleties: Learn One Rule

1.
May use beam search
2.
Easily generalizes to multi-valued target functions
3.
Choose evaluation function to guide search:
• Entropy (i.e., information gain)
• Sample accuracy:

where nc = correct rule predictions, n = all predictions
• m estimate:

# Variants of Rule Learning Programs

• Sequential or simultaneous covering of data?
• General specific, or specific general?
• Generate-and-test, or example-driven?
• Whether and how to post-prune?
• What statistical evaluation function?

# Learning First Order Rules

Why do that?

• Can learn sets of rules such as





• General purpose programming language PROLOG: programs are sets of such rules

# First Order Rule for Classifying Web Pages

[Slattery, 1997]


course(A)

has-word(A, instructor),

Not has-word(A, good),

has-word(B, assign),

Train: 31/31, Test: 31/34


FOIL( )

• positive Examples

• negative Examples

• while Pos, do
Learn a NewRule

most general rule possible

while NewRuleNeg, do
Add a new literal to specialize NewRule

1.
generate candidates

2.

3.

4.
subset of NewRuleNeg that satisfies NewRulepreconditions

{members of Pos covered by NewRule}
• Return

# Specializing Rules in FOIL

Learning rule:

Candidate specializations add new literal of form:

• , where at least one of the vi in the created literal must already exist as a variable in the rule.
• Equal(xj,xk), where xj and xk are variables already present in the rule
• The negation of either of the above forms of literals

# Information Gain in FOIL

Where

• L is the candidate literal to add to rule R

• p0 = number of positive bindings of R
• n0 = number of negative bindings of R
• p1 = number of positive bindings of R+L
• n1 = number of negative bindings of R+L
• t is the number of positive bindings of R also covered by R+L

Note

• is optimal number of bits to indicate the class of a positive binding covered by R

# FOIL Example

Instances:

• pairs of nodes, e.g , with graph described by literals LinkedTo(0,1), LinkedTo(0,8) etc.

Target function:

• CanReach(x,y) true iff directed path from x to y

Hypothesis space:

• Each is a set of horn clauses using predicates LinkedTo(and CanReach)