Consider hypotheses H_{1} and H_{2} learned by learners L_{1} and
L_{2}
 How to learn H and estimate accuracy with limited data?
 How well does observed accuracy of H over limited sample estimate
accuracy over unseen data?
 If H_{1} outperforms H_{2} on sample, will H_{1} outperform H_{2} in
general?
 Same conclusion for L_{1} and L_{2}?
If

S contains n examples, drawn independently of h and each other

Then
error_{S}(h) follows a Binomial distribution, with
 mean
 standard deviation
Approximate this by a Normal distribution with
 mean
 standard deviation
If
 S contains n examples, drawn independently of h and each other
Then
 With approximately 95% probability,
error_{S}(h) lies in interval
equivalently,
lies in interval
which is approximately
 1.
 Pick parameter p to estimate
 2.
 Choose an estimator
 3.
 Determine probability distribution that governs estimator

error_{S}(h) governed by Binomial distribution, approximated by
Normal when
 4.
 Find interval (L, U) such that N% of probability falls in the interval
 Use table of z_{N} values
Test h_{1} on sample S_{1}, test h_{2} on S_{2}
 1.
 Pick parameter to estimate
 2.
 Choose an estimator
 3.
 Determine probability distribution that governs estimator
 4.
 Find interval (L, U) such that N% of probability mass falls
in the interval
P(error_{D}(h_{1}) > error_{D}(h_{2})) = ?
 Example


error_{S1}(h_{1}) = 0.30


error_{S2}(h_{2}) = 0.20





= probability
does not
overestimate d by more than 0.10




z_{N} = 1.64

 I.e., reject null hypothesis with 0.05 level of significance
 1.
 Partition data into k disjoint test sets
of equal size, where this size is at least 30.
 2.
 For i from 1 to k, do




 3.
 Return the value
,
where
N% confidence interval estimate for d:
Note
approximately Normally distributed
 Good for comparing two learners, not for multiple pairs
 Determining probability of rejecting null hypothesis (learners perform
equally)
What we'd like to estimate:
where L(S) is the hypothesis output by learner L using training set S
i.e., the expected difference in true error between hypotheses output by
learners L_{A} and L_{B}, when trained using randomly selected training sets
S drawn according to distribution .
But, given limited data D_{0}, what is a good estimator?
 1.
 Partition data D_{0} into k disjoint test sets
of equal size, where this size is at least 30.
 2.
 For i from 1 to k, do


use T_{i} for the test set, and the remaining data for training set
S_{i}








 3.
 Return the value
,
where
 4.
 This is an approximation (not really correct because training sets are
not independent, they overlap)
 Useful when comparing a large number of learning systems
 Many pairwise comparisons to make
 Is the set of significance values significant?
 Let j = number of groups
 Let k = number of trials per group
Increased F leads to decreased P(means are equal)
 Degrees of Freedom for numerator = j1
 Degrees of freedom for denominator = j(k1)
 Look up value in table
Method 1: Learn decision tree, convert to rules
Method 2: Sequential covering algorithm:
 1.
 Learn one rule with high accuracy, any coverage
 2.
 Remove positive examples covered by this rule
 3.
 Repeat
SEQUENTIALCOVERING(
)
LEARNONERULE
 1.
 May use beam search
 2.
 Easily generalizes to multivalued target functions
 3.
 Choose evaluation function to guide search:
 Entropy (i.e., information gain)
 Sample accuracy:
where n_{c} = correct rule predictions, n = all predictions
 m estimate:

Sequential or simultaneous covering of data?

General
specific, or specific
general?

Generateandtest, or exampledriven?

Whether and how to postprune?

What statistical evaluation function?
Why do that?
 Can learn sets of rules such as
 General purpose programming language PROLOG: programs are sets of
such rules
[Slattery, 1997]
course(A)
hasword(A, instructor),
Not hasword(A, good),
linkfrom(A, B),
hasword(B, assign),
Not linkfrom(B, C)
Train: 31/31, Test: 31/34
FOIL(
)
Learning rule:
Candidate specializations add new literal of form:

,
where at least one of the v_{i} in the created
literal must already exist as a variable in the rule.

Equal(x_{j},x_{k}), where x_{j} and x_{k} are variables already present
in the rule

The negation of either of the above forms of literals
Where
 L is the candidate literal to add to rule R
 p_{0} = number of positive bindings of R
 n_{0} = number of negative bindings of R
 p_{1} = number of positive bindings of R+L
 n_{1} = number of negative bindings of R+L
 t is the number of positive bindings of R also covered by R+L
Note

is optimal number of bits to
indicate the class of a positive binding covered by R
Instances:

pairs of nodes, e.g
,
with graph described by
literals LinkedTo(0,1), LinkedTo(0,8) etc.
Target function:
 CanReach(x,y) true iff directed path from x to y
Hypothesis space:
 Each
is a set of horn clauses using predicates LinkedTo(and CanReach)