Russell and Norvig, Chapter 18: Learning from Observations 18.1 A General Model of Learning Agents [Fig18.1, p526] - components - performance element - selects external actions - most of our efforts have been focussed here - learning element - makes improvements to the performance element - critic - provides performance feedback from the environment to the learning element - problem generator - suggests actions leading to informative experiences - design of learning element affected by - components of performance element being improved - direct mapping, logical reasoning, decision-theoretic, goal-based - representation used for those components - functions, logic, belief networks, operators - available feedback - supervised, reinforcement (reward), unsupervised - prior information available - all learning can be seen as learning the representation of a function 18.2 Inductive Learning - given a set of examples (x,f(x)), find a hypothesis h approximating f - bias is any preference of h's other than consistency with examples - e.g., Ockham's Razor - most likely hypothesis is the simplest one consistent with all observations - reflex agent with learning [Fig18.3, p530] - types - decision trees - version space - neural networks - belief networks - expressiveness vs efficiency - measuring performance - training/testing examples - learning curve: average error/accuracy vs training set size 18.3 Learning Decision Trees - sample decision tree [Fig18.4, p533] - expressiveness - propositional logic (no more than one object) - examples [Fig18.5, p534] - tree construction [Fig18.6, p536] - attribute selection - algorithm [Fig18.7, p537] - stopping conditions - choose_attribute(attrs,egs) ? - resulting tree [Fig18.8, p537] - simpler than first tree - unexpected trend with Thai food - learning curve [Fig18.9, p539] - applications 18.4 Using Information Theory - implementing the choose_attribute function - information - at a node in the tree, how many bits needed to classify an example - I(P(c_1)...P(c_n)) = sum(i=1,n) - P(c_i) lg P(c_i) bits - I(1/2,1/2) = 1 bits - I(0.01,0.99) = 0.08 bits - before split - I((p/(p+n)),(n/(p+n))) - after split on attribute A with values vi (i=1,v) - (pi+ni) = number of positive and negative examples with A=vi - remainder(A) = sum(i=1,v) (pi+ni)/(p+n) * I((pi/(pi+ni)),(ni/(pi+ni))) - gain(A) = I((p/(p+n)),(n/(p+n))) - remainder(A) - choose attribute with greatest gain - noise and overfitting - chi-square pre-pruning - reduced-error post-pruning - missing data - assign weighted values based on frequencies - many-valued attributes - artificially inflate gain - gain ratio - divide gain by information in example splits - e.g., attribute A splits 10 egs into subsets of 3, 3 and 4 - denominator is I(3/10,3/10,4/10) - continuous-valued attributes - discretize - allow split attributes of the form (value <= threshold) - threshold actually appears in data 18.5 Learning General Logical Descriptions - hypothesis space - size (e.g., decision trees) - errors - false negative - false positive - generalization - cover more examples - e.g., dropping condition - specialization - cover fewer examples - e.g., adding condition - current_best_learning(examples) [Fig18.11, p547] - generalize h for false negative - specialize h for false positive - maintain consistency with all examples (or fail) - backtracking is a killer - least-commitment search - maintain set of all consistent hypotheses (version space) - version space learning (candidate elimination alg) [Fig18.12, p549] - still, hypothesis space too big - use interval [MostGeneralHypo,MostSpecificHypo] - still some possible size problems 18.6 Why Learning Works: Computational Learning Theory - given - m training examples sampled from distribution D - find hypothesis h from the set of hypotheses H such that - [P(error(h) <= epsilon) >= (1 - delta)] for a test set sampled from D - need m >= [(1/epsilon) * ( ln (1/delta) + ln |H|)] examples - sample complexity - probably-approximately correct (PAC) learning - |H| = O(2^(2^n)) for propositional concepts - n = |attributes| - truth-table argument - still need exponential number of examples - restrictions on H allow polynomial(epsilon,delta) sample complexity - decision lists with at most k literals per condition - m = O(n^k), where n = |attributes| - decision list learning [Fig18.17, p556] - choose smallest test covering examples of one class - comparison to decision tree [Fig 18.18, p557] - these are worst-case results