Russell and Norvig, Chapter 20: Reinforcement Learning 20.1 Introduction - agent has little prior knowledge and no immediate feedback - action credit assignment difficult when only future reward - two basic agent designs - agent learns utility function U(i) on states - used to select actions maximizing expected utility - requires a model of action outcomes (M^ij_a) - agent learns action-value function - gives expected utility Q(a,i) of action a in state i - Q-learning Q(a,i) - no action outcome model, but cannot look ahead 20.2 Passive Learning in a Known Environment - sample environment [Fig20.1, p601] - passive learning agent [Fig20.2, p602] - given - state transition probabilities M - percept (state) sequence plus reward - find state utilities U - u_i is the reward_to_go of state i - reward_to_go = sum of rewards of states up to terminal state - expected utility = expected reward_to_go - updating utility values - naive - least mean squares (LMS) [Fig20.3, p602] - running average - does not use transition probabilities constraints - converges slowly - adaptive dynamic programming (ADP) - U(i) = R(i) + sum_i Mij*U(j) - where R(i) = reward for being in state i - solve n equations in n unknowns, n = |states| (!) - use value iteration - temporal difference (TD) [Fig20.6, p605] - when observed transition from state i to state j - U(i) = U(i) + alpha * (R(i) + U(j) - U(i)) - alpha = learning rate 20.3 Passive Learning in an Unknown Environment - don't know transition model Mij - LMS and TD directly usable - ADP must learn model as well - model estimated from state transition frequencies 20.4 Active Learning in an Unknown Environment - consider actions, their outcomes, and possible reward - changes to passive agent - model now incorporates actions M_ij^a - U(i) = R(i) + max_a sum_j M_ij^a * U(j) - performance_element chooses action based on M and U - active ADP agent [Fig20.9, p608] - update active model estimates M - could use TD, but converges slower 20.5 Exploration - actions have two purposes - gaining rewards - gaining information leading (possibly) to better rewards - random (wacky) vs greedy - bandit problems - use the optimistic utility value U+(i) instead of U(i) - U+(i) = R(i) + max_a F(sum_j M_ij^a * U+(j), N(a,i)) - N(a,i) = number of times action a taken in state i - F(u,n) = { R+ if n