Russell and Norvig, Chapter 17: Making Complex Decisions 17.1 Sequential Decision Problems - no intermediate utility on the way to the goal - transition model M_ij^a - the probability of reaching state j after taking action a in state i - policy = complete mapping from states to actions - want policy maximizing expected utility - computed from transition model and state utilities - see example [Fig17.1, p499; Fig17.2, p501] - P(intended direction) = 0.8, P(right angle to intended) = 0.1 - U(sequence) = terminal state's value - (1/25)*length(sequence) - Markov decision problem (MDP) - calculating optimal policy in accessible, stochastic environment with known transition model - Markov property satisfied - M_ij depends only on i and not previous states - partially observable Markov decision problem (POMDP) - inaccessible environment - not enough information to determine state, and thus transition probabilities - could just calculate expected utility of actions over possible states - but, utilities change as we gather information - can include the value of information 17.2 Value Iteration - method for calculating optimal policy in accessible environment - utility function over histories must be separable - U_h([s0,s1,...,sn]) = f(s0,U_h([s1,...,sn]) - e.g., additive - U_h([s0,s1,...,sn]) = reward(s0) + U_h([s1,...,sn]) - best policy = arg max_a sum_j M_ij^a * U(j) - U(i) = R(i) + max_a sum_j M_ij^a * U(j) - R(i) is the reward for entering state i - (- 1/25) for all states except - (+ 1) for 4,3 - (- 1) for 4,2 - if know maximum history length n - use dynamic programming - else approximate until accurate enough [Fig17.4, p504] - U_t+1(i) = R(i) + max_a sum_j M_ij^a * U_t(j) - when to stop ? - policy may be optimal before utilities converge 17.3 Policy Iteration [Fig17.7, p506] - iteratively change policy in each state - until no action looks better than policy - need utility values for states (value determination) - use value iteration replacing 'a' with policy(i) - takes time to converge early in the process - solve directly for utilities - U(i) = R(i) + sum_j M_ij^P(i)*U_t(j) - P(i) = Policy(i) - n equations with n unknowns (n = # states) 17.4 Decision-Theoretic Agent [Fig17.8, p508] - X_t is a vector of state random variables at time t - Belief(X_t) = P(X_t | E_1...E_t, A_1...A_t-1) - X_t is the state at time t - E_i is the percept at time i - A_i is the action taken at time i - simplifying assumptions - process is Markovian - P(X_t | X_1...X_t-1, A_1...A_t-1) = P(X_t | X_t-1, A_t-1) - percept depends only on current state - P(E_t | X_1...X_t, A_1...A_t-1, E_1...E_t-1) = P(E_t | X_t) - action taken depends only on previously-received percepts - P(A_t-1 | A_1...A_t-2, E_1...E_t-1) = P(A_t-1 | E_1...E_t-1) - Belief(X_t) calculated in two phases: - (1) prediction phase - compute probability distribution over states ^Belief(X_t) based on the distribution of previous states and the action taken A_t-1 - (2) estimation phase - incorporates percept E_t using Bayes rule - Belief(X_t) = alpha * P(E_t | X_t) * ^Belief(X_t) - decision-theoretic agent [Fig17.9, p511] - sensor model P(E_t | X_t) - stationary sensor model - P(E_t | X_t) = P(E|X) - belief network node CPT describes sensor reliability - (quantity) ---> (sensor, CPT) - multiple sensors for one quantity can improve accuracy (sensor fusion) - sensor model must anticipate failure - lane position sensor network [Fig17.12, p514] - action model P(X_t | X_t-1, A_t-1) (see Sec17.5) 17.5 Dynamic Belief Networks - environment changes as P(X_t | X_t-1, A_t-1) - assume P(X_t | X_t-1, A_t-1) same for all t - each state X_t determined only by previous state X_t-1 - state evolution model, or - Markov chain - agent passively observing and predicting change in environment - P(X_t | X_t-1, A_t-1) = P(X_t | X_t-1) - dynamic belief network (DBN) [Fig17.13, p515] - state node / percept node pairs (slices) - connected by state evolution model - big for lots of states - probabilistic projection and past states - but really only need two slices at a time - prediction-estimation process [Fig17.14, p516] - (1) prediction - given slices t-1 and t - have calculated Belief(X_t-1) incorporated E_1..E_t-1 - calculate ^Belief(X_t) - (2) rollup - remove slice t-1 - add ^Belief(X_t) as the prior probability table for State.t - (3) estimation - add new percept E_t - apply standard belief net updating to compute Belief(X_t) - add t+1 slice, CPT determined by stationary P(X_t | X_t-1) - implements decision-theoretic agent in Fig17.9 - probabilistic projection into the future - accomplished after stage (3) by adding more slices - stochastic simulation works well here since no future evidence - example DBN for lane positioning [Fig17.15, p517] 17.6 Dynamic Decision Networks (DDN) - dynamic belief networks plus - utility nodes - decision nodes for actions - general framework [Fig17.16, p518] - evaluation similar to that in ordinary decision networks - agent does not know - future evidence - future decisions - expected utility of a decision sequence is the weighted sum of the utilities computed using each possible percept sequence - weight is the probability of the percept sequence given the decision sequence - i.e., action evaluation must also consider effects on the agent - not only the environment 17.7 Summary and Discussion - DDN-based agents can - handle uncertainty - handle unexpected events (no fixed plan) - handle noisy and failed sensors - act to obtain relevant information - what is missing? - properties from logic - existential quantification - functions - properties from planning - partial-order planning - hierarchical decomposition - goal-directed