For this problem you will implement a reinforcement learning (RL)
agent to navigate the specific wumpus world of Figure 7.2 in Russell and
Norvig. The RL agent will select actions based on the utility of states in
the wumpus world according to the following formula:
U(i) = R(i) + maxa ∑j Maij
U(j)
The state i represents a four-tuple [X,Y,Orient,Gold], where
X,Y (1 ≤ X,Y ≤ 4) is the location of the agent,
Orient ∈ {up, down, left, right} is the orientation of
the agent, and Gold ∈ {no,yes} indicates whether the
agent has the gold. The actions a are one of {goforward,
turnleft, turnright, grab}. The reward R(i) for state i
is equal to -0.05 except for the following 36 terminal states:
R([1,3,_,_]) = -1.0
R([3,3,_,_]) = -1.0
R([3,1,_,_]) = -1.0
R([4,4,_,_]) = -1.0
R([1,1,_,yes]) = 1.0
Maij is the probability of being in state
j after taking action a in state i. You may assume
the agent's actions always work as expected; therefore,
Maij is always either 0 or 1. For example,
the probability of being in state [1,2,right,no] after executing
goforward in state [1,1,right,no] is 1. The probability of
being in state [1,2,right,no] after executing goforward in
state [1,1,up,no] is 0. Maij = 0 when
i is one of the above terminal states. Specifically,
- Write a method UpdateUtility that makes a single pass over each of the
128 states updating the utility according to the above formula for
U(i). The U(j) values in the formula should all come from
the previous state utilities, not newly computed ones. In other words, you
should not update the current utilities until all of them have been
recomputed. Initially, U(i) = 0 for all states i.
- Write a method RLagent that uses the current utility values to select
actions and move through the states of the wumpus world from Figure 7.2.
Given that the agent is in state i, the action to choose will be the
one maximizing utility:
action = arg maxa ∑j
Maij U(j)
RLagent should return a sequence of actions for getting from the initial
state [1,1,right,no] to the goal state [1,1,_,yes], or NULL
if the agent is killed or exceeds some upper bound (e.g., 100) on the
number of actions.
- The main procedure of your program should iterate between calling
UpdateUtility and RLagent until the agent successfully gets the gold and
returns to location (1,1) without being killed. Your program should then
output the successful action sequence and the number of iterations.
- Collect your well-documented, object-oriented Java code for the RL
solution along with a log of the run.