The Cliff Walking environment is a gridworld with a discrete state space and discrete action space. The
agent starts at grid cell S. The agent can move to the four neighboring cells by taking actions Up, Down,
Left or Right. The Up and Down actions are deterministic, whereas, the Left and Right actions are
stochastic, with a probability of 0.7 to be completed and a probability of 0.3 of the agent ending up in the
perpendicular direction. Trying to move out of the boundary results in staying in the same location. So,
for example, trying to move left when at a cell on the leftmost column results in no movement at all and
the agent remains in the same location. The agent receives ‐1 reward per step in most states, and ‐100
reward when falling off of the cliff. This is an episodic task; termination occurs when the agent reaches
the goal grid cell G. Falling off of the cliff results in resetting to the start state, without termination.
S The Cliff G
For the problem described above, answer the following question:
I. Formulate the problem as a MDP
II. Use policy iteration to find the optimal policy
III. Suppose that you are not given the transition or the reward function, suppose that you observe
the following (state, action, reward, state’) tuples, in episode 1
Episode 1:
( (0,0), Up, ‐1, (1,0) )
( (0,1), Down, ‐1, (0,0) )
( (0,0), Right, ‐1, (0,0) )
( (0,0), Left, ‐1, (0,0) )
( (0,0), Up, ‐1, (1,0) )
( (0,1), Right, ‐1, (1,1) )
( (1,1), Right, ‐1, (1,2) )
( (1,2), Right, ‐1, (1,3) )
( (1,3), Right, ‐1, (1,4) )
( (1,4), Down, ‐1, (0,4) )
Calculate the TD estimates of all the states in Episode
Answers
Answer:
The rule is simple. Your agent/robot starts at the left-bottom corner(the ‘start’ sign) and ends at either +1 or -1 which is the corresponding reward. At each step, the agent has 4 possible actions including up, down, left and right, whereas the black block is a wall where your agent won’t be able to penetrate through. In order to make it more straight forward, our first implementation assumes that each action is deterministic, that is, the agent will go where it intends to go. For instance, when the agent decides to take action up at (2, 0), it will land in (1, 0) rather than (2, 1) or elsewhere. (We will add uncertainty in out second implementation) However, it the agents hit the wall, it will remain at the same position.
Board
So let’s get cracking on the code! Firstly, let’s set up some global rules of the board. (full code)
And as a grid game, it needs a State to justify each state(position) of our agent, giving reward according to its state.
When our agent takes an action, the State should have a function to accept an action and return a legal position of next state.
Agent
This is the artificial intelligence part, as our agent should be able to learn from the process and thinks like a human. The key of the magic is value iteration.
Value Iteration
What our agent will finally learn is a policy, and a policy is a mapping from state to action, simply instructs what the agent should do at each state. In our case, instead of learning a mapping from state to action, we will leverage
Answer:
the excretory system in animal is the organisation principle for meeting human development goal while simultaneously sustainable the ability of the natural system to provide the natural resources while ecosystem service upon economy and services and society depend the desire result is an ideal candidate must not copy of