Hello Codings: SARSA Learning with Python

Hola,

I worked on SARSA algorithm as well as on Q Learning algorithm and both of them had different Q matrix (Duh!) The methodology of both of the algorithms depicts how well one algorithm responds to future awards (which we can say OFF Policy for Q learning) while the other works of the current policy and takes an action before updating Q matrix (ON Policy).

The previous post example of the grid game showed different results when I implemented SARSA. It also involved some repetitive paths whereas Q didn't show any. A single step showed that SARSA followed the agent path and Q followed an optimal agent path.

To implement both ways I remember the way of pseudo code.

QL

initiate Q matrix.
Loop (Episodes):
Choose an initial state (s)
while (goal):
Choose an action (a) with the maximum Q value
Determine the next State (s')
Find total reward -> Immediate Reward + Discounted Reward (Max(Q[s'][a]))
Update Q matrix
s <- s'
new episode

SARSA-L

initiate Q matrix
Loop (Episodes):
choose an initial state (s)
while (goal):
Take an action (a) and get next state (s')
Get a' from s'
Total Reward -> Immediate reward + Gamma * next Q value - current Q value
Update Q
s <- s' a <- a'

Here are the outputs from Q-L and SARSA-L

The above is Q-L

This one is SARSA

There is a difference between both Q Matrix. I worked on another example by using both Q learning and SARSA. It might appear similar to mouse cliff problem for some readers so bear with me.

Here is

Naruto trying to find his way to his goal by using Q-Algorithm

The code for Naruto-Q-Learning is below

Here is

Hinata trying to find her way to her goal by using SARSA

The code for Hinata SARSA Learning

I used epsilon-greedy method for action prediction. I generated a random floating number between 0 to 1 and set epsilon as 0.2. If the generated number is greater than 0.2 then I select maximum Q valued action (argmax). If the generated number is less than 0.2 then I select the action (permitted) randomly. With each episode passing by, I decreased the value of epsilon (Epsilon Decay) This will ensure that as the agent learns its way it follows the path rather than continuing exploration. Exploration is maximum at the start of the simulation and gradually decreases as each episode are passed.

This is the decay of the epsilon.

The path followed in the above simulation is 0 - 4 - 8 - 9 - 10 - 11 - 7. Sometimes the agent also follows the same path as followed during Q learning. Well, I am continuing my exploration for the same and will post more details as I learn more about RL.

Till then, bye

4 comments:

gb2827 November 2018 at 07:12
nice update, thank you, please can you explain how you generate your matrices for reward and state, thanks
gb2827 November 2018 at 22:01
thanks, I got it from your previous post. please can you share your email address with me?
Unknown28 November 2018 at 00:14
Its working.
Can it be ported into hardware?

Hello Codings

SARSA Learning with Python

4 comments:

Footer Menu Widget

Recent in Recipes