Frozen Lake using Q-learning¶

Import required libraries¶

Open AI has developed the Gym library where many different games have been added to the environment. It is a toolkit to develop, test and compare reinforcement learning algorithms.

import numpy as np
import gym
import random
import time
from IPython.display import clear_output

We initialize our Frozen Lake environment here...

env = gym.make('FrozenLake-v0')
env.render()

SFFF
FHFH
FFFH
HFFG

We see that there are 16 different states and 4 possible actions (Up, Down, Left, Right)

states = env.observation_space.n
actions = env.action_space.n

print('States:', states, ' Actions:', actions)

States: 16  Actions: 4

Initialize Q-Table¶

Initializing our Q-Table with zeros. You can initialize it with random values also.

qtable = np.zeros((states, actions))
#qtable = np.random.rand(states, actions)
print(qtable)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

Hyperparameters¶

Setting up the hyperparameters for our algorithm.

episodes: Max training episodes
learning_rate: How drastically the learned q-values should change
gamma: discount factor
epsilon: variable to switch between exploration and exploitation
decay_rate: how slow the epsilon should decrease
max_steps: Max. possible time steps in an episode

episodes = 8000
# use 'speed' to see the agent playing at slower speed. Higher value = slower agent (use speed = 0.05 for comfy view) 
speed = 0

learning_rate = 0.05
gamma = 0.95

epsilon = 1
min_epsilon = 0.01
max_epsilon = 1
decay_rate = 0.001

max_steps = 99

Training algorithm¶

# To track the rewards for each episode
rewards =[]

# Beginning training
for episode in range(episodes):
    
    # Get initial state
    state = env.reset()
    # To track if episode has ended or agent has fallen in hole
    done = False
    t = 0
    # Total reward in an episode
    total_reward = 0
    
    for t in range(max_steps):
        # Exploration and exploitation strategy
        tradeoff_no = random.uniform(0,1)
            
        if tradeoff_no > epsilon:
            action = np.argmax(qtable[state,:])
        else:
            action = env.action_space.sample()
        
        clear_output(wait=True)
        # Displays the environment
        env.render()
        # Take action and receive reward, next_state, status of agent 
        new_state, reward, done, info = env.step(action)
          
        total_reward += reward
        # Update Q-Table using Bellman Equation
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        
        state = new_state
        time.sleep(speed)
        print(qtable)
        
        # Display stats at end of episode
        if done:
            print("Episode ", episode," finished after {} timesteps".format(t))
            print('Epsilon:', epsilon, 'Total reward: ', total_reward)
            time.sleep(0.5)
            break
        
    # Decrease epsilon value with time (episodes)    
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    #epsilon = np.max((min_epsilon, epsilon*decay_rate)) 
    rewards.append(total_reward)

print ("Score over time: " +  str(sum(rewards)/episodes))

  (Down)
SFFF
FHFH
FFFH
HFFG
[[0.18410851 0.16220196 0.16305274 0.15413045]
 [0.09592546 0.10498175 0.08064167 0.15535756]
 [0.13561894 0.10566317 0.10892104 0.10176307]
 [0.00087725 0.08103708 0.00139595 0.02404937]
 [0.22878428 0.1711495  0.13149528 0.11337385]
 [0.         0.         0.         0.        ]
 [0.07769191 0.07474414 0.12657049 0.03666889]
 [0.         0.         0.         0.        ]
 [0.16339244 0.17312395 0.1869767  0.28218143]
 [0.22654311 0.39728562 0.3099756  0.28753218]
 [0.37355515 0.28992515 0.2855971  0.16427558]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.30188098 0.36632278 0.57301599 0.36333947]
 [0.5342665  0.7304161  0.64129986 0.60563808]
 [0.         0.         0.         0.        ]]
Episode  7999  finished after 25 timesteps
Epsilon: 0.01033277288228577 Total reward:  1.0
Score over time: 0.467625

Play with our trained agent¶

env.reset()
speed = 0.05
goal = 0
for episode in range(10):
    
    state = env.reset()
    done = False
    while not done:
        clear_output(wait=True)
        env.render()
        
        action = np.argmax(qtable[state,:])
        new_state, reward, done, info = env.step(action)
        
        state = new_state
        time.sleep(speed)
        if done:
            if reward == 1:
                goal += 1
            print("Episode ", episode," finished after {} timesteps".format(t))
            print('Total reward: ', reward)
            time.sleep(speed)
            break

env.close()
print('Reached goal:', goal, "times")

  (Left)
SFFF
FHFH
FFFH
HFFG
Episode  9  finished after 25 timesteps
Total reward:  1.0
Reached goal: 8 times