This the second part of Reinforcement Learning (Q-learning). If you would like to understand the RL, Q-learning, and key terms please read Part 1.
In this part, we will implement a simple example of Q learning using the R programming language from scratch. It is expected from you to understand the basics of R programming and complete the reading of Part 1 of this article.
Import libraries
We are coding the algorithms using the R base package only however we would need a few libraries to plot the various matrix and visualize the output.
install.packages('plot.matrix')
install.packages('RColorBrewer')
library(plot.matrix)
library(RColorBrewer)
Matrix Plot Function
The below function will create a plot of any R matrix using the plot.matrix library. Please refer CRAN page for details about arguments used in the plot function. Plot_matrix function takes matrix to plot and no. of digits(optional) as input arguments.
plot_matrix <- function(mat_pot, digits_arg){
digits_arg <- ifelse(missing(digits_arg),0,digits_arg)
plot(mat_pot, cex = 1.2, fmt.cell=paste0('%.',digits_arg,'f'),
col=brewer.pal(3,"Blues"), breaks=c(-500, 0, 0, 500),
xlab="States", ylab = "States",main = "")
}
Define Environment
States & Actions
Let’s consider there are 5 states i.e. 1,2,3,4,5 and possible actions as left, right, up and down.
states <- seq(1, 5, by = 1)
state_seq <- cbind(merge(states,states), state =
seq(1,length(states)*length(states)))
state_mat <- matrix(state_seq$state, nrow = length(states), ncol= length(states))
plot_matrix(state_mat,1) ## matrix, digits
The above code will create a seq of states from 1 to 5. Since we can move from one state to state using the actions we will create a states * states matrix which will become our environment. So if we label each state to other state movements as numbers then it will be 1 to 5*5 =25. Below plot can help you understand this better:
Rewards & Goal
Once we have our states and actions are defined, next, we need is the reward for each of the states. This can be negative or positive or any other set of values (numeric) based on the problem you are solving. Here, we have randomly given rewards to various states and creating a matrix.
rewards <- c(0,-100,-100,0,-100,
10,10,10,10,-100,
10,10,-100,10,-100,
10,10,10,10,-100,
-100,-100,10,10,100 )
rewards_mat <- matrix(rewards, nrow = length(states), ncol= length(states))
plot_matrix(rewards_mat)
If we compare the rewards matric with the state matrix, the state numbers 2, 3, 5, 10, 13, 15, 20, 21, and 22 have rewards -100 points which means we do not want our Agent to go there or those are dangerous areas. And State 25 has reward 100 points so this would be our Goal state.
goal <- which(rewards==max(rewards), arr.ind=TRUE)
Q-Matrix
Further, we need to create a Q matrix with the same dimensions as the states matrix and initialize it with 0 values.
## Initialize Q-Matrix
Q <- matrix(0, nrow = length(states), ncol= length(states))
plot_matrix(Q,1)
The Next States
Suppose Agent is at State 3, then based on actions (left, right, up, down) then next possible states will be states 2, 4, and 8. So here we have created a function that would let the Agent know what all are the possible next steps. Also, this will be required when we are calculating Q values using Bellman’s equation. I will be explaining this later. 🙂
getNextStates <- function(cs) {
stalen <- length(states);
NS <- stalen*stalen
aa <- state_seq[state_seq$state == cs,]
if (aa$x == max(states)) {
ns <- c(cs - 1,
cs - stalen,
cs + stalen);
} else if (aa$x == min(states)) {
ns <- c(cs + 1,
cs - stalen,
cs + stalen);
} else {
ns <- c(cs + 1,
cs - 1,
cs - stalen,
cs + stalen);
}
nss <- sort(ns[ns > 0 & ns <= NS]);
return(nss);
}
The above function will return all the possible states that Agent can move based on its current states. Refer the function call example below:
> getNextStates(3)
[1] 2 4 8 > getNextStates(7)
[1] 2 6 8 12
Episodes
The episode is an iteration that starts with Agent’s initial state to terminal state(Goal). So an agent will start from a random state, learn while moving to various states, collect rewards, and update Q value at each step. Psuedo code for the Q-learning is as below:
Following the above steps, R code is written for 10 episodes which can be configured by changing the value of N.
########### Episodes Execution ################
N <- 10 # No. of Episode
alpha <- 0.8 # Learning Rate
gamma <- 0.7 # Discount Factor
for (i in 1:N) {
current_episode <- i;
cat("\nStart Episode: ", current_episode)
## choose next state from possible actions at current state
cs <- sample(state_seq$state, 1)
cat("\n\tCurrent state: ", cs)
step_num <- 1;
while (T) {
cat("\n\n\tStep no.: ", step_num)
cat("\n\t\tCurrent State: ", cs)
reward <- rewards[cs]
if(reward == 0 | is.na(reward) | length(reward) == 0 ){
reward <- 0
}
cat("\n\t\tReward CS: ", reward)
next.states <- getNextStates(cs);
cat("\n\t\tPossible next states: ", next.states)
# next.states
# If we have any states present, else choose randomly.
if (length(next.states) == 1) {
ns <- next.states
} else {
ns <- sample(next.states, 1)
}
cat("\n\t\tNext state: ", ns)
# Update Q values for next states.
Q[cs] <- round(Q[cs] + alpha * (reward +
gamma * max(Q[getNextStates(ns)])-Q[cs]),1);
cat("\n\t\tNew Q-Value: ", Q[cs])
plot_matrix(Q,1)
Sys.sleep(0.2)
if (cs == goal | step_num > 20) {
break;
}
cs <- ns;
step_num <- step_num + 1;
}
cat("\nEnd Episode: ", current_episode)
}
This will execute 10 episodes updating the Q matrix. To understand the working of the code, we will now observe a few steps from Episode 1:
## Start Episode: 1
## Initial state: 5 ## Step no.: 1
## Current state: 5
## Reward CS: -100
## Possible next states: 4 10
## Next state: 10
## New Q-Value: -80
## Step no.: 2
## Current state: 10
## Reward CS: -100
## Possible next states: 5 9 15
## Next state: 15
## New Q-Value: -80
In the first 2 steps of Episode 1, Agent has updated Q values for 2 states i.e. 5 and 10. Also, Q values updated are negative which will reflect what we have seen in the rewards matrix as we have negative rewards for these states. Similarly, it will keep moving, analyze the reward, and update the Q matrix. Let’s fast forward and see the terminating step of Episode 1 which is when the Current state is State 25.
## Step no.: 7
## Current State: 25
## Reward CS: 100
## Possible next states: 20 24
## Next state: 24
## New Q-Value: 84.5
## End Episode: 1
In this step, Agent has updated the Q value for State 25 with the new Q value 84.5.
Further Agent will execute 10 episodes and terminate when it will reach state 25. It may also terminate when the number of steps within an episode reaches 20 because we do not want our agent to stay between some states.
Final Q table
Once the Q matrix is updated, Agent may choose max rewarding next states and reach the goal with the correct and shortest path.
Also, we can observe that our Goal state 25 has the maximum Q value.
We have reached the end of this tutorial. The code used in this code is available in my Git repo
I hope you have enjoyed the articles and have some takeaways from this. I would be happy to discuss any doubts related to the above code and R programming. Please feel free to reach out to me via Linkedin or Email.