Project MineRL: Sample efficient reinforcement learning using human prior
Introduction
A challenge to develop a system to obtain a diamond in Minecraft using limited amount of training time. Since the task is super hard, the organizers also created a smaller problems like chopping trees, navigate to a point, obtain an iron pickaxe. In this post I am going to share my experience about solving the navigate to a point problem.
For those who don’t know what Minecraft is, let me tell you about it briefly. Minecraft is a sandbox game with a 3D world in a block structure. Every object in the game is made from combination of sqaure blocks. I suggest you take a quick look at the game trailer on youtube here to get an idea what the game is about.
Source: https://www.minecraft.net/en-us/
There are two modes Survival and Creative in which a player can play the game. For our challenge we will be using the survival mode, which means that the agent has limited health and taking damage can reasult in death. We will be using the ‘minerl’ python package developed by the organizers to take care of the environmental dynamics. Thus our focus is to control and teach the agent to complete the tasks using reinforcement learning.
From now on I will be discussing about the navigate to a point task. Let’s take a look at the inputs from the game and the actions available for the agent (available at minerl.io).
Observation Space:
Dict({
"compassAngle": "Box()",
"inventory": {
"dirt": "Box()"
},
"pov": "Box(64, 64, 3)"
})
The compassAngle variable has observation which points near the goal location. The agent must find the unique block (Blue Diamond block) at goal location to complete the task. The inventory variable stores amount of dirt blocks collected, which can be used to climb to locations that cannot be accessed by jumping. The pov variable has a frame (RGB image) from the game with size 64x64x3.
Action Space:
Dict({
"attack": "Discrete(2)",
"back": "Discrete(2)",
"camera": "Box(2,)",
"forward": "Discrete(2)",
"jump": "Discrete(2)",
"left": "Discrete(2)",
"place": "Enum(none,dirt)",
"right": "Discrete(2)",
"sneak": "Discrete(2)",
"sprint": "Discrete(2)"
})
The action space consists of movements (forward, back, left, right, jump, sneak, sprint) and other actions like attack, place. The camera variable controls the vision of the agent in horizontal and vertical direction. Discrete(2) means that the variable can take only 2 values, either 0 or 1. Box(2,) means that the variable contains two real values like [0.12, -0.36].
Let’s take a look at how the reward function is defined for this problem. In ‘minerl’ there are two options to train the model NavigateDense and Navigate. In NavigateDense variant, the agent receives reward every tick for how close or far it is from the objective and for the Navigate variant the agent only receives reward when it reaches the objective. I used the NavigateDense variant to train the model as it would help the agent learn faster.
Using human priors for pre-training model
Disclaimer: The output video is 64*64 by default hence the video looks blurry.
Human priors in this case are recording of various players who performed the same task. Along with video the actions were also recorded. One example is as shown below.
(Source: minerl.viewer from minerl package)
The red marks on arrows and other button indicates the action taken at that time period, the camera control records the movement of agents vision and on top of camera control we can see the reward as function of time. We observe reward obtain at each tick along with cumulative reward. There are 194 such trajectories for NavigateDense variant and 191 trajectories for Navigate variant.
My approach to solve the Navigation task
Since the competition is still in progress, I will only be showing the video clips of my agent trained so far and a brief overview of how I achieved it.
After doing some research on available Reinforcement learning algorithms, I found the DDPG algorithm to be most suitable for this task. This algorithm was developed to solve the problem where there are continuous action spaces. I used the similar model architecture as mentioned in DDPG paper with some modifications to account of our actions. I tried the imitation learning algorithms to pre-train my models on human trajectories. I used Google Colab to pre-train the models. I am still training my agent to get better result and will post the final version of the agent later.
Following video shows agent taking random actions without pre-training from human priors.
We can see that it is very difficult for agent to reach the objective if we continue this process (we might even never reach the goal!). To make this task easier for the agent we try imitation learning (showing the agent how the task is done), basically we try to copy the human actions as it is. Following video shows the agent behavior after pre-training on human priors after few iterations.
Awesome!! That’s a great improvement, don’t you think? The agent still doesn’t know it is supposed to touch the special block (Blue Diamond block) to complete it’s task. Also it starts spinning in circle when it is closer to the target. This point is better to learn from self memory as agent now knows he is supposed to follow the compass. After training from self memory, I believe that the agent will be able to complete the navigation task. I will update this post once I get it running.
Stay tuned for more videos and posts!!!