Personal Blog

invopt-ex1

Inventory Optimization: MDP vs RL

Inventory Optimization is a task of maximizing revenue by taking into account the capital investment, warehouse capacity, supply and demand of stock, leadtime and backordering of stocks. This problem has been well researched and is usually presented in form of a Markov Decision Process (MDP). The (s, S) policy is proved to be a optimal solution for such problems.[s: Reorder stock level, S: Target stock level]. Markov Decision Process (MDP) provide a framework to model decision making process where outcomes are partly random and partly under the control of decision maker. The learner or decision maker is called an agent. The agent interacts with the environment which comprises of everything except the agent.
ql-dqnalgo

Deep Q-Learning

Deep Q-learning We introduce deep neural networks to do the Q-Learning, hence the name Deep Q-Learning. Instead of calculating Q-values for each state-action pair, we calculate Q-values for all actions given the state and then select the action with maximum q-value. This concept was first introduced in Playing Atari with Deep Reinforcement Learning paper. The authors show that they were able to surpass human experts on three out of seven Atari games tested using deep neural networks to solve these reinforcement problems.
img

Q-Learning

Q-Learning Q-Learning is a value based reinforcement algorithm. The idea is that we create a Q-Table which has all the states represented as rows of Q-table and actions as columns. Then for each state we would select an action which has maximum value (q-value). This means that we do not change/implement a policy that our agent will follow, instead we improve our Q-Table to always choose the best possible action. Lets take an example of Frozen Lake game. The environment is as shown in following image.
mdp

Reinforcement Learning

Introduction Reinforcement learning is a type of machine learning problem which is solved using the past experiences and interactions of the agent with the world. In other words, Reinforcement learning is mappings of situations with the available actions to maximize a numerical reward signal. Usually rewards are positive if the action taken is desirable and negative if the action is undesirable. The agent is never told which action to take or which action is optimal for a given situation, but instead it must discover on its own which actions to take that would yield maximum reward. In many cases the action taken for a given situation may affect not only immediate reward but also the next situation and may have consequences on future rewards. These characteristics of trial and error search and delayed rewards are distinguishing features of reinforcement learning.