Learning algorithms

deer.base_classes.LearningAlgo(environment, …) All the Q-networks, actor-critic networks, etc.
deer.learning_algos.q_net_keras.MyQNetwork(…) Deep Q-learning network using Keras (with any backend)
deer.learning_algos.AC_net_keras.MyACNetwork(…) Actor-critic learning (using Keras) with Deep Deterministic Policy Gradient (DDPG) for the continuous action domain
deer.learning_algos.CRAR_keras.CRAR(environment) Combined Reinforcement learning via Abstract Representations (CRAR) using Keras
class deer.base_classes.LearningAlgo(environment, batch_size)

All the Q-networks, actor-critic networks, etc. should inherit this interface.

environment : object from class Environment
The environment linked to the Q-network
batch_size : int
Number of tuples taken into account for each iteration of gradient descent
chooseBestAction(state)

Get the best action for a pseudo-state

discountFactor()

Getting the discount factor

learningRate()

Getting the learning rate

qValues(state)

Get the q value for one pseudo-state

setDiscountFactor(df)

Setting the discount factor

df : float
The discount factor that has to bet set
setLearningRate(lr)

Setting the learning rate NB: The learning rate has usually to be set in the optimizer, hence this function should be overridden. Otherwise, the learning rate change is likely not to be taken into account

lr : float
The learning rate that has to bet set
train(states, actions, rewards, nextStates, terminals)

This method performs the training step (e.g. using Bellman iteration in a deep Q-network) for one batch of tuples.

class deer.learning_algos.q_net_keras.MyQNetwork(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network=<class 'deer.learning_algos.NN_keras.NN'>)

Deep Q-learning network using Keras (with any backend)

environment : object from class Environment
The environment in which the agent evolves.
rho : float
Parameter for rmsprop. Default : 0.9
rms_epsilon : float
Parameter for rmsprop. Default : 0.0001
momentum : float
Momentum for SGD. Default : 0
clip_norm : float
The gradient tensor will be clipped to a maximum L2 norm given by this value.
freeze_interval : int
Period during which the target network is freezed and after which the target network is updated. Default : 1000
batch_size : int
Number of tuples taken into account for each iteration of gradient descent. Default : 32
update_rule: str
{sgd,rmsprop}. Default : rmsprop

random_state : numpy random number generator double_Q : bool, optional

Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.
neural_network : object, optional
default is deer.learning_algos.NN_keras
chooseBestAction(state, *args, **kwargs)

Get the best action for a pseudo-state

state : one pseudo-state

The best action : int

getAllParams()

Get all parameters used by the learning algorithm

Values of the parameters: list of numpy arrays

qValues(state_val)

Get the q values for one belief state

state_val : one belief state

The q values for the provided belief state

setAllParams(list_of_values)

Set all parameters used by the learning algorithm

list_of_values : list of numpy arrays
list of the parameters to be set (same order than given by getAllParams()).
train(states_val, actions_val, rewards_val, next_states_val, terminals_val)

Train the Q-network from one batch of data.

states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)].
actions_val : numpy array of integers with size [self._batch_size]
actions[i] is the action taken after having observed states[:][i].
rewards_val : numpy array of floats with size [self._batch_size]
rewards[i] is the reward obtained for taking actions[i-1].
next_states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)].
terminals_val : numpy array of booleans with size [self._batch_size]
terminals[i] is True if the transition leads to a terminal state and False otherwise

Average loss of the batch training (RMSE) Individual (square) losses for each tuple

class deer.learning_algos.AC_net_keras.MyACNetwork(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network_critic=<class 'deer.learning_algos.NN_keras.NN'>, neural_network_actor=<class 'deer.learning_algos.NN_keras.NN'>)

Actor-critic learning (using Keras) with Deep Deterministic Policy Gradient (DDPG) for the continuous action domain

environment : object from class Environment
The environment in which the agent evolves.
rho : float
Parameter for rmsprop. Default : 0.9
rms_epsilon : float
Parameter for rmsprop. Default : 0.0001
momentum : float
Momentum for SGD. Default : 0
clip_norm : float
The gradient tensor will be clipped to a maximum L2 norm given by this value.
freeze_interval : int
Period during which the target network is freezed and after which the target network is updated. Default : 1000
batch_size : int
Number of tuples taken into account for each iteration of gradient descent. Default : 32
update_rule: str
{sgd,rmsprop}. Default : rmsprop
random_state : numpy random number generator
Set the random seed.
double_Q : bool, optional
Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.
neural_network_critic : object, optional
default is deer.learning_algos.NN_keras
neural_network_actor : object, optional
default is deer.learning_algos.NN_keras
chooseBestAction(state, *args, **kwargs)

Get the best action for a pseudo-state

state : one pseudo-state

best_action : float estim_value : float

clip_action(action)

Clip the possible actions if it is outside the action space defined by self._nActions self._nActions is given as [[low_action1,high_action1],[low_action2,high_action2], …]

getAllParams()

Get all parameters used by the learning algorithm

Values of the parameters: list of numpy arrays

gradients(states, actions)

Returns the gradients on the Q-network for the different actions (used for policy update)

setAllParams(list_of_values)

Set all parameters used by the learning algorithm

list_of_values : list of numpy arrays
list of the parameters to be set (same order than given by getAllParams()).
train(states_val, actions_val, rewards_val, next_states_val, terminals_val)

Train the actor-critic algorithm from one batch of data.

states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
actions_val : numpy array of integers with size [self._batch_size]
actions[i] is the action taken after having observed states[:][i].
rewards_val : numpy array of floats with size [self._batch_size]
rewards[i] is the reward obtained for taking actions[i-1].
next_states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
terminals_val : numpy array of booleans with size [self._batch_size]
terminals[i] is True if the transition leads to a terminal state and False otherwise

Average loss of the batch training Individual losses for each tuple

class deer.learning_algos.CRAR_keras.CRAR(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network=<class 'deer.learning_algos.NN_CRAR_keras.NN'>, **kwargs)

Combined Reinforcement learning via Abstract Representations (CRAR) using Keras

environment : object from class Environment
The environment in which the agent evolves.
rho : float
Parameter for rmsprop. Default : 0.9
rms_epsilon : float
Parameter for rmsprop. Default : 0.0001
momentum : float
Momentum for SGD. Default : 0
clip_norm : float
The gradient tensor will be clipped to a maximum L2 norm given by this value.
freeze_interval : int
Period during which the target network is freezed and after which the target network is updated. Default : 1000
batch_size : int
Number of tuples taken into account for each iteration of gradient descent. Default : 32
update_rule: str
{sgd,rmsprop}. Default : rmsprop
random_state : numpy random number generator
Set the random seed.
double_Q : bool, optional
Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.
neural_network : object, optional
Default is deer.learning_algos.NN_keras
chooseBestAction(state, mode, *args, **kwargs)

Get the best action for a pseudo-state

state : list of numpy arrays
One pseudo-state. The number of arrays and their dimensions matches self.environment.inputDimensions().
mode : int
Identifier of the mode (-1 is reserved for the training mode).

The best action : int

getAllParams()

Provides all parameters used by the learning algorithm

Values of the parameters: list of numpy arrays

qValues(state_val)

Get the q values for one pseudo-state (without planning)

state_val : array of objects (or list of objects)
Each object is a numpy array that relates to one of the observations with size [1 * history size * size of punctual observation (which is 2D,1D or scalar)]).

The q values for the provided pseudo state

qValues_planning(state_val, R, gamma, T, Q, d=5)

Get the average Q-values up to planning depth d for one pseudo-state.

state_val : array of objects (or list of objects)
Each object is a numpy array that relates to one of the observations with size [1 * history size * size of punctual observation (which is 2D,1D or scalar)]).
R : float_model
Model that fits the reward
gamma : float_model
Model that fits the discount factor
T : transition_model
Model that fits the transition between abstract representation
Q : Q_model
Model that fits the optimal Q-value
d : int
planning depth

The average q values with planning depth up to d for the provided pseudo-state

qValues_planning_abstr(state_abstr_val, R, gamma, T, Q, d, branching_factor=None)

Get the q values for pseudo-state(s) with a planning depth d. This function is called recursively by decreasing the depth d at every step.

state_abstr_val : internal state(s). R : float_model

Model that fits the reward
gamma : float_model
Model that fits the discount factor
T : transition_model
Model that fits the transition between abstract representation
Q : Q_model
Model that fits the optimal Q-value
d : int
planning depth

The Q-values with planning depth d for the provided encoded state(s)

setAllParams(list_of_values)

Set all parameters used by the learning algorithm

list_of_values : list of numpy arrays
list of the parameters to be set (same order than given by getAllParams()).
setLearningRate(lr)

Setting the learning rate

lr : float
The learning rate that has to be set
train(states_val, actions_val, rewards_val, next_states_val, terminals_val)

Train CRAR from one batch of data.

states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
actions_val : numpy array of integers with size [self._batch_size]
actions[i] is the action taken after having observed states[:][i].
rewards_val : numpy array of floats with size [self._batch_size]
rewards[i] is the reward obtained for taking actions[i-1].
next_states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
terminals_val : numpy array of booleans with size [self._batch_size]
terminals[i] is True if the transition leads to a terminal state and False otherwise

Average loss of the batch training for the Q-values (RMSE) Individual (square) losses for the Q-values for each tuple