`Learning algorithms`¶

`deer.base_classes.LearningAlgo`(environment, …)	All the Q-networks, actor-critic networks, etc.
`deer.learning_algos.q_net_keras.MyQNetwork`(…)	Deep Q-learning network using Keras (with any backend)
`deer.learning_algos.AC_net_keras.MyACNetwork`(…)	Actor-critic learning (using Keras) with Deep Deterministic Policy Gradient (DDPG) for the continuous action domain
`deer.learning_algos.CRAR_keras.CRAR`(environment)	Combined Reinforcement learning via Abstract Representations (CRAR) using Keras

class deer.base_classes.LearningAlgo(environment, batch_size)¶

All the Q-networks, actor-critic networks, etc. should inherit this interface.

environment : object from class Environment: The environment linked to the Q-network
batch_size : int: Number of tuples taken into account for each iteration of gradient descent

chooseBestAction(state)¶: Get the best action for a pseudo-state

discountFactor()¶: Getting the discount factor

learningRate()¶: Getting the learning rate

qValues(state)¶: Get the q value for one pseudo-state

setDiscountFactor(df)¶

Setting the discount factor

df : float: The discount factor that has to bet set

setLearningRate(lr)¶

Setting the learning rate NB: The learning rate has usually to be set in the optimizer, hence this function should be overridden. Otherwise, the learning rate change is likely not to be taken into account

lr : float: The learning rate that has to bet set

train(states, actions, rewards, nextStates, terminals)¶: This method performs the training step (e.g. using Bellman iteration in a deep Q-network) for one batch of tuples.

class deer.learning_algos.q_net_keras.MyQNetwork(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network=<class 'deer.learning_algos.NN_keras.NN'>)¶

Deep Q-learning network using Keras (with any backend)

environment : object from class Environment: The environment in which the agent evolves.
rho : float: Parameter for rmsprop. Default : 0.9
rms_epsilon : float: Parameter for rmsprop. Default : 0.0001
momentum : float: Momentum for SGD. Default : 0
clip_norm : float: The gradient tensor will be clipped to a maximum L2 norm given by this value.
freeze_interval : int: Period during which the target network is freezed and after which the target network is updated. Default : 1000
batch_size : int: Number of tuples taken into account for each iteration of gradient descent. Default : 32
update_rule: str: {sgd,rmsprop}. Default : rmsprop

random_state : numpy random number generator double_Q : bool, optional

Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.

neural_network : object, optional: default is deer.learning_algos.NN_keras

chooseBestAction(state, *args, **kwargs)¶

Get the best action for a pseudo-state

state : one pseudo-state

The best action : int

getAllParams()¶

Get all parameters used by the learning algorithm

Values of the parameters: list of numpy arrays

qValues(state_val)¶

Get the q values for one belief state

state_val : one belief state

The q values for the provided belief state

setAllParams(list_of_values)¶

Set all parameters used by the learning algorithm

list_of_values : list of numpy arrays: list of the parameters to be set (same order than given by getAllParams()).

train(states_val, actions_val, rewards_val, next_states_val, terminals_val)¶

Train the Q-network from one batch of data.

states_val : numpy array of objects: Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)].
actions_val : numpy array of integers with size [self._batch_size]: actions[i] is the action taken after having observed states[:][i].
rewards_val : numpy array of floats with size [self._batch_size]: rewards[i] is the reward obtained for taking actions[i-1].
next_states_val : numpy array of objects: Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)].
terminals_val : numpy array of booleans with size [self._batch_size]: terminals[i] is True if the transition leads to a terminal state and False otherwise

Average loss of the batch training (RMSE) Individual (square) losses for each tuple

class deer.learning_algos.AC_net_keras.MyACNetwork(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network_critic=<class 'deer.learning_algos.NN_keras.NN'>, neural_network_actor=<class 'deer.learning_algos.NN_keras.NN'>)¶

Actor-critic learning (using Keras) with Deep Deterministic Policy Gradient (DDPG) for the continuous action domain

environment : object from class Environment: The environment in which the agent evolves.
rho : float: Parameter for rmsprop. Default : 0.9
rms_epsilon : float: Parameter for rmsprop. Default : 0.0001
momentum : float: Momentum for SGD. Default : 0
clip_norm : float: The gradient tensor will be clipped to a maximum L2 norm given by this value.
freeze_interval : int: Period during which the target network is freezed and after which the target network is updated. Default : 1000
batch_size : int: Number of tuples taken into account for each iteration of gradient descent. Default : 32
update_rule: str: {sgd,rmsprop}. Default : rmsprop
random_state : numpy random number generator: Set the random seed.
double_Q : bool, optional: Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.
neural_network_critic : object, optional: default is deer.learning_algos.NN_keras
neural_network_actor : object, optional: default is deer.learning_algos.NN_keras

chooseBestAction(state, *args, **kwargs)¶

Get the best action for a pseudo-state

state : one pseudo-state

best_action : float estim_value : float

clip_action(action)¶: Clip the possible actions if it is outside the action space defined by self._nActions self._nActions is given as [[low_action1,high_action1],[low_action2,high_action2], …]

getAllParams()¶

Get all parameters used by the learning algorithm

Values of the parameters: list of numpy arrays

gradients(states, actions)¶: Returns the gradients on the Q-network for the different actions (used for policy update)

setAllParams(list_of_values)¶

Set all parameters used by the learning algorithm

list_of_values : list of numpy arrays: list of the parameters to be set (same order than given by getAllParams()).

train(states_val, actions_val, rewards_val, next_states_val, terminals_val)¶

Train the actor-critic algorithm from one batch of data.

states_val : numpy array of objects: Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
actions_val : numpy array of integers with size [self._batch_size]: actions[i] is the action taken after having observed states[:][i].
rewards_val : numpy array of floats with size [self._batch_size]: rewards[i] is the reward obtained for taking actions[i-1].
next_states_val : numpy array of objects: Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
terminals_val : numpy array of booleans with size [self._batch_size]: terminals[i] is True if the transition leads to a terminal state and False otherwise

Average loss of the batch training Individual losses for each tuple

class deer.learning_algos.CRAR_keras.CRAR(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network=<class 'deer.learning_algos.NN_CRAR_keras.NN'>, **kwargs)¶

Combined Reinforcement learning via Abstract Representations (CRAR) using Keras

environment : object from class Environment: The environment in which the agent evolves.
rho : float: Parameter for rmsprop. Default : 0.9
rms_epsilon : float: Parameter for rmsprop. Default : 0.0001
momentum : float: Momentum for SGD. Default : 0
clip_norm : float: The gradient tensor will be clipped to a maximum L2 norm given by this value.
freeze_interval : int: Period during which the target network is freezed and after which the target network is updated. Default : 1000
batch_size : int: Number of tuples taken into account for each iteration of gradient descent. Default : 32
update_rule: str: {sgd,rmsprop}. Default : rmsprop
random_state : numpy random number generator: Set the random seed.
double_Q : bool, optional: Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.
neural_network : object, optional: Default is deer.learning_algos.NN_keras

chooseBestAction(state, mode, *args, **kwargs)¶

Get the best action for a pseudo-state

state : list of numpy arrays: One pseudo-state. The number of arrays and their dimensions matches self.environment.inputDimensions().
mode : int: Identifier of the mode (-1 is reserved for the training mode).

The best action : int

getAllParams()¶

Provides all parameters used by the learning algorithm

Values of the parameters: list of numpy arrays

qValues(state_val)¶

Get the q values for one pseudo-state (without planning)

state_val : array of objects (or list of objects): Each object is a numpy array that relates to one of the observations with size [1 * history size * size of punctual observation (which is 2D,1D or scalar)]).

The q values for the provided pseudo state

qValues_planning(state_val, R, gamma, T, Q, d=5)¶

Get the average Q-values up to planning depth d for one pseudo-state.

state_val : array of objects (or list of objects): Each object is a numpy array that relates to one of the observations with size [1 * history size * size of punctual observation (which is 2D,1D or scalar)]).
R : float_model: Model that fits the reward
gamma : float_model: Model that fits the discount factor
T : transition_model: Model that fits the transition between abstract representation
Q : Q_model: Model that fits the optimal Q-value
d : int: planning depth

The average q values with planning depth up to d for the provided pseudo-state

qValues_planning_abstr(state_abstr_val, R, gamma, T, Q, d, branching_factor=None)¶

Get the q values for pseudo-state(s) with a planning depth d. This function is called recursively by decreasing the depth d at every step.

state_abstr_val : internal state(s). R : float_model

Model that fits the reward

gamma : float_model: Model that fits the discount factor
T : transition_model: Model that fits the transition between abstract representation
Q : Q_model: Model that fits the optimal Q-value
d : int: planning depth

The Q-values with planning depth d for the provided encoded state(s)

setAllParams(list_of_values)¶

Set all parameters used by the learning algorithm

list_of_values : list of numpy arrays: list of the parameters to be set (same order than given by getAllParams()).

setLearningRate(lr)¶

Setting the learning rate

lr : float: The learning rate that has to be set

train(states_val, actions_val, rewards_val, next_states_val, terminals_val)¶

Train CRAR from one batch of data.

states_val : numpy array of objects: Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
actions_val : numpy array of integers with size [self._batch_size]: actions[i] is the action taken after having observed states[:][i].
rewards_val : numpy array of floats with size [self._batch_size]: rewards[i] is the reward obtained for taking actions[i-1].
next_states_val : numpy array of objects: Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
terminals_val : numpy array of booleans with size [self._batch_size]: terminals[i] is True if the transition leads to a terminal state and False otherwise

Average loss of the batch training for the Q-values (RMSE) Individual (square) losses for the Q-values for each tuple

Learning algorithms¶

`Learning algorithms`¶