Learning algorithms
¶
deer.base_classes.LearningAlgo (environment, …) |
All the Q-networks, actor-critic networks, etc. |
deer.learning_algos.q_net_keras.MyQNetwork (…) |
Deep Q-learning network using Keras (with any backend) |
deer.learning_algos.AC_net_keras.MyACNetwork (…) |
Actor-critic learning (using Keras) with Deep Deterministic Policy Gradient (DDPG) for the continuous action domain |
deer.learning_algos.CRAR_keras.CRAR (environment) |
Combined Reinforcement learning via Abstract Representations (CRAR) using Keras |
-
class
deer.base_classes.
LearningAlgo
(environment, batch_size)¶ All the Q-networks, actor-critic networks, etc. should inherit this interface.
- environment : object from class Environment
- The environment linked to the Q-network
- batch_size : int
- Number of tuples taken into account for each iteration of gradient descent
-
chooseBestAction
(state)¶ Get the best action for a pseudo-state
-
discountFactor
()¶ Getting the discount factor
-
learningRate
()¶ Getting the learning rate
-
qValues
(state)¶ Get the q value for one pseudo-state
-
setDiscountFactor
(df)¶ Setting the discount factor
- df : float
- The discount factor that has to bet set
-
setLearningRate
(lr)¶ Setting the learning rate NB: The learning rate has usually to be set in the optimizer, hence this function should be overridden. Otherwise, the learning rate change is likely not to be taken into account
- lr : float
- The learning rate that has to bet set
-
train
(states, actions, rewards, nextStates, terminals)¶ This method performs the training step (e.g. using Bellman iteration in a deep Q-network) for one batch of tuples.
-
class
deer.learning_algos.q_net_keras.
MyQNetwork
(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network=<class 'deer.learning_algos.NN_keras.NN'>)¶ Deep Q-learning network using Keras (with any backend)
- environment : object from class Environment
- The environment in which the agent evolves.
- rho : float
- Parameter for rmsprop. Default : 0.9
- rms_epsilon : float
- Parameter for rmsprop. Default : 0.0001
- momentum : float
- Momentum for SGD. Default : 0
- clip_norm : float
- The gradient tensor will be clipped to a maximum L2 norm given by this value.
- freeze_interval : int
- Period during which the target network is freezed and after which the target network is updated. Default : 1000
- batch_size : int
- Number of tuples taken into account for each iteration of gradient descent. Default : 32
- update_rule: str
- {sgd,rmsprop}. Default : rmsprop
random_state : numpy random number generator double_Q : bool, optional
Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.- neural_network : object, optional
- default is deer.learning_algos.NN_keras
-
chooseBestAction
(state, *args, **kwargs)¶ Get the best action for a pseudo-state
state : one pseudo-state
The best action : int
-
getAllParams
()¶ Get all parameters used by the learning algorithm
Values of the parameters: list of numpy arrays
-
qValues
(state_val)¶ Get the q values for one belief state
state_val : one belief state
The q values for the provided belief state
-
setAllParams
(list_of_values)¶ Set all parameters used by the learning algorithm
- list_of_values : list of numpy arrays
- list of the parameters to be set (same order than given by getAllParams()).
-
train
(states_val, actions_val, rewards_val, next_states_val, terminals_val)¶ Train the Q-network from one batch of data.
- states_val : numpy array of objects
- Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)].
- actions_val : numpy array of integers with size [self._batch_size]
- actions[i] is the action taken after having observed states[:][i].
- rewards_val : numpy array of floats with size [self._batch_size]
- rewards[i] is the reward obtained for taking actions[i-1].
- next_states_val : numpy array of objects
- Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)].
- terminals_val : numpy array of booleans with size [self._batch_size]
- terminals[i] is True if the transition leads to a terminal state and False otherwise
Average loss of the batch training (RMSE) Individual (square) losses for each tuple
-
class
deer.learning_algos.AC_net_keras.
MyACNetwork
(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network_critic=<class 'deer.learning_algos.NN_keras.NN'>, neural_network_actor=<class 'deer.learning_algos.NN_keras.NN'>)¶ Actor-critic learning (using Keras) with Deep Deterministic Policy Gradient (DDPG) for the continuous action domain
- environment : object from class Environment
- The environment in which the agent evolves.
- rho : float
- Parameter for rmsprop. Default : 0.9
- rms_epsilon : float
- Parameter for rmsprop. Default : 0.0001
- momentum : float
- Momentum for SGD. Default : 0
- clip_norm : float
- The gradient tensor will be clipped to a maximum L2 norm given by this value.
- freeze_interval : int
- Period during which the target network is freezed and after which the target network is updated. Default : 1000
- batch_size : int
- Number of tuples taken into account for each iteration of gradient descent. Default : 32
- update_rule: str
- {sgd,rmsprop}. Default : rmsprop
- random_state : numpy random number generator
- Set the random seed.
- double_Q : bool, optional
- Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.
- neural_network_critic : object, optional
- default is deer.learning_algos.NN_keras
- neural_network_actor : object, optional
- default is deer.learning_algos.NN_keras
-
chooseBestAction
(state, *args, **kwargs)¶ Get the best action for a pseudo-state
state : one pseudo-state
best_action : float estim_value : float
-
clip_action
(action)¶ Clip the possible actions if it is outside the action space defined by self._nActions self._nActions is given as [[low_action1,high_action1],[low_action2,high_action2], …]
-
getAllParams
()¶ Get all parameters used by the learning algorithm
Values of the parameters: list of numpy arrays
-
gradients
(states, actions)¶ Returns the gradients on the Q-network for the different actions (used for policy update)
-
setAllParams
(list_of_values)¶ Set all parameters used by the learning algorithm
- list_of_values : list of numpy arrays
- list of the parameters to be set (same order than given by getAllParams()).
-
train
(states_val, actions_val, rewards_val, next_states_val, terminals_val)¶ Train the actor-critic algorithm from one batch of data.
- states_val : numpy array of objects
- Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
- actions_val : numpy array of integers with size [self._batch_size]
- actions[i] is the action taken after having observed states[:][i].
- rewards_val : numpy array of floats with size [self._batch_size]
- rewards[i] is the reward obtained for taking actions[i-1].
- next_states_val : numpy array of objects
- Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
- terminals_val : numpy array of booleans with size [self._batch_size]
- terminals[i] is True if the transition leads to a terminal state and False otherwise
Average loss of the batch training Individual losses for each tuple
-
class
deer.learning_algos.CRAR_keras.
CRAR
(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network=<class 'deer.learning_algos.NN_CRAR_keras.NN'>, **kwargs)¶ Combined Reinforcement learning via Abstract Representations (CRAR) using Keras
- environment : object from class Environment
- The environment in which the agent evolves.
- rho : float
- Parameter for rmsprop. Default : 0.9
- rms_epsilon : float
- Parameter for rmsprop. Default : 0.0001
- momentum : float
- Momentum for SGD. Default : 0
- clip_norm : float
- The gradient tensor will be clipped to a maximum L2 norm given by this value.
- freeze_interval : int
- Period during which the target network is freezed and after which the target network is updated. Default : 1000
- batch_size : int
- Number of tuples taken into account for each iteration of gradient descent. Default : 32
- update_rule: str
- {sgd,rmsprop}. Default : rmsprop
- random_state : numpy random number generator
- Set the random seed.
- double_Q : bool, optional
- Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.
- neural_network : object, optional
- Default is deer.learning_algos.NN_keras
-
chooseBestAction
(state, mode, *args, **kwargs)¶ Get the best action for a pseudo-state
- state : list of numpy arrays
- One pseudo-state. The number of arrays and their dimensions matches self.environment.inputDimensions().
- mode : int
- Identifier of the mode (-1 is reserved for the training mode).
The best action : int
-
getAllParams
()¶ Provides all parameters used by the learning algorithm
Values of the parameters: list of numpy arrays
-
qValues
(state_val)¶ Get the q values for one pseudo-state (without planning)
- state_val : array of objects (or list of objects)
- Each object is a numpy array that relates to one of the observations with size [1 * history size * size of punctual observation (which is 2D,1D or scalar)]).
The q values for the provided pseudo state
-
qValues_planning
(state_val, R, gamma, T, Q, d=5)¶ Get the average Q-values up to planning depth d for one pseudo-state.
- state_val : array of objects (or list of objects)
- Each object is a numpy array that relates to one of the observations with size [1 * history size * size of punctual observation (which is 2D,1D or scalar)]).
- R : float_model
- Model that fits the reward
- gamma : float_model
- Model that fits the discount factor
- T : transition_model
- Model that fits the transition between abstract representation
- Q : Q_model
- Model that fits the optimal Q-value
- d : int
- planning depth
The average q values with planning depth up to d for the provided pseudo-state
-
qValues_planning_abstr
(state_abstr_val, R, gamma, T, Q, d, branching_factor=None)¶ Get the q values for pseudo-state(s) with a planning depth d. This function is called recursively by decreasing the depth d at every step.
state_abstr_val : internal state(s). R : float_model
Model that fits the reward- gamma : float_model
- Model that fits the discount factor
- T : transition_model
- Model that fits the transition between abstract representation
- Q : Q_model
- Model that fits the optimal Q-value
- d : int
- planning depth
The Q-values with planning depth d for the provided encoded state(s)
-
setAllParams
(list_of_values)¶ Set all parameters used by the learning algorithm
- list_of_values : list of numpy arrays
- list of the parameters to be set (same order than given by getAllParams()).
-
setLearningRate
(lr)¶ Setting the learning rate
- lr : float
- The learning rate that has to be set
-
train
(states_val, actions_val, rewards_val, next_states_val, terminals_val)¶ Train CRAR from one batch of data.
- states_val : numpy array of objects
- Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
- actions_val : numpy array of integers with size [self._batch_size]
- actions[i] is the action taken after having observed states[:][i].
- rewards_val : numpy array of floats with size [self._batch_size]
- rewards[i] is the reward obtained for taking actions[i-1].
- next_states_val : numpy array of objects
- Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
- terminals_val : numpy array of booleans with size [self._batch_size]
- terminals[i] is True if the transition leads to a terminal state and False otherwise
Average loss of the batch training for the Q-values (RMSE) Individual (square) losses for the Q-values for each tuple