Agent
¶
This module contains classes used to define the standard behavior of the agent. It relies on the controllers, the chosen training/test policy and the learning algorithm to specify its behavior in the environment.
NeuralAgent (environment, learning_algo[, …]) |
The NeuralAgent class wraps a learning algorithm (such as a deep Q-network) for training and testing in a given environment. |
DataSet (env[, random_state, max_size, …]) |
A replay memory consisting of circular buffers for observations, actions, rewards and terminals. |
-
class
deer.agent.
NeuralAgent
(environment, learning_algo, replay_memory_size=1000000, replay_start_size=None, batch_size=32, random_state=<mtrand.RandomState object>, exp_priority=0, train_policy=None, test_policy=None, only_full_history=True)¶ The NeuralAgent class wraps a learning algorithm (such as a deep Q-network) for training and testing in a given environment.
Attach controllers to it in order to conduct an experiment (when to train the agent, when to test,…).
- environment : object from class Environment
- The environment in which the agent interacts
- learning_algo : object from class LearningAlgo
- The learning algorithm associated to the agent
- replay_memory_size : int
- Size of the replay memory. Default : 1000000
- replay_start_size : int
- Number of observations (=number of time steps taken) in the replay memory before starting learning. Default: minimum possible according to environment.inputDimensions().
- batch_size : int
- Number of tuples taken into account for each iteration of gradient descent. Default : 32
- random_state : numpy random number generator
- Default : random seed.
- exp_priority : float
- The exponent that determines how much prioritization is used, default is 0 (uniform priority). One may check out Schaul et al. (2016) - Prioritized Experience Replay.
- train_policy : object from class Policy
- Policy followed when in training mode (mode -1)
- test_policy : object from class Policy
- Policy followed when in other modes than training (validation and test modes)
- only_full_history : boolean
- Whether we wish to train the neural network only on full histories or we wish to fill with zeroes the observations before the beginning of the episode
-
avgBellmanResidual
()¶ Returns the average training loss on the epoch
-
avgEpisodeVValue
()¶ Returns the average V value on the episode (on time steps where a non-random action has been taken)
-
discountFactor
()¶ Get the discount factor
-
dumpNetwork
(fname, nEpoch=-1)¶ Dump the network
- fname : string
- Name of the file where the network will be dumped
- nEpoch : int
- Epoch number (Optional)
-
learningRate
()¶ Get the learning rate
-
overrideNextAction
(action)¶ Possibility to override the chosen action. This possibility should be used on the signal OnActionChosen.
-
run
(n_epochs, epoch_length)¶ This function encapsulates the whole process of the learning. It starts by calling the controllers method “onStart”, Then it runs a given number of epochs where an epoch is made up of one or many episodes (called with agent._runEpisode) and where an epoch ends up after the number of steps reaches the argument “epoch_length”. It ends up by calling the controllers method “end”.
- n_epochs : int
- number of epochs
- epoch_length : int
- maximum number of steps for a given epoch
-
setControllersActive
(toDisable, active)¶ Activate controller
-
setDiscountFactor
(df)¶ Set the discount factor
-
setLearningRate
(lr)¶ Set the learning rate for the gradient descent
-
setNetwork
(fname, nEpoch=-1)¶ Set values into the network
- fname : string
- Name of the file where the values are
- nEpoch : int
- Epoch number (Optional)
-
totalRewardOverLastTest
()¶ Returns the average sum of rewards per episode and the number of episode
-
train
()¶ This function selects a random batch of data (with self._dataset.randomBatch) and performs a Q-learning iteration (with self._learning_algo.train).
-
class
deer.agent.
DataSet
(env, random_state=None, max_size=1000000, use_priority=False, only_full_history=True)¶ A replay memory consisting of circular buffers for observations, actions, rewards and terminals.
-
actions
()¶ Get all actions currently in the replay memory, ordered by time where they were taken.
-
addSample
(obs, action, reward, is_terminal, priority)¶ Store the punctual observations, action, reward, is_terminal and priority in the dataset. Parameters ———– obs : ndarray
An ndarray(dtype=’object’) where obs[s] corresponds to the punctual observation s before the agent took action [action].- action : int
- The action taken after having observed [obs].
- reward : float
- The reward associated to taking this [action].
- is_terminal : bool
- Tells whether [action] lead to a terminal state (i.e. corresponded to a terminal transition).
- priority : float
- The priority to be associated with the sample
-
observations
()¶ Get all observations currently in the replay memory, ordered by time where they were observed.
-
randomBatch
(batch_size, use_priority)¶ Returns a batch of states, actions, rewards, terminal status, and next_states for a number batch_size of randomly chosen transitions. Note that if terminal[i] == True, then next_states[s][i] == np.zeros_like(states[s][i]) for each s.
- batch_size : int
- Number of transitions to return.
- use_priority : Boolean
- Whether to use prioritized replay or not
- states : numpy array of objects
- Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]). States are taken randomly in the data with the only constraint that they are complete regarding the history size for each observation.
- actions : numpy array of integers [batch_size]
- actions[i] is the action taken after having observed states[:][i].
- rewards : numpy array of floats [batch_size]
- rewards[i] is the reward obtained for taking actions[i-1].
- next_states : numpy array of objects
- Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
- terminals : numpy array of booleans [batch_size]
- terminals[i] is True if the transition leads to a terminal state and False otherwise
- SliceError
- If a batch of this batch_size could not be built based on current data set (not enough data or all trajectories are too short).
-
randomBatch_nstep
(batch_size, nstep, use_priority)¶ Return corresponding states, actions, rewards, terminal status, and next_states for a number batch_size of randomly chosen transitions. Note that if terminal[i] == True, then next_states[s][i] == np.zeros_like(states[s][i]) for each s.
- batch_size : int
- Number of transitions to return.
- nstep : int
- Number of transitions to be considered for each element
- use_priority : Boolean
- Whether to use prioritized replay or not
- states : numpy array of objects
- Each object is a numpy array that relates to one of the observations with size [batch_size * (history size+nstep-1) * size of punctual observation (which is 2D,1D or scalar)]). States are taken randomly in the data with the only constraint that they are complete regarding the history size for each observation.
- actions : numpy array of integers [batch_size, nstep]
- actions[i] is the action taken after having observed states[:][i].
- rewards : numpy array of floats [batch_size, nstep]
- rewards[i] is the reward obtained for taking actions[i-1].
- next_states : numpy array of objects
- Each object is a numpy array that relates to one of the observations with size [batch_size * (history size+nstep-1) * size of punctual observation (which is 2D,1D or scalar)]).
- terminals : numpy array of booleans [batch_size, nstep]
- terminals[i] is True if the transition leads to a terminal state and False otherwise
- SliceError
- If a batch of this size could not be built based on current data set (not enough data or all trajectories are too short).
-
rewards
()¶ Get all rewards currently in the replay memory, ordered by time where they were received.
-
terminals
()¶ Get all terminals currently in the replay memory, ordered by time where they were observed.
terminals[i] is True if actions()[i] lead to a terminal state (i.e. corresponded to a terminal transition), and False otherwise.
-
updatePriorities
(priorities, rndValidIndices)¶
-