`Agent`¶

This module contains classes used to define the standard behavior of the agent. It relies on the controllers, the chosen training/test policy and the learning algorithm to specify its behavior in the environment.

`NeuralAgent`(environment, learning_algo[, …])	The NeuralAgent class wraps a learning algorithm (such as a deep Q-network) for training and testing in a given environment.
`DataSet`(env[, random_state, max_size, …])	A replay memory consisting of circular buffers for observations, actions, rewards and terminals.

class deer.agent.NeuralAgent(environment, learning_algo, replay_memory_size=1000000, replay_start_size=None, batch_size=32, random_state=<mtrand.RandomState object>, exp_priority=0, train_policy=None, test_policy=None, only_full_history=True)¶

The NeuralAgent class wraps a learning algorithm (such as a deep Q-network) for training and testing in a given environment.

Attach controllers to it in order to conduct an experiment (when to train the agent, when to test,…).

environment : object from class Environment: The environment in which the agent interacts
learning_algo : object from class LearningAlgo: The learning algorithm associated to the agent
replay_memory_size : int: Size of the replay memory. Default : 1000000
replay_start_size : int: Number of observations (=number of time steps taken) in the replay memory before starting learning. Default: minimum possible according to environment.inputDimensions().
batch_size : int: Number of tuples taken into account for each iteration of gradient descent. Default : 32
random_state : numpy random number generator: Default : random seed.
exp_priority : float: The exponent that determines how much prioritization is used, default is 0 (uniform priority). One may check out Schaul et al. (2016) - Prioritized Experience Replay.
train_policy : object from class Policy: Policy followed when in training mode (mode -1)
test_policy : object from class Policy: Policy followed when in other modes than training (validation and test modes)
only_full_history : boolean: Whether we wish to train the neural network only on full histories or we wish to fill with zeroes the observations before the beginning of the episode

avgBellmanResidual()¶: Returns the average training loss on the epoch

avgEpisodeVValue()¶: Returns the average V value on the episode (on time steps where a non-random action has been taken)

discountFactor()¶: Get the discount factor

dumpNetwork(fname, nEpoch=-1)¶

Dump the network

fname : string: Name of the file where the network will be dumped
nEpoch : int: Epoch number (Optional)

learningRate()¶: Get the learning rate

overrideNextAction(action)¶: Possibility to override the chosen action. This possibility should be used on the signal OnActionChosen.

run(n_epochs, epoch_length)¶

This function encapsulates the whole process of the learning. It starts by calling the controllers method “onStart”, Then it runs a given number of epochs where an epoch is made up of one or many episodes (called with agent._runEpisode) and where an epoch ends up after the number of steps reaches the argument “epoch_length”. It ends up by calling the controllers method “end”.

n_epochs : int: number of epochs
epoch_length : int: maximum number of steps for a given epoch

setControllersActive(toDisable, active)¶: Activate controller

setDiscountFactor(df)¶: Set the discount factor

setLearningRate(lr)¶: Set the learning rate for the gradient descent

setNetwork(fname, nEpoch=-1)¶

Set values into the network

fname : string: Name of the file where the values are
nEpoch : int: Epoch number (Optional)

totalRewardOverLastTest()¶: Returns the average sum of rewards per episode and the number of episode

train()¶: This function selects a random batch of data (with self._dataset.randomBatch) and performs a Q-learning iteration (with self._learning_algo.train).

class deer.agent.DataSet(env, random_state=None, max_size=1000000, use_priority=False, only_full_history=True)¶

A replay memory consisting of circular buffers for observations, actions, rewards and terminals.

actions()¶: Get all actions currently in the replay memory, ordered by time where they were taken.

addSample(obs, action, reward, is_terminal, priority)¶

Store the punctual observations, action, reward, is_terminal and priority in the dataset. Parameters ———– obs : ndarray

An ndarray(dtype=’object’) where obs[s] corresponds to the punctual observation s before the agent took action [action].

action : int: The action taken after having observed [obs].
reward : float: The reward associated to taking this [action].
is_terminal : bool: Tells whether [action] lead to a terminal state (i.e. corresponded to a terminal transition).
priority : float: The priority to be associated with the sample

observations()¶: Get all observations currently in the replay memory, ordered by time where they were observed.

randomBatch(batch_size, use_priority)¶

Returns a batch of states, actions, rewards, terminal status, and next_states for a number batch_size of randomly chosen transitions. Note that if terminal[i] == True, then next_states[s][i] == np.zeros_like(states[s][i]) for each s.

batch_size : int: Number of transitions to return.
use_priority : Boolean: Whether to use prioritized replay or not

states : numpy array of objects: Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]). States are taken randomly in the data with the only constraint that they are complete regarding the history size for each observation.
actions : numpy array of integers [batch_size]: actions[i] is the action taken after having observed states[:][i].
rewards : numpy array of floats [batch_size]: rewards[i] is the reward obtained for taking actions[i-1].
next_states : numpy array of objects: Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
terminals : numpy array of booleans [batch_size]: terminals[i] is True if the transition leads to a terminal state and False otherwise

SliceError

If a batch of this batch_size could not be built based on current data set (not enough data or all trajectories are too short).

randomBatch_nstep(batch_size, nstep, use_priority)¶

Return corresponding states, actions, rewards, terminal status, and next_states for a number batch_size of randomly chosen transitions. Note that if terminal[i] == True, then next_states[s][i] == np.zeros_like(states[s][i]) for each s.

batch_size : int: Number of transitions to return.
nstep : int: Number of transitions to be considered for each element
use_priority : Boolean: Whether to use prioritized replay or not

states : numpy array of objects: Each object is a numpy array that relates to one of the observations with size [batch_size * (history size+nstep-1) * size of punctual observation (which is 2D,1D or scalar)]). States are taken randomly in the data with the only constraint that they are complete regarding the history size for each observation.
actions : numpy array of integers [batch_size, nstep]: actions[i] is the action taken after having observed states[:][i].
rewards : numpy array of floats [batch_size, nstep]: rewards[i] is the reward obtained for taking actions[i-1].
next_states : numpy array of objects: Each object is a numpy array that relates to one of the observations with size [batch_size * (history size+nstep-1) * size of punctual observation (which is 2D,1D or scalar)]).
terminals : numpy array of booleans [batch_size, nstep]: terminals[i] is True if the transition leads to a terminal state and False otherwise

SliceError

If a batch of this size could not be built based on current data set (not enough data or all trajectories are too short).

rewards()¶: Get all rewards currently in the replay memory, ordered by time where they were received.

terminals()¶

Get all terminals currently in the replay memory, ordered by time where they were observed.

terminals[i] is True if actions()[i] lead to a terminal state (i.e. corresponded to a terminal transition), and False otherwise.

updatePriorities(priorities, rndValidIndices)¶

Agent¶

`Agent`¶