Agent

This module contains classes used to define the standard behavior of the agent. It relies on the controllers, the chosen training/test policy and the learning algorithm to specify its behavior in the environment.

NeuralAgent(environment, learning_algo[, …]) The NeuralAgent class wraps a learning algorithm (such as a deep Q-network) for training and testing in a given environment.
DataSet(env[, random_state, max_size, …]) A replay memory consisting of circular buffers for observations, actions, rewards and terminals.
class deer.agent.NeuralAgent(environment, learning_algo, replay_memory_size=1000000, replay_start_size=None, batch_size=32, random_state=<mtrand.RandomState object>, exp_priority=0, train_policy=None, test_policy=None, only_full_history=True)

The NeuralAgent class wraps a learning algorithm (such as a deep Q-network) for training and testing in a given environment.

Attach controllers to it in order to conduct an experiment (when to train the agent, when to test,…).

environment : object from class Environment
The environment in which the agent interacts
learning_algo : object from class LearningAlgo
The learning algorithm associated to the agent
replay_memory_size : int
Size of the replay memory. Default : 1000000
replay_start_size : int
Number of observations (=number of time steps taken) in the replay memory before starting learning. Default: minimum possible according to environment.inputDimensions().
batch_size : int
Number of tuples taken into account for each iteration of gradient descent. Default : 32
random_state : numpy random number generator
Default : random seed.
exp_priority : float
The exponent that determines how much prioritization is used, default is 0 (uniform priority). One may check out Schaul et al. (2016) - Prioritized Experience Replay.
train_policy : object from class Policy
Policy followed when in training mode (mode -1)
test_policy : object from class Policy
Policy followed when in other modes than training (validation and test modes)
only_full_history : boolean
Whether we wish to train the neural network only on full histories or we wish to fill with zeroes the observations before the beginning of the episode
avgBellmanResidual()

Returns the average training loss on the epoch

avgEpisodeVValue()

Returns the average V value on the episode (on time steps where a non-random action has been taken)

discountFactor()

Get the discount factor

dumpNetwork(fname, nEpoch=-1)

Dump the network

fname : string
Name of the file where the network will be dumped
nEpoch : int
Epoch number (Optional)
learningRate()

Get the learning rate

overrideNextAction(action)

Possibility to override the chosen action. This possibility should be used on the signal OnActionChosen.

run(n_epochs, epoch_length)

This function encapsulates the whole process of the learning. It starts by calling the controllers method “onStart”, Then it runs a given number of epochs where an epoch is made up of one or many episodes (called with agent._runEpisode) and where an epoch ends up after the number of steps reaches the argument “epoch_length”. It ends up by calling the controllers method “end”.

n_epochs : int
number of epochs
epoch_length : int
maximum number of steps for a given epoch
setControllersActive(toDisable, active)

Activate controller

setDiscountFactor(df)

Set the discount factor

setLearningRate(lr)

Set the learning rate for the gradient descent

setNetwork(fname, nEpoch=-1)

Set values into the network

fname : string
Name of the file where the values are
nEpoch : int
Epoch number (Optional)
totalRewardOverLastTest()

Returns the average sum of rewards per episode and the number of episode

train()

This function selects a random batch of data (with self._dataset.randomBatch) and performs a Q-learning iteration (with self._learning_algo.train).

class deer.agent.DataSet(env, random_state=None, max_size=1000000, use_priority=False, only_full_history=True)

A replay memory consisting of circular buffers for observations, actions, rewards and terminals.

actions()

Get all actions currently in the replay memory, ordered by time where they were taken.

addSample(obs, action, reward, is_terminal, priority)

Store the punctual observations, action, reward, is_terminal and priority in the dataset. Parameters ———– obs : ndarray

An ndarray(dtype=’object’) where obs[s] corresponds to the punctual observation s before the agent took action [action].
action : int
The action taken after having observed [obs].
reward : float
The reward associated to taking this [action].
is_terminal : bool
Tells whether [action] lead to a terminal state (i.e. corresponded to a terminal transition).
priority : float
The priority to be associated with the sample
observations()

Get all observations currently in the replay memory, ordered by time where they were observed.

randomBatch(batch_size, use_priority)

Returns a batch of states, actions, rewards, terminal status, and next_states for a number batch_size of randomly chosen transitions. Note that if terminal[i] == True, then next_states[s][i] == np.zeros_like(states[s][i]) for each s.

batch_size : int
Number of transitions to return.
use_priority : Boolean
Whether to use prioritized replay or not
states : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]). States are taken randomly in the data with the only constraint that they are complete regarding the history size for each observation.
actions : numpy array of integers [batch_size]
actions[i] is the action taken after having observed states[:][i].
rewards : numpy array of floats [batch_size]
rewards[i] is the reward obtained for taking actions[i-1].
next_states : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
terminals : numpy array of booleans [batch_size]
terminals[i] is True if the transition leads to a terminal state and False otherwise
SliceError
If a batch of this batch_size could not be built based on current data set (not enough data or all trajectories are too short).
randomBatch_nstep(batch_size, nstep, use_priority)

Return corresponding states, actions, rewards, terminal status, and next_states for a number batch_size of randomly chosen transitions. Note that if terminal[i] == True, then next_states[s][i] == np.zeros_like(states[s][i]) for each s.

batch_size : int
Number of transitions to return.
nstep : int
Number of transitions to be considered for each element
use_priority : Boolean
Whether to use prioritized replay or not
states : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * (history size+nstep-1) * size of punctual observation (which is 2D,1D or scalar)]). States are taken randomly in the data with the only constraint that they are complete regarding the history size for each observation.
actions : numpy array of integers [batch_size, nstep]
actions[i] is the action taken after having observed states[:][i].
rewards : numpy array of floats [batch_size, nstep]
rewards[i] is the reward obtained for taking actions[i-1].
next_states : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * (history size+nstep-1) * size of punctual observation (which is 2D,1D or scalar)]).
terminals : numpy array of booleans [batch_size, nstep]
terminals[i] is True if the transition leads to a terminal state and False otherwise
SliceError
If a batch of this size could not be built based on current data set (not enough data or all trajectories are too short).
rewards()

Get all rewards currently in the replay memory, ordered by time where they were received.

terminals()

Get all terminals currently in the replay memory, ordered by time where they were observed.

terminals[i] is True if actions()[i] lead to a terminal state (i.e. corresponded to a terminal transition), and False otherwise.

updatePriorities(priorities, rndValidIndices)