Agents

This module contains classes used to define any agent wrapping a DQN.

NeuralAgent(environment, q_network, ...[, ...]) The NeuralAgent class wraps a deep Q-network for training and testing in a given environment.
DataSet(env[, randomState, maxSize, ...]) A replay memory consisting of circular buffers for observations, actions, rewards and terminals.

Detailed description

class deer.agent.NeuralAgent(environment, q_network, replay_memory_size, replay_start_size, batch_size, randomState, exp_priority=0)

The NeuralAgent class wraps a deep Q-network for training and testing in a given environment.

Attach controllers to it in order to conduct an experiment (when to train the agent, when to test,...).

Parameters:

environment : object from class Environment

The environment in which the agent interacts

q_network : object from class QNetwork

The q_network associated to the agent

replay_memory_size : int

Size of the replay memory

replay_start_size : int

Number of observations (=number of time steps taken) in the replay memory before starting learning

batch_size : int

Number of tuples taken into account for each iteration of gradient descent

randomState : numpy random number generator

Seed

exp_priority : float, optional

The exponent that determines how much prioritization is used, default is 0 (uniform priority). One may check out Schaul et al. (2016) - Prioritized Experience Replay.

Methods

attach(controller)
avgBellmanResidual() Returns the average training loss on the epoch
avgEpisodeVValue() Returns the average V value on the episode
bestAction() Returns the best Action
detach(controllerIdx)
discountFactor() Get the discount factor
dumpNetwork(fname, nEpoch)
epsilon() Get the epsilon for \(\epsilon\)-greedy exploration
learningRate() Get the learning rate
mode()
overrideNextAction(action) Possibility to override the chosen action.
resumeTrainingMode()
run(nEpochs, epochLength)
setControllersActive(toDisable, active) Activate controller
setDiscountFactor(df) Set the discount factor
setEpsilon(e) Set the epsilon used for \(\epsilon\)-greedy exploration
setLearningRate(lr) Set the learning rate for the gradient descent
startMode(mode, epochLength)
summarizeTestPerformance()
totalRewardOverLastTest() Returns the average sum of reward per episode
train()
avgBellmanResidual()

Returns the average training loss on the epoch

avgEpisodeVValue()

Returns the average V value on the episode

bestAction()

Returns the best Action

discountFactor()

Get the discount factor

epsilon()

Get the epsilon for \(\epsilon\)-greedy exploration

learningRate()

Get the learning rate

overrideNextAction(action)

Possibility to override the chosen action. This possibility should be used on the signal OnActionChosen.

setControllersActive(toDisable, active)

Activate controller

setDiscountFactor(df)

Set the discount factor

setEpsilon(e)

Set the epsilon used for \(\epsilon\)-greedy exploration

setLearningRate(lr)

Set the learning rate for the gradient descent

totalRewardOverLastTest()

Returns the average sum of reward per episode

class deer.agent.DataSet(env, randomState=None, maxSize=1000, use_priority=False)

A replay memory consisting of circular buffers for observations, actions, rewards and terminals.

Methods

actions() Get all actions currently in the replay memory, ordered by time where they were taken.
addSample(obs, action, reward, isTerminal, ...) Store a (observation[for all subjects], action, reward, isTerminal) in the dataset.
nElems() Get the number of samples in this dataset (i.e.
observations() Get all observations currently in the replay memory, ordered by time where they were observed.
randomBatch(size, use_priority) Return corresponding states, actions, rewards, terminal status, and next_states for size randomly chosen transitions.
rewards() Get all rewards currently in the replay memory, ordered by time where they were received.
terminals() Get all terminals currently in the replay memory, ordered by time where they were observed.
update_priorities(priorities, rndValidIndices)
actions()

Get all actions currently in the replay memory, ordered by time where they were taken.

addSample(obs, action, reward, isTerminal, priority)

Store a (observation[for all subjects], action, reward, isTerminal) in the dataset.

Parameters:

obs : ndarray

An ndarray(dtype=’object’) where obs[s] corresponds to the observation made on subject s before the agent took action [action].

action : int

The action taken after having observed [obs].

reward : float

The reward associated to taking this [action].

isTerminal : bool

Tells whether [action] lead to a terminal state (i.e. corresponded to a terminal transition).

priority : float

The priority to be associated with the sample

nElems()

Get the number of samples in this dataset (i.e. the current memory replay size).

observations()

Get all observations currently in the replay memory, ordered by time where they were observed.

observations[s][i] corresponds to the observation made on subject s before the agent took actions()[i].

randomBatch(size, use_priority)

Return corresponding states, actions, rewards, terminal status, and next_states for size randomly chosen transitions. Note that if terminal[i] == True, then next_states[s][i] == np.zeros_like(states[s][i]) for each subject s.

Parameters:

size : int

Number of transitions to return.

Returns:

states : ndarray

An ndarray(size=number_of_subjects, dtype=’object), where states[s] is a 2+D matrix of dimensions size x s.memorySize x “shape of a given observation for this subject”. States were taken randomly in the data with the only constraint that they are complete regarding the histories for each observed subject.

actions : ndarray

An ndarray(size=number_of_subjects, dtype=’int32’) where actions[i] is the action taken after having observed states[:][i].

rewards : ndarray

An ndarray(size=number_of_subjects, dtype=’float32’) where rewards[i] is the reward obtained for taking actions[i-1].

next_states : ndarray

Same structure than states, but next_states[s][i] is guaranteed to be the information concerning the state following the one described by states[s][i] for each subject s.

terminals : ndarray

An ndarray(size=number_of_subjects, dtype=’bool’) where terminals[i] is True if actions[i] lead to terminal states and False otherwise

rewards()

Get all rewards currently in the replay memory, ordered by time where they were received.

terminals()

Get all terminals currently in the replay memory, ordered by time where they were observed.

terminals[i] is True if actions()[i] lead to a terminal state (i.e. corresponded to a terminal transition), and False otherwise.

update_priorities(priorities, rndValidIndices)