Agents
¶
This module contains classes used to define any agent wrapping a DQN.
NeuralAgent (environment, q_network, ...[, ...]) |
The NeuralAgent class wraps a deep Q-network for training and testing in a given environment. |
DataSet (env[, randomState, maxSize, ...]) |
A replay memory consisting of circular buffers for observations, actions, rewards and terminals. |
Detailed description¶
-
class
deer.agent.
NeuralAgent
(environment, q_network, replay_memory_size, replay_start_size, batch_size, randomState, exp_priority=0)¶ The NeuralAgent class wraps a deep Q-network for training and testing in a given environment.
Attach controllers to it in order to conduct an experiment (when to train the agent, when to test,...).
Parameters: environment : object from class Environment
The environment in which the agent interacts
q_network : object from class QNetwork
The q_network associated to the agent
replay_memory_size : int
Size of the replay memory
replay_start_size : int
Number of observations (=number of time steps taken) in the replay memory before starting learning
batch_size : int
Number of tuples taken into account for each iteration of gradient descent
randomState : numpy random number generator
Seed
exp_priority : float, optional
The exponent that determines how much prioritization is used, default is 0 (uniform priority). One may check out Schaul et al. (2016) - Prioritized Experience Replay.
Methods
attach
(controller)avgBellmanResidual
()Returns the average training loss on the epoch avgEpisodeVValue
()Returns the average V value on the episode bestAction
()Returns the best Action detach
(controllerIdx)discountFactor
()Get the discount factor dumpNetwork
(fname, nEpoch)epsilon
()Get the epsilon for \(\epsilon\)-greedy exploration learningRate
()Get the learning rate mode
()overrideNextAction
(action)Possibility to override the chosen action. resumeTrainingMode
()run
(nEpochs, epochLength)setControllersActive
(toDisable, active)Activate controller setDiscountFactor
(df)Set the discount factor setEpsilon
(e)Set the epsilon used for \(\epsilon\)-greedy exploration setLearningRate
(lr)Set the learning rate for the gradient descent startMode
(mode, epochLength)summarizeTestPerformance
()totalRewardOverLastTest
()Returns the average sum of reward per episode train
()-
avgBellmanResidual
()¶ Returns the average training loss on the epoch
-
avgEpisodeVValue
()¶ Returns the average V value on the episode
-
bestAction
()¶ Returns the best Action
-
discountFactor
()¶ Get the discount factor
-
epsilon
()¶ Get the epsilon for \(\epsilon\)-greedy exploration
-
learningRate
()¶ Get the learning rate
-
overrideNextAction
(action)¶ Possibility to override the chosen action. This possibility should be used on the signal OnActionChosen.
-
setControllersActive
(toDisable, active)¶ Activate controller
-
setDiscountFactor
(df)¶ Set the discount factor
-
setEpsilon
(e)¶ Set the epsilon used for \(\epsilon\)-greedy exploration
-
setLearningRate
(lr)¶ Set the learning rate for the gradient descent
-
totalRewardOverLastTest
()¶ Returns the average sum of reward per episode
-
-
class
deer.agent.
DataSet
(env, randomState=None, maxSize=1000, use_priority=False)¶ A replay memory consisting of circular buffers for observations, actions, rewards and terminals.
Methods
actions
()Get all actions currently in the replay memory, ordered by time where they were taken. addSample
(obs, action, reward, isTerminal, ...)Store a (observation[for all subjects], action, reward, isTerminal) in the dataset. nElems
()Get the number of samples in this dataset (i.e. observations
()Get all observations currently in the replay memory, ordered by time where they were observed. randomBatch
(size, use_priority)Return corresponding states, actions, rewards, terminal status, and next_states for size randomly chosen transitions. rewards
()Get all rewards currently in the replay memory, ordered by time where they were received. terminals
()Get all terminals currently in the replay memory, ordered by time where they were observed. update_priorities
(priorities, rndValidIndices)-
actions
()¶ Get all actions currently in the replay memory, ordered by time where they were taken.
-
addSample
(obs, action, reward, isTerminal, priority)¶ Store a (observation[for all subjects], action, reward, isTerminal) in the dataset.
Parameters: obs : ndarray
An ndarray(dtype=’object’) where obs[s] corresponds to the observation made on subject s before the agent took action [action].
action : int
The action taken after having observed [obs].
reward : float
The reward associated to taking this [action].
isTerminal : bool
Tells whether [action] lead to a terminal state (i.e. corresponded to a terminal transition).
priority : float
The priority to be associated with the sample
-
nElems
()¶ Get the number of samples in this dataset (i.e. the current memory replay size).
-
observations
()¶ Get all observations currently in the replay memory, ordered by time where they were observed.
observations[s][i] corresponds to the observation made on subject s before the agent took actions()[i].
-
randomBatch
(size, use_priority)¶ Return corresponding states, actions, rewards, terminal status, and next_states for size randomly chosen transitions. Note that if terminal[i] == True, then next_states[s][i] == np.zeros_like(states[s][i]) for each subject s.
Parameters: size : int
Number of transitions to return.
Returns: states : ndarray
An ndarray(size=number_of_subjects, dtype=’object), where states[s] is a 2+D matrix of dimensions size x s.memorySize x “shape of a given observation for this subject”. States were taken randomly in the data with the only constraint that they are complete regarding the histories for each observed subject.
actions : ndarray
An ndarray(size=number_of_subjects, dtype=’int32’) where actions[i] is the action taken after having observed states[:][i].
rewards : ndarray
An ndarray(size=number_of_subjects, dtype=’float32’) where rewards[i] is the reward obtained for taking actions[i-1].
next_states : ndarray
Same structure than states, but next_states[s][i] is guaranteed to be the information concerning the state following the one described by states[s][i] for each subject s.
terminals : ndarray
An ndarray(size=number_of_subjects, dtype=’bool’) where terminals[i] is True if actions[i] lead to terminal states and False otherwise
-
rewards
()¶ Get all rewards currently in the replay memory, ordered by time where they were received.
-
terminals
()¶ Get all terminals currently in the replay memory, ordered by time where they were observed.
terminals[i] is True if actions()[i] lead to a terminal state (i.e. corresponded to a terminal transition), and False otherwise.
-
update_priorities
(priorities, rndValidIndices)¶
-