Welcome to DeeR’s documentation!

DeeR (Deep Reinforcement) is a python library to train an agent how to behave in a given environment so as to maximize a cumulative sum of rewards (see What is deep reinforcement learning?).

Here are key advantages of the library:

  • You have access within a single library to techniques such as Double Q-learning, prioritized Experience Replay, Deep deterministic policy gradient (DDPG), Combined Reinforcement via Abstract Representations (CRAR), etc.
  • This package provides a general framework where observations are made up of any number of elements (scalars, vectors or frames).
  • You can easily add up a validation phase that allows to stop the training process before overfitting. This possibility is useful when the environment is dependent on scarce data (e.g. limited time series).

In addition, the framework is made in such a way that it is easy to

  • build any environment
  • modify any part of the learning process
  • use your favorite python-based framework to code your own learning algorithm or neural network architecture. The provided learning algorithms and neural network architectures are based on Keras.
alternate text
Figure:General schema of the different elements available in DeeR.

It is a work in progress and input is welcome. Please submit any contribution via pull request.

What is new

Version 0.4

  • Integration of CRAR that allows to combine the model-free and the model-based approaches via abstract representations.
  • Augmented documentation and some interfaces have been updated.

Version 0.3

  • Integration of different exploration/exploitation policies and possibility to easily built your own.
  • Integration of DDPG for continuous action spaces (see actor-critic)
  • Naming convention for this project and some interfaces have been updated. This may cause broken backward compatibility. In that case, make the changes to the new convention by looking at the API in this documentation or by looking at the current version of the examples.
  • Additional automated tests

Version 0.2

  • Standalone python package (you can simply do pip install deer)
  • Integration of new examples environments : toy_env_pendulum, PLE environment and Gym environment
  • Double Q-learning and prioritized Experience Replay
  • Augmented documentation
  • First automated tests

Future extensions:

  • Several agents interacting in the same environment

How should I cite DeeR?

Please cite DeeR in your publications if you use it in your research. Here is an example BibTeX entry:

@misc{franccoislavet2016deer,
title={DeeR},
author={Fran\c{c}ois-Lavet, Vincent and others},
year={2016},
howpublished={\url{https://deer.readthedocs.io/}},
}

User Guide

Installation

Dependencies

This framework is tested to work under Python 3.6.

The required dependencies are NumPy >= 1.10, joblib >= 0.9. You also need keras or you can write your own learning algorithms using your favorite deep learning framework.

For running some of the examples, Matplotlib >= 1.1.1 is required. You also sometimes need to install specific dependencies (e.g. for the atari games, you need to install ALE >= 0.4).

We recommend to use the bleeding-edge version and to install it by following the Developer install instructions. If you want a simpler installation procedure and do not intend to modify yourself the learning algorithms etc., you can look at the User install instructions.

Developer install instructions

As a developer, you can set you up with the bleeding-edge version of DeeR with:

git clone -b master https://github.com/VinF/deer.git

Assuming you already have a python environment with pip, you can automatically install all the dependencies (except specific dependencies that you may need for some examples) with:

pip install -r requirements.txt

And you can install the framework as a package using the mode develop so that you can make modifications and test without having to re-install the package.

python setup.py develop

User install instructions

You can install the framework with pip:

pip install deer

For the bleeding edge version (recommended), you can simply use

pip install git+git://github.com/VINF/deer.git@master

Tutorial

What is deep reinforcement learning?

Deep reinforcement learning is the combination of two fields:

  • Reinforcement learning (RL) is a theory that allows an agent to learn a startegy so as to maximize a sum of cumulated (delayed) rewards from any given environment. If you are not familiar with RL, you can get up to speed easily with this book by Sutton and Barto.
  • Deep learning is a branch of machine learning for regression and classification. It is particularly well suited to model high-level abstractions in data by using multiple processing layers composed of multiple non-linear transformations.

This combination allows to learn complex tasks such as playing ATARI games from high-dimensional sensory inputs. For more informations, you can refer to one of the main papers in the domain : "Human-level control through deep reinforcement learning".

How can I get started?

First, make sure you have installed the package properly by following the steps descibed in Installation.

The general idea of this framework is that you need to instantiate an agent (along with a learning algorithm) and an environment. In order to perform an experiment, you also need to attach to the agent some controllers for controlling the training and the various parameters of your agent.

The environment should be built specifically for any specific task while learning algorithms (such as q-networks) and many controllers are provided within this package.

The best to get started is to have a look at the Examples and in particular the two first environments that are simple to understand:

If you find something that is not yet implemented and if you wish to contribute, you can check the section Development.

Any Question?

You can raise questions about the DeeR project on github : https://github.com/VinF/deer/issues

Examples

You can find these examples at the root of the package. For each example at least two files are provided:

  • A launcher file (whose name usually starts by run_).
  • An environnement file (whose name usually ends by _env).

The launcher file performs different actions:

  • It instantiates the environment and the agent along with a learning algorithm (such as a q-network).
  • It binds controllers to the agent
  • it finally runs the experiment

You can get started with the following examples:

Toy environment with time series

Description of the environement

This environment simulates the possibility of buying or selling a good. The agent can either have one unit or zero unit of that good. At each transaction with the market, the agent obtains a reward equivalent to the price of the good when selling it and the opposite when buying. In addition, a penalty of 0.5 (negative reward) is added for each transaction.

The price pattern is made by repeating the following signal plus a random constant between 0 and 3:

Toy example price pattern

You can see how this environement is built by looking into the file Toy_env.py in examples/toy_env/. It is important to note that any environment derives from the base class Environment and you can refer to it in order to understand the required methods and their usage.

How to run

A minimalist way of running this example can be found in the file run_toy_env_simple.py in examples/toy_env/.

  • First, we need to import the agent, the Q-network, the environement and some controllers
1
2
3
4
5
6

from deer.agent import NeuralAgent
from deer.learning_algos.q_net_keras import MyQNetwork
from Toy_env import MyEnv as Toy_env
import deer.experiment.base_controllers as bc

  • Then we instantiate the different elements as follows:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
rng = np.random.RandomState(123456)

# --- Instantiate environment ---
env = Toy_env(rng)

# --- Instantiate qnetwork ---
qnetwork = MyQNetwork(
    environment=env,
    random_state=rng)

# --- Instantiate agent ---
agent = NeuralAgent(
    env,
    qnetwork,
    random_state=rng)

# --- Bind controllers to the agent ---
# Before every training epoch, we want to print a summary of the agent's epsilon, discount and 
# learning rate as well as the training epoch number.
agent.attach(bc.VerboseController())

# During training epochs, we want to train the agent after every action it takes.
# Plus, we also want to display after each training episode (!= than after every training) the average bellman
# residual and the average of the V values obtained during the last episode.
agent.attach(bc.TrainerController())

# All previous controllers control the agent during the epochs it goes through. However, we want to interleave a 
# "test epoch" between each training epoch. We do not want these test epoch to interfere with the training of the 
# agent. Therefore, we will disable these controllers for the whole duration of the test epochs interleaved this 
# way, using the controllersToDisable argument of the InterleavedTestEpochController. The value of this argument 
# is a list of the indexes of all controllers to disable, their index reflecting in which order they were added.
agent.attach(bc.InterleavedTestEpochController(
    epoch_length=500, 
    controllers_to_disable=[0, 1]))
    
# --- Run the experiment ---
agent.run(n_epochs=100, epoch_length=1000)
Results

Navigate to the folder examples/toy_env/ in a terminal window. The example can then be run by using

python run_toy_env_simple.py

You can also choose the full version of the launcher that specifies the hyperparameters for better performance.

python run_toy_env.py

Every 10 epochs, a graph is saved in the ‘toy_env’ folder. You can then see that kind of behaviour for the test policy at the end of the training:

Toy example policy

In this graph, you can see that the agent has successfully learned to take advantage of the price pattern to buy when it is low and to sell when it is high. This example is of course easy due to the fact that the patterns are very systematic which allows the agent to successfuly learn it. It is important to note that the results shown are made on a validation set that is different from the training and we can see that learning generalizes well. For instance, the action of buying at time step 7 and 16 is the expected result because in average this will allow to make profit since the agent has no information on the future.

Using Convolutions VS LSTM’s

So far, the neural network was build by using a convolutional architecture as follows:

Convolutions architecture

The neural nework processes time series thanks to a set of convolutions layers. The output of the convolutions as well as the other inputs are followed by fully connected layers and the ouput layer.

When working with deep reinforcement learning, it is also possible to work with LSTM’s (see for instance this introduction to LSTM's)

If you want to use LSTM’s architecture, you can import the following libraries

from deer.learning_algos.NN_keras_LSTM import NN as NN_keras

and then instanciate the qnetwork by specifying the ‘neural_network’ as follows:

qnetwork = MyQNetwork(
    env,
    neural_network=NN_keras)

Gym environment

Some examples are also provided with the Gym environment.

Here is the resulting policy for the mountain car example:

http://vincent.francois-l.be/img_GeneralDeepQRL/gym_mountain_car.gif

Here is the resulting policy for the pendulum example:

http://vincent.francois-l.be/img_GeneralDeepQRL/gym_pendulum.gif

Two storage devices environment

Description of the environement

This example simulates the operation of a realistic micro-grid (such as a smart home for instance) that is not connected to the main utility grid (off-grid) and that is provided with PV panels, batteries and hydrogen storage. The battery has the advantage that it is not limited in instaneous power that it can provide or store. The hydrogen storage has the advantage that is can store very large quantity of energy.

python run_MG_two_storage_devices

This example uses the environment defined in MG_two_storage_devices_env.py. The agent can either choose to store in the long term storage or take energy out of it while the short term storage handle at best the lack or surplus of energy by discharging itself or charging itself respectively. Whenever the short term storage is empty and cannot handle the net demand a penalty (negative reward) is obtained equal to the value of loss load set to 2euro/kWh.

The state of the agent is made up of an history of two to four punctual observations:

  • Charging state of the short term storage (0 is empty, 1 is full)
  • Production and consumption (0 is no production or consumption, 1 is maximal production or consumption)
  • (Distance to equinox)
  • (Predictions of future production : average of the production for the next 24 hours and 48 hours)

Two actions are possible for the agent:

  • Action 0 corresponds to discharging the long-term storage
  • Action 1 corresponds to charging the long-term storage
More information can be found in
Deep Reinforcement Learning Solutions for Energy Microgrids Management, Vincent François-Lavet, David Taralla, Damien Ernst, Raphael Fonteneau
Annex to the paper
PV production and consumption profiles

Solar irradiance varies throughout the year depending on the seasons, and it also varies throughout the day depending on the weather and the position of the sun in the sky relative to the PV panels. The main distinction between these profiles is the difference between summer and winter PV production. In particular, production varies with a factor 1:5 between winter and summer as can be seen from the measurements of PV panels production for a residential customer located in Belgium in the figures below.

http://vincent.francois-l.be/img_GeneralDeepQRL/ProductionVSMonths_be.png

Total energy produced per month

http://vincent.francois-l.be/img_GeneralDeepQRL/ProductionVSTime_1janv_be.png

Typical production in winter

http://vincent.francois-l.be/img_GeneralDeepQRL/ProductionVSTime_1july_be.png

Typical production in summer

A simple residential consumption profile is considered with a daily average consumption of 18kWh (see figure below).

http://vincent.francois-l.be/img_GeneralDeepQRL/ConsumptionVSTime_random.png

Representative residential consumption profile

Main microgrid parameters
Data used for the PV panels
cost \(c^{PV}\) \(1 euro/W_p\)
Efficiency \(\eta^{PV}\) \(18 \%\)
Life time \(L^{PV}\) \(20 years\)
Data used for the \(LiFePO_4\) battery
cost \(c^B\) \(500 euro/kWh\)
discharge efficiency \(\eta_0^B\) \(90\%\)
charge efficiency \(\zeta_0^B\) \(90\%\)
Maximum instantaneous power \(P^B\) \(> 10kW\)
Life time \(L^{B}\) \(20 years\)
Data used for the Hydrogen storage device
cost \(c^{H_2}\) \(14 euro/W_p\)
discharge efficiency \(\eta_0^{H_2}\) \(65\%\)
charge efficiency \(\zeta_0^{H_2}\) \(65\%\)
Life time \(L^{H_2}\) \(20 years\)
Data used for reward function
cost endured per kWh not supplied within the microgrid \(k\) \(2 euro/kWh\)
revenue/cost per kWh of hydrogen produced/used \(k^{H_2}\) \(0.1 euro/kWh\)

Tasks with planning

You can find the following environments that demonstrate the possibilities of combining model-based and model-free: simples examples and how to solve any maze taken from a distribution.

ALE environment

This environment is an interface with the ALE environment that simulates any ATARI game.

Related paper: Mnih, Volodymyr, et al. “Human-level control through deep reinforcement learning.” Nature 518.7540 (2015): 529-533. (Hyper-parameters tuning is necessary if you want to try to replicate close performances.)

Development

DeeR is a work in progress and contributions are welcome via pull request.

For more information, you can check out this link : Contributing to an open source Project on github.

You should also make sure that you install the repository approriately for development (see Developer install instructions).

Guidelines for this project

Here are a few guidelines for this project.

  • Simplicity: Be easy to use but also easy to understand when one digs into the code. Any additional code should be justified by the usefulness of the feature.
  • Modularity: The user should be able to easily use its own code with any part of the deer framework (probably at the exception of the core of agent.py that is coded in a very general way).

These guidelines come of course in addition to all good practices for open source development.

Naming convention for this project

  • All classes and methods have word boundaries using medial capitalization. Classes are written with UpperCamelCase and methods are written with lowerCamelCase respectively. Example: “two words” is rendered as “TwoWords” for the UpperCamelCase (classes) and “twoWords” for the lowerCamelCase (methods).
  • All attributes and variables have words separated by underscores. Example: “two words” is rendered as “two_words”
  • If a variable is intended to be ‘private’, it is prefixed by an underscore.

API reference

If you are looking for information on a specific function, class or method, this API is for you.

Agent

This module contains classes used to define the standard behavior of the agent. It relies on the controllers, the chosen training/test policy and the learning algorithm to specify its behavior in the environment.

NeuralAgent(environment, learning_algo[, …]) The NeuralAgent class wraps a learning algorithm (such as a deep Q-network) for training and testing in a given environment.
DataSet(env[, random_state, max_size, …]) A replay memory consisting of circular buffers for observations, actions, rewards and terminals.
class deer.agent.NeuralAgent(environment, learning_algo, replay_memory_size=1000000, replay_start_size=None, batch_size=32, random_state=<mtrand.RandomState object>, exp_priority=0, train_policy=None, test_policy=None, only_full_history=True)

The NeuralAgent class wraps a learning algorithm (such as a deep Q-network) for training and testing in a given environment.

Attach controllers to it in order to conduct an experiment (when to train the agent, when to test,…).

environment : object from class Environment
The environment in which the agent interacts
learning_algo : object from class LearningAlgo
The learning algorithm associated to the agent
replay_memory_size : int
Size of the replay memory. Default : 1000000
replay_start_size : int
Number of observations (=number of time steps taken) in the replay memory before starting learning. Default: minimum possible according to environment.inputDimensions().
batch_size : int
Number of tuples taken into account for each iteration of gradient descent. Default : 32
random_state : numpy random number generator
Default : random seed.
exp_priority : float
The exponent that determines how much prioritization is used, default is 0 (uniform priority). One may check out Schaul et al. (2016) - Prioritized Experience Replay.
train_policy : object from class Policy
Policy followed when in training mode (mode -1)
test_policy : object from class Policy
Policy followed when in other modes than training (validation and test modes)
only_full_history : boolean
Whether we wish to train the neural network only on full histories or we wish to fill with zeroes the observations before the beginning of the episode
avgBellmanResidual()

Returns the average training loss on the epoch

avgEpisodeVValue()

Returns the average V value on the episode (on time steps where a non-random action has been taken)

discountFactor()

Get the discount factor

dumpNetwork(fname, nEpoch=-1)

Dump the network

fname : string
Name of the file where the network will be dumped
nEpoch : int
Epoch number (Optional)
learningRate()

Get the learning rate

overrideNextAction(action)

Possibility to override the chosen action. This possibility should be used on the signal OnActionChosen.

run(n_epochs, epoch_length)

This function encapsulates the whole process of the learning. It starts by calling the controllers method “onStart”, Then it runs a given number of epochs where an epoch is made up of one or many episodes (called with agent._runEpisode) and where an epoch ends up after the number of steps reaches the argument “epoch_length”. It ends up by calling the controllers method “end”.

n_epochs : int
number of epochs
epoch_length : int
maximum number of steps for a given epoch
setControllersActive(toDisable, active)

Activate controller

setDiscountFactor(df)

Set the discount factor

setLearningRate(lr)

Set the learning rate for the gradient descent

setNetwork(fname, nEpoch=-1)

Set values into the network

fname : string
Name of the file where the values are
nEpoch : int
Epoch number (Optional)
totalRewardOverLastTest()

Returns the average sum of rewards per episode and the number of episode

train()

This function selects a random batch of data (with self._dataset.randomBatch) and performs a Q-learning iteration (with self._learning_algo.train).

class deer.agent.DataSet(env, random_state=None, max_size=1000000, use_priority=False, only_full_history=True)

A replay memory consisting of circular buffers for observations, actions, rewards and terminals.

actions()

Get all actions currently in the replay memory, ordered by time where they were taken.

addSample(obs, action, reward, is_terminal, priority)

Store the punctual observations, action, reward, is_terminal and priority in the dataset. Parameters ———– obs : ndarray

An ndarray(dtype=’object’) where obs[s] corresponds to the punctual observation s before the agent took action [action].
action : int
The action taken after having observed [obs].
reward : float
The reward associated to taking this [action].
is_terminal : bool
Tells whether [action] lead to a terminal state (i.e. corresponded to a terminal transition).
priority : float
The priority to be associated with the sample
observations()

Get all observations currently in the replay memory, ordered by time where they were observed.

randomBatch(batch_size, use_priority)

Returns a batch of states, actions, rewards, terminal status, and next_states for a number batch_size of randomly chosen transitions. Note that if terminal[i] == True, then next_states[s][i] == np.zeros_like(states[s][i]) for each s.

batch_size : int
Number of transitions to return.
use_priority : Boolean
Whether to use prioritized replay or not
states : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]). States are taken randomly in the data with the only constraint that they are complete regarding the history size for each observation.
actions : numpy array of integers [batch_size]
actions[i] is the action taken after having observed states[:][i].
rewards : numpy array of floats [batch_size]
rewards[i] is the reward obtained for taking actions[i-1].
next_states : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
terminals : numpy array of booleans [batch_size]
terminals[i] is True if the transition leads to a terminal state and False otherwise
SliceError
If a batch of this batch_size could not be built based on current data set (not enough data or all trajectories are too short).
randomBatch_nstep(batch_size, nstep, use_priority)

Return corresponding states, actions, rewards, terminal status, and next_states for a number batch_size of randomly chosen transitions. Note that if terminal[i] == True, then next_states[s][i] == np.zeros_like(states[s][i]) for each s.

batch_size : int
Number of transitions to return.
nstep : int
Number of transitions to be considered for each element
use_priority : Boolean
Whether to use prioritized replay or not
states : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * (history size+nstep-1) * size of punctual observation (which is 2D,1D or scalar)]). States are taken randomly in the data with the only constraint that they are complete regarding the history size for each observation.
actions : numpy array of integers [batch_size, nstep]
actions[i] is the action taken after having observed states[:][i].
rewards : numpy array of floats [batch_size, nstep]
rewards[i] is the reward obtained for taking actions[i-1].
next_states : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * (history size+nstep-1) * size of punctual observation (which is 2D,1D or scalar)]).
terminals : numpy array of booleans [batch_size, nstep]
terminals[i] is True if the transition leads to a terminal state and False otherwise
SliceError
If a batch of this size could not be built based on current data set (not enough data or all trajectories are too short).
rewards()

Get all rewards currently in the replay memory, ordered by time where they were received.

terminals()

Get all terminals currently in the replay memory, ordered by time where they were observed.

terminals[i] is True if actions()[i] lead to a terminal state (i.e. corresponded to a terminal transition), and False otherwise.

updatePriorities(priorities, rndValidIndices)

Controller

This file defines the base Controller class and some presets controllers that you can use for controlling the training and the various parameters of your agents.

Controllers can be attached to an agent using the agent’s attach(Controller) method. The order in which controllers are attached matters. Indeed, if controllers C1, C2 and C3 were attached in this order and C1 and C3 both listen to the onEpisodeEnd signal, the onEpisodeEnd() method of C1 will be called before the onEpisodeEnd() method of C3, whenever an episode ends.

Controller() A base controller that does nothing when receiving the various signals emitted by an agent.
LearningRateController([…]) A controller that modifies the learning rate periodically upon epochs end.
EpsilonController([initial_e, e_decays, …]) A controller that modifies the probability “epsilon” of taking a random action periodically.
DiscountFactorController([…]) A controller that modifies the q-network discount periodically.
TrainerController([evaluate_on, …]) A controller that makes the agent train on its current database periodically.
InterleavedTestEpochController([id, …]) A controller that interleaves a test epoch between training epochs of the agent.
FindBestController([validationID, testID, …]) A controller that finds the neural net performing at best in validation mode (i.e.
class deer.experiment.base_controllers.Controller

A base controller that does nothing when receiving the various signals emitted by an agent. This class should be the base class of any controller you would want to define.

onActionChosen(agent, action)

Called whenever the agent has chosen an action.

This occurs after the agent state was updated with the new observation it made, but before it applied this action on the environment and before the total reward is updated.

onActionTaken(agent)

Called whenever the agent has taken an action on its environment.

This occurs after the agent applied this action on the environment and before terminality is evaluated. This is called only once, even in the case where the agent skip frames by taking the same action multiple times. In other words, this occurs just before the next observation of the environment.

onEnd(agent)

Called when the agent has finished processing all its epochs, just before returning from its run() method.

onEpisodeEnd(agent, terminal_reached, reward)

Called whenever the agent ends an episode, just after this episode ended and before any onEpochEnd() signal could be sent.

agent : NeuralAgent
The agent firing the event
terminal_reached : bool
Whether the episode ended because a terminal transition occured. This could be False if the episode was stopped because its step budget was exhausted.
reward : float
The reward obtained on the last transition performed in this episode.
onEpochEnd(agent)

Called whenever the agent ends an epoch, just after the last episode of this epoch was ended and after any onEpisodeEnd() signal was processed.

agent : NeuralAgent
The agent firing the event
onStart(agent)

Called when the agent is going to start working (before anything else).

This corresponds to the moment where the agent’s run() method is called.

agent : NeuralAgent
The agent firing the event
setActive(active)

Activate or deactivate this controller.

A controller should not react to any signal it receives as long as it is deactivated. For instance, if a controller maintains a counter on how many episodes it has seen, this counter should not be updated when this controller is disabled.

class deer.experiment.base_controllers.LearningRateController(initial_learning_rate=0.005, learning_rate_decay=1.0, periodicity=1)

Bases: deer.experiment.base_controllers.Controller

A controller that modifies the learning rate periodically upon epochs end.

initial_learning_rate : float
The learning rate upon agent start
learning_rate_decay : float
The factor by which the previous learning rate is multiplied every [periodicity] epochs.
periodicity : int
How many epochs are necessary before an update of the learning rate occurs
class deer.experiment.base_controllers.EpsilonController(initial_e=1.0, e_decays=10000, e_min=0.1, evaluate_on='action', periodicity=1, reset_every='none')

Bases: deer.experiment.base_controllers.Controller

A controller that modifies the probability “epsilon” of taking a random action periodically.

initial_e : float
Start epsilon
e_decays : int
How many updates are necessary for epsilon to reach eMin
e_min : float
End epsilon
evaluate_on : str
After what type of event epsilon shoud be updated periodically. Possible values: ‘action’, ‘episode’, ‘epoch’.
periodicity : int
How many [evaluateOn] are necessary before an update of epsilon occurs
reset_every : str
After what type of event epsilon should be reset to its initial value. Possible values: ‘none’, ‘episode’, ‘epoch’.
class deer.experiment.base_controllers.DiscountFactorController(initial_discount_factor=0.9, discount_factor_growth=1.0, discount_factor_max=0.99, periodicity=1)

Bases: deer.experiment.base_controllers.Controller

A controller that modifies the q-network discount periodically. More informations in : Francois-Lavet Vincent et al. (2015) - How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies (http://arxiv.org/abs/1512.02011).

initial_discount_factor : float
Start discount
discount_factor_growth : float
The factor by which the previous discount is multiplied every [periodicity] epochs.
discount_factor_max : float
Maximum reachable discount
periodicity : int
How many training epochs are necessary before an update of the discount occurs
class deer.experiment.base_controllers.TrainerController(evaluate_on='action', periodicity=1, show_episode_avg_V_value=True, show_avg_Bellman_residual=True)

Bases: deer.experiment.base_controllers.Controller

A controller that makes the agent train on its current database periodically.

evaluate_on : str
After what type of event the agent shoud be trained periodically. Possible values: ‘action’, ‘episode’, ‘epoch’. The first training will occur after the first occurence of [evaluateOn].
periodicity : int
How many [evaluateOn] are necessary before a training occurs _show_avg_Bellman_residual [bool] - Whether to show an informative message after each episode end (and after a training if [evaluateOn] is ‘episode’) about the average bellman residual of this episode
show_episode_avg_V_value : bool
Whether to show an informative message after each episode end (and after a training if [evaluateOn] is ‘episode’) about the average V value of this episode
class deer.experiment.base_controllers.InterleavedTestEpochController(id=0, epoch_length=500, controllers_to_disable=[], periodicity=2, show_score=True, summarize_every=10)

Bases: deer.experiment.base_controllers.Controller

A controller that interleaves a test epoch between training epochs of the agent.

id : int
The identifier (>= 0) of the mode each test epoch triggered by this controller will belong to. Can be used to discriminate between datasets in your Environment subclass (this is the argument that will be given to your environment’s reset() method when starting the test epoch).
epoch_length : float
The total number of transitions that will occur during a test epoch. This means that this epoch could feature several episodes if a terminal transition is reached before this budget is exhausted.
controllers_to_disable : list of int
A list of controllers to disable when this controller wants to start a test epoch. These same controllers will be reactivated after this controller has finished dealing with its test epoch.
periodicity : int
How many epochs are necessary before a test epoch is ran (these controller’s epochs included: “1 test epoch on [periodicity] epochs”). Minimum value: 2.
show_score : bool
Whether to print an informative message on stdout at the end of each test epoch, about the total reward obtained in the course of the test epoch.
summarize_every : int
How many of this controller’s test epochs are necessary before the attached agent’s summarizeTestPerformance() method is called. Give a value <= 0 for “never”. If > 0, the first call will occur just after the first test epoch.
class deer.experiment.base_controllers.FindBestController(validationID=0, testID=None, unique_fname='nnet')

Bases: deer.experiment.base_controllers.Controller

A controller that finds the neural net performing at best in validation mode (i.e. for mode = [validationID]) and computes the associated generalization score in test mode (i.e. for mode = [testID], and this only if [testID] is different from None). This controller should never be disabled by InterleavedTestControllers as it is meant to work in conjunction with them.

At each epoch end where this controller is active, it will look at the current mode the agent is in.

If the mode matches [validationID], it will take the total reward of the agent on this epoch and compare it to its current best score. If it is better, it will ask the agent to dump its current nnet on disk and update its current best score. In all cases, it saves the validation score obtained in a vector.

If the mode matches [testID], it saves the test (= generalization) score in another vector. Note that if [testID] is None, no test mode score are ever recorded.

At the end of the experiment (onEnd), if active, this controller will print information about the epoch at which the best neural net was found together with its generalization score, this last information shown only if [testID] is different from None. Finally it will dump a dictionnary containing the data of the plots ({n: number of epochs elapsed, ts: test scores, vs: validation scores}). Note that if [testID] is None, the value dumped for the ‘ts’ key is [].

validationID : int
See synopsis
testID : int
See synopsis
unique_fname : str
A unique filename (basename for score and network dumps).

Environment

This module defines the base class for the environments.

class deer.base_classes.Environment

All your Environment classes should inherit this interface.

The environment defines the dynamics and the reward signal that the agent observes when interacting with it.

An agent sees at any time-step from the environment a collection of observable elements. Observing the environment at time t thus corresponds to obtaining a punctual observation for each of these elements. According to the control problem to solve, it might be useful for the agent to not only take action based on the current punctual observations but rather on a collection of the last punctual observations. In this framework, it’s the environment that defines the number of each punctual observation to be considered.

Different “modes” are used in this framework to allow the environment to have different dynamics and/or reward signal. For instance, in training mode, only a part of the dynamics may be available so that it is possible to see how well the agent generalizes to a slightly different one.

act(action)

Applies the agent action [action] on the environment.

action : int
The action selected by the agent to operate on the environment. Should be an identifier included between 0 included and nActions() excluded.
end()

Optional hook called at the end of all epochs

inTerminalState()

Tells whether the environment reached a terminal state after the last transition (i.e. the last transition that occured was terminal).

As the majority of control tasks considered have no end (a continuous control should be operated), by default this returns always False. But in the context of a video game for instance, terminal states can happen and in these cases, this method should be overridden.

isTerminal : bool
Whether or not the current state is terminal
inputDimensions()

Gets the shape of the input space for this environment.

This returns a list whose length is the number of observations in the environment. Each element of the list is a tuple: the first integer is always the history size considered for this observation and the rest describes the shape of the observation at a given time step. For instance: - () or (1,) means each observation at a given time step is a single scalar and the history size is 1 (= only current observation) - (N,) means each observation at a given time step is a single scalar and the history size is N - (N, M) means each observation at a given time step is a vector of length M and the history size is N - (N, M1, M2) means each observation at a given time step is a 2D matrix with M1 rows and M2 columns and the history size is N

nActions()

Gets the number of different actions that can be taken on this environment. It can be either an integer in the case of a finite discrete number of actions or it can be a list of couples [min_action_value,max_action_value] for a continuous action space

observationType(subject)

Gets the most inner type (np.uint8, np.float32, …) of [subject].

subject : int
The subject
observe()

Gets a list of punctual observations composing this environment.

This returns a list where element i is a punctual observation. Note that the history of observations is not returned and only the current observation is.

See the documentation of inputDimensions() for more information about the shape of the observations.

reset(mode)

Resets the environment and put it in mode [mode]. This function is called when beginning every new episode.

The [mode] can be used to discriminate for instance between an agent which is training or trying to get a validation or generalization score. The mode the environment is in should always be redefined by resetting the environment using this method, meaning that the mode should be preserved until the next call to reset().

mode : int
The mode to put the environment into. Mode “-1” is reserved and always means “training”.

Initialization of the pseudo state at the beginning of a new episode: list (of lists) with size given by inputDimensions

summarizePerformance(test_data_set, *args, **kwargs)

Optional hook that can be used to show a summary of the performance of the agent on the environment in the current mode.

test_data_set : agent.DataSet
The dataset maintained by the agent in the current mode, which contains observations, actions taken and rewards obtained, as well as wether each transition was terminal or not. Refer to the documentation of agent.DataSet for more information.

Learning algorithms

deer.base_classes.LearningAlgo(environment, …) All the Q-networks, actor-critic networks, etc.
deer.learning_algos.q_net_keras.MyQNetwork(…) Deep Q-learning network using Keras (with any backend)
deer.learning_algos.AC_net_keras.MyACNetwork(…) Actor-critic learning (using Keras) with Deep Deterministic Policy Gradient (DDPG) for the continuous action domain
deer.learning_algos.CRAR_keras.CRAR(environment) Combined Reinforcement learning via Abstract Representations (CRAR) using Keras
class deer.base_classes.LearningAlgo(environment, batch_size)

All the Q-networks, actor-critic networks, etc. should inherit this interface.

environment : object from class Environment
The environment linked to the Q-network
batch_size : int
Number of tuples taken into account for each iteration of gradient descent
chooseBestAction(state)

Get the best action for a pseudo-state

discountFactor()

Getting the discount factor

learningRate()

Getting the learning rate

qValues(state)

Get the q value for one pseudo-state

setDiscountFactor(df)

Setting the discount factor

df : float
The discount factor that has to bet set
setLearningRate(lr)

Setting the learning rate NB: The learning rate has usually to be set in the optimizer, hence this function should be overridden. Otherwise, the learning rate change is likely not to be taken into account

lr : float
The learning rate that has to bet set
train(states, actions, rewards, nextStates, terminals)

This method performs the training step (e.g. using Bellman iteration in a deep Q-network) for one batch of tuples.

class deer.learning_algos.q_net_keras.MyQNetwork(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network=<class 'deer.learning_algos.NN_keras.NN'>)

Deep Q-learning network using Keras (with any backend)

environment : object from class Environment
The environment in which the agent evolves.
rho : float
Parameter for rmsprop. Default : 0.9
rms_epsilon : float
Parameter for rmsprop. Default : 0.0001
momentum : float
Momentum for SGD. Default : 0
clip_norm : float
The gradient tensor will be clipped to a maximum L2 norm given by this value.
freeze_interval : int
Period during which the target network is freezed and after which the target network is updated. Default : 1000
batch_size : int
Number of tuples taken into account for each iteration of gradient descent. Default : 32
update_rule: str
{sgd,rmsprop}. Default : rmsprop

random_state : numpy random number generator double_Q : bool, optional

Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.
neural_network : object, optional
default is deer.learning_algos.NN_keras
chooseBestAction(state, *args, **kwargs)

Get the best action for a pseudo-state

state : one pseudo-state

The best action : int

getAllParams()

Get all parameters used by the learning algorithm

Values of the parameters: list of numpy arrays

qValues(state_val)

Get the q values for one belief state

state_val : one belief state

The q values for the provided belief state

setAllParams(list_of_values)

Set all parameters used by the learning algorithm

list_of_values : list of numpy arrays
list of the parameters to be set (same order than given by getAllParams()).
train(states_val, actions_val, rewards_val, next_states_val, terminals_val)

Train the Q-network from one batch of data.

states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)].
actions_val : numpy array of integers with size [self._batch_size]
actions[i] is the action taken after having observed states[:][i].
rewards_val : numpy array of floats with size [self._batch_size]
rewards[i] is the reward obtained for taking actions[i-1].
next_states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)].
terminals_val : numpy array of booleans with size [self._batch_size]
terminals[i] is True if the transition leads to a terminal state and False otherwise

Average loss of the batch training (RMSE) Individual (square) losses for each tuple

class deer.learning_algos.AC_net_keras.MyACNetwork(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network_critic=<class 'deer.learning_algos.NN_keras.NN'>, neural_network_actor=<class 'deer.learning_algos.NN_keras.NN'>)

Actor-critic learning (using Keras) with Deep Deterministic Policy Gradient (DDPG) for the continuous action domain

environment : object from class Environment
The environment in which the agent evolves.
rho : float
Parameter for rmsprop. Default : 0.9
rms_epsilon : float
Parameter for rmsprop. Default : 0.0001
momentum : float
Momentum for SGD. Default : 0
clip_norm : float
The gradient tensor will be clipped to a maximum L2 norm given by this value.
freeze_interval : int
Period during which the target network is freezed and after which the target network is updated. Default : 1000
batch_size : int
Number of tuples taken into account for each iteration of gradient descent. Default : 32
update_rule: str
{sgd,rmsprop}. Default : rmsprop
random_state : numpy random number generator
Set the random seed.
double_Q : bool, optional
Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.
neural_network_critic : object, optional
default is deer.learning_algos.NN_keras
neural_network_actor : object, optional
default is deer.learning_algos.NN_keras
chooseBestAction(state, *args, **kwargs)

Get the best action for a pseudo-state

state : one pseudo-state

best_action : float estim_value : float

clip_action(action)

Clip the possible actions if it is outside the action space defined by self._nActions self._nActions is given as [[low_action1,high_action1],[low_action2,high_action2], …]

getAllParams()

Get all parameters used by the learning algorithm

Values of the parameters: list of numpy arrays

gradients(states, actions)

Returns the gradients on the Q-network for the different actions (used for policy update)

setAllParams(list_of_values)

Set all parameters used by the learning algorithm

list_of_values : list of numpy arrays
list of the parameters to be set (same order than given by getAllParams()).
train(states_val, actions_val, rewards_val, next_states_val, terminals_val)

Train the actor-critic algorithm from one batch of data.

states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
actions_val : numpy array of integers with size [self._batch_size]
actions[i] is the action taken after having observed states[:][i].
rewards_val : numpy array of floats with size [self._batch_size]
rewards[i] is the reward obtained for taking actions[i-1].
next_states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
terminals_val : numpy array of booleans with size [self._batch_size]
terminals[i] is True if the transition leads to a terminal state and False otherwise

Average loss of the batch training Individual losses for each tuple

class deer.learning_algos.CRAR_keras.CRAR(environment, rho=0.9, rms_epsilon=0.0001, momentum=0, clip_norm=0, freeze_interval=1000, batch_size=32, update_rule='rmsprop', random_state=<mtrand.RandomState object>, double_Q=False, neural_network=<class 'deer.learning_algos.NN_CRAR_keras.NN'>, **kwargs)

Combined Reinforcement learning via Abstract Representations (CRAR) using Keras

environment : object from class Environment
The environment in which the agent evolves.
rho : float
Parameter for rmsprop. Default : 0.9
rms_epsilon : float
Parameter for rmsprop. Default : 0.0001
momentum : float
Momentum for SGD. Default : 0
clip_norm : float
The gradient tensor will be clipped to a maximum L2 norm given by this value.
freeze_interval : int
Period during which the target network is freezed and after which the target network is updated. Default : 1000
batch_size : int
Number of tuples taken into account for each iteration of gradient descent. Default : 32
update_rule: str
{sgd,rmsprop}. Default : rmsprop
random_state : numpy random number generator
Set the random seed.
double_Q : bool, optional
Activate or not the double_Q learning. More informations in : Hado van Hasselt et al. (2015) - Deep Reinforcement Learning with Double Q-learning.
neural_network : object, optional
Default is deer.learning_algos.NN_keras
chooseBestAction(state, mode, *args, **kwargs)

Get the best action for a pseudo-state

state : list of numpy arrays
One pseudo-state. The number of arrays and their dimensions matches self.environment.inputDimensions().
mode : int
Identifier of the mode (-1 is reserved for the training mode).

The best action : int

getAllParams()

Provides all parameters used by the learning algorithm

Values of the parameters: list of numpy arrays

qValues(state_val)

Get the q values for one pseudo-state (without planning)

state_val : array of objects (or list of objects)
Each object is a numpy array that relates to one of the observations with size [1 * history size * size of punctual observation (which is 2D,1D or scalar)]).

The q values for the provided pseudo state

qValues_planning(state_val, R, gamma, T, Q, d=5)

Get the average Q-values up to planning depth d for one pseudo-state.

state_val : array of objects (or list of objects)
Each object is a numpy array that relates to one of the observations with size [1 * history size * size of punctual observation (which is 2D,1D or scalar)]).
R : float_model
Model that fits the reward
gamma : float_model
Model that fits the discount factor
T : transition_model
Model that fits the transition between abstract representation
Q : Q_model
Model that fits the optimal Q-value
d : int
planning depth

The average q values with planning depth up to d for the provided pseudo-state

qValues_planning_abstr(state_abstr_val, R, gamma, T, Q, d, branching_factor=None)

Get the q values for pseudo-state(s) with a planning depth d. This function is called recursively by decreasing the depth d at every step.

state_abstr_val : internal state(s). R : float_model

Model that fits the reward
gamma : float_model
Model that fits the discount factor
T : transition_model
Model that fits the transition between abstract representation
Q : Q_model
Model that fits the optimal Q-value
d : int
planning depth

The Q-values with planning depth d for the provided encoded state(s)

setAllParams(list_of_values)

Set all parameters used by the learning algorithm

list_of_values : list of numpy arrays
list of the parameters to be set (same order than given by getAllParams()).
setLearningRate(lr)

Setting the learning rate

lr : float
The learning rate that has to be set
train(states_val, actions_val, rewards_val, next_states_val, terminals_val)

Train CRAR from one batch of data.

states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
actions_val : numpy array of integers with size [self._batch_size]
actions[i] is the action taken after having observed states[:][i].
rewards_val : numpy array of floats with size [self._batch_size]
rewards[i] is the reward obtained for taking actions[i-1].
next_states_val : numpy array of objects
Each object is a numpy array that relates to one of the observations with size [batch_size * history size * size of punctual observation (which is 2D,1D or scalar)]).
terminals_val : numpy array of booleans with size [self._batch_size]
terminals[i] is True if the transition leads to a terminal state and False otherwise

Average loss of the batch training for the Q-values (RMSE) Individual (square) losses for the Q-values for each tuple

Policies

deer.base_classes.Policy(learning_algo, …) Abstract class for all policies.
deer.policies.EpsilonGreedyPolicy(…) The policy acts greedily with probability \(1-\epsilon\) and acts randomly otherwise.
deer.policies.LongerExplorationPolicy(…[, …]) Simple alternative to \(\epsilon\)-greedy that can explore more efficiently for a broad class of realistic problems.
class deer.base_classes.Policy(learning_algo, n_actions, random_state)

Abstract class for all policies. A policy takes observations as input, and outputs an action.

learning_algo : object from class LearningALgo n_actions : int or list

Definition of the action space provided by Environment.nActions()

random_state : numpy random number generator

action(state)

Main method of the Policy class. It can be called by agent.py, given a state, and should return a valid action w.r.t. the environment given to the constructor.

bestAction(state, mode=None, *args, **kwargs)

Returns the best Action for the given state. This is an additional encapsulation for q-network.

randomAction()

Returns a random action

class deer.policies.EpsilonGreedyPolicy(learning_algo, n_actions, random_state, epsilon)

Bases: deer.base_classes.policy.Policy

The policy acts greedily with probability \(1-\epsilon\) and acts randomly otherwise. It is now used as a default policy for the neural agent.

epsilon : float
Proportion of random steps
action(state, mode=None, *args, **kwargs)

Main method of the Policy class. It can be called by agent.py, given a state, and should return a valid action w.r.t. the environment given to the constructor.

epsilon()

Get the epsilon for \(\epsilon\)-greedy exploration

setEpsilon(e)

Set the epsilon used for \(\epsilon\)-greedy exploration

class deer.policies.LongerExplorationPolicy(learning_algo, n_actions, random_state, epsilon, length=10)

Bases: deer.base_classes.policy.Policy

Simple alternative to \(\epsilon\)-greedy that can explore more efficiently for a broad class of realistic problems.

epsilon : float
Proportion of random steps
length : int
Length of the exploration sequences that will be considered
action(state, mode=None, *args, **kwargs)

Main method of the Policy class. It can be called by agent.py, given a state, and should return a valid action w.r.t. the environment given to the constructor.

epsilon()

Get the epsilon

setEpsilon(e)

Set the epsilon

Indices and tables