Agent and model overview

A reinforcement learning agent provides methods to process states and return actions, to store past observations, and to load and save models. Most agents employ a Model which implements the algorithms to calculate the next action given the current state and to update model parameters from past experiences.

Environment <-> Runner <-> Agent <-> Model

Parameters to the agent are passed as a Configuration object. The configuration is passed on to the Model.

Ready-to-use algorithms

We implemented some of the most common RL algorithms and try to keep these up-to-date. Here we provide an overview over all implemented agents and models.

Agent / General parameters

Agent is the base class for all reinforcement learning agents. Every agent inherits from this class.

class tensorforce.agents.Agent(states_spec, actions_spec, batched_observe=1000, scope='base_agent')

Bases: object

Basic Reinforcement learning agent. An agent encapsulates execution logic of a particular reinforcement learning algorithm and defines the external interface to the environment.

The agent hence acts as an intermediate layer between environment and backend execution (value function or policy updates).

act(states, deterministic=False)

Return action(s) for given state(s). States preprocessing and exploration are applied if configured accordingly.

Parameters:
  • states (any) -- One state (usually a value tuple) or dict of states if multiple states are expected.
  • deterministic (bool) -- If true, no exploration and sampling is applied.
Returns:

Scalar value of the action or dict of multiple actions the agent wants to execute.

static from_spec(spec, kwargs)

Creates an agent from a specification dict.

initialize_model()

Creates the model for the respective agent based on specifications given by user. This is a separate call after constructing the agent because the agent constructor has to perform a number of checks on the specs first, sometimes adjusting them e.g. by converting to a dict.

observe(terminal, reward)

Observe experience from the environment to learn from. Optionally pre-processes rewards Child classes should call super to get the processed reward EX: terminal, reward = super()...

Parameters:
  • terminal (bool) -- boolean indicating if the episode terminated after the observation.
  • reward (float) -- scalar reward that resulted from executing the action.
reset()

Reset the agent to its initial state (e.g. on experiment start). Updates the Model's internal episode and timestep counter, internal states, and resets preprocessors.

restore_model(directory=None, file=None)

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model's default saver directory is used (unless file specifies the entire path).

Parameters:
  • directory -- Optional checkpoint directory.
  • file -- Optional checkpoint file, or path if directory not given.
save_model(directory=None, append_timestep=True)

Save TensorFlow model. If no checkpoint directory is given, the model's default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:
  • directory (str) -- Optional checkpoint directory.
  • append_timestep (bool) -- Appends the current timestep to the checkpoint file if true. If this is set to True, the load path must include the checkpoint timestep suffix. For example, if stored to models/ and set to true, the exported file will be of the form models/model.ckpt-X where X is the last timestep saved. The load path must precisely match this file name. If this option is turned off, the checkpoint will always overwrite the file specified in path and the model can always be loaded under this path.
Returns:

Checkpoint path were the model was saved.

Model

The Model class is the base class for reinforcement learning models.

class tensorforce.models.Model(states_spec, actions_spec, device=None, session_config=None, scope='base_model', saver_spec=None, summary_spec=None, distributed_spec=None, optimizer=None, discount=0.0, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None)

Bases: object

Base class for all (TensorFlow-based) models.

act(states, internals, deterministic=False)

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals -- List of incoming internal state tensors.
  • deterministic (bool) -- If True, will not apply exploration after actions are calculated.
Returns:

  • Actual action-outputs (batched if state input is a batch).

Return type:tuple
create_output_operations(states, internals, actions, terminal, reward, update, deterministic)

Calls all the relevant TensorFlow functions for this model and hence creates all the TensorFlow operations involved.

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals -- List of prior internal state tensors.
  • actions (dict) -- Dict of action tensors (each key represents one action space component).
  • terminal -- Terminal boolean tensor (shape=(batch-size,)).
  • reward -- Reward float tensor (shape=(batch-size,)).
  • update -- Single boolean tensor indicating whether this call happens during an update.
  • deterministic -- Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
get_optimizer_kwargs(states, internals, actions, terminal, reward, update)

Returns the optimizer arguments including the time, the list of variables to optimize, and various argument-free functions (in particular fn_loss returning the combined 0-dim batch loss tensor) which the optimizer might require to perform an update step.

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals -- List of prior internal state tensors.
  • actions (dict) -- Dict of action tensors (each key represents one action space component).
  • terminal -- Terminal boolean tensor (shape=(batch-size,)).
  • reward -- Reward float tensor (shape=(batch-size,)).
  • update -- Single boolean tensor indicating whether this call happens during an update.
Returns:

Dict to be passed into the optimizer op (e.g. 'minimize') as kwargs.

get_summaries()

Returns the TensorFlow summaries reported by the model

Returns:List of summaries
get_variables(include_non_trainable=False)

Returns the TensorFlow variables used by the model.

Returns:List of variables.
initialize(custom_getter)

Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.

Parameters:custom_getter -- The custom_getter_ object to use for tf.make_template when creating TensorFlow functions.
observe(terminal, reward)

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:
  • terminal (bool) -- Whether the episode has terminated.
  • reward (float) -- The observed reward value.
Returns:

The value of the model-internal episode counter.

reset()

Resets the model to its initial state on episode start.

Returns:Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:tuple
restore(directory=None, file=None)

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model's default saver directory is used (unless file specifies the entire path).

Parameters:
  • directory -- Optional checkpoint directory.
  • file -- Optional checkpoint file, or path if directory not given.
save(directory=None, append_timestep=True)

Save TensorFlow model. If no checkpoint directory is given, the model's default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:
  • directory -- Optional checkpoint directory.
  • append_timestep -- Appends the current timestep to the checkpoint file if true.
Returns:

Checkpoint path were the model was saved.

setup()

Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)

Applies optional exploration to the action (post-processor for action outputs).

Parameters:
  • action (tf.Tensor) -- The original output action tensor (to be post-processed).
  • exploration (Exploration) -- The Exploration object to use.
  • action_spec (dict) -- Dict specifying the action space.
Returns:

The post-processed action output tensor.

tf_actions_and_internals(states, internals, update, deterministic)

Creates and returns the TensorFlow operations for retrieving the actions and - if applicable - the posterior internal state Tensors in reaction to the given input states (and prior internal states).

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals -- List of prior internal state tensors.
  • update -- Single boolean tensor indicating whether this call happens during an update.
  • deterministic -- Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
Returns:

  1. dict of output actions (with or without exploration applied (see deterministic))
  2. list of posterior internal state Tensors (empty for non-internal state models)

Return type:

tuple

tf_discounted_cumulative_reward(terminal, reward, discount=None, final_reward=0.0, horizon=0)

Creates and returns the TensorFlow operations for calculating the sequence of discounted cumulative rewards for a given sequence of single rewards.

Example: single rewards = 2.0 1.0 0.0 0.5 1.0 -1.0 terminal = False, False, False, False True False gamma = 0.95 final_reward = 100.0 (only matters for last episode (r=-1.0) as this episode has no terminal signal) horizon=3 output = 2.95 1.45 1.38 1.45 1.0 94.0

Parameters:
  • terminal -- Tensor (bool) holding the is-terminal sequence. This sequence may contain more than one True value. If its very last element is False (not terminating), the given final_reward value is assumed to follow the last value in the single rewards sequence (see below).
  • reward -- Tensor (float) holding the sequence of single rewards. If the last element of terminal is False, an assumed last reward of the value of final_reward will be used.
  • discount (float) -- The discount factor (gamma). By default, take the Model's discount factor.
  • final_reward (float) -- Reward value to use if last episode in sequence does not terminate (terminal sequence ends with False). This value will be ignored if horizon == 1 or discount == 0.0.
  • horizon (int) -- The length of the horizon (e.g. for n-step cumulative rewards in continuous tasks without terminal signals). Use 0 (default) for an infinite horizon. Note that horizon=1 leads to the exact same results as a discount factor of 0.0.
Returns:

Discounted cumulative reward tensor with the same shape as reward.

tf_loss(states, internals, actions, terminal, reward, update)

Creates and returns the single loss Tensor representing the total loss for a batch, including the mean loss per sample, the regularization loss of the batch, .

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals -- List of prior internal state tensors.
  • actions (dict) -- Dict of action tensors (each key represents one action space component).
  • terminal -- Terminal boolean tensor (shape=(batch-size,)).
  • reward -- Reward float tensor (shape=(batch-size,)).
  • update -- Single boolean tensor indicating whether this call happens during an update.
Returns:

Single float-value loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, update)

Creates and returns the TensorFlow operations for calculating the loss per batch instance (sample) of the given input state(s) and action(s).

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals -- List of prior internal state tensors.
  • actions (dict) -- Dict of action tensors (each key represents one action space component).
  • terminal -- Terminal boolean tensor (shape=(batch-size,)).
  • reward -- Reward float tensor (shape=(batch-size,)).
  • update -- Single boolean tensor indicating whether this call happens during an update.
Returns:

Loss tensor (first rank is the batch size -> one loss value per sample in the batch).

tf_optimization(states, internals, actions, terminal, reward, update)

Creates the TensorFlow operations for performing an optimization update step based on the given input states and actions batch.

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals -- List of prior internal state tensors.
  • actions (dict) -- Dict of action tensors (each key represents one action space component).
  • terminal -- Terminal boolean tensor (shape=(batch-size,)).
  • reward -- Reward float tensor (shape=(batch-size,)).
  • update -- Single boolean tensor indicating whether this call happens during an update.
Returns:

The optimization operation.

tf_preprocess_reward(states, internals, terminal, reward)

Applies optional preprocessing to the reward.

tf_preprocess_states(states)

Applies optional preprocessing to the states.

tf_regularization_losses(states, internals, update)

Creates and returns the TensorFlow operations for calculating the different regularization losses for the given batch of state/internal state inputs.

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals -- List of prior internal state tensors.
  • update -- Single boolean tensor indicating whether this call happens during an update.
Returns:

Dict of regularization loss tensors (keys == different regularization types, e.g. 'entropy').

update(states, internals, actions, terminal, reward, return_loss_per_instance=False)

Runs the self.optimization in the session to update the Model's parameters. Optionally, also runs the loss_per_instance calculation and returns the result of that.

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals -- List of prior internal state tensors.
  • actions (dict) -- Dict of action tensors (each key represents one action space component).
  • terminal -- Terminal boolean tensor (shape=(batch-size,)).
  • reward -- Reward float tensor (shape=(batch-size,)).
  • return_loss_per_instance (bool) -- Whether to also run and return the loss_per_instance Tensor.
Returns:

void or - if return_loss_per_instance is True - the value of the loss_per_instance Tensor.

MemoryAgent

class tensorforce.agents.MemoryAgent(states_spec, actions_spec, batched_observe=1000, scope='memory_agent', summary_spec=None, network_spec=None, discount=0.99, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, memory=None, first_update=10000, update_frequency=4, repeat_update=1)

Bases: tensorforce.agents.learning_agent.LearningAgent

The MemoryAgent class implements a replay memory from which it samples batches according to some sampling strategy to update the value function.

import_observations(observations)

Load an iterable of observation dicts into the replay memory.

Parameters:observations -- An iterable with each element containing an observation. Each observation requires keys 'state','action','reward','terminal', 'internal'. Use an empty list [] for 'internal' if internal state is irrelevant.

BatchAgent

class tensorforce.agents.BatchAgent(states_spec, actions_spec, batched_observe=1000, summary_spec=None, network_spec=None, discount=0.99, device=None, session_config=None, scope='batch_agent', saver_spec=None, distributed_spec=None, optimizer=None, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True)

Bases: tensorforce.agents.learning_agent.LearningAgent

The BatchAgent class implements a batch memory which generally implies on-policy experience collection and updates.

observe(terminal, reward)

Adds an observation and performs an update if the necessary conditions are satisfied, i.e. if one batch of experience has been collected as defined by the batch size.

In particular, note that episode control happens outside of the agent since the agent should be agnostic to how the training data is created.

Parameters:
  • terminal (bool) -- Whether episode is terminated or not.
  • reward (float) -- The scalar reward value.
reset_batch()

Cleans up after a batch has been processed (observed). Resets all batch information to be ready for new observation data. Batch information contains:

  • observed states
  • internal-variables
  • taken actions
  • observed is-terminal signals/rewards
  • total batch size

Deep-Q-Networks (DQN)

class tensorforce.agents.DQNAgent(states_spec, actions_spec, batched_observe=None, scope='dqn', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)

Bases: tensorforce.agents.memory_agent.MemoryAgent

Deep-Q-Network agent (DQN). The piece de resistance of deep reinforcement learning as described by Minh et al. (2015). Includes an option for double-DQN (DDQN; van Hasselt et al., 2015)

DQN chooses from one of a number of discrete actions by taking the maximum Q-value from the value function with one output neuron per available action. DQN uses a replay memory for experience playback.

Normalized Advantage Functions

class tensorforce.agents.NAFAgent(states_spec, actions_spec, batched_observe=1000, scope='naf', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)

Bases: tensorforce.agents.memory_agent.MemoryAgent

Normalized Advantage Functions (NAF) for continuous DQN: https://arxiv.org/abs/1603.00748

Deep-Q-learning from demonstration (DQFD)

class tensorforce.agents.DQFDAgent(states_spec, actions_spec, batched_observe=1000, scope='dqfd', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, huber_loss=None, expert_margin=0.5, supervised_weight=0.1, demo_memory_capacity=10000, demo_sampling_ratio=0.2)

Bases: tensorforce.agents.memory_agent.MemoryAgent

Deep Q-learning from demonstration (DQFD) agent (Hester et al., 2017). This agent uses DQN to pre-train from demonstration data via an additional supervised loss term.

import_demonstrations(demonstrations)

Imports demonstrations, i.e. expert observations. Note that for large numbers of observations, set_demonstrations is more appropriate, which directly sets memory contents to an array an expects a different layout.

Parameters:demonstrations -- List of observation dicts
observe(reward, terminal)

Adds observations, updates via sampling from memories according to update rate. DQFD samples from the online replay memory and the demo memory with the fractions controlled by a hyper parameter p called 'expert sampling ratio.

pretrain(steps)

Computes pre-train updates.

Parameters:steps -- Number of updates to execute.
set_demonstrations(batch)

Set all demonstrations from batch data. Expects a dict wherein each value contains an array containing all states, actions, rewards, terminals and internals respectively.

Parameters:batch --

Vanilla Policy Gradient

class tensorforce.agents.VPGAgent(states_spec, actions_spec, batched_observe=1000, scope='vpg', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None)

Bases: tensorforce.agents.batch_agent.BatchAgent

Vanilla Policy Gradient agent as described by [Sutton et al. (1999)] (https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf).

Trust Region Policy Optimization (TRPO)

class tensorforce.agents.TRPOAgent(states_spec, actions_spec, batched_observe=1000, scope='trpo', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None, likelihood_ratio_clipping=None, learning_rate=0.001, cg_max_iterations=20, cg_damping=0.001, cg_unroll_loop=False)

Bases: tensorforce.agents.batch_agent.BatchAgent

Trust Region Policy Optimization (Schulman et al., 2015) agent.

State preprocessing

The agent handles state preprocessing. A preprocessor takes the raw state input from the environment and modifies it (for instance, image resize, state concatenation, etc.). You can find information about our ready-to-use preprocessors here.

Building your own agent

If you want to build your own agent, it should always inherit from Agent. If your agent uses a replay memory, it should probably inherit from MemoryAgent, if it uses a batch replay that is emptied after each update, it should probably inherit from BatchAgent.

We distinguish between agents and models. The Agent class handles the interaction with the environment, such as state preprocessing, exploration and observation of rewards. The Model class handles the mathematical operations, such as building the tensorflow operations, calculating the desired action and updating (i.e. optimizing) the model weights.

To start building your own agent, please refer to this blogpost to gain a deeper understanding of the internals of the TensorForce library. Afterwards, have look on a sample implementation, e.g. the DQN Agent and DQN Model.