Agent and model overview

A reinforcement learning agent provides methods to process states and return actions, to store past observations, and to load and save models. Most agents employ a Model which implements the algorithms to calculate the next action given the current state and to update model parameters from past experiences.

Environment <-> Runner <-> Agent <-> Model

Parameters to the agent are passed as a Configuration object. The configuration is passed on to the Model.

Ready-to-use algorithms

We implemented some of the most common RL algorithms and try to keep these up-to-date. Here we provide an overview over all implemented agents and models.

Agent / General parameters

Agent is the base class for all reinforcement learning agents. Every agent inherits from this class.

class tensorforce.agents.Agent(states, actions, batched_observe=True, batching_capacity=1000)

Bases: object

Base class for TensorForce agents.

__init__(states, actions, batched_observe=True, batching_capacity=1000)

Initializes the agent.

Parameters:states -- States specification, with the following attributes (required):
Parameters:actions -- Actions specification, with the following attributes (required):
Parameters:
  • batched_observe (bool) -- Specifies whether calls to model.observe() are batched, for improved performance (default: true).
  • batching_capacity (int) -- Batching capacity of agent and model (default: 1000).
act(states, deterministic=False, independent=False, fetch_tensors=None)

Return action(s) for given state(s). States preprocessing and exploration are applied if configured accordingly.

Parameters:
  • states (any) -- One state (usually a value tuple) or dict of states if multiple states are expected.
  • deterministic (bool) -- If true, no exploration and sampling is applied.
  • independent (bool) -- If true, action is not followed by observe (and hence not included in updates).
  • fetch_tensors (list) -- Optional String of named tensors to fetch
Returns:

Scalar value of the action or dict of multiple actions the agent wants to execute. (fetched_tensors) Optional dict() with named tensors fetched

static from_spec(spec, kwargs)

Creates an agent from a specification dict.

initialize_model()

Creates the model for the respective agent based on specifications given by user. This is a separate call after constructing the agent because the agent constructor has to perform a number of checks on the specs first, sometimes adjusting them e.g. by converting to a dict.

observe(terminal, reward)

Observe experience from the environment to learn from. Optionally pre-processes rewards Child classes should call super to get the processed reward EX: terminal, reward = super()...

Parameters:
  • terminal (bool) -- boolean indicating if the episode terminated after the observation.
  • reward (float) -- scalar reward that resulted from executing the action.
reset()

Reset the agent to its initial state (e.g. on experiment start). Updates the Model's internal episode and time step counter, internal states, and resets preprocessors.

restore_model(directory=None, file=None)

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model's default saver directory is used (unless file specifies the entire path).

Parameters:
  • directory -- Optional checkpoint directory.
  • file -- Optional checkpoint file, or path if directory not given.
save_model(directory=None, append_timestep=True)

Save TensorFlow model. If no checkpoint directory is given, the model's default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:
  • directory (str) -- Optional checkpoint directory.
  • append_timestep (bool) -- Appends the current timestep to the checkpoint file if true. If this is set to True, the load path must include the checkpoint timestep suffix. For example, if stored to models/ and set to true, the exported file will be of the form models/model.ckpt-X where X is the last timestep saved. The load path must precisely match this file name. If this option is turned off, the checkpoint will always overwrite the file specified in path and the model can always be loaded under this path.
Returns:

Checkpoint path were the model was saved.

Model

The Model class is the base class for reinforcement learning models.

class tensorforce.models.Model(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing)

Bases: object

Base class for all (TensorFlow-based) models.

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing)

Model.

Parameters:
  • states (spec) -- The state-space description dictionary.
  • actions (spec) -- The action-space description dictionary.
  • scope (str) -- The root scope str to use for tf variable scoping.
  • device (str) -- The name of the device to run the graph of this model on.
  • saver (spec) -- Dict specifying whether and how to save the model's parameters.
  • summarizer (spec) -- Dict specifying which tensorboard summaries should be created and added to the graph.
  • execution (spec) -- Dict specifying whether and how to do distributed training on the model's graph.
  • batching_capacity (int) -- Batching capacity.
  • variable_noise (float) -- The stddev value of a Normal distribution used for adding random noise to the model's output (for each batch, noise can be toggled and - if active - will be resampled). Use None for not adding any noise.
  • states_preprocessing (spec / dict of specs) -- Dict specifying whether and how to preprocess state signals (e.g. normalization, greyscale, etc..).
  • actions_exploration (spec / dict of specs) -- Dict specifying whether and how to add exploration to the model's "action outputs" (e.g. epsilon-greedy).
  • reward_preprocessing (spec) -- Dict specifying whether and how to preprocess rewards coming from the Environment (e.g. reward normalization).
act(states, internals, deterministic=False, independent=False, fetch_tensors=None)

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:
  • states (dict) -- Dict of state values (each key represents one state space component).
  • internals (dict) -- Dict of internal state values (each key represents one internal state component).
  • deterministic (bool) -- If True, will not apply exploration after actions are calculated.
  • independent (bool) -- If true, action is not followed by observe (and hence not included in updates).
Returns:

  • Actual action-outputs (batched if state input is a batch).

Return type:tuple
create_operations(states, internals, actions, terminal, reward, deterministic, independent)

Creates output operations for acting, observing and interacting with the memory.

get_component(component_name)

Looks up a component by its name.

Parameters:component_name -- The name of the component to look up.
Returns:The component for the provided name or None if there is no such component.
get_components()

Returns a dictionary of component name to component of all the components within this model.

Returns:(dict) The mapping of name to component.
get_savable_components()

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:List of util.SavableComponent
get_summaries()

Returns the TensorFlow summaries reported by the model

Returns:List of summaries
get_variables(include_submodules=False, include_nontrainable=False)

Returns the TensorFlow variables used by the model.

Parameters:
  • include_submodules -- Includes variables of submodules (e.g. baseline, target network) if true.
  • include_nontrainable -- Includes non-trainable variables if true.
Returns:

List of variables.

initialize(custom_getter)

Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.

Parameters:custom_getter -- The custom_getter_ object to use for tf.make_template when creating TensorFlow functions.
observe(terminal, reward)

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:
  • terminal (bool) -- Whether the episode has terminated.
  • reward (float) -- The observed reward value.
Returns:

The value of the model-internal episode counter.

reset()

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:tuple
restore(directory=None, file=None)

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model's default saver directory is used (unless file specifies the entire path).

Parameters:
  • directory -- Optional checkpoint directory.
  • file -- Optional checkpoint file, or path if directory not given.
restore_component(component_name, save_path)

Restores a component's parameters from a save location.

Parameters:
  • component_name -- The component to restore.
  • save_path -- The save location.
save(directory=None, append_timestep=True)

Save TensorFlow model. If no checkpoint directory is given, the model's default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:
  • directory -- Optional checkpoint directory.
  • append_timestep -- Appends the current timestep to the checkpoint file if true.
Returns:

Checkpoint path where the model was saved.

save_component(component_name, save_path)

Saves a component of this model to the designated location.

Parameters:
  • component_name -- The component to save.
  • save_path -- The location to save to.
Returns:

Checkpoint path where the component was saved.

setup()

Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)

Applies optional exploration to the action (post-processor for action outputs).

Parameters:
  • action (tf.Tensor) -- The original output action tensor (to be post-processed).
  • exploration (Exploration) -- The Exploration object to use.
  • action_spec (dict) -- Dict specifying the action space.
Returns:

The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)

Creates and returns the TensorFlow operations for retrieving the actions and - if applicable - the posterior internal state Tensors in reaction to the given input states (and prior internal states).

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals -- List of prior internal state tensors.
  • deterministic -- Boolean tensor indicating whether action should be chosen deterministically.
Returns:

  1. dict of output actions (with or without exploration applied (see deterministic))
  2. list of posterior internal state Tensors (empty for non-internal state models)

Return type:

tuple

tf_observe_timestep(states, internals, actions, terminal, reward)

Creates the TensorFlow operations for performing the observation of a full time step's information.

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals -- List of prior internal state tensors.
  • actions -- Dict of action tensors.
  • terminal -- Terminal boolean tensor.
  • reward -- Reward tensor.
Returns:

The observation operation.

MemoryAgent

BatchAgent

Deep-Q-Networks (DQN)

class tensorforce.agents.DQNAgent(states, actions, network, batched_observe=True, batching_capacity=1000, scope='dqn', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)

Bases: tensorforce.agents.learning_agent.LearningAgent

Deep Q-Network agent (Mnih et al., 2015).

__init__(states, actions, network, batched_observe=True, batching_capacity=1000, scope='dqn', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)

Initializes the DQN agent.

Parameters:update_mode -- Update mode specification, with the following attributes:
Parameters:
  • memory (spec) -- Memory specification, see core.memories module for more information (default: {type='replay', include_next_states=true, capacity=1000*batch_size}).
  • optimizer (spec) -- Optimizer specification, see core.optimizers module for more information (default: {type='adam', learning_rate=1e-3}).
  • target_sync_frequency (int) -- Target network sync frequency (default: 10000).
  • target_update_weight (float) -- Target network update weight (default: 1.0).
  • double_q_model (bool) -- Specifies whether double DQN mode is used (default: false).
  • huber_loss (float) -- Huber loss clipping (default: none).

Normalized Advantage Functions

class tensorforce.agents.NAFAgent(states, actions, network, batched_observe=True, batching_capacity=1000, scope='naf', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)

Bases: tensorforce.agents.learning_agent.LearningAgent

Normalized Advantage Function agent (Gu et al., 2016).

__init__(states, actions, network, batched_observe=True, batching_capacity=1000, scope='naf', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)

Initializes the NAF agent.

Parameters:update_mode -- Update mode specification, with the following attributes:
Parameters:
  • memory (spec) -- Memory specification, see core.memories module for more information (default: {type='replay', include_next_states=true, capacity=1000*batch_size}).
  • optimizer (spec) -- Optimizer specification, see core.optimizers module for more information (default: {type='adam', learning_rate=1e-3}).
  • target_sync_frequency (int) -- Target network sync frequency (default: 10000).
  • target_update_weight (float) -- Target network update weight (default: 1.0).
  • double_q_model (bool) -- Specifies whether double DQN mode is used (default: false).
  • huber_loss (float) -- Huber loss clipping (default: none).

Deep-Q-learning from demonstration (DQFD)

class tensorforce.agents.DQFDAgent(states, actions, network, batched_observe=True, batching_capacity=1000, scope='dqfd', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, huber_loss=None, expert_margin=0.5, supervised_weight=0.1, demo_memory_capacity=10000, demo_sampling_ratio=0.2)

Bases: tensorforce.agents.learning_agent.LearningAgent

Deep Q-learning from demonstration agent (Hester et al., 2017).

__init__(states, actions, network, batched_observe=True, batching_capacity=1000, scope='dqfd', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, huber_loss=None, expert_margin=0.5, supervised_weight=0.1, demo_memory_capacity=10000, demo_sampling_ratio=0.2)

Initializes the DQFD agent.

Parameters:update_mode -- Update mode specification, with the following attributes:
Parameters:
  • memory (spec) -- Memory specification, see core.memories module for more information (default: {type='replay', include_next_states=true, capacity=1000*batch_size}).
  • optimizer (spec) -- Optimizer specification, see core.optimizers module for more information (default: {type='adam', learning_rate=1e-3}).
  • target_sync_frequency (int) -- Target network sync frequency (default: 10000).
  • target_update_weight (float) -- Target network update weight (default: 1.0).
  • huber_loss (float) -- Huber loss clipping (default: none).
  • expert_margin (float) -- Enforced supervised margin between expert action Q-value and other Q-values (default: 0.5).
  • supervised_weight (float) -- Weight of supervised loss term (default: 0.1).
  • demo_memory_capacity (int) -- Capacity of expert demonstration memory (default: 10000).
  • demo_sampling_ratio (float) -- Runtime sampling ratio of expert data (default: 0.2).
import_demonstrations(demonstrations)

Imports demonstrations, i.e. expert observations. Note that for large numbers of observations, set_demonstrations is more appropriate, which directly sets memory contents to an array an expects a different layout.

Parameters:demonstrations -- List of observation dicts
pretrain(steps)

Computes pre-train updates.

Parameters:steps -- Number of updates to execute.

Vanilla Policy Gradient

class tensorforce.agents.VPGAgent(states, actions, network, batched_observe=True, batching_capacity=1000, scope='vpg', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None)

Bases: tensorforce.agents.learning_agent.LearningAgent

Vanilla policy gradient agent (Williams, 1992)).

__init__(states, actions, network, batched_observe=True, batching_capacity=1000, scope='vpg', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None)

Initializes the VPG agent.

Parameters:update_mode -- Update mode specification, with the following attributes:
Parameters:
  • memory (spec) -- Memory specification, see core.memories module for more information (default: {type='latest', include_next_states=false, capacity=1000*batch_size}).
  • optimizer (spec) -- Optimizer specification, see core.optimizers module for more information (default: {type='adam', learning_rate=1e-3}).
  • baseline_mode (str) -- One of 'states', 'network' (default: none).
  • baseline (spec) -- Baseline specification, see core.baselines module for more information (default: none).
  • baseline_optimizer (spec) -- Baseline optimizer specification, see core.optimizers module for more information (default: none).
  • gae_lambda (float) -- Lambda factor for generalized advantage estimation (default: none).

Trust Region Policy Optimization (TRPO)

class tensorforce.agents.TRPOAgent(states, actions, network, batched_observe=True, batching_capacity=1000, scope='trpo', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, discount=0.99, distributions=None, entropy_regularization=None, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None, likelihood_ratio_clipping=None, learning_rate=0.001, cg_max_iterations=20, cg_damping=0.001, cg_unroll_loop=False, ls_max_iterations=10, ls_accept_ratio=0.9, ls_unroll_loop=False)

Bases: tensorforce.agents.learning_agent.LearningAgent

Trust Region Policy Optimization agent (Schulman et al., 2015).

__init__(states, actions, network, batched_observe=True, batching_capacity=1000, scope='trpo', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, discount=0.99, distributions=None, entropy_regularization=None, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None, likelihood_ratio_clipping=None, learning_rate=0.001, cg_max_iterations=20, cg_damping=0.001, cg_unroll_loop=False, ls_max_iterations=10, ls_accept_ratio=0.9, ls_unroll_loop=False)

Initializes the TRPO agent.

Parameters:update_mode -- Update mode specification, with the following attributes:
Parameters:
  • memory (spec) -- Memory specification, see core.memories module for more information (default: {type='latest', include_next_states=false, capacity=1000*batch_size}).
  • optimizer (spec) -- TRPO agent implicitly defines a optimized-step natural-gradient optimizer.
  • baseline_mode (str) -- One of 'states', 'network' (default: none).
  • baseline (spec) -- Baseline specification, see core.baselines module for more information (default: none).
  • baseline_optimizer (spec) -- Baseline optimizer specification, see core.optimizers module for more information (default: none).
  • gae_lambda (float) -- Lambda factor for generalized advantage estimation (default: none).
  • likelihood_ratio_clipping (float) -- Likelihood ratio clipping for policy gradient (default: none).
  • learning_rate (float) -- Learning rate of natural-gradient optimizer (default: 1e-3).
  • cg_max_iterations (int) -- Conjugate-gradient max iterations (default: 20).
  • cg_damping (float) -- Conjugate-gradient damping (default: 1e-3).
  • cg_unroll_loop (bool) -- Conjugate-gradient unroll loop (default: false).
  • ls_max_iterations (int) -- Line-search max iterations (default: 10).
  • ls_accept_ratio (float) -- Line-search accept ratio (default: 0.9).
  • ls_unroll_loop (bool) -- Line-search unroll loop (default: false).

State preprocessing

The agent handles state preprocessing. A preprocessor takes the raw state input from the environment and modifies it (for instance, image resize, state concatenation, etc.). You can find information about our ready-to-use preprocessors here.

Building your own agent

If you want to build your own agent, it should always inherit from Agent. If your agent uses a replay memory, it should probably inherit from MemoryAgent, if it uses a batch replay that is emptied after each update, it should probably inherit from BatchAgent.

We distinguish between agents and models. The Agent class handles the interaction with the environment, such as state preprocessing, exploration and observation of rewards. The Model class handles the mathematical operations, such as building the tensorflow operations, calculating the desired action and updating (i.e. optimizing) the model weights.

To start building your own agent, please refer to this blogpost to gain a deeper understanding of the internals of the TensorForce library. Afterwards, have look on a sample implementation, e.g. the DQN Agent and DQN Model.