Agent and model overview

A reinforcement learning agent provides methods to process states and return actions, to store past observations, and to load and save models. Most agents employ a Model which implements the algorithms to calculate the next action given the current state and to update model parameters from past experiences.

Environment <-> Runner <-> Agent <-> Model

Parameters to the agent are passed as a Configuration object. The configuration is passed on to the Model.

Ready-to-use algorithms

We implemented some of the most common RL algorithms and try to keep these up-to-date. Here we provide an overview over all implemented agents and models.

Agent / General parameters

Agent is the base class for all reinforcement learning agents. Every agent inherits from this class.

class tensorforce.agents.Agent(states, actions, batched_observe=True, batching_capacity=1000)

Bases: object

Base class for TensorForce agents.

__init__(states, actions, batched_observe=True, batching_capacity=1000)

Initializes the agent.

Parameters:states -- States specification, with the following attributes (required):
Parameters:actions -- Actions specification, with the following attributes (required):
Parameters:
  • batched_observe (bool) -- Specifies whether calls to model.observe() are batched, for improved performance (default: true).
  • batching_capacity (int) -- Batching capacity of agent and model (default: 1000).
act(states, deterministic=False, independent=False, fetch_tensors=None)

Return action(s) for given state(s). States preprocessing and exploration are applied if configured accordingly.

Parameters:
  • states (any) -- One state (usually a value tuple) or dict of states if multiple states are expected.
  • deterministic (bool) -- If true, no exploration and sampling is applied.
  • independent (bool) -- If true, action is not followed by observe (and hence not included in updates).
  • fetch_tensors (list) -- Optional String of named tensors to fetch
Returns:

Scalar value of the action or dict of multiple actions the agent wants to execute. (fetched_tensors) Optional dict() with named tensors fetched

static from_spec(spec, kwargs)

Creates an agent from a specification dict.

initialize_model()

Creates and returns the model (including a local replica in case of distributed learning) for this agent based on specifications given by user. This method needs to be implemented by the different agent subclasses.

observe(terminal, reward)

Observe experience from the environment to learn from. Optionally pre-processes rewards Child classes should call super to get the processed reward EX: terminal, reward = super()...

Parameters:
  • terminal (bool) -- boolean indicating if the episode terminated after the observation.
  • reward (float) -- scalar reward that resulted from executing the action.
reset()

Resets the agent to its initial state (e.g. on experiment start). Updates the Model's internal episode and time step counter, internal states, and resets preprocessors.

restore_model(directory=None, file=None)

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model's default saver directory is used (unless file specifies the entire path).

Parameters:
  • directory -- Optional checkpoint directory.
  • file -- Optional checkpoint file, or path if directory not given.
save_model(directory=None, append_timestep=True)

Save TensorFlow model. If no checkpoint directory is given, the model's default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:
  • directory (str) -- Optional checkpoint directory.
  • append_timestep (bool) -- Appends the current timestep to the checkpoint file if true. If this is set to True, the load path must include the checkpoint timestep suffix. For example, if stored to models/ and set to true, the exported file will be of the form models/model.ckpt-X where X is the last timestep saved. The load path must precisely match this file name. If this option is turned off, the checkpoint will always overwrite the file specified in path and the model can always be loaded under this path.
Returns:

Checkpoint path were the model was saved.

Model

The Model class is the base class for reinforcement learning models.

class tensorforce.models.Model(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, tf_session_dump_dir='')

Bases: object

The Model class coordinates the creation and execution of all TensorFlow operations within a model. It implements the reset, act and update functions, which form the interface the Agent class communicates with, and which should not need to be overwritten. Instead, the following TensorFlow functions need to be implemented:

  • tf_actions_and_internals(states, internals, deterministic) returning the batch of
    actions and successor internal states.

Further, the following TensorFlow functions should be extended accordingly:

  • setup_placeholders() defining TensorFlow input placeholders for states, actions, rewards, etc..
  • setup_template_funcs() builds all TensorFlow functions from "tf_"-methods via tf.make_template.
  • get_variables() returning the list of TensorFlow variables (to be optimized) of this model.

Finally, the following TensorFlow functions can be useful in some cases:

  • tf_preprocess(states, internals, reward) for states/action/reward preprocessing (e.g. reward normalization),
    returning the pre-processed tensors.
  • tf_action_exploration(action, exploration, actions) for action postprocessing (e.g. exploration),
    returning the processed batch of actions.
  • create_output_operations(states, internals, actions, terminal, reward, deterministic) for further output operations,
    similar to the two above for Model.act and Model.update.
  • tf_optimization(states, internals, actions, terminal, reward) for further optimization operations
    (e.g. the baseline update in a PGModel or the target network update in a QModel), returning a single grouped optimization operation.
__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, tf_session_dump_dir='')

Model.

Parameters:
  • states (spec) -- The state-space description dictionary.
  • actions (spec) -- The action-space description dictionary.
  • scope (str) -- The root scope str to use for tf variable scoping.
  • device (str) -- The name of the device to run the graph of this model on.
  • saver (spec) -- Dict specifying whether and how to save the model's parameters.
  • summarizer (spec) -- Dict specifying which tensorboard summaries should be created and added to the graph.
  • execution (spec) -- Dict specifying whether and how to do distributed training on the model's graph.
  • batching_capacity (int) -- Batching capacity.
  • variable_noise (float) -- The stddev value of a Normal distribution used for adding random noise to the model's output (for each batch, noise can be toggled and - if active - will be resampled). Use None for not adding any noise.
  • states_preprocessing (spec / dict of specs) -- Dict specifying whether and how to preprocess state signals (e.g. normalization, greyscale, etc..).
  • actions_exploration (spec / dict of specs) -- Dict specifying whether and how to add exploration to the model's "action outputs" (e.g. epsilon-greedy).
  • reward_preprocessing (spec) -- Dict specifying whether and how to preprocess rewards coming from the Environment (e.g. reward normalization).
  • tf_session_dump_dir (str) -- If non-empty string, all session.run calls will be dumped using the tensorflow offline-debug session into the given directory.
act(states, internals, deterministic=False, independent=False, fetch_tensors=None)

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:
  • states (dict) -- Dict of state values (each key represents one state space component).
  • internals (dict) -- Dict of internal state values (each key represents one internal state component).
  • deterministic (bool) -- If True, will not apply exploration after actions are calculated.
  • independent (bool) -- If true, action is not followed by observe (and hence not included in updates).
  • fetch_tensors (list) -- List of names of additional tensors (from the model's network) to fetch (and return).
Returns:

  • Actual action-outputs (batched if state input is a batch).

Return type:tuple
close()

Saves the model (of saver dir is given) and closes the session.

create_act_operations(states, internals, deterministic, independent)

Creates and stores tf operations that are fetched when calling act(): actions_output, internals_output and timestep_output.

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals (dict) -- Dict of prior internal state tensors (each key represents one internal state component).
  • deterministic -- 0D (bool) tensor (whether to not use action exploration).
  • independent (bool) -- 0D (bool) tensor (whether to store states/internals/action in local buffer).
create_observe_operations(terminal, reward)

Returns the tf op to fetch when an observation batch is passed in (e.g. an episode's rewards and terminals). Uses the filled tf buffers for states, actions and internals to run the tf_observe_timestep (model-dependent), resets buffer index and increases counters (episodes, timesteps).

Parameters:
  • terminal -- The 1D tensor (bool) of terminal signals to process (more than one True within that list is ok).
  • reward -- The 1D tensor (float) of rewards to process.

Returns: Tf op to fetch when observe() is called.

create_operations(states, internals, actions, terminal, reward, deterministic, independent)

Creates and stores tf operations for when act() and observe() are called.

get_component(component_name)

Looks up a component by its name.

Parameters:component_name -- The name of the component to look up.
Returns:The component for the provided name or None if there is no such component.
get_components()

Returns a dictionary of component name to component of all the components within this model.

Returns:(dict) The mapping of name to component.
get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)

Returns the feed-dict for the model's acting and observing tf fetches.

Parameters:
  • states (dict) -- Dict of state values (each key represents one state space component).
  • internals (dict) -- Dict of internal state values (each key represents one internal state component).
  • actions (dict) -- Dict of actions (each key represents one action space component).
  • terminal (List[bool]) -- List of is-terminal signals.
  • reward (List[float]) -- List of reward signals.
  • deterministic (bool) -- Whether actions should be picked without exploration.
  • independent (bool) -- Whether we are doing an independent act (not followed by call to observe; not to be stored in model's buffer).

Returns: The feed dict to use for the fetch.

get_savable_components()

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:List of util.SavableComponent
get_summaries()

Returns the TensorFlow summaries reported by the model

Returns:List of summaries
get_variables(include_submodules=False, include_nontrainable=False)

Returns the TensorFlow variables used by the model.

Parameters:
  • include_submodules -- Includes variables of submodules (e.g. baseline, target network) if true.
  • include_nontrainable -- Includes non-trainable variables if true.
Returns:

List of variables.

observe(terminal, reward)

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:
  • terminal (List[bool]) -- List of is-terminal signals.
  • reward (List[float]) -- List of reward signals.
Returns:

The value of the model-internal episode counter.

reset()

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:tuple
restore(directory=None, file=None)

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model's default saver directory is used (unless file specifies the entire path).

Parameters:
  • directory -- Optional checkpoint directory.
  • file -- Optional checkpoint file, or path if directory not given.
restore_component(component_name, save_path)

Restores a component's parameters from a save location.

Parameters:
  • component_name -- The component to restore.
  • save_path -- The save location.
save(directory=None, append_timestep=True)

Save TensorFlow model. If no checkpoint directory is given, the model's default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:
  • directory -- Optional checkpoint directory.
  • append_timestep -- Appends the current timestep to the checkpoint file if true.
Returns:

Checkpoint path where the model was saved.

save_component(component_name, save_path)

Saves a component of this model to the designated location.

Parameters:
  • component_name -- The component to save.
  • save_path -- The location to save to.
Returns:

Checkpoint path where the component was saved.

setup()

Sets up the TensorFlow model graph, starts the servers (distributed mode), creates summarizers and savers, initializes (and enters) the TensorFlow session.

setup_components_and_tf_funcs(custom_getter=None)

Allows child models to create model's component objects, such as optimizer(s), memory(s), etc.. Creates all tensorflow functions via tf.maketemplate calls on all the class' "tf"-methods.

Parameters:custom_getter -- The custom_getter_ object to use for tf.make_template when creating TensorFlow functions. If None, use a default customgetter.

Returns: The custom_getter passed in (or a default one if custom_getter was None).

setup_graph()

Creates our Graph and figures out, which shared/global model to hook up to. If we are in a global-model's setup procedure, we do not create a new graph (return None as the context). We will instead use the already existing local replica graph of the model.

Returns: None or the graph's as_default()-context.

setup_placeholders()

Creates the TensorFlow placeholders, variables, ops and functions for this model. NOTE: Does not add the internal state placeholders and initialization values to the model yet as that requires the model's Network (if any) to be generated first.

setup_saver()

Creates the tf.train.Saver object and stores it in self.saver.

setup_scaffold(summary_op)

Creates the tf.train.Scaffold object with the given summary_op and assigns it to self.scaffold. Other fields of the Scaffold are generated automatically.

setup_session(server, hooks, graph_default_context)

Creates and then enters the session for this model (finalizes the graph).

Parameters:
  • server (tf.train.Server) -- The tf.train.Server object to connect to (None for single execution).
  • hooks (list) -- A list of (saver, summary, etc..) hooks to be passed to the session.
  • graph_default_context -- The graph as_default() context that we are currently in.
setup_summary_and_saver_hooks()

Creates and returns a list of saver and summarizer hooks to use in a session. Populates self.saver_directory, self.summarizer_hook and self.summarizer.

Returns: List of hooks to use in a session.

start_server()

Creates and stores a tf server (and optionally joins it if we are a parameter-server). Only relevant, if we are running in distributed mode.

tf_action_exploration(action, exploration, action_spec)

Applies optional exploration to the action (post-processor for action outputs).

Parameters:
  • action (tf.Tensor) -- The original output action tensor (to be post-processed).
  • exploration (Exploration) -- The Exploration object to use.
  • action_spec (dict) -- Dict specifying the action space.
Returns:

The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)

Creates and returns the TensorFlow operations for retrieving the actions and - if applicable - the posterior internal state Tensors in reaction to the given input states (and prior internal states).

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals (dict) -- Dict of internal state tensors (each key represents one internal space component).
  • deterministic -- Boolean tensor indicating whether action should be chosen deterministically.
Returns:

  1. dict of output actions (with or without exploration applied (see deterministic))
  2. list of posterior internal state Tensors (empty for non-internal state models)

Return type:

tuple

tf_initialize()

Creates tf Variables for the local state/internals/action-buffers and for the local and global counters for timestep and episode.

tf_observe_timestep(states, internals, actions, terminal, reward)

Creates the TensorFlow operations for processing a batch of observations coming in from our buffer (state, action, internals) as well as from the agent's python-batch (terminal-signals and rewards from the env).

Parameters:
  • states (dict) -- Dict of state tensors (each key represents one state space component).
  • internals (dict) -- Dict of prior internal state tensors (each key represents one internal state component).
  • actions (dict) -- Dict of action tensors (each key represents one action space component).
  • terminal -- 1D (bool) tensor of terminal signals.
  • reward -- 1D (float) tensor of rewards.
Returns:

The observation operation depending on the model type.

tf_preprocess(states, actions, reward)

Applies preprocessing ops to the raw states/action/reward inputs.

Parameters:
  • states (dict) -- Dict of raw state tensors.
  • actions (dict) -- Dict or raw action tensors.
  • reward -- 1D (float) raw rewards tensor.

Returns: The preprocessed versions of the input tensors.

MemoryAgent

BatchAgent

Deep-Q-Networks (DQN)

class tensorforce.agents.DQNAgent(states, actions, network, batched_observe=True, batching_capacity=1000, scope='dqn', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)

Bases: tensorforce.agents.learning_agent.LearningAgent

Deep Q-Network agent (Mnih et al., 2015).

__init__(states, actions, network, batched_observe=True, batching_capacity=1000, scope='dqn', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)

Initializes the DQN agent.

Parameters:update_mode -- Update mode specification, with the following attributes:
Parameters:
  • memory (spec) -- Memory specification, see core.memories module for more information (default: {type='replay', include_next_states=true, capacity=1000*batch_size}).
  • optimizer (spec) -- Optimizer specification, see core.optimizers module for more information (default: {type='adam', learning_rate=1e-3}).
  • target_sync_frequency (int) -- Target network sync frequency (default: 10000).
  • target_update_weight (float) -- Target network update weight (default: 1.0).
  • double_q_model (bool) -- Specifies whether double DQN mode is used (default: false).
  • huber_loss (float) -- Huber loss clipping (default: none).

Normalized Advantage Functions

class tensorforce.agents.NAFAgent(states, actions, network, batched_observe=True, batching_capacity=1000, scope='naf', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)

Bases: tensorforce.agents.learning_agent.LearningAgent

Normalized Advantage Function agent (Gu et al., 2016).

__init__(states, actions, network, batched_observe=True, batching_capacity=1000, scope='naf', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)

Initializes the NAF agent.

Parameters:update_mode -- Update mode specification, with the following attributes:
Parameters:
  • memory (spec) -- Memory specification, see core.memories module for more information (default: {type='replay', include_next_states=true, capacity=1000*batch_size}).
  • optimizer (spec) -- Optimizer specification, see core.optimizers module for more information (default: {type='adam', learning_rate=1e-3}).
  • target_sync_frequency (int) -- Target network sync frequency (default: 10000).
  • target_update_weight (float) -- Target network update weight (default: 1.0).
  • double_q_model (bool) -- Specifies whether double DQN mode is used (default: false).
  • huber_loss (float) -- Huber loss clipping (default: none).

Deep-Q-learning from demonstration (DQFD)

class tensorforce.agents.DQFDAgent(states, actions, network, batched_observe=True, batching_capacity=1000, scope='dqfd', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, huber_loss=None, expert_margin=0.5, supervised_weight=0.1, demo_memory_capacity=10000, demo_sampling_ratio=0.2)

Bases: tensorforce.agents.learning_agent.LearningAgent

Deep Q-learning from demonstration agent (Hester et al., 2017).

__init__(states, actions, network, batched_observe=True, batching_capacity=1000, scope='dqfd', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, target_sync_frequency=10000, target_update_weight=1.0, huber_loss=None, expert_margin=0.5, supervised_weight=0.1, demo_memory_capacity=10000, demo_sampling_ratio=0.2)

Initializes the DQFD agent.

Parameters:update_mode -- Update mode specification, with the following attributes:
Parameters:
  • memory (spec) -- Memory specification, see core.memories module for more information (default: {type='replay', include_next_states=true, capacity=1000*batch_size}).
  • optimizer (spec) -- Optimizer specification, see core.optimizers module for more information (default: {type='adam', learning_rate=1e-3}).
  • target_sync_frequency (int) -- Target network sync frequency (default: 10000).
  • target_update_weight (float) -- Target network update weight (default: 1.0).
  • huber_loss (float) -- Huber loss clipping (default: none).
  • expert_margin (float) -- Enforced supervised margin between expert action Q-value and other Q-values (default: 0.5).
  • supervised_weight (float) -- Weight of supervised loss term (default: 0.1).
  • demo_memory_capacity (int) -- Capacity of expert demonstration memory (default: 10000).
  • demo_sampling_ratio (float) -- Runtime sampling ratio of expert data (default: 0.2).
import_demonstrations(demonstrations)

Imports demonstrations, i.e. expert observations. Note that for large numbers of observations, set_demonstrations is more appropriate, which directly sets memory contents to an array an expects a different layout.

Parameters:demonstrations -- List of observation dicts
pretrain(steps)

Computes pre-train updates.

Parameters:steps -- Number of updates to execute.

Vanilla Policy Gradient

class tensorforce.agents.VPGAgent(states, actions, network, batched_observe=True, batching_capacity=1000, scope='vpg', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None)

Bases: tensorforce.agents.learning_agent.LearningAgent

Vanilla policy gradient agent (Williams, 1992)).

__init__(states, actions, network, batched_observe=True, batching_capacity=1000, scope='vpg', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, optimizer=None, discount=0.99, distributions=None, entropy_regularization=None, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None)

Initializes the VPG agent.

Parameters:update_mode -- Update mode specification, with the following attributes:
Parameters:
  • memory (spec) -- Memory specification, see core.memories module for more information (default: {type='latest', include_next_states=false, capacity=1000*batch_size}).
  • optimizer (spec) -- Optimizer specification, see core.optimizers module for more information (default: {type='adam', learning_rate=1e-3}).
  • baseline_mode (str) -- One of 'states', 'network' (default: none).
  • baseline (spec) -- Baseline specification, see core.baselines module for more information (default: none).
  • baseline_optimizer (spec) -- Baseline optimizer specification, see core.optimizers module for more information (default: none).
  • gae_lambda (float) -- Lambda factor for generalized advantage estimation (default: none).

Trust Region Policy Optimization (TRPO)

class tensorforce.agents.TRPOAgent(states, actions, network, batched_observe=True, batching_capacity=1000, scope='trpo', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, discount=0.99, distributions=None, entropy_regularization=None, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None, likelihood_ratio_clipping=None, learning_rate=0.001, cg_max_iterations=20, cg_damping=0.001, cg_unroll_loop=False, ls_max_iterations=10, ls_accept_ratio=0.9, ls_unroll_loop=False)

Bases: tensorforce.agents.learning_agent.LearningAgent

Trust Region Policy Optimization agent (Schulman et al., 2015).

__init__(states, actions, network, batched_observe=True, batching_capacity=1000, scope='trpo', device=None, saver=None, summarizer=None, execution=None, variable_noise=None, states_preprocessing=None, actions_exploration=None, reward_preprocessing=None, update_mode=None, memory=None, discount=0.99, distributions=None, entropy_regularization=None, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None, likelihood_ratio_clipping=None, learning_rate=0.001, cg_max_iterations=20, cg_damping=0.001, cg_unroll_loop=False, ls_max_iterations=10, ls_accept_ratio=0.9, ls_unroll_loop=False)

Initializes the TRPO agent.

Parameters:update_mode -- Update mode specification, with the following attributes:
Parameters:
  • memory (spec) -- Memory specification, see core.memories module for more information (default: {type='latest', include_next_states=false, capacity=1000*batch_size}).
  • optimizer (spec) -- TRPO agent implicitly defines a optimized-step natural-gradient optimizer.
  • baseline_mode (str) -- One of 'states', 'network' (default: none).
  • baseline (spec) -- Baseline specification, see core.baselines module for more information (default: none).
  • baseline_optimizer (spec) -- Baseline optimizer specification, see core.optimizers module for more information (default: none).
  • gae_lambda (float) -- Lambda factor for generalized advantage estimation (default: none).
  • likelihood_ratio_clipping (float) -- Likelihood ratio clipping for policy gradient (default: none).
  • learning_rate (float) -- Learning rate of natural-gradient optimizer (default: 1e-3).
  • cg_max_iterations (int) -- Conjugate-gradient max iterations (default: 20).
  • cg_damping (float) -- Conjugate-gradient damping (default: 1e-3).
  • cg_unroll_loop (bool) -- Conjugate-gradient unroll loop (default: false).
  • ls_max_iterations (int) -- Line-search max iterations (default: 10).
  • ls_accept_ratio (float) -- Line-search accept ratio (default: 0.9).
  • ls_unroll_loop (bool) -- Line-search unroll loop (default: false).

State preprocessing

The agent handles state preprocessing. A preprocessor takes the raw state input from the environment and modifies it (for instance, image resize, state concatenation, etc.). You can find information about our ready-to-use preprocessors here.

Building your own agent

If you want to build your own agent, it should always inherit from Agent. If your agent uses a replay memory, it should probably inherit from MemoryAgent, if it uses a batch replay that is emptied after each update, it should probably inherit from BatchAgent.

We distinguish between agents and models. The Agent class handles the interaction with the environment, such as state preprocessing, exploration and observation of rewards. The Model class handles the mathematical operations, such as building the tensorflow operations, calculating the desired action and updating (i.e. optimizing) the model weights.

To start building your own agent, please refer to this blogpost to gain a deeper understanding of the internals of the TensorForce library. Afterwards, have look on a sample implementation, e.g. the DQN Agent and DQN Model.