Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce is an open-source deep reinforcement learning framework, with an emphasis on modularized flexible library design and straightforward usability for applications in research and practice. Tensorforce is built on top of Google’s TensorFlow framework version 2.0 (!) and compatible with Python 3 (Python 2 support was dropped with version 0.5).

Tensorforce follows a set of high-level design choices which differentiate it from other similar libraries:

  • Modular component-based design: Feature implementations, above all, strive to be as generally applicable and configurable as possible, potentially at some cost of faithfully resembling details of the introducing paper.
  • Separation of RL algorithm and application: Algorithms are agnostic to the type and structure of inputs (states/observations) and outputs (actions/decisions), as well as the interaction with the application environment.
  • Full-on TensorFlow models: The entire reinforcement learning logic, including control flow, is implemented in TensorFlow, to enable portable computation graphs independent of application programming language, and to facilitate the deployment of models.

Installation

A stable version of Tensorforce is periodically updated on PyPI and installed as follows:

pip3 install tensorforce

To always use the latest version of Tensorforce, install the GitHub version instead:

git clone https://github.com/tensorforce/tensorforce.git
cd tensorforce
pip3 install -e .

Tensorforce is built on top of Google’s TensorFlow and requires that either tensorflow or tensorflow-gpu is installed, currently as version 1.13.1. To include the correct version of TensorFlow with the installation of Tensorforce, simply add the flag tf for the normal CPU version or tf_gpu for the GPU version:

# PyPI version plus TensorFlow CPU version
pip3 install tensorforce[tf]

# GitHub version plus TensorFlow GPU version
pip3 install -e .[tf_gpu]

Some environments require additional packages, for which there are also options available (mazeexp, gym, retro, vizdoom; or envs for all environments), however, some require other tools to be installed (see environments documentation).

Getting started

Initializing an environment

It is recommended to initialize an environment via the Environment.create(...) interface.

from tensorforce.environments import Environment

For instance, the OpenAI CartPole environment can be initialized as follows:

environment = Environment.create(
    environment='gym', level='CartPole', max_episode_timesteps=500
)

Gym’s pre-defined versions are also accessible:

environment = Environment.create(environment='gym', level='CartPole-v1')

Alternatively, an environment can be specified as a config file:

{
    "environment": "gym",
    "level": "CartPole"
}

Environment config files can be loaded by passing their file path:

environment = Environment.create(
    environment='environment.json', max_episode_timesteps=500
)

Custom Gym environments can be used in the same way, but require the corresponding class(es) to be imported and registered accordingly.

Finally, it is possible to implement a custom environment using Tensorforce’s Environment interface:

class CustomEnvironment(Environment):

    def __init__(self):
        super().__init__()

    def states(self):
        return dict(type='float', shape=(8,))

    def actions(self):
        return dict(type='int', num_values=4)

    # Optional, should only be defined if environment has a natural maximum
    # episode length
    def max_episode_timesteps(self):
        return super().max_episode_timesteps()

    # Optional
    def close(self):
        super().close()

    def reset(self):
        state = np.random.random(size=(8,))
        return state

    def execute(self, actions):
        assert 0 <= actions.item() <= 3
        next_state = np.random.random(size=(8,))
        terminal = np.random.random() < 0.5
        reward = np.random.random()
        return next_state, terminal, reward

Custom environment implementations can be loaded by passing their module path:

environment = Environment.create(
    environment='custom_env.CustomEnvironment', max_episode_timesteps=10
)

It is strongly recommended to specify the max_episode_timesteps argument of Environment.create(...) unless specified by the environment (or for evaluation), as otherwise more agent parameters may require specification.

Initializing an agent

Similarly to environments, it is recommended to initialize an agent via the Agent.create(...) interface.

from tensorforce.agents import Agent

For instance, the generic Tensorforce agent can be initialized as follows:

agent = Agent.create(
    agent='tensorforce', environment=environment, update=64,
    objective='policy_gradient', reward_estimation=dict(horizon=20)
)

Other pre-defined agent classes can alternatively be used, for instance, Proximal Policy Optimization:

agent = Agent.create(
    agent='ppo', environment=environment, batch_size=10, learning_rate=1e-3
)

Alternatively, an agent can be specified as a config file:

{
    "agent": "tensorforce",
    "update": 64,
    "objective": "policy_gradient",
    "reward_estimation": {
        "horizon": 20
    }
}

Agent config files can be loaded by passing their file path:

agent = Agent.create(agent='agent.json', environment=environment)

It is recommended to pass the environment object returned by Environment.create(...) as environment argument of Agent.create(...), so that the states, actions and max_episode_timesteps argument are automatically specified accordingly.

Training and evaluation

It is recommended to use the execution utilities for training and evaluation, like the Runner utility, which offer a range of configuration options:

from tensorforce.execution import Runner

A basic experiment consisting of training and subsequent evaluation can be written in a few lines of code:

runner = Runner(
    agent='agent.json',
    environment=dict(environment='gym', level='CartPole'),
    max_episode_timesteps=500
)

runner.run(num_episodes=200)

runner.run(num_episodes=100, evaluation=True)

runner.close()

The execution utility classes take care of handling the agent-environment interaction correctly, and thus should be used where possible. Alternatively, if more detailed control over the agent-environment interaction is required, a simple training and evaluation loop can be written as follows:

# Create agent and environment
environment = Environment.create(
    environment='environment.json', max_episode_timesteps=500
)
agent = Agent.create(agent='agent.json', environment=environment)

# Train for 200 episodes
for _ in range(200):
    states = environment.reset()
    terminal = False
    while not terminal:
        actions = agent.act(states=states)
        states, terminal, reward = environment.execute(actions=actions)
        agent.observe(terminal=terminal, reward=reward)

# Evaluate for 100 episodes
sum_rewards = 0.0
for _ in range(100):
    states = environment.reset()
    internals = agent.initial_internals()
    terminal = False
    while not terminal:
        actions, internals = agent.act(states=states, internals=internals, evaluation=True)
        states, terminal, reward = environment.execute(actions=actions)
        sum_rewards += reward

print('Mean episode reward:', sum_rewards / 100)

# Close agent and environment
agent.close()
environment.close()

Module specification

Agents are instantiated via Agent.create(agent=...), with either of the specification alternatives presented below (agent acts as type argument). It is recommended to pass as second argument environment the application Environment implementation, which automatically extracts the corresponding states, actions and max_episode_timesteps arguments of the agent.

How to specify modules

Dictionary with module type and arguments

Agent.create(...
    policy=dict(network=dict(type='layered', layers=[dict(type='dense', size=32)])),
    memory=dict(type='replay', capacity=10000), ...
)

JSON specification file (plus additional arguments)

Agent.create(...
    policy=dict(network='network.json'),
    memory=dict(type='memory.json', capacity=10000), ...
)

Module path (plus additional arguments)

Agent.create(...
    policy=dict(network='my_module.TestNetwork'),
    memory=dict(type='tensorforce.core.memories.Replay', capacity=10000), ...
)

Callable or Type (plus additional arguments)

Agent.create(...
    policy=dict(network=TestNetwork),
    memory=dict(type=Replay, capacity=10000), ...
)

Default module: only arguments or first argument

Agent.create(...
    policy=dict(network=[dict(type='dense', size=32)]),
    memory=dict(capacity=10000), ...
)

Static vs dynamic hyperparameters

Tensorforce distinguishes between agent/module arguments (primitive types: bool/int/long/float) which specify either part of the TensorFlow model architecture, like the layer size, or a value within the architecture, like the learning rate. Whereas the former are statically defined as part of the agent initialization, the latter can be dynamically adjusted afterwards. These dynamic hyperparameters are indicated by parameter as part of their type specification in the documentation, and can alternatively be assigned a parameter module instead of a constant value, for instance, to specify a decaying learning rate.

Example: exponentially decaying exploration

Agent.create(...
    exploration=dict(
        type='decaying', unit='timesteps', decay='exponential',
        initial_value=0.1, decay_steps=1000, decay_rate=0.5
    ), ...
)

Example: linearly increasing horizon

Agent.create(...
    reward_estimation=dict(horizon=dict(
        type='decaying', dtype='long', unit='episodes', decay='polynomial',
        initial_value=10.0, decay_steps=1000, final_value=50.0, power=1.0
    ), ...
)

Features

Parallel environment execution

Execute multiple environments running locally in one call / batched:

Runner(
    agent='benchmarks/configs/ppo1.json', environment='CartPole-v1',
    num_parallel=5
)
runner.run(num_episodes=100, batch_agent_calls=True)

Execute environments running in different processes whenever ready / unbatched:

Runner(
    agent='benchmarks/configs/ppo1.json', environment='CartPole-v1',
    num_parallel=5, remote='multiprocessing'
)
runner.run(num_episodes=100)

Execute environments running on different machines, here using run.py instead of Runner:

# Environment machine 1
python run.py --environment gym --level CartPole-v1 --remote socket-server \
    --port 65432

# Environment machine 2
python run.py --environment gym --level CartPole-v1 --remote socket-server \
    --port 65433

# Agent machine
python run.py --agent benchmarks/configs/ppo1.json --episodes 100 \
    --num-parallel 2 --remote socket-client --host 127.0.0.1,127.0.0.1 \
    --port 65432,65433 --batch-agent-calls

Action masking

agent = Agent.create(
    states=dict(type='float', shape=(10,)),
    actions=dict(type='int', shape=(), num_actions=3), ...
)
...
states = dict(
    state=np.random.random_sample(size=(10,)),  # regular state
    action_mask=[True, False, True]  # mask as '[ACTION-NAME]_mask'
)
action = agent.act(states=states)
assert action != 1

Record & pretrain

agent = Agent.create(...
    recorder=dict(
        directory='data/traces',
        frequency=100  # record a traces file every 100 episodes
    ), ...
)
...
agent.close()

# Pretrain agent on recorded traces
agent = Agent.create(...)
agent.pretrain(
    directory='data/traces',
    num_iterations=100  # perform 100 update iterations on traces (more configurations possible)
)

Save & restore

TensorFlow saver (full model)

agent = Agent.create(...
    saver=dict(
        directory='data/checkpoints',
        frequency=600  # save checkpoint every 600 seconds (10 minutes)
    ), ...
)
...
agent.close()

# Restore latest agent checkpoint
agent = Agent.load(directory='data/checkpoints')

NumPy / HDF5 (only weights)

agent = Agent.create(...
    saver=dict(
        directory='data/checkpoints',
        frequency=600  # save checkpoint every 600 seconds (10 minutes)
    ), ...
)
...
agent.save(directory='data/checkpoints', format='numpy', append='episodes')

# Restore latest agent checkpoint
agent = Agent.load(directory='data/checkpoints', format='numpy')

TensorBoard

Agent.create(...
    summarizer=dict(
        directory='data/summaries',
        # list of labels, or 'all'
        labels=['graph', 'entropy', 'kl-divergence', 'losses', 'rewards'],
        frequency=100  # store values every 100 timesteps
        # (infrequent update summaries every update; other configurations possible)
    ), ...
)

run.py – Runner

Agent arguments

--[a]gent (string, required unless “socket-server” remote mode) – Agent (name, configuration JSON file, or library module)
--[n]etwork (string, default: not specified) – Network (name, configuration JSON file, or library module)

Environment arguments

--[e]nvironment (string, required unless “socket-client” remote mode) – Environment (name, configuration JSON file, or library module)
--[l]evel (string, default: not specified) – Level or game id, like CartPole-v1, if supported
--[m]ax-episode-timesteps (int, default: not specified) – Maximum number of timesteps per episode
--visualize (bool, default: false) – Visualize agent–environment interaction, if supported
--visualize-directory (bool, default: not specified) – Directory to store videos of agent–environment interaction, if supported
--import-modules (string, default: not specified) – Import comma-separated modules required for environment

Parallel execution arguments

--num-parallel (int, default: no parallel execution) – Number of environment instances to execute in parallel
--batch-agent-calls (bool, default: false) – Batch agent calls for parallel environment execution
--sync-timesteps (bool, default: false) – Synchronize parallel environment execution on timestep-level
--sync-episodes (bool, default: false) – Synchronize parallel environment execution on episode-level
--remote (str, default: local execution) – Communication mode for remote environment execution of parallelized environment execution: “multiprocessing” | “socket-client” | “socket-server”. In case of “socket-server”, runs environment in server communication loop until closed.
--blocking (bool, default: false) – Remote environments should be blocking
--host (str, only for “socket-client” remote mode) – Socket server hostname(s) or IP address(es), single value or comma-separated list
--port (str, only for “socket-client/server” remote mode) – Socket server port(s), single value or comma-separated list, increasing sequence if single host and port given

Runner arguments

--e[v]aluation (bool, default: false) – Run environment (last if multiple) in evaluation mode
--e[p]isodes (int, default: not specified) – Number of episodes
--[t]imesteps (int, default: not specified) – Number of timesteps
--[u]pdates (int, default: not specified) – Number of agent updates
--mean-horizon (int, default: 1) – Number of episodes progress bar values and evaluation score are averaged over
--save-best-agent (bool, default: false) – Directory to save the best version of the agent according to the evaluation score

Logging arguments

--[r]epeat (int, default: 1) – Number of repetitions
--path (string, default: not specified) – Logging path, directory plus filename without extension

--seaborn (bool, default: false) – Use seaborn

tune.py – Hyperparameter tuner

Required arguments

#1: environment (string) – Environment (name, configuration JSON file, or library module)

Optional arguments

--[l]evel (string, default: not specified) – Level or game id, like CartPole-v1, if supported
--[m]ax-repeats (int, default: 1) – Maximum number of repetitions
--[n]um-iterations (int, default: 1) – Number of BOHB iterations
--[d]irectory (string, default: “tuner”) – Output directory
--[r]estore (string, default: not specified) – Restore from given directory
--id (string, default: “worker”) – Unique worker id

Agent interface

Initialization and termination

static TensorforceAgent.create(agent='tensorforce', environment=None, **kwargs)

Creates an agent from a specification.

Parameters:
  • agent (specification | Agent class/object) – JSON file, specification key, configuration dictionary, library module, or Agent class/object (default: Policy agent).
  • environment (Environment object) – Environment which the agent is supposed to be trained on, environment-related arguments like state/action space specifications and maximum episode length will be extract if given (recommended).
  • kwargs – Additional arguments.
TensorforceAgent.close()

Closes the agent.

Main reinforcement learning interface

TensorforceAgent.act(states, internals=None, parallel=0, independent=False, deterministic=False, evaluation=False, query=None, **kwargs)

Returns action(s) for the given state(s), needs to be followed by observe(...) unless independent mode set via independent/evaluation.

Parameters:
  • states (dict[state] | iter[dict[state]]) – Dictionary containing state(s) to be acted on (required).
  • internals (dict[internal] | iter[dict[internal]]) – Dictionary containing current internal agent state(s), either given by initial_internals() at the beginning of an episode or as return value of the preceding act(...) call (required if independent mode and agent has internal states).
  • parallel (int | iter[int]) – Parallel execution index (default: 0).
  • independent (bool) – Whether act is not part of the main agent-environment interaction, and this call is thus not followed by observe (default: false).
  • deterministic (bool) – Ff independent mode, whether to act deterministically, so no exploration and sampling (default: false).
  • evaluation (bool) – Whether the agent is currently evaluated, implies independent and deterministic (default: false).
  • query (list[str]) – Names of tensors to retrieve (default: none).
  • kwargs – Additional input values, for instance, for dynamic hyperparameters.
Returns:

dict[action] | iter[dict[action]], dict[internal] | iter[dict[internal]] if internals argument given, plus optional list[str]: Dictionary containing action(s), dictionary containing next internal agent state(s) if independent mode, plus queried tensor values if requested.

TensorforceAgent.observe(reward, terminal=False, parallel=0, query=None, **kwargs)

Observes reward and whether a terminal state is reached, needs to be preceded by act(...).

Parameters:
  • reward (float | iter[float]) – Reward (required).
  • terminal (bool | 0 | 1 | 2 | iter[..]) – Whether a terminal state is reached or 2 if the episode was aborted (default: false).
  • parallel (int, iter[int]) – Parallel execution index (default: 0).
  • query (list[str]) – Names of tensors to retrieve (default: none).
  • kwargs – Additional input values, for instance, for dynamic hyperparameters.
Returns:

Whether an update was performed, plus queried tensor values if requested.

Return type:

(bool | int, optional list[str])

Required for evaluation at episode start

TensorforceAgent.initial_internals()

Returns the initial internal agent state(s), to be used at the beginning of an episode as internals argument for act(...) in independent mode

Returns:Dictionary containing initial internal agent state(s).
Return type:dict[internal]

Loading and saving

static TensorforceAgent.load(directory=None, filename=None, format=None, environment=None, **kwargs)

Restores an agent from a specification directory/file.

Parameters:
  • directory (str) – Checkpoint directory (default: current directory “.”).
  • filename (str) – Checkpoint filename, with or without append and extension (default: “agent”).
  • format ("tensorflow" | "numpy" | "hdf5") – File format (default: format matching directory and filename, required to be unambiguous).
  • environment (Environment object) – Environment which the agent is supposed to be trained on, environment-related arguments like state/action space specifications and maximum episode length will be extract if given (recommended).
  • kwargs – Additional arguments.
TensorforceAgent.save(directory=None, filename=None, format='tensorflow', append=None)

Saves the agent to a checkpoint.

Parameters:
  • directory (str) – Checkpoint directory (default: directory specified for TensorFlow saver, otherwise current directory).
  • filename (str) – Checkpoint filename, without extension (default: filename specified for TensorFlow saver, otherwise name of agent).
  • format ("tensorflow" | "numpy" | "hdf5") – File format, “tensorflow” uses TensorFlow saver to store both variables and graph meta information, whereas the others only store variables as NumPy/HDF5 file. (default: TensorFlow format).
  • append ("timesteps" | "episodes" | "updates") – Append current timestep/episode/update to checkpoint filename (default: none).
Returns:

Checkpoint path.

Return type:

str

Get and assign variables

TensorforceAgent.get_variables()

Returns the names of all agent variables.

Returns:Names of variables.
Return type:list[str]
TensorforceAgent.get_variable(variable)

Returns the value of the variable with the given name.

Parameters:variable (string) – Variable name (required).
Returns:Variable value.
Return type:numpy-array
TensorforceAgent.assign_variable(variable, value)

Assigns the given value to the variable with the given name.

Parameters:
  • variable (string) – Variable name (required).
  • value (variable-compatible value) – Value to assign to variable (required).

Advanced functions for specialized use cases

TensorforceAgent.experience(states, actions, terminal, reward, internals=None, query=None, **kwargs)[source]

Feed experience traces.

Parameters:
  • states (dict[array[state]]) – Dictionary containing arrays of states (required).
  • actions (dict[array[action]]) – Dictionary containing arrays of actions (required).
  • terminal (array[bool]) – Array of terminals (required).
  • reward (array[float]) – Array of rewards (required).
  • internals (dict[state]) – Dictionary containing arrays of internal agent states (default: no internal states).
  • query (list[str]) – Names of tensors to retrieve (default: none).
  • kwargs – Additional input values, for instance, for dynamic hyperparameters.
TensorforceAgent.update(query=None, **kwargs)[source]

Perform an update.

Parameters:
  • query (list[str]) – Names of tensors to retrieve (default: none).
  • kwargs – Additional input values, for instance, for dynamic hyperparameters.
TensorforceAgent.pretrain(directory, num_iterations, num_traces=1, num_updates=1)[source]

Pretrain from experience traces.

Parameters:
  • directory (path) – Directory with experience traces, e.g. obtained via recorder; episode length has to be consistent with agent configuration (required).
  • num_iterations (int > 0) – Number of iterations consisting of loading new traces and performing multiple updates (required).
  • num_traces (int > 0) – Number of traces to load per iteration; has to at least satisfy the update batch size (default: 1).
  • num_updates (int > 0) – Number of updates per iteration (default: 1).

Others

TensorforceAgent.reset()

Resets all agent buffers and discards unfinished episodes.

TensorforceAgent.get_output_tensors(function)

Returns the names of output tensors for the given function.

Parameters:function (str) – Function name (required).
Returns:Names of output tensors.
Return type:list[str]
TensorforceAgent.get_available_summaries()

Returns the summary labels provided by the agent.

Returns:Available summary labels.
Return type:list[str]

Constant Agent

class tensorforce.agents.ConstantAgent(states, actions, max_episode_timesteps=None, action_values=None, name='agent', device=None, seed=None, summarizer=None, recorder=None, config=None)[source]

Agent returning constant action values (specification key: constant).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • action_values (dict[value]) – Constant value per action (default: false for binary boolean actions, 0 for discrete integer actions, 0.0 for continuous actions).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • name (string) – Agent name, used e.g. for TensorFlow scopes (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0) – how frequently in timesteps to record summaries (default: always).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all or list of summaries to record, from the following labels (default: only "graph"):
    • "graph": graph summary
    • "parameters": parameter scalars
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).

Random Agent

class tensorforce.agents.RandomAgent(states, actions, max_episode_timesteps=None, name='agent', device=None, seed=None, summarizer=None, recorder=None, config=None)[source]

Agent returning random action values (specification key: random).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • name (string) – Agent name, used e.g. for TensorFlow scopes (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0) – how frequently in timesteps to record summaries (default: always).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all or list of summaries to record, from the following labels (default: only "graph"):
    • "graph": graph summary
    • "parameters": parameter scalars
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).

Tensorforce Agent

class tensorforce.agents.TensorforceAgent(states, actions, update, objective, reward_estimation, max_episode_timesteps=None, policy='default', memory=None, optimizer='adam', baseline_policy=None, baseline_optimizer=None, baseline_objective=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, buffer_observe=True, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]

Tensorforce agent (specification key: tensorforce).

Highly configurable agent and basis for a broad class of deep reinforcement learning agents, which act according to a policy parametrized by a neural network, leverage a memory module for periodic updates based on batches of experience, and optionally employ a baseline/critic/target policy for improved reward estimation.

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • policy (specification) – Policy configuration, see policies (default: “default”, action distributions parametrized by an automatically configured network).
  • memory (int | specification) – Memory configuration, see memories (default: replay memory with given or inferred capacity).
  • update (int | specification) – Model update configuration with the following attributes (required, default: timesteps batch size</span>):
    • unit ("timesteps" | "episodes") – unit for update attributes (required).
    • batch_size (parameter, long > 0) – size of update batch in number of units (required).
    • frequency ("never" | parameter, long > 0) – frequency of updates (default: batch_size).
    • start (parameter, long >= batch_size) – number of units before first update (default: 0).
  • optimizer (specification) – Optimizer configuration, see optimizers (default: Adam optimizer).
  • objective (specification) – Optimization objective configuration, see objectives (required).
  • reward_estimation (specification) – Reward estimation configuration with the following attributes (required):
    • horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation (required).
    • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 1.0).
    • estimate_horizon (false | "early" | "late") – Whether to estimate the value of horizon states, and if so, whether to estimate early when experience is stored, or late when it is retrieved (default: "late" if any of the baseline_* arguments is specified, else false).
    • estimate_actions (bool) – Whether to estimate state-action values instead of state values (default: false).
    • estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
    • estimate_advantage (bool) – Whether to estimate the advantage by subtracting the current estimate (default: false).
  • baseline_policy (specification) – Baseline policy configuration, main policy will be used as baseline if none (default: none).
  • baseline_optimizer (float > 0.0 | specification) –

    Baseline optimizer configuration, see optimizers, main optimizer will be used for baseline if none, a float implies none and specifies a custom weight for the baseline loss (default: none).

  • baseline_objective (specification) –

    Baseline optimization objective configuration, see objectives, main objective will be used for baseline if none (default: none).

  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • buffer_observe (bool | int > 0) – Maximum number of timesteps within an episode to buffer before executing internal observe operations, to reduce calls to TensorFlow for improved performance (default: max_episode_timesteps or 1000, unless summarizer specified).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: agent name).
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
    • "graph": graph summary
    • "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-histogram": update histograms
    • "variables": variable mean and variance scalars
    • "variables-histogram": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).

Deep Q-Network

class tensorforce.agents.DeepQNetwork(states, actions, memory, max_episode_timesteps=None, network='auto', batch_size=32, update_frequency=None, start_updating=None, learning_rate=0.0003, huber_loss=0.0, horizon=0, discount=0.99, estimate_terminal=False, target_sync_frequency=1, target_update_weight=1.0, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]

Deep Q-Network agent (specification key: dqn).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
  • memory (int) – Replay memory capacity, has to fit at least around batch_size + one episode (required).
  • batch_size (parameter, long > 0) – Number of timesteps per update batch (default: 32 timesteps).
  • update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
  • start_updating (parameter, long >= batch_size) – Number of timesteps before first update (default: none).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
  • huber_loss (parameter, float > 0.0) – Huber loss threshold (default: no huber loss).
  • horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 0).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
  • target_sync_frequency (parameter, int > 0) – Interval between target network updates (default: every update).
  • target_update_weight (parameter, 0.0 < float <= 1.0) – Target network update weight (default: 1.0).
  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: agent name).
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
    • "graph": graph summary
    • "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-histogram": update histograms
    • "variables": variable mean and variance scalars
    • "variables-histogram": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).

Dueling DQN

class tensorforce.agents.DuelingDQN(states, actions, memory, max_episode_timesteps=None, network='auto', batch_size=32, update_frequency=None, start_updating=None, learning_rate=0.0003, huber_loss=0.0, horizon=0, discount=0.99, estimate_terminal=False, target_sync_frequency=1, target_update_weight=1.0, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]

Dueling DQN agent (specification key: dueling_dqn).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
  • memory (int) – Replay memory capacity, has to fit at least around batch_size + one episode (required).
  • batch_size (parameter, long > 0) – Number of timesteps per update batch (default: 32 timesteps).
  • update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
  • start_updating (parameter, long >= batch_size) – Number of timesteps before first update (default: none).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
  • huber_loss (parameter, float > 0.0) – Huber loss threshold (default: no huber loss).
  • horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 0).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
  • target_sync_frequency (parameter, int > 0) – Interval between target network updates (default: every update).
  • target_update_weight (parameter, 0.0 < float <= 1.0) – Target network update weight (default: 1.0).
  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: agent name).
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
    • "graph": graph summary
    • "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-histogram": update histograms
    • "variables": variable mean and variance scalars
    • "variables-histogram": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).

Vanilla Policy Gradient

class tensorforce.agents.VanillaPolicyGradient(states, actions, max_episode_timesteps, network='auto', batch_size=10, update_frequency=None, learning_rate=0.0003, discount=0.99, estimate_terminal=False, baseline_network=None, baseline_optimizer=None, memory=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]

Vanilla Policy Gradient aka REINFORCE agent (specification key: vpg).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
  • batch_size (parameter, long > 0) – Number of episodes per update batch (default: 10 episodes).
  • update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
  • baseline_network (specification) –

    Baseline network configuration, see networks, main policy will be used as baseline if none (default: none).

  • baseline_optimizer (float > 0.0 | specification) – Baseline optimizer configuration, see optimizers, main optimizer will be used for baseline if none, a float implies none and specifies a custom weight for the baseline loss (default: none).
  • memory (int > 0) – Memory capacity, has to fit at least around batch_size + 1 episodes (default: minimum required size).
  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: agent name).
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
    • "graph": graph summary
    • "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-histogram": update histograms
    • "variables": variable mean and variance scalars
    • "variables-histogram": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).

Actor-Critic

class tensorforce.agents.ActorCritic(states, actions, max_episode_timesteps, network='auto', batch_size=10, update_frequency=None, learning_rate=0.0003, horizon=0, discount=0.99, state_action_value=False, estimate_terminal=False, critic_network='auto', critic_optimizer=1.0, memory=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]

Actor-Critic agent (specification key: ac).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
  • batch_size (parameter, long > 0) – Number of episodes per update batch (default: 10 episodes).
  • update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
  • horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 0).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • state_action_value (bool) – Whether to estimate state-action values instead of state values (default: false).
  • estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
  • critic_network (specification) –

    Critic network configuration, see networks (default: “auto”).

  • critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see optimizers, a float instead specifies a custom weight for the critic loss (default: 1.0).
  • memory (int > 0) – Memory capacity, has to fit at least around batch_size + one episode (default: minimum required size).
  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: agent name).
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
    • "graph": graph summary
    • "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-histogram": update histograms
    • "variables": variable mean and variance scalars
    • "variables-histogram": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).

Advantage Actor-Critic

class tensorforce.agents.AdvantageActorCritic(states, actions, max_episode_timesteps, network='auto', batch_size=10, update_frequency=None, learning_rate=0.0003, horizon=0, discount=0.99, state_action_value=False, estimate_terminal=False, critic_network='auto', critic_optimizer=1.0, memory=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]

Advantage Actor-Critic agent (specification key: a2c).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
  • batch_size (parameter, long > 0) – Number of episodes per update batch (default: 10 episodes).
  • update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
  • horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 0).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • state_action_value (bool) – Whether to estimate state-action values instead of state values (default: false).
  • estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
  • critic_network (specification) –

    Critic network configuration, see networks (default: “auto”).

  • critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see optimizers, a float instead specifies a custom weight for the critic loss (default: 1.0).
  • memory (int > 0) – Memory capacity, has to fit at least around batch_size + one episode (default: minimum required size).
  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: agent name).
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
    • "graph": graph summary
    • "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-histogram": update histograms
    • "variables": variable mean and variance scalars
    • "variables-histogram": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).

Deterministic Policy Gradient

class tensorforce.agents.DeterministicPolicyGradient(states, actions, memory, max_episode_timesteps=None, network='auto', batch_size=32, update_frequency=None, start_updating=None, learning_rate=0.0003, horizon=0, discount=0.99, estimate_terminal=False, critic_network='auto', critic_optimizer=1.0, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]

Deterministic Policy Gradient agent (specification key: dpg). Action space is required to consist of only a single float action.

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
  • memory (int) – Replay memory capacity, has to fit at least around batch_size + one episode (required).
  • batch_size (parameter, long > 0) – Number of timesteps per update batch (default: 32 timesteps).
  • update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
  • start_updating (parameter, long >= batch_size) – Number of timesteps before first update (default: none).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
  • horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 0).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
  • critic_network (specification) –

    Critic network configuration, see networks (default: none).

  • critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see optimizers, a float instead specifies a custom weight for the critic loss (default: 1.0).
  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: agent name).
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
    • "graph": graph summary
    • "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-histogram": update histograms
    • "variables": variable mean and variance scalars
    • "variables-histogram": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).

Proximal Policy Optimization

class tensorforce.agents.ProximalPolicyOptimization(states, actions, max_episode_timesteps, network='auto', batch_size=10, update_frequency=None, learning_rate=0.0003, subsampling_fraction=0.33, optimization_steps=10, likelihood_ratio_clipping=0.2, discount=0.99, estimate_terminal=False, critic_network=None, critic_optimizer=None, memory=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]

Proximal Policy Optimization agent (specification key: ppo).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
  • batch_size (parameter, long > 0) – Number of episodes per update batch (default: 10 episodes).
  • update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
  • subsampling_fraction (parameter, 0.0 < float <= 1.0) – Fraction of batch timesteps to subsample (default: 0.33).
  • optimization_steps (parameter, int > 0) – Number of optimization steps (default: 10).
  • likelihood_ratio_clipping (parameter, float > 0.0) – Likelihood-ratio clipping threshold (default: 0.2).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
  • critic_network (specification) –

    Critic network configuration, see networks, main policy will be used as critic if none (default: none).

  • critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see optimizers, main optimizer will be used for critic if none, a float implies none and specifies a custom weight for the critic loss (default: none).
  • memory (int > 0) – Memory capacity, has to fit at least around batch_size + 1 episodes (default: minimum required size).
  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to fit at leastset separately for a fully deterministic execution (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: agent name).
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
    • "graph": graph summary
    • "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-histogram": update histograms
    • "variables": variable mean and variance scalars
    • "variables-histogram": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).

Trust-Region Policy Optimization

class tensorforce.agents.TrustRegionPolicyOptimization(states, actions, max_episode_timesteps, network='auto', batch_size=10, update_frequency=None, learning_rate=0.001, likelihood_ratio_clipping=0.2, discount=0.99, estimate_terminal=False, critic_network=None, critic_optimizer=None, memory=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]

Trust Region Policy Optimization agent (specification key: trpo).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
  • batch_size (parameter, long > 0) – Number of episodes per update batch (default: 10 episodes).
  • update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 1e-3).
  • likelihood_ratio_clipping (parameter, float > 0.0) – Likelihood-ratio clipping threshold (default: 0.2).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
  • critic_network (specification) –

    Critic network configuration, see networks, main policy will be used as critic if none (default: none).

  • critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see optimizers, main optimizer will be used for critic if none, a float implies none and specifies a custom weight for the critic loss (default: none).
  • memory (int > 0) – Memory capacity, has to fit at least around batch_size + 1 episodes (default: minimum required size).
  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: agent name).
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
    • "graph": graph summary
    • "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-histogram": update histograms
    • "variables": variable mean and variance scalars
    • "variables-histogram": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).

Distributions

class tensorforce.core.distributions.Bernoulli(name, action_spec, embedding_shape, summary_labels=None)[source]

Bernoulli distribution, for binary boolean actions (specification key: bernoulli).

Parameters:
  • name (string) – Distribution name (internal use).
  • action_spec (specification) – Action specification (internal use).
  • embedding_shape (iter[int > 0]) – Embedding shape (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.distributions.Beta(name, action_spec, embedding_shape, summary_labels=None)[source]

Beta distribution, for bounded continuous actions (specification key: beta).

Parameters:
  • name (string) – Distribution name (internal use).
  • action_spec (specification) – Action specification (internal use).
  • embedding_shape (iter[int > 0]) – Embedding shape (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.distributions.Categorical(name, action_spec, embedding_shape, infer_states_value=True, summary_labels=None)[source]

Categorical distribution, for discrete integer actions (specification key: categorical).

Parameters:
  • name (string) – Distribution name (internal use).
  • action_spec (specification) – Action specification (internal use).
  • embedding_shape (iter[int > 0]) – Embedding shape (internal use).
  • infer_states_value (bool) – Whether to infer the state value from state-action values as softmax denominator (default: true).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.distributions.Gaussian(name, action_spec, embedding_shape, summary_labels=None)[source]

Gaussian distribution, for unbounded continuous actions (specification key: gaussian).

Parameters:
  • name (string) – Distribution name (internal use).
  • action_spec (specification) – Action specification (internal use).
  • embedding_shape (iter[int > 0]) – Embedding shape (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).

Layers

Default layer: Function with default argument function

Convolutional layers

class tensorforce.core.layers.Conv1d(name, size, window=3, stride=1, padding='same', dilation=1, bias=True, activation='relu', dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None)[source]

1-dimensional convolutional layer (specification key: conv1d).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
  • window (int > 0) – Window size (default: 3).
  • stride (int > 0) – Stride size (default: 1).
  • padding ('same' | 'valid') – Padding type, see TensorFlow docs (default: ‘same’).
  • dilation (int > 0 | (int > 0, int > 0)) – Dilation value (default: 1).
  • bias (bool) – Whether to add a trainable bias variable (default: true).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: “relu”).
  • dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
  • is_trainable (bool) – Whether layer variables are trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
class tensorforce.core.layers.Conv2d(name, size, window=3, stride=1, padding='same', dilation=1, bias=True, activation='relu', dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None)[source]

2-dimensional convolutional layer (specification key: conv2d).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
  • window (int > 0 | (int > 0, int > 0)) – Window size (default: 3).
  • stride (int > 0 | (int > 0, int > 0)) – Stride size (default: 1).
  • padding ('same' | 'valid') – Padding type, see TensorFlow docs (default: ‘same’).
  • dilation (int > 0 | (int > 0, int > 0)) – Dilation value (default: 1).
  • bias (bool) – Whether to add a trainable bias variable (default: true).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: “relu”).
  • dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
  • is_trainable (bool) – Whether layer variables are trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).

Dense layers

class tensorforce.core.layers.Dense(name, size, bias=True, activation='relu', dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None)[source]

Dense fully-connected layer (specification key: dense).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
  • bias (bool) – Whether to add a trainable bias variable (default: true).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: “relu”).
  • dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
  • is_trainable (bool) – Whether layer variables are trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
class tensorforce.core.layers.Linear(name, size, bias=True, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None)[source]

Linear layer (specification key: linear).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
  • bias (bool) – Whether to add a trainable bias variable (default: true).
  • is_trainable (bool) – Whether layer variables are trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).

Embedding layers

class tensorforce.core.layers.Embedding(name, size, num_embeddings=None, max_norm=None, bias=False, activation='tanh', dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None)[source]

Embedding layer (specification key: embedding).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
  • num_embeddings (int > 0) – If set, specifies the number of embeddings (default: none).
  • max_norm (float) – If set, embeddings are clipped if their L2-norm is larger (default: none).
  • bias (bool) – Whether to add a trainable bias variable (default: false).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: “tanh”).
  • dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
  • is_trainable (bool) – Whether layer variables are trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
  • kwargs – Additional arguments for potential parent class.

Recurrent layers

class tensorforce.core.layers.Gru(name, size, return_final_state=True, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]

Gated recurrent unit layer (specification key: gru).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • cell ('gru' | 'lstm') – The recurrent cell type (required).
  • size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
  • return_final_state (bool) – Whether to return the final state instead of the per-step outputs (default: true).
  • bias (bool) – Whether to add a trainable bias variable (default: false).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
  • dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
  • is_trainable (bool) – Whether layer variables are trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
  • kwargs – Additional arguments for Keras GRU layer, see TensorFlow docs.
class tensorforce.core.layers.Lstm(name, size, return_final_state=True, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]

Long short-term memory layer (specification key: lstm).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • cell ('gru' | 'lstm') – The recurrent cell type (required).
  • size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
  • return_final_state (bool) – Whether to return the final state instead of the per-step outputs (default: true).
  • bias (bool) – Whether to add a trainable bias variable (default: false).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
  • dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
  • is_trainable (bool) – Whether layer variables are trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
  • kwargs – Additional arguments for Keras LSTM layer, see TensorFlow docs.
class tensorforce.core.layers.Rnn(name, cell, size, return_final_state=True, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]

Recurrent neural network layer (specification key: rnn).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • cell ('gru' | 'lstm') – The recurrent cell type (required).
  • size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
  • return_final_state (bool) – Whether to return the final state instead of the per-step outputs (default: true).
  • bias (bool) – Whether to add a trainable bias variable (default: false).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
  • dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
  • is_trainable (bool) – Whether layer variables are trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
  • kwargs – Additional arguments for Keras RNN layer, see TensorFlow docs.

Pooling layers

class tensorforce.core.layers.Flatten(name, input_spec=None, summary_labels=None)[source]

Flatten layer (specification key: flatten).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Pooling(name, reduction, input_spec=None, summary_labels=None)[source]

Pooling layer (global pooling) (specification key: pooling).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • reduction ('concat' | 'max' | 'mean' | 'product' | 'sum') – Pooling type (required).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Pool1d(name, reduction, window=2, stride=2, padding='same', input_spec=None, summary_labels=None)[source]

1-dimensional pooling layer (local pooling) (specification key: pool1d).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • reduction ('average' | 'max') – Pooling type (required).
  • window (int > 0) – Window size (default: 2).
  • stride (int > 0) – Stride size (default: 2).
  • padding ('same' | 'valid') – Padding type, see TensorFlow docs (default: ‘same’).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Pool2d(name, reduction, window=2, stride=2, padding='same', input_spec=None, summary_labels=None)[source]

2-dimensional pooling layer (local pooling) (specification key: pool2d).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • reduction ('average' | 'max') – Pooling type (required).
  • window (int > 0 | (int > 0, int > 0)) – Window size (default: 2).
  • stride (int > 0 | (int > 0, int > 0)) – Stride size (default: 2).
  • padding ('same' | 'valid') – Padding type, see TensorFlow docs (default: ‘same’).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).

Normalization layers

class tensorforce.core.layers.ExponentialNormalization(name, decay=0.999, axes=None, input_spec=None, summary_labels=None)[source]

Normalization layer based on the exponential moving average (specification key: exponential_normalization).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • decay (parameter, 0.0 <= float <= 1.0) – Decay rate (default: 0.999).
  • axes (iter[int >= 0]) – Normalization axes, excluding batch axis (default: all but last axis).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
class tensorforce.core.layers.InstanceNormalization(name, axes=None, input_spec=None, summary_labels=None)[source]

Instance normalization layer (specification key: instance_normalization).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • axes (iter[int >= 0]) – Normalization axes, excluding batch axis (default: all).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).

Misc layers

class tensorforce.core.layers.Activation(name, nonlinearity, input_spec=None, summary_labels=None)[source]

Activation layer (specification key: activation).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (nonlinearity) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Nonlinearity (required).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Clipping(name, upper, lower=None, input_spec=None, summary_labels=None)[source]

Clipping layer (specification key: clipping).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • upper (parameter, float) – Upper clipping value (required).
  • lower (parameter, float) – Lower clipping value (default: negative upper value).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Deltafier(name, concatenate=False, input_spec=None, summary_labels=None)[source]

Deltafier layer computing the difference between the current and the previous input; can only be used as preprocessing layer (specification key: deltafier).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • concatenate (False | int >= 0) – Whether to concatenate instead of replace deltas with input, and if so, concatenation axis (default: false).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Dropout(name, rate, input_spec=None, summary_labels=None)[source]

Dropout layer (specification key: dropout).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • rate (parameter, 0.0 <= float < 1.0) – Dropout rate (required).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Image(name, height=None, width=None, grayscale=False, input_spec=None, summary_labels=None)[source]

Image preprocessing layer (specification key: image).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • height (int) – Height of resized image (default: no resizing or relative to width).
  • width (int) – Width of resized image (default: no resizing or relative to height).
  • grayscale (bool | iter[float]) – Turn into grayscale image, optionally using given weights (default: false).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Reshape(name, shape, input_spec=None, summary_labels=None)[source]

Reshape layer (specification key: reshape).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • shape (int | iter[int]) – New shape (required).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Sequence(name, length, axis=-1, concatenate=True, input_spec=None, summary_labels=None)[source]

Sequence layer stacking the current and previous inputs; can only be used as preprocessing layer (specification key: sequence).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • length (int > 0) – Number of inputs to concatenate (required).
  • axis (int >= 0) – Concatenation axis, excluding batch axis (default: last axis).
  • concatenate (bool) – Whether to concatenate inputs at given axis, otherwise introduce new sequence axis (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).

Layers with internal states

class tensorforce.core.layers.InternalGru(name, size, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]

Internal state GRU cell layer (specification key: internal_gru).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • cell ('gru' | 'lstm') – The recurrent cell type (required).
  • size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
  • length (parameter, long > 0) – ???+1 (required).
  • bias (bool) – Whether to add a trainable bias variable (default: false).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
  • dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
  • is_trainable (bool) – Whether layer variables are trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
  • kwargs – Additional arguments for Keras GRU layer, see TensorFlow docs.
class tensorforce.core.layers.InternalLstm(name, size, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]

Internal state LSTM cell layer (specification key: internal_lstm).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • cell ('gru' | 'lstm') – The recurrent cell type (required).
  • size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
  • length (parameter, long > 0) – ???+1 (required).
  • bias (bool) – Whether to add a trainable bias variable (default: false).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
  • dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
  • is_trainable (bool) – Whether layer variables are trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
  • kwargs – Additional arguments for Keras LSTM layer, see TensorFlow docs.
class tensorforce.core.layers.InternalRnn(name, cell, size, length, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]

Internal state RNN cell layer (specification key: internal_rnn).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • cell ('gru' | 'lstm') – The recurrent cell type (required).
  • size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
  • length (parameter, long > 0) – ???+1 (required).
  • bias (bool) – Whether to add a trainable bias variable (default: false).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
  • dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
  • is_trainable (bool) – Whether layer variables are trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
  • kwargs – Additional arguments for Keras RNN cell layer, see TensorFlow docs.

Special layers

class tensorforce.core.layers.Block(name, layers, input_spec=None)[source]

Block of layers (specification key: block).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • layers (iter[specification]) –

    Layers configuration, see layers (required).

  • input_spec (specification) – Input tensor specification (internal use).
class tensorforce.core.layers.Function(name, function, output_spec=None, input_spec=None, summary_labels=None, l2_regularization=None)[source]

Custom TensorFlow function layer (specification key: function).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • function (lambda[x -> x]) – TensorFlow function (required).
  • output_spec (specification) – Output tensor specification containing type and/or shape information (default: same as input).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
class tensorforce.core.layers.Keras(name, layer, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]

Keras layer (specification key: keras).

Parameters:
class tensorforce.core.layers.Register(name, tensor, input_spec=None, summary_labels=None)[source]

Tensor retrieval layer, which is useful when defining more complex network architectures which do not follow the sequential layer-stack pattern, for instance, when handling multiple inputs (specification key: register).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • tensor (string) – Name under which tensor will be registered (required).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Retrieve(name, tensors, aggregation='concat', axis=0, input_spec=None, summary_labels=None)[source]

Tensor retrieval layer, which is useful when defining more complex network architectures which do not follow the sequential layer-stack pattern, for instance, when handling multiple inputs (specification key: retrieve).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • tensors (iter[string]) – Names of global tensors to retrieve, for instance, state names or previously registered global tensor names (required).
  • aggregation ('concat' | 'product' | 'stack' | 'sum') – Aggregation type in case of multiple tensors (default: ‘concat’).
  • axis (int >= 0) – Aggregation axis, excluding batch axis (default: 0).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Reuse(name, layer, is_trainable=True, input_spec=None)[source]

Reuse layer (specification key: reuse).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • layer (string) – Name of a previously defined layer (required).
  • is_trainable (bool) – Whether reused layer variables are kept trainable (default: true).
  • input_spec (specification) – Input tensor specification (internal use).

Memories

Default memory: Replay with default argument capacity

class tensorforce.core.memories.Recent(name, capacity, values_spec, device=None, summary_labels=None)[source]

Batching memory which always retrieves most recent experiences (specification key: recent).

Parameters:
  • name (string) – Memory name (internal use).
  • capacity (int > 0) – Memory capacity, in experience timesteps (required).
  • values_spec (specification) – Values specification (internal use).
  • device (string) – Device name (default: inherit value of parent module).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.memories.Replay(name, capacity, values_spec, device=None, summary_labels=None)[source]

Replay memory which randomly retrieves experiences (specification key: replay).

Parameters:
  • name (string) – Memory name (internal use).
  • capacity (int > 0) – Memory capacity, in experience timesteps (required).
  • values_spec (specification) – Values specification (internal use).
  • device (string) – Device name (default: inherit value of parent module).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).

Networks

Default network: LayeredNetwork with default argument layers

class tensorforce.core.networks.AutoNetwork(name, inputs_spec, size=64, depth=2, final_size=None, final_depth=1, internal_rnn=False, device=None, summary_labels=None, l2_regularization=None)[source]

Network which is automatically configured based on its input tensors, offering high-level customization (specification key: auto).

Parameters:
  • name (string) – Network name (internal use).
  • inputs_spec (specification) – Input tensors specification (internal use).
  • size (int > 0) – Layer size, before concatenation if multiple states (default: 64).
  • depth (int > 0) – Number of layers per state, before concatenation if multiple states (default: 2).
  • final_size (int > 0) – Layer size after concatenation if multiple states (default: layer size).
  • final_depth (int > 0) – Number of layers after concatenation if multiple states (default: 1).
  • internal_rnn (false | parameter, long >= 0) – Whether to add an internal state LSTM cell as last layer, and if so, horizon of the LSTM (default: false).
  • device (string) – Device name (default: inherit value of parent module).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
class tensorforce.core.networks.LayeredNetwork(name, layers, inputs_spec, device=None, summary_labels=None, l2_regularization=None)[source]

Network consisting of Tensorforce layers, which can be specified as either a list of layer specifications in the case of a standard sequential layer-stack architecture, or as a list of list of layer specifications in the case of a more complex architecture consisting of multiple sequential layer-stacks (specification key: custom or layered).

Parameters:
  • name (string) – Network name (internal use).
  • layers (iter[specification] | iter[iter[specification]]) – Layers configuration, see layers (required).
  • inputs_spec (specification) – Input tensors specification (internal use).
  • device (string) – Device name (default: inherit value of parent module).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).

Objectives

class tensorforce.core.objectives.DeterministicPolicyGradient(name, summary_labels=None)[source]

Deterministic policy gradient objective (specification key: det_policy_gradient).

Parameters:
  • name (string) – Module name (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.objectives.Plus(name, objective1, objective2, summary_labels=None)[source]

Additive combination of two objectives (specification key: plus).

Parameters:
  • name (string) – Module name (internal use).
  • objective1 (specification) – First objective configuration (required).
  • objective2 (specification) – Second objective configuration (required).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.objectives.PolicyGradient(name, ratio_based=False, clipping_value=0.0, early_reduce=False, summary_labels=None)[source]

Policy gradient objective, which maximizes the log-likelihood or likelihood-ratio scaled by the target reward value (specification key: policy_gradient).

Parameters:
  • name (string) – Module name (internal use).
  • ratio_based (bool) – Whether to scale the likelihood-ratio instead of the log-likelihood (default: false).
  • clipping_value (parameter, float > 0.0) – Clipping threshold for the maximized value (default: no clipping).
  • early_reduce (bool) – Whether to compute objective for reduced likelihoods instead of per likelihood (default: false).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.objectives.Value(name, value='state', huber_loss=0.0, early_reduce=False, summary_labels=None)[source]

Value approximation objective, which minimizes the L2-distance between the state-(action-)value estimate and the target reward value (specification key: value).

Parameters:
  • name (string) – Module name (internal use).
  • value ("state" | "action") – Whether to approximate the state- or state-action-value (default: “state”).
  • huber_loss (parameter, float > 0.0) – Huber loss threshold (default: no huber loss).
  • early_reduce (bool) – Whether to compute objective for reduced values instead of value per action (default: false).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).

Optimizers

Default optimizer: MetaOptimizerWrapper

class tensorforce.core.optimizers.ClippingStep(name, optimizer, threshold, mode='global_norm', summary_labels=None)[source]

Clipping-step meta optimizer, which clips the updates of the given optimizer (specification key: clipping_step).

Parameters:
  • name (string) – Module name (internal use).
  • optimizer (specification) – Optimizer configuration (required).
  • threshold (parameter, float > 0.0) – Clipping threshold (required).
  • mode ('global_norm' | 'norm' | 'value') – Clipping mode (default: ‘global_norm’).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.optimizers.Evolutionary(name, learning_rate, num_samples=1, unroll_loop=False, summary_labels=None)[source]

Evolutionary optimizer, which samples random perturbations and applies them either as positive or negative update depending on their improvement of the loss (specification key: evolutionary).

Parameters:
  • name (string) – Module name (internal use).
  • learning_rate (parameter, float > 0.0) – Learning rate (required).
  • num_samples (parameter, int > 0) – Number of sampled perturbations (default: 1).
  • unroll_loop (bool) – Whether to unroll the sampling loop (default: false).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.optimizers.GlobalOptimizer(name, optimizer, summary_labels=None)[source]

Global meta optimizer, which applies the given optimizer to the local variables, then applies the update to a corresponding set of global variables, and subsequently updates the local variables to the value of the global variables; will likely change in the future (specification key: global_optimizer).

Parameters:
  • name (string) – Module name (internal use).
  • optimizer (specification) – Optimizer configuration (required).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.optimizers.MetaOptimizerWrapper(name, optimizer, multi_step=1, subsampling_fraction=1.0, clipping_threshold=None, optimizing_iterations=0, summary_labels=None, **kwargs)[source]

Meta optimizer wrapper (specification key: meta_optimizer_wrapper).

Parameters:
  • name (string) – Module name (internal use).
  • optimizer (specification) – Optimizer configuration (required).
  • multi_step (parameter, int > 0) – Number of optimization steps (default: single step).
  • subsampling_fraction (parameter, 0.0 < float <= 1.0) – Fraction of batch timesteps to subsample (default: no subsampling).
  • clipping_threshold (parameter, float > 0.0) – Clipping threshold (default: no clipping).
  • optimizing_iterations (parameter, int >= 0) – Maximum number of line search iterations (default: no optimizing).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.optimizers.MultiStep(name, optimizer, num_steps, unroll_loop=False, summary_labels=None)[source]

Multi-step meta optimizer, which applies the given optimizer for a number of times (specification key: multi_step).

Parameters:
  • name (string) – Module name (internal use).
  • optimizer (specification) – Optimizer configuration (required).
  • num_steps (parameter, int > 0) – Number of optimization steps (required).
  • unroll_loop (bool) – Whether to unroll the repetition loop (default: false).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.optimizers.NaturalGradient(name, learning_rate, cg_max_iterations=10, cg_damping=0.001, cg_unroll_loop=False, summary_labels=None)[source]

Natural gradient optimizer (specification key: natural_gradient).

Parameters:
  • name (string) – Module name (internal use).
  • learning_rate (parameter, float > 0.0) – Learning rate as KL-divergence of distributions between optimization steps (required).
  • cg_max_iterations (int > 0) – Maximum number of conjugate gradient iterations. (default: 10).
  • cg_damping (float > 0.0) – Conjugate gradient damping factor. (default: 1e-3).
  • cg_unroll_loop (bool) – Whether to unroll the conjugate gradient loop (default: false).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.optimizers.OptimizingStep(name, optimizer, ls_max_iterations=10, ls_accept_ratio=0.9, ls_mode='exponential', ls_parameter=0.5, ls_unroll_loop=False, summary_labels=None)[source]

Optimizing-step meta optimizer, which applies line search to the given optimizer to find a more optimal step size (specification key: optimizing_step).

Parameters:
  • name (string) – Module name (internal use).
  • optimizer (specification) – Optimizer configuration (required).
  • ls_max_iterations (parameter, int > 0) – Maximum number of line search iterations (default: 10).
  • ls_accept_ratio (parameter, float > 0.0) – Line search acceptance ratio (default: 0.9).
  • ls_mode ('exponential' | 'linear') – Line search mode, see line search solver (default: ‘exponential’).
  • ls_parameter (parameter, float > 0.0) – Line search parameter, see line search solver (default: 0.5).
  • ls_unroll_loop (bool) – Whether to unroll the line search loop (default: false).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.optimizers.Plus(name, optimizer1, optimizer2, summary_labels=None)[source]

Additive combination of two optimizers (specification key: plus).

Parameters:
  • name (string) – Module name (internal use).
  • optimizer1 (specification) – First optimizer configuration (required).
  • optimizer2 (specification) – Second optimizer configuration (required).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.optimizers.SubsamplingStep(name, optimizer, fraction, summary_labels=None)[source]

Subsampling-step meta optimizer, which randomly samples a subset of batch instances before applying the given optimizer (specification key: subsampling_step).

Parameters:
  • name (string) – Module name (internal use).
  • optimizer (specification) – Optimizer configuration (required).
  • fraction (parameter, 0.0 < float < 1.0) – Fraction of batch timesteps to subsample (required).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.optimizers.Synchronization(name, sync_frequency=1, update_weight=1.0, summary_labels=None)[source]

Synchronization optimizer, which updates variables periodically to the value of a corresponding set of source variables (specification key: synchronization).

Parameters:
  • name (string) – Module name (internal use).
  • optimizer (specification) – Optimizer configuration (required).
  • sync_frequency (parameter, int > 0) – Interval between updates which also perform a synchronization step (default: every update).
  • update_weight (parameter, 0.0 < float <= 1.0) – Update weight (default: 1.0).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.optimizers.TFOptimizer(name, optimizer, learning_rate=0.0003, gradient_norm_clipping=1.0, summary_labels=None, **kwargs)[source]

TensorFlow optimizer (specification key: tf_optimizer, adadelta, adagrad, adam, adamax, adamw, ftrl, lazyadam, nadam, radam, ranger, rmsprop, sgd, sgdw)

Parameters:
  • name (string) – Module name (internal use).
  • optimizer (adadelta | adagrad | adam | adamax | adamw | ftrl | lazyadam | nadam | radam | ranger | rmsprop | sgd | sgdw) – TensorFlow optimizer name, see TensorFlow docs and TensorFlow Addons docs (required unless given by specification key).
  • learning_rate (parameter, float > 0.0) – Learning rate (default: 3e-4).
  • gradient_norm_clipping (parameter, float > 0.0) – Clip gradients by the ratio of the sum of their norms (default: 1.0).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • kwargs – Arguments for the TensorFlow optimizer, special values “decoupled_weight_decay”, “lookahead” and “moving_average”, see TensorFlow docs and TensorFlow Addons docs.

Parameters

Default parameter: Constant

class tensorforce.core.parameters.Constant(name, value, dtype, summary_labels=None)[source]

Constant hyperparameter.

Parameters:
  • name (string) – Module name (internal use).
  • value (dtype-dependent) – Constant hyperparameter value (required).
  • dtype ("bool" | "int" | "long" | "float") – Tensor type (required).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.parameters.Decaying(name, dtype, unit, decay, initial_value, decay_steps, increasing=False, inverse=False, scale=1.0, summary_labels=None, **kwargs)[source]

Decaying hyperparameter.

Parameters:
  • name (string) – Module name (internal use).
  • dtype ("bool" | "int" | "long" | "float") – Tensor type (required).
  • unit ("timesteps" | "episodes" | "updates") – Unit of decay schedule (required).
  • decay ("cosine" | "cosine_restarts" | "exponential" | "inverse_time" | "linear_cosine" | "linear_cosine_noisy" | "polynomial") – Decay type, see TensorFlow docs (required).
  • initial_value (float) – Initial value (required).
  • decay_steps (long) – Number of decay steps (required).
  • increasing (bool) – Whether to subtract the decayed value from 1.0 (default: false).
  • inverse (bool) – Whether to take the inverse of the decayed value (default: false).
  • scale (float) – Scaling factor for (inverse) decayed value (default: 1.0).
  • summary_labels ("all" | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • kwargs – Additional arguments depend on decay mechanism.
    Cosine decay:
    • alpha (float) – Minimum learning rate value as a fraction of learning_rate (default: 0.0).
    Cosine decay with restarts:
    • t_mul (float) – Used to derive the number of iterations in the i-th period (default: 2.0).
    • m_mul (float) – Used to derive the initial learning rate of the i-th period (default: 1.0).
    • alpha (float) – Minimum learning rate value as a fraction of the learning_rate (default: 0.0).
    Exponential decay:
    • decay_rate (float) – Decay rate (required).
    • staircase (bool) – Whether to apply decay in a discrete staircase, as opposed to continuous, fashion. (default: false).
    Inverse time decay:
    • decay_rate (float) – Decay rate (required).
    • staircase (bool) – Whether to apply decay in a discrete staircase, as opposed to continuous, fashion. (default: false).
    Linear cosine decay:
    • num_periods (float) – Number of periods in the cosine part of the decay (default: 0.5).
    • alpha (float) – Alpha value (default: 0.0).
    • beta (float) – Beta value (default: 0.001).
    Natural exponential decay:
    • decay_rate (float) – Decay rate (required).
    • staircase (bool) – Whether to apply decay in a discrete staircase, as opposed to continuous, fashion. (default: false).
    Noisy linear cosine decay:
    • initial_variance (float) – Initial variance for the noise (default: 1.0).
    • variance_decay (float) – Decay for the noise's variance (default: 0.55).
    • num_periods (float) – Number of periods in the cosine part of the decay (default: 0.5).
    • alpha (float) – Alpha value (default: 0.0).
    • beta (float) – Beta value (default: 0.001).
    Polynomial decay:
    • final_value (float) – Final value (required).
    • power (float) – Power of polynomial (default: 1.0, thus linear).
    • cycle (bool) – Whether to cycle beyond decay_steps (default: false).
class tensorforce.core.parameters.OrnsteinUhlenbeck(name, dtype, theta=0.15, sigma=0.3, mu=0.0, summary_labels=None)[source]

Ornstein-Uhlenbeck process.

Parameters:
  • name (string) – Module name (internal use).
  • dtype ("bool" | "int" | "long" | "float") – Tensor type (required).
  • theta (float > 0.0) – Theta value (default: 0.15).
  • sigma (float > 0.0) – Sigma value (default: 0.3).
  • mu (float) – Mu value (default: 0.0).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.parameters.PiecewiseConstant(name, dtype, unit, boundaries, values, summary_labels=None)[source]

Piecewise-constant hyperparameter.

Parameters:
  • name (string) – Module name (internal use).
  • dtype ("bool" | "int" | "long" | "float") – Tensor type (required).
  • unit ("timesteps" | "episodes" | "updates") – Unit of interval boundaries (required).
  • boundaries (iter[long]) – Strictly increasing interval boundaries for constant segments (required).
  • values (iter[dtype-dependent]) – Interval values of constant segments, one more than (required).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.parameters.Random(name, dtype, distribution, shape=(), summary_labels=None, **kwargs)[source]

Random hyperparameter.

Parameters:
  • name (string) – Module name (internal use).
  • dtype ("bool" | "int" | "long" | "float") – Tensor type (required).
  • distribution ("normal" | "uniform") – Distribution type for random hyperparameter value (required).
  • shape (iter[int > 0]) – Tensor shape (default: scalar).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • kwargs – Additional arguments dependent on distribution type.
    Normal distribution:
    • mean (float) – Mean (default: 0.0).
    • stddev (float > 0.0) – Standard deviation (default: 1.0).
    Uniform distribution:
    • minval (int / float) – Lower bound (default: 0 / 0.0).
    • maxval (float > minval) – Upper bound (default: 1.0 for float, required for int).

Preprocessing

class tensorforce.core.layers.Activation(name, nonlinearity, input_spec=None, summary_labels=None)[source]

Activation layer (specification key: activation).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (nonlinearity) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Nonlinearity (required).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Clipping(name, upper, lower=None, input_spec=None, summary_labels=None)[source]

Clipping layer (specification key: clipping).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • upper (parameter, float) – Upper clipping value (required).
  • lower (parameter, float) – Lower clipping value (default: negative upper value).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Deltafier(name, concatenate=False, input_spec=None, summary_labels=None)[source]

Deltafier layer computing the difference between the current and the previous input; can only be used as preprocessing layer (specification key: deltafier).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • concatenate (False | int >= 0) – Whether to concatenate instead of replace deltas with input, and if so, concatenation axis (default: false).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Dropout(name, rate, input_spec=None, summary_labels=None)[source]

Dropout layer (specification key: dropout).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • rate (parameter, 0.0 <= float < 1.0) – Dropout rate (required).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.ExponentialNormalization(name, decay=0.999, axes=None, input_spec=None, summary_labels=None)[source]

Normalization layer based on the exponential moving average (specification key: exponential_normalization).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • decay (parameter, 0.0 <= float <= 1.0) – Decay rate (default: 0.999).
  • axes (iter[int >= 0]) – Normalization axes, excluding batch axis (default: all but last axis).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
class tensorforce.core.layers.Image(name, height=None, width=None, grayscale=False, input_spec=None, summary_labels=None)[source]

Image preprocessing layer (specification key: image).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • height (int) – Height of resized image (default: no resizing or relative to width).
  • width (int) – Width of resized image (default: no resizing or relative to height).
  • grayscale (bool | iter[float]) – Turn into grayscale image, optionally using given weights (default: false).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.InstanceNormalization(name, axes=None, input_spec=None, summary_labels=None)[source]

Instance normalization layer (specification key: instance_normalization).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • axes (iter[int >= 0]) – Normalization axes, excluding batch axis (default: all).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
class tensorforce.core.layers.Sequence(name, length, axis=-1, concatenate=True, input_spec=None, summary_labels=None)[source]

Sequence layer stacking the current and previous inputs; can only be used as preprocessing layer (specification key: sequence).

Parameters:
  • name (string) – Layer name (default: internally chosen).
  • length (int > 0) – Number of inputs to concatenate (required).
  • axis (int >= 0) – Concatenation axis, excluding batch axis (default: last axis).
  • concatenate (bool) – Whether to concatenate inputs at given axis, otherwise introduce new sequence axis (default: true).
  • input_spec (specification) – Input tensor specification (internal use).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).

Policies

Default policy: ParametrizedDistributions

class tensorforce.core.policies.ParametrizedDistributions(name, states_spec, actions_spec, network='auto', distributions=None, temperature=0.0, device=None, summary_labels=None, l2_regularization=None)[source]

Policy which parametrizes independent distributions per action conditioned on the output of a central states-processing neural network (supports both stochastic and action-value-based policy interface) (specification key: parametrized_distributions).

Parameters:
  • name (string) – Module name (internal use).
  • states_spec (specification) – States specification (internal use).
  • actions_spec (specification) – Actions specification (internal use).
  • network ('auto' | specification) – Policy network configuration, see networks (default: ‘auto’, automatically configured network).
  • distributions (dict[specification]) – Distributions configuration, see distributions, specified per action-type or -name (default: per action-type, Bernoulli distribution for binary boolean actions, categorical distribution for discrete integer actions, Gaussian distribution for unbounded continuous actions, Beta distribution for bounded continuous actions).
  • temperature (parameter | dict[parameter], float >= 0.0) – Sampling temperature, global or per action (default: 0.0).
  • device (string) – Device name (default: inherit value of parent module).
  • summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).

Runner

class tensorforce.execution.Runner(agent, environment=None, max_episode_timesteps=None, evaluation=False, num_parallel=None, environments=None, remote=None, blocking=False, host=None, port=None)[source]

Tensorforce runner utility.

Parameters:
  • agent (specification | Agent object) – Agent specification or object, the latter is not closed automatically as part of runner.close() (required).
  • environment (specification | Environment object) – Environment specification or object, the latter is not closed automatically as part of runner.close() (required, or alternatively environments, invalid for “socket-client” remote mode).
  • max_episode_timesteps (int > 0) – Maximum number of timesteps per episode, overwrites the environment default if defined (default: environment default, invalid for “socket-client” remote mode).
  • evaluation (bool) – Whether to run the (last if multiple) environment in evaluation mode (default: no evaluation).
  • num_parallel (int > 0) – Number of environment instances to execute in parallel (default: no parallel execution, implicitly specified by environments).
  • environments (list[specification | Environment object]) – Environment specifications or objects to execute in parallel, the latter are not closed automatically as part of runner.close() (default: no parallel execution, alternatively specified via environment and num_parallel, invalid for “socket-client” remote mode).
  • remote ("multiprocessing" | "socket-client") – Communication mode for remote environment execution of parallelized environment execution, “socket-client” mode requires a corresponding “socket-server” running (default: local execution).
  • blocking (bool) – Whether remote environment calls should be blocking, only valid if remote mode given (default: not blocking, invalid unless “multiprocessing” or “socket-client” remote mode).
  • host (str, iter[str]) – Socket server hostname(s) or IP address(es) (required only for “socket-client” remote mode).
  • port (int, iter[int]) – Socket server port(s), increasing sequence if single host and port given (required only for “socket-client” remote mode).
run(num_episodes=None, num_timesteps=None, num_updates=None, batch_agent_calls=False, sync_timesteps=False, sync_episodes=False, num_sleep_secs=0.001, callback=None, callback_episode_frequency=None, callback_timestep_frequency=None, use_tqdm=True, mean_horizon=1, evaluation=False, save_best_agent=None, evaluation_callback=None)[source]

Run experiment.

Parameters:
  • num_episodes (int > 0) – Number of episodes to run experiment (default: no episode limit).
  • num_timesteps (int > 0) – Number of timesteps to run experiment (default: no timestep limit).
  • num_updates (int > 0) – Number of agent updates to run experiment (default: no update limit).
  • batch_agent_calls (bool) – Whether to batch agent calls for parallel environment execution (default: separate call per environment).
  • sync_timesteps (bool) – Whether to synchronize parallel environment execution on timestep-level, implied by batch_agent_calls (default: not synchronized unless batch_agent_calls).
  • sync_episodes (bool) – Whether to synchronize parallel environment execution on episode-level (default: not synchronized).
  • num_sleep_secs (float) – Sleep duration if no environment is ready (default: one milliseconds).
  • callback ((Runner, parallel) -> bool) – Callback function taking the runner instance plus parallel index and returning a boolean value indicating whether execution should continue (default: callback always true).
  • callback_episode_frequency (int) – Episode interval between callbacks (default: every episode).
  • callback_timestep_frequency (int) – Timestep interval between callbacks (default: not specified).
  • use_tqdm (bool) – Whether to display a tqdm progress bar for the experiment run (default: display progress bar).
  • mean_horizon (int) – Number of episodes progress bar values and evaluation score are averaged over (default: not averaged).
  • evaluation (bool) – Whether to run in evaluation mode, only valid if a single environment (default: no evaluation).
  • save_best_agent (string) – Directory to save the best version of the agent according to the evaluation score (default: best agent is not saved).
  • evaluation_callback (int | Runner -> float) – Callback function taking the runner instance and returning an evaluation score (default: cumulative evaluation reward averaged over mean_horizon episodes).

Environment interface

Initialization and termination

static Environment.create(environment=None, max_episode_timesteps=None, remote=None, blocking=False, host=None, port=None, **kwargs)[source]

Creates an environment from a specification. In case of “socket-server” remote mode, runs environment in server communication loop until closed.

Parameters:
  • environment (specification | Environment class/object) – JSON file, specification key, configuration dictionary, library module, Environment class/object, or gym.Env (required, invalid for "socket-client" remote mode).
  • max_episode_timesteps (int > 0) – Maximum number of timesteps per episode, overwrites the environment default if defined (default: environment default, invalid for “socket-client” remote mode).
  • remote ("multiprocessing" | "socket-client" | "socket-server") – Communication mode for remote environment execution of parallelized environment execution, “socket-client” mode requires a corresponding “socket-server” running, and “socket-server” mode runs environment in server communication loop until closed (default: local execution).
  • blocking (bool) – Whether remote environment calls should be blocking (default: not blocking, invalid unless “multiprocessing” or “socket-client” remote mode).
  • host (str) – Socket server hostname or IP address (required only for “socket-client” remote mode).
  • port (int) – Socket server port (required only for “socket-client/server” remote mode).
  • kwargs – Additional arguments.
Environment.close()[source]

Closes the environment.

Attributes

Environment.states()[source]

Returns the state space specification.

Returns:Arbitrarily nested dictionary of state descriptions with the following attributes:
  • type ("bool" | "int" | "float") – state data type (default: "float").
  • shape (int | iter[int]) – state shape (required).
  • num_states (int > 0) – number of discrete state values (required for type "int").
  • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
Return type:specification
Environment.actions()[source]

Returns the action space specification.

Returns:Arbitrarily nested dictionary of action descriptions with the following attributes:
  • type ("bool" | "int" | "float") – action data type (required).
  • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
  • num_actions (int > 0) – number of discrete action values (required for type "int").
  • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
Return type:specification
Environment.max_episode_timesteps()[source]

Returns the maximum number of timesteps per episode.

Returns:Maximum number of timesteps per episode.
Return type:int

Interaction functions

Environment.reset()[source]

Resets the environment to start a new episode.

Returns:Dictionary containing initial state(s) and auxiliary information.
Return type:dict[state]
Environment.execute(actions)[source]

Executes the given action(s) and advances the environment by one step.

Parameters:actions (dict[action]) – Dictionary containing action(s) to be executed (required).
Returns:Dictionary containing next state(s), whether a terminal state is reached or 2 if the episode was aborted, and observed reward.
Return type:dict[state], bool | 0 | 1 | 2, float

Arcade Learning Environment

class tensorforce.environments.ArcadeLearningEnvironment(level, life_loss_terminal=False, life_loss_punishment=0.0, repeat_action_probability=0.0, visualize=False, frame_skip=1, seed=None)[source]

Arcade Learning Environment adapter (specification key: ale, arcade_learning_environment).

May require:

sudo apt-get install libsdl1.2-dev libsdl-gfx1.2-dev libsdl-image1.2-dev cmake

git clone https://github.com/mgbellemare/Arcade-Learning-Environment.git
cd Arcade-Learning-Environment

mkdir build && cd build
cmake -DUSE_SDL=ON -DUSE_RLGLUE=OFF -DBUILD_EXAMPLES=ON ..
make -j 4
cd ..

pip3 install .
Parameters:
  • level (string) – ALE rom file (required).
  • loss_of_life_termination – Signals a terminal state on loss of life (default: false).
  • loss_of_life_reward (float) – Reward/Penalty on loss of life (negative values are a penalty) (default: 0.0).
  • repeat_action_probability (float) – Repeats last action with given probability (default: 0.0).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • frame_skip (int > 0) – Number of times to repeat an action without observing (default: 1).
  • seed (int) – Random seed (default: none).

Maze Explorer

class tensorforce.environments.MazeExplorer(level, visualize=False)[source]

MazeExplorer environment adapter (specification key: mazeexp, maze_explorer).

May require:

sudo apt-get install freeglut3-dev

pip3 install mazeexp
Parameters:
  • level (int) – Game mode, see GitHub (required).
  • visualize (bool) – Whether to visualize interaction (default: false).

Open Sim

class tensorforce.environments.OpenSim(level, visualize=False, integrator_accuracy=5e-05)[source]

OpenSim environment adapter (specification key: osim, open_sim).

Parameters:
  • level ('Arm2D' | 'L2Run' | 'Prosthetics') – Environment id (required).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • integrator_accuracy (float) – Integrator accuracy (default: 5e-5).

OpenAI Gym

class tensorforce.environments.OpenAIGym(level, visualize=False, max_episode_steps=None, terminal_reward=0.0, reward_threshold=None, drop_states_indices=None, visualize_directory=None, **kwargs)[source]

OpenAI Gym environment adapter (specification key: gym, openai_gym).

May require:

pip3 install gym
pip3 install gym[all]
Parameters:
  • level (string | gym.Env) – Gym id or instance (required).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • max_episode_steps (false | int > 0) – Whether to terminate an episode after a while, and if so, maximum number of timesteps per episode (default: Gym default).
  • terminal_reward (float) – Additional reward for early termination, if otherwise indistinguishable from termination due to maximum number of timesteps (default: Gym default).
  • reward_threshold (float) – Gym environment argument, the reward threshold before the task is considered solved (default: Gym default).
  • drop_states_indices (list[int]) – Drop states indices (default: none).
  • visualize_directory (string) – Visualization output directory (default: none).
  • kwargs – Additional Gym environment arguments.

OpenAI Retro

class tensorforce.environments.OpenAIRetro(level, visualize=False, visualize_directory=None, **kwargs)[source]

OpenAI Retro environment adapter (specification key: retro, openai_retro).

May require:

pip3 install gym-retro
Parameters:
  • level (string) – Game id (required).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • monitor_directory (string) – Monitor output directory (default: none).
  • kwargs – Additional Retro environment arguments.

PyGame Learning Environment

class tensorforce.environments.PyGameLearningEnvironment(level, visualize=False, frame_skip=1, fps=30)[source]

PyGame Learning Environment environment adapter (specification key: ple, pygame_learning_environment).

May require:

sudo apt-get install git python3-dev python3-setuptools python3-numpy python3-opengl     libsdl-image1.2-dev libsdl-mixer1.2-dev libsdl-ttf2.0-dev libsmpeg-dev libsdl1.2-dev     libportmidi-dev libswscale-dev libavformat-dev libavcodec-dev libtiff5-dev libx11-6     libx11-dev fluid-soundfont-gm timgm6mb-soundfont xfonts-base xfonts-100dpi xfonts-75dpi     xfonts-cyrillic fontconfig fonts-freefont-ttf libfreetype6-dev

pip3 install git+https://github.com/pygame/pygame.git

pip3 install git+https://github.com/ntasfi/PyGame-Learning-Environment.git
Parameters:
  • level (string | subclass of ple.games.base) – Game instance or name of class in ple.games, like “Catcher”, “Doom”, “FlappyBird”, “MonsterKong”, “Pixelcopter”, “Pong”, “PuckWorld”, “RaycastMaze”, “Snake”, “WaterWorld” (required).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • frame_skip (int > 0) – Number of times to repeat an action without observing (default: 1).
  • fps (int > 0) – The desired frames per second we want to run our game at (default: 30).

ViZDoom

class tensorforce.environments.ViZDoom(level, visualize=False, include_variables=False, factored_action=False, frame_skip=12, seed=None)[source]

ViZDoom environment adapter (specification key: vizdoom).

May require:

sudo apt-get install g++ build-essential libsdl2-dev zlib1g-dev libmpg123-dev libjpeg-dev     libsndfile1-dev nasm tar libbz2-dev libgtk2.0-dev make cmake git chrpath timidity     libfluidsynth-dev libgme-dev libopenal-dev timidity libwildmidi-dev unzip libboost-all-dev     liblua5.1-dev

pip3 install vizdoom
Parameters:
  • level (string) – ViZDoom configuration file (required).
  • include_variables (bool) – Whether to include game variables to state (default: false).
  • factored_action (bool) – Whether to use factored action representation (default: false).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • frame_skip (int > 0) – Number of times to repeat an action without observing (default: 12).
  • seed (int) – Random seed (default: none).