Tensorforce: a TensorFlow library for applied reinforcement learning¶
Tensorforce is an open-source deep reinforcement learning framework, with an emphasis on modularized flexible library design and straightforward usability for applications in research and practice. Tensorforce is built on top of Google’s TensorFlow framework version 2.0 (!) and compatible with Python 3 (Python 2 support was dropped with version 0.5).
Tensorforce follows a set of high-level design choices which differentiate it from other similar libraries:
- Modular component-based design: Feature implementations, above all, strive to be as generally applicable and configurable as possible, potentially at some cost of faithfully resembling details of the introducing paper.
- Separation of RL algorithm and application: Algorithms are agnostic to the type and structure of inputs (states/observations) and outputs (actions/decisions), as well as the interaction with the application environment.
- Full-on TensorFlow models: The entire reinforcement learning logic, including control flow, is implemented in TensorFlow, to enable portable computation graphs independent of application programming language, and to facilitate the deployment of models.
Installation¶
A stable version of Tensorforce is periodically updated on PyPI and installed as follows:
pip3 install tensorforce
To always use the latest version of Tensorforce, install the GitHub version instead:
git clone https://github.com/tensorforce/tensorforce.git
cd tensorforce
pip3 install -e .
Tensorforce is built on top of Google’s TensorFlow and requires that either tensorflow
or tensorflow-gpu
is installed, currently as version 1.13.1
. To include the correct version of TensorFlow with the installation of Tensorforce, simply add the flag tf
for the normal CPU version or tf_gpu
for the GPU version:
# PyPI version plus TensorFlow CPU version
pip3 install tensorforce[tf]
# GitHub version plus TensorFlow GPU version
pip3 install -e .[tf_gpu]
Some environments require additional packages, for which there are also options available (mazeexp
, gym
, retro
, vizdoom
; or envs
for all environments), however, some require other tools to be installed (see environments documentation).
Getting started¶
Initializing an environment¶
It is recommended to initialize an environment via the Environment.create(...)
interface.
from tensorforce.environments import Environment
For instance, the OpenAI CartPole environment can be initialized as follows:
environment = Environment.create(
environment='gym', level='CartPole', max_episode_timesteps=500
)
Gym’s pre-defined versions are also accessible:
environment = Environment.create(environment='gym', level='CartPole-v1')
Alternatively, an environment can be specified as a config file:
{
"environment": "gym",
"level": "CartPole"
}
Environment config files can be loaded by passing their file path:
environment = Environment.create(
environment='environment.json', max_episode_timesteps=500
)
Custom Gym environments can be used in the same way, but require the corresponding class(es) to be imported and registered accordingly.
Finally, it is possible to implement a custom environment using Tensorforce’s Environment
interface:
class CustomEnvironment(Environment):
def __init__(self):
super().__init__()
def states(self):
return dict(type='float', shape=(8,))
def actions(self):
return dict(type='int', num_values=4)
# Optional, should only be defined if environment has a natural maximum
# episode length
def max_episode_timesteps(self):
return super().max_episode_timesteps()
# Optional
def close(self):
super().close()
def reset(self):
state = np.random.random(size=(8,))
return state
def execute(self, actions):
assert 0 <= actions.item() <= 3
next_state = np.random.random(size=(8,))
terminal = np.random.random() < 0.5
reward = np.random.random()
return next_state, terminal, reward
Custom environment implementations can be loaded by passing their module path:
environment = Environment.create(
environment='custom_env.CustomEnvironment', max_episode_timesteps=10
)
It is strongly recommended to specify the max_episode_timesteps
argument of Environment.create(...)
unless specified by the environment (or for evaluation), as otherwise more agent parameters may require specification.
Initializing an agent¶
Similarly to environments, it is recommended to initialize an agent via the Agent.create(...)
interface.
from tensorforce.agents import Agent
For instance, the generic Tensorforce agent can be initialized as follows:
agent = Agent.create(
agent='tensorforce', environment=environment, update=64,
objective='policy_gradient', reward_estimation=dict(horizon=20)
)
Other pre-defined agent classes can alternatively be used, for instance, Proximal Policy Optimization:
agent = Agent.create(
agent='ppo', environment=environment, batch_size=10, learning_rate=1e-3
)
Alternatively, an agent can be specified as a config file:
{
"agent": "tensorforce",
"update": 64,
"objective": "policy_gradient",
"reward_estimation": {
"horizon": 20
}
}
Agent config files can be loaded by passing their file path:
agent = Agent.create(agent='agent.json', environment=environment)
It is recommended to pass the environment object returned by Environment.create(...)
as environment
argument of Agent.create(...)
, so that the states
, actions
and max_episode_timesteps
argument are automatically specified accordingly.
Training and evaluation¶
It is recommended to use the execution utilities for training and evaluation, like the Runner utility, which offer a range of configuration options:
from tensorforce.execution import Runner
A basic experiment consisting of training and subsequent evaluation can be written in a few lines of code:
runner = Runner(
agent='agent.json',
environment=dict(environment='gym', level='CartPole'),
max_episode_timesteps=500
)
runner.run(num_episodes=200)
runner.run(num_episodes=100, evaluation=True)
runner.close()
The execution utility classes take care of handling the agent-environment interaction correctly, and thus should be used where possible. Alternatively, if more detailed control over the agent-environment interaction is required, a simple training and evaluation loop can be written as follows:
# Create agent and environment
environment = Environment.create(
environment='environment.json', max_episode_timesteps=500
)
agent = Agent.create(agent='agent.json', environment=environment)
# Train for 200 episodes
for _ in range(200):
states = environment.reset()
terminal = False
while not terminal:
actions = agent.act(states=states)
states, terminal, reward = environment.execute(actions=actions)
agent.observe(terminal=terminal, reward=reward)
# Evaluate for 100 episodes
sum_rewards = 0.0
for _ in range(100):
states = environment.reset()
internals = agent.initial_internals()
terminal = False
while not terminal:
actions, internals = agent.act(states=states, internals=internals, evaluation=True)
states, terminal, reward = environment.execute(actions=actions)
sum_rewards += reward
print('Mean episode reward:', sum_rewards / 100)
# Close agent and environment
agent.close()
environment.close()
Module specification¶
Agents are instantiated via Agent.create(agent=...)
, with either of the specification alternatives presented below (agent
acts as type
argument). It is recommended to pass as second argument environment
the application Environment
implementation, which automatically extracts the corresponding states
, actions
and max_episode_timesteps
arguments of the agent.
How to specify modules¶
Dictionary with module type and arguments¶
Agent.create(...
policy=dict(network=dict(type='layered', layers=[dict(type='dense', size=32)])),
memory=dict(type='replay', capacity=10000), ...
)
JSON specification file (plus additional arguments)¶
Agent.create(...
policy=dict(network='network.json'),
memory=dict(type='memory.json', capacity=10000), ...
)
Module path (plus additional arguments)¶
Agent.create(...
policy=dict(network='my_module.TestNetwork'),
memory=dict(type='tensorforce.core.memories.Replay', capacity=10000), ...
)
Callable or Type (plus additional arguments)¶
Agent.create(...
policy=dict(network=TestNetwork),
memory=dict(type=Replay, capacity=10000), ...
)
Default module: only arguments or first argument¶
Agent.create(...
policy=dict(network=[dict(type='dense', size=32)]),
memory=dict(capacity=10000), ...
)
Static vs dynamic hyperparameters¶
Tensorforce distinguishes between agent/module arguments (primitive types: bool/int/long/float) which specify either part of the TensorFlow model architecture, like the layer size, or a value within the architecture, like the learning rate. Whereas the former are statically defined as part of the agent initialization, the latter can be dynamically adjusted afterwards. These dynamic hyperparameters are indicated by parameter
as part of their type specification in the documentation, and can alternatively be assigned a parameter module instead of a constant value, for instance, to specify a decaying learning rate.
Example: exponentially decaying exploration¶
Agent.create(...
exploration=dict(
type='decaying', unit='timesteps', decay='exponential',
initial_value=0.1, decay_steps=1000, decay_rate=0.5
), ...
)
Example: linearly increasing horizon¶
Agent.create(...
reward_estimation=dict(horizon=dict(
type='decaying', dtype='long', unit='episodes', decay='polynomial',
initial_value=10.0, decay_steps=1000, final_value=50.0, power=1.0
), ...
)
Features¶
Parallel environment execution¶
Execute multiple environments running locally in one call / batched:
Runner(
agent='benchmarks/configs/ppo1.json', environment='CartPole-v1',
num_parallel=5
)
runner.run(num_episodes=100, batch_agent_calls=True)
Execute environments running in different processes whenever ready / unbatched:
Runner(
agent='benchmarks/configs/ppo1.json', environment='CartPole-v1',
num_parallel=5, remote='multiprocessing'
)
runner.run(num_episodes=100)
Execute environments running on different machines, here using run.py
instead
of Runner
:
# Environment machine 1
python run.py --environment gym --level CartPole-v1 --remote socket-server \
--port 65432
# Environment machine 2
python run.py --environment gym --level CartPole-v1 --remote socket-server \
--port 65433
# Agent machine
python run.py --agent benchmarks/configs/ppo1.json --episodes 100 \
--num-parallel 2 --remote socket-client --host 127.0.0.1,127.0.0.1 \
--port 65432,65433 --batch-agent-calls
Action masking¶
agent = Agent.create(
states=dict(type='float', shape=(10,)),
actions=dict(type='int', shape=(), num_actions=3), ...
)
...
states = dict(
state=np.random.random_sample(size=(10,)), # regular state
action_mask=[True, False, True] # mask as '[ACTION-NAME]_mask'
)
action = agent.act(states=states)
assert action != 1
Record & pretrain¶
agent = Agent.create(...
recorder=dict(
directory='data/traces',
frequency=100 # record a traces file every 100 episodes
), ...
)
...
agent.close()
# Pretrain agent on recorded traces
agent = Agent.create(...)
agent.pretrain(
directory='data/traces',
num_iterations=100 # perform 100 update iterations on traces (more configurations possible)
)
Save & restore¶
TensorFlow saver (full model)¶
agent = Agent.create(...
saver=dict(
directory='data/checkpoints',
frequency=600 # save checkpoint every 600 seconds (10 minutes)
), ...
)
...
agent.close()
# Restore latest agent checkpoint
agent = Agent.load(directory='data/checkpoints')
NumPy / HDF5 (only weights)¶
agent = Agent.create(...
saver=dict(
directory='data/checkpoints',
frequency=600 # save checkpoint every 600 seconds (10 minutes)
), ...
)
...
agent.save(directory='data/checkpoints', format='numpy', append='episodes')
# Restore latest agent checkpoint
agent = Agent.load(directory='data/checkpoints', format='numpy')
TensorBoard¶
Agent.create(...
summarizer=dict(
directory='data/summaries',
# list of labels, or 'all'
labels=['graph', 'entropy', 'kl-divergence', 'losses', 'rewards'],
frequency=100 # store values every 100 timesteps
# (infrequent update summaries every update; other configurations possible)
), ...
)
run.py – Runner¶
Agent arguments¶
--[a]gent (string, required unless “socket-server” remote mode) – Agent (name, configuration JSON file, or library module)
--[n]etwork (string, default: not specified) – Network (name, configuration JSON file, or library module)
Environment arguments¶
--[e]nvironment (string, required unless “socket-client” remote mode) – Environment (name, configuration JSON file, or library module)
--[l]evel (string, default: not specified) – Level or game id, like CartPole-v1
, if supported
--[m]ax-episode-timesteps (int, default: not specified) – Maximum number of timesteps per episode
--visualize (bool, default: false) – Visualize agent–environment interaction, if supported
--visualize-directory (bool, default: not specified) – Directory to store videos of agent–environment interaction, if supported
--import-modules (string, default: not specified) – Import comma-separated modules required for environment
Parallel execution arguments¶
--num-parallel (int, default: no parallel execution) – Number of environment instances to execute in parallel
--batch-agent-calls (bool, default: false) – Batch agent calls for parallel environment execution
--sync-timesteps (bool, default: false) – Synchronize parallel environment execution on timestep-level
--sync-episodes (bool, default: false) – Synchronize parallel environment execution on episode-level
--remote (str, default: local execution) – Communication mode for remote environment execution of parallelized environment execution: “multiprocessing” | “socket-client” | “socket-server”. In case of “socket-server”, runs environment in server communication loop until closed.
--blocking (bool, default: false) – Remote environments should be blocking
--host (str, only for “socket-client” remote mode) – Socket server hostname(s) or IP address(es), single value or comma-separated list
--port (str, only for “socket-client/server” remote mode) – Socket server port(s), single value or comma-separated list, increasing sequence if single host and port given
Runner arguments¶
--e[v]aluation (bool, default: false) – Run environment (last if multiple) in evaluation mode
--e[p]isodes (int, default: not specified) – Number of episodes
--[t]imesteps (int, default: not specified) – Number of timesteps
--[u]pdates (int, default: not specified) – Number of agent updates
--mean-horizon (int, default: 1) – Number of episodes progress bar values and evaluation score are averaged over
--save-best-agent (bool, default: false) – Directory to save the best version of the agent according to the evaluation score
Logging arguments¶
--[r]epeat (int, default: 1) – Number of repetitions
--path (string, default: not specified) – Logging path, directory plus filename without extension
--seaborn (bool, default: false) – Use seaborn
tune.py – Hyperparameter tuner¶
Required arguments¶
#1: environment (string) – Environment (name, configuration JSON file, or library module)
Optional arguments¶
--[l]evel (string, default: not specified) – Level or game id, like CartPole-v1
, if supported
--[m]ax-repeats (int, default: 1) – Maximum number of repetitions
--[n]um-iterations (int, default: 1) – Number of BOHB iterations
--[d]irectory (string, default: “tuner”) – Output directory
--[r]estore (string, default: not specified) – Restore from given directory
--id (string, default: “worker”) – Unique worker id
Agent interface¶
Initialization and termination¶
-
static
TensorforceAgent.
create
(agent='tensorforce', environment=None, **kwargs)¶ Creates an agent from a specification.
Parameters: - agent (specification | Agent class/object) – JSON file, specification key, configuration
dictionary, library module, or
Agent
class/object (default: Policy agent). - environment (Environment object) – Environment which the agent is supposed to be trained on, environment-related arguments like state/action space specifications and maximum episode length will be extract if given (recommended).
- kwargs – Additional arguments.
- agent (specification | Agent class/object) – JSON file, specification key, configuration
dictionary, library module, or
-
TensorforceAgent.
close
()¶ Closes the agent.
Main reinforcement learning interface¶
-
TensorforceAgent.
act
(states, internals=None, parallel=0, independent=False, deterministic=False, evaluation=False, query=None, **kwargs)¶ Returns action(s) for the given state(s), needs to be followed by
observe(...)
unless independent mode set viaindependent
/evaluation
.Parameters: - states (dict[state] | iter[dict[state]]) – Dictionary containing state(s) to be acted on (required).
- internals (dict[internal] | iter[dict[internal]]) – Dictionary containing current
internal agent state(s), either given by
initial_internals()
at the beginning of an episode or as return value of the precedingact(...)
call (required if independent mode and agent has internal states). - parallel (int | iter[int]) – Parallel execution index (default: 0).
- independent (bool) – Whether act is not part of the main agent-environment interaction, and this call is thus not followed by observe (default: false).
- deterministic (bool) – Ff independent mode, whether to act deterministically, so no exploration and sampling (default: false).
- evaluation (bool) – Whether the agent is currently evaluated, implies independent and deterministic (default: false).
- query (list[str]) – Names of tensors to retrieve (default: none).
- kwargs – Additional input values, for instance, for dynamic hyperparameters.
Returns: dict[action] | iter[dict[action]], dict[internal] | iter[dict[internal]] if
internals
argument given, plus optional list[str]: Dictionary containing action(s), dictionary containing next internal agent state(s) if independent mode, plus queried tensor values if requested.
-
TensorforceAgent.
observe
(reward, terminal=False, parallel=0, query=None, **kwargs)¶ Observes reward and whether a terminal state is reached, needs to be preceded by
act(...)
.Parameters: - reward (float | iter[float]) – Reward (required).
- terminal (bool | 0 | 1 | 2 | iter[..]) – Whether a terminal state is reached or 2 if the episode was aborted (default: false).
- parallel (int, iter[int]) – Parallel execution index (default: 0).
- query (list[str]) – Names of tensors to retrieve (default: none).
- kwargs – Additional input values, for instance, for dynamic hyperparameters.
Returns: Whether an update was performed, plus queried tensor values if requested.
Return type: (bool | int, optional list[str])
Required for evaluation at episode start¶
-
TensorforceAgent.
initial_internals
()¶ Returns the initial internal agent state(s), to be used at the beginning of an episode as
internals
argument foract(...)
in independent modeReturns: Dictionary containing initial internal agent state(s). Return type: dict[internal]
Loading and saving¶
-
static
TensorforceAgent.
load
(directory=None, filename=None, format=None, environment=None, **kwargs)¶ Restores an agent from a specification directory/file.
Parameters: - directory (str) – Checkpoint directory (default: current directory “.”).
- filename (str) – Checkpoint filename, with or without append and extension (default: “agent”).
- format ("tensorflow" | "numpy" | "hdf5") – File format (default: format matching directory and filename, required to be unambiguous).
- environment (Environment object) – Environment which the agent is supposed to be trained on, environment-related arguments like state/action space specifications and maximum episode length will be extract if given (recommended).
- kwargs – Additional arguments.
-
TensorforceAgent.
save
(directory=None, filename=None, format='tensorflow', append=None)¶ Saves the agent to a checkpoint.
Parameters: - directory (str) – Checkpoint directory (default: directory specified for TensorFlow saver, otherwise current directory).
- filename (str) – Checkpoint filename, without extension (default: filename specified for TensorFlow saver, otherwise name of agent).
- format ("tensorflow" | "numpy" | "hdf5") – File format, “tensorflow” uses TensorFlow saver to store both variables and graph meta information, whereas the others only store variables as NumPy/HDF5 file. (default: TensorFlow format).
- append ("timesteps" | "episodes" | "updates") – Append current timestep/episode/update to checkpoint filename (default: none).
Returns: Checkpoint path.
Return type: str
Get and assign variables¶
-
TensorforceAgent.
get_variables
()¶ Returns the names of all agent variables.
Returns: Names of variables. Return type: list[str]
-
TensorforceAgent.
get_variable
(variable)¶ Returns the value of the variable with the given name.
Parameters: variable (string) – Variable name (required). Returns: Variable value. Return type: numpy-array
-
TensorforceAgent.
assign_variable
(variable, value)¶ Assigns the given value to the variable with the given name.
Parameters: - variable (string) – Variable name (required).
- value (variable-compatible value) – Value to assign to variable (required).
Advanced functions for specialized use cases¶
-
TensorforceAgent.
experience
(states, actions, terminal, reward, internals=None, query=None, **kwargs)[source]¶ Feed experience traces.
Parameters: - states (dict[array[state]]) – Dictionary containing arrays of states (required).
- actions (dict[array[action]]) – Dictionary containing arrays of actions (required).
- terminal (array[bool]) – Array of terminals (required).
- reward (array[float]) – Array of rewards (required).
- internals (dict[state]) – Dictionary containing arrays of internal agent states (default: no internal states).
- query (list[str]) – Names of tensors to retrieve (default: none).
- kwargs – Additional input values, for instance, for dynamic hyperparameters.
-
TensorforceAgent.
update
(query=None, **kwargs)[source]¶ Perform an update.
Parameters: - query (list[str]) – Names of tensors to retrieve (default: none).
- kwargs – Additional input values, for instance, for dynamic hyperparameters.
-
TensorforceAgent.
pretrain
(directory, num_iterations, num_traces=1, num_updates=1)[source]¶ Pretrain from experience traces.
Parameters: - directory (path) – Directory with experience traces, e.g. obtained via recorder; episode length has to be consistent with agent configuration (required).
- num_iterations (int > 0) – Number of iterations consisting of loading new traces and performing multiple updates (required).
- num_traces (int > 0) – Number of traces to load per iteration; has to at least satisfy the update batch size (default: 1).
- num_updates (int > 0) – Number of updates per iteration (default: 1).
Others¶
-
TensorforceAgent.
reset
()¶ Resets all agent buffers and discards unfinished episodes.
-
TensorforceAgent.
get_output_tensors
(function)¶ Returns the names of output tensors for the given function.
Parameters: function (str) – Function name (required). Returns: Names of output tensors. Return type: list[str]
-
TensorforceAgent.
get_available_summaries
()¶ Returns the summary labels provided by the agent.
Returns: Available summary labels. Return type: list[str]
Constant Agent¶
-
class
tensorforce.agents.
ConstantAgent
(states, actions, max_episode_timesteps=None, action_values=None, name='agent', device=None, seed=None, summarizer=None, recorder=None, config=None)[source]¶ Agent returning constant action values (specification key:
constant
).Parameters: - states (specification) – States specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of state descriptions (usually taken fromEnvironment.states()
) with the following attributes:- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
- actions (specification) – Actions specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of action descriptions (usually taken fromEnvironment.actions()
) with the following attributes:- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
- max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode
(default: not given, better implicitly
specified via
environment
argument forAgent.create(...)
). - action_values (dict[value]) – Constant value per action (default: false for binary boolean actions, 0 for discrete integer actions, 0.0 for continuous actions).
- seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
- name (string) – Agent name, used e.g. for TensorFlow scopes (default: “agent”).
- device (string) – Device name (default: TensorFlow default).
- summarizer (specification) – TensorBoard summarizer configuration with the following
attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- frequency (int > 0) – how frequently in timesteps to record summaries (default: always).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
- labels ("all" | iter[string]) – all or list of summaries to record, from the following labels (default: only "graph"):
- "graph": graph summary
- "parameters": parameter scalars
- recorder (specification) – Experience traces recorder configuration with the following
attributes (default: no recorder):
- directory (path) – recorder directory (required).
- frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- max-traces (int > 0) – maximum number of traces to keep (default: all).
- states (specification) – States specification
(required, better implicitly specified via
Random Agent¶
-
class
tensorforce.agents.
RandomAgent
(states, actions, max_episode_timesteps=None, name='agent', device=None, seed=None, summarizer=None, recorder=None, config=None)[source]¶ Agent returning random action values (specification key:
random
).Parameters: - states (specification) – States specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of state descriptions (usually taken fromEnvironment.states()
) with the following attributes:- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
- actions (specification) – Actions specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of action descriptions (usually taken fromEnvironment.actions()
) with the following attributes:- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
- max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode
(default: not given, better implicitly
specified via
environment
argument forAgent.create(...)
). - seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
- name (string) – Agent name, used e.g. for TensorFlow scopes (default: “agent”).
- device (string) – Device name (default: TensorFlow default).
- summarizer (specification) – TensorBoard summarizer configuration with the following
attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- frequency (int > 0) – how frequently in timesteps to record summaries (default: always).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
- labels ("all" | iter[string]) – all or list of summaries to record, from the following labels (default: only "graph"):
- "graph": graph summary
- "parameters": parameter scalars
- recorder (specification) – Experience traces recorder configuration with the following
attributes (default: no recorder):
- directory (path) – recorder directory (required).
- frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- max-traces (int > 0) – maximum number of traces to keep (default: all).
- states (specification) – States specification
(required, better implicitly specified via
Tensorforce Agent¶
-
class
tensorforce.agents.
TensorforceAgent
(states, actions, update, objective, reward_estimation, max_episode_timesteps=None, policy='default', memory=None, optimizer='adam', baseline_policy=None, baseline_optimizer=None, baseline_objective=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, buffer_observe=True, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]¶ Tensorforce agent (specification key:
tensorforce
).Highly configurable agent and basis for a broad class of deep reinforcement learning agents, which act according to a policy parametrized by a neural network, leverage a memory module for periodic updates based on batches of experience, and optionally employ a baseline/critic/target policy for improved reward estimation.
Parameters: - states (specification) – States specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of state descriptions (usually taken fromEnvironment.states()
) with the following attributes:- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
- actions (specification) – Actions specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of action descriptions (usually taken fromEnvironment.actions()
) with the following attributes:- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
- max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode
(default: not given, better implicitly
specified via
environment
argument forAgent.create(...)
). - policy (specification) – Policy configuration, see policies (default: “default”, action distributions parametrized by an automatically configured network).
- memory (int | specification) – Memory configuration, see memories (default: replay memory with given or inferred capacity).
- update (int | specification) – Model update configuration with the following attributes
(required,
default: timesteps batch size</span>):
- unit ("timesteps" | "episodes") – unit for update attributes (required).
- batch_size (parameter, long > 0) – size of update batch in number of units (required).
- frequency ("never" | parameter, long > 0) – frequency of updates (default: batch_size).
- start (parameter, long >= batch_size) – number of units before first update (default: 0).
- optimizer (specification) – Optimizer configuration, see optimizers (default: Adam optimizer).
- objective (specification) – Optimization objective configuration, see objectives (required).
- reward_estimation (specification) – Reward estimation configuration with the following
attributes (required):
- horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation (required).
- discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 1.0).
- estimate_horizon (false | "early" | "late") – Whether to estimate the value of horizon states, and if so, whether to estimate early when experience is stored, or late when it is retrieved (default: "late" if any of the baseline_* arguments is specified, else false).
- estimate_actions (bool) – Whether to estimate state-action values instead of state values (default: false).
- estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
- estimate_advantage (bool) – Whether to estimate the advantage by subtracting the current estimate (default: false).
- baseline_policy (specification) – Baseline policy configuration, main policy will be used as baseline if none (default: none).
- baseline_optimizer (float > 0.0 | specification) –
Baseline optimizer configuration, see optimizers, main optimizer will be used for baseline if none, a float implies none and specifies a custom weight for the baseline loss (default: none).
- baseline_objective (specification) –
Baseline optimization objective configuration, see objectives, main objective will be used for baseline if none (default: none).
- preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
- exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action,
defined as the probability for uniformly random output in case of
bool
andint
actions, and the standard deviation of Gaussian noise added to every output in case offloat
actions (default: 0.0). - variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
- l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
- entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
- name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
- device (string) – Device name (default: TensorFlow default).
- parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
- buffer_observe (bool | int > 0) – Maximum number of timesteps within an episode to buffer before executing internal observe operations, to reduce calls to TensorFlow for improved performance (default: max_episode_timesteps or 1000, unless summarizer specified).
- seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
- execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
- saver (specification) – TensorFlow saver configuration with the following attributes
(default: no saver):
- directory (path) – saver directory (required).
- filename (string) – model filename (default: agent name).
- frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
- load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
- max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
- summarizer (specification) – TensorBoard summarizer configuration with the following
attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
- labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
- "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
- "dropout": dropout zero fraction
- "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
- "graph": graph summary
- "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
- "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
- "parameters": parameter scalars
- "relu": ReLU activation zero fraction
- "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
- "update-norm": update norm
- "updates": update mean and variance scalars
- "updates-histogram": update histograms
- "variables": variable mean and variance scalars
- "variables-histogram": variable histograms
- recorder (specification) – Experience traces recorder configuration with the following
attributes (default: no recorder):
- directory (path) – recorder directory (required).
- frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- max-traces (int > 0) – maximum number of traces to keep (default: all).
- states (specification) – States specification
(required, better implicitly specified via
Deep Q-Network¶
-
class
tensorforce.agents.
DeepQNetwork
(states, actions, memory, max_episode_timesteps=None, network='auto', batch_size=32, update_frequency=None, start_updating=None, learning_rate=0.0003, huber_loss=0.0, horizon=0, discount=0.99, estimate_terminal=False, target_sync_frequency=1, target_update_weight=1.0, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]¶ Deep Q-Network agent (specification key:
dqn
).Parameters: - states (specification) – States specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of state descriptions (usually taken fromEnvironment.states()
) with the following attributes:- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
- actions (specification) – Actions specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of action descriptions (usually taken fromEnvironment.actions()
) with the following attributes:- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
- max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode
(default: not given, better implicitly
specified via
environment
argument forAgent.create(...)
). - network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
- memory (int) – Replay memory capacity, has to fit at least around batch_size + one episode (required).
- batch_size (parameter, long > 0) – Number of timesteps per update batch (default: 32 timesteps).
- update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
- start_updating (parameter, long >= batch_size) – Number of timesteps before first update (default: none).
- learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
- huber_loss (parameter, float > 0.0) – Huber loss threshold (default: no huber loss).
- horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 0).
- discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
- estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
- target_sync_frequency (parameter, int > 0) – Interval between target network updates (default: every update).
- target_update_weight (parameter, 0.0 < float <= 1.0) – Target network update weight (default: 1.0).
- preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
- exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action,
defined as the probability for uniformly random output in case of
bool
andint
actions, and the standard deviation of Gaussian noise added to every output in case offloat
actions (default: 0.0). - variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
- l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
- entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
- name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
- device (string) – Device name (default: TensorFlow default).
- parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
- seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
- execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
- saver (specification) – TensorFlow saver configuration with the following attributes
(default: no saver):
- directory (path) – saver directory (required).
- filename (string) – model filename (default: agent name).
- frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
- load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
- max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
- summarizer (specification) – TensorBoard summarizer configuration with the following
attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
- labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
- "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
- "dropout": dropout zero fraction
- "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
- "graph": graph summary
- "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
- "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
- "parameters": parameter scalars
- "relu": ReLU activation zero fraction
- "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
- "update-norm": update norm
- "updates": update mean and variance scalars
- "updates-histogram": update histograms
- "variables": variable mean and variance scalars
- "variables-histogram": variable histograms
- recorder (specification) – Experience traces recorder configuration with the following
attributes (default: no recorder):
- directory (path) – recorder directory (required).
- frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- max-traces (int > 0) – maximum number of traces to keep (default: all).
- states (specification) – States specification
(required, better implicitly specified via
Dueling DQN¶
-
class
tensorforce.agents.
DuelingDQN
(states, actions, memory, max_episode_timesteps=None, network='auto', batch_size=32, update_frequency=None, start_updating=None, learning_rate=0.0003, huber_loss=0.0, horizon=0, discount=0.99, estimate_terminal=False, target_sync_frequency=1, target_update_weight=1.0, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]¶ Dueling DQN agent (specification key:
dueling_dqn
).Parameters: - states (specification) – States specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of state descriptions (usually taken fromEnvironment.states()
) with the following attributes:- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
- actions (specification) – Actions specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of action descriptions (usually taken fromEnvironment.actions()
) with the following attributes:- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
- max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode
(default: not given, better implicitly
specified via
environment
argument forAgent.create(...)
). - network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
- memory (int) – Replay memory capacity, has to fit at least around batch_size + one episode (required).
- batch_size (parameter, long > 0) – Number of timesteps per update batch (default: 32 timesteps).
- update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
- start_updating (parameter, long >= batch_size) – Number of timesteps before first update (default: none).
- learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
- huber_loss (parameter, float > 0.0) – Huber loss threshold (default: no huber loss).
- horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 0).
- discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
- estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
- target_sync_frequency (parameter, int > 0) – Interval between target network updates (default: every update).
- target_update_weight (parameter, 0.0 < float <= 1.0) – Target network update weight (default: 1.0).
- preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
- exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action,
defined as the probability for uniformly random output in case of
bool
andint
actions, and the standard deviation of Gaussian noise added to every output in case offloat
actions (default: 0.0). - variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
- l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
- entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
- name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
- device (string) – Device name (default: TensorFlow default).
- parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
- seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
- execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
- saver (specification) – TensorFlow saver configuration with the following attributes
(default: no saver):
- directory (path) – saver directory (required).
- filename (string) – model filename (default: agent name).
- frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
- load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
- max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
- summarizer (specification) – TensorBoard summarizer configuration with the following
attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
- labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
- "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
- "dropout": dropout zero fraction
- "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
- "graph": graph summary
- "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
- "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
- "parameters": parameter scalars
- "relu": ReLU activation zero fraction
- "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
- "update-norm": update norm
- "updates": update mean and variance scalars
- "updates-histogram": update histograms
- "variables": variable mean and variance scalars
- "variables-histogram": variable histograms
- recorder (specification) – Experience traces recorder configuration with the following
attributes (default: no recorder):
- directory (path) – recorder directory (required).
- frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- max-traces (int > 0) – maximum number of traces to keep (default: all).
- states (specification) – States specification
(required, better implicitly specified via
Vanilla Policy Gradient¶
-
class
tensorforce.agents.
VanillaPolicyGradient
(states, actions, max_episode_timesteps, network='auto', batch_size=10, update_frequency=None, learning_rate=0.0003, discount=0.99, estimate_terminal=False, baseline_network=None, baseline_optimizer=None, memory=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]¶ Vanilla Policy Gradient aka REINFORCE agent (specification key:
vpg
).Parameters: - states (specification) – States specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of state descriptions (usually taken fromEnvironment.states()
) with the following attributes:- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
- actions (specification) – Actions specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of action descriptions (usually taken fromEnvironment.actions()
) with the following attributes:- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
- max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode
(default: not given, better implicitly
specified via
environment
argument forAgent.create(...)
). - network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
- batch_size (parameter, long > 0) – Number of episodes per update batch (default: 10 episodes).
- update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
- learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
- discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
- estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
- baseline_network (specification) –
Baseline network configuration, see networks, main policy will be used as baseline if none (default: none).
- baseline_optimizer (float > 0.0 | specification) – Baseline optimizer configuration, see optimizers, main optimizer will be used for baseline if none, a float implies none and specifies a custom weight for the baseline loss (default: none).
- memory (int > 0) – Memory capacity, has to fit at least around batch_size + 1 episodes (default: minimum required size).
- preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
- exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action,
defined as the probability for uniformly random output in case of
bool
andint
actions, and the standard deviation of Gaussian noise added to every output in case offloat
actions (default: 0.0). - variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
- l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
- entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
- name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
- device (string) – Device name (default: TensorFlow default).
- parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
- seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
- execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
- saver (specification) – TensorFlow saver configuration with the following attributes
(default: no saver):
- directory (path) – saver directory (required).
- filename (string) – model filename (default: agent name).
- frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
- load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
- max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
- summarizer (specification) – TensorBoard summarizer configuration with the following
attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
- labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
- "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
- "dropout": dropout zero fraction
- "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
- "graph": graph summary
- "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
- "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
- "parameters": parameter scalars
- "relu": ReLU activation zero fraction
- "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
- "update-norm": update norm
- "updates": update mean and variance scalars
- "updates-histogram": update histograms
- "variables": variable mean and variance scalars
- "variables-histogram": variable histograms
- recorder (specification) – Experience traces recorder configuration with the following
attributes (default: no recorder):
- directory (path) – recorder directory (required).
- frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- max-traces (int > 0) – maximum number of traces to keep (default: all).
- states (specification) – States specification
(required, better implicitly specified via
Actor-Critic¶
-
class
tensorforce.agents.
ActorCritic
(states, actions, max_episode_timesteps, network='auto', batch_size=10, update_frequency=None, learning_rate=0.0003, horizon=0, discount=0.99, state_action_value=False, estimate_terminal=False, critic_network='auto', critic_optimizer=1.0, memory=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]¶ Actor-Critic agent (specification key:
ac
).Parameters: - states (specification) – States specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of state descriptions (usually taken fromEnvironment.states()
) with the following attributes:- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
- actions (specification) – Actions specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of action descriptions (usually taken fromEnvironment.actions()
) with the following attributes:- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
- max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode
(default: not given, better implicitly
specified via
environment
argument forAgent.create(...)
). - network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
- batch_size (parameter, long > 0) – Number of episodes per update batch (default: 10 episodes).
- update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
- learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
- horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 0).
- discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
- state_action_value (bool) – Whether to estimate state-action values instead of state values (default: false).
- estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
- critic_network (specification) –
Critic network configuration, see networks (default: “auto”).
- critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see optimizers, a float instead specifies a custom weight for the critic loss (default: 1.0).
- memory (int > 0) – Memory capacity, has to fit at least around batch_size + one episode (default: minimum required size).
- preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
- exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action,
defined as the probability for uniformly random output in case of
bool
andint
actions, and the standard deviation of Gaussian noise added to every output in case offloat
actions (default: 0.0). - variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
- l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
- entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
- name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
- device (string) – Device name (default: TensorFlow default).
- parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
- seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
- execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
- saver (specification) – TensorFlow saver configuration with the following attributes
(default: no saver):
- directory (path) – saver directory (required).
- filename (string) – model filename (default: agent name).
- frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
- load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
- max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
- summarizer (specification) – TensorBoard summarizer configuration with the following
attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
- labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
- "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
- "dropout": dropout zero fraction
- "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
- "graph": graph summary
- "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
- "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
- "parameters": parameter scalars
- "relu": ReLU activation zero fraction
- "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
- "update-norm": update norm
- "updates": update mean and variance scalars
- "updates-histogram": update histograms
- "variables": variable mean and variance scalars
- "variables-histogram": variable histograms
- recorder (specification) – Experience traces recorder configuration with the following
attributes (default: no recorder):
- directory (path) – recorder directory (required).
- frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- max-traces (int > 0) – maximum number of traces to keep (default: all).
- states (specification) – States specification
(required, better implicitly specified via
Advantage Actor-Critic¶
-
class
tensorforce.agents.
AdvantageActorCritic
(states, actions, max_episode_timesteps, network='auto', batch_size=10, update_frequency=None, learning_rate=0.0003, horizon=0, discount=0.99, state_action_value=False, estimate_terminal=False, critic_network='auto', critic_optimizer=1.0, memory=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]¶ Advantage Actor-Critic agent (specification key:
a2c
).Parameters: - states (specification) – States specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of state descriptions (usually taken fromEnvironment.states()
) with the following attributes:- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
- actions (specification) – Actions specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of action descriptions (usually taken fromEnvironment.actions()
) with the following attributes:- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
- max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode
(default: not given, better implicitly
specified via
environment
argument forAgent.create(...)
). - network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
- batch_size (parameter, long > 0) – Number of episodes per update batch (default: 10 episodes).
- update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
- learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
- horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 0).
- discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
- state_action_value (bool) – Whether to estimate state-action values instead of state values (default: false).
- estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
- critic_network (specification) –
Critic network configuration, see networks (default: “auto”).
- critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see optimizers, a float instead specifies a custom weight for the critic loss (default: 1.0).
- memory (int > 0) – Memory capacity, has to fit at least around batch_size + one episode (default: minimum required size).
- preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
- exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action,
defined as the probability for uniformly random output in case of
bool
andint
actions, and the standard deviation of Gaussian noise added to every output in case offloat
actions (default: 0.0). - variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
- l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
- entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
- name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
- device (string) – Device name (default: TensorFlow default).
- parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
- seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
- execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
- saver (specification) – TensorFlow saver configuration with the following attributes
(default: no saver):
- directory (path) – saver directory (required).
- filename (string) – model filename (default: agent name).
- frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
- load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
- max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
- summarizer (specification) – TensorBoard summarizer configuration with the following
attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
- labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
- "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
- "dropout": dropout zero fraction
- "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
- "graph": graph summary
- "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
- "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
- "parameters": parameter scalars
- "relu": ReLU activation zero fraction
- "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
- "update-norm": update norm
- "updates": update mean and variance scalars
- "updates-histogram": update histograms
- "variables": variable mean and variance scalars
- "variables-histogram": variable histograms
- recorder (specification) – Experience traces recorder configuration with the following
attributes (default: no recorder):
- directory (path) – recorder directory (required).
- frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- max-traces (int > 0) – maximum number of traces to keep (default: all).
- states (specification) – States specification
(required, better implicitly specified via
Deterministic Policy Gradient¶
-
class
tensorforce.agents.
DeterministicPolicyGradient
(states, actions, memory, max_episode_timesteps=None, network='auto', batch_size=32, update_frequency=None, start_updating=None, learning_rate=0.0003, horizon=0, discount=0.99, estimate_terminal=False, critic_network='auto', critic_optimizer=1.0, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]¶ Deterministic Policy Gradient agent (specification key:
dpg
). Action space is required to consist of only a single float action.Parameters: - states (specification) – States specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of state descriptions (usually taken fromEnvironment.states()
) with the following attributes:- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
- actions (specification) – Actions specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of action descriptions (usually taken fromEnvironment.actions()
) with the following attributes:- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
- max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode
(default: not given, better implicitly
specified via
environment
argument forAgent.create(...)
). - network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
- memory (int) – Replay memory capacity, has to fit at least around batch_size + one episode (required).
- batch_size (parameter, long > 0) – Number of timesteps per update batch (default: 32 timesteps).
- update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
- start_updating (parameter, long >= batch_size) – Number of timesteps before first update (default: none).
- learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
- horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 0).
- discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
- estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
- critic_network (specification) –
Critic network configuration, see networks (default: none).
- critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see optimizers, a float instead specifies a custom weight for the critic loss (default: 1.0).
- preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
- exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action,
defined as the probability for uniformly random output in case of
bool
andint
actions, and the standard deviation of Gaussian noise added to every output in case offloat
actions (default: 0.0). - variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
- l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
- entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
- name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
- device (string) – Device name (default: TensorFlow default).
- parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
- seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
- execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
- saver (specification) – TensorFlow saver configuration with the following attributes
(default: no saver):
- directory (path) – saver directory (required).
- filename (string) – model filename (default: agent name).
- frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
- load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
- max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
- summarizer (specification) – TensorBoard summarizer configuration with the following
attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
- labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
- "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
- "dropout": dropout zero fraction
- "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
- "graph": graph summary
- "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
- "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
- "parameters": parameter scalars
- "relu": ReLU activation zero fraction
- "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
- "update-norm": update norm
- "updates": update mean and variance scalars
- "updates-histogram": update histograms
- "variables": variable mean and variance scalars
- "variables-histogram": variable histograms
- recorder (specification) – Experience traces recorder configuration with the following
attributes (default: no recorder):
- directory (path) – recorder directory (required).
- frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- max-traces (int > 0) – maximum number of traces to keep (default: all).
- states (specification) – States specification
(required, better implicitly specified via
Proximal Policy Optimization¶
-
class
tensorforce.agents.
ProximalPolicyOptimization
(states, actions, max_episode_timesteps, network='auto', batch_size=10, update_frequency=None, learning_rate=0.0003, subsampling_fraction=0.33, optimization_steps=10, likelihood_ratio_clipping=0.2, discount=0.99, estimate_terminal=False, critic_network=None, critic_optimizer=None, memory=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]¶ Proximal Policy Optimization agent (specification key:
ppo
).Parameters: - states (specification) – States specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of state descriptions (usually taken fromEnvironment.states()
) with the following attributes:- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
- actions (specification) – Actions specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of action descriptions (usually taken fromEnvironment.actions()
) with the following attributes:- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
- max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode
(default: not given, better implicitly
specified via
environment
argument forAgent.create(...)
). - network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
- batch_size (parameter, long > 0) – Number of episodes per update batch (default: 10 episodes).
- update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
- learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
- subsampling_fraction (parameter, 0.0 < float <= 1.0) – Fraction of batch timesteps to subsample (default: 0.33).
- optimization_steps (parameter, int > 0) – Number of optimization steps (default: 10).
- likelihood_ratio_clipping (parameter, float > 0.0) – Likelihood-ratio clipping threshold (default: 0.2).
- discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
- estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
- critic_network (specification) –
Critic network configuration, see networks, main policy will be used as critic if none (default: none).
- critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see optimizers, main optimizer will be used for critic if none, a float implies none and specifies a custom weight for the critic loss (default: none).
- memory (int > 0) – Memory capacity, has to fit at least around batch_size + 1 episodes (default: minimum required size).
- preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
- exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action,
defined as the probability for uniformly random output in case of
bool
andint
actions, and the standard deviation of Gaussian noise added to every output in case offloat
actions (default: 0.0). - variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
- l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
- entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
- name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
- device (string) – Device name (default: TensorFlow default).
- parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
- seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to fit at leastset separately for a fully deterministic execution (default: none).
- execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
- saver (specification) – TensorFlow saver configuration with the following attributes
(default: no saver):
- directory (path) – saver directory (required).
- filename (string) – model filename (default: agent name).
- frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
- load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
- max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
- summarizer (specification) – TensorBoard summarizer configuration with the following
attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
- labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
- "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
- "dropout": dropout zero fraction
- "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
- "graph": graph summary
- "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
- "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
- "parameters": parameter scalars
- "relu": ReLU activation zero fraction
- "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
- "update-norm": update norm
- "updates": update mean and variance scalars
- "updates-histogram": update histograms
- "variables": variable mean and variance scalars
- "variables-histogram": variable histograms
- recorder (specification) – Experience traces recorder configuration with the following
attributes (default: no recorder):
- directory (path) – recorder directory (required).
- frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- max-traces (int > 0) – maximum number of traces to keep (default: all).
- states (specification) – States specification
(required, better implicitly specified via
Trust-Region Policy Optimization¶
-
class
tensorforce.agents.
TrustRegionPolicyOptimization
(states, actions, max_episode_timesteps, network='auto', batch_size=10, update_frequency=None, learning_rate=0.001, likelihood_ratio_clipping=0.2, discount=0.99, estimate_terminal=False, critic_network=None, critic_optimizer=None, memory=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]¶ Trust Region Policy Optimization agent (specification key:
trpo
).Parameters: - states (specification) – States specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of state descriptions (usually taken fromEnvironment.states()
) with the following attributes:- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
- actions (specification) – Actions specification
(required, better implicitly specified via
environment
argument forAgent.create(...)
), arbitrarily nested dictionary of action descriptions (usually taken fromEnvironment.actions()
) with the following attributes:- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
- max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode
(default: not given, better implicitly
specified via
environment
argument forAgent.create(...)
). - network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
- batch_size (parameter, long > 0) – Number of episodes per update batch (default: 10 episodes).
- update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
- learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 1e-3).
- likelihood_ratio_clipping (parameter, float > 0.0) – Likelihood-ratio clipping threshold (default: 0.2).
- discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
- estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
- critic_network (specification) –
Critic network configuration, see networks, main policy will be used as critic if none (default: none).
- critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see optimizers, main optimizer will be used for critic if none, a float implies none and specifies a custom weight for the critic loss (default: none).
- memory (int > 0) – Memory capacity, has to fit at least around batch_size + 1 episodes (default: minimum required size).
- preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
- exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action,
defined as the probability for uniformly random output in case of
bool
andint
actions, and the standard deviation of Gaussian noise added to every output in case offloat
actions (default: 0.0). - variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
- l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
- entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
- name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
- device (string) – Device name (default: TensorFlow default).
- parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
- seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
- execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
- saver (specification) – TensorFlow saver configuration with the following attributes
(default: no saver):
- directory (path) – saver directory (required).
- filename (string) – model filename (default: agent name).
- frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
- load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
- max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
- summarizer (specification) – TensorBoard summarizer configuration with the following
attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
- labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
- "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
- "dropout": dropout zero fraction
- "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
- "graph": graph summary
- "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
- "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
- "parameters": parameter scalars
- "relu": ReLU activation zero fraction
- "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
- "update-norm": update norm
- "updates": update mean and variance scalars
- "updates-histogram": update histograms
- "variables": variable mean and variance scalars
- "variables-histogram": variable histograms
- recorder (specification) – Experience traces recorder configuration with the following
attributes (default: no recorder):
- directory (path) – recorder directory (required).
- frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
- start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
- max-traces (int > 0) – maximum number of traces to keep (default: all).
- states (specification) – States specification
(required, better implicitly specified via
Distributions¶
-
class
tensorforce.core.distributions.
Bernoulli
(name, action_spec, embedding_shape, summary_labels=None)[source]¶ Bernoulli distribution, for binary boolean actions (specification key:
bernoulli
).Parameters: - name (string) – Distribution name (internal use).
- action_spec (specification) – Action specification (internal use).
- embedding_shape (iter[int > 0]) – Embedding shape (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.distributions.
Beta
(name, action_spec, embedding_shape, summary_labels=None)[source]¶ Beta distribution, for bounded continuous actions (specification key:
beta
).Parameters: - name (string) – Distribution name (internal use).
- action_spec (specification) – Action specification (internal use).
- embedding_shape (iter[int > 0]) – Embedding shape (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.distributions.
Categorical
(name, action_spec, embedding_shape, infer_states_value=True, summary_labels=None)[source]¶ Categorical distribution, for discrete integer actions (specification key:
categorical
).Parameters: - name (string) – Distribution name (internal use).
- action_spec (specification) – Action specification (internal use).
- embedding_shape (iter[int > 0]) – Embedding shape (internal use).
- infer_states_value (bool) – Whether to infer the state value from state-action values as softmax denominator (default: true).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.distributions.
Gaussian
(name, action_spec, embedding_shape, summary_labels=None)[source]¶ Gaussian distribution, for unbounded continuous actions (specification key:
gaussian
).Parameters: - name (string) – Distribution name (internal use).
- action_spec (specification) – Action specification (internal use).
- embedding_shape (iter[int > 0]) – Embedding shape (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
Layers¶
Default layer: Function
with default argument function
Convolutional layers¶
-
class
tensorforce.core.layers.
Conv1d
(name, size, window=3, stride=1, padding='same', dilation=1, bias=True, activation='relu', dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None)[source]¶ 1-dimensional convolutional layer (specification key:
conv1d
).Parameters: - name (string) – Layer name (default: internally chosen).
- size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
- window (int > 0) – Window size (default: 3).
- stride (int > 0) – Stride size (default: 1).
- padding ('same' | 'valid') – Padding type, see TensorFlow docs (default: ‘same’).
- dilation (int > 0 | (int > 0, int > 0)) – Dilation value (default: 1).
- bias (bool) – Whether to add a trainable bias variable (default: true).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: “relu”).
- dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
- is_trainable (bool) – Whether layer variables are trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Conv2d
(name, size, window=3, stride=1, padding='same', dilation=1, bias=True, activation='relu', dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None)[source]¶ 2-dimensional convolutional layer (specification key:
conv2d
).Parameters: - name (string) – Layer name (default: internally chosen).
- size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
- window (int > 0 | (int > 0, int > 0)) – Window size (default: 3).
- stride (int > 0 | (int > 0, int > 0)) – Stride size (default: 1).
- padding ('same' | 'valid') – Padding type, see TensorFlow docs (default: ‘same’).
- dilation (int > 0 | (int > 0, int > 0)) – Dilation value (default: 1).
- bias (bool) – Whether to add a trainable bias variable (default: true).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: “relu”).
- dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
- is_trainable (bool) – Whether layer variables are trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
Dense layers¶
-
class
tensorforce.core.layers.
Dense
(name, size, bias=True, activation='relu', dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None)[source]¶ Dense fully-connected layer (specification key:
dense
).Parameters: - name (string) – Layer name (default: internally chosen).
- size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
- bias (bool) – Whether to add a trainable bias variable (default: true).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: “relu”).
- dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
- is_trainable (bool) – Whether layer variables are trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Linear
(name, size, bias=True, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None)[source]¶ Linear layer (specification key:
linear
).Parameters: - name (string) – Layer name (default: internally chosen).
- size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
- bias (bool) – Whether to add a trainable bias variable (default: true).
- is_trainable (bool) – Whether layer variables are trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
Embedding layers¶
-
class
tensorforce.core.layers.
Embedding
(name, size, num_embeddings=None, max_norm=None, bias=False, activation='tanh', dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None)[source]¶ Embedding layer (specification key:
embedding
).Parameters: - name (string) – Layer name (default: internally chosen).
- size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
- num_embeddings (int > 0) – If set, specifies the number of embeddings (default: none).
- max_norm (float) – If set, embeddings are clipped if their L2-norm is larger (default: none).
- bias (bool) – Whether to add a trainable bias variable (default: false).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: “tanh”).
- dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
- is_trainable (bool) – Whether layer variables are trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
- kwargs – Additional arguments for potential parent class.
Recurrent layers¶
-
class
tensorforce.core.layers.
Gru
(name, size, return_final_state=True, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]¶ Gated recurrent unit layer (specification key:
gru
).Parameters: - name (string) – Layer name (default: internally chosen).
- cell ('gru' | 'lstm') – The recurrent cell type (required).
- size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
- return_final_state (bool) – Whether to return the final state instead of the per-step outputs (default: true).
- bias (bool) – Whether to add a trainable bias variable (default: false).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
- dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
- is_trainable (bool) – Whether layer variables are trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
- kwargs – Additional arguments for Keras GRU layer, see TensorFlow docs.
-
class
tensorforce.core.layers.
Lstm
(name, size, return_final_state=True, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]¶ Long short-term memory layer (specification key:
lstm
).Parameters: - name (string) – Layer name (default: internally chosen).
- cell ('gru' | 'lstm') – The recurrent cell type (required).
- size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
- return_final_state (bool) – Whether to return the final state instead of the per-step outputs (default: true).
- bias (bool) – Whether to add a trainable bias variable (default: false).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
- dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
- is_trainable (bool) – Whether layer variables are trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
- kwargs – Additional arguments for Keras LSTM layer, see TensorFlow docs.
-
class
tensorforce.core.layers.
Rnn
(name, cell, size, return_final_state=True, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]¶ Recurrent neural network layer (specification key:
rnn
).Parameters: - name (string) – Layer name (default: internally chosen).
- cell ('gru' | 'lstm') – The recurrent cell type (required).
- size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
- return_final_state (bool) – Whether to return the final state instead of the per-step outputs (default: true).
- bias (bool) – Whether to add a trainable bias variable (default: false).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
- dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
- is_trainable (bool) – Whether layer variables are trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
- kwargs – Additional arguments for Keras RNN layer, see TensorFlow docs.
Pooling layers¶
-
class
tensorforce.core.layers.
Flatten
(name, input_spec=None, summary_labels=None)[source]¶ Flatten layer (specification key:
flatten
).Parameters: - name (string) – Layer name (default: internally chosen).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Pooling
(name, reduction, input_spec=None, summary_labels=None)[source]¶ Pooling layer (global pooling) (specification key:
pooling
).Parameters: - name (string) – Layer name (default: internally chosen).
- reduction ('concat' | 'max' | 'mean' | 'product' | 'sum') – Pooling type (required).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Pool1d
(name, reduction, window=2, stride=2, padding='same', input_spec=None, summary_labels=None)[source]¶ 1-dimensional pooling layer (local pooling) (specification key:
pool1d
).Parameters: - name (string) – Layer name (default: internally chosen).
- reduction ('average' | 'max') – Pooling type (required).
- window (int > 0) – Window size (default: 2).
- stride (int > 0) – Stride size (default: 2).
- padding ('same' | 'valid') – Padding type, see TensorFlow docs (default: ‘same’).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Pool2d
(name, reduction, window=2, stride=2, padding='same', input_spec=None, summary_labels=None)[source]¶ 2-dimensional pooling layer (local pooling) (specification key:
pool2d
).Parameters: - name (string) – Layer name (default: internally chosen).
- reduction ('average' | 'max') – Pooling type (required).
- window (int > 0 | (int > 0, int > 0)) – Window size (default: 2).
- stride (int > 0 | (int > 0, int > 0)) – Stride size (default: 2).
- padding ('same' | 'valid') – Padding type, see TensorFlow docs (default: ‘same’).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
Normalization layers¶
-
class
tensorforce.core.layers.
ExponentialNormalization
(name, decay=0.999, axes=None, input_spec=None, summary_labels=None)[source]¶ Normalization layer based on the exponential moving average (specification key:
exponential_normalization
).Parameters: - name (string) – Layer name (default: internally chosen).
- decay (parameter, 0.0 <= float <= 1.0) – Decay rate (default: 0.999).
- axes (iter[int >= 0]) – Normalization axes, excluding batch axis (default: all but last axis).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
-
class
tensorforce.core.layers.
InstanceNormalization
(name, axes=None, input_spec=None, summary_labels=None)[source]¶ Instance normalization layer (specification key:
instance_normalization
).Parameters: - name (string) – Layer name (default: internally chosen).
- axes (iter[int >= 0]) – Normalization axes, excluding batch axis (default: all).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
Misc layers¶
-
class
tensorforce.core.layers.
Activation
(name, nonlinearity, input_spec=None, summary_labels=None)[source]¶ Activation layer (specification key:
activation
).Parameters: - name (string) – Layer name (default: internally chosen).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (nonlinearity) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Nonlinearity (required).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Clipping
(name, upper, lower=None, input_spec=None, summary_labels=None)[source]¶ Clipping layer (specification key:
clipping
).Parameters: - name (string) – Layer name (default: internally chosen).
- upper (parameter, float) – Upper clipping value (required).
- lower (parameter, float) – Lower clipping value (default: negative upper value).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Deltafier
(name, concatenate=False, input_spec=None, summary_labels=None)[source]¶ Deltafier layer computing the difference between the current and the previous input; can only be used as preprocessing layer (specification key:
deltafier
).Parameters: - name (string) – Layer name (default: internally chosen).
- concatenate (False | int >= 0) – Whether to concatenate instead of replace deltas with input, and if so, concatenation axis (default: false).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Dropout
(name, rate, input_spec=None, summary_labels=None)[source]¶ Dropout layer (specification key:
dropout
).Parameters: - name (string) – Layer name (default: internally chosen).
- rate (parameter, 0.0 <= float < 1.0) – Dropout rate (required).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Image
(name, height=None, width=None, grayscale=False, input_spec=None, summary_labels=None)[source]¶ Image preprocessing layer (specification key:
image
).Parameters: - name (string) – Layer name (default: internally chosen).
- height (int) – Height of resized image (default: no resizing or relative to width).
- width (int) – Width of resized image (default: no resizing or relative to height).
- grayscale (bool | iter[float]) – Turn into grayscale image, optionally using given weights (default: false).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Reshape
(name, shape, input_spec=None, summary_labels=None)[source]¶ Reshape layer (specification key:
reshape
).Parameters: - name (string) – Layer name (default: internally chosen).
- shape (int | iter[int]) – New shape (required).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Sequence
(name, length, axis=-1, concatenate=True, input_spec=None, summary_labels=None)[source]¶ Sequence layer stacking the current and previous inputs; can only be used as preprocessing layer (specification key:
sequence
).Parameters: - name (string) – Layer name (default: internally chosen).
- length (int > 0) – Number of inputs to concatenate (required).
- axis (int >= 0) – Concatenation axis, excluding batch axis (default: last axis).
- concatenate (bool) – Whether to concatenate inputs at given axis, otherwise introduce new sequence axis (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
Layers with internal states¶
-
class
tensorforce.core.layers.
InternalGru
(name, size, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]¶ Internal state GRU cell layer (specification key:
internal_gru
).Parameters: - name (string) – Layer name (default: internally chosen).
- cell ('gru' | 'lstm') – The recurrent cell type (required).
- size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
- length (parameter, long > 0) – ???+1 (required).
- bias (bool) – Whether to add a trainable bias variable (default: false).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
- dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
- is_trainable (bool) – Whether layer variables are trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
- kwargs – Additional arguments for Keras GRU layer, see TensorFlow docs.
-
class
tensorforce.core.layers.
InternalLstm
(name, size, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]¶ Internal state LSTM cell layer (specification key:
internal_lstm
).Parameters: - name (string) – Layer name (default: internally chosen).
- cell ('gru' | 'lstm') – The recurrent cell type (required).
- size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
- length (parameter, long > 0) – ???+1 (required).
- bias (bool) – Whether to add a trainable bias variable (default: false).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
- dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
- is_trainable (bool) – Whether layer variables are trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
- kwargs – Additional arguments for Keras LSTM layer, see TensorFlow docs.
-
class
tensorforce.core.layers.
InternalRnn
(name, cell, size, length, bias=False, activation=None, dropout=0.0, is_trainable=True, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]¶ Internal state RNN cell layer (specification key:
internal_rnn
).Parameters: - name (string) – Layer name (default: internally chosen).
- cell ('gru' | 'lstm') – The recurrent cell type (required).
- size (int >= 0) – Layer output size, 0 implies additionally removing the axis (required).
- length (parameter, long > 0) – ???+1 (required).
- bias (bool) – Whether to add a trainable bias variable (default: false).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (activation) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Activation nonlinearity (default: none).
- dropout (parameter, 0.0 <= float < 1.0) – Dropout rate (default: 0.0).
- is_trainable (bool) – Whether layer variables are trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
- kwargs – Additional arguments for Keras RNN cell layer, see TensorFlow docs.
Special layers¶
-
class
tensorforce.core.layers.
Block
(name, layers, input_spec=None)[source]¶ Block of layers (specification key:
block
).Parameters: - name (string) – Layer name (default: internally chosen).
- layers (iter[specification]) –
Layers configuration, see layers (required).
- input_spec (specification) – Input tensor specification (internal use).
-
class
tensorforce.core.layers.
Function
(name, function, output_spec=None, input_spec=None, summary_labels=None, l2_regularization=None)[source]¶ Custom TensorFlow function layer (specification key:
function
).Parameters: - name (string) – Layer name (default: internally chosen).
- function (lambda[x -> x]) – TensorFlow function (required).
- output_spec (specification) – Output tensor specification containing type and/or shape information (default: same as input).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Keras
(name, layer, input_spec=None, summary_labels=None, l2_regularization=None, **kwargs)[source]¶ Keras layer (specification key:
keras
).Parameters: - layer (string) – Keras layer class name, see TensorFlow docs (required).
- kwargs – Arguments for the Keras layer, see TensorFlow docs.
-
class
tensorforce.core.layers.
Register
(name, tensor, input_spec=None, summary_labels=None)[source]¶ Tensor retrieval layer, which is useful when defining more complex network architectures which do not follow the sequential layer-stack pattern, for instance, when handling multiple inputs (specification key:
register
).Parameters: - name (string) – Layer name (default: internally chosen).
- tensor (string) – Name under which tensor will be registered (required).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Retrieve
(name, tensors, aggregation='concat', axis=0, input_spec=None, summary_labels=None)[source]¶ Tensor retrieval layer, which is useful when defining more complex network architectures which do not follow the sequential layer-stack pattern, for instance, when handling multiple inputs (specification key:
retrieve
).Parameters: - name (string) – Layer name (default: internally chosen).
- tensors (iter[string]) – Names of global tensors to retrieve, for instance, state names or previously registered global tensor names (required).
- aggregation ('concat' | 'product' | 'stack' | 'sum') – Aggregation type in case of multiple tensors (default: ‘concat’).
- axis (int >= 0) – Aggregation axis, excluding batch axis (default: 0).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Reuse
(name, layer, is_trainable=True, input_spec=None)[source]¶ Reuse layer (specification key:
reuse
).Parameters: - name (string) – Layer name (default: internally chosen).
- layer (string) – Name of a previously defined layer (required).
- is_trainable (bool) – Whether reused layer variables are kept trainable (default: true).
- input_spec (specification) – Input tensor specification (internal use).
Memories¶
Default memory: Replay
with default argument capacity
-
class
tensorforce.core.memories.
Recent
(name, capacity, values_spec, device=None, summary_labels=None)[source]¶ Batching memory which always retrieves most recent experiences (specification key:
recent
).Parameters: - name (string) – Memory name (internal use).
- capacity (int > 0) – Memory capacity, in experience timesteps (required).
- values_spec (specification) – Values specification (internal use).
- device (string) – Device name (default: inherit value of parent module).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.memories.
Replay
(name, capacity, values_spec, device=None, summary_labels=None)[source]¶ Replay memory which randomly retrieves experiences (specification key:
replay
).Parameters: - name (string) – Memory name (internal use).
- capacity (int > 0) – Memory capacity, in experience timesteps (required).
- values_spec (specification) – Values specification (internal use).
- device (string) – Device name (default: inherit value of parent module).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
Networks¶
Default network: LayeredNetwork
with default argument layers
-
class
tensorforce.core.networks.
AutoNetwork
(name, inputs_spec, size=64, depth=2, final_size=None, final_depth=1, internal_rnn=False, device=None, summary_labels=None, l2_regularization=None)[source]¶ Network which is automatically configured based on its input tensors, offering high-level customization (specification key:
auto
).Parameters: - name (string) – Network name (internal use).
- inputs_spec (specification) – Input tensors specification (internal use).
- size (int > 0) – Layer size, before concatenation if multiple states (default: 64).
- depth (int > 0) – Number of layers per state, before concatenation if multiple states (default: 2).
- final_size (int > 0) – Layer size after concatenation if multiple states (default: layer size).
- final_depth (int > 0) – Number of layers after concatenation if multiple states (default: 1).
- internal_rnn (false | parameter, long >= 0) – Whether to add an internal state LSTM cell as last layer, and if so, horizon of the LSTM (default: false).
- device (string) – Device name (default: inherit value of parent module).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
-
class
tensorforce.core.networks.
LayeredNetwork
(name, layers, inputs_spec, device=None, summary_labels=None, l2_regularization=None)[source]¶ Network consisting of Tensorforce layers, which can be specified as either a list of layer specifications in the case of a standard sequential layer-stack architecture, or as a list of list of layer specifications in the case of a more complex architecture consisting of multiple sequential layer-stacks (specification key:
custom
orlayered
).Parameters: - name (string) – Network name (internal use).
- layers (iter[specification] | iter[iter[specification]]) – Layers configuration, see layers (required).
- inputs_spec (specification) – Input tensors specification (internal use).
- device (string) – Device name (default: inherit value of parent module).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
Objectives¶
-
class
tensorforce.core.objectives.
DeterministicPolicyGradient
(name, summary_labels=None)[source]¶ Deterministic policy gradient objective (specification key:
det_policy_gradient
).Parameters: - name (string) – Module name (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.objectives.
Plus
(name, objective1, objective2, summary_labels=None)[source]¶ Additive combination of two objectives (specification key:
plus
).Parameters: - name (string) – Module name (internal use).
- objective1 (specification) – First objective configuration (required).
- objective2 (specification) – Second objective configuration (required).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.objectives.
PolicyGradient
(name, ratio_based=False, clipping_value=0.0, early_reduce=False, summary_labels=None)[source]¶ Policy gradient objective, which maximizes the log-likelihood or likelihood-ratio scaled by the target reward value (specification key:
policy_gradient
).Parameters: - name (string) – Module name (internal use).
- ratio_based (bool) – Whether to scale the likelihood-ratio instead of the log-likelihood (default: false).
- clipping_value (parameter, float > 0.0) – Clipping threshold for the maximized value (default: no clipping).
- early_reduce (bool) – Whether to compute objective for reduced likelihoods instead of per likelihood (default: false).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.objectives.
Value
(name, value='state', huber_loss=0.0, early_reduce=False, summary_labels=None)[source]¶ Value approximation objective, which minimizes the L2-distance between the state-(action-)value estimate and the target reward value (specification key:
value
).Parameters: - name (string) – Module name (internal use).
- value ("state" | "action") – Whether to approximate the state- or state-action-value (default: “state”).
- huber_loss (parameter, float > 0.0) – Huber loss threshold (default: no huber loss).
- early_reduce (bool) – Whether to compute objective for reduced values instead of value per action (default: false).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
Optimizers¶
Default optimizer: MetaOptimizerWrapper
-
class
tensorforce.core.optimizers.
ClippingStep
(name, optimizer, threshold, mode='global_norm', summary_labels=None)[source]¶ Clipping-step meta optimizer, which clips the updates of the given optimizer (specification key:
clipping_step
).Parameters: - name (string) – Module name (internal use).
- optimizer (specification) – Optimizer configuration (required).
- threshold (parameter, float > 0.0) – Clipping threshold (required).
- mode ('global_norm' | 'norm' | 'value') – Clipping mode (default: ‘global_norm’).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.optimizers.
Evolutionary
(name, learning_rate, num_samples=1, unroll_loop=False, summary_labels=None)[source]¶ Evolutionary optimizer, which samples random perturbations and applies them either as positive or negative update depending on their improvement of the loss (specification key:
evolutionary
).Parameters: - name (string) – Module name (internal use).
- learning_rate (parameter, float > 0.0) – Learning rate (required).
- num_samples (parameter, int > 0) – Number of sampled perturbations (default: 1).
- unroll_loop (bool) – Whether to unroll the sampling loop (default: false).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.optimizers.
GlobalOptimizer
(name, optimizer, summary_labels=None)[source]¶ Global meta optimizer, which applies the given optimizer to the local variables, then applies the update to a corresponding set of global variables, and subsequently updates the local variables to the value of the global variables; will likely change in the future (specification key:
global_optimizer
).Parameters: - name (string) – Module name (internal use).
- optimizer (specification) – Optimizer configuration (required).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.optimizers.
MetaOptimizerWrapper
(name, optimizer, multi_step=1, subsampling_fraction=1.0, clipping_threshold=None, optimizing_iterations=0, summary_labels=None, **kwargs)[source]¶ Meta optimizer wrapper (specification key:
meta_optimizer_wrapper
).Parameters: - name (string) – Module name (internal use).
- optimizer (specification) – Optimizer configuration (required).
- multi_step (parameter, int > 0) – Number of optimization steps (default: single step).
- subsampling_fraction (parameter, 0.0 < float <= 1.0) – Fraction of batch timesteps to subsample (default: no subsampling).
- clipping_threshold (parameter, float > 0.0) – Clipping threshold (default: no clipping).
- optimizing_iterations (parameter, int >= 0) – Maximum number of line search iterations (default: no optimizing).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.optimizers.
MultiStep
(name, optimizer, num_steps, unroll_loop=False, summary_labels=None)[source]¶ Multi-step meta optimizer, which applies the given optimizer for a number of times (specification key:
multi_step
).Parameters: - name (string) – Module name (internal use).
- optimizer (specification) – Optimizer configuration (required).
- num_steps (parameter, int > 0) – Number of optimization steps (required).
- unroll_loop (bool) – Whether to unroll the repetition loop (default: false).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.optimizers.
NaturalGradient
(name, learning_rate, cg_max_iterations=10, cg_damping=0.001, cg_unroll_loop=False, summary_labels=None)[source]¶ Natural gradient optimizer (specification key:
natural_gradient
).Parameters: - name (string) – Module name (internal use).
- learning_rate (parameter, float > 0.0) – Learning rate as KL-divergence of distributions between optimization steps (required).
- cg_max_iterations (int > 0) – Maximum number of conjugate gradient iterations. (default: 10).
- cg_damping (float > 0.0) – Conjugate gradient damping factor. (default: 1e-3).
- cg_unroll_loop (bool) – Whether to unroll the conjugate gradient loop (default: false).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.optimizers.
OptimizingStep
(name, optimizer, ls_max_iterations=10, ls_accept_ratio=0.9, ls_mode='exponential', ls_parameter=0.5, ls_unroll_loop=False, summary_labels=None)[source]¶ Optimizing-step meta optimizer, which applies line search to the given optimizer to find a more optimal step size (specification key:
optimizing_step
).Parameters: - name (string) – Module name (internal use).
- optimizer (specification) – Optimizer configuration (required).
- ls_max_iterations (parameter, int > 0) – Maximum number of line search iterations (default: 10).
- ls_accept_ratio (parameter, float > 0.0) – Line search acceptance ratio (default: 0.9).
- ls_mode ('exponential' | 'linear') – Line search mode, see line search solver (default: ‘exponential’).
- ls_parameter (parameter, float > 0.0) – Line search parameter, see line search solver (default: 0.5).
- ls_unroll_loop (bool) – Whether to unroll the line search loop (default: false).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.optimizers.
Plus
(name, optimizer1, optimizer2, summary_labels=None)[source]¶ Additive combination of two optimizers (specification key:
plus
).Parameters: - name (string) – Module name (internal use).
- optimizer1 (specification) – First optimizer configuration (required).
- optimizer2 (specification) – Second optimizer configuration (required).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.optimizers.
SubsamplingStep
(name, optimizer, fraction, summary_labels=None)[source]¶ Subsampling-step meta optimizer, which randomly samples a subset of batch instances before applying the given optimizer (specification key:
subsampling_step
).Parameters: - name (string) – Module name (internal use).
- optimizer (specification) – Optimizer configuration (required).
- fraction (parameter, 0.0 < float < 1.0) – Fraction of batch timesteps to subsample (required).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.optimizers.
Synchronization
(name, sync_frequency=1, update_weight=1.0, summary_labels=None)[source]¶ Synchronization optimizer, which updates variables periodically to the value of a corresponding set of source variables (specification key:
synchronization
).Parameters: - name (string) – Module name (internal use).
- optimizer (specification) – Optimizer configuration (required).
- sync_frequency (parameter, int > 0) – Interval between updates which also perform a synchronization step (default: every update).
- update_weight (parameter, 0.0 < float <= 1.0) – Update weight (default: 1.0).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.optimizers.
TFOptimizer
(name, optimizer, learning_rate=0.0003, gradient_norm_clipping=1.0, summary_labels=None, **kwargs)[source]¶ TensorFlow optimizer (specification key:
tf_optimizer
,adadelta
,adagrad
,adam
,adamax
,adamw
,ftrl
,lazyadam
,nadam
,radam
,ranger
,rmsprop
,sgd
,sgdw
)Parameters: - name (string) – Module name (internal use).
- optimizer (
adadelta
|adagrad
|adam
|adamax
|adamw
|ftrl
|lazyadam
|nadam
|radam
|ranger
|rmsprop
|sgd
|sgdw
) – TensorFlow optimizer name, see TensorFlow docs and TensorFlow Addons docs (required unless given by specification key). - learning_rate (parameter, float > 0.0) – Learning rate (default: 3e-4).
- gradient_norm_clipping (parameter, float > 0.0) – Clip gradients by the ratio of the sum of their norms (default: 1.0).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- kwargs – Arguments for the TensorFlow optimizer, special values “decoupled_weight_decay”, “lookahead” and “moving_average”, see TensorFlow docs and TensorFlow Addons docs.
Parameters¶
Default parameter: Constant
-
class
tensorforce.core.parameters.
Constant
(name, value, dtype, summary_labels=None)[source]¶ Constant hyperparameter.
Parameters: - name (string) – Module name (internal use).
- value (dtype-dependent) – Constant hyperparameter value (required).
- dtype ("bool" | "int" | "long" | "float") – Tensor type (required).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.parameters.
Decaying
(name, dtype, unit, decay, initial_value, decay_steps, increasing=False, inverse=False, scale=1.0, summary_labels=None, **kwargs)[source]¶ Decaying hyperparameter.
Parameters: - name (string) – Module name (internal use).
- dtype ("bool" | "int" | "long" | "float") – Tensor type (required).
- unit ("timesteps" | "episodes" | "updates") – Unit of decay schedule (required).
- decay ("cosine" | "cosine_restarts" | "exponential" | "inverse_time" | "linear_cosine" | "linear_cosine_noisy" | "polynomial") – Decay type, see TensorFlow docs (required).
- initial_value (float) – Initial value (required).
- decay_steps (long) – Number of decay steps (required).
- increasing (bool) – Whether to subtract the decayed value from 1.0 (default: false).
- inverse (bool) – Whether to take the inverse of the decayed value (default: false).
- scale (float) – Scaling factor for (inverse) decayed value (default: 1.0).
- summary_labels ("all" | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- kwargs – Additional arguments depend on decay mechanism.
Cosine decay:- alpha (float) – Minimum learning rate value as a fraction of learning_rate (default: 0.0).
- t_mul (float) – Used to derive the number of iterations in the i-th period (default: 2.0).
- m_mul (float) – Used to derive the initial learning rate of the i-th period (default: 1.0).
- alpha (float) – Minimum learning rate value as a fraction of the learning_rate (default: 0.0).
- decay_rate (float) – Decay rate (required).
- staircase (bool) – Whether to apply decay in a discrete staircase, as opposed to continuous, fashion. (default: false).
- decay_rate (float) – Decay rate (required).
- staircase (bool) – Whether to apply decay in a discrete staircase, as opposed to continuous, fashion. (default: false).
- num_periods (float) – Number of periods in the cosine part of the decay (default: 0.5).
- alpha (float) – Alpha value (default: 0.0).
- beta (float) – Beta value (default: 0.001).
- decay_rate (float) – Decay rate (required).
- staircase (bool) – Whether to apply decay in a discrete staircase, as opposed to continuous, fashion. (default: false).
- initial_variance (float) – Initial variance for the noise (default: 1.0).
- variance_decay (float) – Decay for the noise's variance (default: 0.55).
- num_periods (float) – Number of periods in the cosine part of the decay (default: 0.5).
- alpha (float) – Alpha value (default: 0.0).
- beta (float) – Beta value (default: 0.001).
- final_value (float) – Final value (required).
- power (float) – Power of polynomial (default: 1.0, thus linear).
- cycle (bool) – Whether to cycle beyond decay_steps (default: false).
-
class
tensorforce.core.parameters.
OrnsteinUhlenbeck
(name, dtype, theta=0.15, sigma=0.3, mu=0.0, summary_labels=None)[source]¶ Ornstein-Uhlenbeck process.
Parameters: - name (string) – Module name (internal use).
- dtype ("bool" | "int" | "long" | "float") – Tensor type (required).
- theta (float > 0.0) – Theta value (default: 0.15).
- sigma (float > 0.0) – Sigma value (default: 0.3).
- mu (float) – Mu value (default: 0.0).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.parameters.
PiecewiseConstant
(name, dtype, unit, boundaries, values, summary_labels=None)[source]¶ Piecewise-constant hyperparameter.
Parameters: - name (string) – Module name (internal use).
- dtype ("bool" | "int" | "long" | "float") – Tensor type (required).
- unit ("timesteps" | "episodes" | "updates") – Unit of interval boundaries (required).
- boundaries (iter[long]) – Strictly increasing interval boundaries for constant segments (required).
- values (iter[dtype-dependent]) – Interval values of constant segments, one more than (required).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.parameters.
Random
(name, dtype, distribution, shape=(), summary_labels=None, **kwargs)[source]¶ Random hyperparameter.
Parameters: - name (string) – Module name (internal use).
- dtype ("bool" | "int" | "long" | "float") – Tensor type (required).
- distribution ("normal" | "uniform") – Distribution type for random hyperparameter value (required).
- shape (iter[int > 0]) – Tensor shape (default: scalar).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- kwargs – Additional arguments dependent on distribution type.
Normal distribution:- mean (float) – Mean (default: 0.0).
- stddev (float > 0.0) – Standard deviation (default: 1.0).
- minval (int / float) – Lower bound (default: 0 / 0.0).
- maxval (float > minval) – Upper bound (default: 1.0 for float, required for int).
Preprocessing¶
-
class
tensorforce.core.layers.
Activation
(name, nonlinearity, input_spec=None, summary_labels=None)[source] Activation layer (specification key:
activation
).Parameters: - name (string) – Layer name (default: internally chosen).
- ('crelu' | 'elu' | 'leaky-relu' | 'none' | 'relu' | 'selu' | 'sigmoid' | (nonlinearity) – ‘softmax’ | ‘softplus’ | ‘softsign’ | ‘swish’ | ‘tanh’): Nonlinearity (required).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Clipping
(name, upper, lower=None, input_spec=None, summary_labels=None)[source] Clipping layer (specification key:
clipping
).Parameters: - name (string) – Layer name (default: internally chosen).
- upper (parameter, float) – Upper clipping value (required).
- lower (parameter, float) – Lower clipping value (default: negative upper value).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Deltafier
(name, concatenate=False, input_spec=None, summary_labels=None)[source] Deltafier layer computing the difference between the current and the previous input; can only be used as preprocessing layer (specification key:
deltafier
).Parameters: - name (string) – Layer name (default: internally chosen).
- concatenate (False | int >= 0) – Whether to concatenate instead of replace deltas with input, and if so, concatenation axis (default: false).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Dropout
(name, rate, input_spec=None, summary_labels=None)[source] Dropout layer (specification key:
dropout
).Parameters: - name (string) – Layer name (default: internally chosen).
- rate (parameter, 0.0 <= float < 1.0) – Dropout rate (required).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
ExponentialNormalization
(name, decay=0.999, axes=None, input_spec=None, summary_labels=None)[source] Normalization layer based on the exponential moving average (specification key:
exponential_normalization
).Parameters: - name (string) – Layer name (default: internally chosen).
- decay (parameter, 0.0 <= float <= 1.0) – Decay rate (default: 0.999).
- axes (iter[int >= 0]) – Normalization axes, excluding batch axis (default: all but last axis).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Image
(name, height=None, width=None, grayscale=False, input_spec=None, summary_labels=None)[source] Image preprocessing layer (specification key:
image
).Parameters: - name (string) – Layer name (default: internally chosen).
- height (int) – Height of resized image (default: no resizing or relative to width).
- width (int) – Width of resized image (default: no resizing or relative to height).
- grayscale (bool | iter[float]) – Turn into grayscale image, optionally using given weights (default: false).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
InstanceNormalization
(name, axes=None, input_spec=None, summary_labels=None)[source] Instance normalization layer (specification key:
instance_normalization
).Parameters: - name (string) – Layer name (default: internally chosen).
- axes (iter[int >= 0]) – Normalization axes, excluding batch axis (default: all).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
-
class
tensorforce.core.layers.
Sequence
(name, length, axis=-1, concatenate=True, input_spec=None, summary_labels=None)[source] Sequence layer stacking the current and previous inputs; can only be used as preprocessing layer (specification key:
sequence
).Parameters: - name (string) – Layer name (default: internally chosen).
- length (int > 0) – Number of inputs to concatenate (required).
- axis (int >= 0) – Concatenation axis, excluding batch axis (default: last axis).
- concatenate (bool) – Whether to concatenate inputs at given axis, otherwise introduce new sequence axis (default: true).
- input_spec (specification) – Input tensor specification (internal use).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
Policies¶
Default policy: ParametrizedDistributions
-
class
tensorforce.core.policies.
ParametrizedDistributions
(name, states_spec, actions_spec, network='auto', distributions=None, temperature=0.0, device=None, summary_labels=None, l2_regularization=None)[source]¶ Policy which parametrizes independent distributions per action conditioned on the output of a central states-processing neural network (supports both stochastic and action-value-based policy interface) (specification key:
parametrized_distributions
).Parameters: - name (string) – Module name (internal use).
- states_spec (specification) – States specification (internal use).
- actions_spec (specification) – Actions specification (internal use).
- network ('auto' | specification) – Policy network configuration, see networks (default: ‘auto’, automatically configured network).
- distributions (dict[specification]) – Distributions configuration, see distributions, specified per action-type or -name (default: per action-type, Bernoulli distribution for binary boolean actions, categorical distribution for discrete integer actions, Gaussian distribution for unbounded continuous actions, Beta distribution for bounded continuous actions).
- temperature (parameter | dict[parameter], float >= 0.0) – Sampling temperature, global or per action (default: 0.0).
- device (string) – Device name (default: inherit value of parent module).
- summary_labels ('all' | iter[string]) – Labels of summaries to record (default: inherit value of parent module).
- l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
Runner¶
-
class
tensorforce.execution.
Runner
(agent, environment=None, max_episode_timesteps=None, evaluation=False, num_parallel=None, environments=None, remote=None, blocking=False, host=None, port=None)[source]¶ Tensorforce runner utility.
Parameters: - agent (specification | Agent object) – Agent specification or object, the latter is not
closed automatically as part of
runner.close()
(required). - environment (specification | Environment object) – Environment specification or object, the
latter is not closed automatically as part of
runner.close()
(required, or alternativelyenvironments
, invalid for “socket-client” remote mode). - max_episode_timesteps (int > 0) – Maximum number of timesteps per episode, overwrites the environment default if defined (default: environment default, invalid for “socket-client” remote mode).
- evaluation (bool) – Whether to run the (last if multiple) environment in evaluation mode (default: no evaluation).
- num_parallel (int > 0) – Number of environment instances to execute in parallel
(default: no parallel execution, implicitly
specified by
environments
). - environments (list[specification | Environment object]) – Environment specifications or
objects to execute in parallel, the latter are not closed automatically as part of
runner.close()
(default: no parallel execution, alternatively specified viaenvironment
andnum_parallel
, invalid for “socket-client” remote mode). - remote ("multiprocessing" | "socket-client") – Communication mode for remote environment execution of parallelized environment execution, “socket-client” mode requires a corresponding “socket-server” running (default: local execution).
- blocking (bool) – Whether remote environment calls should be blocking, only valid if remote mode given (default: not blocking, invalid unless “multiprocessing” or “socket-client” remote mode).
- host (str, iter[str]) – Socket server hostname(s) or IP address(es) (required only for “socket-client” remote mode).
- port (int, iter[int]) – Socket server port(s), increasing sequence if single host and port given (required only for “socket-client” remote mode).
-
run
(num_episodes=None, num_timesteps=None, num_updates=None, batch_agent_calls=False, sync_timesteps=False, sync_episodes=False, num_sleep_secs=0.001, callback=None, callback_episode_frequency=None, callback_timestep_frequency=None, use_tqdm=True, mean_horizon=1, evaluation=False, save_best_agent=None, evaluation_callback=None)[source]¶ Run experiment.
Parameters: - num_episodes (int > 0) – Number of episodes to run experiment (default: no episode limit).
- num_timesteps (int > 0) – Number of timesteps to run experiment (default: no timestep limit).
- num_updates (int > 0) – Number of agent updates to run experiment (default: no update limit).
- batch_agent_calls (bool) – Whether to batch agent calls for parallel environment execution (default: separate call per environment).
- sync_timesteps (bool) – Whether to synchronize parallel environment execution on timestep-level, implied by batch_agent_calls (default: not synchronized unless batch_agent_calls).
- sync_episodes (bool) – Whether to synchronize parallel environment execution on episode-level (default: not synchronized).
- num_sleep_secs (float) – Sleep duration if no environment is ready (default: one milliseconds).
- callback ((Runner, parallel) -> bool) – Callback function taking the runner instance plus parallel index and returning a boolean value indicating whether execution should continue (default: callback always true).
- callback_episode_frequency (int) – Episode interval between callbacks (default: every episode).
- callback_timestep_frequency (int) – Timestep interval between callbacks (default: not specified).
- use_tqdm (bool) – Whether to display a tqdm progress bar for the experiment run (default: display progress bar).
- mean_horizon (int) – Number of episodes progress bar values and evaluation score are averaged over (default: not averaged).
- evaluation (bool) – Whether to run in evaluation mode, only valid if a single environment (default: no evaluation).
- save_best_agent (string) – Directory to save the best version of the agent according to the evaluation score (default: best agent is not saved).
- evaluation_callback (int | Runner -> float) – Callback function taking the runner instance and returning an evaluation score (default: cumulative evaluation reward averaged over mean_horizon episodes).
- agent (specification | Agent object) – Agent specification or object, the latter is not
closed automatically as part of
Environment interface¶
Initialization and termination¶
-
static
Environment.
create
(environment=None, max_episode_timesteps=None, remote=None, blocking=False, host=None, port=None, **kwargs)[source]¶ Creates an environment from a specification. In case of “socket-server” remote mode, runs environment in server communication loop until closed.
Parameters: - environment (specification | Environment class/object) – JSON file, specification key,
configuration dictionary, library module,
Environment
class/object, or gym.Env (required, invalid for "socket-client" remote mode). - max_episode_timesteps (int > 0) – Maximum number of timesteps per episode, overwrites the environment default if defined (default: environment default, invalid for “socket-client” remote mode).
- remote ("multiprocessing" | "socket-client" | "socket-server") – Communication mode for remote environment execution of parallelized environment execution, “socket-client” mode requires a corresponding “socket-server” running, and “socket-server” mode runs environment in server communication loop until closed (default: local execution).
- blocking (bool) – Whether remote environment calls should be blocking (default: not blocking, invalid unless “multiprocessing” or “socket-client” remote mode).
- host (str) – Socket server hostname or IP address (required only for “socket-client” remote mode).
- port (int) – Socket server port (required only for “socket-client/server” remote mode).
- kwargs – Additional arguments.
- environment (specification | Environment class/object) – JSON file, specification key,
configuration dictionary, library module,
Attributes¶
-
Environment.
states
()[source]¶ Returns the state space specification.
Returns: Arbitrarily nested dictionary of state descriptions with the following attributes: - type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_states (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
Return type: specification
-
Environment.
actions
()[source]¶ Returns the action space specification.
Returns: Arbitrarily nested dictionary of action descriptions with the following attributes: - type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_actions (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
Return type: specification
Interaction functions¶
-
Environment.
reset
()[source]¶ Resets the environment to start a new episode.
Returns: Dictionary containing initial state(s) and auxiliary information. Return type: dict[state]
-
Environment.
execute
(actions)[source]¶ Executes the given action(s) and advances the environment by one step.
Parameters: actions (dict[action]) – Dictionary containing action(s) to be executed (required). Returns: Dictionary containing next state(s), whether a terminal state is reached or 2 if the episode was aborted, and observed reward. Return type: dict[state], bool | 0 | 1 | 2, float
Arcade Learning Environment¶
-
class
tensorforce.environments.
ArcadeLearningEnvironment
(level, life_loss_terminal=False, life_loss_punishment=0.0, repeat_action_probability=0.0, visualize=False, frame_skip=1, seed=None)[source]¶ Arcade Learning Environment adapter (specification key:
ale
,arcade_learning_environment
).May require:
sudo apt-get install libsdl1.2-dev libsdl-gfx1.2-dev libsdl-image1.2-dev cmake git clone https://github.com/mgbellemare/Arcade-Learning-Environment.git cd Arcade-Learning-Environment mkdir build && cd build cmake -DUSE_SDL=ON -DUSE_RLGLUE=OFF -DBUILD_EXAMPLES=ON .. make -j 4 cd .. pip3 install .
Parameters: - level (string) – ALE rom file (required).
- loss_of_life_termination – Signals a terminal state on loss of life (default: false).
- loss_of_life_reward (float) – Reward/Penalty on loss of life (negative values are a penalty) (default: 0.0).
- repeat_action_probability (float) – Repeats last action with given probability (default: 0.0).
- visualize (bool) – Whether to visualize interaction (default: false).
- frame_skip (int > 0) – Number of times to repeat an action without observing (default: 1).
- seed (int) – Random seed (default: none).
Maze Explorer¶
-
class
tensorforce.environments.
MazeExplorer
(level, visualize=False)[source]¶ MazeExplorer environment adapter (specification key:
mazeexp
,maze_explorer
).May require:
sudo apt-get install freeglut3-dev pip3 install mazeexp
Parameters: - level (int) – Game mode, see GitHub (required).
- visualize (bool) – Whether to visualize interaction (default: false).
Open Sim¶
-
class
tensorforce.environments.
OpenSim
(level, visualize=False, integrator_accuracy=5e-05)[source]¶ OpenSim environment adapter (specification key:
osim
,open_sim
).Parameters: - level ('Arm2D' | 'L2Run' | 'Prosthetics') – Environment id (required).
- visualize (bool) – Whether to visualize interaction (default: false).
- integrator_accuracy (float) – Integrator accuracy (default: 5e-5).
OpenAI Gym¶
-
class
tensorforce.environments.
OpenAIGym
(level, visualize=False, max_episode_steps=None, terminal_reward=0.0, reward_threshold=None, drop_states_indices=None, visualize_directory=None, **kwargs)[source]¶ OpenAI Gym environment adapter (specification key:
gym
,openai_gym
).May require:
pip3 install gym pip3 install gym[all]
Parameters: - level (string | gym.Env) – Gym id or instance (required).
- visualize (bool) – Whether to visualize interaction (default: false).
- max_episode_steps (false | int > 0) – Whether to terminate an episode after a while, and if so, maximum number of timesteps per episode (default: Gym default).
- terminal_reward (float) – Additional reward for early termination, if otherwise indistinguishable from termination due to maximum number of timesteps (default: Gym default).
- reward_threshold (float) – Gym environment argument, the reward threshold before the task is considered solved (default: Gym default).
- drop_states_indices (list[int]) – Drop states indices (default: none).
- visualize_directory (string) – Visualization output directory (default: none).
- kwargs – Additional Gym environment arguments.
OpenAI Retro¶
-
class
tensorforce.environments.
OpenAIRetro
(level, visualize=False, visualize_directory=None, **kwargs)[source]¶ OpenAI Retro environment adapter (specification key:
retro
,openai_retro
).May require:
pip3 install gym-retro
Parameters: - level (string) – Game id (required).
- visualize (bool) – Whether to visualize interaction (default: false).
- monitor_directory (string) – Monitor output directory (default: none).
- kwargs – Additional Retro environment arguments.
PyGame Learning Environment¶
-
class
tensorforce.environments.
PyGameLearningEnvironment
(level, visualize=False, frame_skip=1, fps=30)[source]¶ PyGame Learning Environment environment adapter (specification key:
ple
,pygame_learning_environment
).May require:
sudo apt-get install git python3-dev python3-setuptools python3-numpy python3-opengl libsdl-image1.2-dev libsdl-mixer1.2-dev libsdl-ttf2.0-dev libsmpeg-dev libsdl1.2-dev libportmidi-dev libswscale-dev libavformat-dev libavcodec-dev libtiff5-dev libx11-6 libx11-dev fluid-soundfont-gm timgm6mb-soundfont xfonts-base xfonts-100dpi xfonts-75dpi xfonts-cyrillic fontconfig fonts-freefont-ttf libfreetype6-dev pip3 install git+https://github.com/pygame/pygame.git pip3 install git+https://github.com/ntasfi/PyGame-Learning-Environment.git
Parameters: - level (string | subclass of
ple.games.base
) – Game instance or name of class inple.games
, like “Catcher”, “Doom”, “FlappyBird”, “MonsterKong”, “Pixelcopter”, “Pong”, “PuckWorld”, “RaycastMaze”, “Snake”, “WaterWorld” (required). - visualize (bool) – Whether to visualize interaction (default: false).
- frame_skip (int > 0) – Number of times to repeat an action without observing (default: 1).
- fps (int > 0) – The desired frames per second we want to run our game at (default: 30).
- level (string | subclass of
ViZDoom¶
-
class
tensorforce.environments.
ViZDoom
(level, visualize=False, include_variables=False, factored_action=False, frame_skip=12, seed=None)[source]¶ ViZDoom environment adapter (specification key:
vizdoom
).May require:
sudo apt-get install g++ build-essential libsdl2-dev zlib1g-dev libmpg123-dev libjpeg-dev libsndfile1-dev nasm tar libbz2-dev libgtk2.0-dev make cmake git chrpath timidity libfluidsynth-dev libgme-dev libopenal-dev timidity libwildmidi-dev unzip libboost-all-dev liblua5.1-dev pip3 install vizdoom
Parameters: - level (string) – ViZDoom configuration file (required).
- include_variables (bool) – Whether to include game variables to state (default: false).
- factored_action (bool) – Whether to use factored action representation (default: false).
- visualize (bool) – Whether to visualize interaction (default: false).
- frame_skip (int > 0) – Number of times to repeat an action without observing (default: 12).
- seed (int) – Random seed (default: none).