Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce is an open-source deep reinforcement learning framework, with an emphasis on modularized flexible library design and straightforward usability for applications in research and practice. Tensorforce is built on top of Google’s TensorFlow framework and compatible with Python 3 (Python 2 support was dropped with version 0.5).

Tensorforce follows a set of high-level design choices which differentiate it from other similar libraries:

  • Modular component-based design: Feature implementations, above all, strive to be as generally applicable and configurable as possible, potentially at some cost of faithfully resembling details of the introducing paper.
  • Separation of RL algorithm and application: Algorithms are agnostic to the type and structure of inputs (states/observations) and outputs (actions/decisions), as well as the interaction with the application environment.
  • Full-on TensorFlow models: The entire reinforcement learning logic, including control flow, is implemented in TensorFlow, to enable portable computation graphs independent of application programming language, and to facilitate the deployment of models.

Installation

A stable version of Tensorforce is periodically updated on PyPI and installed as follows:

pip install tensorforce

To always use the latest version of Tensorforce, install the GitHub version instead:

git clone https://github.com/tensorforce/tensorforce.git
cd tensorforce
pip install -e .

Tensorforce is built on top of Google’s TensorFlow and requires that either tensorflow or tensorflow-gpu is installed, currently as version 1.13.1. To include the correct version of TensorFlow with the installation of Tensorforce, simply add the flag tf for the normal CPU version or tf_gpu for the GPU version:

# PyPI version plus TensorFlow CPU version
pip install tensorforce[tf]

# GitHub version plus TensorFlow GPU version
pip install -e .[tf_gpu]

Some environments require additional packages, for which there are also options available (mazeexp, gym, retro, vizdoom; or envs for all environments), however, some require other tools to be installed (see environments documentation).

Getting started

Training

from tensorforce.agents import Agent
from tensorforce.environments import Environment

# Setup environment
# (Tensorforce or custom implementation, ideally using the Environment interface)
environment = Environment.create(environment='environment.json')

# Create and initialize agent
agent = Agent.create(agent='agent.json', environment=environment)
agent.initialize()

# Reset agent and environment at the beginning of a new episode
agent.reset()
states = environment.reset()
terminal = False

# Agent-environment interaction training loop
while not terminal:
    actions = agent.act(states=states)
    states, terminal, reward = environment.execute(actions=actions)
    agent.observe(terminal=terminal, reward=reward)

# Close agent and environment
agent.close()
environment.close()

Evaluation / application

# Agent-environment interaction evaluation loop
while not terminal:
    actions = agent.act(states=states, evaluation=True)
    states, terminal, reward = environment.execute(actions=actions)

Runner utility

from tensorforce.execution import Runner

# Tensorforce runner utility
runner = Runner(agent='agent.json', environment='environment.json')

# Run training
runner.run(num_episodes=500)

# Close runner
runner.close()

Module specification

Agents are instantiated via Agent.create(agent=...), with either of the specification alternatives presented below (agent acts as type argument). It is recommended to pass as second argument environment the application Environment implementation, which automatically extracts the corresponding states, actions and max_episode_timesteps arguments of the agent.

How to specify modules

Dictionary with module type and arguments

Agent.create(...
    policy=dict(network=dict(type='layered', layers=[dict(type='dense', size=32)])),
    memory=dict(type='replay', capacity=10000), ...
)

JSON specification file (plus additional arguments)

Agent.create(...
    policy=dict(network='network.json'),
    memory=dict(type='memory.json', capacity=10000), ...
)

Module path (plus additional arguments)

Agent.create(...
    policy=dict(network='my_module.TestNetwork'),
    memory=dict(type='tensorforce.core.memories.Replay', capacity=10000), ...
)

Callable or Type (plus additional arguments)

Agent.create(...
    policy=dict(network=TestNetwork),
    memory=dict(type=Replay, capacity=10000), ...
)

Default module: only arguments or first argument

Agent.create(...
    policy=dict(network=[dict(type='dense', size=32)]),
    memory=dict(capacity=10000), ...
)

Static vs dynamic hyperparameters

Tensorforce distinguishes between agent/module arguments (primitive types: bool/int/long/float) which specify either part of the TensorFlow model architecture, like the layer size, or a value within the architecture, like the learning rate. Whereas the former are statically defined as part of the agent initialization, the latter can be dynamically adjusted afterwards. These dynamic hyperparameters are indicated by parameter as part of their type specification in the documentation, and can alternatively be assigned a parameter module instead of a constant value, for instance, to specify a decaying learning rate.

Example: exponentially decaying exploration

Agent.create(...
    exploration=dict(
        type='decaying', unit='timesteps', decay='exponential',
        initial_value=0.1, decay_steps=1000, decay_rate=0.5
    ), ...
)

Example: linearly increasing horizon

Agent.create(...
    reward_estimation=dict(horizon=dict(
        type='decaying', dtype='long', unit='episodes', decay='polynomial',
        initial_value=10.0, decay_steps=1000, final_value=50.0, power=1.0
    ), ...
)

Features

Action masking

agent = Agent.create(
    states=dict(type='float', shape=(10,)),
    actions=dict(type='int', shape=(), num_actions=3), ...
)
...
states = dict(
    state=np.random.random_sample(size=(10,)),  # regular state
    action_mask=[True, False, True]  # mask as '[ACTION-NAME]_mask'
)
action = agent.act(states=states)
assert action != 1

Record & pretrain

agent = Agent.create(...
    recorder=dict(
        directory='data/traces',
        frequency=100  # record a traces file every 100 episodes
    ), ...
)
...
agent.close()

# Pretrain agent on recorded traces
agent = Agent.create(...)
agent.pretrain(
    directory='data/traces',
    num_updates=100  # perform 100 updates on traces (other configurations possible)
)

Save & restore

agent = Agent.create(...
    saver=dict(
        directory='data/checkpoints',
        frequency=600  # save checkpoint every 600 seconds (10 minutes)
    ), ...
)
...
agent.close()

# Restore latest agent checkpoint
agent = Agent.load(directory='data/checkpoints')

TensorBoard

Agent.create(...
    summarizer=dict(
        directory='data/summaries',
        labels=['graph', 'losses', 'rewards'],  # list of labels, or 'all'
        frequency=100  # store values every 100 timesteps
        # (infrequent update summaries every update; other configurations possible)
    ), ...
)

run.py – Runner

Required arguments

#1: agent (string) – Agent (configuration JSON file, name, or library module)
#2: environment (string) – Environment (name, configuration JSON file, or library module)

Optional arguments

Agent arguments

--[n]etwork (string, default: not specified) – Network (configuration JSON file, name, or library module)

Environment arguments

--[l]evel (string, default: not specified) – Level or game id, like CartPole-v1, if supported
--[i]mport-modules (string, default: not specified) – Import comma-separated modules required for environment
--visualize (bool, default: false) – Visualize agent–environment interaction, if supported

Runner arguments

--[t]imesteps (int, default: not specified) – Number of timesteps
--[e]pisodes (int, default: not specified) – Number of episodes
--[m]ax-episode-timesteps (int, default: not specified) – Maximum number of timesteps per episode
--mean-horizon (int, default: 10) – Number of timesteps/episodes for mean reward computation
--e[v]aluation (bool, default: false) – Evaluation mode
--[s]ave-best-agent (bool, default: false) – Save best-performing agent

Logging arguments

--[r]epeat (int, default: 1) – Number of repetitions
--[p]ath (string, default: not specified) – Logging path, directory plus filename without extension

--seaborn (bool, default: false) – Use seaborn

tune.py – Hyperparameter tuner

Required arguments

#1: environment (string) – Environment (name, configuration JSON file, or library module)

Optional arguments

--[l]evel (string, default: not specified) – Level or game id, like CartPole-v1, if supported
--[m]ax-repeats (int, default: 1) – Maximum number of repetitions
--[n]um-iterations (int, default: 1) – Number of BOHB iterations
--[d]irectory (string, default: “tuner”) – Output directory
--[r]estore (string, default: not specified) – Restore from given directory
--id (string, default: “worker”) – Unique worker id

Agent interface

Constant Agent

Random Agent

Tensorforce Agent

Deep Q-Network

Dueling DQN

Vanilla Policy Gradient

Actor-Critic

Advantage Actor-Critic

Deterministic Policy Gradient

Proximal Policy Optimization

Trust-Region Policy Optimization

Distributions

Layers

Default layer: Function with default argument function

Convolutional layers

Dense layers

Embedding layers

Recurrent layers

Pooling layers

Normalization layers

Misc layers

Layers with internal states

Special layers

Memories

Default memory: Replay with default argument capacity

Networks

Default network: LayeredNetwork with default argument layers

Objectives

Optimizers

Default optimizer: MetaOptimizerWrapper

Parameters

Default parameter: Constant

Preprocessing

Policies

Default policy: ParametrizedDistributions

Environment interface

class tensorforce.environments.Environment[source]

Tensorforce environment interface.

actions()[source]

Returns the action space specification.

Returns:Arbitrarily nested dictionary of action descriptions with the following attributes:
  • type ("bool" | "int" | "float") – action data type (required).
  • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
  • num_actions (int > 0) – number of discrete action values (required for type "int").
  • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
Return type:specification
close()[source]

Closes the environment.

static create(environment, **kwargs)[source]

Creates an environment from a specification.

Parameters:
  • environment (specification) – JSON file, specification key, configuration dictionary, library module, or Environment subclass (required).
  • kwargs – Additional arguments.
execute(actions)[source]

Executes the given action(s) and advances the environment by one step.

Parameters:actions (dict[action]) – Dictionary containing action(s) to be executed (required).
Returns:Dictionary containing next state(s), whether a terminal state is reached or 2 if the episode was aborted, and observed reward.
Return type:((dict[state], bool | 0 | 1 | 2, float))
max_episode_timesteps()[source]

Returns the maximum number of timesteps per episode.

Returns:Maximum number of timesteps per episode.
Return type:int
reset()[source]

Resets the environment to start a new episode.

Returns:Dictionary containing initial state(s) and auxiliary information.
Return type:dict[state]
states()[source]

Returns the state space specification.

Returns:Arbitrarily nested dictionary of state descriptions with the following attributes:
  • type ("bool" | "int" | "float") – state data type (default: "float").
  • shape (int | iter[int]) – state shape (required).
  • num_states (int > 0) – number of discrete state values (required for type "int").
  • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
Return type:specification

Arcade Learning Environment

class tensorforce.environments.ArcadeLearningEnvironment(level, life_loss_terminal=False, life_loss_punishment=0.0, repeat_action_probability=0.0, visualize=False, frame_skip=1, seed=None)[source]

Arcade Learning Environment adapter (specification key: ale, arcade_learning_environment).

May require:

sudo apt-get install libsdl1.2-dev libsdl-gfx1.2-dev libsdl-image1.2-dev cmake

git clone https://github.com/mgbellemare/Arcade-Learning-Environment.git
cd Arcade-Learning-Environment

mkdir build && cd build
cmake -DUSE_SDL=ON -DUSE_RLGLUE=OFF -DBUILD_EXAMPLES=ON ..
make -j 4
cd ..

pip install .
Parameters:
  • level (string) – ALE rom file (required).
  • loss_of_life_termination – Signals a terminal state on loss of life (default: false).
  • loss_of_life_reward (float) – Reward/Penalty on loss of life (negative values are a penalty) (default: 0.0).
  • repeat_action_probability (float) – Repeats last action with given probability (default: 0.0).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • frame_skip (int > 0) – Number of times to repeat an action without observing (default: 1).
  • seed (int) – Random seed (default: none).

Maze Explorer

class tensorforce.environments.MazeExplorer(level, visualize=False)[source]

MazeExplorer environment adapter (specification key: mazeexp, maze_explorer).

May require:

sudo apt-get install freeglut3-dev

pip install mazeexp
Parameters:
  • level (int) – Game mode, see GitHub (required).
  • visualize (bool) – Whether to visualize interaction (default: false).

Open Sim

class tensorforce.environments.OpenSim(level, visualize=False, integrator_accuracy=5e-05)[source]

OpenSim environment adapter (specification key: osim, open_sim).

Parameters:
  • level ('Arm2D' | 'L2Run' | 'Prosthetics') – Environment id (required).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • integrator_accuracy (float) – Integrator accuracy (default: 5e-5).

OpenAI Gym

class tensorforce.environments.OpenAIGym(level, visualize=False, max_episode_timesteps=None, terminal_reward=0.0, reward_threshold=None, tags=None, visualize_directory=None, **kwargs)[source]

OpenAI Gym environment adapter (specification key: gym, openai_gym).

May require:

pip install gym
pip install gym[all]
Parameters:
  • level (string | gym.Env) – Gym id or instance (required).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • max_episode_timesteps (false | int > 0) – Whether to terminate an episode after a while, and if so, maximum number of timesteps per episode (default: Gym default).
  • terminal_reward (float) – Additional reward for early termination, if otherwise indistinguishable from termination due to maximum number of timesteps (default: Gym default).
  • reward_threshold (float) – Gym environment argument, the reward threshold before the task is considered solved (default: Gym default).
  • tags (dict) – Gym environment argument, a set of arbitrary key-value tags on this environment, including simple property=True tags (default: Gym default).
  • visualize_directory (string) – Visualization output directory (default: none).
  • kwargs – Additional Gym environment arguments.

OpenAI Retro

class tensorforce.environments.OpenAIRetro(level, visualize=False, visualize_directory=None, **kwargs)[source]

OpenAI Retro environment adapter (specification key: retro, openai_retro).

May require:

pip install gym-retro
Parameters:
  • level (string) – Game id (required).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • monitor_directory (string) – Monitor output directory (default: none).
  • kwargs – Additional Retro environment arguments.

PyGame Learning Environment

class tensorforce.environments.PyGameLearningEnvironment(level, visualize=False, frame_skip=1, fps=30)[source]

PyGame Learning Environment environment adapter (specification key: ple, pygame_learning_environment).

May require:

sudo apt-get install git python3-dev python3-setuptools python3-numpy python3-opengl     libsdl-image1.2-dev libsdl-mixer1.2-dev libsdl-ttf2.0-dev libsmpeg-dev  libsdl1.2-dev     libportmidi-dev libswscale-dev libavformat-dev libavcodec-dev libtiff5-dev libx11-6     libx11-dev fluid-soundfont-gm timgm6mb-soundfont xfonts-base xfonts-100dpi xfonts-75dpi     xfonts-cyrillic fontconfig fonts-freefont-ttf libfreetype6-dev

pip install git+https://github.com/pygame/pygame.git

pip install git+https://github.com/ntasfi/PyGame-Learning-Environment.git
Parameters:
  • level (string | subclass of ple.games.base) – Game instance or name of class in ple.games, like ‘doom’, ‘flappybird’, ‘monsterkong’, ‘catcher’, ‘pixelcopter’, ‘pong’, ‘puckworld’, ‘raycastmaze’, ‘snake’, ‘waterworld’ (required).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • frame_skip (int > 0) – Number of times to repeat an action without observing (default: 1).
  • fps (int > 0) – The desired frames per second we want to run our game at (default: 30).

ViZDoom

class tensorforce.environments.ViZDoom(level, visualize=False, include_variables=False, factored_action=False, frame_skip=12, seed=None)[source]

ViZDoom environment adapter (specification key: vizdoom).

May require:

sudo apt-get install g++ build-essential libsdl2-dev zlib1g-dev libmpg123-dev libjpeg-dev     libsndfile1-dev nasm tar libbz2-dev libgtk2.0-dev make cmake git chrpath timidity     libfluidsynth-dev libgme-dev libopenal-dev timidity libwildmidi-dev unzip libboost-all-dev     liblua5.1-dev

pip install vizdoom
Parameters:
  • level (string) – ViZDoom configuration file (required).
  • include_variables (bool) – Whether to include game variables to state (default: false).
  • factored_action (bool) – Whether to use factored action representation (default: false).
  • visualize (bool) – Whether to visualize interaction (default: false).
  • frame_skip (int > 0) – Number of times to repeat an action without observing (default: 12).
  • seed (int) – Random seed (default: none).