TensorForce - modular deep reinforcement learning in TensorFlow¶
TensorForce is an open source reinforcement learning library focused on providing clear APIs, readability and modularisation to deploy reinforcement learning solutions both in research and practice. TensorForce is built on top on TensorFlow.
Quick start¶
For a quick start, you can run one of our example scripts using the provided configurations, e.g. to run the TRPO agent on CartPole, execute from the examples folder:
python examples/openai_gym.py CartPole-v0 -a examples/configs/ppo.json -n examples/configs/mlp2_network.json
In python, it could look like this:
# examples/quickstart.py
import numpy as np
from tensorforce.agents import PPOAgent
from tensorforce.execution import Runner
from tensorforce.contrib.openai_gym import OpenAIGym
# Create an OpenAIgym environment
env = OpenAIGym('CartPole-v0', visualize=True)
# Network as list of layers
network_spec = [
dict(type='dense', size=32, activation='tanh'),
dict(type='dense', size=32, activation='tanh')
]
agent = PPOAgent(
states_spec=env.states,
actions_spec=env.actions,
network_spec=network_spec,
batch_size=4096,
# BatchAgent
keep_last_timestep=True,
# PPOAgent
step_optimizer=dict(
type='adam',
learning_rate=1e-3
),
optimization_steps=10,
# Model
scope='ppo',
discount=0.99,
# DistributionModel
distributions_spec=None,
entropy_regularization=0.01,
# PGModel
baseline_mode=None,
baseline=None,
baseline_optimizer=None,
gae_lambda=None,
# PGLRModel
likelihood_ratio_clipping=0.2,
summary_spec=None,
distributed_spec=None
)
# Create the runner
runner = Runner(agent=agent, environment=env)
# Callback function printing episode statistics
def episode_finished(r):
print("Finished episode {ep} after {ts} timesteps (reward: {reward})".format(ep=r.episode, ts=r.episode_timestep,
reward=r.episode_rewards[-1]))
return True
# Start learning
runner.run(episodes=3000, max_episode_timesteps=200, episode_finished=episode_finished)
# Print statistics
print("Learning finished. Total episodes: {ep}. Average reward of last 100 episodes: {ar}.".format(
ep=runner.episode,
ar=np.mean(runner.episode_rewards[-100:]))
)
Agent and model overview¶
A reinforcement learning agent provides methods to process states and
return actions, to store past observations, and to load and save models.
Most agents employ a Model
which implements the algorithms to
calculate the next action given the current state and to update model
parameters from past experiences.
Environment <-> Runner <-> Agent <-> Model
Parameters to the agent are passed as a Configuration
object. The
configuration is passed on to the Model
.
Ready-to-use algorithms¶
We implemented some of the most common RL algorithms and try to keep these up-to-date. Here we provide an overview over all implemented agents and models.
Agent / General parameters¶
Agent
is the base class for all reinforcement learning agents. Every
agent inherits from this class.
- class
tensorforce.agents.
Agent
(states_spec, actions_spec, batched_observe=1000, scope='base_agent')Bases:
object
Basic Reinforcement learning agent. An agent encapsulates execution logic of a particular reinforcement learning algorithm and defines the external interface to the environment.
The agent hence acts as an intermediate layer between environment and backend execution (value function or policy updates).
act
(states, deterministic=False)Return action(s) for given state(s). States preprocessing and exploration are applied if configured accordingly.
Parameters:
- states (any) -- One state (usually a value tuple) or dict of states if multiple states are expected.
- deterministic (bool) -- If true, no exploration and sampling is applied.
Returns: Scalar value of the action or dict of multiple actions the agent wants to execute.
- static
from_spec
(spec, kwargs)Creates an agent from a specification dict.
initialize_model
()Creates the model for the respective agent based on specifications given by user. This is a separate call after constructing the agent because the agent constructor has to perform a number of checks on the specs first, sometimes adjusting them e.g. by converting to a dict.
observe
(terminal, reward)Observe experience from the environment to learn from. Optionally pre-processes rewards Child classes should call super to get the processed reward EX: terminal, reward = super()...
Parameters:
- terminal (bool) -- boolean indicating if the episode terminated after the observation.
- reward (float) -- scalar reward that resulted from executing the action.
reset
()Reset the agent to its initial state (e.g. on experiment start). Updates the Model's internal episode and timestep counter, internal states, and resets preprocessors.
restore_model
(directory=None, file=None)Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model's default saver directory is used (unless file specifies the entire path).
Parameters:
- directory -- Optional checkpoint directory.
- file -- Optional checkpoint file, or path if directory not given.
save_model
(directory=None, append_timestep=True)Save TensorFlow model. If no checkpoint directory is given, the model's default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.
Parameters:
- directory (str) -- Optional checkpoint directory.
- append_timestep (bool) -- Appends the current timestep to the checkpoint file if true. If this is set to True, the load path must include the checkpoint timestep suffix. For example, if stored to models/ and set to true, the exported file will be of the form models/model.ckpt-X where X is the last timestep saved. The load path must precisely match this file name. If this option is turned off, the checkpoint will always overwrite the file specified in path and the model can always be loaded under this path.
Returns: Checkpoint path were the model was saved.
Model¶
The Model
class is the base class for reinforcement learning models.
- class
tensorforce.models.
Model
(states_spec, actions_spec, device=None, session_config=None, scope='base_model', saver_spec=None, summary_spec=None, distributed_spec=None, optimizer=None, discount=0.0, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None)Bases:
object
Base class for all (TensorFlow-based) models.
act
(states, internals, deterministic=False)Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))
Parameters:
- states (dict) -- Dict of state tensors (each key represents one state space component).
- internals -- List of incoming internal state tensors.
- deterministic (bool) -- If True, will not apply exploration after actions are calculated.
Returns:
- Actual action-outputs (batched if state input is a batch).
Return type: tuple
create_output_operations
(states, internals, actions, terminal, reward, update, deterministic)Calls all the relevant TensorFlow functions for this model and hence creates all the TensorFlow operations involved.
Parameters:
- states (dict) -- Dict of state tensors (each key represents one state space component).
- internals -- List of prior internal state tensors.
- actions (dict) -- Dict of action tensors (each key represents one action space component).
- terminal -- Terminal boolean tensor (shape=(batch-size,)).
- reward -- Reward float tensor (shape=(batch-size,)).
- update -- Single boolean tensor indicating whether this call happens during an update.
- deterministic -- Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
get_optimizer_kwargs
(states, internals, actions, terminal, reward, update)Returns the optimizer arguments including the time, the list of variables to optimize, and various argument-free functions (in particular
fn_loss
returning the combined 0-dim batch loss tensor) which the optimizer might require to perform an update step.
Parameters:
- states (dict) -- Dict of state tensors (each key represents one state space component).
- internals -- List of prior internal state tensors.
- actions (dict) -- Dict of action tensors (each key represents one action space component).
- terminal -- Terminal boolean tensor (shape=(batch-size,)).
- reward -- Reward float tensor (shape=(batch-size,)).
- update -- Single boolean tensor indicating whether this call happens during an update.
Returns: Dict to be passed into the optimizer op (e.g. 'minimize') as kwargs.
get_summaries
()Returns the TensorFlow summaries reported by the model
Returns: List of summaries
get_variables
(include_non_trainable=False)Returns the TensorFlow variables used by the model.
Returns: List of variables.
initialize
(custom_getter)Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.
Parameters: custom_getter -- The custom_getter_
object to use fortf.make_template
when creating TensorFlow functions.
observe
(terminal, reward)Adds an observation (reward and is-terminal) to the model without updating its trainable variables.
Parameters:
- terminal (bool) -- Whether the episode has terminated.
- reward (float) -- The observed reward value.
Returns: The value of the model-internal episode counter.
reset
()Resets the model to its initial state on episode start.
Returns: Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors. Return type: tuple
restore
(directory=None, file=None)Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model's default saver directory is used (unless file specifies the entire path).
Parameters:
- directory -- Optional checkpoint directory.
- file -- Optional checkpoint file, or path if directory not given.
save
(directory=None, append_timestep=True)Save TensorFlow model. If no checkpoint directory is given, the model's default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.
Parameters:
- directory -- Optional checkpoint directory.
- append_timestep -- Appends the current timestep to the checkpoint file if true.
Returns: Checkpoint path were the model was saved.
setup
()Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.
tf_action_exploration
(action, exploration, action_spec)Applies optional exploration to the action (post-processor for action outputs).
Parameters:
- action (tf.Tensor) -- The original output action tensor (to be post-processed).
- exploration (Exploration) -- The Exploration object to use.
- action_spec (dict) -- Dict specifying the action space.
Returns: The post-processed action output tensor.
tf_actions_and_internals
(states, internals, update, deterministic)Creates and returns the TensorFlow operations for retrieving the actions and - if applicable - the posterior internal state Tensors in reaction to the given input states (and prior internal states).
Parameters:
- states (dict) -- Dict of state tensors (each key represents one state space component).
- internals -- List of prior internal state tensors.
- update -- Single boolean tensor indicating whether this call happens during an update.
- deterministic -- Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
Returns:
- dict of output actions (with or without exploration applied (see
deterministic
))- list of posterior internal state Tensors (empty for non-internal state models)
Return type: tuple
tf_discounted_cumulative_reward
(terminal, reward, discount=None, final_reward=0.0, horizon=0)Creates and returns the TensorFlow operations for calculating the sequence of discounted cumulative rewards for a given sequence of single rewards.
Example: single rewards = 2.0 1.0 0.0 0.5 1.0 -1.0 terminal = False, False, False, False True False gamma = 0.95 final_reward = 100.0 (only matters for last episode (r=-1.0) as this episode has no terminal signal) horizon=3 output = 2.95 1.45 1.38 1.45 1.0 94.0
Parameters:
- terminal -- Tensor (bool) holding the is-terminal sequence. This sequence may contain more than one True value. If its very last element is False (not terminating), the given
final_reward
value is assumed to follow the last value in the single rewards sequence (see below).- reward -- Tensor (float) holding the sequence of single rewards. If the last element of
terminal
is False, an assumed last reward of the value offinal_reward
will be used.- discount (float) -- The discount factor (gamma). By default, take the Model's discount factor.
- final_reward (float) -- Reward value to use if last episode in sequence does not terminate (terminal sequence ends with False). This value will be ignored if horizon == 1 or discount == 0.0.
- horizon (int) -- The length of the horizon (e.g. for n-step cumulative rewards in continuous tasks without terminal signals). Use 0 (default) for an infinite horizon. Note that horizon=1 leads to the exact same results as a discount factor of 0.0.
Returns: Discounted cumulative reward tensor with the same shape as
reward
.
tf_loss
(states, internals, actions, terminal, reward, update)Creates and returns the single loss Tensor representing the total loss for a batch, including the mean loss per sample, the regularization loss of the batch, .
Parameters:
- states (dict) -- Dict of state tensors (each key represents one state space component).
- internals -- List of prior internal state tensors.
- actions (dict) -- Dict of action tensors (each key represents one action space component).
- terminal -- Terminal boolean tensor (shape=(batch-size,)).
- reward -- Reward float tensor (shape=(batch-size,)).
- update -- Single boolean tensor indicating whether this call happens during an update.
Returns: Single float-value loss tensor.
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)Creates and returns the TensorFlow operations for calculating the loss per batch instance (sample) of the given input state(s) and action(s).
Parameters:
- states (dict) -- Dict of state tensors (each key represents one state space component).
- internals -- List of prior internal state tensors.
- actions (dict) -- Dict of action tensors (each key represents one action space component).
- terminal -- Terminal boolean tensor (shape=(batch-size,)).
- reward -- Reward float tensor (shape=(batch-size,)).
- update -- Single boolean tensor indicating whether this call happens during an update.
Returns: Loss tensor (first rank is the batch size -> one loss value per sample in the batch).
tf_optimization
(states, internals, actions, terminal, reward, update)Creates the TensorFlow operations for performing an optimization update step based on the given input states and actions batch.
Parameters:
- states (dict) -- Dict of state tensors (each key represents one state space component).
- internals -- List of prior internal state tensors.
- actions (dict) -- Dict of action tensors (each key represents one action space component).
- terminal -- Terminal boolean tensor (shape=(batch-size,)).
- reward -- Reward float tensor (shape=(batch-size,)).
- update -- Single boolean tensor indicating whether this call happens during an update.
Returns: The optimization operation.
tf_preprocess_reward
(states, internals, terminal, reward)Applies optional preprocessing to the reward.
tf_preprocess_states
(states)Applies optional preprocessing to the states.
tf_regularization_losses
(states, internals, update)Creates and returns the TensorFlow operations for calculating the different regularization losses for the given batch of state/internal state inputs.
Parameters:
- states (dict) -- Dict of state tensors (each key represents one state space component).
- internals -- List of prior internal state tensors.
- update -- Single boolean tensor indicating whether this call happens during an update.
Returns: Dict of regularization loss tensors (keys == different regularization types, e.g. 'entropy').
update
(states, internals, actions, terminal, reward, return_loss_per_instance=False)Runs the self.optimization in the session to update the Model's parameters. Optionally, also runs the
loss_per_instance
calculation and returns the result of that.
Parameters:
- states (dict) -- Dict of state tensors (each key represents one state space component).
- internals -- List of prior internal state tensors.
- actions (dict) -- Dict of action tensors (each key represents one action space component).
- terminal -- Terminal boolean tensor (shape=(batch-size,)).
- reward -- Reward float tensor (shape=(batch-size,)).
- return_loss_per_instance (bool) -- Whether to also run and return the
loss_per_instance
Tensor.Returns: void or - if return_loss_per_instance is True - the value of the
loss_per_instance
Tensor.
MemoryAgent¶
- class
tensorforce.agents.
MemoryAgent
(states_spec, actions_spec, batched_observe=1000, scope='memory_agent', summary_spec=None, network_spec=None, discount=0.99, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, memory=None, first_update=10000, update_frequency=4, repeat_update=1)Bases:
tensorforce.agents.learning_agent.LearningAgent
The
MemoryAgent
class implements a replay memory from which it samples batches according to some sampling strategy to update the value function.
import_observations
(observations)Load an iterable of observation dicts into the replay memory.
Parameters: observations -- An iterable with each element containing an observation. Each observation requires keys 'state','action','reward','terminal', 'internal'. Use an empty list [] for 'internal' if internal state is irrelevant.
BatchAgent¶
- class
tensorforce.agents.
BatchAgent
(states_spec, actions_spec, batched_observe=1000, summary_spec=None, network_spec=None, discount=0.99, device=None, session_config=None, scope='batch_agent', saver_spec=None, distributed_spec=None, optimizer=None, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True)Bases:
tensorforce.agents.learning_agent.LearningAgent
The
BatchAgent
class implements a batch memory which generally implies on-policy experience collection and updates.
observe
(terminal, reward)Adds an observation and performs an update if the necessary conditions are satisfied, i.e. if one batch of experience has been collected as defined by the batch size.
In particular, note that episode control happens outside of the agent since the agent should be agnostic to how the training data is created.
Parameters:
- terminal (bool) -- Whether episode is terminated or not.
- reward (float) -- The scalar reward value.
reset_batch
()Cleans up after a batch has been processed (observed). Resets all batch information to be ready for new observation data. Batch information contains:
- observed states
- internal-variables
- taken actions
- observed is-terminal signals/rewards
- total batch size
Deep-Q-Networks (DQN)¶
- class
tensorforce.agents.
DQNAgent
(states_spec, actions_spec, batched_observe=None, scope='dqn', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)Bases:
tensorforce.agents.memory_agent.MemoryAgent
Deep-Q-Network agent (DQN). The piece de resistance of deep reinforcement learning as described by Minh et al. (2015). Includes an option for double-DQN (DDQN; van Hasselt et al., 2015)
DQN chooses from one of a number of discrete actions by taking the maximum Q-value from the value function with one output neuron per available action. DQN uses a replay memory for experience playback.
Normalized Advantage Functions¶
- class
tensorforce.agents.
NAFAgent
(states_spec, actions_spec, batched_observe=1000, scope='naf', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)Bases:
tensorforce.agents.memory_agent.MemoryAgent
Normalized Advantage Functions (NAF) for continuous DQN: https://arxiv.org/abs/1603.00748
Deep-Q-learning from demonstration (DQFD)¶
- class
tensorforce.agents.
DQFDAgent
(states_spec, actions_spec, batched_observe=1000, scope='dqfd', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, huber_loss=None, expert_margin=0.5, supervised_weight=0.1, demo_memory_capacity=10000, demo_sampling_ratio=0.2)Bases:
tensorforce.agents.memory_agent.MemoryAgent
Deep Q-learning from demonstration (DQFD) agent (Hester et al., 2017). This agent uses DQN to pre-train from demonstration data via an additional supervised loss term.
import_demonstrations
(demonstrations)Imports demonstrations, i.e. expert observations. Note that for large numbers of observations, set_demonstrations is more appropriate, which directly sets memory contents to an array an expects a different layout.
Parameters: demonstrations -- List of observation dicts
observe
(reward, terminal)Adds observations, updates via sampling from memories according to update rate. DQFD samples from the online replay memory and the demo memory with the fractions controlled by a hyper parameter p called 'expert sampling ratio.
pretrain
(steps)Computes pre-train updates.
Parameters: steps -- Number of updates to execute.
set_demonstrations
(batch)Set all demonstrations from batch data. Expects a dict wherein each value contains an array containing all states, actions, rewards, terminals and internals respectively.
Parameters: batch --
Vanilla Policy Gradient¶
- class
tensorforce.agents.
VPGAgent
(states_spec, actions_spec, batched_observe=1000, scope='vpg', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None)Bases:
tensorforce.agents.batch_agent.BatchAgent
Vanilla Policy Gradient agent as described by [Sutton et al. (1999)] (https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf).
Trust Region Policy Optimization (TRPO)¶
- class
tensorforce.agents.
TRPOAgent
(states_spec, actions_spec, batched_observe=1000, scope='trpo', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None, likelihood_ratio_clipping=None, learning_rate=0.001, cg_max_iterations=20, cg_damping=0.001, cg_unroll_loop=False)Bases:
tensorforce.agents.batch_agent.BatchAgent
Trust Region Policy Optimization (Schulman et al., 2015) agent.
State preprocessing¶
The agent handles state preprocessing. A preprocessor takes the raw state input from the environment and modifies it (for instance, image resize, state concatenation, etc.). You can find information about our ready-to-use preprocessors here.
Building your own agent¶
If you want to build your own agent, it should always inherit from
Agent
. If your agent uses a replay memory, it should probably inherit
from MemoryAgent
, if it uses a batch replay that is emptied after each update,
it should probably inherit from BatchAgent
.
We distinguish between agents and models. The Agent
class handles the
interaction with the environment, such as state preprocessing, exploration
and observation of rewards. The Model
class handles the mathematical
operations, such as building the tensorflow operations, calculating the
desired action and updating (i.e. optimizing) the model weights.
To start building your own agent, please refer to this blogpost to gain a deeper understanding of the internals of the TensorForce library. Afterwards, have look on a sample implementation, e.g. the DQN Agent and DQN Model.
Environments¶
A reinforcement learning environment provides the API to a simulated or real environment as the subject for optimization. It could be anything from video games (e.g. Atari) to robots or trading systems. The agent interacts with this environment and learns to act optimally in its dynamics.
Environment <-> Runner <-> Agent <-> Model
- class
tensorforce.environments.
Environment
Base environment class.
actions
Return the action space. Might include subdicts if multiple actions are available simultaneously.
Returns: dict of action properties (continuous, number of actions)
close
()Close environment. No other method calls possible afterwards.
execute
(actions)Executes action, observes next state(s) and reward.
Parameters: actions -- Actions to execute. Returns: (Dict of) next state(s), boolean indicating terminal, and reward signal.
reset
()Reset environment and setup for new episode.
Returns: initial state of reset environment.
seed
(seed)Sets the random seed of the environment to the given value (current time, if seed=None). Naturally deterministic Environments (e.g. ALE or some gym Envs) don't have to implement this method.
Parameters: seed (int) -- The seed to use for initializing the pseudo-random number generator (default=epoch time in sec). Returns: The actual seed (int) used OR None if Environment did not override this method (no seeding supported).
states
Return the state space. Might include subdicts if multiple states are available simultaneously.
Returns: dict of state properties (shape and type).
Ready-to-use environments¶
OpenAI Gym¶
- class
tensorforce.contrib.openai_gym.
OpenAIGym
(gym_id, monitor=None, monitor_safe=False, monitor_video=0, visualize=False)Bases:
tensorforce.environments.environment.Environment
__init__
(gym_id, monitor=None, monitor_safe=False, monitor_video=0, visualize=False)Initialize OpenAI Gym.
Parameters:
- gym_id -- OpenAI Gym environment ID. See https://gym.openai.com/envs
- monitor -- Output directory. Setting this to None disables monitoring.
- monitor_safe -- Setting this to True prevents existing log files to be overwritten. Default False.
- monitor_video -- Save a video every monitor_video steps. Setting this to 0 disables recording of videos.
- visualize -- If set True, the program will visualize the trainings of gym's environment. Note that such visualization is probabily going to slow down the training.
OpenAI Universe¶
- class
tensorforce.contrib.openai_universe.
OpenAIUniverse
(env_id)Bases:
tensorforce.environments.environment.Environment
OpenAI Universe Integration: https://universe.openai.com/. Contains OpenAI Gym: https://gym.openai.com/.
__init__
(env_id)Initialize OpenAI universe environment.
Parameters: env_id -- string with id/descriptor of the universe environment, e.g. 'HarvestDay-v0'.
Deepmind Lab¶
- class
tensorforce.contrib.deepmind_lab.
DeepMindLab
(level_id, repeat_action=1, state_attribute='RGB_INTERLACED', settings={'width': '320', 'appendCommand': '', 'fps': '60', 'height': '240'})Bases:
tensorforce.environments.environment.Environment
DeepMind Lab Integration: https://arxiv.org/abs/1612.03801 https://github.com/deepmind/lab
Since DeepMind lab is only available as source code, a manual install via bazel is required. Further, due to the way bazel handles external dependencies, cloning TensorForce into lab is the most convenient way to run it using the bazel BUILD file we provide. To use lab, first download and install it according to instructions https://github.com/deepmind/lab/blob/master/docs/build.md:
git clone https://github.com/deepmind/lab.gitAdd to the lab main BUILD file:
Clone TensorForce into the lab directory, then run the TensorForce bazel runner.
Note that using any specific configuration file currently requires changing the Tensorforce BUILD file to adjust environment parameters.
bazel run //tensorforce:lab_runnerPlease note that we have not tried to reproduce any lab results yet, and these instructions just explain connectivity in case someone wants to get started there.
__init__
(level_id, repeat_action=1, state_attribute='RGB_INTERLACED', settings={'width': '320', 'appendCommand': '', 'fps': '60', 'height': '240'})Initialize DeepMind Lab environment.
Parameters:
- level_id -- string with id/descriptor of the level, e.g. 'seekavoid_arena_01'.
- repeat_action -- number of frames the environment is advanced, executing the given action during every frame.
- state_attribute -- Attributes which represents the state for this environment, should adhere to the specification given in DeepMindLabEnvironment.state_spec(level_id).
- settings -- dict specifying additional settings as key-value string pairs. The following options are recognized: 'width' (horizontal resolution of the observation frames), 'height' (vertical resolution of the observation frames), 'fps' (frames per second) and 'appendCommand' (commands for the internal Quake console).
close
()Closes the environment and releases the underlying Quake III Arena instance. No other method calls possible afterwards.
execute
(actions)Pass action to universe environment, return reward, next step, terminal state and additional info.
Parameters: action -- action to execute as numpy array, should have dtype np.intc and should adhere to the specification given in DeepMindLabEnvironment.action_spec(level_id) Returns: dict containing the next state, the reward, and a boolean indicating if the next state is a terminal state
fps
An advisory metric that correlates discrete environment steps ("frames") with real (wallclock) time: the number of frames per (real) second.
num_steps
Number of frames since the last reset() call.
reset
()Resets the environment to its initialization state. This method needs to be called to start a new episode after the last episode ended.
Returns: initial state
Unreal Engine 4 Games¶
- class
tensorforce.contrib.unreal_engine.
UE4Environment
(host='localhost', port=6025, connect=True, discretize_actions=False, delta_time=0, num_ticks=4)Bases:
tensorforce.contrib.remote_environment.RemoteEnvironment
,tensorforce.contrib.state_settable_environment.StateSettableEnvironment
A special RemoteEnvironment for UE4 game connections. Communicates with the remote to receive information on the definitions of action- and observation spaces. Sends UE4 Action- and Axis-mappings as RL-actions and receives observations back defined by ducandu plugin Observer objects placed in the Game (these could be camera pixels or other observations, e.g. a x/y/z position of some game actor).
__init__
(host='localhost', port=6025, connect=True, discretize_actions=False, delta_time=0, num_ticks=4)
Parameters:
- host (str) -- The hostname to connect to.
- port (int) -- The port to connect to.
- connect (bool) -- Whether to connect already in this c'tor.
- discretize_actions (bool) -- Whether to treat axis-mappings defined in UE4 game as discrete actions. This would be necessary e.g. for agents that use q-networks where the output are q-values per discrete state-action pair.
- delta_time (float) -- The fake delta time to use for each single game tick.
- num_ticks (int) -- The number of ticks to be executed in this step (each tick will repeat the same given
- actions) --
discretize_action_space_desc
()Creates a list of discrete action(-combinations) in case we want to learn with a discrete set of actions, but only have action-combinations (maybe even continuous) available from the env. E.g. the UE4 game has the following action/axis-mappings:
{ 'Fire': {'type': 'action', 'keys': ('SpaceBar',)}, 'MoveRight': {'type': 'axis', 'keys': (('Right', 1.0), ('Left', -1.0), ('A', -1.0), ('D', 1.0))}, }-> this method will discretize them into the following 6 discrete actions:
[ [(Right, 0.0),(SpaceBar, False)], [(Right, 0.0),(SpaceBar, True)] [(Right, -1.0),(SpaceBar, False)], [(Right, -1.0),(SpaceBar, True)], [(Right, 1.0),(SpaceBar, False)], [(Right, 1.0),(SpaceBar, True)], ]
execute
(actions)Executes a single step in the UE4 game. This step may be comprised of one or more actual game ticks for all of which the same given action- and axis-inputs (or action number in case of discretized actions) are repeated. UE4 distinguishes between action-mappings, which are boolean actions (e.g. jump or dont-jump) and axis-mappings, which are continuous actions like MoveForward with values between -1.0 (run backwards) and 1.0 (run forwards), 0.0 would mean: stop.
reset
()same as step (no kwargs to pass), but needs to block and return observation_dict
- stores the received observation in self.last_observation
translate_abstract_actions_to_keys
(abstract)Translates a list of tuples ([pretty mapping], [value]) to a list of tuples ([some key], [translated value]) each single item in abstract will undergo the following translation:
Example1: we want: "MoveRight": 5.0 possible keys for the action are: ("Right", 1.0), ("Left", -1.0) result: "Right": 5.0 * 1.0 = 5.0
Example2: we want: "MoveRight": -0.5 possible keys for the action are: ("Left", -1.0), ("Right", 1.0) result: "Left": -0.5 * -1.0 = 0.5 (same as "Right": -0.5)
Preprocessing¶
Often it is necessary to modify state input tensors before passing them to the reinforcement learning agent. This could be due to various reasons, e.g.:
- Feature scaling / input normalization,
- Data reduction,
- Ensuring the Markov property by concatenating multiple states (e.g. in Atari)
TensorForce comes with a number of ready-to-use preprocessors, a preprocessing stack and easy ways to implement your own preprocessors.
Usage¶
The
Each preprocessor implements three methods:
- The constructor (
__init__
) for parameter initialization process(state)
takes a state and returns the processed stateprocessed_shape(original_shape)
takes a shape and returns the processed shape
The preprocessing stack iteratively calls these functions of all preprocessors in the stack and returns the result.
Using one preprocessor¶
from tensorforce.core.preprocessing import Sequence
pp_seq = Sequence(4) # initialize preprocessor (return sequence of last 4 states)
state = env.reset() # reset environment
processed_state = pp_seq.process(state) # process state
Using a preprocessing stack¶
You can stack multipe preprocessors:
from tensorforce.core.preprocessing import Preprocessing, Grayscale, Sequence
pp_gray = Grayscale() # initialize grayscale preprocessor
pp_seq = Sequence(4) # initialize sequence preprocessor
stack = Preprocessing() # initialize preprocessing stack
stack.add(pp_gray) # add grayscale preprocessor to stack
stack.add(pp_seq) # add maximum preprocessor to stack
state = env.reset() # reset environment
processed_state = stack.process(state) # process state
Using a configuration dict¶
If you use configuration objects, you can build your preprocessing stack from a config:
from tensorforce.core.preprocessing import Preprocessing
preprocessing_config = [
{
"type": "image_resize",
"width": 84,
"height": 84
}, {
"type": "grayscale"
}, {
"type": "center"
}, {
"type": "sequence",
"length": 4
}
]
stack = Preprocessing.from_spec(preprocessing_config)
config.state_shape = stack.shape(config.state_shape)
The Agent
class expects a preprocessing configuration parameter and then
handles preprocessing automatically:
from tensorforce.agents import DQNAgent
agent = DQNAgent(config=dict(
states=...,
actions=...,
preprocessing=preprocessing_config,
# ...
))
Ready-to-use preprocessors¶
These are the preprocessors that come with TensorForce:
Standardize¶
- class
tensorforce.core.preprocessing.
Standardize
(across_batch=False, scope='standardize', summary_labels=())Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Standardize state. Subtract mean and divide by standard deviation.
Grayscale¶
- class
tensorforce.core.preprocessing.
Grayscale
(weights=(0.299, 0.587, 0.114), scope='grayscale', summary_labels=())Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Turn 3D color state into grayscale.
ImageResize¶
- class
tensorforce.core.preprocessing.
ImageResize
(width, height, scope='image_resize', summary_labels=())Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Resize image to width x height.
Normalize¶
- class
tensorforce.core.preprocessing.
Normalize
(scope='normalize', summary_labels=())Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Normalize state. Subtract minimal value and divide by range.
Sequence¶
- class
tensorforce.core.preprocessing.
Sequence
(length=2, scope='sequence', summary_labels=())Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Concatenate
length
state vectors. Example: Used in Atari problems to create the Markov property.
Building your own preprocessor¶
All preprocessors should inherit from
tensorforce.core.preprocessing.Preprocessor
.
For a start, please refer to the source of the Grayscale preprocessor.
TensorForce: Details for "summary_spec" agent parameters¶
summary_spec¶
TensorForce has the ability to record summary data for use with TensorBoard as well STDIO and file export. This is accomplished through dictionary parameter called "summary_spec" passed to the agent on initialization.
"summary_spec" supports the following optional dictionary entries:
Key | Value |
---|---|
directory | (str) Path to storage for TensorBoard summary data |
steps | (int) Frequency in steps between storage of summary data |
seconds | (int) Frequency in seconds to store summary data |
labels | (list) Requested Export, See "LABELS" section |
meta_dict | (dict) For used with label "configuration" |
LABELS¶
Entry | Data produced |
---|---|
losses | Training total-loss and "loss-without-regularization" |
total-loss | Final calculated loss value |
variables | Network variables |
inputs | Equivalent to: ['states', 'actions', 'rewards'] |
states | Histogram of input state space |
actions | Histogram of input action space |
rewards | Histogram of input reward space |
gradients | Histogram and scalar gradients |
gradients_histogram | Variable gradients as histograms |
gradients_scalar | Variable Mean/Variance of gradients as scalar |
regularization | Regularization values |
configuration | See Configuration Export for more detail |
configuration | Export configuration to "TEXT" tab in TensorBoard |
print_configuration | Prints configuration to STDOUT |
from tensorforce.agents import PPOAgent
# Create a Proximal Policy Optimization agent
agent = PPOAgent(
states_spec=...,
actions_spec=...,
network_spec=...,
summary_spec=dict(directory="./board/",
steps=50,
labels=['configuration',
'gradients_scalar',
'regularization',
'inputs',
'losses',
'variables']
),
...
)
Configuration Export¶
Adding the "configuration" label will create a "TEXT" tab in TensorBoard that contains all the parameters passed to the Agent. By using the additional "summary_spec" dictionary key "meta_dict", custom keys and values can be added to the data export. The user may want to pass "Description", "Experiement #", "InputDataSet", etc.
If a key is already in use within TensorForce an error will be raised to notify you to change the key value. To use the custom feature, create a dictionary with keys to export:
from tensorforce.agents import PPOAgent
metaparams['MyDescription'] = "This experiment covers the first test ...."
metaparams['My2D'] = np.ones((9,9)) # 9x9 matrix of 1.0's
metaparams['My1D'] = np.ones((9)) # Column of 9 1.0's
# Create a Proximal Policy Optimization agent
agent = PPOAgent(
states_spec=...,
actions_spec=...,
network_spec=...,
summary_spec=dict(directory="./board/",
steps=50,
meta_dict=metaparams, #Add custom keys to export
labels=['configuration',
'gradients_scalar',
'regularization',
'inputs',
'losses',
'variables']
),
...
)
Use the "print_configuration" label to export the configuration data to the command line's STDOUT.
Runners¶
A "runner" manages the interaction between the Environment and the Agent. TensorForce comes with ready-to-use runners. Of course, you can implement your own runners, too. If you are not using simulation environments, the runner is simply your application code using the Agent API.
Environment <-> Runner <-> Agent <-> Model
Ready-to-use runners¶
We implemented a standard runner, a threaded runner (for real-time interaction e.g. with OpenAI Universe) and a distributed runner for A3C variants.
Runner¶
This is the standard runner. It requires an agent and an environment for initialization:
from tensorforce.execution import Runner
runner = Runner(
agent = agent, # Agent object
environment = env # Environment object
)
A reinforcement learning agent observes states from the environment, selects actions and collect experience which is used to update its model and improve action selection. You can get information about our ready-to-use agents here.
The environment object is either the "real" environment, or a proxy which fulfills the actions selected by the agent in the real world. You can find information about environments here.
The runner is started with the Runner.run(...)
method:
runner.run(
episodes = int, # number of episodes to run
max_timesteps = int, # maximum timesteps per episode
episode_finished = object, # callback function called when episode is finished
)
You can use the episode_finished callback for printing performance feedback:
def episode_finished(r):
if r.episode % 10 == 0:
print("Finished episode {ep} after {ts} timesteps".format(ep=r.episode + 1, ts=r.timestep + 1))
print("Episode reward: {}".format(r.episode_rewards[-1]))
print("Average of last 10 rewards: {}".format(np.mean(r.episode_rewards[-10:])))
return True
Using the Runner¶
Here is some example code for using the runner (without preprocessing).
import logging
from tensorforce.contrib.openai_gym import OpenAIGym
from tensorforce.agents import DQNAgent
from tensorforce.execution import Runner
def main():
gym_id = 'CartPole-v0'
max_episodes = 10000
max_timesteps = 1000
env = OpenAIGym(gym_id)
network_spec = [
dict(type='dense', size=32, activation='tanh'),
dict(type='dense', size=32, activation='tanh')
]
agent = DQNAgent(
states_spec=env.states,
actions_spec=env.actions,
network_spec=network_spec,
batch_size=64
)
runner = Runner(agent, env)
report_episodes = 10
def episode_finished(r):
if r.episode % report_episodes == 0:
logging.info("Finished episode {ep} after {ts} timesteps".format(ep=r.episode, ts=r.timestep))
logging.info("Episode reward: {}".format(r.episode_rewards[-1]))
logging.info("Average of last 100 rewards: {}".format(sum(r.episode_rewards[-100:]) / 100))
return True
print("Starting {agent} for Environment '{env}'".format(agent=agent, env=env))
runner.run(max_episodes, max_timesteps, episode_finished=episode_finished)
print("Learning finished. Total episodes: {ep}".format(ep=runner.episode))
if __name__ == '__main__':
main()
Building your own runner¶
There are three mandatory tasks any runner implements: Obtaining an action from the agent, passing it to the environment, and passing the resulting observation to the agent.
# Get action
action = agent.act(state)
# Execute action in the environment
state, reward, terminal_state = environment.execute(action)
# Pass observation to the agent
agent.observe(state, action, reward, terminal_state)
The key idea here is the separation of concerns. External code should not need to manage batches or remember network features, this is that the agent is for. Conversely, an agent need not concern itself with how a model is implemented and the API should facilitate easy combination of different agents and models.
If you would like to build your own runner, it is probably a good idea to take a look at the source code of our Runner class.
tensorforce package¶
Subpackages¶
tensorforce.agents package¶
Submodules¶
tensorforce.agents.agent module¶
-
class
tensorforce.agents.agent.
Agent
(states_spec, actions_spec, batched_observe=1000, scope='base_agent')¶ Bases:
object
Basic Reinforcement learning agent. An agent encapsulates execution logic of a particular reinforcement learning algorithm and defines the external interface to the environment.
The agent hence acts as an intermediate layer between environment and backend execution (value function or policy updates).
-
act
(states, deterministic=False)¶ Return action(s) for given state(s). States preprocessing and exploration are applied if configured accordingly.
Parameters: - states (any) – One state (usually a value tuple) or dict of states if multiple states are expected.
- deterministic (bool) – If true, no exploration and sampling is applied.
Returns: Scalar value of the action or dict of multiple actions the agent wants to execute.
-
close
()¶
-
static
from_spec
(spec, kwargs)¶ Creates an agent from a specification dict.
-
initialize_model
()¶ Creates the model for the respective agent based on specifications given by user. This is a separate call after constructing the agent because the agent constructor has to perform a number of checks on the specs first, sometimes adjusting them e.g. by converting to a dict.
-
last_observation
()¶
-
observe
(terminal, reward)¶ Observe experience from the environment to learn from. Optionally pre-processes rewards Child classes should call super to get the processed reward EX: terminal, reward = super()…
Parameters: - terminal (bool) – boolean indicating if the episode terminated after the observation.
- reward (float) – scalar reward that resulted from executing the action.
-
static
process_action_spec
(actions_spec)¶
-
static
process_state_spec
(states_spec)¶
-
reset
()¶ Reset the agent to its initial state (e.g. on experiment start). Updates the Model’s internal episode and timestep counter, internal states, and resets preprocessors.
-
restore_model
(directory=None, file=None)¶ Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).
Parameters: - directory – Optional checkpoint directory.
- file – Optional checkpoint file, or path if directory not given.
-
save_model
(directory=None, append_timestep=True)¶ Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.
Parameters: - directory (str) – Optional checkpoint directory.
- append_timestep (bool) – Appends the current timestep to the checkpoint file if true. If this is set to True, the load path must include the checkpoint timestep suffix. For example, if stored to models/ and set to true, the exported file will be of the form models/model.ckpt-X where X is the last timestep saved. The load path must precisely match this file name. If this option is turned off, the checkpoint will always overwrite the file specified in path and the model can always be loaded under this path.
Returns: Checkpoint path were the model was saved.
-
should_stop
()¶
-
tensorforce.agents.batch_agent module¶
-
class
tensorforce.agents.batch_agent.
BatchAgent
(states_spec, actions_spec, batched_observe=1000, summary_spec=None, network_spec=None, discount=0.99, device=None, session_config=None, scope='batch_agent', saver_spec=None, distributed_spec=None, optimizer=None, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True)¶ Bases:
tensorforce.agents.learning_agent.LearningAgent
The
BatchAgent
class implements a batch memory which generally implies on-policy experience collection and updates.-
observe
(terminal, reward)¶ Adds an observation and performs an update if the necessary conditions are satisfied, i.e. if one batch of experience has been collected as defined by the batch size.
In particular, note that episode control happens outside of the agent since the agent should be agnostic to how the training data is created.
Parameters: - terminal (bool) – Whether episode is terminated or not.
- reward (float) – The scalar reward value.
-
reset_batch
()¶ Cleans up after a batch has been processed (observed). Resets all batch information to be ready for new observation data. Batch information contains:
- observed states
- internal-variables
- taken actions
- observed is-terminal signals/rewards
- total batch size
-
tensorforce.agents.constant_agent module¶
Random agent that always returns a random action. Useful to be able to get random agents with specific shapes.
-
class
tensorforce.agents.constant_agent.
ConstantAgent
(states_spec, actions_spec, batched_observe=1000, scope='constant', action_values=None)¶ Bases:
tensorforce.agents.agent.Agent
Constant action agent for sanity checks. Returns a constant value at every step, useful to debug continuous problems.
-
initialize_model
()¶
-
tensorforce.agents.ddqn_agent module¶
-
class
tensorforce.agents.ddqn_agent.
DDQNAgent
(states_spec, actions_spec, batched_observe=1000, scope='ddqn', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, huber_loss=None)¶ Bases:
tensorforce.agents.memory_agent.MemoryAgent
Double DQN Agent based on Van Hasselt et al.. Simple extension to DQN which improves stability.
-
initialize_model
()¶
-
tensorforce.agents.dqfd_agent module¶
-
class
tensorforce.agents.dqfd_agent.
DQFDAgent
(states_spec, actions_spec, batched_observe=1000, scope='dqfd', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, huber_loss=None, expert_margin=0.5, supervised_weight=0.1, demo_memory_capacity=10000, demo_sampling_ratio=0.2)¶ Bases:
tensorforce.agents.memory_agent.MemoryAgent
Deep Q-learning from demonstration (DQFD) agent (Hester et al., 2017). This agent uses DQN to pre-train from demonstration data via an additional supervised loss term.
-
import_demonstrations
(demonstrations)¶ Imports demonstrations, i.e. expert observations. Note that for large numbers of observations, set_demonstrations is more appropriate, which directly sets memory contents to an array an expects a different layout.
Parameters: demonstrations – List of observation dicts
-
initialize_model
()¶
-
observe
(reward, terminal)¶ Adds observations, updates via sampling from memories according to update rate. DQFD samples from the online replay memory and the demo memory with the fractions controlled by a hyper parameter p called ‘expert sampling ratio.
-
pretrain
(steps)¶ Computes pre-train updates.
Parameters: steps – Number of updates to execute.
-
set_demonstrations
(batch)¶ Set all demonstrations from batch data. Expects a dict wherein each value contains an array containing all states, actions, rewards, terminals and internals respectively.
Parameters: batch –
-
tensorforce.agents.dqn_agent module¶
-
class
tensorforce.agents.dqn_agent.
DQNAgent
(states_spec, actions_spec, batched_observe=None, scope='dqn', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)¶ Bases:
tensorforce.agents.memory_agent.MemoryAgent
Deep-Q-Network agent (DQN). The piece de resistance of deep reinforcement learning as described by Minh et al. (2015). Includes an option for double-DQN (DDQN; van Hasselt et al., 2015)
DQN chooses from one of a number of discrete actions by taking the maximum Q-value from the value function with one output neuron per available action. DQN uses a replay memory for experience playback.
-
initialize_model
()¶
-
tensorforce.agents.dqn_nstep_agent module¶
-
class
tensorforce.agents.dqn_nstep_agent.
DQNNstepAgent
(states_spec, actions_spec, batched_observe=1000, scope='dqn-nstep', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, keep_last_timestep=True, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)¶ Bases:
tensorforce.agents.batch_agent.BatchAgent
N-step Deep-Q-Network agent (DQN).
-
initialize_model
()¶
-
tensorforce.agents.memory_agent module¶
-
class
tensorforce.agents.memory_agent.
MemoryAgent
(states_spec, actions_spec, batched_observe=1000, scope='memory_agent', summary_spec=None, network_spec=None, discount=0.99, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, memory=None, first_update=10000, update_frequency=4, repeat_update=1)¶ Bases:
tensorforce.agents.learning_agent.LearningAgent
The
MemoryAgent
class implements a replay memory from which it samples batches according to some sampling strategy to update the value function.-
import_observations
(observations)¶ Load an iterable of observation dicts into the replay memory.
Parameters: observations – An iterable with each element containing an observation. Each observation requires keys ‘state’,’action’,’reward’,’terminal’, ‘internal’. Use an empty list [] for ‘internal’ if internal state is irrelevant.
-
observe
(terminal, reward)¶
-
tensorforce.agents.naf_agent module¶
-
class
tensorforce.agents.naf_agent.
NAFAgent
(states_spec, actions_spec, batched_observe=1000, scope='naf', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)¶ Bases:
tensorforce.agents.memory_agent.MemoryAgent
Normalized Advantage Functions (NAF) for continuous DQN: https://arxiv.org/abs/1603.00748
-
initialize_model
()¶
-
tensorforce.agents.ppo_agent module¶
-
class
tensorforce.agents.ppo_agent.
PPOAgent
(states_spec, actions_spec, batched_observe=1000, scope='ppo', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=0.01, batch_size=1000, keep_last_timestep=True, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None, likelihood_ratio_clipping=None, step_optimizer=None, optimization_steps=10)¶ Bases:
tensorforce.agents.batch_agent.BatchAgent
Proximal Policy Optimization agent ([Schulman et al., 2017] (https://openai-public.s3-us-west-2.amazonaws.com/blog/2017-07/ppo/ppo-arxiv.pdf).
-
initialize_model
()¶
-
tensorforce.agents.random_agent module¶
-
class
tensorforce.agents.random_agent.
RandomAgent
(states_spec, actions_spec, batched_observe=1000, scope='random')¶ Bases:
tensorforce.agents.agent.Agent
Random agent, useful as a baseline and sanity check.
-
initialize_model
()¶
-
tensorforce.agents.trpo_agent module¶
-
class
tensorforce.agents.trpo_agent.
TRPOAgent
(states_spec, actions_spec, batched_observe=1000, scope='trpo', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None, likelihood_ratio_clipping=None, learning_rate=0.001, cg_max_iterations=20, cg_damping=0.001, cg_unroll_loop=False)¶ Bases:
tensorforce.agents.batch_agent.BatchAgent
Trust Region Policy Optimization (Schulman et al., 2015) agent.
-
initialize_model
()¶
-
tensorforce.agents.vpg_agent module¶
-
class
tensorforce.agents.vpg_agent.
VPGAgent
(states_spec, actions_spec, batched_observe=1000, scope='vpg', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None)¶ Bases:
tensorforce.agents.batch_agent.BatchAgent
Vanilla Policy Gradient agent as described by [Sutton et al. (1999)] (https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf).
-
initialize_model
()¶
-
Module contents¶
-
class
tensorforce.agents.
Agent
(states_spec, actions_spec, batched_observe=1000, scope='base_agent')¶ Bases:
object
Basic Reinforcement learning agent. An agent encapsulates execution logic of a particular reinforcement learning algorithm and defines the external interface to the environment.
The agent hence acts as an intermediate layer between environment and backend execution (value function or policy updates).
-
act
(states, deterministic=False)¶ Return action(s) for given state(s). States preprocessing and exploration are applied if configured accordingly.
Parameters: - states (any) – One state (usually a value tuple) or dict of states if multiple states are expected.
- deterministic (bool) – If true, no exploration and sampling is applied.
Returns: Scalar value of the action or dict of multiple actions the agent wants to execute.
-
close
()¶
-
static
from_spec
(spec, kwargs)¶ Creates an agent from a specification dict.
-
initialize_model
()¶ Creates the model for the respective agent based on specifications given by user. This is a separate call after constructing the agent because the agent constructor has to perform a number of checks on the specs first, sometimes adjusting them e.g. by converting to a dict.
-
last_observation
()¶
-
observe
(terminal, reward)¶ Observe experience from the environment to learn from. Optionally pre-processes rewards Child classes should call super to get the processed reward EX: terminal, reward = super()…
Parameters: - terminal (bool) – boolean indicating if the episode terminated after the observation.
- reward (float) – scalar reward that resulted from executing the action.
-
static
process_action_spec
(actions_spec)¶
-
static
process_state_spec
(states_spec)¶
-
reset
()¶ Reset the agent to its initial state (e.g. on experiment start). Updates the Model’s internal episode and timestep counter, internal states, and resets preprocessors.
-
restore_model
(directory=None, file=None)¶ Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).
Parameters: - directory – Optional checkpoint directory.
- file – Optional checkpoint file, or path if directory not given.
-
save_model
(directory=None, append_timestep=True)¶ Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.
Parameters: - directory (str) – Optional checkpoint directory.
- append_timestep (bool) – Appends the current timestep to the checkpoint file if true. If this is set to True, the load path must include the checkpoint timestep suffix. For example, if stored to models/ and set to true, the exported file will be of the form models/model.ckpt-X where X is the last timestep saved. The load path must precisely match this file name. If this option is turned off, the checkpoint will always overwrite the file specified in path and the model can always be loaded under this path.
Returns: Checkpoint path were the model was saved.
-
should_stop
()¶
-
-
class
tensorforce.agents.
ConstantAgent
(states_spec, actions_spec, batched_observe=1000, scope='constant', action_values=None)¶ Bases:
tensorforce.agents.agent.Agent
Constant action agent for sanity checks. Returns a constant value at every step, useful to debug continuous problems.
-
initialize_model
()¶
-
-
class
tensorforce.agents.
RandomAgent
(states_spec, actions_spec, batched_observe=1000, scope='random')¶ Bases:
tensorforce.agents.agent.Agent
Random agent, useful as a baseline and sanity check.
-
initialize_model
()¶
-
-
class
tensorforce.agents.
LearningAgent
(states_spec, actions_spec, batched_observe=1000, scope='learning_agent', summary_spec=None, network_spec=None, discount=0.99, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None)¶ Bases:
tensorforce.agents.agent.Agent
An Agent that actually learns by optimizing the parameters of its tensorflow model.
-
class
tensorforce.agents.
BatchAgent
(states_spec, actions_spec, batched_observe=1000, summary_spec=None, network_spec=None, discount=0.99, device=None, session_config=None, scope='batch_agent', saver_spec=None, distributed_spec=None, optimizer=None, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True)¶ Bases:
tensorforce.agents.learning_agent.LearningAgent
The
BatchAgent
class implements a batch memory which generally implies on-policy experience collection and updates.-
observe
(terminal, reward)¶ Adds an observation and performs an update if the necessary conditions are satisfied, i.e. if one batch of experience has been collected as defined by the batch size.
In particular, note that episode control happens outside of the agent since the agent should be agnostic to how the training data is created.
Parameters: - terminal (bool) – Whether episode is terminated or not.
- reward (float) – The scalar reward value.
-
reset_batch
()¶ Cleans up after a batch has been processed (observed). Resets all batch information to be ready for new observation data. Batch information contains:
- observed states
- internal-variables
- taken actions
- observed is-terminal signals/rewards
- total batch size
-
-
class
tensorforce.agents.
MemoryAgent
(states_spec, actions_spec, batched_observe=1000, scope='memory_agent', summary_spec=None, network_spec=None, discount=0.99, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, memory=None, first_update=10000, update_frequency=4, repeat_update=1)¶ Bases:
tensorforce.agents.learning_agent.LearningAgent
The
MemoryAgent
class implements a replay memory from which it samples batches according to some sampling strategy to update the value function.-
import_observations
(observations)¶ Load an iterable of observation dicts into the replay memory.
Parameters: observations – An iterable with each element containing an observation. Each observation requires keys ‘state’,’action’,’reward’,’terminal’, ‘internal’. Use an empty list [] for ‘internal’ if internal state is irrelevant.
-
observe
(terminal, reward)¶
-
-
class
tensorforce.agents.
VPGAgent
(states_spec, actions_spec, batched_observe=1000, scope='vpg', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None)¶ Bases:
tensorforce.agents.batch_agent.BatchAgent
Vanilla Policy Gradient agent as described by [Sutton et al. (1999)] (https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf).
-
initialize_model
()¶
-
-
class
tensorforce.agents.
TRPOAgent
(states_spec, actions_spec, batched_observe=1000, scope='trpo', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=1000, keep_last_timestep=True, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None, likelihood_ratio_clipping=None, learning_rate=0.001, cg_max_iterations=20, cg_damping=0.001, cg_unroll_loop=False)¶ Bases:
tensorforce.agents.batch_agent.BatchAgent
Trust Region Policy Optimization (Schulman et al., 2015) agent.
-
initialize_model
()¶
-
-
class
tensorforce.agents.
PPOAgent
(states_spec, actions_spec, batched_observe=1000, scope='ppo', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=0.01, batch_size=1000, keep_last_timestep=True, baseline_mode=None, baseline=None, baseline_optimizer=None, gae_lambda=None, likelihood_ratio_clipping=None, step_optimizer=None, optimization_steps=10)¶ Bases:
tensorforce.agents.batch_agent.BatchAgent
Proximal Policy Optimization agent ([Schulman et al., 2017] (https://openai-public.s3-us-west-2.amazonaws.com/blog/2017-07/ppo/ppo-arxiv.pdf).
-
initialize_model
()¶
-
-
class
tensorforce.agents.
DQNAgent
(states_spec, actions_spec, batched_observe=None, scope='dqn', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)¶ Bases:
tensorforce.agents.memory_agent.MemoryAgent
Deep-Q-Network agent (DQN). The piece de resistance of deep reinforcement learning as described by Minh et al. (2015). Includes an option for double-DQN (DDQN; van Hasselt et al., 2015)
DQN chooses from one of a number of discrete actions by taking the maximum Q-value from the value function with one output neuron per available action. DQN uses a replay memory for experience playback.
-
initialize_model
()¶
-
-
class
tensorforce.agents.
DDQNAgent
(states_spec, actions_spec, batched_observe=1000, scope='ddqn', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, huber_loss=None)¶ Bases:
tensorforce.agents.memory_agent.MemoryAgent
Double DQN Agent based on Van Hasselt et al.. Simple extension to DQN which improves stability.
-
initialize_model
()¶
-
-
class
tensorforce.agents.
DQNNstepAgent
(states_spec, actions_spec, batched_observe=1000, scope='dqn-nstep', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, keep_last_timestep=True, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)¶ Bases:
tensorforce.agents.batch_agent.BatchAgent
N-step Deep-Q-Network agent (DQN).
-
initialize_model
()¶
-
-
class
tensorforce.agents.
DQFDAgent
(states_spec, actions_spec, batched_observe=1000, scope='dqfd', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, huber_loss=None, expert_margin=0.5, supervised_weight=0.1, demo_memory_capacity=10000, demo_sampling_ratio=0.2)¶ Bases:
tensorforce.agents.memory_agent.MemoryAgent
Deep Q-learning from demonstration (DQFD) agent (Hester et al., 2017). This agent uses DQN to pre-train from demonstration data via an additional supervised loss term.
-
import_demonstrations
(demonstrations)¶ Imports demonstrations, i.e. expert observations. Note that for large numbers of observations, set_demonstrations is more appropriate, which directly sets memory contents to an array an expects a different layout.
Parameters: demonstrations – List of observation dicts
-
initialize_model
()¶
-
observe
(reward, terminal)¶ Adds observations, updates via sampling from memories according to update rate. DQFD samples from the online replay memory and the demo memory with the fractions controlled by a hyper parameter p called ‘expert sampling ratio.
-
pretrain
(steps)¶ Computes pre-train updates.
Parameters: steps – Number of updates to execute.
-
set_demonstrations
(batch)¶ Set all demonstrations from batch data. Expects a dict wherein each value contains an array containing all states, actions, rewards, terminals and internals respectively.
Parameters: batch –
-
-
class
tensorforce.agents.
NAFAgent
(states_spec, actions_spec, batched_observe=1000, scope='naf', summary_spec=None, network_spec=None, device=None, session_config=None, saver_spec=None, distributed_spec=None, optimizer=None, discount=0.99, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None, distributions_spec=None, entropy_regularization=None, batch_size=32, memory=None, first_update=10000, update_frequency=4, repeat_update=1, target_sync_frequency=10000, target_update_weight=1.0, double_q_model=False, huber_loss=None)¶ Bases:
tensorforce.agents.memory_agent.MemoryAgent
Normalized Advantage Functions (NAF) for continuous DQN: https://arxiv.org/abs/1603.00748
-
initialize_model
()¶
-
tensorforce.contrib package¶
Submodules¶
tensorforce.contrib.ale module¶
Arcade Learning Environment (ALE). https://github.com/mgbellemare/Arcade-Learning-Environment
-
class
tensorforce.contrib.ale.
ALE
(rom, frame_skip=1, repeat_action_probability=0.0, loss_of_life_termination=False, loss_of_life_reward=0, display_screen=False, seed=<mtrand.RandomState object>)¶ Bases:
tensorforce.environments.environment.Environment
-
action_names
¶
-
actions
¶
-
close
()¶
-
current_state
¶
-
execute
(actions)¶
-
is_terminal
¶
-
reset
()¶
-
states
¶
-
tensorforce.contrib.deepmind_lab module¶
-
class
tensorforce.contrib.deepmind_lab.
DeepMindLab
(level_id, repeat_action=1, state_attribute='RGB_INTERLACED', settings={'width': '320', 'appendCommand': '', 'fps': '60', 'height': '240'})¶ Bases:
tensorforce.environments.environment.Environment
DeepMind Lab Integration: https://arxiv.org/abs/1612.03801 https://github.com/deepmind/lab
Since DeepMind lab is only available as source code, a manual install via bazel is required. Further, due to the way bazel handles external dependencies, cloning TensorForce into lab is the most convenient way to run it using the bazel BUILD file we provide. To use lab, first download and install it according to instructions https://github.com/deepmind/lab/blob/master/docs/build.md:
git clone https://github.com/deepmind/lab.git
Add to the lab main BUILD file:
Clone TensorForce into the lab directory, then run the TensorForce bazel runner.
Note that using any specific configuration file currently requires changing the Tensorforce BUILD file to adjust environment parameters.
bazel run //tensorforce:lab_runner
Please note that we have not tried to reproduce any lab results yet, and these instructions just explain connectivity in case someone wants to get started there.
-
actions
¶
-
close
()¶ Closes the environment and releases the underlying Quake III Arena instance. No other method calls possible afterwards.
-
execute
(actions)¶ Pass action to universe environment, return reward, next step, terminal state and additional info.
Parameters: action – action to execute as numpy array, should have dtype np.intc and should adhere to the specification given in DeepMindLabEnvironment.action_spec(level_id) Returns: dict containing the next state, the reward, and a boolean indicating if the next state is a terminal state
-
fps
¶ An advisory metric that correlates discrete environment steps (“frames”) with real (wallclock) time: the number of frames per (real) second.
-
num_steps
¶ Number of frames since the last reset() call.
-
reset
()¶ Resets the environment to its initialization state. This method needs to be called to start a new episode after the last episode ended.
Returns: initial state
-
states
¶
-
tensorforce.contrib.maze_explorer module¶
-
class
tensorforce.contrib.maze_explorer.
MazeExplorer
(mode_id=0, visible=True)¶ Bases:
tensorforce.environments.environment.Environment
MazeExplorer Integration: https://github.com/mryellow/maze_explorer.
-
actions
¶
-
close
()¶
-
execute
(actions)¶
-
reset
()¶
-
states
¶
-
tensorforce.contrib.openai_gym module¶
OpenAI Gym Integration: https://gym.openai.com/.
tensorforce.contrib.openai_universe module¶
-
class
tensorforce.contrib.openai_universe.
OpenAIUniverse
(env_id)¶ Bases:
tensorforce.environments.environment.Environment
OpenAI Universe Integration: https://universe.openai.com/. Contains OpenAI Gym: https://gym.openai.com/.
-
actions
¶
-
close
()¶
-
configure
(*args, **kwargs)¶
-
execute
(actions)¶
-
render
(*args, **kwargs)¶
-
reset
()¶
-
states
¶
-
tensorforce.contrib.remote_environment module¶
-
class
tensorforce.contrib.remote_environment.
MsgPackNumpyProtocol
(max_msg_len=8192)¶ Bases:
object
A simple protocol to communicate over tcp sockets, which can be used by RemoteEnvironment implementations. The protocol is based on msgpack-numpy encoding and decoding.
Each message has a simple 8-byte header, which encodes the length of the subsequent msgpack-numpy encoded byte-string. All messages received need to have the ‘status’ field set to ‘ok’. If ‘status’ is set to ‘error’, the field ‘message’ should be populated with some error information.
Examples: client sends: “[8-byte header]msgpack-encoded({“cmd”: “seed”, “value”: 200})” server responds: “[8-byte header]msgpack-encoded({“status”: “ok”, “value”: 200})”
client sends: “[8-byte header]msgpack-encoded({“cmd”: “reset”})” server responds: “[8-byte header]msgpack-encoded({“status”: “ok”})”
client sends: “[8-byte header]msgpack-encoded({“cmd”: “step”, “action”: 5})” server responds: “[8-byte header]msgpack-encoded({“status”: “ok”, “obs_dict”: {… some observations}, “reward”: -10.0, “is_terminal”: False})”
-
recv
(socket_)¶ Receives a message as msgpack-numpy encoded byte-string from the given socket object. Blocks until something was received.
Parameters: socket – The python socket object to use. Returns: The decoded (as dict) message received.
-
send
(message, socket_)¶ Sends a message (dict) to the socket. Message is encoded via msgpack-numpy.
Parameters: - message – The message dict (e.g. {“cmd”: “reset”})
- socket – The python socket object to use.
-
-
class
tensorforce.contrib.remote_environment.
RemoteEnvironment
(host='localhost', port=6025)¶ Bases:
tensorforce.environments.environment.Environment
-
close
()¶ Same as disconnect method.
-
connect
()¶ Starts the server tcp connection on the given host:port.
-
current_state
¶
-
disconnect
()¶ Ends our server tcp connection.
-
tensorforce.contrib.state_settable_environment module¶
-
class
tensorforce.contrib.state_settable_environment.
StateSettableEnvironment
¶ Bases:
tensorforce.environments.environment.Environment
An Environment that implements the set_state method to set the current state to some new state using setter instructions.
-
set_state
(**kwargs)¶ Sets the current state of the environment manually to some other state and returns a new observation.
Parameters: **kwargs – - The set instruction(s) to be executed by the environment.
- A single set instruction usually set a single property of the
state/observation vector to some new value.
Returns: The observation dictionary of the Environment after(!) setting it to the new state.
-
tensorforce.contrib.unreal_engine module¶
-
class
tensorforce.contrib.unreal_engine.
UE4Environment
(host='localhost', port=6025, connect=True, discretize_actions=False, delta_time=0, num_ticks=4)¶ Bases:
tensorforce.contrib.remote_environment.RemoteEnvironment
,tensorforce.contrib.state_settable_environment.StateSettableEnvironment
A special RemoteEnvironment for UE4 game connections. Communicates with the remote to receive information on the definitions of action- and observation spaces. Sends UE4 Action- and Axis-mappings as RL-actions and receives observations back defined by ducandu plugin Observer objects placed in the Game (these could be camera pixels or other observations, e.g. a x/y/z position of some game actor).
-
actions
()¶
-
connect
()¶
-
discretize_action_space_desc
()¶ Creates a list of discrete action(-combinations) in case we want to learn with a discrete set of actions, but only have action-combinations (maybe even continuous) available from the env. E.g. the UE4 game has the following action/axis-mappings:
{ 'Fire': {'type': 'action', 'keys': ('SpaceBar',)}, 'MoveRight': {'type': 'axis', 'keys': (('Right', 1.0), ('Left', -1.0), ('A', -1.0), ('D', 1.0))}, }
-> this method will discretize them into the following 6 discrete actions:
[ [(Right, 0.0),(SpaceBar, False)], [(Right, 0.0),(SpaceBar, True)] [(Right, -1.0),(SpaceBar, False)], [(Right, -1.0),(SpaceBar, True)], [(Right, 1.0),(SpaceBar, False)], [(Right, 1.0),(SpaceBar, True)], ]
-
execute
(actions)¶ Executes a single step in the UE4 game. This step may be comprised of one or more actual game ticks for all of which the same given action- and axis-inputs (or action number in case of discretized actions) are repeated. UE4 distinguishes between action-mappings, which are boolean actions (e.g. jump or dont-jump) and axis-mappings, which are continuous actions like MoveForward with values between -1.0 (run backwards) and 1.0 (run forwards), 0.0 would mean: stop.
-
static
extract_observation
(message)¶
-
reset
()¶ same as step (no kwargs to pass), but needs to block and return observation_dict
- stores the received observation in self.last_observation
-
seed
(seed=None)¶
-
set_state
(setters, **kwargs)¶
-
states
()¶
-
translate_abstract_actions_to_keys
(abstract)¶ Translates a list of tuples ([pretty mapping], [value]) to a list of tuples ([some key], [translated value]) each single item in abstract will undergo the following translation:
Example1: we want: “MoveRight”: 5.0 possible keys for the action are: (“Right”, 1.0), (“Left”, -1.0) result: “Right”: 5.0 * 1.0 = 5.0
Example2: we want: “MoveRight”: -0.5 possible keys for the action are: (“Left”, -1.0), (“Right”, 1.0) result: “Left”: -0.5 * -1.0 = 0.5 (same as “Right”: -0.5)
-
Module contents¶
tensorforce.core package¶
Subpackages¶
-
class
tensorforce.core.baselines.aggregated_baseline.
AggregatedBaseline
(baselines, scope='aggregated-baseline', summary_labels=())¶ Bases:
tensorforce.core.baselines.baseline.Baseline
Baseline which aggregates per-state baselines.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_predict
(states, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.baselines.baseline.
Baseline
(scope='baseline', summary_labels=None)¶ Bases:
object
Base class for baseline value functions.
-
static
from_spec
(spec, kwargs=None)¶ Creates a baseline from a specification dict.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the baseline
Returns: List of summaries
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the baseline.
Returns: List of variables
-
tf_loss
(states, reward, update)¶ Creates the TensorFlow operations for calculating the L2 loss between predicted state values and actual rewards.
Parameters: - states – State tensors
- reward – Reward tensor
- update – Boolean tensor indicating whether this call happens during an update.
Returns: Loss tensor
-
tf_predict
(states, update)¶ Creates the TensorFlow operations for predicting the value function of given states. :param states: State tensors :param update: Boolean tensor indicating whether this call happens during an update.
Returns: State value tensor
-
tf_regularization_loss
()¶ Creates the TensorFlow operations for the baseline regularization loss/
Returns: Regularization loss tensor
-
static
-
class
tensorforce.core.baselines.cnn_baseline.
CNNBaseline
(conv_sizes, dense_sizes, scope='cnn-baseline', summary_labels=())¶ Bases:
tensorforce.core.baselines.network_baseline.NetworkBaseline
CNN baseline (single-state) consisting of convolutional layers followed by dense layers.
-
class
tensorforce.core.baselines.mlp_baseline.
MLPBaseline
(sizes, scope='mlp-baseline', summary_labels=())¶ Bases:
tensorforce.core.baselines.network_baseline.NetworkBaseline
Multi-layer perceptron baseline (single-state) consisting of dense layers.
-
class
tensorforce.core.baselines.network_baseline.
NetworkBaseline
(network_spec, scope='network-baseline', summary_labels=())¶ Bases:
tensorforce.core.baselines.baseline.Baseline
Baseline based on a TensorForce network, used when parameters are shared between the value function and the baseline.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_predict
(states, update)¶
-
tf_regularization_loss
()¶ Creates the TensorFlow operations for the baseline regularization loss.
Returns: Regularization loss tensor
-
-
class
tensorforce.core.baselines.
Baseline
(scope='baseline', summary_labels=None)¶ Bases:
object
Base class for baseline value functions.
-
static
from_spec
(spec, kwargs=None)¶ Creates a baseline from a specification dict.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the baseline
Returns: List of summaries
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the baseline.
Returns: List of variables
-
tf_loss
(states, reward, update)¶ Creates the TensorFlow operations for calculating the L2 loss between predicted state values and actual rewards.
Parameters: - states – State tensors
- reward – Reward tensor
- update – Boolean tensor indicating whether this call happens during an update.
Returns: Loss tensor
-
tf_predict
(states, update)¶ Creates the TensorFlow operations for predicting the value function of given states. :param states: State tensors :param update: Boolean tensor indicating whether this call happens during an update.
Returns: State value tensor
-
tf_regularization_loss
()¶ Creates the TensorFlow operations for the baseline regularization loss/
Returns: Regularization loss tensor
-
static
-
class
tensorforce.core.baselines.
AggregatedBaseline
(baselines, scope='aggregated-baseline', summary_labels=())¶ Bases:
tensorforce.core.baselines.baseline.Baseline
Baseline which aggregates per-state baselines.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_predict
(states, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.baselines.
NetworkBaseline
(network_spec, scope='network-baseline', summary_labels=())¶ Bases:
tensorforce.core.baselines.baseline.Baseline
Baseline based on a TensorForce network, used when parameters are shared between the value function and the baseline.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_predict
(states, update)¶
-
tf_regularization_loss
()¶ Creates the TensorFlow operations for the baseline regularization loss.
Returns: Regularization loss tensor
-
-
class
tensorforce.core.baselines.
MLPBaseline
(sizes, scope='mlp-baseline', summary_labels=())¶ Bases:
tensorforce.core.baselines.network_baseline.NetworkBaseline
Multi-layer perceptron baseline (single-state) consisting of dense layers.
-
class
tensorforce.core.baselines.
CNNBaseline
(conv_sizes, dense_sizes, scope='cnn-baseline', summary_labels=())¶ Bases:
tensorforce.core.baselines.network_baseline.NetworkBaseline
CNN baseline (single-state) consisting of convolutional layers followed by dense layers.
-
class
tensorforce.core.distributions.bernoulli.
Bernoulli
(shape, probability=0.5, scope='bernoulli', summary_labels=())¶ Bases:
tensorforce.core.distributions.distribution.Distribution
Bernoulli distribution for binary actions.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
state_action_value
(distr_params, action)¶
-
state_value
(distr_params)¶
-
tf_entropy
(distr_params)¶
-
tf_kl_divergence
(distr_params1, distr_params2)¶
-
tf_log_probability
(distr_params, action)¶
-
tf_parameterize
(x)¶
-
tf_regularization_loss
()¶
-
tf_sample
(distr_params, deterministic)¶
-
-
class
tensorforce.core.distributions.beta.
Beta
(shape, min_value, max_value, alpha=0.0, beta=0.0, scope='beta', summary_labels=())¶ Bases:
tensorforce.core.distributions.distribution.Distribution
Beta distribution, for bounded continuous actions
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_entropy
(distr_params)¶
-
tf_kl_divergence
(distr_params1, distr_params2)¶
-
tf_log_probability
(distr_params, action)¶
-
tf_parameterize
(x)¶
-
tf_regularization_loss
()¶
-
tf_sample
(distr_params, deterministic)¶
-
-
class
tensorforce.core.distributions.categorical.
Categorical
(shape, num_actions, probabilities=None, scope='categorical', summary_labels=())¶ Bases:
tensorforce.core.distributions.distribution.Distribution
Categorical distribution, for discrete actions
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
state_action_value
(distr_params, action)¶
-
state_value
(distr_params)¶
-
tf_entropy
(distr_params)¶
-
tf_kl_divergence
(distr_params1, distr_params2)¶
-
tf_log_probability
(distr_params, action)¶
-
tf_parameterize
(x)¶
-
tf_regularization_loss
()¶
-
tf_sample
(distr_params, deterministic)¶
-
-
class
tensorforce.core.distributions.distribution.
Distribution
(scope='distribution', summary_labels=None)¶ Bases:
object
Base class for policy distributions.
-
static
from_spec
(spec, kwargs=None)¶ Creates a distribution from a specification dict.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the distribution.
Returns: List of summaries.
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the distribution.
Returns: List of variables.
-
tf_entropy
(distr_params)¶ Creates the TensorFlow operations for calculating the entropy of a distribution.
Parameters: distr_params – Tuple of distribution parameter tensors. Returns: Entropy tensor.
-
tf_kl_divergence
(distr_params1, distr_params2)¶ Creates the TensorFlow operations for calculating the KL divergence between two distributions.
Parameters: - distr_params1 – Tuple of parameter tensors for first distribution.
- distr_params2 – Tuple of parameter tensors for second distribution.
Returns: KL divergence tensor.
-
tf_log_probability
(distr_params, action)¶ Creates the TensorFlow operations for calculating the log probability of an action for a distribution.
Parameters: - distr_params – Tuple of distribution parameter tensors.
- action – Action tensor.
Returns: KL divergence tensor.
-
tf_parameterize
(x)¶ Creates the TensorFlow operations for parameterizing a distribution conditioned on the given input.
Parameters: x – Input tensor which the distribution is conditioned on. Returns: Tuple of distribution parameter tensors.
-
tf_regularization_loss
()¶ Creates the TensorFlow operations for the distribution regularization loss.
Returns: Regularization loss tensor.
-
tf_sample
(distr_params, deterministic)¶ Creates the TensorFlow operations for sampling an action based on a distribution.
Parameters: - distr_params – Tuple of distribution parameter tensors.
- deterministic – Boolean input tensor indicating whether the maximum likelihood action
- be returned. (should) –
Returns: Sampled action tensor.
-
static
-
class
tensorforce.core.distributions.gaussian.
Gaussian
(shape, mean=0.0, log_stddev=0.0, scope='gaussian', summary_labels=())¶ Bases:
tensorforce.core.distributions.distribution.Distribution
Gaussian distribution, for unbounded continuous actions.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
state_action_value
(distr_params, action)¶
-
state_value
(distr_params)¶
-
tf_entropy
(distr_params)¶
-
tf_kl_divergence
(distr_params1, distr_params2)¶
-
tf_log_probability
(distr_params, action)¶
-
tf_parameterize
(x)¶
-
tf_regularization_loss
()¶
-
tf_sample
(distr_params, deterministic)¶
-
-
class
tensorforce.core.distributions.
Distribution
(scope='distribution', summary_labels=None)¶ Bases:
object
Base class for policy distributions.
-
static
from_spec
(spec, kwargs=None)¶ Creates a distribution from a specification dict.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the distribution.
Returns: List of summaries.
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the distribution.
Returns: List of variables.
-
tf_entropy
(distr_params)¶ Creates the TensorFlow operations for calculating the entropy of a distribution.
Parameters: distr_params – Tuple of distribution parameter tensors. Returns: Entropy tensor.
-
tf_kl_divergence
(distr_params1, distr_params2)¶ Creates the TensorFlow operations for calculating the KL divergence between two distributions.
Parameters: - distr_params1 – Tuple of parameter tensors for first distribution.
- distr_params2 – Tuple of parameter tensors for second distribution.
Returns: KL divergence tensor.
-
tf_log_probability
(distr_params, action)¶ Creates the TensorFlow operations for calculating the log probability of an action for a distribution.
Parameters: - distr_params – Tuple of distribution parameter tensors.
- action – Action tensor.
Returns: KL divergence tensor.
-
tf_parameterize
(x)¶ Creates the TensorFlow operations for parameterizing a distribution conditioned on the given input.
Parameters: x – Input tensor which the distribution is conditioned on. Returns: Tuple of distribution parameter tensors.
-
tf_regularization_loss
()¶ Creates the TensorFlow operations for the distribution regularization loss.
Returns: Regularization loss tensor.
-
tf_sample
(distr_params, deterministic)¶ Creates the TensorFlow operations for sampling an action based on a distribution.
Parameters: - distr_params – Tuple of distribution parameter tensors.
- deterministic – Boolean input tensor indicating whether the maximum likelihood action
- be returned. (should) –
Returns: Sampled action tensor.
-
static
-
class
tensorforce.core.distributions.
Bernoulli
(shape, probability=0.5, scope='bernoulli', summary_labels=())¶ Bases:
tensorforce.core.distributions.distribution.Distribution
Bernoulli distribution for binary actions.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
state_action_value
(distr_params, action)¶
-
state_value
(distr_params)¶
-
tf_entropy
(distr_params)¶
-
tf_kl_divergence
(distr_params1, distr_params2)¶
-
tf_log_probability
(distr_params, action)¶
-
tf_parameterize
(x)¶
-
tf_regularization_loss
()¶
-
tf_sample
(distr_params, deterministic)¶
-
-
class
tensorforce.core.distributions.
Categorical
(shape, num_actions, probabilities=None, scope='categorical', summary_labels=())¶ Bases:
tensorforce.core.distributions.distribution.Distribution
Categorical distribution, for discrete actions
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
state_action_value
(distr_params, action)¶
-
state_value
(distr_params)¶
-
tf_entropy
(distr_params)¶
-
tf_kl_divergence
(distr_params1, distr_params2)¶
-
tf_log_probability
(distr_params, action)¶
-
tf_parameterize
(x)¶
-
tf_regularization_loss
()¶
-
tf_sample
(distr_params, deterministic)¶
-
-
class
tensorforce.core.distributions.
Gaussian
(shape, mean=0.0, log_stddev=0.0, scope='gaussian', summary_labels=())¶ Bases:
tensorforce.core.distributions.distribution.Distribution
Gaussian distribution, for unbounded continuous actions.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
state_action_value
(distr_params, action)¶
-
state_value
(distr_params)¶
-
tf_entropy
(distr_params)¶
-
tf_kl_divergence
(distr_params1, distr_params2)¶
-
tf_log_probability
(distr_params, action)¶
-
tf_parameterize
(x)¶
-
tf_regularization_loss
()¶
-
tf_sample
(distr_params, deterministic)¶
-
-
class
tensorforce.core.distributions.
Beta
(shape, min_value, max_value, alpha=0.0, beta=0.0, scope='beta', summary_labels=())¶ Bases:
tensorforce.core.distributions.distribution.Distribution
Beta distribution, for bounded continuous actions
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_entropy
(distr_params)¶
-
tf_kl_divergence
(distr_params1, distr_params2)¶
-
tf_log_probability
(distr_params, action)¶
-
tf_parameterize
(x)¶
-
tf_regularization_loss
()¶
-
tf_sample
(distr_params, deterministic)¶
-
-
class
tensorforce.core.explorations.constant.
Constant
(constant=0.0, scope='constant', summary_labels=())¶ Bases:
tensorforce.core.explorations.exploration.Exploration
Explore via adding a constant term.
-
tf_explore
(episode, timestep, action_shape)¶
-
-
class
tensorforce.core.explorations.epsilon_anneal.
EpsilonAnneal
(initial_epsilon=1.0, final_epsilon=0.1, timesteps=10000, start_timestep=0, scope='epsilon_anneal', summary_labels=())¶ Bases:
tensorforce.core.explorations.exploration.Exploration
Annealing epsilon parameter based on ratio of current timestep to total timesteps.
-
tf_explore
(episode, timestep, action_shape)¶
-
-
class
tensorforce.core.explorations.epsilon_decay.
EpsilonDecay
(initial_epsilon=1.0, final_epsilon=0.1, timesteps=10000, start_timestep=0, half_lives=10, scope='epsilon_anneal', summary_labels=())¶ Bases:
tensorforce.core.explorations.exploration.Exploration
Exponentially decaying epsilon parameter based on ratio of difference between current and final epsilon to total timesteps.
-
tf_explore
(episode=0, timestep=0, action_shape=(1, ))¶
-
-
class
tensorforce.core.explorations.exploration.
Exploration
(scope='exploration', summary_labels=None)¶ Bases:
object
Abstract exploration object.
-
static
from_spec
(spec)¶ Creates an exploration object from a specification dict.
-
get_variables
()¶ Returns exploration variables.
Returns: List of variables.
-
tf_explore
(episode, timestep, action_shape)¶ Creates exploration value, e.g. compute an epsilon for epsilon-greedy or sample normal noise.
-
static
-
class
tensorforce.core.explorations.linear_decay.
LinearDecay
(scope='exploration', summary_labels=None)¶ Bases:
tensorforce.core.explorations.exploration.Exploration
Linear decay based on episode number.
-
tf_explore
(episode, timestep, action_shape)¶
-
-
class
tensorforce.core.explorations.ornstein_uhlenbeck_process.
OrnsteinUhlenbeckProcess
(sigma=0.3, mu=0.0, theta=0.15, scope='ornstein_uhlenbeck', summary_labels=())¶ Bases:
tensorforce.core.explorations.exploration.Exploration
Explores via an Ornstein-Uhlenbeck process.
-
tf_explore
(episode, timestep, action_shape)¶
-
-
class
tensorforce.core.explorations.
Exploration
(scope='exploration', summary_labels=None)¶ Bases:
object
Abstract exploration object.
-
static
from_spec
(spec)¶ Creates an exploration object from a specification dict.
-
get_variables
()¶ Returns exploration variables.
Returns: List of variables.
-
tf_explore
(episode, timestep, action_shape)¶ Creates exploration value, e.g. compute an epsilon for epsilon-greedy or sample normal noise.
-
static
-
class
tensorforce.core.explorations.
Constant
(constant=0.0, scope='constant', summary_labels=())¶ Bases:
tensorforce.core.explorations.exploration.Exploration
Explore via adding a constant term.
-
tf_explore
(episode, timestep, action_shape)¶
-
-
class
tensorforce.core.explorations.
LinearDecay
(scope='exploration', summary_labels=None)¶ Bases:
tensorforce.core.explorations.exploration.Exploration
Linear decay based on episode number.
-
tf_explore
(episode, timestep, action_shape)¶
-
-
class
tensorforce.core.explorations.
EpsilonDecay
(initial_epsilon=1.0, final_epsilon=0.1, timesteps=10000, start_timestep=0, half_lives=10, scope='epsilon_anneal', summary_labels=())¶ Bases:
tensorforce.core.explorations.exploration.Exploration
Exponentially decaying epsilon parameter based on ratio of difference between current and final epsilon to total timesteps.
-
tf_explore
(episode=0, timestep=0, action_shape=(1, ))¶
-
-
class
tensorforce.core.explorations.
OrnsteinUhlenbeckProcess
(sigma=0.3, mu=0.0, theta=0.15, scope='ornstein_uhlenbeck', summary_labels=())¶ Bases:
tensorforce.core.explorations.exploration.Exploration
Explores via an Ornstein-Uhlenbeck process.
-
tf_explore
(episode, timestep, action_shape)¶
-
-
class
tensorforce.core.memories.memory.
Memory
(states_spec, actions_spec)¶ Bases:
object
Abstract memory class.
-
add_observation
(states, internals, actions, terminal, reward)¶ Inserts a single experience to the memory.
Parameters: - states –
- internals –
- actions –
- terminal –
- reward –
Returns:
-
static
from_spec
(spec, kwargs=None)¶ Creates a memory from a specification dict.
-
get_batch
(batch_size, next_states=False)¶ Samples a batch from the memory.
Parameters: - batch_size – The batch size
- next_states – A boolean flag indicating whether ‘next_states’ values should be included
Returns: A dict containing states, internal states, actions, terminals, rewards (and next states)
-
set_memory
(states, internals, actions, terminals, rewards)¶ Deletes memory content and sets content to provided observations.
Parameters: - states –
- internals –
- actions –
- terminals –
- rewards –
-
update_batch
(loss_per_instance)¶ Updates loss values for sampling strategies based on loss functions.
Parameters: loss_per_instance –
-
-
class
tensorforce.core.memories.naive_prioritized_replay.
NaivePrioritizedReplay
(states_spec, actions_spec, capacity, prioritization_weight=1.0)¶ Bases:
tensorforce.core.memories.memory.Memory
Prioritised replay sampling based on loss per experience.
-
add_observation
(states, internals, actions, terminal, reward)¶
-
get_batch
(batch_size, next_states=False)¶ Samples a batch of the specified size according to priority.
Parameters: - batch_size – The batch size
- next_states – A boolean flag indicating whether ‘next_states’ values should be included
Returns: A dict containing states, actions, rewards, terminals, internal states (and next states)
-
update_batch
(loss_per_instance)¶ Computes priorities according to loss.
Parameters: loss_per_instance –
-
-
tensorforce.core.memories.naive_prioritized_replay.
random
() → x in the interval [0, 1).¶
-
class
tensorforce.core.memories.prioritized_replay.
PrioritizedReplay
(states_spec, actions_spec, capacity, prioritization_weight=1.0, prioritization_constant=0.0)¶ Bases:
tensorforce.core.memories.memory.Memory
Prioritised replay sampling based on loss per experience.
-
add_observation
(states, internals, actions, terminal, reward)¶
-
get_batch
(batch_size, next_states=False)¶ Samples a batch of the specified size according to priority.
Parameters: - batch_size – The batch size
- next_states – A boolean flag indicating whether ‘next_states’ values should be included
Returns: A dict containing states, actions, rewards, terminals, internal states (and next states)
-
update_batch
(loss_per_instance)¶ Computes priorities according to loss.
Parameters: loss_per_instance –
-
-
class
tensorforce.core.memories.prioritized_replay.
SumTree
(capacity)¶ Bases:
object
Sum tree data structure where data is stored in leaves and each node on the tree contains a sum of the children.
Items and priorities are stored in leaf nodes, while internal nodes store the sum of priorities from all its descendants. Internally a single list stores the internal nodes followed by leaf nodes.
See:
- Usage:
- tree = SumTree(100) tree.push(‘item1’, priority=0.5) tree.push(‘item2’, priority=0.6) item, priority = tree[0] batch = tree.sample_minibatch(2)
-
move
(external_index, new_priority)¶ Change the priority of a leaf node
-
put
(item, priority=None)¶ Stores a transition in replay memory.
If the memory is full, the oldest entry is replaced.
-
sample_minibatch
(batch_size)¶ Sample minibatch of size batch_size.
-
class
tensorforce.core.memories.replay.
Replay
(states_spec, actions_spec, capacity, random_sampling=True)¶ Bases:
tensorforce.core.memories.memory.Memory
Replay memory to store observations and sample mini batches for training from.
-
add_observation
(states, internals, actions, terminal, reward)¶
-
get_batch
(batch_size, next_states=False, keep_terminal_states=True)¶ Samples a batch of the specified size by selecting a random start/end point and returning the contained sequence or random indices depending on the field ‘random_sampling’.
Parameters: - batch_size – The batch size
- next_states – A boolean flag indicating whether ‘next_states’ values should be included
- keep_terminal_states – A boolean flag indicating whether to keep terminal states when
next_states
are requested. In this case, the next state is not from the same episode and should probably not be used to learn a model of the environment. However, if the environment produces sparse rewards (i.e. only one reward at the end of the episode) we cannot exclude terminal states, as otherwise there would never be a reward to learn from.
Returns: A dict containing states, actions, rewards, terminals, internal states (and next states)
-
set_memory
(states, internals, actions, terminal, reward)¶ Convenience function to set whole batches as memory content to bypass calling the insert function for every single experience.
Parameters: - states –
- internals –
- actions –
- terminal –
- reward –
Returns:
-
update_batch
(loss_per_instance)¶
-
-
class
tensorforce.core.memories.
Memory
(states_spec, actions_spec)¶ Bases:
object
Abstract memory class.
-
add_observation
(states, internals, actions, terminal, reward)¶ Inserts a single experience to the memory.
Parameters: - states –
- internals –
- actions –
- terminal –
- reward –
Returns:
-
static
from_spec
(spec, kwargs=None)¶ Creates a memory from a specification dict.
-
get_batch
(batch_size, next_states=False)¶ Samples a batch from the memory.
Parameters: - batch_size – The batch size
- next_states – A boolean flag indicating whether ‘next_states’ values should be included
Returns: A dict containing states, internal states, actions, terminals, rewards (and next states)
-
set_memory
(states, internals, actions, terminals, rewards)¶ Deletes memory content and sets content to provided observations.
Parameters: - states –
- internals –
- actions –
- terminals –
- rewards –
-
update_batch
(loss_per_instance)¶ Updates loss values for sampling strategies based on loss functions.
Parameters: loss_per_instance –
-
-
class
tensorforce.core.memories.
Replay
(states_spec, actions_spec, capacity, random_sampling=True)¶ Bases:
tensorforce.core.memories.memory.Memory
Replay memory to store observations and sample mini batches for training from.
-
add_observation
(states, internals, actions, terminal, reward)¶
-
get_batch
(batch_size, next_states=False, keep_terminal_states=True)¶ Samples a batch of the specified size by selecting a random start/end point and returning the contained sequence or random indices depending on the field ‘random_sampling’.
Parameters: - batch_size – The batch size
- next_states – A boolean flag indicating whether ‘next_states’ values should be included
- keep_terminal_states – A boolean flag indicating whether to keep terminal states when
next_states
are requested. In this case, the next state is not from the same episode and should probably not be used to learn a model of the environment. However, if the environment produces sparse rewards (i.e. only one reward at the end of the episode) we cannot exclude terminal states, as otherwise there would never be a reward to learn from.
Returns: A dict containing states, actions, rewards, terminals, internal states (and next states)
-
set_memory
(states, internals, actions, terminal, reward)¶ Convenience function to set whole batches as memory content to bypass calling the insert function for every single experience.
Parameters: - states –
- internals –
- actions –
- terminal –
- reward –
Returns:
-
update_batch
(loss_per_instance)¶
-
-
class
tensorforce.core.memories.
PrioritizedReplay
(states_spec, actions_spec, capacity, prioritization_weight=1.0, prioritization_constant=0.0)¶ Bases:
tensorforce.core.memories.memory.Memory
Prioritised replay sampling based on loss per experience.
-
add_observation
(states, internals, actions, terminal, reward)¶
-
get_batch
(batch_size, next_states=False)¶ Samples a batch of the specified size according to priority.
Parameters: - batch_size – The batch size
- next_states – A boolean flag indicating whether ‘next_states’ values should be included
Returns: A dict containing states, actions, rewards, terminals, internal states (and next states)
-
update_batch
(loss_per_instance)¶ Computes priorities according to loss.
Parameters: loss_per_instance –
-
-
class
tensorforce.core.memories.
NaivePrioritizedReplay
(states_spec, actions_spec, capacity, prioritization_weight=1.0)¶ Bases:
tensorforce.core.memories.memory.Memory
Prioritised replay sampling based on loss per experience.
-
add_observation
(states, internals, actions, terminal, reward)¶
-
get_batch
(batch_size, next_states=False)¶ Samples a batch of the specified size according to priority.
Parameters: - batch_size – The batch size
- next_states – A boolean flag indicating whether ‘next_states’ values should be included
Returns: A dict containing states, actions, rewards, terminals, internal states (and next states)
-
update_batch
(loss_per_instance)¶ Computes priorities according to loss.
Parameters: loss_per_instance –
-
Collection of custom layer implementations. We prefer not to use contrib-layers to retain full control over shapes and internal states.
-
class
tensorforce.core.networks.layer.
Conv1d
(size, window=3, stride=1, padding='SAME', bias=True, activation='relu', l2_regularization=0.0, l1_regularization=0.0, scope='conv1d', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
1-dimensional convolutional layer.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_apply
(x, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.layer.
Conv2d
(size, window=3, stride=1, padding='SAME', bias=True, activation='relu', l2_regularization=0.0, l1_regularization=0.0, scope='conv2d', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
2-dimensional convolutional layer.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_apply
(x, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.layer.
Dense
(size=None, bias=True, activation='tanh', l2_regularization=0.0, l1_regularization=0.0, skip=False, scope='dense', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Dense layer, i.e. linear fully connected layer with subsequent non-linearity.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_apply
(x, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.layer.
Dropout
(rate=0.0, scope='dropout', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Dropout layer. If using dropout, add this layer after inputs and after dense layers. For LSTM, dropout is handled independently as an argument. Not available for Conv2d yet.
-
tf_apply
(x, update)¶
-
-
class
tensorforce.core.networks.layer.
Dueling
(size, bias=False, activation='none', l2_regularization=0.0, l1_regularization=0.0, output=None, scope='dueling', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Dueling layer, i.e. Duel pipelines for Exp & Adv to help with stability
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_apply
(x, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.layer.
Embedding
(indices, size, l2_regularization=0.0, l1_regularization=0.0, scope='embedding', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Embedding layer.
-
tf_apply
(x, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.layer.
Flatten
(scope='flatten', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Flatten layer reshaping the input.
-
tf_apply
(x, update)¶
-
-
class
tensorforce.core.networks.layer.
InternalLstm
(size, dropout=None, scope='internal_lstm', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Long short-term memory layer for internal state management.
-
internals_init
()¶
-
internals_input
()¶
-
tf_apply
(x, update, state)¶
-
-
class
tensorforce.core.networks.layer.
Layer
(num_internals=0, scope='layer', summary_labels=None)¶ Bases:
object
Base class for network layers.
-
static
from_spec
(spec, kwargs=None)¶ Creates a layer from a specification dict.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the layer.
Returns: List of summaries.
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the layer.
Returns: List of variables.
-
internals_init
()¶ Returns the TensorFlow tensors for internal state initializations.
Returns: List of internal state initialization tensors.
-
internals_input
()¶ Returns the TensorFlow placeholders for internal state inputs.
Returns: List of internal state input placeholders.
-
tf_apply
(x, update)¶ Creates the TensorFlow operations for applying the layer to the given input.
Parameters: - x – Layer input tensor.
- update – Boolean tensor indicating whether this call happens during an update.
Returns: Layer output tensor.
-
tf_regularization_loss
()¶ Creates the TensorFlow operations for the layer regularization loss.
Returns: Regularization loss tensor.
-
tf_tensors
(named_tensors)¶ Attaches the named_tensors dictionary to the layer for examination and update.
Parameters: named_tensors – Dictionary of named tensors to be used as Input’s or recorded from outputs Returns: NA
-
static
-
class
tensorforce.core.networks.layer.
Linear
(size, weights=None, bias=True, l2_regularization=0.0, l1_regularization=0.0, scope='linear', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Linear fully-connected layer.
-
tf_apply
(x, update=False)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.layer.
Lstm
(size, dropout=None, scope='lstm', summary_labels=(), return_final_state=True)¶ Bases:
tensorforce.core.networks.layer.Layer
-
tf_apply
(x, update, sequence_length=None)¶
-
-
class
tensorforce.core.networks.layer.
Nonlinearity
(name='relu', scope='nonlinearity', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Non-linearity layer applying a non-linear transformation.
-
tf_apply
(x, update)¶
-
-
class
tensorforce.core.networks.layer.
Pool2d
(pooling_type='max', window=2, stride=2, padding='SAME', scope='pool2d', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
2-dimensional pooling layer.
-
tf_apply
(x, update)¶
-
-
class
tensorforce.core.networks.network.
LayerBasedNetwork
(scope='layerbased-network', summary_labels=())¶ Bases:
tensorforce.core.networks.network.Network
Base class for networks using TensorForce layers.
-
add_layer
(layer)¶
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
internals_init
()¶
-
internals_input
()¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.network.
LayeredNetwork
(layers_spec, scope='layered-network', summary_labels=())¶ Bases:
tensorforce.core.networks.network.LayerBasedNetwork
Network consisting of a sequence of layers, which can be created from a specification dict.
-
static
from_json
(filename)¶ Creates a layer_networkd_builder from a JSON.
Parameters: filename – Path to configuration Returns: A layered_network_builder function with layers generated from the JSON
-
tf_apply
(x, internals, update, return_internals=False)¶
-
static
-
class
tensorforce.core.networks.network.
Network
(scope='network', summary_labels=None)¶ Bases:
object
Base class for neural networks.
-
static
from_spec
(spec, kwargs=None)¶ Creates a network from a specification dict.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the network.
Returns: List of summaries
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the network.
Returns: List of variables
-
internals_init
()¶ Returns the TensorFlow tensors for internal state initializations.
Returns: List of internal state initialization tensors
-
internals_input
()¶ Returns the TensorFlow placeholders for internal state inputs.
Returns: List of internal state input placeholders
-
tf_apply
(x, internals, update, return_internals=False)¶ Creates the TensorFlow operations for applying the network to the given input.
Parameters: - x – Network input tensor or dict of input tensors.
- internals – List of prior internal state tensors
- update – Boolean tensor indicating whether this call happens during an update.
- return_internals – If true, also returns posterior internal state tensors
Returns: Network output tensor, plus optionally list of posterior internal state tensors
-
tf_regularization_loss
()¶ Creates the TensorFlow operations for the network regularization loss.
Returns: Regularization loss tensor
-
static
-
class
tensorforce.core.networks.
Layer
(num_internals=0, scope='layer', summary_labels=None)¶ Bases:
object
Base class for network layers.
-
static
from_spec
(spec, kwargs=None)¶ Creates a layer from a specification dict.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the layer.
Returns: List of summaries.
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the layer.
Returns: List of variables.
-
internals_init
()¶ Returns the TensorFlow tensors for internal state initializations.
Returns: List of internal state initialization tensors.
-
internals_input
()¶ Returns the TensorFlow placeholders for internal state inputs.
Returns: List of internal state input placeholders.
-
tf_apply
(x, update)¶ Creates the TensorFlow operations for applying the layer to the given input.
Parameters: - x – Layer input tensor.
- update – Boolean tensor indicating whether this call happens during an update.
Returns: Layer output tensor.
-
tf_regularization_loss
()¶ Creates the TensorFlow operations for the layer regularization loss.
Returns: Regularization loss tensor.
-
tf_tensors
(named_tensors)¶ Attaches the named_tensors dictionary to the layer for examination and update.
Parameters: named_tensors – Dictionary of named tensors to be used as Input’s or recorded from outputs Returns: NA
-
static
-
class
tensorforce.core.networks.
Nonlinearity
(name='relu', scope='nonlinearity', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Non-linearity layer applying a non-linear transformation.
-
tf_apply
(x, update)¶
-
-
class
tensorforce.core.networks.
Dropout
(rate=0.0, scope='dropout', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Dropout layer. If using dropout, add this layer after inputs and after dense layers. For LSTM, dropout is handled independently as an argument. Not available for Conv2d yet.
-
tf_apply
(x, update)¶
-
-
class
tensorforce.core.networks.
Flatten
(scope='flatten', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Flatten layer reshaping the input.
-
tf_apply
(x, update)¶
-
-
class
tensorforce.core.networks.
Pool2d
(pooling_type='max', window=2, stride=2, padding='SAME', scope='pool2d', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
2-dimensional pooling layer.
-
tf_apply
(x, update)¶
-
-
class
tensorforce.core.networks.
Embedding
(indices, size, l2_regularization=0.0, l1_regularization=0.0, scope='embedding', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Embedding layer.
-
tf_apply
(x, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.
Linear
(size, weights=None, bias=True, l2_regularization=0.0, l1_regularization=0.0, scope='linear', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Linear fully-connected layer.
-
tf_apply
(x, update=False)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.
Dense
(size=None, bias=True, activation='tanh', l2_regularization=0.0, l1_regularization=0.0, skip=False, scope='dense', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Dense layer, i.e. linear fully connected layer with subsequent non-linearity.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_apply
(x, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.
Dueling
(size, bias=False, activation='none', l2_regularization=0.0, l1_regularization=0.0, output=None, scope='dueling', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Dueling layer, i.e. Duel pipelines for Exp & Adv to help with stability
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_apply
(x, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.
Conv1d
(size, window=3, stride=1, padding='SAME', bias=True, activation='relu', l2_regularization=0.0, l1_regularization=0.0, scope='conv1d', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
1-dimensional convolutional layer.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_apply
(x, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.
Conv2d
(size, window=3, stride=1, padding='SAME', bias=True, activation='relu', l2_regularization=0.0, l1_regularization=0.0, scope='conv2d', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
2-dimensional convolutional layer.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
tf_apply
(x, update)¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.
InternalLstm
(size, dropout=None, scope='internal_lstm', summary_labels=())¶ Bases:
tensorforce.core.networks.layer.Layer
Long short-term memory layer for internal state management.
-
internals_init
()¶
-
internals_input
()¶
-
tf_apply
(x, update, state)¶
-
-
class
tensorforce.core.networks.
Lstm
(size, dropout=None, scope='lstm', summary_labels=(), return_final_state=True)¶ Bases:
tensorforce.core.networks.layer.Layer
-
tf_apply
(x, update, sequence_length=None)¶
-
-
class
tensorforce.core.networks.
Network
(scope='network', summary_labels=None)¶ Bases:
object
Base class for neural networks.
-
static
from_spec
(spec, kwargs=None)¶ Creates a network from a specification dict.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the network.
Returns: List of summaries
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the network.
Returns: List of variables
-
internals_init
()¶ Returns the TensorFlow tensors for internal state initializations.
Returns: List of internal state initialization tensors
-
internals_input
()¶ Returns the TensorFlow placeholders for internal state inputs.
Returns: List of internal state input placeholders
-
tf_apply
(x, internals, update, return_internals=False)¶ Creates the TensorFlow operations for applying the network to the given input.
Parameters: - x – Network input tensor or dict of input tensors.
- internals – List of prior internal state tensors
- update – Boolean tensor indicating whether this call happens during an update.
- return_internals – If true, also returns posterior internal state tensors
Returns: Network output tensor, plus optionally list of posterior internal state tensors
-
tf_regularization_loss
()¶ Creates the TensorFlow operations for the network regularization loss.
Returns: Regularization loss tensor
-
static
-
class
tensorforce.core.networks.
LayerBasedNetwork
(scope='layerbased-network', summary_labels=())¶ Bases:
tensorforce.core.networks.network.Network
Base class for networks using TensorForce layers.
-
add_layer
(layer)¶
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
internals_init
()¶
-
internals_input
()¶
-
tf_regularization_loss
()¶
-
-
class
tensorforce.core.networks.
LayeredNetwork
(layers_spec, scope='layered-network', summary_labels=())¶ Bases:
tensorforce.core.networks.network.LayerBasedNetwork
Network consisting of a sequence of layers, which can be created from a specification dict.
-
static
from_json
(filename)¶ Creates a layer_networkd_builder from a JSON.
Parameters: filename – Path to configuration Returns: A layered_network_builder function with layers generated from the JSON
-
tf_apply
(x, internals, update, return_internals=False)¶
-
static
-
class
tensorforce.core.optimizers.solvers.conjugate_gradient.
ConjugateGradient
(max_iterations, damping, unroll_loop=False)¶ Bases:
tensorforce.core.optimizers.solvers.iterative.Iterative
Conjugate gradient algorithm which iteratively finds a solution $x$ for a system of linear equations of the form $A x = b$, where $A x$ could be, for instance, a locally linear approximation of a high-dimensional function.
See below pseudo-code taken from Wikipedia:
def conjgrad(A, b, x_0): r_0 := b - A * x_0 c_0 := r_0 r_0^2 := r^T * r for t in 0, ..., max_iterations - 1: Ac := A * c_t cAc := c_t^T * Ac lpha := r_t^2 / cAc x_{t+1} := x_t + lpha * c_t r_{t+1} := r_t - lpha * Ac r_{t+1}^2 := r_{t+1}^T * r_{t+1} if r_{t+1} < \epsilon: break eta = r_{t+1}^2 / r_t^2 c_{t+1} := r_{t+1} + eta * c_t return x_{t+1}
-
tf_initialize
(x_init, b)¶ Initialization step preparing the arguments for the first iteration of the loop body: $x_0, 0, p_0, r_0, r_0^2$.
Parameters: - x_init – Initial solution guess $x_0$, zero vector if None.
- b – The right-hand side $b$ of the system of linear equations.
Returns: Initial arguments for tf_step.
-
tf_next_step
(x, iteration, conjugate, residual, squared_residual)¶ Termination condition: max number of iterations, or residual sufficiently small.
Parameters: - x – Current solution estimate $x_t$.
- iteration – Current iteration counter $t$.
- conjugate – Current conjugate $c_t$.
- residual – Current residual $r_t$.
- squared_residual – Current squared residual $r_t^2$.
Returns: True if another iteration should be performed.
-
tf_solve
(fn_x, x_init, b)¶ Iteratively solves the system of linear equations $A x = b$.
Parameters: - fn_x – A callable returning the left-hand side $A x$ of the system of linear equations.
- x_init – Initial solution guess $x_0$, zero vector if None.
- b – The right-hand side $b$ of the system of linear equations.
Returns: A solution $x$ to the problem as given by the solver.
-
tf_step
(x, iteration, conjugate, residual, squared_residual)¶ Iteration loop body of the conjugate gradient algorithm.
Parameters: - x – Current solution estimate $x_t$.
- iteration – Current iteration counter $t$.
- conjugate – Current conjugate $c_t$.
- residual – Current residual $r_t$.
- squared_residual – Current squared residual $r_t^2$.
Returns: Updated arguments for next iteration.
-
-
class
tensorforce.core.optimizers.solvers.iterative.
Iterative
(max_iterations, unroll_loop=False)¶ Bases:
tensorforce.core.optimizers.solvers.solver.Solver
Generic solver which iteratively solves an equation/optimization problem. Involves an initialization step, the iteration loop body and the termination condition.
-
tf_initialize
(x_init, *args)¶ Initialization step preparing the arguments for the first iteration of the loop body (default: initial solution guess and iteration counter).
Parameters: - x_init – Initial solution guess $x_0$.
- *args –
Additional solver-specific arguments.
Returns: Initial arguments for tf_step.
-
tf_next_step
(x, iteration, *args)¶ Termination condition (default: max number of iterations).
Parameters: - x – Current solution estimate.
- iteration – Current iteration counter.
- *args –
Additional solver-specific arguments.
Returns: True if another iteration should be performed.
-
tf_solve
(fn_x, x_init, *args)¶ Iteratively solves an equation/optimization for $x$ involving an expression $f(x)$.
Parameters: - fn_x – A callable returning an expression $f(x)$ given $x$.
- x_init – Initial solution guess $x_0$.
- *args –
Additional solver-specific arguments.
Returns: A solution $x$ to the problem as given by the solver.
-
tf_step
(x, iteration, *args)¶ Iteration loop body of the iterative solver (default: increment iteration step). The first two loop arguments have to be the current solution estimate and the iteration step.
Parameters: - x – Current solution estimate.
- iteration – Current iteration counter.
- *args –
Additional solver-specific arguments.
Returns: Updated arguments for next iteration.
-
-
class
tensorforce.core.optimizers.solvers.line_search.
LineSearch
(max_iterations, accept_ratio, mode, parameter, unroll_loop=False)¶ Bases:
tensorforce.core.optimizers.solvers.iterative.Iterative
Line search algorithm which iteratively optimizes the value $f(x)$ for $x$ on the line between $x’$ and $x_0$ by optimistically taking the first acceptable $x$ starting from $x_0$ and moving towards $x’$.
-
tf_initialize
(x_init, base_value, target_value, estimated_improvement)¶ Initialization step preparing the arguments for the first iteration of the loop body.
Parameters: - x_init – Initial solution guess $x_0$.
- base_value – Value $f(x’)$ at $x = x’$.
- target_value – Value $f(x_0)$ at $x = x_0$.
- estimated_improvement – Estimated value at $x = x_0$, $f(x’)$ if None.
Returns: Initial arguments for tf_step.
-
tf_next_step
(x, iteration, deltas, improvement, last_improvement, estimated_improvement)¶ Termination condition: max number of iterations, or no improvement for last step, or improvement less than acceptable ratio, or estimated value not positive.
Parameters: - x – Current solution estimate $x_t$.
- iteration – Current iteration counter $t$.
- deltas – Current difference $x_t - x’$.
- improvement – Current improvement $(f(x_t) - f(x’)) / v’$.
- last*improvement –
Last improvement $(f(x*{t-1}) - f(x’)) / v’$.
- estimated_improvement – Current estimated value $v’$.
Returns: True if another iteration should be performed.
-
tf_solve
(fn_x, x_init, base_value, target_value, estimated_improvement=None)¶ Iteratively optimizes $f(x)$ for $x$ on the line between $x’$ and $x_0$.
Parameters: - fn_x – A callable returning the value $f(x)$ at $x$.
- x_init – Initial solution guess $x_0$.
- base_value – Value $f(x’)$ at $x = x’$.
- target_value – Value $f(x_0)$ at $x = x_0$.
- estimated_improvement – Estimated improvement for $x = x_0$, $f(x’)$ if None.
Returns: A solution $x$ to the problem as given by the solver.
-
tf_step
(x, iteration, deltas, improvement, last_improvement, estimated_improvement)¶ Iteration loop body of the line search algorithm.
Parameters: - x – Current solution estimate $x_t$.
- iteration – Current iteration counter $t$.
- deltas – Current difference $x_t - x’$.
- improvement – Current improvement $(f(x_t) - f(x’)) / v’$.
- last*improvement –
Last improvement $(f(x*{t-1}) - f(x’)) / v’$.
- estimated_improvement – Current estimated value $v’$.
Returns: Updated arguments for next iteration.
-
-
class
tensorforce.core.optimizers.solvers.solver.
Solver
¶ Bases:
object
Generic TensorFlow-based solver which solves a not yet further specified equation/optimization problem.
-
static
from_config
(config, kwargs=None)¶ Creates a solver from a specification dict.
-
tf_solve
(fn_x, *args)¶ Solves an equation/optimization for $x$ involving an expression $f(x)$.
Parameters: - fn_x – A callable returning an expression $f(x)$ given $x$.
- *args –
Additional solver-specific arguments.
Returns: A solution $x$ to the problem as given by the solver.
-
static
-
class
tensorforce.core.optimizers.solvers.
Solver
¶ Bases:
object
Generic TensorFlow-based solver which solves a not yet further specified equation/optimization problem.
-
static
from_config
(config, kwargs=None)¶ Creates a solver from a specification dict.
-
tf_solve
(fn_x, *args)¶ Solves an equation/optimization for $x$ involving an expression $f(x)$.
Parameters: - fn_x – A callable returning an expression $f(x)$ given $x$.
- *args –
Additional solver-specific arguments.
Returns: A solution $x$ to the problem as given by the solver.
-
static
-
class
tensorforce.core.optimizers.solvers.
Iterative
(max_iterations, unroll_loop=False)¶ Bases:
tensorforce.core.optimizers.solvers.solver.Solver
Generic solver which iteratively solves an equation/optimization problem. Involves an initialization step, the iteration loop body and the termination condition.
-
tf_initialize
(x_init, *args)¶ Initialization step preparing the arguments for the first iteration of the loop body (default: initial solution guess and iteration counter).
Parameters: - x_init – Initial solution guess $x_0$.
- *args –
Additional solver-specific arguments.
Returns: Initial arguments for tf_step.
-
tf_next_step
(x, iteration, *args)¶ Termination condition (default: max number of iterations).
Parameters: - x – Current solution estimate.
- iteration – Current iteration counter.
- *args –
Additional solver-specific arguments.
Returns: True if another iteration should be performed.
-
tf_solve
(fn_x, x_init, *args)¶ Iteratively solves an equation/optimization for $x$ involving an expression $f(x)$.
Parameters: - fn_x – A callable returning an expression $f(x)$ given $x$.
- x_init – Initial solution guess $x_0$.
- *args –
Additional solver-specific arguments.
Returns: A solution $x$ to the problem as given by the solver.
-
tf_step
(x, iteration, *args)¶ Iteration loop body of the iterative solver (default: increment iteration step). The first two loop arguments have to be the current solution estimate and the iteration step.
Parameters: - x – Current solution estimate.
- iteration – Current iteration counter.
- *args –
Additional solver-specific arguments.
Returns: Updated arguments for next iteration.
-
-
class
tensorforce.core.optimizers.solvers.
ConjugateGradient
(max_iterations, damping, unroll_loop=False)¶ Bases:
tensorforce.core.optimizers.solvers.iterative.Iterative
Conjugate gradient algorithm which iteratively finds a solution $x$ for a system of linear equations of the form $A x = b$, where $A x$ could be, for instance, a locally linear approximation of a high-dimensional function.
See below pseudo-code taken from Wikipedia:
def conjgrad(A, b, x_0): r_0 := b - A * x_0 c_0 := r_0 r_0^2 := r^T * r for t in 0, ..., max_iterations - 1: Ac := A * c_t cAc := c_t^T * Ac lpha := r_t^2 / cAc x_{t+1} := x_t + lpha * c_t r_{t+1} := r_t - lpha * Ac r_{t+1}^2 := r_{t+1}^T * r_{t+1} if r_{t+1} < \epsilon: break eta = r_{t+1}^2 / r_t^2 c_{t+1} := r_{t+1} + eta * c_t return x_{t+1}
-
tf_initialize
(x_init, b)¶ Initialization step preparing the arguments for the first iteration of the loop body: $x_0, 0, p_0, r_0, r_0^2$.
Parameters: - x_init – Initial solution guess $x_0$, zero vector if None.
- b – The right-hand side $b$ of the system of linear equations.
Returns: Initial arguments for tf_step.
-
tf_next_step
(x, iteration, conjugate, residual, squared_residual)¶ Termination condition: max number of iterations, or residual sufficiently small.
Parameters: - x – Current solution estimate $x_t$.
- iteration – Current iteration counter $t$.
- conjugate – Current conjugate $c_t$.
- residual – Current residual $r_t$.
- squared_residual – Current squared residual $r_t^2$.
Returns: True if another iteration should be performed.
-
tf_solve
(fn_x, x_init, b)¶ Iteratively solves the system of linear equations $A x = b$.
Parameters: - fn_x – A callable returning the left-hand side $A x$ of the system of linear equations.
- x_init – Initial solution guess $x_0$, zero vector if None.
- b – The right-hand side $b$ of the system of linear equations.
Returns: A solution $x$ to the problem as given by the solver.
-
tf_step
(x, iteration, conjugate, residual, squared_residual)¶ Iteration loop body of the conjugate gradient algorithm.
Parameters: - x – Current solution estimate $x_t$.
- iteration – Current iteration counter $t$.
- conjugate – Current conjugate $c_t$.
- residual – Current residual $r_t$.
- squared_residual – Current squared residual $r_t^2$.
Returns: Updated arguments for next iteration.
-
-
class
tensorforce.core.optimizers.solvers.
LineSearch
(max_iterations, accept_ratio, mode, parameter, unroll_loop=False)¶ Bases:
tensorforce.core.optimizers.solvers.iterative.Iterative
Line search algorithm which iteratively optimizes the value $f(x)$ for $x$ on the line between $x’$ and $x_0$ by optimistically taking the first acceptable $x$ starting from $x_0$ and moving towards $x’$.
-
tf_initialize
(x_init, base_value, target_value, estimated_improvement)¶ Initialization step preparing the arguments for the first iteration of the loop body.
Parameters: - x_init – Initial solution guess $x_0$.
- base_value – Value $f(x’)$ at $x = x’$.
- target_value – Value $f(x_0)$ at $x = x_0$.
- estimated_improvement – Estimated value at $x = x_0$, $f(x’)$ if None.
Returns: Initial arguments for tf_step.
-
tf_next_step
(x, iteration, deltas, improvement, last_improvement, estimated_improvement)¶ Termination condition: max number of iterations, or no improvement for last step, or improvement less than acceptable ratio, or estimated value not positive.
Parameters: - x – Current solution estimate $x_t$.
- iteration – Current iteration counter $t$.
- deltas – Current difference $x_t - x’$.
- improvement – Current improvement $(f(x_t) - f(x’)) / v’$.
- last*improvement –
Last improvement $(f(x*{t-1}) - f(x’)) / v’$.
- estimated_improvement – Current estimated value $v’$.
Returns: True if another iteration should be performed.
-
tf_solve
(fn_x, x_init, base_value, target_value, estimated_improvement=None)¶ Iteratively optimizes $f(x)$ for $x$ on the line between $x’$ and $x_0$.
Parameters: - fn_x – A callable returning the value $f(x)$ at $x$.
- x_init – Initial solution guess $x_0$.
- base_value – Value $f(x’)$ at $x = x’$.
- target_value – Value $f(x_0)$ at $x = x_0$.
- estimated_improvement – Estimated improvement for $x = x_0$, $f(x’)$ if None.
Returns: A solution $x$ to the problem as given by the solver.
-
tf_step
(x, iteration, deltas, improvement, last_improvement, estimated_improvement)¶ Iteration loop body of the line search algorithm.
Parameters: - x – Current solution estimate $x_t$.
- iteration – Current iteration counter $t$.
- deltas – Current difference $x_t - x’$.
- improvement – Current improvement $(f(x_t) - f(x’)) / v’$.
- last*improvement –
Last improvement $(f(x*{t-1}) - f(x’)) / v’$.
- estimated_improvement – Current estimated value $v’$.
Returns: Updated arguments for next iteration.
-
-
class
tensorforce.core.optimizers.clipped_step.
ClippedStep
(optimizer, clipping_value, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.meta_optimizer.MetaOptimizer
The multi-shep meta optimizer repeatedly applies the optimization step proposed by another optimizer a number of times.
-
tf_step
(time, variables, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- **kwargs –
Additional arguments passed on to the internal optimizer.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.evolutionary.
Evolutionary
(learning_rate, num_samples=1, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.optimizer.Optimizer
Evolutionary optimizer which samples random perturbations and applies them either positively or negatively, depending on their improvement of the loss.
-
tf_step
(time, variables, fn_loss, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- fn_loss – A callable returning the loss of the current model.
- **kwargs –
Additional arguments, not used.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.global_optimizer.
GlobalOptimizer
(optimizer, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.meta_optimizer.MetaOptimizer
The global optimizer applies an optimizer to the local variables. In addition, it also applies the update to a corresponding set of global variables and subsequently updates the local variables to the value of these global variables. Note: This is used for the current distributed mode, and will likely change with the next major version update.
-
tf_step
(time, variables, global_variables, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- global_variables – List of global variables to apply the proposed optimization step to.
- **kwargs –
??? coming soon
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.meta_optimizer.
MetaOptimizer
(optimizer, **kwargs)¶ Bases:
tensorforce.core.optimizers.optimizer.Optimizer
A meta optimizer takes the optimization implemented by another optimizer and modifies/optimizes its proposed result. For example, line search might be applied to find a more optimal step size.
-
get_variables
()¶
-
-
class
tensorforce.core.optimizers.multi_step.
MultiStep
(optimizer, num_steps=5, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.meta_optimizer.MetaOptimizer
The multi-step meta optimizer repeatedly applies the optimization step proposed by another optimizer a number of times.
-
tf_step
(time, variables, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- **kwargs –
Additional arguments passed on to the internal optimizer.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.natural_gradient.
NaturalGradient
(learning_rate, cg_max_iterations=20, cg_damping=0.001, cg_unroll_loop=False, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.optimizer.Optimizer
Natural gradient optimizer.
-
tf_step
(time, variables, fn_loss, fn_kl_divergence, return_estimated_improvement=False, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- fn_loss – A callable returning the loss of the current model.
- fn_kl_divergence – A callable returning the KL-divergence relative to the current model.
- return_estimated_improvement – Returns the estimated improvement resulting from the natural gradient calculation if true.
- **kwargs –
Additional arguments, not used.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.optimized_step.
OptimizedStep
(optimizer, ls_max_iterations=10, ls_accept_ratio=0.9, ls_mode='exponential', ls_parameter=0.5, ls_unroll_loop=False, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.meta_optimizer.MetaOptimizer
The optimized-step meta optimizer applies line search to the proposed optimization step of another optimizer to find a more optimal step size.
-
tf_step
(time, variables, fn_loss, fn_reference=None, fn_compare=None, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- fn_loss – A callable returning the loss of the current model.
- fn_reference – A callable returning the reference values necessary for comparison.
- fn_compare – A callable comparing the current model to the reference model given by its values.
- **kwargs –
Additional arguments passed on to the internal optimizer.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.optimizer.
Optimizer
(summaries=None, summary_labels=None)¶ Bases:
object
Generic TensorFlow optimizer which minimizes a not yet further specified expression, usually some kind of loss function. More generally, an optimizer can be considered as some method of updating a set of variables.
-
apply_step
(variables, deltas)¶ Applies step deltas to variable values.
Parameters: - variables – List of variables.
- deltas – List of deltas of same length.
Returns: The step-applied operation.
-
static
from_spec
(spec, kwargs=None)¶ Creates an optimizer from a specification dict.
-
get_variables
()¶ Returns the TensorFlow variables used by the optimizer.
Returns: List of variables.
-
minimize
(time, variables, **kwargs)¶ Performs an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- **kwargs –
Additional optimizer-specific arguments. The following arguments are used by some optimizers:
- fn_loss (-) – A callable returning the loss of the current model.
- fn_kl_divergence (-) – A callable returning the KL-divergence relative to the current model.
- return_estimated_improvement (-) – Returns the estimated improvement resulting from the natural gradient calculation if true.
- fn_reference (-) – A callable returning the reference values necessary for comparison.
- fn_compare (-) – A callable comparing the current model to the reference model given by its values.
- source_variables (-) – List of source variables to synchronize with.
- global_variables (-) – List of global variables to apply the proposed optimization step to.
Returns: The optimization operation.
-
tf_step
(time, variables, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- **kwargs –
Additional arguments depending on the specific optimizer implementation. For instance, often includes
fn_loss
if a loss function is optimized.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.synchronization.
Synchronization
(sync_frequency=1, update_weight=1.0)¶ Bases:
tensorforce.core.optimizers.optimizer.Optimizer
The synchronization optimizer updates variables periodically to the value of a corresponding set of source variables.
-
get_variables
()¶
-
tf_step
(time, variables, source_variables, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- source_variables – List of source variables to synchronize with.
- **kwargs –
Additional arguments, not used.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.tf_optimizer.
TFOptimizer
(optimizer, summaries=None, summary_labels=None, **kwargs)¶ Bases:
tensorforce.core.optimizers.optimizer.Optimizer
Wrapper class for TensorFlow optimizers.
-
get_variables
()¶
-
static
get_wrapper
(optimizer)¶ Returns a TFOptimizer constructor callable for the given optimizer name.
Parameters: - optimizer – The name of the optimizer, one of ‘adadelta’, ‘adagrad’, ‘adam’, ‘nadam’,
- 'momentum', 'rmsprop'. ('gradient_descent',) –
Returns: The TFOptimizer constructor callable.
-
tf_optimizers
= {'nadam': <sphinx.ext.autodoc._MockObject object>, 'adam': <sphinx.ext.autodoc._MockObject object>, 'adadelta': <sphinx.ext.autodoc._MockObject object>, 'rmsprop': <sphinx.ext.autodoc._MockObject object>, 'adagrad': <sphinx.ext.autodoc._MockObject object>, 'momentum': <sphinx.ext.autodoc._MockObject object>, 'gradient_descent': <sphinx.ext.autodoc._MockObject object>}¶
-
tf_step
(time, variables, fn_loss, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- fn_loss – A callable returning the loss of the current model.
- **kwargs –
Additional arguments, not used.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.
Optimizer
(summaries=None, summary_labels=None)¶ Bases:
object
Generic TensorFlow optimizer which minimizes a not yet further specified expression, usually some kind of loss function. More generally, an optimizer can be considered as some method of updating a set of variables.
-
apply_step
(variables, deltas)¶ Applies step deltas to variable values.
Parameters: - variables – List of variables.
- deltas – List of deltas of same length.
Returns: The step-applied operation.
-
static
from_spec
(spec, kwargs=None)¶ Creates an optimizer from a specification dict.
-
get_variables
()¶ Returns the TensorFlow variables used by the optimizer.
Returns: List of variables.
-
minimize
(time, variables, **kwargs)¶ Performs an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- **kwargs –
Additional optimizer-specific arguments. The following arguments are used by some optimizers:
- fn_loss (-) – A callable returning the loss of the current model.
- fn_kl_divergence (-) – A callable returning the KL-divergence relative to the current model.
- return_estimated_improvement (-) – Returns the estimated improvement resulting from the natural gradient calculation if true.
- fn_reference (-) – A callable returning the reference values necessary for comparison.
- fn_compare (-) – A callable comparing the current model to the reference model given by its values.
- source_variables (-) – List of source variables to synchronize with.
- global_variables (-) – List of global variables to apply the proposed optimization step to.
Returns: The optimization operation.
-
tf_step
(time, variables, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- **kwargs –
Additional arguments depending on the specific optimizer implementation. For instance, often includes
fn_loss
if a loss function is optimized.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.
MetaOptimizer
(optimizer, **kwargs)¶ Bases:
tensorforce.core.optimizers.optimizer.Optimizer
A meta optimizer takes the optimization implemented by another optimizer and modifies/optimizes its proposed result. For example, line search might be applied to find a more optimal step size.
-
get_variables
()¶
-
-
class
tensorforce.core.optimizers.
TFOptimizer
(optimizer, summaries=None, summary_labels=None, **kwargs)¶ Bases:
tensorforce.core.optimizers.optimizer.Optimizer
Wrapper class for TensorFlow optimizers.
-
get_variables
()¶
-
static
get_wrapper
(optimizer)¶ Returns a TFOptimizer constructor callable for the given optimizer name.
Parameters: - optimizer – The name of the optimizer, one of ‘adadelta’, ‘adagrad’, ‘adam’, ‘nadam’,
- 'momentum', 'rmsprop'. ('gradient_descent',) –
Returns: The TFOptimizer constructor callable.
-
tf_optimizers
= {'nadam': <sphinx.ext.autodoc._MockObject object>, 'adam': <sphinx.ext.autodoc._MockObject object>, 'adadelta': <sphinx.ext.autodoc._MockObject object>, 'rmsprop': <sphinx.ext.autodoc._MockObject object>, 'adagrad': <sphinx.ext.autodoc._MockObject object>, 'momentum': <sphinx.ext.autodoc._MockObject object>, 'gradient_descent': <sphinx.ext.autodoc._MockObject object>}¶
-
tf_step
(time, variables, fn_loss, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- fn_loss – A callable returning the loss of the current model.
- **kwargs –
Additional arguments, not used.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.
Evolutionary
(learning_rate, num_samples=1, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.optimizer.Optimizer
Evolutionary optimizer which samples random perturbations and applies them either positively or negatively, depending on their improvement of the loss.
-
tf_step
(time, variables, fn_loss, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- fn_loss – A callable returning the loss of the current model.
- **kwargs –
Additional arguments, not used.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.
NaturalGradient
(learning_rate, cg_max_iterations=20, cg_damping=0.001, cg_unroll_loop=False, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.optimizer.Optimizer
Natural gradient optimizer.
-
tf_step
(time, variables, fn_loss, fn_kl_divergence, return_estimated_improvement=False, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- fn_loss – A callable returning the loss of the current model.
- fn_kl_divergence – A callable returning the KL-divergence relative to the current model.
- return_estimated_improvement – Returns the estimated improvement resulting from the natural gradient calculation if true.
- **kwargs –
Additional arguments, not used.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.
MultiStep
(optimizer, num_steps=5, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.meta_optimizer.MetaOptimizer
The multi-step meta optimizer repeatedly applies the optimization step proposed by another optimizer a number of times.
-
tf_step
(time, variables, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- **kwargs –
Additional arguments passed on to the internal optimizer.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.
OptimizedStep
(optimizer, ls_max_iterations=10, ls_accept_ratio=0.9, ls_mode='exponential', ls_parameter=0.5, ls_unroll_loop=False, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.meta_optimizer.MetaOptimizer
The optimized-step meta optimizer applies line search to the proposed optimization step of another optimizer to find a more optimal step size.
-
tf_step
(time, variables, fn_loss, fn_reference=None, fn_compare=None, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- fn_loss – A callable returning the loss of the current model.
- fn_reference – A callable returning the reference values necessary for comparison.
- fn_compare – A callable comparing the current model to the reference model given by its values.
- **kwargs –
Additional arguments passed on to the internal optimizer.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.
Synchronization
(sync_frequency=1, update_weight=1.0)¶ Bases:
tensorforce.core.optimizers.optimizer.Optimizer
The synchronization optimizer updates variables periodically to the value of a corresponding set of source variables.
-
get_variables
()¶
-
tf_step
(time, variables, source_variables, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- source_variables – List of source variables to synchronize with.
- **kwargs –
Additional arguments, not used.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.
ClippedStep
(optimizer, clipping_value, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.meta_optimizer.MetaOptimizer
The multi-shep meta optimizer repeatedly applies the optimization step proposed by another optimizer a number of times.
-
tf_step
(time, variables, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- **kwargs –
Additional arguments passed on to the internal optimizer.
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.optimizers.
GlobalOptimizer
(optimizer, summaries=None, summary_labels=None)¶ Bases:
tensorforce.core.optimizers.meta_optimizer.MetaOptimizer
The global optimizer applies an optimizer to the local variables. In addition, it also applies the update to a corresponding set of global variables and subsequently updates the local variables to the value of these global variables. Note: This is used for the current distributed mode, and will likely change with the next major version update.
-
tf_step
(time, variables, global_variables, **kwargs)¶ Creates the TensorFlow operations for performing an optimization step.
Parameters: - time – Time tensor.
- variables – List of variables to optimize.
- global_variables – List of global variables to apply the proposed optimization step to.
- **kwargs –
??? coming soon
Returns: List of delta tensors corresponding to the updates for each optimized variable.
-
-
class
tensorforce.core.preprocessing.clip.
Clip
(min_value, max_value, scope='clip', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Clip by min/max.
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.divide.
Divide
(scale, scope='divide', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Divide state by scale.
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.grayscale.
Grayscale
(weights=(0.299, 0.587, 0.114), scope='grayscale', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Turn 3D color state into grayscale.
-
processed_shape
(shape)¶
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.image_resize.
ImageResize
(width, height, scope='image_resize', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Resize image to width x height.
-
processed_shape
(shape)¶
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.normalize.
Normalize
(scope='normalize', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Normalize state. Subtract minimal value and divide by range.
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.preprocessor.
Preprocessor
(scope='preprocessor', summary_labels=None)¶ Bases:
object
-
get_variables
()¶ Returns the TensorFlow variables used by the preprocessor.
Returns: List of variables.
-
processed_shape
(shape)¶ Shape of preprocessed state given original shape.
Parameters: shape – original shape. Returns: processed tensor shape
-
reset
()¶
-
tf_process
(tensor)¶ Process state.
Parameters: tensor – tensor to process. Returns: processed tensor.
-
-
class
tensorforce.core.preprocessing.preprocessor_stack.
PreprocessorStack
¶ Bases:
object
-
static
from_spec
(spec)¶ Creates a preprocessing stack from a specification dict.
-
get_variables
()¶
-
process
(tensor)¶ Process state.
Parameters: tensor – tensor to process Returns: processed state
-
processed_shape
(shape)¶ Shape of preprocessed state given original shape.
Parameters: shape – original state shape Returns: processed state shape
-
reset
()¶
-
static
-
class
tensorforce.core.preprocessing.running_standardize.
RunningStandardize
(axis=None, reset_after_batch=True, scope='running_standardize', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Standardize state w.r.t past states. Subtract mean and divide by standard deviation of sequence of past states.
-
reset
()¶
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.sequence.
Sequence
(length=2, scope='sequence', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Concatenate
length
state vectors. Example: Used in Atari problems to create the Markov property.-
processed_shape
(shape)¶
-
reset
()¶
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.standardize.
Standardize
(across_batch=False, scope='standardize', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Standardize state. Subtract mean and divide by standard deviation.
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.
Preprocessor
(scope='preprocessor', summary_labels=None)¶ Bases:
object
-
get_variables
()¶ Returns the TensorFlow variables used by the preprocessor.
Returns: List of variables.
-
processed_shape
(shape)¶ Shape of preprocessed state given original shape.
Parameters: shape – original shape. Returns: processed tensor shape
-
reset
()¶
-
tf_process
(tensor)¶ Process state.
Parameters: tensor – tensor to process. Returns: processed tensor.
-
-
class
tensorforce.core.preprocessing.
Sequence
(length=2, scope='sequence', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Concatenate
length
state vectors. Example: Used in Atari problems to create the Markov property.-
processed_shape
(shape)¶
-
reset
()¶
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.
Standardize
(across_batch=False, scope='standardize', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Standardize state. Subtract mean and divide by standard deviation.
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.
RunningStandardize
(axis=None, reset_after_batch=True, scope='running_standardize', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Standardize state w.r.t past states. Subtract mean and divide by standard deviation of sequence of past states.
-
reset
()¶
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.
Normalize
(scope='normalize', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Normalize state. Subtract minimal value and divide by range.
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.
Grayscale
(weights=(0.299, 0.587, 0.114), scope='grayscale', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Turn 3D color state into grayscale.
-
processed_shape
(shape)¶
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.
ImageResize
(width, height, scope='image_resize', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Resize image to width x height.
-
processed_shape
(shape)¶
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.
PreprocessorStack
¶ Bases:
object
-
static
from_spec
(spec)¶ Creates a preprocessing stack from a specification dict.
-
get_variables
()¶
-
process
(tensor)¶ Process state.
Parameters: tensor – tensor to process Returns: processed state
-
processed_shape
(shape)¶ Shape of preprocessed state given original shape.
Parameters: shape – original state shape Returns: processed state shape
-
reset
()¶
-
static
-
class
tensorforce.core.preprocessing.
Divide
(scale, scope='divide', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Divide state by scale.
-
tf_process
(tensor)¶
-
-
class
tensorforce.core.preprocessing.
Clip
(min_value, max_value, scope='clip', summary_labels=())¶ Bases:
tensorforce.core.preprocessing.preprocessor.Preprocessor
Clip by min/max.
-
tf_process
(tensor)¶
-
Module contents¶
tensorforce.environments package¶
Submodules¶
tensorforce.environments.environment module¶
-
class
tensorforce.environments.environment.
Environment
¶ Bases:
object
Base environment class.
-
actions
¶ Return the action space. Might include subdicts if multiple actions are available simultaneously.
Returns: dict of action properties (continuous, number of actions)
-
close
()¶ Close environment. No other method calls possible afterwards.
-
execute
(actions)¶ Executes action, observes next state(s) and reward.
Parameters: actions – Actions to execute. Returns: (Dict of) next state(s), boolean indicating terminal, and reward signal.
-
reset
()¶ Reset environment and setup for new episode.
Returns: initial state of reset environment.
-
seed
(seed)¶ Sets the random seed of the environment to the given value (current time, if seed=None). Naturally deterministic Environments (e.g. ALE or some gym Envs) don’t have to implement this method.
Parameters: seed (int) – The seed to use for initializing the pseudo-random number generator (default=epoch time in sec). Returns: The actual seed (int) used OR None if Environment did not override this method (no seeding supported).
-
states
¶ Return the state space. Might include subdicts if multiple states are available simultaneously.
Returns: dict of state properties (shape and type).
-
tensorforce.environments.minimal_test module¶
-
class
tensorforce.environments.minimal_test.
MinimalTest
(specification)¶ Bases:
tensorforce.environments.environment.Environment
-
actions
¶
-
close
()¶
-
execute
(actions)¶
-
reset
()¶
-
states
¶
-
-
tensorforce.environments.minimal_test.
random
() → x in the interval [0, 1).¶
Module contents¶
-
class
tensorforce.environments.
Environment
¶ Bases:
object
Base environment class.
-
actions
¶ Return the action space. Might include subdicts if multiple actions are available simultaneously.
Returns: dict of action properties (continuous, number of actions)
-
close
()¶ Close environment. No other method calls possible afterwards.
-
execute
(actions)¶ Executes action, observes next state(s) and reward.
Parameters: actions – Actions to execute. Returns: (Dict of) next state(s), boolean indicating terminal, and reward signal.
-
reset
()¶ Reset environment and setup for new episode.
Returns: initial state of reset environment.
-
seed
(seed)¶ Sets the random seed of the environment to the given value (current time, if seed=None). Naturally deterministic Environments (e.g. ALE or some gym Envs) don’t have to implement this method.
Parameters: seed (int) – The seed to use for initializing the pseudo-random number generator (default=epoch time in sec). Returns: The actual seed (int) used OR None if Environment did not override this method (no seeding supported).
-
states
¶ Return the state space. Might include subdicts if multiple states are available simultaneously.
Returns: dict of state properties (shape and type).
-
tensorforce.execution package¶
Submodules¶
tensorforce.execution.runner module¶
-
class
tensorforce.execution.runner.
Runner
(agent, environment, repeat_actions=1, history=None)¶ Bases:
object
Simple runner for non-realtime single-process execution.
-
reset
(history=None)¶
-
run
(timesteps=None, episodes=None, max_episode_timesteps=None, deterministic=False, episode_finished=None)¶ Runs the agent on the environment.
Parameters: - timesteps (int) – Max. number of total timesteps to run (across episodes).
- episodes (int) – Max. number of episodes to run.
- max_episode_timesteps (int) – Max. number of timesteps per episode.
- deterministic (bool) – If true, pick actions from model without exploration/sampling.
- episode_finished (callable) – Function handler taking a
Runner
argument and returning a boolean indicating whether to continue execution. For instance, useful for reporting intermediate performance or integrating termination conditions.
-
tensorforce.execution.threaded_runner module¶
-
class
tensorforce.execution.threaded_runner.
ThreadedRunner
(agents, environments, repeat_actions=1, save_path=None, save_episodes=None)¶ Bases:
object
Runner for non-realtime threaded execution of multiple agents.
-
run
(episodes=-1, max_episode_timesteps=-1, episode_finished=None, summary_report=None, summary_interval=0, max_timesteps=None)¶ Parameters: - episodes (List[Episode]) –
- max_episode_timesteps (int) – Max. number of timesteps per episode.
- episode_finished (callable) –
- summary_report (callable) – Function that produces a tensorboard summary update.
- summary_interval (int) –
- max_timesteps (int) – Deprecated; see max_episode_timesteps
-
-
tensorforce.execution.threaded_runner.
WorkerAgentGenerator
(agent_class)¶ Worker Agent generator, receives an Agent class and creates a Worker Agent class that inherits from that Agent.
Module contents¶
-
class
tensorforce.execution.
Runner
(agent, environment, repeat_actions=1, history=None)¶ Bases:
object
Simple runner for non-realtime single-process execution.
-
reset
(history=None)¶
-
run
(timesteps=None, episodes=None, max_episode_timesteps=None, deterministic=False, episode_finished=None)¶ Runs the agent on the environment.
Parameters: - timesteps (int) – Max. number of total timesteps to run (across episodes).
- episodes (int) – Max. number of episodes to run.
- max_episode_timesteps (int) – Max. number of timesteps per episode.
- deterministic (bool) – If true, pick actions from model without exploration/sampling.
- episode_finished (callable) – Function handler taking a
Runner
argument and returning a boolean indicating whether to continue execution. For instance, useful for reporting intermediate performance or integrating termination conditions.
-
-
class
tensorforce.execution.
ThreadedRunner
(agents, environments, repeat_actions=1, save_path=None, save_episodes=None)¶ Bases:
object
Runner for non-realtime threaded execution of multiple agents.
-
run
(episodes=-1, max_episode_timesteps=-1, episode_finished=None, summary_report=None, summary_interval=0, max_timesteps=None)¶ Parameters: - episodes (List[Episode]) –
- max_episode_timesteps (int) – Max. number of timesteps per episode.
- episode_finished (callable) –
- summary_report (callable) – Function that produces a tensorboard summary update.
- summary_interval (int) –
- max_timesteps (int) – Deprecated; see max_episode_timesteps
-
tensorforce.models package¶
Submodules¶
tensorforce.models.constant_model module¶
-
class
tensorforce.models.constant_model.
ConstantModel
(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, action_values)¶ Bases:
tensorforce.models.model.Model
Utility class to return constant actions of a desired shape and with given bounds.
-
tf_actions_and_internals
(states, internals, update, deterministic)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tensorforce.models.distribution_model module¶
-
class
tensorforce.models.distribution_model.
DistributionModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization)¶ Bases:
tensorforce.models.model.Model
Base class for models using distributions parametrized by a neural network.
-
create_distributions
()¶
-
static
get_distributions_summaries
(distributions)¶
-
static
get_distributions_variables
(distributions, include_non_trainable=False)¶
-
get_optimizer_kwargs
(states, internals, actions, terminal, reward, update)¶
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_actions_and_internals
(states, internals, update, deterministic)¶
-
tf_kl_divergence
(states, internals, update)¶
-
tf_regularization_losses
(states, internals, update)¶
-
tensorforce.models.model module¶
The Model
class coordinates the creation and execution of all TensorFlow operations within a model.
It implements the reset
, act
and update
functions, which form the interface the Agent
class
communicates with, and which should not need to be overwritten. Instead, the following TensorFlow
functions need to be implemented:
tf_actions_and_internals(states, internals, deterministic)
returning the batch of- actions and successor internal states.
tf_loss_per_instance(states, internals, actions, terminal, reward)
returning the loss- per instance for a batch.
Further, the following TensorFlow functions should be extended accordingly:
initialize(custom_getter)
defining TensorFlow placeholders/functions and adding internal states.get_variables()
returning the list of TensorFlow variables (to be optimized) of this model.tf_regularization_losses(states, internals)
returning a dict of regularization losses.get_optimizer_kwargs(states, internals, actions, terminal, reward)
returning a dict of potential- arguments (argument-free functions) to the optimizer.
Finally, the following TensorFlow functions can be useful in some cases:
preprocess_states(states)
for state preprocessing, returning the processed batch of states.tf_action_exploration(action, exploration, action_spec)
for action postprocessing (e.g. exploration),- returning the processed batch of actions.
tf_preprocess_reward(states, internals, terminal, reward)
for reward preprocessing (e.g. reward normalization),- returning the processed batch of rewards.
create_output_operations(states, internals, actions, terminal, reward, deterministic)
for further output operations,- similar to the two above for
Model.act
andModel.update
.
tf_optimization(states, internals, actions, terminal, reward)
for further optimization operations- (e.g. the baseline update in a
PGModel
or the target network update in aQModel
), returning a single grouped optimization operation.
-
class
tensorforce.models.model.
Model
(states_spec, actions_spec, device=None, session_config=None, scope='base_model', saver_spec=None, summary_spec=None, distributed_spec=None, optimizer=None, discount=0.0, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None)¶ Bases:
object
Base class for all (TensorFlow-based) models.
-
act
(states, internals, deterministic=False)¶ Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of incoming internal state tensors.
- deterministic (bool) – If True, will not apply exploration after actions are calculated.
Returns: - Actual action-outputs (batched if state input is a batch).
Return type: tuple
-
close
()¶
-
create_output_operations
(states, internals, actions, terminal, reward, update, deterministic)¶ Calls all the relevant TensorFlow functions for this model and hence creates all the TensorFlow operations involved.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
- deterministic – Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
-
get_optimizer_kwargs
(states, internals, actions, terminal, reward, update)¶ Returns the optimizer arguments including the time, the list of variables to optimize, and various argument-free functions (in particular
fn_loss
returning the combined 0-dim batch loss tensor) which the optimizer might require to perform an update step.Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Dict to be passed into the optimizer op (e.g. ‘minimize’) as kwargs.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the model
Returns: List of summaries
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the model.
Returns: List of variables.
-
initialize
(custom_getter)¶ Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.
Parameters: custom_getter – The custom_getter_
object to use fortf.make_template
when creating TensorFlow functions.
-
observe
(terminal, reward)¶ Adds an observation (reward and is-terminal) to the model without updating its trainable variables.
Parameters: - terminal (bool) – Whether the episode has terminated.
- reward (float) – The observed reward value.
Returns: The value of the model-internal episode counter.
-
reset
()¶ Resets the model to its initial state on episode start.
Returns: Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors. Return type: tuple
-
restore
(directory=None, file=None)¶ Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).
Parameters: - directory – Optional checkpoint directory.
- file – Optional checkpoint file, or path if directory not given.
-
save
(directory=None, append_timestep=True)¶ Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.
Parameters: - directory – Optional checkpoint directory.
- append_timestep – Appends the current timestep to the checkpoint file if true.
Returns: Checkpoint path were the model was saved.
-
setup
()¶ Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.
-
tf_action_exploration
(action, exploration, action_spec)¶ Applies optional exploration to the action (post-processor for action outputs).
Parameters: - action (tf.Tensor) – The original output action tensor (to be post-processed).
- exploration (Exploration) – The Exploration object to use.
- action_spec (dict) – Dict specifying the action space.
Returns: The post-processed action output tensor.
-
tf_actions_and_internals
(states, internals, update, deterministic)¶ Creates and returns the TensorFlow operations for retrieving the actions and - if applicable - the posterior internal state Tensors in reaction to the given input states (and prior internal states).
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- update – Single boolean tensor indicating whether this call happens during an update.
- deterministic – Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
Returns: - dict of output actions (with or without exploration applied (see
deterministic
)) - list of posterior internal state Tensors (empty for non-internal state models)
Return type: tuple
-
tf_discounted_cumulative_reward
(terminal, reward, discount=None, final_reward=0.0, horizon=0)¶ Creates and returns the TensorFlow operations for calculating the sequence of discounted cumulative rewards for a given sequence of single rewards.
Example: single rewards = 2.0 1.0 0.0 0.5 1.0 -1.0 terminal = False, False, False, False True False gamma = 0.95 final_reward = 100.0 (only matters for last episode (r=-1.0) as this episode has no terminal signal) horizon=3 output = 2.95 1.45 1.38 1.45 1.0 94.0
Parameters: - terminal – Tensor (bool) holding the is-terminal sequence. This sequence may contain more than one
True value. If its very last element is False (not terminating), the given
final_reward
value is assumed to follow the last value in the single rewards sequence (see below). - reward – Tensor (float) holding the sequence of single rewards. If the last element of
terminal
is False, an assumed last reward of the value offinal_reward
will be used. - discount (float) – The discount factor (gamma). By default, take the Model’s discount factor.
- final_reward (float) – Reward value to use if last episode in sequence does not terminate (terminal sequence ends with False). This value will be ignored if horizon == 1 or discount == 0.0.
- horizon (int) – The length of the horizon (e.g. for n-step cumulative rewards in continuous tasks without terminal signals). Use 0 (default) for an infinite horizon. Note that horizon=1 leads to the exact same results as a discount factor of 0.0.
Returns: Discounted cumulative reward tensor with the same shape as
reward
.- terminal – Tensor (bool) holding the is-terminal sequence. This sequence may contain more than one
True value. If its very last element is False (not terminating), the given
-
tf_loss
(states, internals, actions, terminal, reward, update)¶ Creates and returns the single loss Tensor representing the total loss for a batch, including the mean loss per sample, the regularization loss of the batch, .
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Single float-value loss tensor.
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶ Creates and returns the TensorFlow operations for calculating the loss per batch instance (sample) of the given input state(s) and action(s).
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Loss tensor (first rank is the batch size -> one loss value per sample in the batch).
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶ Creates the TensorFlow operations for performing an optimization update step based on the given input states and actions batch.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: The optimization operation.
-
tf_preprocess_reward
(states, internals, terminal, reward)¶ Applies optional preprocessing to the reward.
-
tf_preprocess_states
(states)¶ Applies optional preprocessing to the states.
-
tf_regularization_losses
(states, internals, update)¶ Creates and returns the TensorFlow operations for calculating the different regularization losses for the given batch of state/internal state inputs.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Dict of regularization loss tensors (keys == different regularization types, e.g. ‘entropy’).
-
update
(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶ Runs the self.optimization in the session to update the Model’s parameters. Optionally, also runs the
loss_per_instance
calculation and returns the result of that.Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- return_loss_per_instance (bool) – Whether to also run and return the
loss_per_instance
Tensor.
Returns: void or - if return_loss_per_instance is True - the value of the
loss_per_instance
Tensor.
-
tensorforce.models.pg_log_prob_model module¶
-
class
tensorforce.models.pg_log_prob_model.
PGLogProbModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶ Bases:
tensorforce.models.pg_model.PGModel
Policy gradient model based on computing log likelihoods, e.g. VPG.
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tensorforce.models.pg_model module¶
-
class
tensorforce.models.pg_model.
PGModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶ Bases:
tensorforce.models.distribution_model.DistributionModel
Base class for policy gradient models. It optionally defines a baseline and handles its optimization. It implements the
tf_loss_per_instance
function, but requires subclasses to implementtf_pg_loss_per_instance
.-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶ Creates the TensorFlow operations for calculating the (policy-gradient-specific) loss per batch instance of the given input states and actions, after the specified reward/advantage calculations.
Parameters: - states – Dict of state tensors.
- internals – List of prior internal state tensors.
- actions – Dict of action tensors.
- terminal – Terminal boolean tensor.
- reward – Reward tensor.
- update – Boolean tensor indicating whether this call happens during an update.
Returns: Loss tensor.
-
tf_regularization_losses
(states, internals, update)¶
-
tf_reward_estimation
(states, internals, terminal, reward, update)¶
-
tensorforce.models.pg_prob_ratio_model module¶
-
class
tensorforce.models.pg_prob_ratio_model.
PGProbRatioModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda, likelihood_ratio_clipping)¶ Bases:
tensorforce.models.pg_model.PGModel
Policy gradient model based on computing likelihood ratios, e.g. TRPO and PPO.
-
get_optimizer_kwargs
(states, actions, terminal, reward, internals, update)¶
-
initialize
(custom_getter)¶
-
tf_compare
(states, internals, actions, terminal, reward, update, reference)¶
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_reference
(states, internals, actions, update)¶
-
tensorforce.models.q_demo_model module¶
-
class
tensorforce.models.q_demo_model.
QDemoModel
(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, network_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix, expert_margin, supervised_weight)¶ Bases:
tensorforce.models.q_model.QModel
Model for deep Q-learning from demonstration. Principal structure similar to double deep Q-networks but uses additional loss terms for demo data.
-
create_output_operations
(states, internals, actions, terminal, reward, update, deterministic)¶
-
demonstration_update
(states, internals, actions, terminal, reward)¶
-
initialize
(custom_getter)¶
-
tf_demo_loss
(states, actions, terminal, reward, internals, update)¶
-
tf_demo_optimization
(states, internals, actions, terminal, reward, update)¶
-
tensorforce.models.q_model module¶
-
class
tensorforce.models.q_model.
QModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.distribution_model.DistributionModel
Q-value model.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶
-
tf_q_delta
(q_value, next_q_value, terminal, reward)¶ Creates the deltas (or advantage) of the Q values.
Returns: A list of deltas per action
-
tf_q_value
(embedding, distr_params, action, name)¶
-
update
(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶
-
tensorforce.models.q_naf_model module¶
-
class
tensorforce.models.q_naf_model.
QNAFModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.q_model.QModel
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_q_value
(embedding, distr_params, action, name)¶
-
tf_regularization_losses
(states, internals, update)¶
-
tensorforce.models.q_nstep_model module¶
-
class
tensorforce.models.q_nstep_model.
QNstepModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.q_model.QModel
Deep Q network using n-step rewards as described in Asynchronous Methods for Reinforcement Learning.
-
tf_q_delta
(q_value, next_q_value, terminal, reward)¶
-
tensorforce.models.random_model module¶
-
class
tensorforce.models.random_model.
RandomModel
(states_spec, actions_spec, device=None, session_config=None, scope='base_model', saver_spec=None, summary_spec=None, distributed_spec=None, optimizer=None, discount=0.0, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None)¶ Bases:
tensorforce.models.model.Model
Utility class to return random actions of a desired shape and with given bounds.
-
tf_actions_and_internals
(states, internals, update, deterministic)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
Module contents¶
-
class
tensorforce.models.
Model
(states_spec, actions_spec, device=None, session_config=None, scope='base_model', saver_spec=None, summary_spec=None, distributed_spec=None, optimizer=None, discount=0.0, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None)¶ Bases:
object
Base class for all (TensorFlow-based) models.
-
act
(states, internals, deterministic=False)¶ Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of incoming internal state tensors.
- deterministic (bool) – If True, will not apply exploration after actions are calculated.
Returns: - Actual action-outputs (batched if state input is a batch).
Return type: tuple
-
close
()¶
-
create_output_operations
(states, internals, actions, terminal, reward, update, deterministic)¶ Calls all the relevant TensorFlow functions for this model and hence creates all the TensorFlow operations involved.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
- deterministic – Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
-
get_optimizer_kwargs
(states, internals, actions, terminal, reward, update)¶ Returns the optimizer arguments including the time, the list of variables to optimize, and various argument-free functions (in particular
fn_loss
returning the combined 0-dim batch loss tensor) which the optimizer might require to perform an update step.Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Dict to be passed into the optimizer op (e.g. ‘minimize’) as kwargs.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the model
Returns: List of summaries
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the model.
Returns: List of variables.
-
initialize
(custom_getter)¶ Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.
Parameters: custom_getter – The custom_getter_
object to use fortf.make_template
when creating TensorFlow functions.
-
observe
(terminal, reward)¶ Adds an observation (reward and is-terminal) to the model without updating its trainable variables.
Parameters: - terminal (bool) – Whether the episode has terminated.
- reward (float) – The observed reward value.
Returns: The value of the model-internal episode counter.
-
reset
()¶ Resets the model to its initial state on episode start.
Returns: Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors. Return type: tuple
-
restore
(directory=None, file=None)¶ Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).
Parameters: - directory – Optional checkpoint directory.
- file – Optional checkpoint file, or path if directory not given.
-
save
(directory=None, append_timestep=True)¶ Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.
Parameters: - directory – Optional checkpoint directory.
- append_timestep – Appends the current timestep to the checkpoint file if true.
Returns: Checkpoint path were the model was saved.
-
setup
()¶ Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.
-
tf_action_exploration
(action, exploration, action_spec)¶ Applies optional exploration to the action (post-processor for action outputs).
Parameters: - action (tf.Tensor) – The original output action tensor (to be post-processed).
- exploration (Exploration) – The Exploration object to use.
- action_spec (dict) – Dict specifying the action space.
Returns: The post-processed action output tensor.
-
tf_actions_and_internals
(states, internals, update, deterministic)¶ Creates and returns the TensorFlow operations for retrieving the actions and - if applicable - the posterior internal state Tensors in reaction to the given input states (and prior internal states).
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- update – Single boolean tensor indicating whether this call happens during an update.
- deterministic – Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
Returns: - dict of output actions (with or without exploration applied (see
deterministic
)) - list of posterior internal state Tensors (empty for non-internal state models)
Return type: tuple
-
tf_discounted_cumulative_reward
(terminal, reward, discount=None, final_reward=0.0, horizon=0)¶ Creates and returns the TensorFlow operations for calculating the sequence of discounted cumulative rewards for a given sequence of single rewards.
Example: single rewards = 2.0 1.0 0.0 0.5 1.0 -1.0 terminal = False, False, False, False True False gamma = 0.95 final_reward = 100.0 (only matters for last episode (r=-1.0) as this episode has no terminal signal) horizon=3 output = 2.95 1.45 1.38 1.45 1.0 94.0
Parameters: - terminal – Tensor (bool) holding the is-terminal sequence. This sequence may contain more than one
True value. If its very last element is False (not terminating), the given
final_reward
value is assumed to follow the last value in the single rewards sequence (see below). - reward – Tensor (float) holding the sequence of single rewards. If the last element of
terminal
is False, an assumed last reward of the value offinal_reward
will be used. - discount (float) – The discount factor (gamma). By default, take the Model’s discount factor.
- final_reward (float) – Reward value to use if last episode in sequence does not terminate (terminal sequence ends with False). This value will be ignored if horizon == 1 or discount == 0.0.
- horizon (int) – The length of the horizon (e.g. for n-step cumulative rewards in continuous tasks without terminal signals). Use 0 (default) for an infinite horizon. Note that horizon=1 leads to the exact same results as a discount factor of 0.0.
Returns: Discounted cumulative reward tensor with the same shape as
reward
.- terminal – Tensor (bool) holding the is-terminal sequence. This sequence may contain more than one
True value. If its very last element is False (not terminating), the given
-
tf_loss
(states, internals, actions, terminal, reward, update)¶ Creates and returns the single loss Tensor representing the total loss for a batch, including the mean loss per sample, the regularization loss of the batch, .
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Single float-value loss tensor.
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶ Creates and returns the TensorFlow operations for calculating the loss per batch instance (sample) of the given input state(s) and action(s).
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Loss tensor (first rank is the batch size -> one loss value per sample in the batch).
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶ Creates the TensorFlow operations for performing an optimization update step based on the given input states and actions batch.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: The optimization operation.
-
tf_preprocess_reward
(states, internals, terminal, reward)¶ Applies optional preprocessing to the reward.
-
tf_preprocess_states
(states)¶ Applies optional preprocessing to the states.
-
tf_regularization_losses
(states, internals, update)¶ Creates and returns the TensorFlow operations for calculating the different regularization losses for the given batch of state/internal state inputs.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Dict of regularization loss tensors (keys == different regularization types, e.g. ‘entropy’).
-
update
(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶ Runs the self.optimization in the session to update the Model’s parameters. Optionally, also runs the
loss_per_instance
calculation and returns the result of that.Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- return_loss_per_instance (bool) – Whether to also run and return the
loss_per_instance
Tensor.
Returns: void or - if return_loss_per_instance is True - the value of the
loss_per_instance
Tensor.
-
-
class
tensorforce.models.
DistributionModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization)¶ Bases:
tensorforce.models.model.Model
Base class for models using distributions parametrized by a neural network.
-
create_distributions
()¶
-
static
get_distributions_summaries
(distributions)¶
-
static
get_distributions_variables
(distributions, include_non_trainable=False)¶
-
get_optimizer_kwargs
(states, internals, actions, terminal, reward, update)¶
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_actions_and_internals
(states, internals, update, deterministic)¶
-
tf_kl_divergence
(states, internals, update)¶
-
tf_regularization_losses
(states, internals, update)¶
-
-
class
tensorforce.models.
PGModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶ Bases:
tensorforce.models.distribution_model.DistributionModel
Base class for policy gradient models. It optionally defines a baseline and handles its optimization. It implements the
tf_loss_per_instance
function, but requires subclasses to implementtf_pg_loss_per_instance
.-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶ Creates the TensorFlow operations for calculating the (policy-gradient-specific) loss per batch instance of the given input states and actions, after the specified reward/advantage calculations.
Parameters: - states – Dict of state tensors.
- internals – List of prior internal state tensors.
- actions – Dict of action tensors.
- terminal – Terminal boolean tensor.
- reward – Reward tensor.
- update – Boolean tensor indicating whether this call happens during an update.
Returns: Loss tensor.
-
tf_regularization_losses
(states, internals, update)¶
-
tf_reward_estimation
(states, internals, terminal, reward, update)¶
-
-
class
tensorforce.models.
PGProbRatioModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda, likelihood_ratio_clipping)¶ Bases:
tensorforce.models.pg_model.PGModel
Policy gradient model based on computing likelihood ratios, e.g. TRPO and PPO.
-
get_optimizer_kwargs
(states, actions, terminal, reward, internals, update)¶
-
initialize
(custom_getter)¶
-
tf_compare
(states, internals, actions, terminal, reward, update, reference)¶
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_reference
(states, internals, actions, update)¶
-
-
class
tensorforce.models.
PGLogProbModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶ Bases:
tensorforce.models.pg_model.PGModel
Policy gradient model based on computing log likelihoods, e.g. VPG.
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
-
class
tensorforce.models.
QModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.distribution_model.DistributionModel
Q-value model.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶
-
tf_q_delta
(q_value, next_q_value, terminal, reward)¶ Creates the deltas (or advantage) of the Q values.
Returns: A list of deltas per action
-
tf_q_value
(embedding, distr_params, action, name)¶
-
update
(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶
-
-
class
tensorforce.models.
QNstepModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.q_model.QModel
Deep Q network using n-step rewards as described in Asynchronous Methods for Reinforcement Learning.
-
tf_q_delta
(q_value, next_q_value, terminal, reward)¶
-
-
class
tensorforce.models.
QNAFModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.q_model.QModel
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_q_value
(embedding, distr_params, action, name)¶
-
tf_regularization_losses
(states, internals, update)¶
-
-
class
tensorforce.models.
QDemoModel
(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, network_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix, expert_margin, supervised_weight)¶ Bases:
tensorforce.models.q_model.QModel
Model for deep Q-learning from demonstration. Principal structure similar to double deep Q-networks but uses additional loss terms for demo data.
-
create_output_operations
(states, internals, actions, terminal, reward, update, deterministic)¶
-
demonstration_update
(states, internals, actions, terminal, reward)¶
-
initialize
(custom_getter)¶
-
tf_demo_loss
(states, actions, terminal, reward, internals, update)¶
-
tf_demo_optimization
(states, internals, actions, terminal, reward, update)¶
-
tensorforce.tests package¶
Submodules¶
tensorforce.tests.base_agent_test module¶
-
class
tensorforce.tests.base_agent_test.
BaseAgentTest
¶ Bases:
tensorforce.tests.base_test.BaseTest
Base class for tests of fundamental Agent functionality, i.e. various action types and shapes and internal states.
-
config
= None¶
-
exclude_bool
= False¶
-
exclude_bounded
= False¶
-
exclude_float
= False¶
-
exclude_int
= False¶
-
exclude_lstm
= False¶
-
exclude_multi
= False¶
-
multi_config
= None¶
-
test_bool
()¶ Tests the case of one boolean action.
-
test_bounded_float
()¶ Tests the case of one bounded float action, i.e. with min and max value.
-
test_float
()¶ Tests the case of one float action.
-
test_int
()¶ Tests the case of one integer action.
-
test_lstm
()¶ Tests the case of using internal states via an LSTM layer (for one integer action).
-
test_multi
()¶ Tests the case of multiple actions of different type and shape.
-
tensorforce.tests.base_test module¶
-
class
tensorforce.tests.base_test.
BaseTest
¶ Bases:
object
Base class for tests of Agent functionality.
-
agent
= None¶
-
base_test_pass
(name, environment, network_spec, **kwargs)¶ Basic test loop, requires an Agent to achieve a certain performance on an environment.
-
base_test_run
(name, environment, network_spec, **kwargs)¶ Run test, tests whether algorithm can run and update without compilation errors, not whether it passes.
-
deterministic
= None¶
-
pass_threshold
= 0.8¶
-
pre_run
(agent, environment)¶ Called before
Runner.run
.
-
requires_network
= True¶
-
tensorforce.tests.test_constant_agent module¶
-
class
tensorforce.tests.test_constant_agent.
TestConstantAgent
(methodName='runTest')¶ Bases:
tensorforce.tests.base_agent_test.BaseAgentTest
,unittest.case.TestCase
-
agent
¶ alias of
ConstantAgent
-
config
= {'action_values': {'action': 1.0}}¶
-
deterministic
= False¶
-
exclude_bool
= True¶
-
exclude_bounded
= True¶
-
exclude_int
= True¶
-
exclude_lstm
= True¶
-
exclude_multi
= True¶
-
requires_network
= False¶
-
tensorforce.tests.test_ddqn_agent module¶
-
class
tensorforce.tests.test_ddqn_agent.
TestDDQNAgent
(methodName='runTest')¶ Bases:
tensorforce.tests.base_agent_test.BaseAgentTest
,unittest.case.TestCase
-
agent
¶ alias of
DDQNAgent
-
config
= {'optimizer': {'learning_rate': 0.002, 'type': 'adam'}, 'repeat_update': 4, 'memory': {'capacity': 1000, 'type': 'replay'}, 'target_sync_frequency': 10, 'first_update': 64, 'batch_size': 32}¶
-
deterministic
= True¶
-
exclude_bounded
= True¶
-
exclude_float
= True¶
-
multi_config
= {'optimizer': {'learning_rate': 0.01, 'type': 'adam'}, 'repeat_update': 1, 'memory': {'capacity': 1000, 'type': 'replay'}, 'target_sync_frequency': 10, 'first_update': 16, 'batch_size': 16}¶
-
tensorforce.tests.test_dqfd_agent module¶
-
class
tensorforce.tests.test_dqfd_agent.
TestDQFDAgent
(methodName='runTest')¶ Bases:
tensorforce.tests.base_agent_test.BaseAgentTest
,unittest.case.TestCase
-
agent
¶ alias of
DQFDAgent
-
config
= {'demo_sampling_ratio': 0.2, 'memory': {'capacity': 1000, 'type': 'replay'}, 'target_sync_frequency': 10, 'first_update': 10, 'demo_memory_capacity': 100, 'batch_size': 8}¶
-
deterministic
= True¶
-
exclude_bounded
= True¶
-
exclude_float
= True¶
-
multi_config
= {'optimizer': {'learning_rate': 0.01, 'type': 'adam'}, 'target_sync_frequency': 10, 'first_update': 16, 'demo_memory_capacity': 100, 'batch_size': 16, 'repeat_update': 1, 'memory': {'capacity': 1000, 'type': 'replay'}, 'demo_sampling_ratio': 0.2}¶
-
pre_run
(agent, environment)¶
-
tensorforce.tests.test_dqn_agent module¶
-
class
tensorforce.tests.test_dqn_agent.
TestDQNAgent
(methodName='runTest')¶ Bases:
tensorforce.tests.base_agent_test.BaseAgentTest
,unittest.case.TestCase
-
agent
¶ alias of
DQNAgent
-
config
= {'optimizer': {'learning_rate': 0.002, 'type': 'adam'}, 'repeat_update': 4, 'memory': {'capacity': 1000, 'type': 'replay'}, 'target_sync_frequency': 10, 'first_update': 64, 'batch_size': 32}¶
-
deterministic
= True¶
-
exclude_bounded
= True¶
-
exclude_float
= True¶
-
multi_config
= {'optimizer': {'learning_rate': 0.01, 'type': 'adam'}, 'repeat_update': 1, 'memory': {'capacity': 1000, 'type': 'replay'}, 'target_sync_frequency': 10, 'first_update': 16, 'batch_size': 16}¶
-
tensorforce.tests.test_dqn_memories module¶
tensorforce.tests.test_dqn_nstep_agent module¶
-
class
tensorforce.tests.test_dqn_nstep_agent.
TestDQNNstepAgent
(methodName='runTest')¶ Bases:
tensorforce.tests.base_agent_test.BaseAgentTest
,unittest.case.TestCase
-
agent
¶ alias of
DQNNstepAgent
-
config
= {'optimizer': {'learning_rate': 0.01, 'type': 'adam'}, 'batch_size': 8}¶
-
deterministic
= True¶
-
exclude_bounded
= True¶
-
exclude_float
= True¶
-
exclude_multi
= True¶
-
tensorforce.tests.test_naf_agent module¶
-
class
tensorforce.tests.test_naf_agent.
TestNAFAgent
(methodName='runTest')¶ Bases:
tensorforce.tests.base_agent_test.BaseAgentTest
,unittest.case.TestCase
-
agent
¶ alias of
NAFAgent
-
config
= {'optimizer': {'learning_rate': 0.001, 'type': 'adam'}, 'repeat_update': 4, 'memory': {'capacity': 1000, 'type': 'replay'}, 'explorations_spec': {'type': 'ornstein_uhlenbeck'}, 'target_sync_frequency': 10, 'first_update': 8, 'batch_size': 8}¶
-
deterministic
= True¶
-
exclude_bool
= True¶
-
exclude_bounded
= True¶
-
exclude_int
= True¶
-
exclude_lstm
= True¶
-
exclude_multi
= True¶
-
tensorforce.tests.test_ppo_agent module¶
-
class
tensorforce.tests.test_ppo_agent.
TestPPOAgent
(methodName='runTest')¶ Bases:
tensorforce.tests.base_agent_test.BaseAgentTest
,unittest.case.TestCase
-
agent
¶ alias of
PPOAgent
-
config
= {'batch_size': 8}¶
-
deterministic
= False¶
-
multi_config
= {'step_optimizer': {'learning_rate': 0.001, 'type': 'adam'}, 'batch_size': 32}¶
-
tensorforce.tests.test_quickstart_example module¶
tensorforce.tests.test_random_agent module¶
tensorforce.tests.test_reward_estimation module¶
tensorforce.tests.test_trpo_agent module¶
-
class
tensorforce.tests.test_trpo_agent.
TestTRPOAgent
(methodName='runTest')¶ Bases:
tensorforce.tests.base_agent_test.BaseAgentTest
,unittest.case.TestCase
-
agent
¶ alias of
TRPOAgent
-
config
= {'learning_rate': 0.005, 'batch_size': 16}¶
-
deterministic
= False¶
-
multi_config
= {'learning_rate': 0.1, 'batch_size': 64}¶
-
tensorforce.tests.test_tutorial_code module¶
-
class
tensorforce.tests.test_tutorial_code.
TestTutorialCode
(methodName='runTest')¶ Bases:
unittest.case.TestCase
Validation of random code snippets as to be notified when old blog posts need to be changed.
-
test_blogpost_introduction
()¶ Test of introduction blog post examples.
-
test_blogpost_introduction_runner
()¶
-
test_reinforceio_homepage
()¶ Code example from the homepage and README.md.
-
tensorforce.tests.test_vpg_agent module¶
-
class
tensorforce.tests.test_vpg_agent.
TestVPGAgent
(methodName='runTest')¶ Bases:
tensorforce.tests.base_agent_test.BaseAgentTest
,unittest.case.TestCase
-
agent
¶ alias of
VPGAgent
-
config
= {'batch_size': 8}¶
-
deterministic
= False¶
-
multi_config
= {'optimizer': {'learning_rate': 0.01, 'type': 'adam'}, 'batch_size': 64}¶
-
tensorforce.tests.test_vpg_baselines module¶
-
class
tensorforce.tests.test_vpg_baselines.
TestVPGBaselines
(methodName='runTest')¶ Bases:
tensorforce.tests.base_test.BaseTest
,unittest.case.TestCase
-
agent
¶ alias of
VPGAgent
-
deterministic
= False¶
-
test_baseline_no_optimizer
()¶
-
test_gae_baseline
()¶
-
test_multi_baseline
()¶
-
test_network_baseline
()¶
-
test_states_baseline
()¶
-
tensorforce.tests.test_vpg_optimizers module¶
-
class
tensorforce.tests.test_vpg_optimizers.
TestVPGOptimizers
(methodName='runTest')¶ Bases:
tensorforce.tests.base_test.BaseTest
,unittest.case.TestCase
-
agent
¶ alias of
VPGAgent
-
deterministic
= False¶
-
test_adam
()¶
-
test_clipped_step
()¶
-
test_evolutionary
()¶
-
test_multi_step
()¶
-
test_natural_gradient
()¶
-
test_optimized_step
()¶
-
Module contents¶
Submodules¶
tensorforce.exception module¶
-
exception
tensorforce.exception.
TensorForceError
¶ Bases:
exceptions.Exception
TensorForce error
tensorforce.meta_parameter_recorder module¶
-
class
tensorforce.meta_parameter_recorder.
MetaParameterRecorder
(current_frame)¶ Bases:
object
Class to record MetaParameters as well as Summary/Description for TensorBoard (TEXT & FILE will come later)
General:
- format_type: used to configure data conversion for TensorBoard=0, TEXT & JSON (not Implemented), etc
-
build_metagraph_list
()¶ Convert MetaParams into TF Summary Format and create summary_op
Parameters: None – Returns: Merged TF Op for TEXT summary elements, should only be executed once to reduce data duplication
-
convert_data_to_string
(data, indent=0, format_type=0, separator=None, eol=None)¶
-
convert_dictionary_to_string
(data, indent=0, format_type=0, separator=None, eol=None)¶
-
convert_list_to_string
(data, indent=0, format_type=0, eol=None, count=True)¶
-
convert_ndarray_to_md
(data, format_type=0, eol=None)¶
-
merge_custom
(custom_dict)¶
-
text_output
(format_type=1)¶
tensorforce.util module¶
-
tensorforce.util.
cumulative_discount
(values, terminals, discount, cumulative_start=0.0)¶ Compute cumulative discounts. :param values: Values to discount :param terminals: Booleans indicating terminal states :param discount: Discount factor :param cumulative_start: Float or ndarray, estimated reward for state t + 1. Default 0.0
Returns: The cumulative discounted rewards. Return type: dicounted_values
-
tensorforce.util.
get_object
(obj, predefined_objects=None, default_object=None, kwargs=None)¶ Utility method to map some kind of object specification to its content, e.g. optimizer or baseline specifications to the respective classes.
Parameters: - obj – A specification dict (value for key ‘type’ optionally specifies the object, options as follows), a module path (e.g., my_module.MyClass), a key in predefined_objects, or a callable (e.g., the class type object).
- predefined_objects – Dict containing predefined set of objects, accessible via their key
- default_object – Default object is no other is specified
- kwargs – Arguments for object creation
Returns: The retrieved object
-
tensorforce.util.
np_dtype
(dtype)¶ Translates dtype specifications in configurations to numpy data types. :param dtype: String describing a numerical type (e.g. ‘float’) or numerical type primitive.
Returns: Numpy data type
-
tensorforce.util.
prod
(xs)¶ Computes the product along the elements in an iterable. Returns 1 for empty iterable.
Parameters: xs – Iterable containing numbers. Returns: Product along iterable.
-
tensorforce.util.
rank
(x)¶
-
tensorforce.util.
shape
(x, unknown=-1)¶
-
tensorforce.util.
tf_dtype
(dtype)¶ Translates dtype specifications in configurations to tensorflow data types.
Parameters: dtype – String describing a numerical type (e.g. ‘float’), numpy data type, or numerical type primitive. Returns: TensorFlow data type
More information¶
You can find more information at our TensorForce GitHub repository.
We have a seperate repository available for benchmarking our algorithm implementations [here](https://github.com/reinforceio/tensorforce-benchmark).