tensorforce.models package¶

Submodules¶

tensorforce.models.constant_model module¶

class tensorforce.models.constant_model.ConstantModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, action_values)¶

Bases: tensorforce.models.model.Model

Utility class to return constant actions of a desired shape and with given bounds.

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, action_values)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶: Creates output operations for acting, observing and interacting with the memory.

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

Returns a dictionary of component name to component of all the components within this model.

Returns:	(dict) The mapping of name to component.

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

Returns the TensorFlow summaries reported by the model

Returns:	List of summaries

get_variables(include_submodules=False, include_nontrainable=False)¶

Returns the TensorFlow variables used by the model.

Parameters:	include_submodules – Includes variables of submodules (e.g. baseline, target network) if true. include_nontrainable – Includes non-trainable variables if true.
Returns:	List of variables.

initialize(custom_getter)¶

Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.

Parameters:	custom_getter – The `custom_getter_` object to use for `tf.make_template` when creating TensorFlow functions.

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_initialize()¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_preprocess(states, actions, reward)¶

tensorforce.models.distribution_model module¶

class tensorforce.models.distribution_model.DistributionModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, requires_deterministic)¶

Bases: tensorforce.models.memory_model.MemoryModel

Base class for models using distributions parametrized by a neural network.

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, requires_deterministic)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the loss per batch instance.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss per instance tensor.

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

Creates the TensorFlow operations for performing an optimization update step based on the given input states and actions batch.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. next_states – Dict of successor state tensors. next_internals – List of posterior internal state tensors.
Returns:	The optimization operation.

tf_preprocess(states, actions, reward)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

tensorforce.models.model module¶

The Model class coordinates the creation and execution of all TensorFlow operations within a model. It implements the reset, act and update functions, which form the interface the Agent class communicates with, and which should not need to be overwritten. Instead, the following TensorFlow functions need to be implemented:

tf_actions_and_internals(states, internals, deterministic) returning the batch of

actions and successor internal states.
tf_loss_per_instance(states, internals, actions, terminal, reward) returning the loss

per instance for a batch.

Further, the following TensorFlow functions should be extended accordingly:

initialize(custom_getter) defining TensorFlow placeholders/functions and adding internal states.
get_variables() returning the list of TensorFlow variables (to be optimized) of this model.
tf_regularization_losses(states, internals) returning a dict of regularization losses.
get_optimizer_kwargs(states, internals, actions, terminal, reward) returning a dict of potential

arguments (argument-free functions) to the optimizer.

Finally, the following TensorFlow functions can be useful in some cases:

preprocess_states(states) for state preprocessing, returning the processed batch of states.
tf_action_exploration(action, exploration, action_spec) for action postprocessing (e.g. exploration),

returning the processed batch of actions.
tf_preprocess_reward(states, internals, terminal, reward) for reward preprocessing (e.g. reward normalization),

returning the processed batch of rewards.
create_output_operations(states, internals, actions, terminal, reward, deterministic) for further output operations,

similar to the two above for Model.act and Model.update.
tf_optimization(states, internals, actions, terminal, reward) for further optimization operations

(e.g. the baseline update in a PGModel or the target network update in a QModel), returning a single grouped optimization operation.

class tensorforce.models.model.Model(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing)¶

Bases: object

Base class for all (TensorFlow-based) models.

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing)¶

Model.

Parameters:

states (spec) – The state-space description dictionary.
actions (spec) – The action-space description dictionary.
scope (str) – The root scope str to use for tf variable scoping.
device (str) – The name of the device to run the graph of this model on.
saver (spec) – Dict specifying whether and how to save the model’s parameters.
summarizer (spec) – Dict specifying which tensorboard summaries should be created and added to the graph.
execution (spec) – Dict specifying whether and how to do distributed training on the model’s graph.
batching_capacity (int) – Batching capacity.
variable_noise (float) – The stddev value of a Normal distribution used for adding random noise to the model’s output (for each batch, noise can be toggled and - if active - will be resampled). Use None for not adding any noise.
states_preprocessing (spec / dict of specs) – Dict specifying whether and how to preprocess state signals (e.g. normalization, greyscale, etc..).
actions_exploration (spec / dict of specs) – Dict specifying whether and how to add exploration to the model’s “action outputs” (e.g. epsilon-greedy).
reward_preprocessing (spec) – Dict specifying whether and how to preprocess rewards coming from the Environment (e.g. reward normalization).

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶: Creates output operations for acting, observing and interacting with the memory.

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

Returns a dictionary of component name to component of all the components within this model.

Returns:	(dict) The mapping of name to component.

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

Returns the TensorFlow summaries reported by the model

Returns:	List of summaries

get_variables(include_submodules=False, include_nontrainable=False)¶

Returns the TensorFlow variables used by the model.

Parameters:	include_submodules – Includes variables of submodules (e.g. baseline, target network) if true. include_nontrainable – Includes non-trainable variables if true.
Returns:	List of variables.

initialize(custom_getter)¶

Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.

Parameters:	custom_getter – The `custom_getter_` object to use for `tf.make_template` when creating TensorFlow functions.

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

Creates and returns the TensorFlow operations for retrieving the actions and - if applicable - the posterior internal state Tensors in reaction to the given input states (and prior internal states).

Parameters:

states (dict) – Dict of state tensors (each key represents one state space component).
internals – List of prior internal state tensors.
deterministic – Boolean tensor indicating whether action should be chosen deterministically.

Returns:

dict of output actions (with or without exploration applied (see deterministic))
list of posterior internal state Tensors (empty for non-internal state models)

Return type:

tuple

tf_initialize()¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

Creates the TensorFlow operations for performing the observation of a full time step’s information.

Parameters:	states (dict) – Dict of state tensors (each key represents one state space component). internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor.
Returns:	The observation operation.

tf_preprocess(states, actions, reward)¶

tensorforce.models.pg_log_prob_model module¶

class tensorforce.models.pg_log_prob_model.PGLogProbModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

Bases: tensorforce.models.pg_model.PGModel

Policy gradient model based on computing log likelihoods, e.g. VPG.

COMPONENT_BASELINE = 'baseline'¶

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

baseline_optimizer_arguments(states, internals, reward)¶

Returns the baseline optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor.
Returns:	Baseline optimizer arguments as dict.

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_baseline_loss(states, internals, reward, update, reference=None)¶

Creates the TensorFlow operations for calculating the baseline loss of a batch.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update. reference – Optional reference tensor(s), in case of a comparative loss.
Returns:	Loss tensor.

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

tf_reward_estimation(states, internals, terminal, reward, update)¶

tensorforce.models.pg_model module¶

class tensorforce.models.pg_model.PGModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

Bases: tensorforce.models.distribution_model.DistributionModel

Base class for policy gradient models. It optionally defines a baseline and handles its optimization. It implements the tf_loss_per_instance function, but requires subclasses to implement tf_pg_loss_per_instance.

COMPONENT_BASELINE = 'baseline'¶

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

baseline_optimizer_arguments(states, internals, reward)¶

Returns the baseline optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor.
Returns:	Baseline optimizer arguments as dict.

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_baseline_loss(states, internals, reward, update, reference=None)¶

Creates the TensorFlow operations for calculating the baseline loss of a batch.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update. reference – Optional reference tensor(s), in case of a comparative loss.
Returns:	Loss tensor.

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the loss per batch instance.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss per instance tensor.

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

tf_reward_estimation(states, internals, terminal, reward, update)¶

tensorforce.models.pg_prob_ratio_model module¶

class tensorforce.models.pg_prob_ratio_model.PGProbRatioModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda, likelihood_ratio_clipping)¶

Bases: tensorforce.models.pg_model.PGModel

Policy gradient model based on computing likelihood ratios, e.g. TRPO and PPO.

COMPONENT_BASELINE = 'baseline'¶

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda, likelihood_ratio_clipping)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

baseline_optimizer_arguments(states, internals, reward)¶

Returns the baseline optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor.
Returns:	Baseline optimizer arguments as dict.

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_baseline_loss(states, internals, reward, update, reference=None)¶

Creates the TensorFlow operations for calculating the baseline loss of a batch.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update. reference – Optional reference tensor(s), in case of a comparative loss.
Returns:	Loss tensor.

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

tf_regularization_losses(states, internals, update)¶

tf_reward_estimation(states, internals, terminal, reward, update)¶

tensorforce.models.q_demo_model module¶

class tensorforce.models.q_demo_model.QDemoModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, expert_margin, supervised_weight, demo_memory_capacity, demo_batch_size)¶

Bases: tensorforce.models.q_model.QModel

Model for deep Q-learning from demonstration. Principal structure similar to double deep Q-networks but uses additional loss terms for demo data.

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

COMPONENT_TARGET_DISTRIBUTION = 'target_distribution'¶

COMPONENT_TARGET_NETWORK = 'target_network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, expert_margin, supervised_weight, demo_memory_capacity, demo_batch_size)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

demo_update()¶: Performs a demonstration update by calling the demo optimization operation. Note that the batch data does not have to be fetched from the demo memory as this is now part of the TensorFlow operation of the demo update.

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

Returns the TensorFlow variables used by the model.

Returns:	List of variables.

import_demo_experience(states, internals, actions, terminal, reward)¶: Stores demonstrations in the demo memory.

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

target_optimizer_arguments()¶

Returns the target optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Returns:	Target optimizer arguments as dict.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_combined_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶: Combines Q-loss and demo loss.

tf_demo_loss(states, actions, terminal, reward, internals, update, reference=None)¶: Extends the q-model loss via the dqfd large-margin loss.

tf_demo_optimization(states, internals, actions, terminal, reward, next_states, next_internals)¶

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_demo_experience(states, internals, actions, terminal, reward)¶: Imports a single experience to memory.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_q_delta(q_value, next_q_value, terminal, reward)¶

Creates the deltas (or advantage) of the Q values.

Returns:	A list of deltas per action

tf_q_value(embedding, distr_params, action, name)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

tensorforce.models.q_model module¶

class tensorforce.models.q_model.QModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

Bases: tensorforce.models.distribution_model.DistributionModel

Q-value model.

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

COMPONENT_TARGET_DISTRIBUTION = 'target_distribution'¶

COMPONENT_TARGET_NETWORK = 'target_network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

target_optimizer_arguments()¶

Returns the target optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Returns:	Target optimizer arguments as dict.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_q_delta(q_value, next_q_value, terminal, reward)¶

Creates the deltas (or advantage) of the Q values.

Returns:	A list of deltas per action

tf_q_value(embedding, distr_params, action, name)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

tensorforce.models.q_naf_model module¶

class tensorforce.models.q_naf_model.QNAFModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

Bases: tensorforce.models.q_model.QModel

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

COMPONENT_TARGET_DISTRIBUTION = 'target_distribution'¶

COMPONENT_TARGET_NETWORK = 'target_network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

target_optimizer_arguments()¶

Returns the target optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Returns:	Target optimizer arguments as dict.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_q_delta(q_value, next_q_value, terminal, reward)¶

Creates the deltas (or advantage) of the Q values.

Returns:	A list of deltas per action

tf_q_value(embedding, distr_params, action, name)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

tensorforce.models.q_nstep_model module¶

class tensorforce.models.q_nstep_model.QNstepModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

Bases: tensorforce.models.q_model.QModel

Deep Q network using n-step rewards as described in Asynchronous Methods for Reinforcement Learning.

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

COMPONENT_TARGET_DISTRIBUTION = 'target_distribution'¶

COMPONENT_TARGET_NETWORK = 'target_network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

target_optimizer_arguments()¶

Returns the target optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Returns:	Target optimizer arguments as dict.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_q_delta(q_value, next_q_value, terminal, reward)¶

tf_q_value(embedding, distr_params, action, name)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

tensorforce.models.random_model module¶

class tensorforce.models.random_model.RandomModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity)¶

Bases: tensorforce.models.model.Model

Utility class to return random actions of a desired shape and with given bounds.

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶: Creates output operations for acting, observing and interacting with the memory.

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

Returns a dictionary of component name to component of all the components within this model.

Returns:	(dict) The mapping of name to component.

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

Returns the TensorFlow summaries reported by the model

Returns:	List of summaries

get_variables(include_submodules=False, include_nontrainable=False)¶

Returns the TensorFlow variables used by the model.

Parameters:	include_submodules – Includes variables of submodules (e.g. baseline, target network) if true. include_nontrainable – Includes non-trainable variables if true.
Returns:	List of variables.

initialize(custom_getter)¶

Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.

Parameters:	custom_getter – The `custom_getter_` object to use for `tf.make_template` when creating TensorFlow functions.

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_initialize()¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_preprocess(states, actions, reward)¶

Module contents¶

class tensorforce.models.Model(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing)¶

Bases: object

Base class for all (TensorFlow-based) models.

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing)¶

Model.

Parameters:

states (spec) – The state-space description dictionary.
actions (spec) – The action-space description dictionary.
scope (str) – The root scope str to use for tf variable scoping.
device (str) – The name of the device to run the graph of this model on.
saver (spec) – Dict specifying whether and how to save the model’s parameters.
summarizer (spec) – Dict specifying which tensorboard summaries should be created and added to the graph.
execution (spec) – Dict specifying whether and how to do distributed training on the model’s graph.
batching_capacity (int) – Batching capacity.
variable_noise (float) – The stddev value of a Normal distribution used for adding random noise to the model’s output (for each batch, noise can be toggled and - if active - will be resampled). Use None for not adding any noise.
states_preprocessing (spec / dict of specs) – Dict specifying whether and how to preprocess state signals (e.g. normalization, greyscale, etc..).
actions_exploration (spec / dict of specs) – Dict specifying whether and how to add exploration to the model’s “action outputs” (e.g. epsilon-greedy).
reward_preprocessing (spec) – Dict specifying whether and how to preprocess rewards coming from the Environment (e.g. reward normalization).

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶: Creates output operations for acting, observing and interacting with the memory.

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

Returns a dictionary of component name to component of all the components within this model.

Returns:	(dict) The mapping of name to component.

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

Returns the TensorFlow summaries reported by the model

Returns:	List of summaries

get_variables(include_submodules=False, include_nontrainable=False)¶

Returns the TensorFlow variables used by the model.

Parameters:	include_submodules – Includes variables of submodules (e.g. baseline, target network) if true. include_nontrainable – Includes non-trainable variables if true.
Returns:	List of variables.

initialize(custom_getter)¶

Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.

Parameters:	custom_getter – The `custom_getter_` object to use for `tf.make_template` when creating TensorFlow functions.

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

Creates and returns the TensorFlow operations for retrieving the actions and - if applicable - the posterior internal state Tensors in reaction to the given input states (and prior internal states).

Parameters:

states (dict) – Dict of state tensors (each key represents one state space component).
internals – List of prior internal state tensors.
deterministic – Boolean tensor indicating whether action should be chosen deterministically.

Returns:

dict of output actions (with or without exploration applied (see deterministic))
list of posterior internal state Tensors (empty for non-internal state models)

Return type:

tuple

tf_initialize()¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

Creates the TensorFlow operations for performing the observation of a full time step’s information.

Parameters:	states (dict) – Dict of state tensors (each key represents one state space component). internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor.
Returns:	The observation operation.

tf_preprocess(states, actions, reward)¶

class tensorforce.models.MemoryModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount)¶

Bases: tensorforce.models.model.Model

A memory model is a generical model to accumulate and sample data.

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount)¶

Memory model.

Parameters:

states (spec) – The state-space description dictionary.
actions (spec) – The action-space description dictionary.
scope (str) – The root scope str to use for tf variable scoping.
device (str) – The name of the device to run the graph of this model on.
saver (spec) – Dict specifying whether and how to save the model’s parameters.
summarizer (spec) – Dict specifying which tensorboard summaries should be created and added to the graph.
execution (spec) – Dict specifying whether and how to do distributed training on the model’s graph.
batching_capacity (int) – Batching capacity.
variable_noise (float) – The stddev value of a Normal distribution used for adding random noise to the model’s output (for each batch, noise can be toggled and - if active - will be resampled). Use None for not adding any noise.
states_preprocessing (spec / dict of specs) – Dict specifying whether and how to preprocess state signals (e.g. normalization, greyscale, etc..).
actions_exploration (spec / dict of specs) – Dict specifying whether and how to add exploration to the model’s “action outputs” (e.g. epsilon-greedy).
reward_preprocessing (spec) – Dict specifying whether and how to preprocess rewards coming from the Environment (e.g. reward normalization).
update_mode (spec) – Update mode.
memory (spec) – Memory.
optimizer (spec) – Dict specifying the tf optimizer to use for tuning the model’s trainable parameters.
discount (float) – The RL reward discount factor (gamma).

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

Returns a dictionary of component name to component of all the components within this model.

Returns:	(dict) The mapping of name to component.

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

Returns the optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. next_states – Dict of successor state tensors. next_internals – List of posterior internal state tensors.
Returns:	Optimizer arguments as dict.

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

Creates and returns the TensorFlow operations for retrieving the actions and - if applicable - the posterior internal state Tensors in reaction to the given input states (and prior internal states).

Parameters:

states (dict) – Dict of state tensors (each key represents one state space component).
internals – List of prior internal state tensors.
deterministic – Boolean tensor indicating whether action should be chosen deterministically.

Returns:

dict of output actions (with or without exploration applied (see deterministic))
list of posterior internal state Tensors (empty for non-internal state models)

Return type:

tuple

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the loss per batch instance.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss per instance tensor.

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

Creates the TensorFlow operations for performing an optimization update step based on the given input states and actions batch.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. next_states – Dict of successor state tensors. next_internals – List of posterior internal state tensors.
Returns:	The optimization operation.

tf_preprocess(states, actions, reward)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

Creates the TensorFlow operations for calculating the regularization losses for the given input states.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. update – Boolean tensor indicating whether this call happens during an update.
Returns:	Dict of regularization loss tensors.

class tensorforce.models.DistributionModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, requires_deterministic)¶

Bases: tensorforce.models.memory_model.MemoryModel

Base class for models using distributions parametrized by a neural network.

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, requires_deterministic)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the loss per batch instance.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss per instance tensor.

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

Creates the TensorFlow operations for performing an optimization update step based on the given input states and actions batch.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. next_states – Dict of successor state tensors. next_internals – List of posterior internal state tensors.
Returns:	The optimization operation.

tf_preprocess(states, actions, reward)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

class tensorforce.models.PGModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

Bases: tensorforce.models.distribution_model.DistributionModel

Base class for policy gradient models. It optionally defines a baseline and handles its optimization. It implements the tf_loss_per_instance function, but requires subclasses to implement tf_pg_loss_per_instance.

COMPONENT_BASELINE = 'baseline'¶

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

baseline_optimizer_arguments(states, internals, reward)¶

Returns the baseline optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor.
Returns:	Baseline optimizer arguments as dict.

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_baseline_loss(states, internals, reward, update, reference=None)¶

Creates the TensorFlow operations for calculating the baseline loss of a batch.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update. reference – Optional reference tensor(s), in case of a comparative loss.
Returns:	Loss tensor.

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the loss per batch instance.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss per instance tensor.

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

tf_reward_estimation(states, internals, terminal, reward, update)¶

class tensorforce.models.PGProbRatioModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda, likelihood_ratio_clipping)¶

Bases: tensorforce.models.pg_model.PGModel

Policy gradient model based on computing likelihood ratios, e.g. TRPO and PPO.

COMPONENT_BASELINE = 'baseline'¶

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda, likelihood_ratio_clipping)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

baseline_optimizer_arguments(states, internals, reward)¶

Returns the baseline optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor.
Returns:	Baseline optimizer arguments as dict.

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_baseline_loss(states, internals, reward, update, reference=None)¶

Creates the TensorFlow operations for calculating the baseline loss of a batch.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update. reference – Optional reference tensor(s), in case of a comparative loss.
Returns:	Loss tensor.

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

tf_regularization_losses(states, internals, update)¶

tf_reward_estimation(states, internals, terminal, reward, update)¶

class tensorforce.models.DPGTargetModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, critic_network, critic_optimizer, target_sync_frequency, target_update_weight)¶

Bases: tensorforce.models.distribution_model.DistributionModel

Policy gradient model log likelihood model with target network (e.g. DDPG)

COMPONENT_CRITIC = 'critic'¶

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

COMPONENT_TARGET_DISTRIBUTION = 'target_distribution'¶

COMPONENT_TARGET_NETWORK = 'target_network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, critic_network, critic_optimizer, target_sync_frequency, target_update_weight)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_predict_target_q(states, internals, terminal, actions, reward, update)¶

tf_preprocess(states, actions, reward)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

tf_target_actions_and_internals(states, internals, deterministic=True)¶

class tensorforce.models.PGLogProbModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

Bases: tensorforce.models.pg_model.PGModel

Policy gradient model based on computing log likelihoods, e.g. VPG.

COMPONENT_BASELINE = 'baseline'¶

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

baseline_optimizer_arguments(states, internals, reward)¶

Returns the baseline optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor.
Returns:	Baseline optimizer arguments as dict.

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_baseline_loss(states, internals, reward, update, reference=None)¶

Creates the TensorFlow operations for calculating the baseline loss of a batch.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update. reference – Optional reference tensor(s), in case of a comparative loss.
Returns:	Loss tensor.

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

tf_reward_estimation(states, internals, terminal, reward, update)¶

class tensorforce.models.QModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

Bases: tensorforce.models.distribution_model.DistributionModel

Q-value model.

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

COMPONENT_TARGET_DISTRIBUTION = 'target_distribution'¶

COMPONENT_TARGET_NETWORK = 'target_network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

target_optimizer_arguments()¶

Returns the target optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Returns:	Target optimizer arguments as dict.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_q_delta(q_value, next_q_value, terminal, reward)¶

Creates the deltas (or advantage) of the Q values.

Returns:	A list of deltas per action

tf_q_value(embedding, distr_params, action, name)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

class tensorforce.models.QNstepModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

Bases: tensorforce.models.q_model.QModel

Deep Q network using n-step rewards as described in Asynchronous Methods for Reinforcement Learning.

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

COMPONENT_TARGET_DISTRIBUTION = 'target_distribution'¶

COMPONENT_TARGET_NETWORK = 'target_network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

target_optimizer_arguments()¶

Returns the target optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Returns:	Target optimizer arguments as dict.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_q_delta(q_value, next_q_value, terminal, reward)¶

tf_q_value(embedding, distr_params, action, name)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

class tensorforce.models.QNAFModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

Bases: tensorforce.models.q_model.QModel

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

COMPONENT_TARGET_DISTRIBUTION = 'target_distribution'¶

COMPONENT_TARGET_NETWORK = 'target_network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

target_optimizer_arguments()¶

Returns the target optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Returns:	Target optimizer arguments as dict.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_q_delta(q_value, next_q_value, terminal, reward)¶

Creates the deltas (or advantage) of the Q values.

Returns:	A list of deltas per action

tf_q_value(embedding, distr_params, action, name)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶

class tensorforce.models.QDemoModel(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, expert_margin, supervised_weight, demo_memory_capacity, demo_batch_size)¶

Bases: tensorforce.models.q_model.QModel

Model for deep Q-learning from demonstration. Principal structure similar to double deep Q-networks but uses additional loss terms for demo data.

COMPONENT_DISTRIBUTION = 'distribution'¶

COMPONENT_NETWORK = 'network'¶

COMPONENT_TARGET_DISTRIBUTION = 'target_distribution'¶

COMPONENT_TARGET_NETWORK = 'target_network'¶

__init__(states, actions, scope, device, saver, summarizer, execution, batching_capacity, variable_noise, states_preprocessing, actions_exploration, reward_preprocessing, update_mode, memory, optimizer, discount, network, distributions, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, expert_margin, supervised_weight, demo_memory_capacity, demo_batch_size)¶

act(states, internals, deterministic=False, independent=False, fetch_tensors=None)¶

Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))

Parameters:

states (dict) – Dict of state values (each key represents one state space component).
internals (dict) – Dict of internal state values (each key represents one internal state component).
deterministic (bool) – If True, will not apply exploration after actions are calculated.
independent (bool) – If true, action is not followed by observe (and hence not included in updates).

Returns:

Actual action-outputs (batched if state input is a batch).

Return type:	tuple

as_local_model()¶

close()¶

create_act_operations(states, internals, deterministic, independent)¶

create_distributions()¶

create_observe_operations(terminal, reward)¶

create_operations(states, internals, actions, terminal, reward, deterministic, independent)¶

demo_update()¶: Performs a demonstration update by calling the demo optimization operation. Note that the batch data does not have to be fetched from the demo memory as this is now part of the TensorFlow operation of the demo update.

get_component(component_name)¶

Looks up a component by its name.

Parameters:	component_name – The name of the component to look up.
Returns:	The component for the provided name or None if there is no such component.

get_components()¶

get_feed_dict(states=None, internals=None, actions=None, terminal=None, reward=None, deterministic=None, independent=None)¶

get_savable_components()¶

Returns the list of all of the components this model consists of that can be individually saved and restored. For instance the network or distribution.

Returns:	List of util.SavableComponent

get_summaries()¶

get_variables(include_submodules=False, include_nontrainable=False)¶

Returns the TensorFlow variables used by the model.

Returns:	List of variables.

import_demo_experience(states, internals, actions, terminal, reward)¶: Stores demonstrations in the demo memory.

import_experience(states, internals, actions, terminal, reward)¶: Stores experiences.

initialize(custom_getter)¶

observe(terminal, reward)¶

Adds an observation (reward and is-terminal) to the model without updating its trainable variables.

Parameters:	terminal (bool) – Whether the episode has terminated. reward (float) – The observed reward value.
Returns:	The value of the model-internal episode counter.

optimizer_arguments(states, internals, actions, terminal, reward, next_states, next_internals)¶

reset()¶

Resets the model to its initial state on episode start. This should also reset all preprocessor(s).

Returns:	Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors.
Return type:	tuple

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

restore_component(component_name, save_path)¶

Restores a component’s parameters from a save location.

Parameters:	component_name – The component to restore. save_path – The save location.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path where the model was saved.

save_component(component_name, save_path)¶

Saves a component of this model to the designated location.

Parameters:	component_name – The component to save. save_path – The location to save to.
Returns:	Checkpoint path where the component was saved.

setup()¶: Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.

target_optimizer_arguments()¶

Returns the target optimizer arguments including the time, the list of variables to optimize, and various functions which the optimizer might require to perform an update step.

Returns:	Target optimizer arguments as dict.

tf_action_exploration(action, exploration, action_spec)¶

Applies optional exploration to the action (post-processor for action outputs).

Parameters:	action (tf.Tensor) – The original output action tensor (to be post-processed). exploration (Exploration) – The Exploration object to use. action_spec (dict) – Dict specifying the action space.
Returns:	The post-processed action output tensor.

tf_actions_and_internals(states, internals, deterministic)¶

tf_combined_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶: Combines Q-loss and demo loss.

tf_demo_loss(states, actions, terminal, reward, internals, update, reference=None)¶: Extends the q-model loss via the dqfd large-margin loss.

tf_demo_optimization(states, internals, actions, terminal, reward, next_states, next_internals)¶

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_import_demo_experience(states, internals, actions, terminal, reward)¶: Imports a single experience to memory.

tf_import_experience(states, internals, actions, terminal, reward)¶

Imports experiences into the TensorFlow memory structure. Can be used to import off-policy data.

Parameters:	states – Dict of state values to import with keys as state names and values as values to set. internals – Internal values to set, can be fetched from agent via agent.current_internals if no values available. actions – Dict of action values to import with keys as action names and values as values to set. terminal – Terminal value(s) reward – Reward value(s)

tf_initialize()¶

tf_kl_divergence(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_loss(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

Creates the TensorFlow operations for calculating the full loss of a batch.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.
reference – Optional reference tensor(s), in case of a comparative loss.

Returns:

Loss tensor.

tf_loss_per_instance(states, internals, actions, terminal, reward, next_states, next_internals, update, reference=None)¶

tf_observe_timestep(states, internals, actions, terminal, reward)¶

tf_optimization(states, internals, actions, terminal, reward, next_states=None, next_internals=None)¶

tf_preprocess(states, actions, reward)¶

tf_q_delta(q_value, next_q_value, terminal, reward)¶

Creates the deltas (or advantage) of the Q values.

Returns:	A list of deltas per action

tf_q_value(embedding, distr_params, action, name)¶

tf_reference(states, internals, actions, terminal, reward, next_states, next_internals, update)¶

Creates the TensorFlow operations for obtaining the reference tensor(s), in case of a comparative loss.

Parameters:

states – Dict of state tensors.
internals – List of prior internal state tensors.
actions – Dict of action tensors.
terminal – Terminal boolean tensor.
reward – Reward tensor.
next_states – Dict of successor state tensors.
next_internals – List of posterior internal state tensors.
update – Boolean tensor indicating whether this call happens during an update.

Returns:

Reference tensor(s).

tf_regularization_losses(states, internals, update)¶