tensorforce.models package¶
Submodules¶
tensorforce.models.constant_model module¶
-
class
tensorforce.models.constant_model.
ConstantModel
(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, action_values)¶ Bases:
tensorforce.models.model.Model
Utility class to return constant actions of a desired shape and with given bounds.
-
tf_actions_and_internals
(states, internals, update, deterministic)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tensorforce.models.distribution_model module¶
-
class
tensorforce.models.distribution_model.
DistributionModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization)¶ Bases:
tensorforce.models.model.Model
Base class for models using distributions parametrized by a neural network.
-
create_distributions
()¶
-
static
get_distributions_summaries
(distributions)¶
-
static
get_distributions_variables
(distributions, include_non_trainable=False)¶
-
get_optimizer_kwargs
(states, internals, actions, terminal, reward, update)¶
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_actions_and_internals
(states, internals, update, deterministic)¶
-
tf_kl_divergence
(states, internals, update)¶
-
tf_regularization_losses
(states, internals, update)¶
-
tensorforce.models.model module¶
The Model
class coordinates the creation and execution of all TensorFlow operations within a model.
It implements the reset
, act
and update
functions, which form the interface the Agent
class
communicates with, and which should not need to be overwritten. Instead, the following TensorFlow
functions need to be implemented:
tf_actions_and_internals(states, internals, deterministic)
returning the batch of- actions and successor internal states.
tf_loss_per_instance(states, internals, actions, terminal, reward)
returning the loss- per instance for a batch.
Further, the following TensorFlow functions should be extended accordingly:
initialize(custom_getter)
defining TensorFlow placeholders/functions and adding internal states.get_variables()
returning the list of TensorFlow variables (to be optimized) of this model.tf_regularization_losses(states, internals)
returning a dict of regularization losses.get_optimizer_kwargs(states, internals, actions, terminal, reward)
returning a dict of potential- arguments (argument-free functions) to the optimizer.
Finally, the following TensorFlow functions can be useful in some cases:
preprocess_states(states)
for state preprocessing, returning the processed batch of states.tf_action_exploration(action, exploration, action_spec)
for action postprocessing (e.g. exploration),- returning the processed batch of actions.
tf_preprocess_reward(states, internals, terminal, reward)
for reward preprocessing (e.g. reward normalization),- returning the processed batch of rewards.
create_output_operations(states, internals, actions, terminal, reward, deterministic)
for further output operations,- similar to the two above for
Model.act
andModel.update
.
tf_optimization(states, internals, actions, terminal, reward)
for further optimization operations- (e.g. the baseline update in a
PGModel
or the target network update in aQModel
), returning a single grouped optimization operation.
-
class
tensorforce.models.model.
Model
(states_spec, actions_spec, device=None, session_config=None, scope='base_model', saver_spec=None, summary_spec=None, distributed_spec=None, optimizer=None, discount=0.0, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None)¶ Bases:
object
Base class for all (TensorFlow-based) models.
-
act
(states, internals, deterministic=False)¶ Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of incoming internal state tensors.
- deterministic (bool) – If True, will not apply exploration after actions are calculated.
Returns: - Actual action-outputs (batched if state input is a batch).
Return type: tuple
-
close
()¶
-
create_output_operations
(states, internals, actions, terminal, reward, update, deterministic)¶ Calls all the relevant TensorFlow functions for this model and hence creates all the TensorFlow operations involved.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
- deterministic – Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
-
get_optimizer_kwargs
(states, internals, actions, terminal, reward, update)¶ Returns the optimizer arguments including the time, the list of variables to optimize, and various argument-free functions (in particular
fn_loss
returning the combined 0-dim batch loss tensor) which the optimizer might require to perform an update step.Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Dict to be passed into the optimizer op (e.g. ‘minimize’) as kwargs.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the model
Returns: List of summaries
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the model.
Returns: List of variables.
-
initialize
(custom_getter)¶ Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.
Parameters: custom_getter – The custom_getter_
object to use fortf.make_template
when creating TensorFlow functions.
-
observe
(terminal, reward)¶ Adds an observation (reward and is-terminal) to the model without updating its trainable variables.
Parameters: - terminal (bool) – Whether the episode has terminated.
- reward (float) – The observed reward value.
Returns: The value of the model-internal episode counter.
-
reset
()¶ Resets the model to its initial state on episode start.
Returns: Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors. Return type: tuple
-
restore
(directory=None, file=None)¶ Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).
Parameters: - directory – Optional checkpoint directory.
- file – Optional checkpoint file, or path if directory not given.
-
save
(directory=None, append_timestep=True)¶ Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.
Parameters: - directory – Optional checkpoint directory.
- append_timestep – Appends the current timestep to the checkpoint file if true.
Returns: Checkpoint path were the model was saved.
-
setup
()¶ Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.
-
tf_action_exploration
(action, exploration, action_spec)¶ Applies optional exploration to the action (post-processor for action outputs).
Parameters: - action (tf.Tensor) – The original output action tensor (to be post-processed).
- exploration (Exploration) – The Exploration object to use.
- action_spec (dict) – Dict specifying the action space.
Returns: The post-processed action output tensor.
-
tf_actions_and_internals
(states, internals, update, deterministic)¶ Creates and returns the TensorFlow operations for retrieving the actions and - if applicable - the posterior internal state Tensors in reaction to the given input states (and prior internal states).
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- update – Single boolean tensor indicating whether this call happens during an update.
- deterministic – Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
Returns: - dict of output actions (with or without exploration applied (see
deterministic
)) - list of posterior internal state Tensors (empty for non-internal state models)
Return type: tuple
-
tf_discounted_cumulative_reward
(terminal, reward, discount=None, final_reward=0.0, horizon=0)¶ Creates and returns the TensorFlow operations for calculating the sequence of discounted cumulative rewards for a given sequence of single rewards.
Example: single rewards = 2.0 1.0 0.0 0.5 1.0 -1.0 terminal = False, False, False, False True False gamma = 0.95 final_reward = 100.0 (only matters for last episode (r=-1.0) as this episode has no terminal signal) horizon=3 output = 2.95 1.45 1.38 1.45 1.0 94.0
Parameters: - terminal – Tensor (bool) holding the is-terminal sequence. This sequence may contain more than one
True value. If its very last element is False (not terminating), the given
final_reward
value is assumed to follow the last value in the single rewards sequence (see below). - reward – Tensor (float) holding the sequence of single rewards. If the last element of
terminal
is False, an assumed last reward of the value offinal_reward
will be used. - discount (float) – The discount factor (gamma). By default, take the Model’s discount factor.
- final_reward (float) – Reward value to use if last episode in sequence does not terminate (terminal sequence ends with False). This value will be ignored if horizon == 1 or discount == 0.0.
- horizon (int) – The length of the horizon (e.g. for n-step cumulative rewards in continuous tasks without terminal signals). Use 0 (default) for an infinite horizon. Note that horizon=1 leads to the exact same results as a discount factor of 0.0.
Returns: Discounted cumulative reward tensor with the same shape as
reward
.- terminal – Tensor (bool) holding the is-terminal sequence. This sequence may contain more than one
True value. If its very last element is False (not terminating), the given
-
tf_loss
(states, internals, actions, terminal, reward, update)¶ Creates and returns the single loss Tensor representing the total loss for a batch, including the mean loss per sample, the regularization loss of the batch, .
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Single float-value loss tensor.
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶ Creates and returns the TensorFlow operations for calculating the loss per batch instance (sample) of the given input state(s) and action(s).
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Loss tensor (first rank is the batch size -> one loss value per sample in the batch).
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶ Creates the TensorFlow operations for performing an optimization update step based on the given input states and actions batch.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: The optimization operation.
-
tf_preprocess_reward
(states, internals, terminal, reward)¶ Applies optional preprocessing to the reward.
-
tf_preprocess_states
(states)¶ Applies optional preprocessing to the states.
-
tf_regularization_losses
(states, internals, update)¶ Creates and returns the TensorFlow operations for calculating the different regularization losses for the given batch of state/internal state inputs.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Dict of regularization loss tensors (keys == different regularization types, e.g. ‘entropy’).
-
update
(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶ Runs the self.optimization in the session to update the Model’s parameters. Optionally, also runs the
loss_per_instance
calculation and returns the result of that.Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- return_loss_per_instance (bool) – Whether to also run and return the
loss_per_instance
Tensor.
Returns: void or - if return_loss_per_instance is True - the value of the
loss_per_instance
Tensor.
-
tensorforce.models.pg_log_prob_model module¶
-
class
tensorforce.models.pg_log_prob_model.
PGLogProbModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶ Bases:
tensorforce.models.pg_model.PGModel
Policy gradient model based on computing log likelihoods, e.g. VPG.
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tensorforce.models.pg_model module¶
-
class
tensorforce.models.pg_model.
PGModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶ Bases:
tensorforce.models.distribution_model.DistributionModel
Base class for policy gradient models. It optionally defines a baseline and handles its optimization. It implements the
tf_loss_per_instance
function, but requires subclasses to implementtf_pg_loss_per_instance
.-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶ Creates the TensorFlow operations for calculating the (policy-gradient-specific) loss per batch instance of the given input states and actions, after the specified reward/advantage calculations.
Parameters: - states – Dict of state tensors.
- internals – List of prior internal state tensors.
- actions – Dict of action tensors.
- terminal – Terminal boolean tensor.
- reward – Reward tensor.
- update – Boolean tensor indicating whether this call happens during an update.
Returns: Loss tensor.
-
tf_regularization_losses
(states, internals, update)¶
-
tf_reward_estimation
(states, internals, terminal, reward, update)¶
-
tensorforce.models.pg_prob_ratio_model module¶
-
class
tensorforce.models.pg_prob_ratio_model.
PGProbRatioModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda, likelihood_ratio_clipping)¶ Bases:
tensorforce.models.pg_model.PGModel
Policy gradient model based on computing likelihood ratios, e.g. TRPO and PPO.
-
get_optimizer_kwargs
(states, actions, terminal, reward, internals, update)¶
-
initialize
(custom_getter)¶
-
tf_compare
(states, internals, actions, terminal, reward, update, reference)¶
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_reference
(states, internals, actions, update)¶
-
tensorforce.models.q_demo_model module¶
-
class
tensorforce.models.q_demo_model.
QDemoModel
(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, network_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix, expert_margin, supervised_weight)¶ Bases:
tensorforce.models.q_model.QModel
Model for deep Q-learning from demonstration. Principal structure similar to double deep Q-networks but uses additional loss terms for demo data.
-
create_output_operations
(states, internals, actions, terminal, reward, update, deterministic)¶
-
demonstration_update
(states, internals, actions, terminal, reward)¶
-
initialize
(custom_getter)¶
-
tf_demo_loss
(states, actions, terminal, reward, internals, update)¶
-
tf_demo_optimization
(states, internals, actions, terminal, reward, update)¶
-
tensorforce.models.q_model module¶
-
class
tensorforce.models.q_model.
QModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.distribution_model.DistributionModel
Q-value model.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶
-
tf_q_delta
(q_value, next_q_value, terminal, reward)¶ Creates the deltas (or advantage) of the Q values.
Returns: A list of deltas per action
-
tf_q_value
(embedding, distr_params, action, name)¶
-
update
(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶
-
tensorforce.models.q_naf_model module¶
-
class
tensorforce.models.q_naf_model.
QNAFModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.q_model.QModel
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_q_value
(embedding, distr_params, action, name)¶
-
tf_regularization_losses
(states, internals, update)¶
-
tensorforce.models.q_nstep_model module¶
-
class
tensorforce.models.q_nstep_model.
QNstepModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.q_model.QModel
Deep Q network using n-step rewards as described in Asynchronous Methods for Reinforcement Learning.
-
tf_q_delta
(q_value, next_q_value, terminal, reward)¶
-
tensorforce.models.random_model module¶
-
class
tensorforce.models.random_model.
RandomModel
(states_spec, actions_spec, device=None, session_config=None, scope='base_model', saver_spec=None, summary_spec=None, distributed_spec=None, optimizer=None, discount=0.0, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None)¶ Bases:
tensorforce.models.model.Model
Utility class to return random actions of a desired shape and with given bounds.
-
tf_actions_and_internals
(states, internals, update, deterministic)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
Module contents¶
-
class
tensorforce.models.
Model
(states_spec, actions_spec, device=None, session_config=None, scope='base_model', saver_spec=None, summary_spec=None, distributed_spec=None, optimizer=None, discount=0.0, variable_noise=None, states_preprocessing_spec=None, explorations_spec=None, reward_preprocessing_spec=None)¶ Bases:
object
Base class for all (TensorFlow-based) models.
-
act
(states, internals, deterministic=False)¶ Does a forward pass through the model to retrieve action (outputs) given inputs for state (and internal state, if applicable (e.g. RNNs))
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of incoming internal state tensors.
- deterministic (bool) – If True, will not apply exploration after actions are calculated.
Returns: - Actual action-outputs (batched if state input is a batch).
Return type: tuple
-
close
()¶
-
create_output_operations
(states, internals, actions, terminal, reward, update, deterministic)¶ Calls all the relevant TensorFlow functions for this model and hence creates all the TensorFlow operations involved.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
- deterministic – Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
-
get_optimizer_kwargs
(states, internals, actions, terminal, reward, update)¶ Returns the optimizer arguments including the time, the list of variables to optimize, and various argument-free functions (in particular
fn_loss
returning the combined 0-dim batch loss tensor) which the optimizer might require to perform an update step.Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Dict to be passed into the optimizer op (e.g. ‘minimize’) as kwargs.
-
get_summaries
()¶ Returns the TensorFlow summaries reported by the model
Returns: List of summaries
-
get_variables
(include_non_trainable=False)¶ Returns the TensorFlow variables used by the model.
Returns: List of variables.
-
initialize
(custom_getter)¶ Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.
Parameters: custom_getter – The custom_getter_
object to use fortf.make_template
when creating TensorFlow functions.
-
observe
(terminal, reward)¶ Adds an observation (reward and is-terminal) to the model without updating its trainable variables.
Parameters: - terminal (bool) – Whether the episode has terminated.
- reward (float) – The observed reward value.
Returns: The value of the model-internal episode counter.
-
reset
()¶ Resets the model to its initial state on episode start.
Returns: Current episode, timestep counter and the shallow-copied list of internal state initialization Tensors. Return type: tuple
-
restore
(directory=None, file=None)¶ Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).
Parameters: - directory – Optional checkpoint directory.
- file – Optional checkpoint file, or path if directory not given.
-
save
(directory=None, append_timestep=True)¶ Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.
Parameters: - directory – Optional checkpoint directory.
- append_timestep – Appends the current timestep to the checkpoint file if true.
Returns: Checkpoint path were the model was saved.
-
setup
()¶ Sets up the TensorFlow model graph and initializes (and enters) the TensorFlow session.
-
tf_action_exploration
(action, exploration, action_spec)¶ Applies optional exploration to the action (post-processor for action outputs).
Parameters: - action (tf.Tensor) – The original output action tensor (to be post-processed).
- exploration (Exploration) – The Exploration object to use.
- action_spec (dict) – Dict specifying the action space.
Returns: The post-processed action output tensor.
-
tf_actions_and_internals
(states, internals, update, deterministic)¶ Creates and returns the TensorFlow operations for retrieving the actions and - if applicable - the posterior internal state Tensors in reaction to the given input states (and prior internal states).
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- update – Single boolean tensor indicating whether this call happens during an update.
- deterministic – Boolean Tensor indicating, whether we will not apply exploration when actions are calculated.
Returns: - dict of output actions (with or without exploration applied (see
deterministic
)) - list of posterior internal state Tensors (empty for non-internal state models)
Return type: tuple
-
tf_discounted_cumulative_reward
(terminal, reward, discount=None, final_reward=0.0, horizon=0)¶ Creates and returns the TensorFlow operations for calculating the sequence of discounted cumulative rewards for a given sequence of single rewards.
Example: single rewards = 2.0 1.0 0.0 0.5 1.0 -1.0 terminal = False, False, False, False True False gamma = 0.95 final_reward = 100.0 (only matters for last episode (r=-1.0) as this episode has no terminal signal) horizon=3 output = 2.95 1.45 1.38 1.45 1.0 94.0
Parameters: - terminal – Tensor (bool) holding the is-terminal sequence. This sequence may contain more than one
True value. If its very last element is False (not terminating), the given
final_reward
value is assumed to follow the last value in the single rewards sequence (see below). - reward – Tensor (float) holding the sequence of single rewards. If the last element of
terminal
is False, an assumed last reward of the value offinal_reward
will be used. - discount (float) – The discount factor (gamma). By default, take the Model’s discount factor.
- final_reward (float) – Reward value to use if last episode in sequence does not terminate (terminal sequence ends with False). This value will be ignored if horizon == 1 or discount == 0.0.
- horizon (int) – The length of the horizon (e.g. for n-step cumulative rewards in continuous tasks without terminal signals). Use 0 (default) for an infinite horizon. Note that horizon=1 leads to the exact same results as a discount factor of 0.0.
Returns: Discounted cumulative reward tensor with the same shape as
reward
.- terminal – Tensor (bool) holding the is-terminal sequence. This sequence may contain more than one
True value. If its very last element is False (not terminating), the given
-
tf_loss
(states, internals, actions, terminal, reward, update)¶ Creates and returns the single loss Tensor representing the total loss for a batch, including the mean loss per sample, the regularization loss of the batch, .
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Single float-value loss tensor.
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶ Creates and returns the TensorFlow operations for calculating the loss per batch instance (sample) of the given input state(s) and action(s).
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Loss tensor (first rank is the batch size -> one loss value per sample in the batch).
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶ Creates the TensorFlow operations for performing an optimization update step based on the given input states and actions batch.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: The optimization operation.
-
tf_preprocess_reward
(states, internals, terminal, reward)¶ Applies optional preprocessing to the reward.
-
tf_preprocess_states
(states)¶ Applies optional preprocessing to the states.
-
tf_regularization_losses
(states, internals, update)¶ Creates and returns the TensorFlow operations for calculating the different regularization losses for the given batch of state/internal state inputs.
Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- update – Single boolean tensor indicating whether this call happens during an update.
Returns: Dict of regularization loss tensors (keys == different regularization types, e.g. ‘entropy’).
-
update
(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶ Runs the self.optimization in the session to update the Model’s parameters. Optionally, also runs the
loss_per_instance
calculation and returns the result of that.Parameters: - states (dict) – Dict of state tensors (each key represents one state space component).
- internals – List of prior internal state tensors.
- actions (dict) – Dict of action tensors (each key represents one action space component).
- terminal – Terminal boolean tensor (shape=(batch-size,)).
- reward – Reward float tensor (shape=(batch-size,)).
- return_loss_per_instance (bool) – Whether to also run and return the
loss_per_instance
Tensor.
Returns: void or - if return_loss_per_instance is True - the value of the
loss_per_instance
Tensor.
-
-
class
tensorforce.models.
DistributionModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization)¶ Bases:
tensorforce.models.model.Model
Base class for models using distributions parametrized by a neural network.
-
create_distributions
()¶
-
static
get_distributions_summaries
(distributions)¶
-
static
get_distributions_variables
(distributions, include_non_trainable=False)¶
-
get_optimizer_kwargs
(states, internals, actions, terminal, reward, update)¶
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_actions_and_internals
(states, internals, update, deterministic)¶
-
tf_kl_divergence
(states, internals, update)¶
-
tf_regularization_losses
(states, internals, update)¶
-
-
class
tensorforce.models.
PGModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶ Bases:
tensorforce.models.distribution_model.DistributionModel
Base class for policy gradient models. It optionally defines a baseline and handles its optimization. It implements the
tf_loss_per_instance
function, but requires subclasses to implementtf_pg_loss_per_instance
.-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶ Creates the TensorFlow operations for calculating the (policy-gradient-specific) loss per batch instance of the given input states and actions, after the specified reward/advantage calculations.
Parameters: - states – Dict of state tensors.
- internals – List of prior internal state tensors.
- actions – Dict of action tensors.
- terminal – Terminal boolean tensor.
- reward – Reward tensor.
- update – Boolean tensor indicating whether this call happens during an update.
Returns: Loss tensor.
-
tf_regularization_losses
(states, internals, update)¶
-
tf_reward_estimation
(states, internals, terminal, reward, update)¶
-
-
class
tensorforce.models.
PGProbRatioModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda, likelihood_ratio_clipping)¶ Bases:
tensorforce.models.pg_model.PGModel
Policy gradient model based on computing likelihood ratios, e.g. TRPO and PPO.
-
get_optimizer_kwargs
(states, actions, terminal, reward, internals, update)¶
-
initialize
(custom_getter)¶
-
tf_compare
(states, internals, actions, terminal, reward, update, reference)¶
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_reference
(states, internals, actions, update)¶
-
-
class
tensorforce.models.
PGLogProbModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶ Bases:
tensorforce.models.pg_model.PGModel
Policy gradient model based on computing log likelihoods, e.g. VPG.
-
tf_pg_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
-
class
tensorforce.models.
QModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.distribution_model.DistributionModel
Q-value model.
-
get_summaries
()¶
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_optimization
(states, internals, actions, terminal, reward, update)¶
-
tf_q_delta
(q_value, next_q_value, terminal, reward)¶ Creates the deltas (or advantage) of the Q values.
Returns: A list of deltas per action
-
tf_q_value
(embedding, distr_params, action, name)¶
-
update
(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶
-
-
class
tensorforce.models.
QNstepModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.q_model.QModel
Deep Q network using n-step rewards as described in Asynchronous Methods for Reinforcement Learning.
-
tf_q_delta
(q_value, next_q_value, terminal, reward)¶
-
-
class
tensorforce.models.
QNAFModel
(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶ Bases:
tensorforce.models.q_model.QModel
-
get_variables
(include_non_trainable=False)¶
-
initialize
(custom_getter)¶
-
tf_loss_per_instance
(states, internals, actions, terminal, reward, update)¶
-
tf_q_value
(embedding, distr_params, action, name)¶
-
tf_regularization_losses
(states, internals, update)¶
-
-
class
tensorforce.models.
QDemoModel
(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, network_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix, expert_margin, supervised_weight)¶ Bases:
tensorforce.models.q_model.QModel
Model for deep Q-learning from demonstration. Principal structure similar to double deep Q-networks but uses additional loss terms for demo data.
-
create_output_operations
(states, internals, actions, terminal, reward, update, deterministic)¶
-
demonstration_update
(states, internals, actions, terminal, reward)¶
-
initialize
(custom_getter)¶
-
tf_demo_loss
(states, actions, terminal, reward, internals, update)¶
-
tf_demo_optimization
(states, internals, actions, terminal, reward, update)¶
-