tensorforce.models package¶

Submodules¶

tensorforce.models.constant_model module¶

class tensorforce.models.constant_model.ConstantModel(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, action_values)¶

Bases: tensorforce.models.model.Model

Utility class to return constant actions of a desired shape and with given bounds.

tf_actions_and_internals(states, internals, update, deterministic)¶

tf_loss_per_instance(states, internals, actions, terminal, reward, update)¶

tensorforce.models.distribution_model module¶

class tensorforce.models.distribution_model.DistributionModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization)¶

Bases: tensorforce.models.model.Model

Base class for models using distributions parameterized by a neural network.

create_distributions()¶

static get_distributions_summaries(distributions)¶

static get_distributions_variables(distributions, include_non_trainable=False)¶

get_optimizer_kwargs(states, internals, actions, terminal, reward, update)¶

get_summaries()¶

get_variables(include_non_trainable=False)¶

initialize(custom_getter)¶

tf_actions_and_internals(states, internals, update, deterministic)¶

tf_kl_divergence(states, internals, update)¶

tf_regularization_losses(states, internals, update)¶

tensorforce.models.model module¶

The Model class coordinates the creation and execution of all TensorFlow operations within a model. It implements the reset, act and update functions, which give the interface the Agent class communicates with, and which should not need to be overwritten. Instead, the following TensorFlow functions need to be implemented:

tf_actions_and_internals(states, internals, deterministic) returning the batch of

actions and successor internal states.
tf_loss_per_instance(states, internals, actions, terminal, reward) returning the loss

per instance for a batch.

Further, the following TensorFlow functions should be extended accordingly:

initialize(custom_getter) defining TensorFlow placeholders/functions and adding internal states.
get_variables() returning the list of TensorFlow variables (to be optimized) of this model.
tf_regularization_losses(states, internals) returning a dict of regularization losses.
get_optimizer_kwargs(states, internals, actions, terminal, reward) returning a dict of potential

arguments (argument-free functions) to the optimizer.

Finally, the following TensorFlow functions can be useful in some cases:

preprocess_states(states) for state preprocessing, returning the processed batch of states.
action_exploration(action, exploration, action_spec) for action postprocessing (e.g. exploration), returning the processed batch of actions.
preprocess_reward(states, internals, terminal, reward) for reward preprocessing (e.g. reward normalization), returning the processed batch of rewards.
create_output_operations(states, internals, actions, terminal, reward, deterministic) for further output operations, similar to the two above for Model.act and Model.update.
tf_optimization(states, internals, actions, terminal, reward) for further optimization operations (e.g. the baseline update in a PGModel or the target network update in a QModel), returning a single grouped optimization operation.

class tensorforce.models.model.Model(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec)¶

Bases: object

Base class for all (TensorFlow-based) models.

act(states, internals, deterministic=False)¶

close()¶

create_output_operations(states, internals, actions, terminal, reward, update, deterministic)¶

Calls all the relevant TensorFlow functions for this model and hence creates all the TensorFlow operations involved.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update. deterministic – Boolean tensor indicating whether action should be chosen deterministically.

get_optimizer_kwargs(states, internals, actions, terminal, reward, update)¶

Returns the optimizer arguments including the time, the list of variables to optimize, and various argument-free functions (in particular fn_loss returning the combined 0-dim batch loss tensor) which the optimizer might require to perform an update step.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update.
Returns:	Loss tensor of the size of the batch.

get_summaries()¶

Returns the TensorFlow summaries reported by the model

Returns:	List of summaries

get_variables(include_non_trainable=False)¶

Returns the TensorFlow variables used by the model.

Returns:	List of variables.

initialize(custom_getter)¶

Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.

Parameters:	custom_getter – The `custom_getter_` object to use for `tf.make_template` when creating TensorFlow functions.

observe(terminal, reward)¶

reset()¶

Resets the model to its initial state on episode start.

Returns:	Current episode and timestep counter, and a list containing the internal states initializations.

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path were the model was saved.

setup()¶: Sets up the TensorFlow model graph and initializes the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶: Applies optional exploration to the action.

tf_actions_and_internals(states, internals, update, deterministic)¶

Creates the TensorFlow operations for retrieving the actions (and posterior internal states) in reaction to the given input states (and prior internal states).

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. update – Boolean tensor indicating whether this call happens during an update. deterministic – Boolean tensor indicating whether action should be chosen deterministically.
Returns:	Actions and list of posterior internal state tensors.

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_loss(states, internals, actions, terminal, reward, update)¶

tf_loss_per_instance(states, internals, actions, terminal, reward, update)¶

Creates the TensorFlow operations for calculating the loss per batch instance of the given input states and actions.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update.
Returns:	Loss tensor.

tf_optimization(states, internals, actions, terminal, reward, update)¶

Creates the TensorFlow operations for performing an optimization update step based on the given input states and actions batch.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update.
Returns:	The optimization operation.

tf_preprocess_reward(states, internals, terminal, reward)¶: Applies optional pre-processing to the reward.

tf_preprocess_states(states)¶: Applies optional pre-processing to the states.

tf_regularization_losses(states, internals, update)¶

Creates the TensorFlow operations for calculating the regularization losses for the given input states.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. update – Boolean tensor indicating whether this call happens during an update.
Returns:	Dict of regularization loss tensors.

update(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶

tensorforce.models.pg_log_prob_model module¶

class tensorforce.models.pg_log_prob_model.PGLogProbModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

Bases: tensorforce.models.pg_model.PGModel

Policy gradient model based on computing log likelihoods, e.g. VPG.

tf_pg_loss_per_instance(states, internals, actions, terminal, reward, update)¶

tensorforce.models.pg_model module¶

class tensorforce.models.pg_model.PGModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

Bases: tensorforce.models.distribution_model.DistributionModel

Base class for policy gradient models. It optionally defines a baseline and handles its optimization. It implements the tf_loss_per_instance function, but requires subclasses to implement tf_pg_loss_per_instance.

get_summaries()¶

get_variables(include_non_trainable=False)¶

initialize(custom_getter)¶

tf_loss_per_instance(states, internals, actions, terminal, reward, update)¶

tf_optimization(states, internals, actions, terminal, reward, update)¶

tf_pg_loss_per_instance(states, internals, actions, terminal, reward, update)¶

Creates the TensorFlow operations for calculating the (policy-gradient-specific) loss per batch instance of the given input states and actions, after the specified reward/advantage calculations.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update.
Returns:	Loss tensor.

tf_regularization_losses(states, internals, update)¶

tf_reward_estimation(states, internals, terminal, reward, update)¶

tensorforce.models.pg_prob_ratio_model module¶

class tensorforce.models.pg_prob_ratio_model.PGProbRatioModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda, likelihood_ratio_clipping)¶

Bases: tensorforce.models.pg_model.PGModel

Policy gradient model based on computing likelihood ratios, e.g. TRPO and PPO.

get_optimizer_kwargs(states, actions, terminal, reward, internals, update)¶

initialize(custom_getter)¶

tf_compare(states, internals, actions, terminal, reward, update, reference)¶

tf_pg_loss_per_instance(states, internals, actions, terminal, reward, update)¶

tf_reference(states, internals, actions, update)¶

tensorforce.models.q_demo_model module¶

class tensorforce.models.q_demo_model.QDemoModel(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, network_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix, expert_margin, supervised_weight)¶

Bases: tensorforce.models.q_model.QModel

Model for deep Q-learning from demonstration. Principal structure similar to double deep Q-networks but uses additional loss terms for demo data.

create_output_operations(states, internals, actions, terminal, reward, update, deterministic)¶

demonstration_update(states, internals, actions, terminal, reward)¶

initialize(custom_getter)¶

tf_demo_loss(states, actions, terminal, reward, internals, update)¶

tf_demo_optimization(states, internals, actions, terminal, reward, update)¶

tensorforce.models.q_model module¶

class tensorforce.models.q_model.QModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶

Bases: tensorforce.models.distribution_model.DistributionModel

Q-value model.

get_summaries()¶

get_variables(include_non_trainable=False)¶

initialize(custom_getter)¶

tf_loss_per_instance(states, internals, actions, terminal, reward, update)¶

tf_optimization(states, internals, actions, terminal, reward, update)¶

tf_q_delta(q_value, next_q_value, terminal, reward)¶

Creates the deltas (or advantage) of the Q values.

Returns:	A list of deltas per action

tf_q_value(embedding, distr_params, action, name)¶

update(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶

tensorforce.models.q_naf_model module¶

class tensorforce.models.q_naf_model.QNAFModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶

Bases: tensorforce.models.q_model.QModel

get_variables(include_non_trainable=False)¶

initialize(custom_getter)¶

tf_loss_per_instance(states, internals, actions, terminal, reward, update)¶

tf_q_value(embedding, distr_params, action, name)¶

tf_regularization_losses(states, internals, update)¶

tensorforce.models.q_nstep_model module¶

class tensorforce.models.q_nstep_model.QNstepModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶

Bases: tensorforce.models.q_model.QModel

Deep Q network using n-step rewards as described in Asynchronous Methods for Reinforcement Learning.

tf_q_delta(q_value, next_q_value, terminal, reward)¶

tensorforce.models.random_model module¶

class tensorforce.models.random_model.RandomModel(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec)¶

Bases: tensorforce.models.model.Model

Utility class to return random actions of a desired shape and with given bounds.

tf_actions_and_internals(states, internals, update, deterministic)¶

tf_loss_per_instance(states, internals, actions, terminal, reward, update)¶

Module contents¶

class tensorforce.models.Model(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec)¶

Bases: object

Base class for all (TensorFlow-based) models.

act(states, internals, deterministic=False)¶

close()¶

create_output_operations(states, internals, actions, terminal, reward, update, deterministic)¶

Calls all the relevant TensorFlow functions for this model and hence creates all the TensorFlow operations involved.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update. deterministic – Boolean tensor indicating whether action should be chosen deterministically.

get_optimizer_kwargs(states, internals, actions, terminal, reward, update)¶

Returns the optimizer arguments including the time, the list of variables to optimize, and various argument-free functions (in particular fn_loss returning the combined 0-dim batch loss tensor) which the optimizer might require to perform an update step.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update.
Returns:	Loss tensor of the size of the batch.

get_summaries()¶

Returns the TensorFlow summaries reported by the model

Returns:	List of summaries

get_variables(include_non_trainable=False)¶

Returns the TensorFlow variables used by the model.

Returns:	List of variables.

initialize(custom_getter)¶

Creates the TensorFlow placeholders and functions for this model. Moreover adds the internal state placeholders and initialization values to the model.

Parameters:	custom_getter – The `custom_getter_` object to use for `tf.make_template` when creating TensorFlow functions.

observe(terminal, reward)¶

reset()¶

Resets the model to its initial state on episode start.

Returns:	Current episode and timestep counter, and a list containing the internal states initializations.

restore(directory=None, file=None)¶

Restore TensorFlow model. If no checkpoint file is given, the latest checkpoint is restored. If no checkpoint directory is given, the model’s default saver directory is used (unless file specifies the entire path).

Parameters:	directory – Optional checkpoint directory. file – Optional checkpoint file, or path if directory not given.

save(directory=None, append_timestep=True)¶

Save TensorFlow model. If no checkpoint directory is given, the model’s default saver directory is used. Optionally appends current timestep to prevent overwriting previous checkpoint files. Turn off to be able to load model from the same given path argument as given here.

Parameters:	directory – Optional checkpoint directory. append_timestep – Appends the current timestep to the checkpoint file if true.
Returns:	Checkpoint path were the model was saved.

setup()¶: Sets up the TensorFlow model graph and initializes the TensorFlow session.

tf_action_exploration(action, exploration, action_spec)¶: Applies optional exploration to the action.

tf_actions_and_internals(states, internals, update, deterministic)¶

Creates the TensorFlow operations for retrieving the actions (and posterior internal states) in reaction to the given input states (and prior internal states).

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. update – Boolean tensor indicating whether this call happens during an update. deterministic – Boolean tensor indicating whether action should be chosen deterministically.
Returns:	Actions and list of posterior internal state tensors.

tf_discounted_cumulative_reward(terminal, reward, discount, final_reward=0.0)¶

Creates the TensorFlow operations for calculating the discounted cumulative rewards for a given sequence of rewards.

Parameters:	terminal – Terminal boolean tensor. reward – Reward tensor. discount – Discount factor. final_reward – Last reward value in the sequence.
Returns:	Discounted cumulative reward tensor.

tf_loss(states, internals, actions, terminal, reward, update)¶

tf_loss_per_instance(states, internals, actions, terminal, reward, update)¶

Creates the TensorFlow operations for calculating the loss per batch instance of the given input states and actions.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update.
Returns:	Loss tensor.

tf_optimization(states, internals, actions, terminal, reward, update)¶

Creates the TensorFlow operations for performing an optimization update step based on the given input states and actions batch.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update.
Returns:	The optimization operation.

tf_preprocess_reward(states, internals, terminal, reward)¶: Applies optional pre-processing to the reward.

tf_preprocess_states(states)¶: Applies optional pre-processing to the states.

tf_regularization_losses(states, internals, update)¶

Creates the TensorFlow operations for calculating the regularization losses for the given input states.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. update – Boolean tensor indicating whether this call happens during an update.
Returns:	Dict of regularization loss tensors.

update(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶

class tensorforce.models.DistributionModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization)¶

Bases: tensorforce.models.model.Model

Base class for models using distributions parameterized by a neural network.

create_distributions()¶

static get_distributions_summaries(distributions)¶

static get_distributions_variables(distributions, include_non_trainable=False)¶

get_optimizer_kwargs(states, internals, actions, terminal, reward, update)¶

get_summaries()¶

get_variables(include_non_trainable=False)¶

initialize(custom_getter)¶

tf_actions_and_internals(states, internals, update, deterministic)¶

tf_kl_divergence(states, internals, update)¶

tf_regularization_losses(states, internals, update)¶

class tensorforce.models.PGModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

Bases: tensorforce.models.distribution_model.DistributionModel

Base class for policy gradient models. It optionally defines a baseline and handles its optimization. It implements the tf_loss_per_instance function, but requires subclasses to implement tf_pg_loss_per_instance.

get_summaries()¶

get_variables(include_non_trainable=False)¶

initialize(custom_getter)¶

tf_loss_per_instance(states, internals, actions, terminal, reward, update)¶

tf_optimization(states, internals, actions, terminal, reward, update)¶

tf_pg_loss_per_instance(states, internals, actions, terminal, reward, update)¶

Creates the TensorFlow operations for calculating the (policy-gradient-specific) loss per batch instance of the given input states and actions, after the specified reward/advantage calculations.

Parameters:	states – Dict of state tensors. internals – List of prior internal state tensors. actions – Dict of action tensors. terminal – Terminal boolean tensor. reward – Reward tensor. update – Boolean tensor indicating whether this call happens during an update.
Returns:	Loss tensor.

tf_regularization_losses(states, internals, update)¶

tf_reward_estimation(states, internals, terminal, reward, update)¶

class tensorforce.models.PGProbRatioModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda, likelihood_ratio_clipping)¶

Bases: tensorforce.models.pg_model.PGModel

Policy gradient model based on computing likelihood ratios, e.g. TRPO and PPO.

get_optimizer_kwargs(states, actions, terminal, reward, internals, update)¶

initialize(custom_getter)¶

tf_compare(states, internals, actions, terminal, reward, update, reference)¶

tf_pg_loss_per_instance(states, internals, actions, terminal, reward, update)¶

tf_reference(states, internals, actions, update)¶

class tensorforce.models.PGLogProbModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, baseline_mode, baseline, baseline_optimizer, gae_lambda)¶

Bases: tensorforce.models.pg_model.PGModel

Policy gradient model based on computing log likelihoods, e.g. VPG.

tf_pg_loss_per_instance(states, internals, actions, terminal, reward, update)¶

class tensorforce.models.QModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶

Bases: tensorforce.models.distribution_model.DistributionModel

Q-value model.

get_summaries()¶

get_variables(include_non_trainable=False)¶

initialize(custom_getter)¶

tf_loss_per_instance(states, internals, actions, terminal, reward, update)¶

tf_optimization(states, internals, actions, terminal, reward, update)¶

tf_q_delta(q_value, next_q_value, terminal, reward)¶

Creates the deltas (or advantage) of the Q values.

Returns:	A list of deltas per action

tf_q_value(embedding, distr_params, action, name)¶

update(states, internals, actions, terminal, reward, return_loss_per_instance=False)¶

class tensorforce.models.QNstepModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶

Bases: tensorforce.models.q_model.QModel

Deep Q network using n-step rewards as described in Asynchronous Methods for Reinforcement Learning.

tf_q_delta(q_value, next_q_value, terminal, reward)¶

class tensorforce.models.QNAFModel(states_spec, actions_spec, network_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix)¶

Bases: tensorforce.models.q_model.QModel

get_variables(include_non_trainable=False)¶

initialize(custom_getter)¶

tf_loss_per_instance(states, internals, actions, terminal, reward, update)¶

tf_q_value(embedding, distr_params, action, name)¶

tf_regularization_losses(states, internals, update)¶

class tensorforce.models.QDemoModel(states_spec, actions_spec, device, session_config, scope, saver_spec, summary_spec, distributed_spec, optimizer, discount, variable_noise, states_preprocessing_spec, explorations_spec, reward_preprocessing_spec, network_spec, distributions_spec, entropy_regularization, target_sync_frequency, target_update_weight, double_q_model, huber_loss, random_sampling_fix, expert_margin, supervised_weight)¶

Bases: tensorforce.models.q_model.QModel

Model for deep Q-learning from demonstration. Principal structure similar to double deep Q-networks but uses additional loss terms for demo data.

create_output_operations(states, internals, actions, terminal, reward, update, deterministic)¶

demonstration_update(states, internals, actions, terminal, reward)¶

initialize(custom_getter)¶

tf_demo_loss(states, actions, terminal, reward, internals, update)¶

tf_demo_optimization(states, internals, actions, terminal, reward, update)¶