Policies

Default policy: depends on agent configuration, but always with default argument network (with default argument layers), so a list is a short-form specification of a sequential layer-stack network architecture:

Agent.create(
    ...
    policy=[
        dict(type='dense', size=64, activation='tanh'),
        dict(type='dense', size=64, activation='tanh')
    ],
    ...
)

Or simply:

Agent.create(
    ...
    policy=dict(network='auto'),
    ...
)

See the networks documentation for more information about how to specify a network.

Example of a full parametrized-distributions policy specification with customized distribution and decaying temperature:

Agent.create(
    ...
    policy=dict(
        type='parametrized_distributions',
        network=[
            dict(type='dense', size=64, activation='tanh'),
            dict(type='dense', size=64, activation='tanh')
        ],
        distributions=dict(
            float=dict(type='gaussian', stddev_mode='global'),
            bounded_action=dict(type='beta')
        ),
        temperature=dict(
            type='decaying', decay='exponential', unit='episodes',
            num_steps=100, initial_value=0.01, decay_rate=0.5
        )
    )
    ...
)

In the case of multiple action components, some policy types, like parametrized_distributions, support the specification of additional network outputs for some/all actions via registered tensors:

Agent.create(
    ...
    actions=dict(
        action1=dict(type='int', shape=(), num_values=5),
        action2=dict(type='float', shape=(), min_value=-1.0, max_value=1.0)
    ),
    ...
    policy=dict(
        type='parametrized_distributions',
        network=[
            dict(type='dense', size=64),
            dict(type='register', tensor='action1-embedding'),
            dict(type='dense', size=64)
            # Final output implicitly used for remaining actions
        ],
        single_output=False
    )
    ...
)
class tensorforce.core.policies.ParametrizedActionValue(network='auto', *, device=None, l2_regularization=None, name=None, states_spec=None, auxiliaries_spec=None, internals_spec=None, actions_spec=None)

Policy which parametrizes an action-value function, conditioned on the output of a neural network processing the input state (specification key: parametrized_action_value).

Parameters:
  • network ('auto' | specification) – Policy network configuration, see networks (default: ‘auto’, automatically configured network).
  • device (string) – Device name (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
  • name (string) – internal use.
  • states_spec (specification) – internal use.
  • auxiliaries_spec (specification) – internal use.
  • internals_spec (specification) – internal use.
  • actions_spec (specification) – internal use.
class tensorforce.core.policies.ParametrizedDistributions(network='auto', *, single_output=True, distributions=None, temperature=1.0, use_beta_distribution=False, device=None, l2_regularization=None, name=None, states_spec=None, auxiliaries_spec=None, internals_spec=None, actions_spec=None)

Policy which parametrizes independent distributions per action, conditioned on the output of a central neural network processing the input state, supporting both a stochastic and value-based policy interface (specification key: parametrized_distributions).

Parameters:
  • network ('auto' | specification) –

    Policy network configuration, see networks (default: ‘auto’, automatically configured network).

  • single_output (bool) – Whether the network returns a single embedding tensor or, in the case of multiple action components, specifies additional outputs for some/all action distributions, via registered tensors with name “[ACTION]-embedding” (default: single output).
  • distributions (dict[specification]) – Distributions configuration, see distributions, specified per action-type or -name (default: per action-type, Bernoulli distribution for binary boolean actions, categorical distribution for discrete integer actions, Gaussian distribution for unbounded continuous actions, Beta distribution for bounded continuous actions).
  • temperature (parameter | dict[parameter], float >= 0.0) – Sampling temperature, global or per action (default: 1.0).
  • use_beta_distribution (bool) – Whether to use the Beta distribution for bounded continuous actions by default. (default: false).
  • device (string) – Device name (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
  • name (string) – internal use.
  • states_spec (specification) – internal use.
  • auxiliaries_spec (specification) – internal use.
  • internals_spec (specification) – internal use.
  • actions_spec (specification) – internal use.
class tensorforce.core.policies.ParametrizedStateValue(network='auto', *, device=None, l2_regularization=None, name=None, states_spec=None, auxiliaries_spec=None, internals_spec=None, actions_spec=None)

Policy which parametrizes a state-value function, conditioned on the output of a neural network processing the input state (specification key: parametrized_state_value).

Parameters:
  • network ('auto' | specification) –

    Policy network configuration, see networks (default: ‘auto’, automatically configured network).

  • device (string) – Device name (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
  • name (string) – internal use.
  • states_spec (specification) – internal use.
  • auxiliaries_spec (specification) – internal use.
  • internals_spec (specification) – internal use.
  • actions_spec (specification) – internal use.
class tensorforce.core.policies.ParametrizedValuePolicy(network='auto', *, single_output=True, state_value_mode='separate', device=None, l2_regularization=None, name=None, states_spec=None, auxiliaries_spec=None, internals_spec=None, actions_spec=None)

Policy which parametrizes independent action-/advantage-/state-value functions per action and optionally a state-value function, conditioned on the output of a central neural network processing the input state (specification key: parametrized_value_policy).

Parameters:
  • network ('auto' | specification) –

    Policy network configuration, see networks (default: ‘auto’, automatically configured network).

  • single_output (bool) – Whether the network returns a single embedding tensor or, in the case of multiple action components, specifies additional outputs for some/all action/state value functions, via registered tensors with name “[ACTION]-embedding” or “state-embedding”/”[ACTION]-state-embedding” depending on the state_value_mode argument (default: single output).
  • state_value_mode ('implicit' | 'separate' | 'separate-per-action') – Whether to compute the state value implicitly as maximum action value (like DQN), or as either a single separate state-value function or a function per action (like DuelingDQN) (default: single separate state-value function).
  • device (string) – Device name (default: inherit value of parent module).
  • l2_regularization (float >= 0.0) – Scalar controlling L2 regularization (default: inherit value of parent module).
  • name (string) – internal use.
  • states_spec (specification) – internal use.
  • auxiliaries_spec (specification) – internal use.
  • internals_spec (specification) – internal use.
  • actions_spec (specification) – internal use.