Tensorforce Policy Agent

class tensorforce.agents.PolicyAgent(states, actions, update, objective, reward_estimation, max_episode_timesteps=None, policy=None, network='auto', memory=None, optimizer='adam', baseline_policy=None, baseline_network=None, baseline_optimizer=None, baseline_objective=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, buffer_observe=True, seed=None, execution=None, saver=None, summarizer=None, recorder=None)[source]

Policy Agent (specification key: policy).

Base class for a broad class of deep reinforcement learning agents, which act according to a policy parametrized by a neural network, leverage a memory module for periodic updates based on batches of experience, and optionally employ a baseline/critic/target policy for improved reward estimation.

Parameters:
  • states (specification) – States specification (required), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_states (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_actions (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Maximum number of timesteps per episode (default: not given).
  • policy (specification) – Policy configuration, currently best to ignore and use the network argument instead.
  • network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
  • memory (int | specification) – Memory configuration, see memories (default: replay memory with given or inferred capacity).
  • update (int | specification) – Model update configuration with the following attributes (required, default: timesteps batch size</span>):
    • unit ("timesteps" | "episodes") – unit for update attributes (required).
    • batch_size (parameter, long > 0) – size of update batch in number of units (required).
    • frequency ("never" | parameter, long > 0) – frequency of updates (default: batch_size).
    • start (parameter, long >= 2 * batch_size) – number of units before first update (default: 0).
  • optimizer (specification) – Optimizer configuration, see optimizers (default: Adam optimizer).
  • objective (specification) – Optimization objective configuration, see objectives (required).
  • reward_estimation (specification) – Reward estimation configuration with the following attributes (required):
    • horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation (required).
    • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 1.0).
    • estimate_horizon (false | "early" | "late") – Whether to estimate the value of horizon states, and if so, whether to estimate early when experience is stored, or late when it is retrieved (default: "late").
    • estimate_actions (bool) – Whether to estimate state-action values instead of state values (default: false).
    • estimate_terminal (bool) – Whether to estimate the value of terminal states (default: false).
    • estimate_advantage (bool) – Whether to estimate the advantage by subtracting the current estimate (default: false).
  • baseline_policy ("same" | "equal" | specification) – Baseline policy configuration, “same” refers to reusing the main policy as baseline, “equal” refers to using the same configuration as the main policy (default: none).
  • baseline_network ("same" | "equal" | specification) –

    Baseline network configuration, see networks, “same” refers to reusing the main network as part of the baseline policy, “equal” refers to using the same configuration as the main network (default: none).

  • baseline_optimizer ("same" | "equal" | specification) –

    Baseline optimizer configuration, see optimizers, “same” refers to reusing the main optimizer for the baseline, “equal” refers to using the same configuration as the main optimizer (default: none).

  • baseline_objective ("same" | "equal" | specification) –

    Baseline optimization objective configuration, see objectives, “same” refers to reusing the main objective for the baseline, “equal” refers to using the same configuration as the main objective (default: none).

  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • buffer_observe (bool | int > 0) – Maximum number of timesteps within an episode to buffer before executing internal observe operations, to reduce calls to TensorFlow for improved performance (default: max_episode_timesteps or 1000, unless summarizer specified).
  • seed (int) – Random seed to set for Python, NumPy and TensorFlow (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: "model").
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timestepsto record summaries, applies to "variables" and "act" if specified globally (default: always), otherwise specified per "variables"/"act" in timesteps and "observe"/"update" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropy": entropy of policy distribution
    • "graph": graph summary
    • "kl-divergence": KL-divergence of previous and updated policy distribution
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "processed-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-full": update histograms
    • "variables": variable mean and variance scalars
    • "variables-full": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).