Advantage Actor-Critic

class tensorforce.agents.AdvantageActorCritic(states, actions, max_episode_timesteps, network='auto', batch_size=10, update_frequency=None, learning_rate=0.0003, horizon=0, discount=0.99, state_action_value=False, estimate_terminal=False, critic_network='auto', critic_optimizer=1.0, memory=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]

Advantage Actor-Critic agent (specification key: a2c).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • network ("auto" | specification) – Policy network configuration, see networks (default: “auto”, automatically configured network).
  • batch_size (parameter, long > 0) – Number of episodes per update batch (default: 10 episodes).
  • update_frequency ("never" | parameter, long > 0) – Frequency of updates (default: batch_size).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 3e-4).
  • horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 0).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • state_action_value (bool) – Whether to estimate state-action values instead of state values (default: false).
  • estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
  • critic_network (specification) –

    Critic network configuration, see networks (default: “auto”).

  • critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see optimizers, a float instead specifies a custom weight for the critic loss (default: 1.0).
  • memory (int > 0) – Memory capacity, has to fit at least around batch_size + one episode (default: minimum required size).
  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: agent name).
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
    • "graph": graph summary
    • "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-histogram": update histograms
    • "variables": variable mean and variance scalars
    • "variables-histogram": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).