Advantage Actor-Critic

class tensorforce.agents.AdvantageActorCritic(states, actions, batch_size, max_episode_timesteps=None, network='auto', use_beta_distribution=False, memory='minimum', update_frequency=1.0, learning_rate=0.001, horizon=1, discount=0.99, return_processing=None, advantage_processing=None, predict_terminal_values=False, critic='auto', critic_optimizer=1.0, state_preprocessing='linear_normalization', reward_preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, parallel_interactions=1, config=None, saver=None, summarizer=None, tracking=None, recorder=None, **kwargs)

Advantage Actor-Critic agent (specification key: a2c).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • batch_size (parameter, int > 0) – Number of timesteps per update batch (required).
  • network ("auto" | specification) – Policy network configuration, see the networks documentation (default: “auto”, automatically configured network).
  • use_beta_distribution (bool) – Whether to use the Beta distribution for bounded continuous actions by default. (default: false).
  • memory (int > 0) – Batch memory capacity, has to fit at least maximum batch_size + maximum network/estimator horizon + 1 timesteps (default: minimum capacity, usually does not need to be changed).
  • update_frequency (“never” | parameter, int > 0 | 0.0 < float <= 1.0) – Frequency of updates, relative to batch_size if float (default: batch_size).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 1e-3).
  • horizon (“episode” | parameter, int >= 0) – Horizon of discounted-sum reward estimation before critic estimate (default: 1).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • return_processing (specification) – Return processing as layer or list of layers, see the preprocessing documentation (default: no return processing).
  • advantage_processing (specification) –

    Advantage processing as layer or list of layers, see the preprocessing documentation (default: no advantage processing).

  • predict_terminal_values (bool) – Whether to predict the value of terminal states, usually not required since max_episode_timesteps terminals are handled separately (default: false).
  • critic (specification) –

    Critic network configuration, see the networks documentation (default: “auto”).

  • critic_optimizer (float > 0.0 | specification) – Critic optimizer configuration, see the optimizers documentation, a float instead specifies a custom weight for the critic loss (default: 1.0).
  • l2_regularization (parameter, float >= 0.0) – L2 regularization loss weight (default: no L2 regularization).
  • entropy_regularization (parameter, float >= 0.0) – Entropy regularization loss weight, to discourage the policy distribution from being “too certain” (default: no entropy regularization).
  • state_preprocessing (dict[specification]) –

    State preprocessing as layer or list of layers, see the preprocessing documentation, specified per state-type or -name (default: linear normalization of bounded float states to [-2.0, 2.0]).

  • reward_preprocessing (specification) –

    Reward preprocessing as layer or list of layers, see the preprocessing documentation (default: no reward preprocessing).

  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions, specified globally or per action-type or -name (default: no exploration).
  • variable_noise (parameter, float >= 0.0) – Add Gaussian noise with given standard deviation to all trainable variables, as alternative exploration mechanism (default: no variable noise).

  • >>> – For arguments below, see the Tensorforce agent documentation.
  • parallel_interactions (int > 0) –
  • config (specification) –
  • saver (path | specification) –
  • summarizer (path | specification) –
  • tracking ("all" | iter[string]) –
  • recorder (path | specification) –