Vanilla Policy Gradient

class tensorforce.agents.VanillaPolicyGradient(states, actions, max_episode_timesteps, batch_size, network='auto', use_beta_distribution=False, memory='minimum', update_frequency=1.0, learning_rate=0.001, discount=0.99, reward_processing=None, return_processing=None, advantage_processing=None, predict_terminal_values=False, baseline=None, baseline_optimizer=None, state_preprocessing='linear_normalization', exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, parallel_interactions=1, config=None, saver=None, summarizer=None, tracking=None, recorder=None, **kwargs)

Vanilla Policy Gradient aka REINFORCE agent (specification key: vpg or reinforce).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • batch_size (parameter, int > 0) – Number of episodes per update batch (required).
  • network ("auto" | specification) – Policy network configuration, see the networks documentation (default: “auto”, automatically configured network).
  • use_beta_distribution (bool) – Whether to use the Beta distribution for bounded continuous actions by default. (default: false).
  • memory (int > 0) – Batch memory capacity, has to fit at least maximum batch_size + 1 episodes (default: minimum capacity, usually does not need to be changed).
  • update_frequency (“never” | parameter, int > 0 | 0.0 < float <= 1.0) – Frequency of updates, relative to batch_size if float (default: batch_size).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 1e-3).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • return_processing (specification) – Return processing as layer or list of layers, see the preprocessing documentation (default: no return processing).
  • advantage_processing (specification) –

    Advantage processing as layer or list of layers, see the preprocessing documentation (default: no advantage processing).

  • predict_terminal_values (bool) – Whether to predict the value of terminal states, usually not required since max_episode_timesteps terminals are handled separately (default: false).
  • reward_processing (specification) –

    Reward preprocessing as layer or list of layers, see the preprocessing documentation (default: no reward processing).

  • baseline (specification) –

    Baseline network configuration, see the networks documentation, main policy will be used as baseline if none (default: none).

  • baseline_optimizer (float > 0.0 | specification) – Baseline optimizer configuration, see the optimizers documentation, main optimizer will be used for baseline if none, a float implies none and specifies a custom weight for the baseline loss (default: none).
  • l2_regularization (parameter, float >= 0.0) – L2 regularization loss weight (default: no L2 regularization).
  • entropy_regularization (parameter, float >= 0.0) – Entropy regularization loss weight, to discourage the policy distribution from being “too certain” (default: no entropy regularization).
  • state_preprocessing (dict[specification]) –

    State preprocessing as layer or list of layers, see the preprocessing documentation, specified per state-type or -name (default: linear normalization of bounded float states to [-2.0, 2.0]).

  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions, specified globally or per action-type or -name (default: no exploration).
  • variable_noise (parameter, float >= 0.0) – Add Gaussian noise with given standard deviation to all trainable variables, as alternative exploration mechanism (default: no variable noise).

  • >>> – For arguments below, see the Tensorforce agent documentation.
  • parallel_interactions (int > 0) –
  • config (specification) –
  • saver (path | specification) –
  • summarizer (path | specification) –
  • tracking ("all" | iter[string]) –
  • recorder (path | specification) –