Dueling DQN

class tensorforce.agents.DuelingDQN(states, actions, memory, batch_size, max_episode_timesteps=None, network='auto', update_frequency='batch_size', start_updating=None, learning_rate=0.001, huber_loss=None, horizon=1, discount=0.99, predict_terminal_values=False, target_update_weight=1.0, target_sync_frequency=1, state_preprocessing='linear_normalization', reward_preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, parallel_interactions=1, config=None, saver=None, summarizer=None, recorder=None, estimate_terminal=None, **kwargs)

Dueling DQN agent (specification key: dueling_dqn).

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • memory (int > 0) – Replay memory capacity, has to fit at least maximum batch_size + maximum network/estimator horizon + 1 timesteps (required).
  • batch_size (parameter, int > 0) – Number of timesteps per update batch (required).
  • network ("auto" | specification) – Policy network configuration, see the networks documentation (default: “auto”, automatically configured network).
  • update_frequency (“never” | parameter, int > 0) – Frequency of updates (default: batch_size).
  • start_updating (parameter, int >= batch_size) – Number of timesteps before first update (default: none).
  • learning_rate (parameter, float > 0.0) – Optimizer learning rate (default: 1e-3).
  • huber_loss (parameter, float > 0.0) – Huber loss threshold (default: no huber loss).
  • horizon (parameter, int >= 1) – n-step DQN, horizon of discounted-sum reward estimation before target network estimate (default: 1).
  • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 0.99).
  • predict_terminal_values (bool) – Whether to predict the value of terminal states (default: false).
  • target_update_weight (parameter, 0.0 < float <= 1.0) – Target network update weight (default: 1.0).
  • target_sync_frequency (parameter, int >= 1) – Interval between target network updates (default: every update).
  • l2_regularization (parameter, float >= 0.0) – L2 regularization loss weight (default: no L2 regularization).
  • entropy_regularization (parameter, float >= 0.0) – Entropy regularization loss weight, to discourage the policy distribution from being “too certain” (default: no entropy regularization).
  • state_preprocessing (dict[specification]) – State preprocessing as layer or list of layers, see the preprocessing documentation, specified per state-type or -name (default: linear normalization of bounded float states to [-2.0, 2.0]).
  • reward_preprocessing (specification) –

    Reward preprocessing as layer or list of layers, see the preprocessing documentation (default: no reward preprocessing).

  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions, specified globally or per action-type or -name (default: no exploration).
  • variable_noise (parameter, float >= 0.0) – Add Gaussian noise with given standard deviation to all trainable variables, as alternative exploration mechanism (default: no variable noise).
  • others – See the Tensorforce agent documentation.