Tensorforce Agent

class tensorforce.agents.TensorforceAgent(states, actions, update, optimizer, objective, reward_estimation, max_episode_timesteps=None, policy='auto', memory=None, baseline=None, baseline_optimizer=None, baseline_objective=None, l2_regularization=0.0, entropy_regularization=0.0, state_preprocessing='linear_normalization', reward_preprocessing=None, exploration=0.0, variable_noise=0.0, parallel_interactions=1, config=None, saver=None, summarizer=None, recorder=None, baseline_policy=None, name=None, buffer_observe=None, device=None, seed=None)

Tensorforce agent (specification key: tensorforce).

Highly configurable agent and basis for a broad class of deep reinforcement learning agents, which act according to a policy parametrized by a neural network, leverage a memory module for periodic updates based on batches of experience, and optionally employ a baseline/critic/target policy for improved reward estimation.

  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create()), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create()), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create()).
  • policy (specification) – Policy configuration, see networks and policies documentation (default: action distributions or value functions parametrized by an automatically configured network).
  • memory (int | specification) – Replay memory capacity, or memory configuration, see the memories documentation (default: minimum capacity recent memory).
  • update (int | specification) – Model update configuration with the following attributes (required, default: timesteps batch size</span>):
    • unit ("timesteps" | "episodes") – unit for update attributes (required).
    • batch_size (parameter, int > 0) – size of update batch in number of units (required).
    • frequency ("never" | parameter, int > 0) – frequency of updates (default: batch_size).
    • start (parameter, int >= batch_size) – number of units before first update (default: none).
  • optimizer (specification) – Optimizer configuration, see the optimizers documentation (default: Adam optimizer).
  • objective (specification) – Optimization objective configuration, see the objectives documentation (required).
  • reward_estimation (specification) – Reward estimation configuration with the following attributes (required):
    • horizon ("episode" | parameter, int >= 1) – Horizon of discounted-sum return estimation (required).
    • discount (parameter, 0.0 <= float <= 1.0) – Discount factor of future rewards for discounted-sum return estimation (default: 1.0).
    • estimate_advantage (bool) – Whether to use an estimate of the advantage (return minus baseline value prediction) instead of the return as learning signal (default: false, unless baseline_policy is specified but baseline_objective/optimizer are not).
    • predict_horizon_values (false | "early" | "late") – Whether to include a baseline prediction of the horizon value as part of the return estimation, and if so, whether to compute the horizon value prediction "early" when experiences are stored to memory, or "late" when batches of experience are retrieved for the update (default: "late" if baseline_policy or baseline_objective are specified, else false).
    • predict_action_values (bool) – Whether to predict state-action- instead of state-values as horizon values and for advantage estimation (default: false).
    • predict_terminal_values (bool) – Whether to predict the value of terminal states (default: false).
    • return_processing (specification) – Return processing as layer or list of layers, see the [preprocessing documentation](../modules/preprocessing.html) (default: no return processing).
    • advantage_processing (specification) – Advantage processing as layer or list of layers, see the [preprocessing documentation](../modules/preprocessing.html) (default: no advantage processing).
  • baseline (specification) –

    Baseline configuration, policy will be used as baseline if none, see networks and potentially policies documentation (default: none).

  • baseline_optimizer (specification | parameter, float > 0.0) –

    Baseline optimizer configuration, see the optimizers documentation, main optimizer will be used for baseline if none, a float implies none and specifies a custom weight for the baseline loss (default: none).

  • baseline_objective (specification) –

    Baseline optimization objective configuration, see the objectives documentation, required if baseline optimizer is specified, main objective will be used for baseline if baseline objective and optimizer are not specified (default: none).

  • l2_regularization (parameter, float >= 0.0) – L2 regularization loss weight (default: no L2 regularization).
  • entropy_regularization (parameter, float >= 0.0) – Entropy regularization loss weight, to discourage the policy distribution from being “too certain” (default: no entropy regularization).
  • state_preprocessing (dict[specification]) – State preprocessing as layer or list of layers, see the preprocessing documentation, specified per state-type or -name (default: linear normalization of bounded float states to [-2.0, 2.0]).
  • reward_preprocessing (specification) –

    Reward preprocessing as layer or list of layers, see the preprocessing documentation (default: no reward preprocessing).

  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions, specified globally or per action-type or -name (default: no exploration).
  • variable_noise (parameter, float >= 0.0) – Add Gaussian noise with given standard deviation to all trainable variables, as alternative exploration mechanism (default: no variable noise).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or agents within an environment (default: 1).
  • config (specification) – Additional configuration options:
    • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: "agent").
    • device (string) – Device name (default: TensorFlow default).
    • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed may have to be set separately for fully deterministic execution (default: none).
    • buffer_observe (false | "episode" | int > 0) – Number of timesteps within an episode to buffer before calling the internal observe function, to reduce calls to TensorFlow for improved performance (default: configuration-specific maximum number which can be buffered without affecting performance).
    • enable_int_action_masking (bool) – Whether int action options can be masked via an optional "[ACTION-NAME]_mask" state input (default: true).
    • create_tf_assertions (bool) – Whether to create internal TensorFlow assertion operations (default: true).
    • eager_mode (bool) – Whether to run functions eagerly instead of running as a traced graph function, can be helpful for debugging (default: false).
    • tf_log_level (int >= 0) – TensorFlow log level, additional C++ logging messages can be enabled by setting os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"/"2" before importing Tensorforce/TensorFlow (default: 40, only error and critical).
  • saver (path | specification) – TensorFlow checkpoints directory, or checkpoint manager configuration with the following attributes, for periodic implicit saving as alternative to explicit saving via agent.save() (default: no saver):
    • directory (path) – checkpoint directory (required).
    • filename (string) – checkpoint filename (default: agent name).
    • frequency (int > 0) – how frequently to save a checkpoint (required).
    • unit ("timesteps" | "episodes" | "updates") – frequency unit (default: updates).
    • max_checkpoints (int > 0) – maximum number of checkpoints to keep (default: 10).
    • max_hour_frequency (int > 0) – ignoring max-checkpoints, definitely keep a checkpoint in given hour frequency (default: none).
  • summarizer (path | specification) – TensorBoard summaries directory, or summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • filename (path) – summarizer filename, max_summaries does not apply if name specified (default: "summary-%Y%m%d-%H%M%S").
    • max_summaries (int > 0) – maximum number of (generically-named) summaries to keep (default: 7, number of different colors in Tensorboard).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • summaries ("all" | iter[string]) – which summaries to record, "all" implies all numerical summaries, so all summaries except "graph" (default: "all"):
    • "action-value": value of each action (timestep-based)
    • "distribution": distribution parameters like probabilities or mean and stddev (timestep-based)
    • "entropy": entropy of (per-action) policy distribution(s) (timestep-based)
    • "graph": computation graph
    • "kl-divergence": KL-divergence of previous and updated (per-action) policy distribution(s) (update-based)
    • "loss": policy and baseline loss plus loss components (update-based)
    • "parameters": parameter values (according to parameter unit)
    • "reward": timestep and episode reward, plus intermediate reward/return estimates (timestep/episode/update-based)
    • "update-norm": global norm of update (update-based)
    • "updates": mean and variance of update tensors per variable (update-based)
    • "variables": mean of trainable variables tensors (update-based)
  • recorder (path | specification) – Traces recordings directory, or recorder configuration with the following attributes (see record-and-pretrain script for example application) (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).