Tensorforce Agent

class tensorforce.agents.TensorforceAgent(states, actions, update, objective, reward_estimation, max_episode_timesteps=None, policy='default', memory=None, optimizer='adam', baseline_policy=None, baseline_optimizer=None, baseline_objective=None, preprocessing=None, exploration=0.0, variable_noise=0.0, l2_regularization=0.0, entropy_regularization=0.0, name='agent', device=None, parallel_interactions=1, buffer_observe=True, seed=None, execution=None, saver=None, summarizer=None, recorder=None, config=None)[source]

Tensorforce agent (specification key: tensorforce).

Highly configurable agent and basis for a broad class of deep reinforcement learning agents, which act according to a policy parametrized by a neural network, leverage a memory module for periodic updates based on batches of experience, and optionally employ a baseline/critic/target policy for improved reward estimation.

Parameters:
  • states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
    • type ("bool" | "int" | "float") – state data type (default: "float").
    • shape (int | iter[int]) – state shape (required).
    • num_values (int > 0) – number of discrete state values (required for type "int").
    • min_value/max_value (float) – minimum/maximum state value (optional for type "float").
  • actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create(...)), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
    • type ("bool" | "int" | "float") – action data type (required).
    • shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
    • num_values (int > 0) – number of discrete action values (required for type "int").
    • min_value/max_value (float) – minimum/maximum action value (optional for type "float").
  • max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create(...)).
  • policy (specification) – Policy configuration, see policies (default: “default”, action distributions parametrized by an automatically configured network).
  • memory (int | specification) – Memory configuration, see memories (default: replay memory with given or inferred capacity).
  • update (int | specification) – Model update configuration with the following attributes (required, default: timesteps batch size</span>):
    • unit ("timesteps" | "episodes") – unit for update attributes (required).
    • batch_size (parameter, long > 0) – size of update batch in number of units (required).
    • frequency ("never" | parameter, long > 0) – frequency of updates (default: batch_size).
    • start (parameter, long >= batch_size) – number of units before first update (default: 0).
  • optimizer (specification) – Optimizer configuration, see optimizers (default: Adam optimizer).
  • objective (specification) – Optimization objective configuration, see objectives (required).
  • reward_estimation (specification) – Reward estimation configuration with the following attributes (required):
    • horizon ("episode" | parameter, long >= 0) – Horizon of discounted-sum reward estimation (required).
    • discount (parameter, 0.0 <= float <= 1.0) – Discount factor for future rewards of discounted-sum reward estimation (default: 1.0).
    • estimate_horizon (false | "early" | "late") – Whether to estimate the value of horizon states, and if so, whether to estimate early when experience is stored, or late when it is retrieved (default: "late" if any of the baseline_* arguments is specified, else false).
    • estimate_actions (bool) – Whether to estimate state-action values instead of state values (default: false).
    • estimate_terminal (bool) – Whether to estimate the value of (real) terminal states (default: false).
    • estimate_advantage (bool) – Whether to estimate the advantage by subtracting the current estimate (default: false).
  • baseline_policy (specification) – Baseline policy configuration, main policy will be used as baseline if none (default: none).
  • baseline_optimizer (float > 0.0 | specification) –

    Baseline optimizer configuration, see optimizers, main optimizer will be used for baseline if none, a float implies none and specifies a custom weight for the baseline loss (default: none).

  • baseline_objective (specification) –

    Baseline optimization objective configuration, see objectives, main objective will be used for baseline if none (default: none).

  • preprocessing (dict[specification]) – Preprocessing as layer or list of layers, see preprocessing, specified per state-type or -name and for reward (default: none).
  • exploration (parameter | dict[parameter], float >= 0.0) – Exploration, global or per action, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions (default: 0.0).
  • variable_noise (parameter, float >= 0.0) – Standard deviation of Gaussian noise added to all trainable float variables (default: 0.0).
  • l2_regularization (parameter, float >= 0.0) – Scalar controlling L2 regularization (default: 0.0).
  • entropy_regularization (parameter, float >= 0.0) – Scalar controlling entropy regularization, to discourage the policy distribution being too “certain” / spiked (default: 0.0).
  • name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: “agent”).
  • device (string) – Device name (default: TensorFlow default).
  • parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or (centrally controlled) agents within an environment (default: 1).
  • buffer_observe (bool | int > 0) – Maximum number of timesteps within an episode to buffer before executing internal observe operations, to reduce calls to TensorFlow for improved performance (default: max_episode_timesteps or 1000, unless summarizer specified).
  • seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed has to be set separately for a fully deterministic execution (default: none).
  • execution (specification) – TensorFlow execution configuration with the following attributes (default: standard): …
  • saver (specification) – TensorFlow saver configuration with the following attributes (default: no saver):
    • directory (path) – saver directory (required).
    • filename (string) – model filename (default: agent name).
    • frequency (int > 0) – how frequently in seconds to save the model (default: 600 seconds).
    • load (bool | str) – whether to load the existing model, or which model filename to load (default: true).
  • max-checkpoints (int > 0) – maximum number of checkpoints to keep (default: 5).
  • summarizer (specification) – TensorBoard summarizer configuration with the following attributes (default: no summarizer):
    • directory (path) – summarizer directory (required).
    • frequency (int > 0, dict[int > 0]) – how frequently in timesteps to record summaries for act-summaries if specified globally (default: always), otherwise specified for act-summaries via "act" in timesteps, for observe/experience-summaries via "observe"/"experience" in episodes, and for update/variables-summaries via "update"/"variables" in updates (default: never).
    • flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
    • max-summaries (int > 0) – maximum number of summaries to keep (default: 5).
    • labels ("all" | iter[string]) – all excluding "*-histogram" labels, or list of summaries to record, from the following labels (default: only "graph"):
    • "distributions" or "bernoulli", "categorical", "gaussian", "beta": distribution-specific parameters
    • "dropout": dropout zero fraction
    • "entropies" or "entropy", "action-entropies": entropy of policy distribution(s)
    • "graph": graph summary
    • "kl-divergences" or "kl-divergence", "action-kl-divergences": KL-divergence of previous and updated polidcy distribution(s)
    • "losses" or "loss", "objective-loss", "regularization-loss", "baseline-loss", "baseline-objective-loss", "baseline-regularization-loss": loss scalars
    • "parameters": parameter scalars
    • "relu": ReLU activation zero fraction
    • "rewards" or "timestep-reward", "episode-reward", "raw-reward", "empirical-reward", "estimated-reward": reward scalar
    • "update-norm": update norm
    • "updates": update mean and variance scalars
    • "updates-histogram": update histograms
    • "variables": variable mean and variance scalars
    • "variables-histogram": variable histograms
  • recorder (specification) – Experience traces recorder configuration with the following attributes (default: no recorder):
    • directory (path) – recorder directory (required).
    • frequency (int > 0) – how frequently in episodes to record traces (default: every episode).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • start (int >= 0) – how many episodes to skip before starting to record traces (default: 0).
    • max-traces (int > 0) – maximum number of traces to keep (default: all).