Tensorforce Agent¶

class tensorforce.agents.TensorforceAgent(states, actions, update, optimizer, objective, reward_estimation, max_episode_timesteps=None, policy='auto', memory=None, baseline=None, baseline_optimizer=None, baseline_objective=None, l2_regularization=0.0, entropy_regularization=0.0, state_preprocessing='linear_normalization', exploration=0.0, variable_noise=0.0, parallel_interactions=1, config=None, saver=None, summarizer=None, tracking=None, recorder=None, **kwargs)¶

Tensorforce agent (specification key: tensorforce).

Highly configurable agent and basis for a broad class of deep reinforcement learning agents, which act according to a policy parametrized by a neural network, leverage a memory module for periodic updates based on batches of experience, and optionally employ a baseline/critic/target policy for improved reward estimation.

Parameters:

states (specification) – States specification (required, better implicitly specified via environment argument for Agent.create()), arbitrarily nested dictionary of state descriptions (usually taken from Environment.states()) with the following attributes:
- type ("bool" | "int" | "float") – state data type (default: "float").
- shape (int | iter[int]) – state shape (required).
- num_values (int > 0) – number of discrete state values (required for type "int").
- min_value/max_value (float) – minimum/maximum state value (optional for type "float").
actions (specification) – Actions specification (required, better implicitly specified via environment argument for Agent.create()), arbitrarily nested dictionary of action descriptions (usually taken from Environment.actions()) with the following attributes:
- type ("bool" | "int" | "float") – action data type (required).
- shape (int > 0 | iter[int > 0]) – action shape (default: scalar).
- num_values (int > 0) – number of discrete action values (required for type "int").
- min_value/max_value (float) – minimum/maximum action value (optional for type "float").
max_episode_timesteps (int > 0) – Upper bound for numer of timesteps per episode (default: not given, better implicitly specified via environment argument for Agent.create()).
policy (specification) – Policy configuration, see networks and policies documentation (default: action distributions or value functions parametrized by an automatically configured network).
memory (int | specification) – Replay memory capacity, or memory configuration, see the memories documentation (default: minimum capacity recent memory).
update (int | specification) – Model update configuration with the following attributes (
required, default: timesteps batch size</span>):
- unit ("timesteps" | "episodes") – unit for update attributes (required).
- batch_size (parameter, int > 0) – size of update batch in number of units (required).
- frequency ("never" | parameter, int > 0 | 0.0 < float <= 1.0) – frequency of updates, relative to batch_size if float (default: batch_size).
- start (parameter, int >= batch_size) – number of units before first update (default: none).
optimizer (specification) – Optimizer configuration, see the optimizers documentation (default: Adam optimizer).
objective (specification) – Optimization objective configuration, see the objectives documentation (required).
reward_estimation (specification) – Reward estimation configuration with the following attributes (required):
- horizon ("episode" | parameter, int >= 1) – Horizon of discounted-sum return estimation (required).
- discount (parameter, 0.0 <= float <= 1.0) – Discount factor of future rewards for discounted-sum return estimation (default: 1.0).
- predict_horizon_values (false | "early" | "late") – Whether to include a baseline prediction of the horizon value as part of the return estimation, and if so, whether to compute the horizon value prediction "early" when experiences are stored to memory, or "late" when batches of experience are retrieved for the update (default: "late" if baseline_policy or baseline_objective are specified, else false).
- estimate_advantage (False | "early" | "late") – Whether to use an estimate of the advantage (return minus baseline value prediction) instead of the return as learning signal, and whether to do so late after the baseline update (default) or early before the baseline update (default: false, unless baseline_policy is specified but baseline_objective/optimizer are not).
- predict_action_values (bool) – Whether to predict state-action- instead of state-values as horizon values and for advantage estimation (default: false).
- reward_processing (specification)) – Reward preprocessing as layer or list of layers, see the [preprocessing documentation](../modules/preprocessing.html) (default: no reward processing).
- return_processing (specification) – Return processing as layer or list of layers, see the [preprocessing documentation](../modules/preprocessing.html) (default: no return processing).
- advantage_processing (specification) – Advantage processing as layer or list of layers, see the [preprocessing documentation](../modules/preprocessing.html) (default: no advantage processing).
- predict_terminal_values (bool) – Whether to predict the value of terminal states, usually not required since max_episode_timesteps terminals are handled separately (default: false).
baseline (specification) –
Baseline configuration, policy will be used as baseline if none, see networks and potentially policies documentation (default: none).
baseline_optimizer (specification | parameter, float > 0.0) –
Baseline optimizer configuration, see the optimizers documentation, main optimizer will be used for baseline if none, a float implies none and specifies a custom weight for the baseline loss (default: none).
baseline_objective (specification) –
Baseline optimization objective configuration, see the objectives documentation, required if baseline optimizer is specified, main objective will be used for baseline if baseline objective and optimizer are not specified (default: none).
l2_regularization (parameter, float >= 0.0) – L2 regularization loss weight (default: no L2 regularization).
entropy_regularization (parameter, float >= 0.0) – Entropy regularization loss weight, to discourage the policy distribution from being “too certain” (default: no entropy regularization).
state_preprocessing (dict[specification]) – State preprocessing as layer or list of layers, see the preprocessing documentation, specified per state-type or -name (default: linear normalization of bounded float states to [-2.0, 2.0]).
exploration (parameter | dict[parameter], float >= 0.0) – Exploration, defined as the probability for uniformly random output in case of bool and int actions, and the standard deviation of Gaussian noise added to every output in case of float actions, specified globally or per action-type or -name (default: no exploration).
variable_noise (parameter, float >= 0.0) – Add Gaussian noise with given standard deviation to all trainable variables, as alternative exploration mechanism (default: no variable noise).
parallel_interactions (int > 0) – Maximum number of parallel interactions to support, for instance, to enable multiple parallel episodes, environments or agents within an environment (default: 1).
config (specification) – Additional configuration options:
- name (string) – Agent name, used e.g. for TensorFlow scopes and saver default filename (default: "agent").
- device (string) – Device name (default: CPU). Different from (un)supervised deep learning, RL does not always benefit from running on a GPU, depending on environment and agent configuration. In particular for RL-typical environments with low-dimensional state spaces (i.e., no images), one usually gets better performance by running on CPU only. Consequently, Tensorforce is configured to run on CPU by default, which can be changed, for instance, by setting this value to 'GPU' instead.
- seed (int) – Random seed to set for Python, NumPy (both set globally!) and TensorFlow, environment seed may have to be set separately for fully deterministic execution, generally not recommended since results in a fully deterministic setting are less meaningful/representative (default: none).
- buffer_observe (false | "episode" | int > 0) – Number of timesteps within an episode to buffer before calling the internal observe function, to reduce calls to TensorFlow for improved performance (default: configuration-specific maximum number which can be buffered without affecting performance).
- enable_int_action_masking (bool) – Whether int action options can be masked via an optional "[ACTION-NAME]_mask" state input (default: true).
- create_tf_assertions (bool) – Whether to create internal TensorFlow assertion operations (default: true).
- eager_mode (bool) – Whether to run functions eagerly instead of running as a traced graph function, can be helpful for debugging (default: false).
- tf_log_level (int >= 0) – TensorFlow log level, additional C++ logging messages can be enabled by setting os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"/"2" before importing Tensorforce/TensorFlow (default: 40, only error and critical).
saver (path | specification) – TensorFlow checkpoints directory, or checkpoint manager configuration with the following attributes, for periodic implicit saving as alternative to explicit saving via agent.save() (default: no saver):
- directory (path) – checkpoint directory (required).
- filename (string) – checkpoint filename (default: agent name).
- frequency (int > 0) – how frequently to save a checkpoint (required).
- unit ("timesteps" | "episodes" | "updates") – frequency unit (default: updates).
- max_checkpoints (int > 0) – maximum number of checkpoints to keep (default: 10).
- max_hour_frequency (int > 0) – ignoring max-checkpoints, definitely keep a checkpoint in given hour frequency (default: none).
summarizer (path | specification) – TensorBoard summaries directory, or summarizer configuration with the following attributes (default: no summarizer):
- directory (path) – summarizer directory (required).
- filename (path) – summarizer filename, max_summaries does not apply if name specified (default: "summary-%Y%m%d-%H%M%S").
- max_summaries (int > 0) – maximum number of (generically-named) summaries to keep (default: 7, number of different colors in Tensorboard).
- flush (int > 0) – how frequently in seconds to flush the summary writer (default: 10).
- summaries ("all" | iter[string]) – which summaries to record, "all" implies all numerical summaries, so all summaries except "graph" (default: "all"):
- "action-value": value of each action (timestep-based)
- "distribution": distribution parameters like probabilities or mean and stddev (timestep-based)
- "entropy": entropy of (per-action) policy distribution(s) (timestep-based)
- "graph": computation graph
- "kl-divergence": KL-divergence of previous and updated (per-action) policy distribution(s) (update-based)
- "loss": policy and baseline loss plus loss components (update-based)
- "parameters": parameter values (according to parameter unit)
- "reward": reward per timestep, episode length and reward, plus intermediate reward/return/advantage estimates and processed values (timestep/episode/update-based)
- "update-norm": global norm of update (update-based)
- "updates": mean and variance of update tensors per variable (update-based)
- "variables": mean of trainable variables tensors (update-based)
tracking ("all" | iter[string]) – Which tensors to track, available values are a subset of the values of summarizer[summaries] above (default: no tracking). The current value of tracked tensors can be retrieved via tracked_tensors() at any time, however, note that tensor values change at different timescales (timesteps, episodes, updates).
recorder (path | specification) – Traces recordings directory, or recorder configuration with the following attributes (see record-and-pretrain script for example application) (default: no recorder):