Agent and model overview¶
A reinforcement learning agent provides methods to process states and
return actions, to store past observations, and to load and save models.
Most agents employ a Model
which implements the algorithms to
calculate the next action given the current state and to update model
parameters from past experiences.
Environment <-> Runner <-> Agent <-> Model
Parameters to the agent are passed as a Configuration
object. The
configuration is passed on to the Model
.
Ready-to-use algorithms¶
We implemented some of the most common RL algorithms and try to keep these up-to-date. Here we provide an overview over all implemented agents and models.
Agent / General parameters¶
Agent
is the base class for all reinforcement learning agents. Every
agent inherits from this class.
- class
tensorforce.agents.
Agent
(config, model=None)¶Basic Reinforcement learning agent. An agent encapsulates execution logic of a particular reinforcement learning algorithm and defines the external interface to the environment.
The agent hence acts an intermediate layer between environment and backend execution (value function or policy updates).
Each agent requires the following configuration parameters:
states
: dict containing one or more state definitions.actions
: dict containing one or more action definitions.preprocessing
: dict or list containing state preprocessing configuration.exploration
: dict containing action exploration configuration.The configuration is passed to the Model and should thus include its configuration parameters, too.
Examples:
act
(state, deterministic=False)¶Return action(s) for given state(s). First, the states are preprocessed using the given preprocessing configuration. Then, the states are passed to the model to calculate the desired action(s) to execute.
After obtaining the actions, exploration might be added by the agent, depending on the exploration configuration.
Parameters:
- state – One state (usually a value tuple) or dict of states if multiple states are expected.
- deterministic – If true, no exploration and sampling is applied.
Returns: Scalar value of the action or dict of multiple actions the agent wants to execute.
observe
(reward, terminal)¶Observe experience from the environment to learn from. Optionally preprocesses rewards Child classes should call super to get the processed reward EX: reward, terminal = super()...
Parameters:
- reward – scalar reward that resulted from executing the action.
- terminal – boolean indicating if the episode terminated after the observation.
Returns: processed_reward terminal
reset
()¶Reset agent after episode. Increments internal episode count, internal states and preprocessors.
Returns: void
Model¶
The Model
class is the base class for reinforcement learning models.
- class
tensorforce.models.
Model
(config)¶Bases:
object
Base model class.
Each model requires the following configuration parameters:
discount
: float of discount factor (gamma).learning_rate
: float of learning rate (alpha).optimizer
: string of optimizer to use (e.g. ‘adam’).device
: string of tensorflow device name.tf_summary
: string directory to write tensorflow summaries. Default Nonetf_summary_level
: int indicating which tensorflow summaries to create.tf_summary_interval
: int number of calls to get_action until writing tensorflow summaries on update.log_level
: string containing log level (e.g. ‘info’).distributed
: boolean indicating whether to use distributed tensorflow.global_model
: global model.session
: session to use.
create_tf_operations
(config)¶Creates generic TensorFlow operations and placeholders required for models.
Parameters: config – Model configuration which must contain entries for states and actions. Returns:
load_model
(path)¶Import model from path using tf.train.Saver.
Parameters: path – Path to checkpoint Returns:
reset
()¶Resets the internal state to the initial state. Returns: A list containing the internal_inits field.
save_model
(path, use_global_step=True)¶Export model using a tf.train.Saver. Optionally append current time step as to not overwrite previous checkpoint file. Set to ‘false’ to be able to load model from exact path it was saved to in case of restarting program.
Parameters:
- path – Model export directory
- use_global_step – Whether to append the current timestep to the checkpoint path.
Returns:
update
(batch)¶
- Generic batch update operation for Q-learning and policy gradient algorithms.
- Takes a batch of experiences,
Parameters: batch – Batch of experiences. Returns:
MemoryAgent¶
- class
tensorforce.agents.
MemoryAgent
(config, model=None)¶Bases:
tensorforce.agents.agent.Agent
The
MemoryAgent
class implements a replay memory, from which it samples batches to update the value function.Each agent requires the following
Configuration
parameters:
states
: dict containing one or more state definitions.actions
: dict containing one or more action definitions.preprocessing
: dict or list containing state preprocessing configuration.exploration
: dict containing action exploration configuration.The
MemoryAgent
class additionally requires the following parameters:
batch_size
: integer of the batch size.memory_capacity
: integer of maximum experiences to store.memory
: string indicating memory type (‘replay’ or ‘prioritized_replay’).update_frequency
: integer indicating the number of steps between model updates.first_update
: integer indicating the number of steps to pass before the first update.repeat_update
: integer indicating how often to repeat the model update.
import_observations
(observations)¶Load an iterable of observation dicts into the replay memory.
Parameters:
- observations – An iterable with each element containing an observation. Each
- requires keys 'state','action','reward','terminal', 'internal'. (observation) –
- an empty list [] for 'internal' if internal state is irrelevant. (Use) –
Returns:
BatchAgent¶
- class
tensorforce.agents.
BatchAgent
(config, model=None)¶Bases:
tensorforce.agents.agent.Agent
The
BatchAgent
class implements a batch memory, which is cleared after every update.Each agent requires the following
Configuration
parameters:
states
: dict containing one or more state definitions.actions
: dict containing one or more action definitions.preprocessing
: dict or list containing state preprocessing configuration.exploration
: dict containing action exploration configuration.The
BatchAgent
class additionally requires the following parameters:
batch_size
: integer of the batch size.keep_last
: bool optionally keep the last observation for use in the next batch
observe
(reward, terminal)¶Adds an observation and performs an update if the necessary conditions are satisfied, i.e. if one batch of experience has been collected as defined by the batch size.
In particular, note that episode control happens outside of the agent since the agent should be agnostic to how the training data is created.
Parameters:
- reward – float of a scalar reward
- terminal – boolean whether episode is terminated or not
Returns: void
Deep-Q-Networks (DQN)¶
- class
tensorforce.agents.
DQNAgent
(config, model=None)¶Bases:
tensorforce.agents.memory_agent.MemoryAgent
Deep-Q-Network agent (DQN). The piece de resistance of deep reinforcement learning as described by Minh et al. (2015). Includes an option for double-DQN (DDQN; van Hasselt et al., 2015)
DQN chooses from one of a number of discrete actions by taking the maximum Q-value from the value function with one output neuron per available action. DQN uses a replay memory for experience playback.
Configuration:
Each agent requires the following configuration parameters:
states
: dict containing one or more state definitions.actions
: dict containing one or more action definitions.preprocessing
: dict or list containing state preprocessing configuration.exploration
: dict containing action exploration configuration.The
MemoryAgent
class additionally requires the following parameters:
batch_size
: integer of the batch size.memory_capacity
: integer of maximum experiences to store.memory
: string indicating memory type (‘replay’ or ‘prioritized_replay’).update_frequency
: integer indicating the number of steps between model updates.first_update
: integer indicating the number of steps to pass before the first update.repeat_update
: integer indicating how often to repeat the model update.Each model requires the following configuration parameters:
discount
: float of discount factor (gamma).learning_rate
: float of learning rate (alpha).optimizer
: string of optimizer to use (e.g. ‘adam’).device
: string of tensorflow device name.tf_summary
: string directory to write tensorflow summaries. Default Nonetf_summary_level
: int indicating which tensorflow summaries to create.tf_summary_interval
: int number of calls to get_action until writing tensorflow summaries on update.log_level
: string containing logleve (e.g. ‘info’).distributed
: boolean indicating whether to use distributed tensorflow.global_model
: global model.session
: session to use.The DQN agent expects the following additional configuration parameters:
target_update_frequency
: int of states between updates of the target network.update_target_weight
: float of update target weight (tau parameter).double_dqn
: boolean indicating whether to use double-dqn.clip_loss
: float if not 0, uses the huber loss with clip_loss as the linear bound
Normalized Advantage Functions¶
- class
tensorforce.agents.
NAFAgent
(config, model=None)¶Bases:
tensorforce.agents.memory_agent.MemoryAgent
Normalized Advantage Functions (NAF) agent (Gu et al., 2016), a.k.a. DQN for continuous actions.
Configuration:
Each agent requires the following configuration parameters:
states
: dict containing one or more state definitions.actions
: dict containing one or more action definitions.preprocessing
: dict or list containing state preprocessing configuration.exploration
: dict containing action exploration configuration.The
MemoryAgent
class additionally requires the following parameters:
batch_size
: integer of the batch size.memory_capacity
: integer of maximum experiences to store.memory
: string indicating memory type (‘replay’ or ‘prioritized_replay’).update_frequency
: integer indicating the number of steps between model updates.first_update
: integer indicating the number of steps to pass before the first update.repeat_update
: integer indicating how often to repeat the model update.Each model requires the following configuration parameters:
discount
: float of discount factor (gamma).learning_rate
: float of learning rate (alpha).optimizer
: string of optimizer to use (e.g. ‘adam’).device
: string of tensorflow device name.tf_summary
: string directory to write tensorflow summaries. Default Nonetf_summary_level
: int indicating which tensorflow summaries to create.tf_summary_interval
: int number of calls to get_action until writing tensorflow summaries on update.log_level
: string containing logleve (e.g. ‘info’).distributed
: boolean indicating whether to use distributed tensorflow.global_model
: global model.session
: session to use.The NAF agent expects the following additional configuration parameters:
target_update_frequency
: int of states between updates of the target network.update_target_weight
: float of update target weight (tau parameter).clip_loss
: float if not 0, uses the huber loss with clip_loss as the linear bound
Deep-Q-learning from demonostration (DQFD)¶
- class
tensorforce.agents.
DQFDAgent
(config, model=None)¶Bases:
tensorforce.agents.memory_agent.MemoryAgent
Deep Q-learning from demonstration (DQFD) agent (Hester et al., 2017). This agent uses DQN to pre-train from demonstration data.
Configuration:
Each agent requires the following configuration parameters:
states
: dict containing one or more state definitions.actions
: dict containing one or more action definitions.preprocessing
: dict or list containing state preprocessing configuration.exploration
: dict containing action exploration configuration.Each model requires the following configuration parameters:
discount
: float of discount factor (gamma).learning_rate
: float of learning rate (alpha).optimizer
: string of optimizer to use (e.g. ‘adam’).device
: string of tensorflow device name.tf_summary
: string directory to write tensorflow summaries. Default Nonetf_summary_level
: int indicating which tensorflow summaries to create.tf_summary_interval
: int number of calls to get_action until writing tensorflow summaries on update.log_level
: string containing logleve (e.g. ‘info’).distributed
: boolean indicating whether to use distributed tensorflow.global_model
: global model.session
: session to use.The
DQFDAgent
class additionally requires the following parameters:
batch_size
: integer of the batch size.memory_capacity
: integer of maximum experiences to store.memory
: string indicating memory type (‘replay’ or ‘prioritized_replay’).min_replay_size
: integer of minimum replay size before the first update.update_rate
: float of the update rate (e.g. 0.25 = every 4 steps).target_network_update_rate
: float of target network update rate (e.g. 0.01 = every 100 steps).use_target_network
: boolean indicating whether to use a target network.update_repeat
: integer of how many times to repeat an update.update_target_weight
: float of update target weight (tau parameter).demo_sampling_ratio
: float, ratio of expert data used at runtime to train from.supervised_weight
: float, weight of large margin classifier loss.expert_margin
: float of difference in Q-values between expert action and other actions enforced by the large margin function.clip_loss
: float if not 0, uses the huber loss with clip_loss as the linear bound
import_demonstrations
(demonstrations)¶Imports demonstrations, i.e. expert observations. Note that for large numbers of observations, set_demonstrations is more appropriate, which directly sets memory contents to an array an expects a different layout.
Parameters: demonstrations – List of observation dicts Returns:
observe
(reward, terminal)¶Adds observations, updates via sampling from memories according to update rate. DQFD samples from the online replay memory and the demo memory with the fractions controlled by a hyper parameter p called ‘expert sampling ratio.
Parameters:
- reward –
- terminal –
Returns:
pretrain
(steps)¶Computes pretrain updates.
Parameters: steps – Number of updates to execute. Returns:
set_demonstrations
(batch)¶Set all demonstrations from batch data. Expects a dict wherein each value contains an array containing all states, actions, rewards, terminals and internals respectively. of :param batch:
Returns:
Vanilla Policy Gradient¶
- class
tensorforce.agents.
VPGAgent
(config, model=None)¶Bases:
tensorforce.agents.batch_agent.BatchAgent
Vanilla Policy Gradient agent as described by Sutton et al. (1999).
Configuration:
Each agent requires the following
Configuration
parameters:
states
: dict containing one or more state definitions.actions
: dict containing one or more action definitions.preprocessing
: dict or list containing state preprocessing configuration.exploration
: dict containing action exploration configuration.The
BatchAgent
class additionally requires the following parameters:
batch_size
: integer of the batch size.keep_last
: bool optionally keep the last observation for use in the next batchA Policy Gradient Model expects the following additional configuration parameters:
baseline
: string indicating the baseline value function (currently ‘linear’ or ‘mlp’).gae_rewards
: boolean indicating whether to use GAE reward estimation.gae_lambda
: GAE lambda.normalize_rewards
: boolean indicating whether to normalize rewards.The VPG agent does not require any additional configuration parameters.
Trust Region Policy Optimization (TRPO)¶
- class
tensorforce.agents.
TRPOAgent
(config, model=None)¶Bases:
tensorforce.agents.batch_agent.BatchAgent
Trust Region Policy Optimization (Schulman et al., 2015) agent.
Configuration:
Each agent requires the following
Configuration
parameters:
states
: dict containing one or more state definitions.actions
: dict containing one or more action definitions.preprocessing
: dict or list containing state preprocessing configuration.exploration
: dict containing action exploration configuration.The
BatchAgent
class additionally requires the following parameters:
batch_size
: integer of the batch size.keep_last
: bool optionally keep the last observation for use in the next batchA Policy Gradient Model expects the following additional configuration parameters:
baseline
: string indicating the baseline value function (currently ‘linear’ or ‘mlp’).gae_rewards
: boolean indicating whether to use GAE reward estimation.gae_lambda
: GAE lambda.normalize_rewards
: boolean indicating whether to normalize rewards.The TRPO agent expects the following additional configuration parameters:
learning_rate
: float of learning rate (alpha).optimizer
: string of optimizer to use (e.g. ‘adam’).cg_damping
: float of the damping factor for the conjugate gradient method.line_search_steps
: int of how many steps to take during line search.max_kl_divergence
: float indicating the maximum kl divergence to allow for updates.cg_iterations
: int of count of conjugate gradient iterations.
State preprocessing¶
The agent handles state preprocessing. A preprocessor takes the raw state input from the environment and modifies it (for instance, image resize, state concatenation, etc.). You can find information about our ready-to-use preprocessors here.
Building your own agent¶
If you want to build your own agent, it should always inherit from
Agent
. If your agent uses a replay memory, it should probably inherit
from MemoryAgent
, if it uses a batch replay that is emptied after each update,
it should probably inherit from BatchAgent
.
We distinguish between agents and models. The Agent
class handles the
interaction with the environment, such as state preprocessing, exploration
and observation of rewards. The Model
class handles the mathematical
operations, such as building the tensorflow operations, calculating the
desired action and updating (i.e. optimizing) the model weights.
To start building your own agent, please refer to this blogpost to gain a deeper understanding of the internals of the TensorForce library. Afterwards, have look on a sample implementation, e.g. the DQN Agent and DQN Model.