Simulation Flow Control¶
About¶
palaestrAI allows to define when an agent acts, when environments are updated (stepped), and at which point an episode or phase ends. This is called simulation flow control and achieved through simulation controllers and termination conditions.
Simulation controllers make the simulation tick; they define which data is passed to which entitity at which point. For example, the taking turns simulation controller allows each agent to act in turn, and between agent actions steps all environments.
Termination conditions decide when an episode or a phase ends. For example, an episode can end when a particular agent is successful enough; or a phase could end when a fixed number of episodes have been exeucted.
Simulation Controllers¶
Taking Turns¶
Vanilla (Scatter-Gather)¶
- palaestrai.simulation.VanillaSimulationController¶
alias of
VanillaSimController
Termination Conditions Available¶
Agent Objective¶
- class palaestrai.experiment.AgentObjectiveTerminationCondition(*args, **kwargs)[source]¶
Brain controls execution flow of experiments.
This termination condition allows to control the simulation flow based on the overall success of an agent.
Users may supply any objective average to terminate the flow, which will lead to a ::SimulationFlowControl.RESET during for an episode, and ::SimulationFlowControl.STOP_PHASE on phase flow control level. I.e., when an agent becomes successful during an episode, it will request to restart that episode. If the agent becomes successful over a number of episodes, the phase will end.
Threshhold values are given in the termination condition’s parameters for each agent. Under each agent key, the actual threshhold values are given. The keys follow a specific pattern: {brain|phase}_avg{number}, where “{brain|phase}” means either “brain” or “phase, and “number” is the number for the floating average.
brain_avgN
specifies that an agent signals to end an episode once the mean of the last N objective values is equal or more than the number given. The simulation controller can then decide to end the episode. This change in flow control is only relevant for the current worker; i.e., other workers will continue until they are equally successful, or the phase ends for another reason. I.e.,\[\frac{1}{N} \sum [r_{T-N}, r_{T-N+1}, \dotsc, r_{T}] \ge X\]phase_avgN
signals termination of a phase once the average cumulative reward of the last N episodes is equal to or greater than the number given. I.e., this parameter considers the average reward of all steps over all workers (1 worker = 1 episode), since a worker acts within one particular episode. Put in math:\[\frac{1}{N} \sum_{\mathit{episode = 1}}^{N} \sum \frac{1}{M} [ r_1, r_2, \dotsc, r_M ]_\mathit{episode}\]where M is the number of steps in a particular episode.
Note
Any particular
phase_avgN
must hold for all workers. Suppose you have 2 workers, then aphase_avg10: 1.0
forces both workers to have at least 10 successful episodes, where the average objective value over all steps is at least 1.0.E.g.,
brain_avg100: 8.9
as parameter means that the episode ends once the brain reaches an objective score of at least 8.9, averaged over the last 100 actions.brain_avg10: 8.9
: similar to the above, except that the averaging is done over 10 actions.phase_avg10: 1.0
: ends the phase once the average cumulative successof the brain from the last 10 episodes of all workers is at least 1.0.
Warning
A word of caution: Make sure that your
brain_avgN
andphase_avgN
definitions are compatible, mathematically speaking. Abrain_avg10: 100
does not necessarily imply thatphase_avg10: 100
also holds. Thebrain_avg10
considers the last 10 steps of one episode, whilephase_avg10
considers the average objective value of all steps in 10 episodes. Misaligning them can easily create a setup during which the phase never terminates. As an example, suppose your objective value of step 1 is 1, step 2 yields an objective value of 2, step 3 of 3, etc. Then,brain_avg10: 100
will terminate after 105 steps, because the average objective value over the last 10 steps is greater than 100, as (96 + 97 + … + 104 + 105) / 10.0 = 100.5. However, the average objective value over all steps for each episode is 53 = (1 + 2 + … + 105) / 105, so the average value over the last 10 episodes is also 53 and thus the conditionphase_avg10: 100
does not rise and the phase will never terminate as always 53 < 100.If you specify any
avgN
, then the termination condition will ensure that at least N actions are recorded before calculating the average. Meaning: If your environment terminates after N steps, but you specify abrain_avgM
, with N < M, then the termination condition is never calculated. To calculate the average of the last 10 steps, the agent must have had the change to act 10 times, after all.Note
For technical reasons, you must specify a
brain_avg*
parameter if you want to usephase_avg*
, as the result of a brain objective averaging is transmitted to the phase-specific portion of the termination condition.However, a special case exist when specifying a
brain_avgN
parameter, but not aphase_avgN
parameter. Then, the first agent that triggers the termination condition during an episode will end the whole phase.Examples
The following snipped is a shortened example from palaestrAI’s experiment definition:
definitions: agents: myagent: name: My Agent # (Other agent definitions omitted) simulation: tt: name: palaestrai.simulation:TakingTurns conditions: - name: palaestrai.experiment:AgentObjectiveTerminationCondition params: My Agent: brain_avg100: 8.9 run_config: condition: name: palaestrai.experiment:AgentObjectiveTerminationCondition params: My Agent: phase_avg100: 8.9
This configuration means that an episode ends once that last 100 steps have an average objective of at least 8.9. The phase ends once the average reward of the last 10 episodes is, on average, at least 8.9. I.e., consider 10 episodes with an average reward of 10, 11, 6, 12, 15, 20, 17, 11, 9, 10, then the phase termination condition will hold, as (10 + 11 + 6 + 12 + 15 + 20 + 17 + 11 + 9 + 10) / 10 = 12.1 > 8.0
Environment Termination Condition¶
- class palaestrai.experiment.EnvironmentTerminationCondition[source]¶
Terminates the current phase when an ::~Environment terminates
This
TerminationCondition
examines updates from anEnvironment
and checks whether the environment itself signals termination. When an environment termination signal is received thisTerminationCondition
the current episode.Example
The following snipped is a shortened example from palaestrAI’s experiment definition in which an episode is ended when the environment terminates:
definitions: # (Definitions of environment, agents and phase_config are omitted.) simulation: vanilla: name: palaestrai.simulation:Vanilla conditions: - name: palaestrai.simulation:EnvironmentTerminationCondition params: {} # (Definition of the run configuration is also omitted.)
Maximum Number of Episodes¶
- class palaestrai.experiment.MaxEpisodesTerminationCondition[source]¶
Checks whether a maximum number of episodes has been exceeded.
This termination condition will only trigger on phase level. It uses the
episodes
key in the phase configuration to check whether a maximum number of episodes has been reached.Examples
Consider the following experiment phase definition:
schedule: Training: phase_config: mode: train worker: 2 episodes: 100 simulation: conditions: - name: palaestrai.experiment:MaxEpisodesTerminationCondition params: {} name: palaestrai.simulation:TakingTurns run_config: condition: name: palaestrai.experiment:MaxEpisodesTerminationCondition params: {}
Then, the phase would end when both workers (
worker: 2
) have reached 100 episodes (episodes: 100
).
Default (Vanilla) Phase Termination Condition¶
- class palaestrai.experiment.VanillaRunGovernorTerminationCondition[source]¶
A combination of environment and max episodes flow control.
This
TerminationCondition
uses theEnvironmentTerminationCondition
andMaxEpisodesTerminationCondition
to end an episode when the environment terminates, and the phase when all workers have reached the maximum number of episodes limit.Example
The following excerpt from a phase configuration shows an example of using this termination condition to end the phase once both workers have experienced 10 episodes each, where each episode runs until the environment terminates:
schedule: - phase_0: # (Definition of environment and agents omitted.) simulation: name: palaestrai.simulation:Vanilla conditions: - name: palaestrai.simulation:VanillaSimControllerTerminationCondition params: {} phase_config: # Additional config for this phase mode: train worker: 2 episodes: 10 run_config: condition: name: palaestrai.experiment:VanillaRunGovernorTerminationCondition params: {}
Multiple Termination Conditions¶
Multiple TerminationCondition
s can be used by the use of custom classes like
the VanillaRunGovernorTerminationCondition
but
they can also be combined in the experiment file. Conditions on the episode
level can be used together, i.e., ORed by adding them in the conditions
list, e.g.:
definitions:
agents:
myagent:
name: &agent_name My Agent
# (Other agent definitions omitted)
phase_config:
mode: train
worker: 2
simulation:
taking_turns:
name: palaestrai.simulation:Vanilla
conditions:
- name: palaestrai.experiment:EnvironmentTerminationCondition
params: {}
- name: palaestrai.experiment:AgentObjectiveTerminationCondition
params:
*agent_name :
brain_avg200: 10.0
run_config:
condition:
name: palaestrai.experiment:AgentObjectiveTerminationCondition
params:
*agent_name :
phase_avg5: 1.0
This configuration means that an episode of one of the two workers
of My Agent
ends once the worker have an average objective of
at least 10
over the last 200
steps of the current episode OR
if the Environment
terminates.
Furthermore, independently of the TerminationCondition
s for the episode,
a phase ends once the average objective value over all steps for each
episode over the last 5
episodes is greater than 1.0
.
Note
If the max. amount of steps in an episode is less than 200
then the average of the objective values for the
brain_avg200: 10.0
condition never gets calculated because the
objective values of the steps never fill up the window of at least
200
required objective values of the last steps. If this is the
case then the episode level condition for the
AgentObjectiveTerminationCondition
is basically ineffective but nevertheless required for
phase level condition.