Simulation Flow Control¶

About¶

palaestrAI allows to define when an agent acts, when environments are updated (stepped), and at which point an episode or phase ends. This is called simulation flow control and achieved through simulation controllers and termination conditions.

Simulation controllers make the simulation tick; they define which data is passed to which entitity at which point. For example, the taking turns simulation controller allows each agent to act in turn, and between agent actions steps all environments.

Termination conditions decide when an episode or a phase ends. For example, an episode can end when a particular agent is successful enough; or a phase could end when a fixed number of episodes have been exeucted.

Simulation Controllers¶

Taking Turns¶

class palaestrai.simulation.TakingTurnsSimulationController(*args, **kwargs)[source]¶

Vanilla (Scatter-Gather)¶

palaestrai.simulation.VanillaSimulationController¶: alias of VanillaSimController

Termination Conditions Available¶

Agent Objective¶

class palaestrai.experiment.AgentObjectiveTerminationCondition(*args, **kwargs)[source]¶

Brain controls execution flow of experiments.

This termination condition allows to control the simulation flow based on the overall success of an agent.

Users may supply any objective average to terminate the flow, which will lead to a ::SimulationFlowControl.RESET during for an episode, and ::SimulationFlowControl.STOP_PHASE on phase flow control level. I.e., when an agent becomes successful during an episode, it will request to restart that episode. If the agent becomes successful over a number of episodes, the phase will end.

Threshhold values are given in the termination condition’s parameters for each agent. Under each agent key, the actual threshhold values are given. The keys follow a specific pattern: {brain|phase}_avg{number}, where “{brain|phase}” means either “brain” or “phase, and “number” is the number for the floating average.

brain_avgN specifies that an agent signals to end an episode once the mean of the last N objective values is equal or more than the number given. The simulation controller can then decide to end the episode. This change in flow control is only relevant for the current worker; i.e., other workers will continue until they are equally successful, or the phase ends for another reason. I.e.,

\[\frac{1}{N} \sum [r_{T-N}, r_{T-N+1}, \dotsc, r_{T}] \ge X\]

phase_avgN signals termination of a phase once the average cumulative reward of the last N episodes is equal to or greater than the number given. I.e., this parameter considers the average reward of all steps over all workers (1 worker = 1 episode), since a worker acts within one particular episode. Put in math:

\[\frac{1}{N} \sum_{\mathit{episode = 1}}^{N} \sum \frac{1}{M} [ r_1, r_2, \dotsc, r_M ]_\mathit{episode}\]

where M is the number of steps in a particular episode.

Note

Any particular phase_avgN must hold for all workers. Suppose you have 2 workers, then a phase_avg10: 1.0 forces both workers to have at least 10 successful episodes, where the average objective value over all steps is at least 1.0.

E.g.,

brain_avg100: 8.9 as parameter means that the episode ends once the brain reaches an objective score of at least 8.9, averaged over the last 100 actions.
brain_avg10: 8.9: similar to the above, except that the averaging is done over 10 actions.
phase_avg10: 1.0: ends the phase once the average cumulative success
of the brain from the last 10 episodes of all workers is at least 1.0.

Warning

A word of caution: Make sure that your brain_avgN and phase_avgN definitions are compatible, mathematically speaking. A brain_avg10: 100 does not necessarily imply that phase_avg10: 100 also holds. The brain_avg10 considers the last 10 steps of one episode, while phase_avg10 considers the average objective value of all steps in 10 episodes. Misaligning them can easily create a setup during which the phase never terminates. As an example, suppose your objective value of step 1 is 1, step 2 yields an objective value of 2, step 3 of 3, etc. Then, brain_avg10: 100 will terminate after 105 steps, because the average objective value over the last 10 steps is greater than 100, as (96 + 97 + … + 104 + 105) / 10.0 = 100.5. However, the average objective value over all steps for each episode is 53 = (1 + 2 + … + 105) / 105, so the average value over the last 10 episodes is also 53 and thus the condition phase_avg10: 100 does not rise and the phase will never terminate as always 53 < 100.

If you specify any avgN, then the termination condition will ensure that at least N actions are recorded before calculating the average. Meaning: If your environment terminates after N steps, but you specify a brain_avgM, with N < M, then the termination condition is never calculated. To calculate the average of the last 10 steps, the agent must have had the change to act 10 times, after all.

Note

For technical reasons, you must specify a brain_avg* parameter if you want to use phase_avg*, as the result of a brain objective averaging is transmitted to the phase-specific portion of the termination condition.

However, a special case exist when specifying a brain_avgN parameter, but not a phase_avgN parameter. Then, the first agent that triggers the termination condition during an episode will end the whole phase.

Examples

The following snipped is a shortened example from palaestrAI’s experiment definition:

definitions:
  agents:
    myagent:
      name: My Agent
      # (Other agent definitions omitted)
  simulation:
    tt:
      name: palaestrai.simulation:TakingTurns
      conditions:
        - name: palaestrai.experiment:AgentObjectiveTerminationCondition
          params:
            My Agent:
              brain_avg100: 8.9
  run_config:
    condition:
      name: palaestrai.experiment:AgentObjectiveTerminationCondition
      params:
        My Agent:
          phase_avg100: 8.9

This configuration means that an episode ends once that last 100 steps have an average objective of at least 8.9. The phase ends once the average reward of the last 10 episodes is, on average, at least 8.9. I.e., consider 10 episodes with an average reward of 10, 11, 6, 12, 15, 20, 17, 11, 9, 10, then the phase termination condition will hold, as (10 + 11 + 6 + 12 + 15 + 20 + 17 + 11 + 9 + 10) / 10 = 12.1 > 8.0

Environment Termination Condition¶

class palaestrai.experiment.EnvironmentTerminationCondition[source]¶

Terminates the current phase when an ::~Environment terminates

This TerminationCondition examines updates from an Environment and checks whether the environment itself signals termination. When an environment termination signal is received this TerminationCondition the current episode.

Example

The following snipped is a shortened example from palaestrAI’s experiment definition in which an episode is ended when the environment terminates:

definitions:
  # (Definitions of environment, agents and phase_config
  are omitted.)
  simulation:
    vanilla:
        name: palaestrai.simulation:Vanilla
        conditions:
          - name: palaestrai.simulation:EnvironmentTerminationCondition
            params: {}
# (Definition of the run configuration is also omitted.)

Maximum Number of Episodes¶

class palaestrai.experiment.MaxEpisodesTerminationCondition[source]¶

Checks whether a maximum number of episodes has been exceeded.

This termination condition will only trigger on phase level. It uses the episodes key in the phase configuration to check whether a maximum number of episodes has been reached.

Examples

Consider the following experiment phase definition:

schedule:
  Training:
    phase_config:
      mode: train
      worker: 2
      episodes: 100
    simulation:
      conditions:
      - name: palaestrai.experiment:MaxEpisodesTerminationCondition
        params: {}
      name: palaestrai.simulation:TakingTurns
run_config:
  condition:
    name: palaestrai.experiment:MaxEpisodesTerminationCondition
    params: {}

Then, the phase would end when both workers (worker: 2) have reached 100 episodes (episodes: 100).

Default (Vanilla) Phase Termination Condition¶

class palaestrai.experiment.VanillaRunGovernorTerminationCondition[source]¶

A combination of environment and max episodes flow control.

This TerminationCondition uses the EnvironmentTerminationCondition and MaxEpisodesTerminationCondition to end an episode when the environment terminates, and the phase when all workers have reached the maximum number of episodes limit.

Example

The following excerpt from a phase configuration shows an example of using this termination condition to end the phase once both workers have experienced 10 episodes each, where each episode runs until the environment terminates:

schedule:
  - phase_0:
      # (Definition of environment and agents omitted.)
      simulation:
        name: palaestrai.simulation:Vanilla
        conditions:
          - name: palaestrai.simulation:VanillaSimControllerTerminationCondition
            params: {}
      phase_config:  # Additional config for this phase
        mode: train
        worker: 2
        episodes: 10
run_config:
  condition:
    name: palaestrai.experiment:VanillaRunGovernorTerminationCondition
    params: {}

Multiple Termination Conditions¶

Multiple TerminationConditions can be used by the use of custom classes like the VanillaRunGovernorTerminationCondition but they can also be combined in the experiment file. Conditions on the episode level can be used together, i.e., ORed by adding them in the conditions list, e.g.:

definitions:
  agents:
    myagent:
      name: &agent_name My Agent
      # (Other agent definitions omitted)
  phase_config:
    mode: train
    worker: 2
  simulation:
    taking_turns:
      name: palaestrai.simulation:Vanilla
      conditions:
      - name: palaestrai.experiment:EnvironmentTerminationCondition
        params: {}
      - name: palaestrai.experiment:AgentObjectiveTerminationCondition
        params:
          *agent_name :
            brain_avg200: 10.0
run_config:
  condition:
    name: palaestrai.experiment:AgentObjectiveTerminationCondition
    params:
      *agent_name :
        phase_avg5: 1.0

This configuration means that an episode of one of the two workers of My Agent ends once the worker have an average objective of at least 10 over the last 200 steps of the current episode OR if the Environment terminates.

Furthermore, independently of the TerminationConditions for the episode, a phase ends once the average objective value over all steps for each episode over the last 5 episodes is greater than 1.0.

Note

If the max. amount of steps in an episode is less than 200 then the average of the objective values for the brain_avg200: 10.0 condition never gets calculated because the objective values of the steps never fill up the window of at least 200 required objective values of the last steps. If this is the case then the episode level condition for the AgentObjectiveTerminationCondition is basically ineffective but nevertheless required for phase level condition.