2. Using Environments: Playing Tic-Tac-Toe#

This tutorial gently guides you through your second end-to-end run with palaestrAI. Here, you will learn how to create an experiment run file, how to costomize agents and enviroments, how to execute it, and how to query the store for data. We will import hARL agent and use tic-tac-toe palaestrAI environment from palaestrai-environment.

This tutorial will call the palaestrAI API directly from the notebook. The command-line interface (CLI) does exactly that under the hood, too: There is no difference in the general usage or the layout of the experiment run files. But with the Juypter notebook, we can have everything neatly in one place.

So sit back and follow us through your experiment run… Have a lot of fun!

2.1. Palaestrai Modules Installation#

In order to Install that hARL and palaestrai-environement modules and make use of the full palaestrai stack that is necessary in for this example, at first clone the repostories of these two modules and then you should run the code cell below one time or install modules mannually:

Note: replace the correct path for directories in code cell below

[1]:
# %pip install [Directory path of cloned hARL repo]
# %pip install [Directory path of cloned palaestrai-environement repo]

2.2. Imports#

Let’s start by importing necessary modules. This will be what we need for palaestrAI, namely the entrypoint, the runtime config, and the database access stuff:

[2]:
import palaestrai  # Will provide palaestrai.exectue
import palaestrai.core  # RuntimeConfig
import palaestrai.store  # store.Session for database connectivity
import palaestrai.store.database_util
import palaestrai.store.database_model as paldb

The typical data science analysis toolstack uses pandas and matplotlib, so let’s import those, too.

[3]:
import numpy as np
import pandas as pd

jsonpickle we will need to inspect the reward information objects later on. Here, we also need to use the jsonpickle extension for numpy:

[4]:
import jsonpickle
import jsonpickle.ext.numpy as jsonpickle_numpy

jsonpickle_numpy.register_handlers()

There are also some of the usual suspects from Python’s standard library, which we’ll import here without further comment:

[5]:
import io
import pprint
import tempfile
from pathlib import Path

2.3. Experiment Run Document#

Everything palaestrAI does depends on its configuration, or rather, experiments. When you do real design of experiments, you first create an experiment document, in which you define strategies for sampling your factors. Each sample is an experiment run, which will be executed by palaestrAI. We won’t do the full DoE dance here, but rather provide an experiment run document directly.

Experiments and experiment runs have unique names (uid). When they’re not given, they are auto-generated, but usually the user wants to set them in order to find them in the store later on. Choosing a good name might seem hard (it isn’t, any string will do); being forced to choose a unique name might seem an unecessary constraint. However, it isn’t: Each experiment run must be repeatable, i.e., always have the same result, no matter how often it is run. A change in an experiment run definition can yield different results. Therefore, each experiment run is unique—and thus should be its name, too. We will define the experiment run name as a separate variable so that we don’t have to remember it later on when we query the store:

[6]:
experiment_run_name = "Tutorial Experiment Run"

Experiment (run) documents also have a version. It serves as a discriminator to catch semantic changes in the document. It is an additional safeguard and emits a log message, but not a stopgap.

For this tutorial, we set the document’s version to palaestrAI’s version. That is okay here since we need to keep this documented up-to-date in any case. When experiment runs are archived, the version number (and its immutability!) become more important.

[7]:
experiment_run_version = palaestrai.__version__

And now to the document itself. Apart from the uid, the version, and the random seed (seed), it provides the configuration of the experiment run. Experiment runs have phases, so the most important key here is the experiment schedule.

A schedule defines the phases of an experiment run. A phase comprises environments, agents, simulation parameters such as the termination condition, as well as general configuration flags. Schedule configurations are cascading: Values defined in the previous phase are applied to following phases, too, unless they are explicitly overwritten.

In this experiment, we will simulate Tic-Tac-Toe board game. In order to run it you need to install some modules:

  • hARL: This module provide a set of implemented brain/muscle that powered by known RL algorithms.

    • To install it you should clone the source from repository at hARL then install it from source by this command pip install {harl source path}. if you want the module be updated automatically when you update the source then use pip install -e {harl source path}

  • palaestrai-environments: this module provide the implementation of some environments like Tic-Tac-Toe.

    • To install it you should clone the source from repository at palaestrai-environments then install it from source by this command pip install {palaestrai-environments source path}. if you want the module be updated automatically when you update the source then use pip install -e {palaestrai-environments source path}

    • Note: Tic-Tac_Toe environemnt designed in a way that you will need just an agent to run the experiment and competitor is implemented as an embeded agent in the environment and will do an action after each action from the main agent.

This experiment just contains one phase that is in train mode with one worker and one episode. If you ran the Dummy experiment you can see the experiment configuration template is still the same but with different values for environemt and agent.TicTacToeEnvironment Class is used to setup the experiment environment. the environment implemented as 9 length array with 9 sensors and 1 actuator that should be configured for agent with the same names in the source code. each sensor is respnsible to check the status of one tile.we can see two parameters passed to the environemt: - randomness: The rate at which the environment chooses a random move over the optimal move. - invalid_turn_limit: How many invalid turns the agent is allowed to make before the episode is terminated.

The agent that tries to play Tic-Tac-Toe uses the [Soft Actor Critic (SAC)] (https://spinningup.openai.com/en/latest/algorithms/sac.html) algorithm. SAC has a number of hyperparameters, which can be found in the documentation, and are:

  • replay_size (int): The maximum size of the replay buffer

  • fc_dims (a list of integers, default is [256, 256]: The number of algorithms in the hidden layers of the policy networks. (“fc” stands for “fully connected”).

  • activation: The activation function the network uses. This must be a PyTorch module; the default is torch.nn.ReLU.

  • gamma: The discount factor; default is 0.99.

  • polyak: Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:

    \[\theta_{\text{targ}} \leftarrow \rho \theta_{\text{targ}} + ( 1-\rho) \theta,\]

    where \(\rho\) is polyak. (Always between 0 and 1, usually close to 1 .) Default is 0.995.

  • lr: Learning rate used for policy and value learning; defaults to 1e-3.

  • batch_size: Minipbatch size of the gradient descent; defaults to 100.

  • update_after: Number of interaction with the environment the agent should collect before the training starts. Higher number lead to more random (“dumb”) interactions of the agent with the environment before it actually tries to develop a strategy. However, this also means that the replay buffer contains enough interesting data to learn from. Defaults to 1000 interactions.

  • update_every: How many interactions should happen between trainings. Defaults to 50.

(Please note that we’re using an f-string here, and hence the YAML dict {} becomes {{}}.)

[8]:
experiment_run_document = f"""
uid: "{experiment_run_name}"
seed: 5831341
version: "{experiment_run_version}"
schedule:  # The schedule for this run; it is a list
  - Training Phase:  # Name of the current phase. Can be any user-chosen name
      environments:  # Definition of the environments for this phase
        - environment:
            name: palaestrai_environments.tictactoe:TicTacToeEnvironment
            uid: myenv
            params:
              randomness: 0.7
              invalid_turn_limit: 3
      agents:  # Definiton of agents for this phase
        - name: ttt_agent
          brain:
            name: harl:SACBrain
            params:
              update_after: 32
              update_every: 3
          muscle:
            name: harl:SACMuscle
            params:
              start_steps: 32
          objective:
            name: palaestrai.agent.dummy_objective:DummyObjective
            params: {{ }}
          sensors:
            - myenv.Tile 1-1
            - myenv.Tile 1-2
            - myenv.Tile 1-3
            - myenv.Tile 2-1
            - myenv.Tile 2-2
            - myenv.Tile 2-3
            - myenv.Tile 3-1
            - myenv.Tile 3-2
            - myenv.Tile 3-3
          actuators:
            - myenv.Field selector
      simulation:  # Definition of the simulation controller for this phase
        name: palaestrai.simulation:TakingTurns
        conditions:
          - name: palaestrai.simulation:VanillaSimControllerTerminationCondition
            params: {{ }}
      phase_config:
        mode: train
        worker: 1
        episodes: 64
  - Test Phase:
      phase_config:
        mode: test
        worker: 1
        episodes: 16
run_config:  # Not a runTIME config
  condition:
    name: palaestrai.experiment:VanillaRunGovernorTerminationCondition
    params: {{}}
"""

2.4. Runtime Config#

With the experiment run neatly defined, there is something else that defines how palaestrAI behaves: Its runtime config. It has nothing to do with an experiment run, but defines the behavior of palaestrAI on a certain machine. This includes log levels or the URI defining how to connect to the database. Usually, one does not touch it once the framework is installed.

In this case, we’re playing it safe and provide some sane defaults that are only relevant for the scope of this notebook. For example, we’ll resort to using SQLite in a temporary directory instead of PostgreSQL + TimescaleDB (speed is not of importance here).

Let’s create the database in a temporary location:

[9]:
store_dir = tempfile.TemporaryDirectory()
store_dir
[9]:
<TemporaryDirectory '/tmp/tmpq6cysob2'>
[10]:
runtime_config = palaestrai.core.RuntimeConfig()
runtime_config.reset()
runtime_config.load(
    {
        "store_uri": "sqlite:///%s/palaestrai.db" % store_dir.name,
        "executor_bus_port": 24747,
        "logger_port": 24748,
    }
)
pprint.pprint(runtime_config.to_dict())
{'data_path': './_outputs',
 'executor_bus_port': 24747,
 'logger_port': 24748,
 'logging': {'filters': {'debug_filter': {'()': 'palaestrai.core.runtime_config.DebugLogFilter'}},
             'formatters': {'debug': {'format': '%(asctime)s '
                                                '%(name)s[%(process)d]: '
                                                '%(levelname)s - %(message)s '
                                                '(%(module)s.%(funcName)s in '
                                                '%(filename)s:%(lineno)d)'},
                            'simple': {'format': '%(asctime)s '
                                                 '%(name)s[%(process)d]: '
                                                 '%(levelname)s - '
                                                 '%(message)s'}},
             'handlers': {'console': {'class': 'logging.StreamHandler',
                                      'formatter': 'simple',
                                      'level': 'INFO',
                                      'stream': 'ext://sys.stdout'},
                          'console_debug': {'class': 'logging.StreamHandler',
                                            'filters': ['debug_filter'],
                                            'formatter': 'debug',
                                            'level': 'DEBUG',
                                            'stream': 'ext://sys.stdout'}},
             'loggers': {'palaestrai.agent': {'level': 'ERROR'},
                         'palaestrai.agent.agent_conductor': {'level': 'ERROR'},
                         'palaestrai.agent.brain': {'level': 'ERROR'},
                         'palaestrai.agent.muscle': {'level': 'ERROR'},
                         'palaestrai.core': {'level': 'ERROR'},
                         'palaestrai.environment': {'level': 'ERROR'},
                         'palaestrai.experiment': {'level': 'ERROR'},
                         'palaestrai.simulation': {'level': 'ERROR'},
                         'palaestrai.store': {'level': 'ERROR'},
                         'palaestrai.types': {'level': 'ERROR'},
                         'palaestrai.util': {'level': 'ERROR'},
                         'palaestrai.visualization': {'level': 'ERROR'},
                         'sqlalchemy.engine': {'level': 'ERROR'}},
             'root': {'handlers': ['console', 'console_debug'],
                      'level': 'ERROR'},
             'version': 1},
 'major_domo_client_retries': 3,
 'major_domo_client_timeout': 300000,
 'profile': False,
 'public_bind': False,
 'store_buffer_size': 20,
 'store_uri': 'sqlite:////tmp/tmpq6cysob2/palaestrai.db',
 'time_series_store_uri': 'influx+localhost:8086'}

The nice thing about the RuntimeConfig is that it is a singleton available everywhere in the framework. So whatever we set here pertains throughout the run.

2.5. Database Initialization#

Since we’ve opted to start fresh with a new SQLite database in a temporary directory, we will have to create and initialize it. Usually, one does this once (e.g., from the CLI with palaestrai database-create) and is then done with it, but in this case we do it every time we run the notebook—it is a one-shot tutorial, after all. :-)

Luckily, palaestrAI has just the function we need to do it for us:

[11]:
palaestrai.store.database_util.setup_database(runtime_config.store_uri)
Could not create extension timescaledb and create hypertables: (sqlite3.OperationalError) near "EXTENSION": syntax error
[SQL: CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;]
(Background on this error at: https://sqlalche.me/e/14/e3q8). Your database setup might lead to noticeable slowdowns with larger experiment runs. Please upgrade to PostgreSQL with TimescaleDB for the best performance.

You will see a warning regarding the TimescaleDB extension. That is okay and just a warning. Since we’re not running a big, sophisticated experiment, we can live with a bit of a performance penality.

2.6. Experiment Run Execution#

Next up: Actually executing the experiment run! It just consists of one line: A call to palaestrai.execute(). This method can cope with three types of parameters:

  1. An ExperimentRun object. Nice in cases one has already loaded it (e.g., de-serialized it).

  2. A str. palaestrAI.execute() interprets this as a path to a file—one of the most common use cases.

  3. A TextIO object: Any stream that delivers text. Useful when the experiment run document is not yet deserialized, and exactly what we need.

To turn a str into a TextIO, we simply wrap it into a StringIO object. Make it so!

[12]:
rc = palaestrai.execute(io.StringIO(experiment_run_document))

The execution should yield no errors.

[13]:
assert rc[1].name == "EXITED"

2.7. Querying the Store#

Let’s get a custom session to the database first:

[14]:
dbh = palaestrai.store.Session()

palaestrAI offers a lighteight convenience API that wraps the most common database queries. I.e., instead of having to craft queries by hand, it is possible to resort to this API. The API returns pandas Dataframes that are a convenient format to work with data.

The query API is available from the module palaestrai.store.query.

[15]:
import palaestrai.store.query as palq

Lets first retrieve all experiments that we’ve run and look for the most recently executed (which is ours):

[16]:
all_experiments = palq.experiments_and_runs_configurations(dbh)
all_experiments
[16]:
experiment_id experiment_name experiment_document experiment_run_id experiment_run_uid experiment_run_document experiment_run_instance_id experiment_run_instance_uid experiment_run_phase_id experiment_run_phase_uid experiment_run_phase_mode
0 1 Dummy Experiment record for ExperimentRun Tuto... None 1 Tutorial Experiment Run !ExperimentRun\n_canonical_config: null\n_inst... 1 30c80b8e-bc4c-4449-8b22-ff07bc5abb1a 1 Training Phase train
1 1 Dummy Experiment record for ExperimentRun Tuto... None 1 Tutorial Experiment Run !ExperimentRun\n_canonical_config: null\n_inst... 1 30c80b8e-bc4c-4449-8b22-ff07bc5abb1a 2 Test Phase test

This table gives us a very good idea of what was executed, because it lists (among other this) the experiment run name and the phases therein. We have two phases (training and test) with several episodes, so there should be two entries.

[17]:
assert len(all_experiments) == 2

Usually, the participating agents and their configurations is also interesting. The query API as a function for this, too. Of course, we’re interested only in the agents that participated in one particular experiment or experiment run. The query API has a convenient way to do this: We can pass a dataframe that contains the information interesting to us, and palaestrAI will use that to construct the query. The key is called like_dataframe:

[18]:
agents = palq.agents_configurations(
    dbh,
    like_dataframe=all_experiments[all_experiments.experiment_run_uid == experiment_run_name]
)
agents
[18]:
agent_uid agent_name agent_configuration experiment_run_phase_id experiment_run_phase_uid experiment_run_phase_configuration experiment_run_instance_uid experiment_run_id experiment_run_uid experiment_id experiment_name
agent_id
2 ttt_agent ttt_agent {'name': 'ttt_agent', 'brain': {'name': 'harl:... 2 Test Phase {'mode': 'test', 'worker': 1, 'episodes': 16} 30c80b8e-bc4c-4449-8b22-ff07bc5abb1a 1 Tutorial Experiment Run 1 Dummy Experiment record for ExperimentRun Tuto...
1 ttt_agent ttt_agent {'name': 'ttt_agent', 'brain': {'name': 'harl:... 1 Training Phase {'mode': 'train', 'worker': 1, 'episodes': 64} 30c80b8e-bc4c-4449-8b22-ff07bc5abb1a 1 Tutorial Experiment Run 1 Dummy Experiment record for ExperimentRun Tuto...
[19]:
agents[agents.experiment_run_phase_uid == "Training Phase"].index[0]
[19]:
1
[20]:
assert len(agents) == 2

Great, here’s our agent. In order to know how the agent fared, we need to parse the reward. As of writing, this does not get extracted automatically, so we need to formulate the query ourselves. This is a bit of a hassle, sadly, but not that hard, because we can use SQLalchemy’s facilities. Thankfully, SQLalchemy’s index operator translates directly into a path query, if we want to.

[21]:
import sqlalchemy as sa
[22]:
query = sa.select(
    paldb.MuscleAction,
    paldb.MuscleAction.rewards[(0, "py/state", "value", 0)].label("reward")
).where(paldb.MuscleAction.agent_id.in_(agents.index))
actions = pd.read_sql_query(query, dbh.bind)
actions
[22]:
id walltime agent_id simtimes sensor_readings actuator_setpoints rewards objective statistics reward
0 1 2024-03-28 15:09:39.141412 1 {'py/object': 'collections.defaultdict', 'myen... None None None 0.0 None NaN
1 2 2024-03-28 15:09:39.547742 1 {'py/object': 'collections.defaultdict', 'myen... [{'py/object': 'palaestrai.agent.sensor_inform... [{'py/object': 'palaestrai.agent.actuator_info... [{'py/object': 'palaestrai.agent.reward_inform... 1.0 {} 1.0
2 3 2024-03-28 15:09:39.561917 1 {'py/object': 'collections.defaultdict', 'myen... [{'py/object': 'palaestrai.agent.sensor_inform... [{'py/object': 'palaestrai.agent.actuator_info... [{'py/object': 'palaestrai.agent.reward_inform... -100.0 {} -100.0
3 4 2024-03-28 15:09:39.579567 1 {'py/object': 'collections.defaultdict', 'myen... [{'py/object': 'palaestrai.agent.sensor_inform... [{'py/object': 'palaestrai.agent.actuator_info... [{'py/object': 'palaestrai.agent.reward_inform... -100.0 {} -100.0
4 5 2024-03-28 15:09:39.591170 1 {'py/object': 'collections.defaultdict', 'myen... [{'py/object': 'palaestrai.agent.sensor_inform... [{'py/object': 'palaestrai.agent.actuator_info... [{'py/object': 'palaestrai.agent.reward_inform... -100.0 {} -100.0
... ... ... ... ... ... ... ... ... ... ...
163 164 2024-03-28 15:13:07.834700 2 {'py/object': 'collections.defaultdict', 'myen... [{'py/object': 'palaestrai.agent.sensor_inform... [{'py/object': 'palaestrai.agent.actuator_info... [{'py/object': 'palaestrai.agent.reward_inform... -100.0 {} -100.0
164 165 2024-03-28 15:13:07.854811 2 {'py/object': 'collections.defaultdict', 'myen... [{'py/object': 'palaestrai.agent.sensor_inform... [{'py/object': 'palaestrai.agent.actuator_info... None -100.0 {} NaN
165 166 2024-03-28 15:13:07.963022 2 {'py/object': 'collections.defaultdict', 'myen... [{'py/object': 'palaestrai.agent.sensor_inform... [{'py/object': 'palaestrai.agent.actuator_info... [{'py/object': 'palaestrai.agent.reward_inform... -100.0 {} -100.0
166 167 2024-03-28 15:13:12.161790 2 {'py/object': 'collections.defaultdict', 'myen... [{'py/object': 'palaestrai.agent.sensor_inform... [{'py/object': 'palaestrai.agent.actuator_info... None -100.0 {} NaN
167 168 2024-03-28 15:13:12.180373 2 {'py/object': 'collections.defaultdict', 'myen... [{'py/object': 'palaestrai.agent.sensor_inform... [{'py/object': 'palaestrai.agent.actuator_info... [{'py/object': 'palaestrai.agent.reward_inform... -100.0 {} -100.0

168 rows × 10 columns

And now we can plot it. Since, for tutorial reasons, the run is rather short, we don’t expect too much of the agent now. But never mind, let’s see some curves! :-)

[23]:
pd.set_option("plotting.backend", "plotly")
pd.concat(
    [
        actions[
            actions.agent_id == agents[agents.experiment_run_phase_uid == "Training Phase"].index[0]
        ].rename({"reward": "Training Reward"}, axis=1),
        actions[
            actions.agent_id == agents[agents.experiment_run_phase_uid == "Test Phase"].index[0]
        ].rename({"reward": "Test Reward"}, axis=1)
    ],
)[["Training Reward", "Test Reward"]].plot()

2.8. Conclusion#

This concludes our learning agents tutorial. We hope you enjoyed the whole run. If you encountered any errors, head over to the palaestrAI issue tracker at Gitlab and let us know!