Running palaestrAI as a Service

About

Usually, experiments are executed by issuing palaestrai experiment-start on the machine that runs palaestrAI. palaestrai serve adds a second way to drive palaestrAI: it launches a long-running HTTP service that exposes a REST API. External tools (and, later, a web frontend) can then create and launch experiment runs, poll their status, and retrieve their logs over HTTP, without having to run palaestrAI locally or copy experiment files around.

The service stays online until it receives SIGINT or SIGTERM (e.g., Ctrl+C); there is no separate daemon management.

Quickstart

Start the server (here on the default port 4247, listening on all interfaces):

palaestrai serve

The following curl session walks through a full lifecycle. It assumes an experiment run document my_run.yml (the same document format accepted by palaestrai experiment-start).

Create the experiment run in the database (this does not run it). The identifier is read from the document’s uid field:

curl -X PUT http://localhost:4247/experiment_runs \
    -H "Content-Type: application/x-yaml" \
    --data-binary @my_run.yml

Launch a new instance of that run. The request body is ignored; the response contains the new instance UID:

curl -X PUT http://localhost:4247/experiment_runs/<run-uid>/instances

{"instance_uid": "<instance-uid>", "experiment_run_uid": "<run-uid>"}

Poll the instance status. The HTTP status code mirrors the lifecycle state (202 scheduled, 200 running/finished, 500 error, 404 unknown):

curl -i http://localhost:4247/experiment_run_instances/<instance-uid>

{"uid": "<instance-uid>", "status": "RUNNING"}

Retrieve the persisted log entries for that instance (see Logging below for the available filters):

curl "http://localhost:4247/experiment_run_instances/<instance-uid>/logs?level=INFO"

Shut the service down by sending SIGINT/SIGTERM to the palaestrai serve process (e.g. Ctrl+C). The parent asks the executor child to shut down gracefully, waits, and then exits.

Parameters

CLI options

palaestrai serve accepts the following options:

-l, --listen

IP address to bind to. May be given multiple times to request several addresses. Default: all available interfaces (0.0.0.0). Because uvicorn binds a single host, requesting more than one explicit address falls back to binding all interfaces.

-p, --port

TCP port to listen on. Default: 4247.

Example:

palaestrai serve --listen 127.0.0.1 --port 8080

Runtime-config keys

Beyond the CLI options, palaestrai serve reads the usual runtime configuration. The keys most relevant to the service are:

store_uri

SQLAlchemy-style URI of the results store database that every route reads from and writes to. The service ensures the schema exists on startup. Default: sqlite:///palaestrai.db.

log_store_uri

SQLAlchemy-style URI of the separate SQLite database that holds the log entries served by GET /experiment_run_instances/{uid}/logs (see Logging). It is kept separate from store_uri so logs do not bloat the results store. The executor child writes it and the API parent reads it, so both sides must agree on this value (which they do, as the parent snapshots its runtime config for the child). Default: sqlite:///palaestrai-log.db.

broker_uri / executor_bus_port / public_bind

Control the in-palaestrAI ZeroMQ message broker the executor child uses internally. They behave exactly as for palaestrai experiment-start; see Runtime Configuration. Defaults: broker_uri derived from the other two, executor_bus_port: 4242, public_bind: False.

fork_method

Multiprocessing start method. The supervisor pins the executor subtree to this method so its synchronization primitives stay consistent. Default: spawn.

Note

The HTTP listen address/port are not runtime-config keys; they are set only via the --listen/--port CLI options above.

Logging

Under palaestrai serve the separate log store is enabled automatically: the executor child attaches a SQLiteLogHandler to the root logger, writing to log_store_uri. A few properties follow from how the handler works:

  • The log store is optional in general (it is specific to serve) but is switched on by serve without any extra configuration.

  • Only records that are associated with an experiment run instance are stored; every record is keyed by its experiment_run_instance_uid. Records not tied to an instance (executor/broker/pre-run records) are dropped from the log store and only reach stdout.

  • DEBUG records are dropped by default (the handler stores INFO and above). Lower the relevant logger levels in the runtime config if you need finer-grained logs.

REST API

All routes negotiate content: a response is returned as YAML when the client sends an Accept header asking for YAML (application/x-yaml, application/yaml, text/yaml, text/x-yaml) and as JSON otherwise. The PUT routes default to parsing the request body as YAML; because JSON is a subset of YAML, a JSON body is accepted as well. GET /experiments/{name} always returns the stored YAML document.

Error responses follow a thin error→HTTP translation: not-found lookups become 404, integrity/uniqueness violations become 409, schema/syntax validation failures become 422, immutability violations become 405 (with an explanatory message), and any other unhandled error becomes 500.

Experiments

GET /experiments

List all experiments, each including its experiment runs under the experiment_runs key. Optional query parameter name filters by name (SQL LIKE syntax). Returns 200 with a JSON (or YAML) list.

GET /experiments/{name}

Return the stored experiment document (and nothing else) as YAML. 200 on success, 404 if no such experiment exists.

PUT /experiments

Store a new experiment from its (arsenAI) document body. The experiment name is taken from the document’s top-level uid field; a missing uid is a 405. Returns 201 with {"name": ..., "id": ...}. A duplicate name surfaces as 409.

DELETE /experiments/{name}

Cascading delete of an experiment and everything below it (runs, instances, phases, …). 200 with {"deleted": name}; 404 if unknown.

POST /experiments/{name}

Experiments are immutable, so this always returns 405 with the message “Experiments are immutable. …”.

Experiment runs

GET /experiment_runs

List all experiment runs; each entry adds an experiment key with the parent experiment’s name. Optional query parameter uid filters by run UID (SQL LIKE syntax). Returns 200.

GET /experiment_runs/{uid}

Return full data on an experiment run, travelling down the hierarchy to its instances and their phases (environments and agents), but not down to muscle actions, environment/world states, or brain dumps. 200 on success, 404 if unknown.

PUT /experiment_runs

Create a new experiment run in the database from its YAML body (does not run it). The run UID and the parent experiment association (experiment_uid) are read from the document; the body is schema-validated. Returns 201 with {"uid": ..., "experiment_uid": ...}. A schema/syntax failure is 422.

POST /experiment_runs/{uid}

Update an experiment run, but only if it has never been executed (no instance exists yet). If it has been executed at least once, this returns 405; create a new run instead. 200 with {"uid": ...} on success, 404 if unknown.

DELETE /experiment_runs/{uid}

Cascading delete of the run and all data associated with it. 200 with {"deleted": uid}; 404 if unknown.

PUT /experiment_runs/{uid}/instances

Schedule a new instance of the experiment run. The request body is ignored. The run is reconstructed (its instance UID is generated at construction) and handed to the executor child for scheduling; the instance row itself is created asynchronously by the store under the same UID. Returns 202 with {"instance_uid": ..., "experiment_run_uid": uid} immediately, 404 if the run is unknown.

Experiment run instances

GET /experiment_run_instances/{uid}

Report the status of an instance. The body is {"uid": ..., "status": ...} and the HTTP status code mirrors the lifecycle state:

Status

HTTP code

SCHEDULED

202

RUNNING

200

FINISHED

200

ERROR

500

UNKNOWN

404

UNKNOWN is reported (with 404) when no instance row exists for the UID.

GET /experiment_run_instances/{uid}/logs

Return the persisted log entries for an instance, read from the separate log store (log_store_uri). Existence of the instance is checked against the results store: an unknown instance is a 404 (body {"uid": ..., "status": "UNKNOWN"}); a known instance with no logs yields a 200 with an empty list. The response body is:

{"instance_uid": ..., "count": <n>, "logs": [ ... ]}

where each log entry is {"created_at", "level", "levelno", "logger", "message"}.

The following query parameters filter and page the result:

level

Minimum log level by name (e.g. INFO, WARNING). Entries with a numeric level at or above this level are returned. An unknown level name is a 422.

since

ISO-8601 timestamp; only entries at or after this time are returned. A value that is not a valid ISO-8601 timestamp is a 422.

Warning

URL-encode the timestamp. A + in a raw query string decodes to a space, so an offset like +00:00 must be encoded (%2B00:00); otherwise the value fails validation and the request is rejected with 422.

logger

Filter by logger name (SQL LIKE syntax).

limit

Maximum number of entries to return. Default 1000; clamped to a maximum of 10000.

offset

Number of entries to skip (for paging). Default 0.

Entries are ordered by created_at ascending, then by insertion order.

Process Model

A single palaestrai serve invocation runs two processes:

  1. API process (parent). Runs uvicorn together with a FastAPI application. It owns the HTTP socket(s) and the user-facing signal handlers. All read routes (GET) and all create/delete routes (PUT/DELETE/POST) talk to the results store database directly – these are pure store operations that do not need a running executor.

  2. Executor process (child). Runs Executor(is_service=True).execute() in its own event loop. In service mode the executor does not shut down when no run is scheduled; it idles instead, waiting for work. It owns the experiment-run-instance status lifecycle (SCHEDULEDRUNNINGFINISHED/ERROR), which it writes authoritatively to the database.

The parent supervises the child over a small command channel (a multiprocessing pipe). Scheduling a new instance (PUT /experiment_runs/{uid}/instances) sends a SCHEDULE command to the child; everything else is served from the database. If the executor child dies unexpectedly, the parent marks any still-RUNNING instance as ERROR (crash safety), cleans up orphaned processes, and respawns the child.