Architecture¶

System Overview¶

The sapporo-service is a FastAPI application that accepts WES API requests, prepares a run directory for each workflow execution, and delegates the actual workflow engine invocation to a shell script (run.sh). Each workflow engine runs inside its own Docker container, spawned as a sibling container via the host's Docker socket. See Installation - Volume Mounts for details on the DinD volume mount requirements.

         +-----------+
         |  Client   |
         +-----------+
               |
          HTTP request
               |
               v
         +-----------+
         |  FastAPI  |
         +-----------+
               |
         Python call
               |
               v
         +-----------+
         |  run.py   |
         +-----------+
               |
          subprocess
               |
               v
         +-----------+
         |  run.sh   |
         +-----------+
               |
          docker run
               |
               v
         +-----------+
         |  Engine   |
         | Container |
         +-----------+

The Python side (run.py) never calls a workflow engine directly. It prepares the run directory, writes all input files, then forks run.sh as a subprocess. All run data is persisted to the filesystem, with a SQLite index for fast listing.

run.sh: Workflow Engine Abstraction¶

The run.sh script is the single interface between the sapporo-service (Python) and all workflow engines. It dispatches to a run_<engine>() function based on the workflow_engine field in the run request.

Each engine function constructs a docker run command that:

Mounts the Docker socket (-v /var/run/docker.sock:...)
Mounts the run directory (-v ${run_dir}:${run_dir})
Sets the working directory to the execution directory (-w=${exe_dir})
Runs the engine-specific command with the workflow URL and parameters

Override the default run.sh location using --run-sh or SAPPORO_RUN_SH. See Configuration - Custom run.sh for details.

Engine Dispatch¶

run.sh uses a naming convention to dispatch to the correct engine function. For a request with "workflow_engine": "cwltool", it calls run_cwltool(). If no matching function exists, the run fails with EXECUTOR_ERROR.

Cancellation¶

run.sh runs the workflow function as a background process and waits for it. Cancellation is handled via Unix signals: the Python side sends SIGUSR1 to the run.sh process, which triggers the cancel() function. This allows engine-specific cleanup before writing the CANCELED state.

Error Handling¶

run.sh uses trap to handle errors and signals:

ERR -> SYSTEM_ERROR (unexpected failure)
SIGHUP/SIGINT/SIGQUIT/SIGTERM -> SYSTEM_ERROR (killed by system)
Unknown signals -> SYSTEM_ERROR with exit code 1 (catch-all)
USR1 -> CANCELED (user-requested cancellation)
Non-zero exit from a run_<engine>() function -> EXECUTOR_ERROR

RO-Crate metadata is generated on the COMPLETE and EXECUTOR_ERROR paths. For EXECUTOR_ERROR, the RO-Crate uses FailedActionStatus and may have no output files. If RO-Crate generation itself fails, run.sh writes {"@error": "RO-Crate generation failed. Check stderr.log for details."} to ro-crate-metadata.json and appends the Python error traceback to stderr.log. The SYSTEM_ERROR and CANCELED paths skip RO-Crate generation because their preconditions (e.g., run request, start time) may not be satisfied.

Adding a New Engine¶

To add a new workflow engine, define a run_<engine>() function in run.sh that constructs the appropriate docker run command. The function must:

Build a docker run command that mounts the Docker socket and run directory
Write the command to ${cmd} for logging
Execute the command, redirecting stdout/stderr to ${stdout} and ${stderr}
Call executor_error $? on non-zero exit

For a complete example, see the StreamFlow addition PR.

Run Directory¶

Each workflow execution is stored on the filesystem at {run_dir}/{run_id[:2]}/{run_id}/. The run directory is the single source of truth for all run data.

Directory Structure¶

runs/
├── 29/
│   └── 29109b85-7935-4e13-8773-9def402c7775/
│       ├── cmd.txt
│       ├── end_time.txt
│       ├── exe/
│       │   └── workflow_params.json
│       ├── exit_code.txt
│       ├── outputs/
│       │   └── <output_file>
│       ├── outputs.json
│       ├── run.pid
│       ├── run_request.json
│       ├── runtime_info.json
│       ├── start_time.txt
│       ├── state.txt
│       ├── stderr.log
│       ├── stdout.log
│       ├── system_logs.json
│       └── workflow_engine_params.txt
├── 2d/
│   └── ...
└── sapporo.db

File Descriptions¶

File	Description
`run_request.json`	Original run request (workflow URL, engine, parameters)
`state.txt`	Current run state (e.g., `RUNNING`, `COMPLETE`)
`exe/`	Execution directory (working directory for the workflow engine)
`exe/workflow_params.json`	Workflow parameters
`outputs/`	Output files produced by the workflow
`outputs.json`	JSON listing of output files
`cmd.txt`	The `docker run` command that was executed
`stdout.log` / `stderr.log`	Workflow engine stdout/stderr
`start_time.txt` / `end_time.txt`	ISO 8601 timestamps
`exit_code.txt`	Process exit code
`run.pid`	PID of the `run.sh` subprocess
`runtime_info.json`	Runtime metadata
`system_logs.json`	System-level logs
`workflow_engine_params.txt`	Engine-specific parameters

Reconciliation¶

Detects runs stuck in RUNNING/QUEUED after a process restart and marks them as SYSTEM_ERROR.

reconcile_runs() runs at startup (before init_db()) and periodically in the background (at the snapshot interval, default: 30 minutes). For each run in a non-terminal state, it reads run.pid and checks process liveness via os.kill(pid, 0):

PID file	Process alive	Action
Present	Yes	Skip (running normally)
Present	No	Set `SYSTEM_ERROR` (reason: "process vanished")
Absent	N/A	Set `SYSTEM_ERROR` (reason: "no pid file")

Runs in terminal states (COMPLETE, EXECUTOR_ERROR, SYSTEM_ERROR, CANCELED, DELETED) and UNKNOWN are skipped. For each reconciled run, state.txt is set to SYSTEM_ERROR, the current timestamp is written to end_time.txt, and the reason is logged to system_logs.json.

SQLite Index¶

The SQLite database (sapporo.db) is an index, not a data store. It is rebuilt at a configurable interval (default: 30 minutes) by a background asyncio task that scans the run directories, and can be deleted at any time without data loss. It exists solely to make GET /runs (list all runs) fast. Individual run queries (GET /runs/{run_id}) always read from the filesystem.

RO-Crate¶

After each run completes (or fails at the executor level), the service generates RO-Crate metadata (ro-crate-metadata.json). See RO-Crate for the full specification including conformance profiles, entity graph, custom properties, bioinformatics extensions, and validation.