Architecture¶
System Overview¶
The sapporo-service is a FastAPI application that accepts WES API requests, prepares a run directory for each workflow execution, and delegates the actual workflow engine invocation to a shell script (run.sh). Each workflow engine runs inside its own Docker container, spawned as a sibling container via the host's Docker socket. See Installation - Volume Mounts for details on the DinD volume mount requirements.
+--------+ +---------+ +--------+ +---------------------+
| Client | --> | FastAPI | --> | run.py | --> | run.sh (subprocess) |
+--------+ +---------+ +--------+ +---------------------+
|
docker run
|
+------------+
| Engine |
| Container |
+------------+
The Python side (run.py) never calls a workflow engine directly. It prepares the run directory, writes all input files, then forks run.sh as a subprocess. All run data is persisted to the filesystem, with a SQLite index for fast listing.
run.sh: Workflow Engine Abstraction¶
The run.sh script is the single interface between the sapporo-service (Python) and all workflow engines. It dispatches to a run_<engine>() function based on the workflow_engine field in the run request.
Each engine function constructs a docker run command that:
- Mounts the Docker socket (
-v /var/run/docker.sock:...) - Mounts the run directory (
-v ${run_dir}:${run_dir}) - Sets the working directory to the execution directory (
-w=${exe_dir}) - Runs the engine-specific command with the workflow URL and parameters
Override the default run.sh location using --run-sh or SAPPORO_RUN_SH. See Configuration - Custom run.sh for details.
Engine Dispatch¶
run.sh uses a naming convention to dispatch to the correct engine function. For a request with "workflow_engine": "cwltool", it calls run_cwltool(). If no matching function exists, the run fails with EXECUTOR_ERROR.
Cancellation¶
run.sh runs the workflow function as a background process and waits for it. Cancellation is handled via Unix signals: the Python side sends SIGUSR1 to the run.sh process, which triggers the cancel() function. This allows engine-specific cleanup before writing the CANCELED state.
Error Handling¶
run.sh uses trap to handle errors and signals:
ERR->SYSTEM_ERROR(unexpected failure)SIGHUP/SIGINT/SIGQUIT/SIGTERM->SYSTEM_ERROR(killed by system)- Unknown signals ->
SYSTEM_ERRORwith exit code 1 (catch-all) USR1->CANCELED(user-requested cancellation)- Non-zero exit from a
run_<engine>()function ->EXECUTOR_ERROR
RO-Crate metadata is generated on the COMPLETE and EXECUTOR_ERROR paths. For EXECUTOR_ERROR, the RO-Crate uses FailedActionStatus and may have no output files. If RO-Crate generation itself fails, run.sh writes {"@error": "RO-Crate generation failed. Check stderr.log for details."} to ro-crate-metadata.json and appends the Python error traceback to stderr.log. The SYSTEM_ERROR and CANCELED paths skip RO-Crate generation because their preconditions (e.g., run request, start time) may not be satisfied.
Adding a New Engine¶
To add a new workflow engine, define a run_<engine>() function in run.sh that constructs the appropriate docker run command. The function must:
- Build a
docker runcommand that mounts the Docker socket and run directory - Write the command to
${cmd}for logging - Execute the command, redirecting stdout/stderr to
${stdout}and${stderr} - Call
executor_error $?on non-zero exit
For a complete example, see the StreamFlow addition PR.
Run Directory¶
Each workflow execution is stored on the filesystem at {run_dir}/{run_id[:2]}/{run_id}/. The run directory is the single source of truth for all run data.
Directory Structure¶
runs/
├── 29/
│ └── 29109b85-7935-4e13-8773-9def402c7775/
│ ├── cmd.txt
│ ├── end_time.txt
│ ├── exe/
│ │ └── workflow_params.json
│ ├── exit_code.txt
│ ├── outputs/
│ │ └── <output_file>
│ ├── outputs.json
│ ├── run.pid
│ ├── run_request.json
│ ├── runtime_info.json
│ ├── start_time.txt
│ ├── state.txt
│ ├── stderr.log
│ ├── stdout.log
│ ├── system_logs.json
│ └── workflow_engine_params.txt
├── 2d/
│ └── ...
└── sapporo.db
File Descriptions¶
| File | Description |
|---|---|
run_request.json |
Original run request (workflow URL, engine, parameters) |
state.txt |
Current run state (e.g., RUNNING, COMPLETE) |
exe/ |
Execution directory (working directory for the workflow engine) |
exe/workflow_params.json |
Workflow parameters |
outputs/ |
Output files produced by the workflow |
outputs.json |
JSON listing of output files |
cmd.txt |
The docker run command that was executed |
stdout.log / stderr.log |
Workflow engine stdout/stderr |
start_time.txt / end_time.txt |
ISO 8601 timestamps |
exit_code.txt |
Process exit code |
run.pid |
PID of the run.sh subprocess |
runtime_info.json |
Runtime metadata |
system_logs.json |
System-level logs |
workflow_engine_params.txt |
Engine-specific parameters |
Orphan Recovery¶
When the sapporo process restarts (e.g., container recreation), any run.sh subprocesses from the previous instance are dead. Runs that were in a non-terminal state are now orphans — their state.txt still says RUNNING or QUEUED, but no process is driving them forward.
At startup, before the SQLite index is built, recover_orphaned_runs() scans all run directories and transitions orphaned runs to SYSTEM_ERROR.
Target States¶
Runs in the following non-terminal states are recovered:
INITIALIZINGQUEUEDRUNNINGPAUSEDPREEMPTEDCANCELINGDELETING
Runs in terminal states (COMPLETE, EXECUTOR_ERROR, SYSTEM_ERROR, CANCELED, DELETED) and UNKNOWN are left unchanged.
Recovery Actions¶
For each orphaned run, the recovery process:
- Sets
state.txttoSYSTEM_ERROR - Writes the current timestamp to
end_time.txt - Appends a descriptive message to
system_logs.json
Ordering¶
recover_orphaned_runs() runs before init_db() in the application lifespan, so the SQLite index reflects the corrected states from its first build.
SQLite Index¶
The SQLite database (sapporo.db) is an index, not a data store. It is rebuilt at a configurable interval (default: 30 minutes) by scanning the run directories and can be deleted at any time without data loss. It exists solely to make GET /runs (list all runs) fast. Individual run queries (GET /runs/{run_id}) always read from the filesystem.
RO-Crate¶
After each run completes (or fails at the executor level), the service generates RO-Crate metadata (ro-crate-metadata.json). See RO-Crate for the full specification including conformance profiles, entity graph, custom properties, bioinformatics extensions, and validation.