RO-Crate¶

Overview¶

RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata. It is based on Schema.org annotations in JSON-LD.

After each workflow run completes (or fails at the executor level), sapporo-service automatically generates an ro-crate-metadata.json file in the run directory. This metadata captures the full provenance of the run: the workflow executed, input parameters, output files, timestamps, exit codes, and runtime environment, enabling reproducibility and comparison of workflow executions.

Conformance¶

The generated RO-Crate metadata conforms to the following specifications:

Specification	Version	Notes
RO-Crate	1.1	ro-crate-py 0.14.x default
WRROC Process Run Crate	0.5	Prerequisite for Workflow Run Crate
WRROC Workflow Run Crate	0.5	sapporo treats engines as black boxes
Workflow RO-Crate	1.0	Standard ComputationalWorkflow representation

Provenance Run Crate is out of scope because it requires per-step execution info (HowToStep, ControlAction, OrganizeAction), which sapporo does not have. Sapporo delegates execution to workflow engines as black boxes and does not track individual step-level provenance.

Generation Conditions¶

RO-Crate generation is triggered from run.sh after the workflow engine finishes. The following table describes when metadata is generated:

Run State	RO-Crate Generated	Reason
`COMPLETE`	Yes	Normal success path
`EXECUTOR_ERROR`	Yes (`FailedActionStatus`)	Outputs may be absent, but metadata is still valuable
`SYSTEM_ERROR`	No	Preconditions not satisfied
`CANCELED`	No	Preconditions not satisfied

If RO-Crate generation itself fails, run.sh writes {"@error": "RO-Crate generation failed. Check stderr.log for details."} to ro-crate-metadata.json and appends the Python error traceback to stderr.log. The @error key in the response indicates a generation failure, distinguishing it from a valid RO-Crate (which contains @graph) or a run where RO-Crate was not generated (which returns null via the API).

Graceful Degradation¶

sapporo-service accepts arbitrary workflow engines and workflow languages via the WES API. Unknown engines or languages are expected cases, not errors. The RO-Crate generation applies the following fallback rules to produce valid metadata without crashing. All fallbacks are normal operation and do not emit warning logs (to prevent log file bloat in production).

Condition	Fallback Behavior
Unknown workflow language (`ro-crate-py` `LANG_MAP` miss)	Generic `ComputerLanguage` entity with `name` = workflow type string. WDL is special-cased with `@id` = `https://openwdl.org`
Unknown workflow engine (`_ENGINE_URL_MAP` miss)	`SoftwareApplication` with fragment identifier (`#engine_name`). No `url` property
Local workflow file not found on disk	URL string used as-is for `ComputationalWorkflow` identifier
File metadata I/O error (`stat`, read, hash)	Error suppressed; affected properties omitted from the entity
Non-numeric `exit_code`	`FailedActionStatus` set; `exitCode` property omitted
Output file disappeared after listing	File skipped; remaining outputs processed normally
Missing `run_request.json`	`TypeError` raised with a descriptive message (generation cannot proceed)
tataki Docker unavailable or fails	Warning logged; `encodingFormat` left unchanged

Entity Graph¶

@context: [ro-crate/1.1, wfrun/context, sapporo]

Root Dataset (./)
  +-- conformsTo    -> [process/0.5, workflow/0.5, workflow-ro-crate/1.0]
  +-- mainEntity    -> ComputationalWorkflow
  +-- mentions      -> [CreateAction]
  +-- hasPart       -> [all data entities]
  +-- datePublished -> ISO 8601 datetime

ComputationalWorkflow
  +-- @type: [File, SoftwareSourceCode, ComputationalWorkflow]
  +-- programmingLanguage -> ComputerLanguage
  +-- input  -> [FormalParameter...]
  +-- output -> [FormalParameter...]

CreateAction (#run_id)
  +-- instrument      -> ComputationalWorkflow
  +-- object          -> [File (inputs), PropertyValue (params)]
  +-- result          -> [File (outputs)]
  +-- agent           -> Person (from username.txt)
  +-- executedBy      -> [SoftwareApplication (engine), SoftwareApplication (sapporo)]
  +-- startTime       -> ISO 8601 datetime
  +-- endTime         -> ISO 8601 datetime
  +-- actionStatus    -> CompletedActionStatus | FailedActionStatus
  +-- exitCode        -> int (sapporo context)
  +-- description     -> summary text (e.g., "Executed wf.cwl using cwltool")
  +-- error           -> stderr tail (failure only)
  +-- containerImage  -> ContainerImage
  +-- subjectOf       -> [stdout.log, stderr.log, cmd.txt, system_logs.json, workflow_engine_params.txt]
  +-- multiqcStats    -> File (MultiQC stats)

Root Data Entity¶

The root dataset includes:

name / description: Generated from the run ID.
datePublished: ISO 8601 datetime of when the RO-Crate was generated.
license: A textual note stating that licensing of individual files is determined by their respective owners. The RO-Crate Metadata Descriptor (ro-crate-metadata.json) is separately licensed under CC0 1.0.
publisher: Sapporo WES Project organization (the entity that generated and serves this crate).

Workflow Entity¶

The ComputationalWorkflow entity represents the executed workflow. It conforms to the Bioschemas ComputationalWorkflow 1.0-RELEASE profile. Supported workflow languages are resolved via ro-crate-py:

Type	Language
`CWL`	Common Workflow Language
`WDL`	Workflow Description Language
`NFL`	Nextflow
`SMK`	Snakemake

Unknown workflow types fall back to a generic ComputerLanguage entity with name set to the type string. The MIME type (encodingFormat) falls back to text/plain. These fallbacks are normal operation and do not produce log output.

CreateAction¶

The CreateAction entity records the execution provenance:

instrument: References the ComputationalWorkflow.
object: Input files (workflow_attachment) and parameters (workflow_params.json key-value pairs as PropertyValue entities).
result: Output files from the outputs/ directory.
agent: A Person entity derived from username.txt (when authentication is enabled).
executedBy: References to SoftwareApplication entities for the workflow engine and sapporo.
actionStatus: CompletedActionStatus (exit code 0) or FailedActionStatus (non-zero).
description: Summary text (e.g., "Executed wf.cwl using cwltool").
error: Last 20 lines of stderr.log (failure only).
containerImage: Docker image extracted from cmd.txt (e.g., quay.io/commonwl/cwltool:3.1.x).
subjectOf: References to stdout.log, stderr.log, cmd.txt, system_logs.json, and workflow_engine_params.txt.

Custom Properties (sapporo context)¶

Custom properties are defined under the https://w3id.org/ro/terms/sapporo context. These properties enable the Tonkaz workflow comparison tool to perform fine-grained file-level comparison between runs.

Property	Domain	Description
`exitCode`	`CreateAction`	Process exit code
`executedBy`	`CreateAction`	References to SoftwareApplication entities (engine, sapporo)
`lineCount`	`File`	Number of lines in a text file
`text`	`File`	Embedded file content (files <= 10 KB)
`multiqcStats`	`CreateAction`	Reference to MultiQC general stats JSON
`FileStats`	(type)	Type for samtools/vcftools statistics
`stats`	`File`	Link from File to FileStats

File checksums use sha256 (defined in the wfrun context).

Bioinformatics Extensions¶

sapporo automatically runs bioinformatics analysis tools on output files to embed domain-specific statistics in the RO-Crate metadata.

MultiQC Statistics¶

MultiQC is run in a Docker container (quay.io/biocontainers/multiqc:1.33--pyhdfd78af_0) automatically on the entire run directory after workflow completion. If Docker is not available, MultiQC is skipped. If MultiQC finds supported tool outputs (e.g., FastQC, samtools), it generates a multiqc_general_stats.json file. This file is:

Stored at {run_dir}/multiqc_general_stats.json.
Added to the crate as a File entity with full content embedded.
Referenced from the CreateAction via the multiqcStats property.

samtools Stats (BAM/SAM)¶

For output files with BAM (.bam) or SAM (.sam) format (detected via EDAM ontology), samtools flagstats is run in a Docker container (quay.io/biocontainers/samtools:1.23--h96c455f_0). The resulting FileStats entity includes:

Property	Description
`totalReads`	Total number of reads
`mappedReads`	Number of mapped reads
`unmappedReads`	Number of unmapped reads
`duplicateReads`	Number of duplicate reads
`mappedRate`	Mapped reads / total reads
`unmappedRate`	Unmapped reads / total reads
`duplicateRate`	Duplicate reads / total reads

vcftools Stats (VCF)¶

For output files with VCF format (.vcf, .vcf.gz), vcf-stats is run in a Docker container (quay.io/biocontainers/vcftools:0.1.17--pl5321h077b44d_0). The resulting FileStats entity includes:

Property	Description
`variantCount`	Total number of variants
`snpsCount`	Number of SNPs
`indelsCount`	Number of indels

EDAM Format Auto-detection¶

Output files are automatically annotated with EDAM ontology format identifiers based on file extension. EDAM entities use @type: "Thing" as they represent ontology terms rather than web resources. The mapping is defined in sapporo/ro_crate.py (EDAM_MAPPING dict). Common non-bioinformatics formats (JSON, CSV, TSV, HTML, YAML, Markdown, ZIP, gzip, plain text) are also mapped to their IANA media types.

tataki Content-Based Format Detection¶

tataki is run in a Docker container (ghcr.io/sapporo-wes/tataki:latest) against all output files after the extension-based EDAM detection. tataki detects file formats by inspecting file content (magic bytes, structure analysis) rather than relying on file extensions, covering both bioinformatics formats (BAM, VCF, FASTQ, ...) and common formats (TSV, CSV, JSON, HTML, PDF, PNG, SVG).

When tataki identifies a file's format, the file's encodingFormat is replaced with the EDAM ontology entity returned by tataki. Files that tataki cannot identify retain their original encodingFormat (extension-based EDAM + MIME type).

This enrichment enables tonkaz Level 1-3 file-content comparison on typical workflow outputs. If Docker is not available or tataki fails, the enrichment is silently skipped.

API Endpoint¶

`GET /runs/{run_id}/ro-crate`¶

Retrieve the RO-Crate metadata for a completed run.

Parameter	Default	Response
`download=false`	JSON-LD	`application/ld+json`
`download=true`	ZIP archive	`application/zip`

When download=true, the response is a ZIP archive containing all files referenced in the crate. When download=false, only the ro-crate-metadata.json content is returned as JSON-LD.

When authentication is enabled, this endpoint is protected and requires a valid JWT token.

Implementation¶

RO-Crate generation is implemented in sapporo/ro_crate.py and called from run.sh after the workflow engine completes (or fails). It runs in the same subprocess as the workflow execution.

The entry point is generate_ro_crate(run_dir), invoked from run.sh via the CLI as:

sapporo-cli generate-ro-crate ${run_dir}

The generation flow:

Create a base crate with WRROC profiles and sapporo context.
Add the ComputationalWorkflow entity from the run request.
Add SoftwareApplication entities for the workflow engine and sapporo.
Build the CreateAction with inputs, outputs, logs, and metadata.
Run MultiQC in Docker and attach statistics (skipped if Docker is unavailable).
Run samtools/vcftools in Docker on applicable output files (skipped if Docker is unavailable).
Run tataki in Docker to enrich output files with EDAM format IDs (skipped if Docker is unavailable).
Write ro-crate-metadata.json and README.md to the run directory.

Validation¶

The generated RO-Crate metadata can be validated using roc-validator:

uv run roc-validator validate ro-crate-metadata.json

All REQUIRED checks from the RO-Crate 1.1 specification should pass. RECOMMENDED checks may produce warnings for optional properties that sapporo does not populate (e.g., author on the Root Data Entity, license as a CreativeWork entity).

Example¶

A complete RO-Crate example is available in tests/ro-crate/:

ro-crate-metadata.json: Generated metadata (quick reference copy)
ro-crate_dir/: Sample run directory with all source files and generated metadata

See tests/ro-crate/README.md for details.