RO-Crate¶
Overview¶
RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata. It is based on Schema.org annotations in JSON-LD.
After each workflow run completes (or fails at the executor level), sapporo-service automatically generates an ro-crate-metadata.json file in the run directory. This metadata captures the full provenance of the run: the workflow executed, input parameters, output files, timestamps, exit codes, and runtime environment, enabling reproducibility and comparison of workflow executions.
Conformance¶
The generated RO-Crate metadata conforms to the following specifications:
| Specification | Version | Notes |
|---|---|---|
| RO-Crate | 1.1 | ro-crate-py 0.14.x default |
| WRROC Process Run Crate | 0.5 | Prerequisite for Workflow Run Crate |
| WRROC Workflow Run Crate | 0.5 | sapporo treats engines as black boxes |
| Workflow RO-Crate | 1.0 | Standard ComputationalWorkflow representation |
Provenance Run Crate is out of scope because it requires per-step execution info (HowToStep, ControlAction, OrganizeAction), which sapporo does not have. Sapporo delegates execution to workflow engines as black boxes and does not track individual step-level provenance.
Generation Conditions¶
RO-Crate generation is triggered from run.sh after the workflow engine finishes. The following table describes when metadata is generated:
| Run State | RO-Crate Generated | Reason |
|---|---|---|
COMPLETE |
Yes | Normal success path |
EXECUTOR_ERROR |
Yes (FailedActionStatus) |
Outputs may be absent, but metadata is still valuable |
SYSTEM_ERROR |
No | Preconditions not satisfied |
CANCELED |
No | Preconditions not satisfied |
If RO-Crate generation itself fails, run.sh writes {"@error": "RO-Crate generation failed. Check stderr.log for details."} to ro-crate-metadata.json and appends the Python error traceback to stderr.log. The @error key in the response indicates a generation failure, distinguishing it from a valid RO-Crate (which contains @graph) or a run where RO-Crate was not generated (which returns null via the API).
Graceful Degradation¶
sapporo-service accepts arbitrary workflow engines and workflow languages via the WES API. Unknown engines or languages are expected cases, not errors. The RO-Crate generation applies the following fallback rules to produce valid metadata without crashing. All fallbacks are normal operation and do not emit warning logs (to prevent log file bloat in production).
| Condition | Fallback Behavior |
|---|---|
Unknown workflow language (ro-crate-py LANG_MAP miss) |
Generic ComputerLanguage entity with name = workflow type string. WDL is special-cased with @id = https://openwdl.org |
Unknown workflow engine (_ENGINE_URL_MAP miss) |
SoftwareApplication with fragment identifier (#engine_name). No url property |
| Local workflow file not found on disk | URL string used as-is for ComputationalWorkflow identifier |
File metadata I/O error (stat, read, hash) |
Error suppressed; affected properties omitted from the entity |
Non-numeric exit_code |
FailedActionStatus set; exitCode property omitted |
| Output file disappeared after listing | File skipped; remaining outputs processed normally |
Missing run_request.json |
TypeError raised with a descriptive message (generation cannot proceed) |
Entity Graph¶
@context: [ro-crate/1.1, wfrun/context, sapporo]
Root Dataset (./)
+-- conformsTo -> [process/0.5, workflow/0.5, workflow-ro-crate/1.0]
+-- mainEntity -> ComputationalWorkflow
+-- mentions -> [CreateAction]
+-- hasPart -> [all data entities]
+-- datePublished -> ISO 8601 datetime
ComputationalWorkflow
+-- @type: [File, SoftwareSourceCode, ComputationalWorkflow]
+-- programmingLanguage -> ComputerLanguage
+-- input -> [FormalParameter...]
+-- output -> [FormalParameter...]
CreateAction (#run_id)
+-- instrument -> ComputationalWorkflow
+-- object -> [File (inputs), PropertyValue (params)]
+-- result -> [File (outputs)]
+-- agent -> Person (from username.txt)
+-- executedBy -> [SoftwareApplication (engine), SoftwareApplication (sapporo)]
+-- startTime -> ISO 8601 datetime
+-- endTime -> ISO 8601 datetime
+-- actionStatus -> CompletedActionStatus | FailedActionStatus
+-- exitCode -> int (sapporo context)
+-- description -> summary text (e.g., "Executed wf.cwl using cwltool")
+-- error -> stderr tail (failure only)
+-- containerImage -> ContainerImage
+-- subjectOf -> [stdout.log, stderr.log, cmd.txt, system_logs.json, workflow_engine_params.txt]
+-- multiqcStats -> File (MultiQC stats)
Root Data Entity¶
The root dataset includes:
name/description: Generated from the run ID.datePublished: ISO 8601 datetime of when the RO-Crate was generated.license: A textual note stating that licensing of individual files is determined by their respective owners. The RO-Crate Metadata Descriptor (ro-crate-metadata.json) is separately licensed under CC0 1.0.publisher: Sapporo WES Project organization (the entity that generated and serves this crate).
Workflow Entity¶
The ComputationalWorkflow entity represents the executed workflow. It conforms to the Bioschemas ComputationalWorkflow 1.0-RELEASE profile. Supported workflow languages are resolved via ro-crate-py:
| Type | Language |
|---|---|
CWL |
Common Workflow Language |
WDL |
Workflow Description Language |
NFL |
Nextflow |
SMK |
Snakemake |
Unknown workflow types fall back to a generic ComputerLanguage entity with name set to the type string. The MIME type (encodingFormat) falls back to text/plain. These fallbacks are normal operation and do not produce log output.
CreateAction¶
The CreateAction entity records the execution provenance:
instrument: References theComputationalWorkflow.object: Input files (workflow_attachment) and parameters (workflow_params.jsonkey-value pairs asPropertyValueentities).result: Output files from theoutputs/directory.agent: APersonentity derived fromusername.txt(when authentication is enabled).executedBy: References toSoftwareApplicationentities for the workflow engine and sapporo.actionStatus:CompletedActionStatus(exit code 0) orFailedActionStatus(non-zero).description: Summary text (e.g., "Executed wf.cwl using cwltool").error: Last 20 lines ofstderr.log(failure only).containerImage: Docker image extracted fromcmd.txt(e.g.,quay.io/commonwl/cwltool:3.1.x).subjectOf: References tostdout.log,stderr.log,cmd.txt,system_logs.json, andworkflow_engine_params.txt.
Custom Properties (sapporo context)¶
Custom properties are defined under the https://w3id.org/ro/terms/sapporo context. These properties enable the Tonkaz workflow comparison tool to perform fine-grained file-level comparison between runs.
| Property | Domain | Description |
|---|---|---|
exitCode |
CreateAction |
Process exit code |
executedBy |
CreateAction |
References to SoftwareApplication entities (engine, sapporo) |
lineCount |
File |
Number of lines in a text file |
text |
File |
Embedded file content (files <= 10 KB) |
multiqcStats |
CreateAction |
Reference to MultiQC general stats JSON |
FileStats |
(type) | Type for samtools/vcftools statistics |
stats |
File |
Link from File to FileStats |
File checksums use sha256 (defined in the wfrun context).
Bioinformatics Extensions¶
sapporo automatically runs bioinformatics analysis tools on output files to embed domain-specific statistics in the RO-Crate metadata.
MultiQC Statistics¶
MultiQC is run in a Docker container (quay.io/biocontainers/multiqc:1.33--pyhdfd78af_0) automatically on the entire run directory after workflow completion. If Docker is not available, MultiQC is skipped. If MultiQC finds supported tool outputs (e.g., FastQC, samtools), it generates a multiqc_general_stats.json file. This file is:
- Stored at
{run_dir}/multiqc_general_stats.json. - Added to the crate as a
Fileentity with full content embedded. - Referenced from the
CreateActionvia themultiqcStatsproperty.
samtools Stats (BAM/SAM)¶
For output files with BAM (.bam) or SAM (.sam) format (detected via EDAM ontology), samtools flagstats is run in a Docker container (quay.io/biocontainers/samtools:1.23--h96c455f_0). The resulting FileStats entity includes:
| Property | Description |
|---|---|
totalReads |
Total number of reads |
mappedReads |
Number of mapped reads |
unmappedReads |
Number of unmapped reads |
duplicateReads |
Number of duplicate reads |
mappedRate |
Mapped reads / total reads |
unmappedRate |
Unmapped reads / total reads |
duplicateRate |
Duplicate reads / total reads |
vcftools Stats (VCF)¶
For output files with VCF format (.vcf, .vcf.gz), vcf-stats is run in a Docker container (quay.io/biocontainers/vcftools:0.1.17--pl5321h077b44d_0). The resulting FileStats entity includes:
| Property | Description |
|---|---|
variantCount |
Total number of variants |
snpsCount |
Number of SNPs |
indelsCount |
Number of indels |
EDAM Format Auto-detection¶
Output files are automatically annotated with EDAM ontology format identifiers based on file extension. EDAM entities use @type: "Thing" as they represent ontology terms rather than web resources. The mapping is defined in sapporo/ro_crate.py (EDAM_MAPPING dict). Common non-bioinformatics formats (JSON, CSV, TSV, HTML, YAML, Markdown, ZIP, gzip, plain text) are also mapped to their IANA media types.
API Endpoint¶
GET /runs/{run_id}/ro-crate¶
Retrieve the RO-Crate metadata for a completed run.
| Parameter | Default | Response |
|---|---|---|
download=false |
JSON-LD | application/ld+json |
download=true |
ZIP archive | application/zip |
When download=true, the response is a ZIP archive containing all files referenced in the crate. When download=false, only the ro-crate-metadata.json content is returned as JSON-LD.
When authentication is enabled, this endpoint is protected and requires a valid JWT token.
Implementation¶
RO-Crate generation is implemented in sapporo/ro_crate.py and called from run.sh after the workflow engine completes (or fails). It runs in the same subprocess as the workflow execution.
The entry point is generate_ro_crate(run_dir), invoked from run.sh via the CLI as:
The generation flow:
- Create a base crate with WRROC profiles and sapporo context.
- Add the
ComputationalWorkflowentity from the run request. - Add
SoftwareApplicationentities for the workflow engine and sapporo. - Build the
CreateActionwith inputs, outputs, logs, and metadata. - Run MultiQC in Docker and attach statistics (skipped if Docker is unavailable).
- Run samtools/vcftools in Docker on applicable output files (skipped if Docker is unavailable).
- Write
ro-crate-metadata.jsonandREADME.mdto the run directory.
Validation¶
The generated RO-Crate metadata can be validated using roc-validator:
All REQUIRED checks from the RO-Crate 1.1 specification should pass. RECOMMENDED checks may produce warnings for optional properties that sapporo does not populate (e.g., author on the Root Data Entity, license as a CreativeWork entity).
Example¶
A complete RO-Crate example is available in tests/ro-crate/:
ro-crate-metadata.json: Generated metadata (quick reference copy)ro-crate_dir/: Sample run directory with all source files and generated metadata
See tests/ro-crate/README.md for details.