Runtime Orchestrator Spec
This document defines the first implementation boundary for Hosted Runtime Analysis.
sandtrace-ingest already accepts normalized run uploads. The missing piece is the service that decides when to execute a hosted runtime job, how a worker claims it, and how the result is handed off to ingest.
Scope
This spec covers:
- job submission
- job state transitions
- worker lease behavior
- runtime execution payloads
- result upload handoff into
sandtrace-ingest - minimal database schema
This spec does not cover:
- customer billing calculations
- dedicated worker pools
- custom base images
- private networking
- full GitHub App design
Core model
Job lifecycle
Each hosted runtime execution is a runtime_job.
Required high-level states:
queuedrunninguploadedfailedcanceled
Optional internal states that are useful but not required on day one:
lease_acquiredchecking_outexecutinguploading
The public API should expose only the high-level states unless debugging requires more detail.
State rules
- new jobs start as
queued - only a worker with an active lease can move a job to
running - a job becomes
uploadedonly aftersandtrace-ingestacknowledges the run payload - terminal states are
uploaded,failed, andcanceled - terminal jobs cannot be resumed; retries create a new job row linked to the original
Job submission API
The orchestrator should accept a single create-job request from either the product UI or a GitHub-triggered integration layer.
POST /v1/runtime/jobs
Creates a hosted runtime job.
Example request:
{
"org_slug": "sandtrace",
"project_slug": "web",
"source": {
"kind": "github",
"repo_url": "https://github.com/cc-consulting-nv/web.git",
"owner": "cc-consulting-nv",
"repo": "web",
"ref": "refs/heads/main",
"git_commit": "c13aa82903ea336cf3f21bdf2d930dc1a41f65cf",
"pull_request_number": 98
},
"execution": {
"working_directory": ".",
"command": [
"pnpm",
"install"
],
"timeout_seconds": 300,
"allow_network": true,
"allow_exec": true
},
"trigger": {
"kind": "pull_request",
"actor": "github-app"
}
}
Example response:
{
"job_id": "rtj_01kkygkagq0jk17bx6y1w8c3df",
"status": "queued",
"created_at": "2026-03-17T20:00:00Z"
}
Validation rules
- org must have Hosted Runtime Analysis enabled
- repo/project must be enabled for hosted runtime analysis
commandmust be non-emptytimeout_secondsmust be within the allowed plan limitrepo_urlandgit_commitmust be present
Job query API
GET /v1/runtime/jobs
Lists jobs for a visible org or project.
Recommended filters:
project_slugstatustrigger_kindgit_commitlimit
GET /v1/runtime/jobs/{job_id}
Returns the job record, current status, and last event summary.
POST /v1/runtime/jobs/{job_id}/cancel
Cancels a job if it is still queued or running.
If a worker already holds a lease, the worker should observe the cancellation signal and stop execution as soon as possible.
Worker lease API
Workers should not scan the database directly for jobs. Use a lease endpoint so orchestration policy stays centralized.
POST /v1/runtime/leases
Claims one queued job and returns a worker lease plus the full execution payload.
Example request:
{
"worker_id": "wrk_01kkygmfjf4j9s26hzd0h93j0r",
"pool": "shared-linux",
"capabilities": {
"linux": true,
"ptrace": true,
"namespaces": true
}
}
Example response:
{
"lease_id": "rtl_01kkygn4z3fkef0g4tgm6g3b1j",
"job": {
"job_id": "rtj_01kkygkagq0jk17bx6y1w8c3df",
"org_slug": "sandtrace",
"project_slug": "web",
"source": {
"repo_url": "https://github.com/cc-consulting-nv/web.git",
"owner": "cc-consulting-nv",
"repo": "web",
"ref": "refs/heads/main",
"git_commit": "c13aa82903ea336cf3f21bdf2d930dc1a41f65cf"
},
"execution": {
"working_directory": ".",
"command": [
"pnpm",
"install"
],
"timeout_seconds": 300,
"allow_network": true,
"allow_exec": true
}
},
"lease_expires_at": "2026-03-17T20:05:00Z"
}
Lease rules
- a lease is exclusive to one worker
- a lease must expire automatically
- workers must renew the lease while running long jobs
- expired leases return the job to
queuedorfailed, depending on retry policy - lease expiry should emit a job event
POST /v1/runtime/leases/{lease_id}/heartbeat
Renews the lease expiry while the job is still healthy.
POST /v1/runtime/leases/{lease_id}/complete
Marks worker execution complete and provides the ingest handoff details.
Example request:
{
"result": {
"status": "uploaded",
"ingest_run_id": "run_20260317143329_bd39b25f6f31",
"uploaded_at": "2026-03-17T20:03:20Z"
}
}
POST /v1/runtime/leases/{lease_id}/fail
Marks the job failed and includes failure metadata.
Example request:
{
"result": {
"status": "failed",
"reason": "sandbox_apply_failed",
"message": "Namespace creation failed: EPERM"
}
}
Result upload handoff
The worker should upload the final run result to sandtrace-ingest using the existing ingest contract instead of inventing a second result store.
Worker flow
- worker claims lease
- worker checks out repo
- worker executes
sandtrace run - worker uploads the normalized
runpayload toPOST /v1/ingest/run - worker records the returned
run_id - worker completes the lease with
status=uploaded
Required run upload metadata
The worker-generated upload should include:
org_slugproject_slugrepo_urlgit_commitcommandtrigger_kindworker_idjob_id
job_id should be preserved inside the run payload metadata so the UI can link a hosted runtime job to the stored run record.
Minimal database schema
The first implementation only needs three tables.
runtime_jobs
Suggested columns:
idjob_ulidorg_slugproject_slugsource_kindrepo_urlrepo_ownerrepo_namegit_refgit_commitpull_request_numbertrigger_kindtrigger_actorworking_directorycommand_jsontimeout_secondsallow_networkallow_execstatusretry_of_job_ulidingest_run_idfailure_reasonfailure_messagecreated_atstarted_atfinished_at
Indexes:
(org_slug, project_slug, created_at desc)(org_slug, git_commit)(status, created_at)- unique
(job_ulid)
runtime_job_events
Suggested columns:
idjob_ulidevent_typeactor_kindactor_idpayload_jsoncreated_at
Purpose:
- audit trail
- debugging
- timeline rendering
runtime_worker_leases
Suggested columns:
idlease_ulidjob_ulidworker_idpoolstatusleased_atexpires_atcompleted_at
Indexes:
- unique
(lease_ulid) (job_ulid, status)(worker_id, status)
Retry policy
The first version should stay conservative.
- no automatic retry for successful upload failures without operator review
- allow one automatic retry for worker crash or lease expiry
- do not retry permanent validation failures
- retries create a new
runtime_jobsrow withretry_of_job_ulidset
UI implications
The product UI will need these read shapes later:
- recent hosted jobs per project
- job detail by
job_id - status badge for
queued,running,uploaded,failed,canceled - link from job detail to the uploaded run detail when
ingest_run_idexists
That means the orchestrator should preserve ingest_run_id and terminal failure details from the first version onward.
MVP recommendations
The first implementation should:
- support only GitHub-backed jobs
- use one shared Linux worker pool
- allow one command per repo
- support only manual and pull-request triggers
- upload only the final normalized run payload
Do not add these yet:
- customer-provided worker images
- multi-step pipelines
- arbitrary environment variable passthrough
- private networking
- non-GitHub source providers
Relationship to current product behavior
Until the orchestrator exists:
auditandsbomstay in standard CIrunstays local or self-hosted on a privileged Linux environment
This spec is the bridge from that model to a hosted paid add-on.