Expand DataRecord with first-class provenance fields; remove all hidden scoping from data access
Opened by stack72 · 4/7/2026· GitHub #1123
Problem
Swamp has three data access paths today, and they exist because each one tries to enforce a different implicit scope:
| Path | Scope it enforces | How |
|---|---|---|
context.readModelData(name, spec?) |
Current workflow run | DataAccessService filters on ownerDefinition.workflowRunId |
data.findBySpec(name, spec) |
Current workflow run | ModelResolver delegate reads context.workflowRunId and filters server-side |
context.queryData(pred) / data.query(pred) |
Current workflow run (in driver) / nothing (in CEL) | Driver string-concatenates && tags.workflowRunId == "${id}" into the predicate; CEL delegate does nothing |
This is the root of the 15+ scoping bugs filed against data access (#914, #966, #987, #1020, #1023, #1058, #1066, #1105, #1113, #497, etc.). Every fix has been a 1-off because the underlying problem isn't any particular filter — it's that scoping is hidden inside the framework. A predicate that looks like it should return all matching data quietly returns a workflow-scoped subset, because some delegate or driver wrapper added a clause the author can't see. Different callers see different results from the same call.
The fix is not to add more scoping options, a scope parameter, or a magic ctx variable bound into the predicate environment. The fix is to remove hidden scoping entirely and make the data shape rich enough that any filter the author wants is expressible — and visible — in the predicate string itself.
Design principles
- Extensions see everything by default. No filtering happens unless the author wrote the clause.
- The method signature is the contract. Reading
data.query('modelName == \"dedup\"')should tell you the entire scope. There is no other filter being applied behind the scenes. - Provenance is data, not metadata. Workflow run, job, step, source — these are properties of a data record, not contextual scoping rules. They belong as first-class queryable fields.
- Filtering is the author's job. The framework gives you queryable fields. You write the predicate. If you want "data from this workflow run", you write
workflowRunId == \"...\". If you want "everything ever produced by this model", you writemodelName == \"...\". If you want "data from the dedup step of this run", you writestep == \"dedup\" && workflowRunId == \"...\".
Proposed solution
1. Promote provenance fields out of tags / ownerDefinition into first-class DataRecord fields
Today the data writer (data_writer.ts:509-527) merges these into data.tags:
specName(auto-injected)modelName(auto-injected, originally for orphan recovery #370)type(auto-injected: "resource" or "file")- Workflow tag overrides from
execution_service.ts:440-446:source,workflow,workflowRunId,step(and presumablyjob)
And ownerDefinition separately carries ownerType, ownerRef, and (redundantly) workflowRunId.
tags is supposed to be the user-defined tag namespace. Today it's a junk drawer where the framework smuggles ownership/provenance metadata so that predicates can reach it (since that's the only field-of-maps available in QUERY_FIELDS). This is the wrong place for it.
Promote everything to first-class DataRecord fields:
interface DataRecord {
// existing
id: string;
name: string;
version: number;
createdAt: string;
attributes: Record<string, unknown>;
modelName: string; // already first-class
modelType: string;
specName: string; // already first-class
dataType: string;
contentType: string;
lifetime: string;
ownerType: string;
streaming: boolean;
size: number;
content: string;
// promoted from ownerDefinition / framework-managed tags
ownerRef: string; // from ownerDefinition.ownerRef
workflowRunId: string; // \"\" when not produced inside a workflow
workflowName: string; // \"\" when not produced inside a workflow
jobName: string; // \"\" when not produced inside a workflow step
stepName: string; // \"\" when not produced inside a workflow step
source: string; // e.g. \"step-output\", \"manual\", etc.
// user-defined only
tags: Record<string, string>;
}Add corresponding columns to the catalog schema with appropriate indexes (workflow_run_id, step_name, etc.). Add the new field names to QUERY_FIELDS in query_predicate.ts.
The data writer stops smuggling these into tags. tags becomes purely user-defined, restoring its intended meaning.
2. Remove all hidden scoping logic
Once provenance fields are first-class, the code paths that hide scoping behind delegates and wrappers are unnecessary and should be removed:
raw_execution_driver.ts:142-149 — delete the queryData wrapper. Pass dataQueryService.query through directly. If an extension wants workflow scoping it writes && workflowRunId == \"...\" itself.
// before
const queryData = this.context.queryData && workflowRunId
? (predicate, select?) => {
const scopedPredicate = `(${predicate}) && tags.workflowRunId == \"${workflowRunId}\"`;
return this.context.queryData!(scopedPredicate, select);
}
: this.context.queryData;
// after
const queryData = this.context.queryData;model_resolver.ts:632-685 — findBySpec becomes a thin pass-through to data.query (or is deleted entirely once callers migrate). Its runId filter (lines 657-661) is removed. data.query is already a pass-through; nothing to change there.
DataAccessService.readModelData — the workflowRunId filter (lines 149-154) is removed. Since this whole method is going to be retired in favor of queryData, the cleanup happens during caller migration.
3. Predicate examples after the change
| What you want | Predicate |
|---|---|
| Every episode the dedup model has ever produced | modelName == \"dedup\" && specName == \"episode\" |
| Episodes the dedup model produced in workflow run X | modelName == \"dedup\" && specName == \"episode\" && workflowRunId == \"X\" |
| Episodes from the dedup step in run X | step == \"dedup\" && workflowRunId == \"X\" |
Everything produced by manual swamp model method run invocations (no workflow) |
workflowRunId == \"\" |
| Everything ever produced by a model, regardless of source | modelName == \"X\" |
| Episodes where a user-defined tag says it's a re-encode | modelName == \"dedup\" && tags.encode == \"reencode\" |
Each predicate is self-contained. Reading the line tells you exactly what data will come back. There is no driver, delegate, or context wrapper changing the answer behind the author's back.
What this enables
The two callers that drove this refactor become trivial:
mms_dedup.ts:194 (extension method):
// before
const items = await context.readModelData(args.sourceModel, \"episode\");
// after
const items = await context.queryData(
`modelName == \"${args.sourceModel}\" && specName == \"episode\"`
);The author decides whether to also filter by workflowRunId, step, etc. Nothing happens implicitly.
workflows/discover-and-download.yaml and eztv-check.yaml (workflow forEach):
# before
in: ${{ data.findBySpec(\"dedup\", \"episode\") }}
# after — author writes the scoping they actually want
in: ${{ data.query('modelName == \"dedup\" && specName == \"episode\" && workflowRunId == \"' + workflow.runId + '\"') }}The workflow YAML author has to know about workflow.runId and concatenate it themselves. That's fine — it's explicit. (If workflow.runId isn't already exposed in the workflow expression context, that's a small separate change to make it available as a regular variable, not as a magic predicate binding.)
Why not a scope option, or a ctx predicate variable, or auto-injection
All of these were considered and rejected:
scope: { workflowRunId }option onDataQueryOptions: same hidden behavior, same magic, just relocated from the driver wrapper into the service. Users still can't see the filter from the call site. Future requirements (parent run scoping, time windows, etc.) require new options matrix entries.Magic
ctxvariable bound into the predicate environment: better than auto-injection but still hidden — the predicate becomes context-dependent in a way that's not obvious from reading the call. Same predicate string returns different results in different contexts. Same problem.Auto-injecting clauses at any layer: this is what
findBySpecand the driver wrapper do today. It is the source of the bug class.
The point of this refactor is that the predicate string IS the contract. If it doesn't say it, it isn't happening.
Out of scope for this issue
- Migration of existing extensions / workflow YAML to the new API (separate work, dependent on this)
- Deprecation/removal of
readModelData,findBySpec,DataAccessService(separate work, dependent on this) - Vault reference resolution in query results (not needed for current callers)
- Orphan data recovery (not needed for current callers; #370 was the original justification but
modelNamebeing a first-class field would handle the same use case explicitly) - Exposing additional workflow expression variables (
workflow.runId, etc.) — small follow-up if not already present
Automoved by swampadmin from GitHub issue #1123
Open
No activity in this phase yet.
Sign in to post a ripple.