DMDL Reference
Workflow
Workflows orchestrate the execution of your data transformations. They bring together your model, mappings, and connection profiles into an executable pipeline.
Workflow Structure
A workflow file has four main components:
- Workflow metadata - ID, name, and description of the workflow
- Model reference - Path to the data model file
- Mappings list - Ordered list of mapping files to execute
- Connection profile - Which database connection to use
workflow:
id: ECOMMERCE_WORKFLOW
name: E-commerce Data Pipeline
definition: Transforms raw e-commerce data into business-ready entities
model: models/ecommerce-model.yaml
mappings:
- mappings/customer-mapping.yaml
- mappings/order-mapping.yaml
connection: dev
Tip: Generate a workflow template with
daana-cli generate workflow -o workflow.yamlto get started quickly.
Workflow Metadata
Every workflow needs identifying information:
workflow:
id: ECOMMERCE_WORKFLOW
name: E-commerce Data Pipeline
definition: Transforms raw e-commerce data into analytics-ready format
description: |
This workflow processes customer and order data from the
operational database and transforms it into business entities
for analytics and reporting.
| Field | Purpose |
|---|---|
id | Unique identifier. Used to reference the workflow in commands. |
name | Human-readable name. Displayed in logs and status reports. |
definition | One-line summary of what the workflow does. |
description | Optional detailed explanation with business context. Supports multi-line text. |
Tip: Choose a workflow ID that clearly reflects the business domain (e.g.,
ECOMMERCE_WORKFLOW,customer_analytics).
Model Reference
The model field is the path to your data model file (relative to your project):
workflow:
model: models/ecommerce-model.yaml
You can reference either:
- YAML files (
.yaml,.yml) - Will be auto-compiled during workflow execution - JSON files (
.json) - Pre-compiled model for production use
# Development: use YAML for easier editing
model: models/ecommerce-model.yaml
# JSON is optional if you already have a compiled model
model: models/ecommerce-model.json
Best Practice: Use YAML unless you have a specific reason to maintain pre-compiled JSON files.
Mappings
The mappings field is an ordered list of mapping file paths:
workflow:
mappings:
- mappings/customer-mapping.yaml
- mappings/product-mapping.yaml
- mappings/order-mapping.yaml
- mappings/order-item-mapping.yaml
Mapping Order Matters
Mappings execute in the order listed. When entities have relationships (foreign keys), list parent entities before child entities:
mappings:
# 1. Independent entities first (no foreign keys)
- mappings/customer-mapping.yaml
- mappings/product-mapping.yaml
# 2. Then entities that reference the above
- mappings/order-mapping.yaml # References CUSTOMER
# 3. Finally, entities that reference those
- mappings/order-item-mapping.yaml # References ORDER and PRODUCT
Rule: If entity A has a relationship to entity B, mapping B must come before mapping A in the list.
Connection Profile
The connection field specifies which database connection profile to use:
workflow:
connection: dev
This references a named profile from your connections.yaml file:
# connections.yaml
connections:
dev:
type: postgresql
host: localhost
port: 5432
user: dev
password: devpass
database: customerdb
production:
type: postgresql
host: prod.example.com
# ...
See Connections for details on configuring connection profiles.
Batch Processing
For large datasets, configure batch processing in the advanced section. This enables incremental data loading for pipelines using ingestion_strategy: INCREMENTAL or TRANSACTIONAL.
Basic Batch Configuration
Set batch_expression to the timestamp column that tracks new data in your source tables. Daana auto-generates the filter; no SQL is required.
workflow:
id: ECOMMERCE_WORKFLOW
name: E-commerce Data Pipeline
definition: Transforms e-commerce data with incremental loading
model: models/ecommerce-model.yaml
mappings:
- mappings/order-mapping.yaml
connection: production
advanced:
batch_expression: updated_at
batch_expression
The timestamp column used to filter source data during incremental processing. When set, the framework auto-generates the batch filter SQL.
advanced:
batch_expression: updated_at
Common choices:
updated_at- Standard timestamp columningest_ts- Ingestion timestamp from data lakebatch_date- Batch date for periodic loads
Per-Table Override
When your source tables use different timestamp column names, override batch_expression on individual tables in the mapping YAML:
# mapping.yaml
tables:
- table: batch.daily_usage
batch_expression: batch_date # This table uses batch_date
ingestion_strategy: INCREMENTAL
attributes:
- id: USAGE_MINUTES
transformation_expression: minutes_watched
- table: streaming.events
# No batch_expression; uses workflow default (updated_at)
ingestion_strategy: TRANSACTIONAL
attributes:
- id: EVENT_TYPE
transformation_expression: event_type
Which Pipelines Use Batch Filtering?
| Ingestion Strategy | Batch filter applied? | Why |
|---|---|---|
INCREMENTAL | Yes | Only reads new data via watermark |
TRANSACTIONAL | Yes | No delta detection; without filtering, rows are duplicated |
FULL | No | Must read all data for correct change detection |
FULL_LOG | No | Must read all data for soft-delete detection |
IDFR pipelines always apply batch filtering regardless of strategy; they skip previously registered identifiers.
Important: When setting
batch_expressionat workflow level, ensure the column exists on all source tables, including those used by multi-IDFR entities. If a source table lacks the column, the pipeline will fail with a SQL error. Use per-tablebatch_expressionoverrides when different tables have different timestamp columns, or omit the workflow-level setting and only configure it per-table.
Automatic Batch Tracking
Daana tracks every pipeline run in a metadata table (batch_history). Each row records the batch window and a status lifecycle: R (running) → C (complete) or F (failed). The next run's lower bound (its watermark) comes from this table, so continuity between runs is always correct as long as a previous run completed successfully.
How the watermark is resolved
Before each pipeline runs, the framework queries batch_history for that pipeline's proc_key:
State of batch_history for this proc_key | Resulting watermark |
|---|---|
| No rows (first run) | Epoch (1970-01-01T00:00:00Z); process all historical data |
At least one C row | The end_tmstp of the most-recently-completed C row |
Rows exist but no C (only F/R) | Strategy-dependent; see "Aborted pipeline recovery" below |
The watermark is computed per-pipeline, not per-workflow. If one pipeline aborts, sibling pipelines in the same workflow continue.
--batch-start and --batch-end flags
The explicit batch-window flags must always be used as a pair:
- Both omitted (default): the watermark is the lower bound;
NOWis the upper bound. This is the routine incremental-loading mode. - Both specified: the explicit values are used and written into
batch_history. The next run's watermark anchors at this run's--batch-end. - Only one specified: rejected with a clear error. A half-defined window can silently miss data or rewind the watermark.
- Both specified,
--batch-end <= --batch-start: rejected. Zero-width or transposed windows are almost always operator typos.
# Routine: continues from last successful batch
daana-cli execute
# Backfill: explicit window (must include both flags; end > start)
daana-cli execute --batch-start "2024-01-01T00:00:00Z" --batch-end "2024-01-31T00:00:00Z"
# Reset and reload: drop affected data, then resume from epoch
daana-cli execute --full-refresh --entity ORDER --yes
daana-cli execute # next run starts from epoch for the refreshed scope
Aborted pipeline recovery
When batch_history has rows for a pipeline but none with status C (e.g., after a series of failures), the framework refuses to silently rewind to epoch. Behavior depends on the pipeline's ingestion strategy:
| Strategy | Behavior when prior runs failed but none completed |
|---|---|
INCREMENTAL, TRANSACTIONAL | Pipeline aborts; the workflow exits non-zero with the affected proc_key/entity in the error message. Sibling pipelines continue. |
FULL, FULL_LOG | Logs a warning and proceeds with epoch as the watermark. Truedelta treats idempotent re-reads as no-ops, so the load remains safe. |
Recovery is operator-driven. Pick one of:
--full-refresh --entity <ID>: clears the affected entity's data and re-executes in the same invocation. See Full refresh below for the operational details across all three scopes.--full-refresh(no--entity): the same recovery model applied to every entity in the model. See Full refresh below.- Manual surgery on
batch_history: set the latestFrow toC(with the correctend_tmstp) if the underlying load actually completed but the bookkeeping row didn't.
Full refresh
daana-cli execute --full-refresh resets data tables and reloads from source in one invocation. The flag has three scopes: model, entity, and attribute. See execute in the CLI reference for the precise structural definition of each scope, including what gets deleted and what's preserved.
This subsection covers when to use each scope and what to expect operationally.
Model-level full refresh (--full-refresh with no --entity) resets every entity in the model. Use it when:
- You changed the model in a way that affects every entity (e.g., reworked the IDFR strategy, renamed framework columns, switched mapping conventions for every entity), and existing data needs to be re-derived from source.
- You deployed a new model for the first time and want a clean initial backfill across the whole warehouse.
- Every pipeline is in a bad state. This is rare; usually one entity or one pipeline is the underlying cause, and
--full-refresh --entity <ID>is the narrower tool.
What to expect operationally:
- All pipelines re-run from epoch. The source's full history is read for every entity. For sources with deep history, this can be a large initial load.
INCREMENTALandTRANSACTIONALpipelines behave like first-time runs after the reset. There is no in-flight state to preserve.- Layer 3 consumers see partial state during the refresh window. Entities reload in workflow order. For consistent reads against a model-level refresh, pause downstream jobs until the workflow exits zero.
- The operation is idempotent. Re-running
--full-refresh --yesafter a successful refresh deletes against already-empty tables (no-op) and reloads the same data again. - Framework infrastructure is preserved: semantic views, functions, and metadata table structure stay in place. There is no need to re-run
installordeployafterward.
daana-cli execute --full-refresh --yes
Entity-level full refresh (--full-refresh --entity <ID>) scopes the reset to one entity and its relationships. The same operational expectations apply, scoped down. Use it when the affected scope is a single entity (the typical case for aborted-pipeline recovery, schema migrations that only touch one entity, or per-entity backfills).
Attribute-level full refresh (--full-refresh --entity <ID> --attribute <NAME>) scopes the reset to one attribute of one entity. Use it when the affected scope is narrower still: a single attribute changed (added a transformation, fixed a mapping bug for one column) and the rest of the entity's data is correct.
--yes bypasses the interactive confirmation prompt and is required for non-interactive runs (CI, scheduled jobs).
Runbook contract: machine-readable diagnostics
When run with --no-tui, an aborted-pipeline diagnostic is emitted as a single JSON line on stdout in addition to the human-readable error on stderr:
daana-cli execute --no-tui
# stdout (one line per aborted pipeline):
# {"type":"incomplete_history","proc_key":...,"entity_id":"ORDER","investigate_sql":"SELECT ...",...}
Tooling can grep stdout line-by-line for "type":"incomplete_history" and parse each match. The schema is wire-stable; new diagnostic types will be added with new type values rather than mutating existing ones.
Other flags
# Force: override stale run detection (a stuck 'R' row past stale-timeout)
daana-cli execute --force
Workflow Commands
Generate Template
Create a new workflow file from a template:
daana-cli generate workflow -o workflow.yaml
Check Workflow
Validate your workflow configuration before deploying:
daana-cli check workflow
This validates:
- Workflow YAML structure
- Model file exists and is valid
- Mapping files exist and are valid
- Connection profile exists
Execute Workflow
Run the data transformation:
# Full execution
daana-cli execute
# With batch parameters (for incremental loading)
daana-cli execute --batch-start "2024-01-01" --batch-end "2024-01-31"
Complete Example
Here's a complete workflow showing all components together:
workflow:
id: ECOMMERCE_WORKFLOW
name: E-commerce Data Pipeline
definition: Transforms raw e-commerce data into analytics-ready format
description: |
This workflow processes customer and order data from the
operational database and transforms it into business entities
for analytics and reporting.
model: models/ecommerce-model.yaml
mappings:
- mappings/customer-mapping.yaml
- mappings/product-mapping.yaml
- mappings/order-mapping.yaml
- mappings/order-item-mapping.yaml
connection: production
advanced:
batch_expression: updated_at
Quick Reference
Looking up a specific field? Here's the complete reference for all fields in workflow.yaml.
Workflow Fields
| Field | Type | Required | Description |
|---|---|---|---|
| id | string | ✓ | Unique identifier for the workflow |
| name | string | ✓ | Human-readable name for the workflow |
| definition | string | ✓ | One-line description of the workflow purpose |
| description | string | ○ | Detailed description of the workflow |
| connection | string | ✓ | Connection profile to use for execution |
| model | ModelReference | ✓ | Path to the data model file |
| mappings | list of string | ✓ | List of mapping files to execute ⚠ Order matters: list parent entities before children |
| advanced | object | ○ | Advanced workflow settings |
| └advanced.batch_expression | string | ○ | Batch filtering expression for incremental source reads ⚠ Per-table override: set batch_expression on individual tables in your mapping YAML⚠ If value contains ${BATCH_START} or ${BATCH_END}, it's used as raw SQL (auto-detected)⚠ FULL and FULL_LOG pipelines ignore this setting (they read all data) |
| └advanced.batch_stale_timeout | string | ○ | Timeout for stale pipeline run detection ⚠ Uses Go duration format (e.g., 8h, 30m, 1h30m)⚠ Use --force to override stale detection during execution |
✓ = required, ○ = optional
Best Practices
- Use descriptive workflow IDs - Choose names that reflect the business domain (e.g.,
ECOMMERCE_WORKFLOW,CUSTOMER_ANALYTICS) - Order mappings by dependency - Independent entities first, then dependent ones
- Validate before deploying - Always run
daana-cli check workflowbefore deployment - Configure batch processing - For large datasets, use INCREMENTAL ingestion with batch settings
- Use environment-specific connections - Create separate connection profiles for dev, staging, and production
- Document your workflows - Use the
descriptionfield for detailed documentation - Keep models in YAML - Unless you have a specific reason to maintain compiled JSON files
- Test incrementally - Test incremental logic with small date ranges first
Next Steps
- Configure connections to your data sources
- Create mappings to define data transformations
- Follow the tutorial for a complete end-to-end example