DMDL Reference

Workflow

Workflows orchestrate the execution of your data transformations. They bring together your model, mappings, and connection profiles into an executable pipeline.

Workflow Structure

A workflow file has four main components:

  1. Workflow metadata - ID, name, and description of the workflow
  2. Model reference - Path to the data model file
  3. Mappings list - Ordered list of mapping files to execute
  4. Connection profile - Which database connection to use
workflow:
  id: ECOMMERCE_WORKFLOW
  name: E-commerce Data Pipeline
  definition: Transforms raw e-commerce data into business-ready entities

  model: models/ecommerce-model.yaml

  mappings:
    - mappings/customer-mapping.yaml
    - mappings/order-mapping.yaml

  connection: dev

Tip: Generate a workflow template with daana-cli generate workflow -o workflow.yaml to get started quickly.

Workflow Metadata

Every workflow needs identifying information:

workflow:
  id: ECOMMERCE_WORKFLOW
  name: E-commerce Data Pipeline
  definition: Transforms raw e-commerce data into analytics-ready format
  description: |
    This workflow processes customer and order data from the
    operational database and transforms it into business entities
    for analytics and reporting.
FieldPurpose
idUnique identifier. Used to reference the workflow in commands.
nameHuman-readable name. Displayed in logs and status reports.
definitionOne-line summary of what the workflow does.
descriptionOptional detailed explanation with business context. Supports multi-line text.

Tip: Choose a workflow ID that clearly reflects the business domain (e.g., ECOMMERCE_WORKFLOW, customer_analytics).

Model Reference

The model field is the path to your data model file (relative to your project):

workflow:
  model: models/ecommerce-model.yaml

You can reference either:

  • YAML files (.yaml, .yml) - Will be auto-compiled during workflow execution
  • JSON files (.json) - Pre-compiled model for production use
# Development: use YAML for easier editing
model: models/ecommerce-model.yaml

# JSON is optional if you already have a compiled model
model: models/ecommerce-model.json

Best Practice: Use YAML unless you have a specific reason to maintain pre-compiled JSON files.

Mappings

The mappings field is an ordered list of mapping file paths:

workflow:
  mappings:
    - mappings/customer-mapping.yaml
    - mappings/product-mapping.yaml
    - mappings/order-mapping.yaml
    - mappings/order-item-mapping.yaml

Mapping Order Matters

Mappings execute in the order listed. When entities have relationships (foreign keys), list parent entities before child entities:

mappings:
  # 1. Independent entities first (no foreign keys)
  - mappings/customer-mapping.yaml
  - mappings/product-mapping.yaml

  # 2. Then entities that reference the above
  - mappings/order-mapping.yaml      # References CUSTOMER

  # 3. Finally, entities that reference those
  - mappings/order-item-mapping.yaml  # References ORDER and PRODUCT

Rule: If entity A has a relationship to entity B, mapping B must come before mapping A in the list.

Connection Profile

The connection field specifies which database connection profile to use:

workflow:
  connection: dev

This references a named profile from your connections.yaml file:

# connections.yaml
connections:
  dev:
    type: postgresql
    host: localhost
    port: 5432
    user: dev
    password: devpass
    database: customerdb

  production:
    type: postgresql
    host: prod.example.com
    # ...

See Connections for details on configuring connection profiles.

Batch Processing

For large datasets, configure batch processing in the advanced section. This enables incremental data loading for pipelines using ingestion_strategy: INCREMENTAL or TRANSACTIONAL.

Basic Batch Configuration

Set batch_expression to the timestamp column that tracks new data in your source tables. Daana auto-generates the filter; no SQL is required.

workflow:
  id: ECOMMERCE_WORKFLOW
  name: E-commerce Data Pipeline
  definition: Transforms e-commerce data with incremental loading

  model: models/ecommerce-model.yaml
  mappings:
    - mappings/order-mapping.yaml
  connection: production

  advanced:
    batch_expression: updated_at

batch_expression

The timestamp column used to filter source data during incremental processing. When set, the framework auto-generates the batch filter SQL.

advanced:
  batch_expression: updated_at

Common choices:

  • updated_at - Standard timestamp column
  • ingest_ts - Ingestion timestamp from data lake
  • batch_date - Batch date for periodic loads

Per-Table Override

When your source tables use different timestamp column names, override batch_expression on individual tables in the mapping YAML:

# mapping.yaml
tables:
  - table: batch.daily_usage
    batch_expression: batch_date        # This table uses batch_date
    ingestion_strategy: INCREMENTAL
    attributes:
      - id: USAGE_MINUTES
        transformation_expression: minutes_watched

  - table: streaming.events
    # No batch_expression; uses workflow default (updated_at)
    ingestion_strategy: TRANSACTIONAL
    attributes:
      - id: EVENT_TYPE
        transformation_expression: event_type

Which Pipelines Use Batch Filtering?

Ingestion StrategyBatch filter applied?Why
INCREMENTALYesOnly reads new data via watermark
TRANSACTIONALYesNo delta detection; without filtering, rows are duplicated
FULLNoMust read all data for correct change detection
FULL_LOGNoMust read all data for soft-delete detection

IDFR pipelines always apply batch filtering regardless of strategy; they skip previously registered identifiers.

Important: When setting batch_expression at workflow level, ensure the column exists on all source tables, including those used by multi-IDFR entities. If a source table lacks the column, the pipeline will fail with a SQL error. Use per-table batch_expression overrides when different tables have different timestamp columns, or omit the workflow-level setting and only configure it per-table.

Automatic Batch Tracking

Daana tracks every pipeline run in a metadata table (batch_history). Each row records the batch window and a status lifecycle: R (running) → C (complete) or F (failed). The next run's lower bound (its watermark) comes from this table, so continuity between runs is always correct as long as a previous run completed successfully.

How the watermark is resolved

Before each pipeline runs, the framework queries batch_history for that pipeline's proc_key:

State of batch_history for this proc_keyResulting watermark
No rows (first run)Epoch (1970-01-01T00:00:00Z); process all historical data
At least one C rowThe end_tmstp of the most-recently-completed C row
Rows exist but no C (only F/R)Strategy-dependent; see "Aborted pipeline recovery" below

The watermark is computed per-pipeline, not per-workflow. If one pipeline aborts, sibling pipelines in the same workflow continue.

--batch-start and --batch-end flags

The explicit batch-window flags must always be used as a pair:

  • Both omitted (default): the watermark is the lower bound; NOW is the upper bound. This is the routine incremental-loading mode.
  • Both specified: the explicit values are used and written into batch_history. The next run's watermark anchors at this run's --batch-end.
  • Only one specified: rejected with a clear error. A half-defined window can silently miss data or rewind the watermark.
  • Both specified, --batch-end <= --batch-start: rejected. Zero-width or transposed windows are almost always operator typos.
# Routine: continues from last successful batch
daana-cli execute

# Backfill: explicit window (must include both flags; end > start)
daana-cli execute --batch-start "2024-01-01T00:00:00Z" --batch-end "2024-01-31T00:00:00Z"

# Reset and reload: drop affected data, then resume from epoch
daana-cli execute --full-refresh --entity ORDER --yes
daana-cli execute  # next run starts from epoch for the refreshed scope

Aborted pipeline recovery

When batch_history has rows for a pipeline but none with status C (e.g., after a series of failures), the framework refuses to silently rewind to epoch. Behavior depends on the pipeline's ingestion strategy:

StrategyBehavior when prior runs failed but none completed
INCREMENTAL, TRANSACTIONALPipeline aborts; the workflow exits non-zero with the affected proc_key/entity in the error message. Sibling pipelines continue.
FULL, FULL_LOGLogs a warning and proceeds with epoch as the watermark. Truedelta treats idempotent re-reads as no-ops, so the load remains safe.

Recovery is operator-driven. Pick one of:

  • --full-refresh --entity <ID>: clears the affected entity's data and re-executes in the same invocation. See Full refresh below for the operational details across all three scopes.
  • --full-refresh (no --entity): the same recovery model applied to every entity in the model. See Full refresh below.
  • Manual surgery on batch_history: set the latest F row to C (with the correct end_tmstp) if the underlying load actually completed but the bookkeeping row didn't.

Full refresh

daana-cli execute --full-refresh resets data tables and reloads from source in one invocation. The flag has three scopes: model, entity, and attribute. See execute in the CLI reference for the precise structural definition of each scope, including what gets deleted and what's preserved.

This subsection covers when to use each scope and what to expect operationally.

Model-level full refresh (--full-refresh with no --entity) resets every entity in the model. Use it when:

  • You changed the model in a way that affects every entity (e.g., reworked the IDFR strategy, renamed framework columns, switched mapping conventions for every entity), and existing data needs to be re-derived from source.
  • You deployed a new model for the first time and want a clean initial backfill across the whole warehouse.
  • Every pipeline is in a bad state. This is rare; usually one entity or one pipeline is the underlying cause, and --full-refresh --entity <ID> is the narrower tool.

What to expect operationally:

  • All pipelines re-run from epoch. The source's full history is read for every entity. For sources with deep history, this can be a large initial load.
  • INCREMENTAL and TRANSACTIONAL pipelines behave like first-time runs after the reset. There is no in-flight state to preserve.
  • Layer 3 consumers see partial state during the refresh window. Entities reload in workflow order. For consistent reads against a model-level refresh, pause downstream jobs until the workflow exits zero.
  • The operation is idempotent. Re-running --full-refresh --yes after a successful refresh deletes against already-empty tables (no-op) and reloads the same data again.
  • Framework infrastructure is preserved: semantic views, functions, and metadata table structure stay in place. There is no need to re-run install or deploy afterward.
daana-cli execute --full-refresh --yes

Entity-level full refresh (--full-refresh --entity <ID>) scopes the reset to one entity and its relationships. The same operational expectations apply, scoped down. Use it when the affected scope is a single entity (the typical case for aborted-pipeline recovery, schema migrations that only touch one entity, or per-entity backfills).

Attribute-level full refresh (--full-refresh --entity <ID> --attribute <NAME>) scopes the reset to one attribute of one entity. Use it when the affected scope is narrower still: a single attribute changed (added a transformation, fixed a mapping bug for one column) and the rest of the entity's data is correct.

--yes bypasses the interactive confirmation prompt and is required for non-interactive runs (CI, scheduled jobs).

Runbook contract: machine-readable diagnostics

When run with --no-tui, an aborted-pipeline diagnostic is emitted as a single JSON line on stdout in addition to the human-readable error on stderr:

daana-cli execute --no-tui
# stdout (one line per aborted pipeline):
# {"type":"incomplete_history","proc_key":...,"entity_id":"ORDER","investigate_sql":"SELECT ...",...}

Tooling can grep stdout line-by-line for "type":"incomplete_history" and parse each match. The schema is wire-stable; new diagnostic types will be added with new type values rather than mutating existing ones.

Other flags

# Force: override stale run detection (a stuck 'R' row past stale-timeout)
daana-cli execute --force

Workflow Commands

Generate Template

Create a new workflow file from a template:

daana-cli generate workflow -o workflow.yaml

Check Workflow

Validate your workflow configuration before deploying:

daana-cli check workflow

This validates:

  • Workflow YAML structure
  • Model file exists and is valid
  • Mapping files exist and are valid
  • Connection profile exists

Execute Workflow

Run the data transformation:

# Full execution
daana-cli execute

# With batch parameters (for incremental loading)
daana-cli execute --batch-start "2024-01-01" --batch-end "2024-01-31"

Complete Example

Here's a complete workflow showing all components together:

workflow:
  id: ECOMMERCE_WORKFLOW
  name: E-commerce Data Pipeline
  definition: Transforms raw e-commerce data into analytics-ready format
  description: |
    This workflow processes customer and order data from the
    operational database and transforms it into business entities
    for analytics and reporting.

  model: models/ecommerce-model.yaml

  mappings:
    - mappings/customer-mapping.yaml
    - mappings/product-mapping.yaml
    - mappings/order-mapping.yaml
    - mappings/order-item-mapping.yaml

  connection: production

  advanced:
    batch_expression: updated_at

Quick Reference

Looking up a specific field? Here's the complete reference for all fields in workflow.yaml.

Workflow Fields

FieldTypeRequiredDescription
idstring
Unique identifier for the workflow
namestring
Human-readable name for the workflow
definitionstring
One-line description of the workflow purpose
descriptionstring
Detailed description of the workflow
connectionstring
Connection profile to use for execution
modelModelReference
Path to the data model file
mappingslist of string
List of mapping files to execute
Order matters: list parent entities before children
advancedobject
Advanced workflow settings
advanced.batch_expressionstring
Batch filtering expression for incremental source reads
Per-table override: set batch_expression on individual tables in your mapping YAMLIf value contains ${BATCH_START} or ${BATCH_END}, it's used as raw SQL (auto-detected)FULL and FULL_LOG pipelines ignore this setting (they read all data)
advanced.batch_stale_timeoutstring
Timeout for stale pipeline run detection
Uses Go duration format (e.g., 8h, 30m, 1h30m)Use --force to override stale detection during execution

✓ = required, ○ = optional

Best Practices

  1. Use descriptive workflow IDs - Choose names that reflect the business domain (e.g., ECOMMERCE_WORKFLOW, CUSTOMER_ANALYTICS)
  2. Order mappings by dependency - Independent entities first, then dependent ones
  3. Validate before deploying - Always run daana-cli check workflow before deployment
  4. Configure batch processing - For large datasets, use INCREMENTAL ingestion with batch settings
  5. Use environment-specific connections - Create separate connection profiles for dev, staging, and production
  6. Document your workflows - Use the description field for detailed documentation
  7. Keep models in YAML - Unless you have a specific reason to maintain compiled JSON files
  8. Test incrementally - Test incremental logic with small date ranges first

Next Steps

Previous
Mapping