Chapter 4 - Loading Strategies

Goal: Understand the different ingestion strategies and when to use each one.

Prerequisites: You must have completed Chapter 3: Structuring Business Logic.

Strategies You've Already Used

In Chapter 1, your project used three ingestion strategies without requiring any configuration decisions from you:

stage.customers and stage.customer_addresses used FULL_LOG, because both tables contain multiple rows per customer representing historical versions.
stage.loyalty_memberships used FULL, because the table contains exactly one row per entity representing current state.
stage.sales_orders used TRANSACTIONAL, because every row is an immutable order event written once and never updated.

This chapter explains why those choices were made and introduces the remaining strategy, INCREMENTAL.

The Key Question

Before choosing a strategy, ask: "Does my source contain history, or just current state?"

Source Contains	Strategy	Source Timestamp	Example
Current state only (1 row per entity)	`FULL`	not required	Reference tables, dimensions
Complete history (N rows per entity)	`FULL_LOG`	effective time per row	Change logs, snapshots, CDC
Mutable rows (each row may be updated after insert)	`INCREMENTAL`	watermark + effective time	Million-row CRM customers with `updated_at`
Immutable rows (each row written once, never updated)	`TRANSACTIONAL`	watermark + effective time	Order events, clickstream, sensor data

The watermark (batch_expression) is a filter on the source: it controls which rows are read this run (WHERE batch_expression > previous_high_water_mark). The effective timestamp (entity_effective_timestamp_expression) is per-row placement on the warehouse timeline: it controls when this version of the row became true, which Daana uses to merge attribute-level history (a new history version is written only when an attribute's value actually changed at a new effective time). The same source column often plays both roles, but the framework keeps them separate.

In practice, prefer a staging clock column for batch_expression (one whose value is set when the row enters the warehouse-facing table, e.g. ingested_at TIMESTAMPTZ NOT NULL DEFAULT NOW()). It is monotonic at load time, immune to source-system clock skew, and unaffected by out-of-order business events. Business timestamps like order_date or updated_at work, but tying the reader filter to them mixes "when did this row arrive" with "when did this thing happen", which is harder to reason about under late arrival or backfill.

The Four Strategies

FULL: Source Has Current State Only

Use when your source table has exactly one row per entity representing its current state.

ingestion_strategy: FULL

How it works: Daana reads the entire table each run. Over multiple runs, Daana compares snapshots to detect what changed and builds history automatically.

In your project, stage.loyalty_memberships uses FULL. Each customer has one membership row with a current tier level:

 customer_id | tier_level | join_date
-------------+------------+------------
 1           | Gold       | 2023-01-10
 2           | Silver     | 2023-02-14
 4           | Gold       | 2023-04-20

If a tier level changes between pipeline runs, Daana detects the difference and records a new version, with eff_tmstp set to the source's update time when the table carries one, or to the run timestamp otherwise.

FULL_LOG: Source Contains Complete History

Use when your source table has multiple rows per entity, each representing a historical version.

ingestion_strategy: FULL_LOG

How it works: Daana treats each row as a point-in-time record and preserves the complete history from the source.

In your project, stage.customers uses FULL_LOG. Customer 1 appears twice, representing a status change:

 customer_id | account_status | updated_at
-------------+----------------+------------------------
 1           | Active         | 2024-11-09 17:44:59
 1           | Inactive       | 2024-11-25 14:30:22

Similarly, stage.customer_addresses uses FULL_LOG with daily snapshots. Daana deduplicates identical consecutive snapshots and retains only rows where an actual change occurred.

INCREMENTAL: Watermarked Loading for Large Tables

Use for large tables where reading everything each run is too slow. Requires a timestamp column to act as a watermark.

ingestion_strategy: INCREMENTAL
batch_expression: ingested_at         # staging clock - when the row landed
entity_effective_timestamp_expression: updated_at  # business time - when the row became true

How it works: Daana auto-generates a filter so each run only reads rows where the batch_expression column is greater than the previous successful run's high-water mark. This avoids re-reading millions of unchanged rows on every execution.

Tip: batch_expression and entity_effective_timestamp_expression can point at the same column or at different ones. The split shown above is the recommended default for mutable sources: the staging clock advances monotonically as rows land, while the business timestamp tracks when each version of the row became true. The watermark can also be set once at the workflow level and overridden per table.
See Batch Processing for how Daana tracks the watermark between runs.

Late arrivals. Splitting batch_expression (staging clock) from entity_effective_timestamp_expression (business time) lets INCREMENTAL accept rows whose business-effective timestamp is in the past. A row that lands in the current batch window but reports updated_at weeks ago is read by the watermark and written to the warehouse at its updated_at value, not at ingest time. Existing target rows at later effective timestamps remain unchanged; the late row appears as an earlier observation in the history.

Two consequences worth being aware of:

Re-asking the warehouse "what was the value at time T?" can return a different answer after a late arrival is processed. The warehouse represents the current understanding of history, refined as new observations arrive. The same warehouse state is returned by re-running an unchanged batch or by re-running the same source through a full reprocess.
view_{entity}_hist returns one row per value-run, anchored at the earliest known eff_tmstp. A late arrival that pushes the anchor earlier replaces the earlier anchor in the view's output. See Understanding Daana Output: value-run compaction.

TRANSACTIONAL: Append-Only Events

Use for immutable data that is inserted once and never updated.

ingestion_strategy: TRANSACTIONAL
batch_expression: ingested_at         # staging clock

How it works: Appends new rows without checking for updates. This is the fastest strategy, suitable for event streams, clickstream data, sensor readings, or financial transactions. A batch_expression is optional: without one, every run scans the whole source table and relies on Daana's primary-key deduplication. With one, each run only reads rows that landed after the previous successful run.

In your project, stage.sales_orders uses TRANSACTIONAL with batch_expression: ingested_at (set in Chapter 1). Each order row is written once when the order is placed and never updated, which is the contract this strategy is built around. ingested_at is the staging clock: a column with DEFAULT NOW() that records when each row landed in the staging table, so each execute only reads what arrived since the previous run. Exercise 3 walks through that handoff. The sample schema also ships stage.loyalty_transactions (append-only point redemptions and accruals); the same fit applies.

Hands-On Exercises

The exercises below mix CLI commands with SQL queries. Open a psql shell in a second terminal so the first stays free for daana-cli invocations:

docker exec -it daana-customerdb psql -U dev -d customerdb

Exercise 1: Examine FULL_LOG Sources

Both stage.customers and stage.customer_addresses use FULL_LOG. Examine the source data to see why:

SELECT customer_id, account_status, updated_at
FROM stage.customers
WHERE customer_id = 1
ORDER BY updated_at;

Customer 1 appears twice, representing a status change from Active to Inactive:

 customer_id | account_status |       updated_at
-------------+----------------+------------------------
           1 | Active         | 2024-11-09 17:44:59+00
           1 | Inactive       | 2024-11-25 14:30:22+00
(2 rows)

stage.customer_addresses is the same idea for a different grain: one row per customer per day. Customer 2 appears on many dates (for example Philadelphia on 2024-11-29, then Boston on earlier snapshot dates). Snapshot sources naturally produce duplicate-looking rows when nothing changed from day to day; Daana deduplicates those and tracks only real changes.

Exercise 2: Add New History and Re-Execute

Insert a new address snapshot to simulate Emily moving to San Francisco:

INSERT INTO stage.customer_addresses
  (customer_id, address_type, is_default, recipient_name, address_line1, city, state, postal_code, country, snapshot_date)
VALUES
  (2, 'Shipping', TRUE, 'Emily Johnson', '100 Tech Drive', 'San Francisco', 'CA', '94105', 'USA', '2024-11-30');

Re-execute the pipeline (in your CLI terminal):

daana-cli execute

Query Emily's updated history (back in the psql terminal), newest first:

SELECT customer_city, eff_tmstp
FROM daana_dw.view_customer_hist
WHERE customer_key = '2'
ORDER BY eff_tmstp DESC;

You should see San Francisco at the top, then Philadelphia, then older Boston rows, and an earlier row with an empty customer_city from the initial customer load before address snapshots populated the attribute.

The new San Francisco row appears in the history automatically. Any empty customer_city in older rows reflects the initial ingestion from stage.customers (no city column); later snapshots from stage.customer_addresses populated the attribute.

Exercise 3: Load `sales_orders` in Two Batches

stage.sales_orders already uses TRANSACTIONAL with batch_expression: ingested_at (Chapter 1). This exercise treats your initial load as batch 1 and lands a second batch by inserting three new orders into the staging table. That is how Daana picks up newly arrived data on every run: the staging clock ticks on insert, and daana-cli execute reads everything beyond the previous run's high-water mark. No reset, no mapping change, no explicit batch flags.

Throughout this exercise, reuse the same pair of snapshots so you always compare apples to apples. Staging shows what arrived (including ingested_at); the warehouse view shows what the pipeline has materialized.

SELECT order_id, order_date, ingested_at
FROM stage.sales_orders
ORDER BY order_id::int;

SELECT order_id, order_status, order_purchase_ts
FROM daana_dw.view_order
ORDER BY order_id::int;

Baseline (after Chapter 1, before you insert new orders). Run both statements. You should see 13 rows in each result. The ingested_at column was set by tutorial_setup.sh via DEFAULT NOW(), so those timestamps cluster within a few seconds. The previous daana-cli execute recorded a moment slightly later than the maximum ingested_at as that batch's end_tmstp in daana_metadata.batch_history (you inspect that table in step 6).

Insert three new orders. The schema sets ingested_at NOT NULL DEFAULT CURRENT_TIMESTAMP, so you don't specify it: each row gets NOW() at insert time, strictly after the previous batch's end_tmstp.

INSERT INTO stage.sales_orders
  (order_id, order_number, customer_id, order_date, order_channel, location_id,
   order_status, payment_status, fulfillment_method, total_amount, currency)
VALUES
  (14, 'ORD-2024-000014', 3, '2024-12-02 10:00:00', 'Online',   4, 'Processing', 'Authorized', 'Ship to Address',   42.50, 'USD'),
  (15, 'ORD-2024-000015', 5, '2024-12-04 14:30:00', 'In-Store', 2, 'Delivered',  'Captured',   'In-Store Purchase', 28.99, 'USD'),
  (16, 'ORD-2024-000016', 7, '2024-12-06 09:15:00', 'Online',   5, 'Shipped',    'Captured',   'Ship to Address',   67.25, 'USD');

After the INSERT, before execute. Run the same two snapshots again. Staging now has 16 rows and the last three have newer ingested_at values. The warehouse snapshot still has 13 rows until the next run picks up the new staging rows.
Run execute with no flags. Daana resolves the new batch's lower bound from the previous run's end_tmstp, and its upper bound from wallclock now. Only rows with ingested_at in that window pass the reader filter:
```
daana-cli execute
```
After execute. Run the same two snapshots again. Staging still shows 16 orders. view_order now shows 16 rows; the three new orders (IDs 14, 15, 16) appear with statuses Processing, Delivered, and Shipped and with the expected order_purchase_ts values.
Inspect the batch trail to see the watermark handoff:
```
SELECT start_tmstp, end_tmstp, row_st
FROM daana_metadata.batch_history
WHERE source_table = 'sales_orders'
ORDER BY popln_tmstp;
```
Two completed batches show up. Run 2's start_tmstp matches run 1's end_tmstp exactly; that boundary is what the reader filter uses to avoid re-reading rows already absorbed. Daana writes one row per attribute pipeline (and the relationship pipeline) on every batch. The ORDER entity has five such pipelines, so each batch produces five rows. All rows for this source-entity group share a single master_proc_key that identifies the pipeline across runs; individual batches are distinguished by proc_key and procinst_key. The row_st = 'C' column means completed; rows in flight carry 'R'.
Idempotency. Running daana-cli execute again right now is safe. The watermark has advanced past order 16, no new rows are in scope, and view_order stays at 16 (order_desc stays at one row per attribute). batch_history is the exception: it records a fresh five rows on every run, since each invocation is a distinct execution attempt. This makes retries after partial failures safe to run blindly.

When would you reach for INCREMENTAL?

TRANSACTIONAL trusts the source: every row that passes the watermark is appended. INCREMENTAL adds truedelta on the merge side, which is what you want when the source can re-emit rows it has already published (mutable CRM tables, snapshot feeds). stage.sales_orders writes each order exactly once, so TRANSACTIONAL is the right fit here.

Choosing the Right Strategy

Question	Answer	Strategy
Does the source have one row per entity?	Yes	`FULL`
Does the source have multiple rows per entity (history)?	Yes	`FULL_LOG`
Is the table large and only current state?	Yes	`INCREMENTAL`
Are rows immutable events?	Yes	`TRANSACTIONAL`

When in doubt, start with FULL for current-state tables and FULL_LOG for tables with history. Both are safe defaults. Switch to INCREMENTAL when table size becomes a performance concern, and use TRANSACTIONAL for event streams where rows are never updated.

Next: Chapter 5 - Mastering DMDL