Tutorial
4. Loading Strategies

Chapter 4: Loading Strategies

Goal: Understand the different ingestion strategies and when to use each one.

Prerequisites: You must have completed Chapter 3: Structuring Business Logic.


The Key Question

Before choosing a strategy, ask: "Does my source contain history, or just current state?"

Source ContainsStrategyExample
Current state only (1 row per entity)FULLReference tables, dimensions
Complete history (N rows per entity)FULL_LOGChange logs, snapshots, CDC
Current state, large tableINCREMENTALLarge orders table with updated_at
Immutable eventsTRANSACTIONALClickstream, sensor data

The Four Strategies

FULL - Source Has Current State Only

Use when your source table has exactly one row per entity representing its current state.

ingestion_strategy: FULL

How it works: Daana reads the entire table each run. Over multiple runs, Daana compares snapshots to detect what changed and builds history automatically.

Example source (one row per product):

product_id | name           | price
1          | The Handmaid's | 16.99   ← Current state only
2          | American Gods  | 22.99   ← Current state only

FULL_LOG - Source Contains Complete History

Use when your source table has multiple rows per entity, each representing a historical version.

ingestion_strategy: FULL_LOG

How it works: Daana treats each row as a point-in-time record and preserves the complete history from the source.

Example source (multiple rows per customer):

customer_id | status   | updated_at
1           | Active   | 2024-11-09  ← Version 1
1           | Inactive | 2024-11-25  ← Version 2 (status changed)

This is what you used in Chapter 1! Both stage.customers and stage.customer_addresses contain multiple rows per entity, so both use FULL_LOG.

INCREMENTAL - Like FULL, But Faster

Use for large tables where reading everything is too slow. Requires a watermark column.

ingestion_strategy: INCREMENTAL
effective_datetime: updated_at

How it works: Only reads rows where updated_at > last_successful_run.

TRANSACTIONAL - Append-Only Events

Use for immutable data that's inserted once and never updated.

ingestion_strategy: TRANSACTIONAL

How it works: Appends new rows without checking for updates. Fastest strategy.


Hands-On Exercises

Exercise 1: Examine FULL_LOG Sources

Both of your Chapter 1 sources use FULL_LOG. Let's see why:

# See multiple rows per customer in stage.customers
docker exec -it daana-customerdb psql -U dev -d customerdb -c \
  "SELECT customer_id, account_status, updated_at
   FROM stage.customers
   WHERE customer_id = 1
   ORDER BY updated_at;"

You'll see customer 1 appears twice (Active then Inactive). This is a FULL_LOG pattern!

 customer_id | account_status |       updated_at
-------------+----------------+------------------------
           1 | Active         | 2024-11-09 17:44:59+00
           1 | Inactive       | 2024-11-25 14:30:22+00
(2 rows)
# See multiple rows per customer in stage.customer_addresses
docker exec -it daana-customerdb psql -U dev -d customerdb -c \
  "SELECT customer_id, city, snapshot_date
   FROM stage.customer_addresses
   WHERE customer_id = 2
   ORDER BY snapshot_date DESC LIMIT 5;"

Most customers appear many times (daily snapshots). Also FULL_LOG!

 customer_id |     city     | snapshot_date
-------------+--------------+---------------
           2 | Philadelphia | 2024-11-29
           2 | Boston       | 2024-11-28
           2 | Boston       | 2024-11-27
           2 | Boston       | 2024-11-26
           2 | Boston       | 2024-11-25
(5 rows)

Snapshotted sources naturally have many duplicate records (simply meaning nothing has changed over several days). Daana automatically removes such duplicates and only tracks actual changes in data.

Exercise 2: Add New History and Re-Execute

# Add a new address snapshot (Emily moves to SF!)
docker exec -it daana-customerdb psql -U dev -d customerdb -c "
INSERT INTO stage.customer_addresses
(customer_id, address_type, is_default, recipient_name, address_line1, city, state, postal_code, country, snapshot_date)
VALUES (2, 'Shipping', TRUE, 'Emily Johnson', '100 Tech Drive', 'San Francisco', 'CA', '94105', 'USA', '2024-11-30');"
# Re-execute
daana-cli execute
# See Emily's complete history: Boston → Philadelphia → San Francisco
# Note: History views use lowercase column names
docker exec -it daana-customerdb psql -U dev -d customerdb -c \
  "SELECT customer_city, eff_tmstp FROM daana_dw.view_customer_hist
   WHERE customer_key = '2' ORDER BY eff_tmstp DESC LIMIT 3;"

Result:

 customer_city |       eff_tmstp
---------------+------------------------
 San Francisco | 2024-11-30 00:00:00
 Philadelphia  | 2024-11-29 00:00:00
 Boston        | 2024-11-05 00:00:00
(3 rows)

Exercise 3: Clean Up

docker exec -it daana-customerdb psql -U dev -d customerdb -c \
  "DELETE FROM stage.customer_addresses WHERE snapshot_date = '2024-11-30';"
daana-cli execute

Best Practices

  1. Look at your source data - Does it have 1 row or N rows per entity?
  2. FULL_LOG for history - If source tracks changes as separate rows, use FULL_LOG
  3. FULL for current state - If source only has current state, use FULL
  4. INCREMENTAL for scale - Switch from FULL when tables get large (>1M rows)
  5. TRANSACTIONAL for events - Use for immutable append-only data

Next: Chapter 5 - Mastering DMDL