Tutorial

Chapter 1 - Jump Start

Goal: Observe how Daana generates an integrated, time-aware data warehouse from four different source-data shapes, using only YAML declarations and three CLI commands.

What you'll see: Customer data merged from three independent source systems and full change history tracked automatically, all available for query immediately.

Time: typically 10-15 minutes

Before starting: Complete the Quickstart Tutorial to set up your project and databases.


The Source Data

The Book Retailer's customer data resides in four independent source tables. The columns differ between tables, and the shape of each table differs as well. Daana represents each shape with one value of a single mapping field, ingestion_strategy. The four shapes used in this tutorial are described below.

stage.customers (shape: change log). One row is written each time CRM data changes, timestamped with updated_at. Customer 1 appears twice because their status changed:

 customer_id | first_name | last_name | account_status | updated_at
-------------+------------+-----------+----------------+---------------------
 1           | John       | Smith     | Active         | 2024-11-09 17:44:59
 2           | Emily      | Johnson   | Active         | 2024-11-04 18:24:50
 3           | Michael    | Brown     | Active         | 2024-10-16 10:37:12
 ...
 1           | John       | Smith     | Inactive       | 2024-11-25 14:30:22  <- status changed
 3           | Michael    | Brown     | Suspended      | 2024-11-27 16:45:18  <- status changed

stage.customer_addresses (shape: daily snapshot). The source writes one row per customer per night, regardless of whether the address changed. Most rows are duplicates of the previous day's row (275 in total, one per customer per day):

 customer_id | city         | state | snapshot_date
-------------+--------------+-------+---------------
 2           | Boston       | MA    | 2024-11-05
 2           | Boston       | MA    | 2024-11-28
 2           | Philadelphia | PA    | 2024-11-29    <- address changed
 ...

stage.loyalty_memberships (shape: current-state lookup). One row per active member, updated in place when the tier changes. The source system stores no history of its own; only the current state is available:

 customer_id | tier_level | join_date
-------------+------------+------------
 1           | Gold       | 2023-01-10
 2           | Silver     | 2023-02-14
 4           | Gold       | 2023-04-20
 6           | Platinum   | 2023-06-10
 ...

When a tier value changes between runs, Daana writes a new version instead of overwriting; chapter 4 covers how.

stage.sales_orders (shape: transactional). Immutable order events. Each row is written once when the order is placed and never updated, and the row carries both a monetary amount and its currency code:

 order_id | customer_id | order_status | total_amount | currency | order_date
----------+-------------+--------------+--------------+----------+---------------------
 1        | 1           | Delivered    | 60.33        | USD      | 2024-11-01 10:30:00
 2        | 2           | Delivered    | 24.49        | USD      | 2024-11-02 14:15:00
 6        | 6           | Delivered    | 92.38        | USD      | 2024-11-06 13:20:00

Why this is hard without Daana

Producing a single unified, time-aware customer view from these four shapes the traditional way requires four different ingestion patterns inside one pipeline:

  • CDC merge for the change log, picking the latest version per updated_at.
  • Snapshot deduplication for the daily address dumps, reducing 275 rows to one row per actual address change.
  • SCD2 reconstruction for loyalty memberships, because the source overwrites in place and history has to be derived downstream.
  • Append-only inserts with deduplication for the orders, plus a separate decision about how to keep total_amount and currency aligned if either ever changes.

Then a FULL OUTER JOIN across the four streams, key reconciliation, and valid_from/valid_to columns so analysts can answer "what did this customer look like on 27 November?". The result is a custom ETL implementation that has to be re-checked whenever the source schema drifts.

In Daana, each shape maps to one ingestion_strategy. This chapter uses three: FULL_LOG for both the change log and the daily snapshot, FULL for the loyalty lookup, and TRANSACTIONAL for the orders. This configuration is enough for Daana to generate the merging, attribute-level history, and snapshot deduplication automatically. Chapter 4 walks through each strategy in detail.


Author Your Model and Mappings

You've already run daana-cli init my-tutorial --example quickstart, which set up the data, Docker, and a workflow scaffold. Two pieces remain for you to author by hand:

  • model.yaml at the project root, which declares the business entities, attributes, and relationships. The file ships with a placeholder so this chapter can replace it directly.
  • Two mapping files in mappings/, which connect each entity to its source tables. The directory ships empty; you generate the skeletons with daana-cli generate mapping and complete them.

This is the same authoring loop used on real projects. The remainder of this chapter walks through each file.

Author the Model

Start with the model file. Open model.yaml in your editor and replace its contents with the following:

model:
  id: BOOK_RETAILER_MODEL
  name: BookRetailerModel
  definition: Unified customer and order data from multiple sources
  description: Merges customer data with address snapshots for time-travel analytics

  entities:
    - id: CUSTOMER
      name: CUSTOMER
      definition: Customer who places orders
      description: Unified customer data from customers table and address snapshots
      attributes:
        - id: customer_id
          name: customer_id
          definition: Unique customer identifier
          type: STRING

        - id: customer_name
          name: customer_name
          definition: Full name of the customer
          type: STRING
          effective_timestamp: true

        - id: customer_status
          name: customer_status
          definition: Account status (Active, Inactive, Suspended)
          type: STRING
          effective_timestamp: true

        - id: customer_city
          name: customer_city
          definition: City where customer is located
          type: STRING
          effective_timestamp: true

        - id: customer_state
          name: customer_state
          definition: State where customer is located
          type: STRING
          effective_timestamp: true

        - id: customer_loyalty_tier
          name: customer_loyalty_tier
          definition: Loyalty program tier level (Bronze, Silver, Gold, Platinum)
          type: STRING
          effective_timestamp: true

    - id: ORDER
      name: ORDER
      definition: Purchase order
      description: Orders placed by customers
      attributes:
        - id: order_id
          name: order_id
          definition: Unique order identifier
          type: STRING

        - id: order_status
          name: order_status
          definition: Current order status
          type: STRING
          effective_timestamp: true

        - id: order_purchase_ts
          name: order_purchase_ts
          definition: When order was placed
          type: START_TIMESTAMP

        - id: order_value
          name: order_value
          definition: Total monetary value of the order
          description: Amount and currency bundled as an atomic concept
          effective_timestamp: true
          group:
            - id: order_value_amount
              name: order_value_amount
              definition: Monetary amount
              type: NUMBER
            - id: order_value_currency
              name: order_value_currency
              definition: Currency code (ISO 4217, e.g. USD, EUR)
              type: UNIT

  relationships:
    - name: IS_PLACED_BY
      definition: Order is placed by a customer
      source_entity_id: ORDER
      target_entity_id: CUSTOMER

Validate the file:

daana-cli check model

The model declares two entities (CUSTOMER and ORDER) and the IS_PLACED_BY relationship between them. Most attributes carry effective_timestamp: true so Daana tracks their change history independently. Chapter 2 walks through that and the other field types in detail. One type is worth flagging now because it explains an asymmetry between the two mappings you are about to write:

  • type: START_TIMESTAMP marks order_purchase_ts as the event timestamp that places the order on the timeline. Daana uses it as the entity's effective timestamp for the order, so the order mapping does not need to set entity_effective_timestamp_expression for this attribute. See START_TIMESTAMP.

Generate Mapping Skeletons

Bootstrap one mapping file per entity:

daana-cli generate mapping

Output:

Generating mapping templates...

Model: model.yaml
Output: mappings/

Generated 2 mapping template(s)
  Generated CUSTOMER: customer-mapping.yaml
  Generated ORDER: order-mapping.yaml

This writes mappings/customer-mapping.yaml and mappings/order-mapping.yaml. Each skeleton lists every entity attribute once with a placeholder transformation_expression, sets primary_keys to the natural-key attribute id (customer_id, order_id), defaults ingestion_strategy to FULL_LOG, and pre-fills connection, table, and (where applicable) entity_effective_timestamp_expression with placeholder values you replace. The order skeleton also lists each model relationship once with the correct id and a placeholder target_transformation_expression. None of the placeholders are validated against the source schema; the next two sections replace them with the real expressions.

Each skeleton starts with a single source table block. CUSTOMER is sourced from three systems in this tutorial, so the next section adds two more table blocks under tables:; ORDER is sourced from a single table and stays one block.

Complete the Customer Mapping

Open mappings/customer-mapping.yaml and replace its contents with the following:

entity_id: CUSTOMER

mapping_groups:
  - name: default_mapping_group
    tables:
      # Source 1: CRM data, core customer records with full status change history
      - connection: dev
        table: stage.customers
        primary_keys:
          - customer_id
        ingestion_strategy: FULL_LOG
        entity_effective_timestamp_expression: updated_at
        attributes:
          - id: customer_id
            transformation_expression: customer_id
          - id: customer_name
            transformation_expression: CONCAT(first_name, ' ', last_name)
          - id: customer_status
            transformation_expression: account_status

      # Source 2: Address snapshots, merged automatically with Source 1
      - connection: dev
        table: stage.customer_addresses
        primary_keys:
          - customer_id
        ingestion_strategy: FULL_LOG
        entity_effective_timestamp_expression: snapshot_date
        attributes:
          - id: customer_city
            transformation_expression: city
          - id: customer_state
            transformation_expression: state

      # Source 3: Loyalty tier, merged automatically with Sources 1 and 2
      - connection: dev
        table: stage.loyalty_memberships
        primary_keys:
          - customer_id
        ingestion_strategy: FULL
        entity_effective_timestamp_expression: join_date
        attributes:
          - id: customer_loyalty_tier
            transformation_expression: tier_level

The mapping assembles CUSTOMER from three source tables in a single mapping group: the CRM change log (stage.customers, FULL_LOG), the daily address snapshots (stage.customer_addresses, FULL_LOG), and the current-state loyalty roster (stage.loyalty_memberships, FULL). Each table block specifies the connection, the source table, the primary key column, the ingestion strategy, the column that supplies the effective timestamp, and the transformation expression for each attribute. Merging across tables is generated by Daana when the entity is materialized.

connection: dev resolves against the dev profile in connections.yaml, which --example quickstart set to point at the local PostgreSQL container. See Connection Profiles for the full reference.

The mapping_groups list has a single entry in this chapter; Chapter 2 covers when an entity needs more than one.

Complete the Order Mapping

Open mappings/order-mapping.yaml and replace its contents with the following:

entity_id: ORDER

mapping_groups:
  - name: default_mapping_group
    tables:
      - connection: dev
        table: stage.sales_orders
        primary_keys:
          - order_id
        ingestion_strategy: TRANSACTIONAL
        batch_expression: ingested_at
        entity_effective_timestamp_expression: order_date
        attributes:
          - id: order_id
            transformation_expression: order_id
          - id: order_status
            transformation_expression: order_status
          - id: order_purchase_ts
            transformation_expression: order_date
          - id: order_value_amount           # child attribute of the order_value group
            transformation_expression: total_amount
          - id: order_value_currency         # child attribute of the order_value group
            transformation_expression: currency

    relationships:
      - id: ORDER_IS_PLACED_BY_CUSTOMER
        source_table: stage.sales_orders
        target_transformation_expression: customer_id

The order is sourced from a single table (stage.sales_orders, TRANSACTIONAL). The two child attributes of the order_value group are mapped individually (order_value_amount and order_value_currency), and the relationships block resolves IS_PLACED_BY by pointing at the customer_id column on the source table. daana-cli generate mapping pre-filled the relationship id. Chapter 2 walks through the naming rule.

Validate

daana-cli check workflow

A successful check confirms the model and both mappings parse, the entity and attribute references resolve, and the mappings cover every attribute declared on the model. deploy and execute invoke check internally, so this step is optional, but running it once after authoring catches typos before any database changes happen.

The check also surfaces three warnings about the dev connection profile: the password and user fields are hardcoded rather than read from environment variables, and sslmode is disable. These are intentional for the local sandbox so the tutorial runs without extra setup. Chapter 6 covers the production hardening (secrets, TLS, separate profiles).


Deploy and Execute

You already ran daana-cli install during the tutorial setup. Install is environment-level and runs once per database. From here, deploy the workflow and execute it:

daana-cli deploy

daana-cli execute

Explore What Daana Built

For each entity in your model, Daana generates three types of views in the daana_dw schema:

ViewContents
view_<entity>Current state, one row per entity
view_<entity>_histFull change history, one row per tracked version
view_<entity>_with_relCurrent state with all related entities joined

Daana also generates per-relationship junction views (such as view_order_is_placed_by_customer) and internal v_<entity>_desc views (here v_customer_desc and v_order_desc) over the raw EAV-style storage (Entity-Attribute-Value: each attribute lives in its own row keyed by entity id and attribute name, instead of as a wide table column); Chapter 2 lists the full set and explains the underlying tables.

These views form a complete, historized business layer. They can be queried directly or used as building blocks for a traditional dimensional mart (dimensions and fact tables). Chapter 7 walks through that process. Future versions of Daana will generate dimensional marts automatically.

customer_key is each customer's surrogate key (always typed character varying); the ::integer cast in ORDER BY clauses sorts numerically. Chapter 2 explains how Daana derives it.

Open a psql shell against the local container so you can run the upcoming SQL queries:

docker exec -it daana-customerdb psql -U dev -d customerdb

Query 1: Three sources, one unified view

All three source systems are merged into a single row per customer:

SELECT
  customer_key,
  customer_name,
  customer_status,
  customer_city,
  customer_state,
  customer_loyalty_tier
FROM daana_dw.view_customer
ORDER BY customer_key::integer;

Result:

 customer_key | customer_name       | customer_status | customer_city | customer_state | customer_loyalty_tier
--------------+---------------------+-----------------+---------------+----------------+-----------------------
 1            | John Smith          | Inactive        | New York      | NY             | Gold
 2            | Emily Johnson       | Active          | Philadelphia  | PA             | Silver
 3            | Michael Brown       | Suspended       | Brooklyn      | NY             | Bronze
 4            | Sarah Davis         | Active          | Cambridge     | MA             | Gold
 5            | David Wilson        | Active          | Queens        | NY             | Silver
 6            | Jennifer Martinez   | Active          | New York      | NY             | Platinum
 7            | Robert Anderson     | Active          | San Francisco | CA             |
 8            | Lisa Taylor         | Active          | San Francisco | CA             | Bronze
 9            | James Thomas        | Active          | Los Angeles   | CA             | Silver
 10           | Maria Garcia        | Active          | Boston        | MA             | Bronze
(10 rows)

One row per customer, assembled from three independent source tables. Robert Anderson (customer 7) has no loyalty membership record; his customer_loyalty_tier is empty. Daana handles missing data across sources without additional configuration.

The 12 rows in stage.customers reduce to 10 here because customers 1 and 3 each carry an additional status-change row (Active -> Inactive, Active -> Suspended); FULL_LOG ingestion preserves those as historical versions rather than duplicate customers, and the current-state view shows only the latest version per entity.

Query 2: The full change history for one customer

Emily (customer 2) has dozens of rows in stage.customer_addresses and two rows in stage.customers. Daana deduplicates the snapshots, detects only actual changes, and stores one row per distinct version:

SELECT
  customer_name,
  customer_city,
  customer_state,
  customer_status,
  customer_loyalty_tier,
  eff_tmstp
FROM daana_dw.view_customer_hist
WHERE customer_key = '2'
ORDER BY eff_tmstp;

Result:

 customer_name | customer_city | customer_state | customer_status | customer_loyalty_tier |      eff_tmstp
---------------+---------------+----------------+-----------------+-----------------------+---------------------
               |               |                |                 | Silver                | 2023-02-14 00:00:00
 Emily Johnson |               |                | Active          | Silver                | 2024-11-04 18:24:50
 Emily Johnson | Boston        | MA             | Active          | Silver                | 2024-11-05 00:00:00
 Emily Johnson | Philadelphia  | PA             | Active          | Silver                | 2024-11-29 00:00:00
(4 rows)

The first row (2023-02-14) originates from stage.loyalty_memberships. On that date Emily's loyalty tier (Silver) is the only thing the business knows about her; the CRM has not yet created a record and no address snapshot exists, so name, city, state, and status remain empty. Each effective_timestamp: true attribute is historized independently, so a single timestamp can carry a partial picture without invalidating the others. Subsequent rows fill in CRM and address attributes as they arrive. Emily's 25 daily address snapshots reduce to 2 actual change events (Boston, then Philadelphia).

Query 3: Point-in-time query

Query Emily's address on 27 November 2024, prior to her relocation:

SELECT
  customer_name,
  customer_city,
  customer_state,
  eff_tmstp
FROM daana_dw.view_customer_hist
WHERE customer_key = '2'
  AND eff_tmstp <= '2024-11-27'
ORDER BY eff_tmstp DESC
LIMIT 1;

Result:

 customer_name | customer_city | customer_state |       eff_tmstp
---------------+---------------+----------------+------------------------
 Emily Johnson | Boston        | MA             | 2024-11-05 00:00:00

No additional query structure is required. The history is stored and timestamped by Daana automatically.

Daana also exposes a parameterized helper, daana_dw.f_customer(p_eff_tmstp), that returns the same point-in-time snapshot in one call. The function applies to every customer at once, where the view_customer_hist query above returned only one row:

SELECT
  customer_key,
  customer_name,
  customer_city,
  customer_state,
  customer_status
FROM daana_dw.f_customer('2024-11-27')
WHERE customer_key = '2';

Result:

 customer_key | customer_name | customer_city | customer_state | customer_status
--------------+---------------+---------------+----------------+-----------------
 2            | Emily Johnson | Boston        | MA             | Active

Drop the WHERE clause to get the full per-customer snapshot. See Chapter 2: Time-travel queries for the full reference.

Query 4: Relationship view

The _with_rel view resolves the IS_PLACED_BY relationship by attaching each customer's order columns alongside the customer's own attributes, so a single query returns customer state and the order they placed:

SELECT
  order_purchase_ts,
  customer_key,
  customer_name,
  customer_loyalty_tier,
  order_id,
  order_status,
  order_value_amount,
  order_value_currency
FROM daana_dw.view_customer_with_rel
ORDER BY order_purchase_ts DESC;

Result:

  order_purchase_ts  | customer_key |   customer_name   | customer_loyalty_tier | order_id | order_status | order_value_amount | order_value_currency
---------------------+--------------+-------------------+-----------------------+----------+--------------+--------------------+----------------------
 2024-11-25 16:45:00 | 6            | Jennifer Martinez | Platinum              | 13       | Delivered    |        56.53000000 | USD
 2024-11-20 11:00:00 | 2            | Emily Johnson     | Silver                | 12       | Delivered    |        35.88000000 | USD
 2024-11-15 13:30:00 | 1            | John Smith        | Gold                  | 11       | Delivered    |        48.38000000 | USD
 2024-11-10 14:20:00 | 10           | Maria Garcia      | Bronze                | 10       | Processing   |        65.77000000 | USD
 2024-11-09 09:45:00 | 9            | James Thomas      | Silver                | 9        | Delivered    |        38.60000000 | USD
 2024-11-08 15:30:00 | 8            | Lisa Taylor       | Bronze                | 8        | Delivered    |        72.50000000 | USD
 2024-11-07 10:00:00 | 7            | Robert Anderson   |                       | 7        | Shipped      |        57.07000000 | USD
 2024-11-06 13:20:00 | 6            | Jennifer Martinez | Platinum              | 6        | Delivered    |        92.38000000 | USD
 2024-11-05 11:30:00 | 5            | David Wilson      | Silver                | 5        | Delivered    |        19.52000000 | USD
 2024-11-04 16:45:00 | 4            | Sarah Davis       | Gold                  | 4        | Processing   |        44.03000000 | USD
 2024-11-03 09:20:00 | 3            | Michael Brown     | Bronze                | 3        | Shipped      |        71.39000000 | USD
 2024-11-02 14:15:00 | 2            | Emily Johnson     | Silver                | 2        | Delivered    |        24.49000000 | USD
 2024-11-01 10:30:00 | 1            | John Smith        | Gold                  | 1        | Delivered    |        60.33000000 | USD
(13 rows)

The IS_PLACED_BY relationship was declared as ORDER -> CUSTOMER and Daana materializes it on both sides: view_order_with_rel returns each order with the customer's columns attached, and view_customer_with_rel (queried above) returns each customer with the order columns attached. The relationship is many-orders-per-customer: customers 1, 2, and 6 each appear twice in the result above (a recent November order at the top and an earlier order further down). An aggregate query (e.g. COUNT(order_id) GROUP BY customer_key) rolls these back up to one row per customer. No manual JOIN logic is required at query time.


What Happened Automatically

Four capabilities were applied without any custom code:

Multi-source merge. Three independent tables (stage.customers, stage.customer_addresses, stage.loyalty_memberships) were consolidated into a single CUSTOMER entity. Daana matched rows by customer_id across all three sources, assembled the attributes, and handled customers with no loyalty membership (such as Robert Anderson) without special-case logic.

Snapshot deduplication and attribute-level history. Emily's 25 address snapshot rows (out of 275 in the dataset) contained mostly identical records; most rows reflected the same address from day to day. Daana identified only the genuine changes and stored four historical versions: loyalty registration, first customer record, first address, and the move to Philadelphia. Every attribute with effective_timestamp: true receives this treatment automatically.

Grouped monetary attribute. order_value_amount and order_value_currency were declared as a single concept (order_value). They are stored together and will always change together in the historical record; an amount value cannot exist without its corresponding currency. Chapter 3 shows where the grouping lives in storage and why it matters.

Relationship views. The IS_PLACED_BY relationship between ORDER and CUSTOMER was resolved into pre-joined views on both sides: view_order_with_rel (each order with customer columns) and view_customer_with_rel (each customer with order columns). The customer-side view returns one row per order, as Query 4 showed for John, Emily, and Jennifer; aggregate (COUNT, SUM, etc.) when you need one row per customer. No manual JOIN logic is required at query time.


What You Avoided

To implement this manually in SQL (dbt, Dataform, or stored procedures), the following work would be required:

RequirementManual implementation
Three-source mergeFULL OUTER JOIN across three tables with COALESCE conflict resolution
Address snapshot deduplicationWindow function change detection using LAG() across 275 rows
SCD2 history tablevalid_from/valid_to columns, gap detection, row hashing
Point-in-time queriesWHERE valid_from <= :date AND valid_to > :date on every query
Grouped monetary attributeConvention enforcement in application code or separate constraint tables
Order-customer join viewManual CREATE VIEW with JOIN on surrogate keys

Manual effort: Multiple SQL files, significant development time, and ongoing maintenance whenever a source schema changes.

In Daana: Six attributes in a model file and three source table entries in a mapping.


Next Steps

This chapter demonstrated the end result. The following chapters unpack how it was built: