Mapping

Mappings define how source data from your operational databases maps to your Daana data model.

Mapping Structure

A mapping file connects source tables to a single entity in your model. It has four main components:

Entity target - Which entity this mapping populates
Mapping groups - Container for table mappings (typically one group)
Tables - Source tables with column-to-attribute mappings
Relationships - How this entity connects to other entities (foreign keys)

entity_id: CUSTOMER
 
mapping_groups:
  - name: default_mapping_group
    allow_multiple_identifiers: false
 
    tables:
      - connection: dev
        table: public.customers
 
        primary_keys:
          - customer_id
 
        ingestion_strategy: FULL
 
        # Default timestamp expression for all attributes with effective_timestamp: true
        entity_effective_timestamp_expression: CURRENT_TIMESTAMP
 
        attributes:
          - id: CUSTOMER_ID
            transformation_expression: customer_id
            where: customer_id IS NOT NULL
 
          - id: CUSTOMER_NAME
            transformation_expression: customer_name
            where: customer_name IS NOT NULL

Tip: Always use schema.table format (e.g., public.customers, stage.raw_data) to avoid ambiguity across database schemas.

Ingestion Strategies

The ingestion_strategy field determines how Daana loads data from your source table. Choose based on your data characteristics:

Strategy	Best For	Description
FULL	Small dimension tables	Complete snapshot of all data for each delivery
INCREMENTAL	Large fact tables	Only changed or new data since last load (requires watermark column)
FULL_LOG	Change history tables	Complete history of all changes, with multiple rows per instance
TRANSACTIONAL	Append-only event logs	Append-only data where each instance is delivered exactly once

See the Glossary for detailed descriptions and examples of each strategy.

Effective Timestamps

When your model has attributes with effective_timestamp: true, Daana tracks when values change over time. Instead of manually creating timestamp entries in your mapping, use the DRY expression fields.

entity_effective_timestamp_expression

Set a default timestamp expression for all attributes with effective_timestamp: true at the table level:

tables:
  - connection: dev
    table: public.customers
    primary_keys:
      - customer_id
    ingestion_strategy: FULL
 
    # Default for ALL attributes with effective_timestamp: true
    entity_effective_timestamp_expression: CURRENT_TIMESTAMP
 
    attributes:
      - id: CUSTOMER_NAME
        transformation_expression: customer_name
      - id: CUSTOMER_EMAIL
        transformation_expression: email

With this setup, any attribute that has effective_timestamp: true in the model will automatically use CURRENT_TIMESTAMP for its effective timestamp.

attribute_effective_timestamp_expression

Override the table default for specific attributes that need a different timestamp source:

tables:
  - connection: dev
    table: public.customers
    primary_keys:
      - customer_id
    ingestion_strategy: FULL
 
    # Default timestamp expression
    entity_effective_timestamp_expression: CURRENT_TIMESTAMP
 
    attributes:
      - id: CUSTOMER_NAME
        transformation_expression: customer_name
        # Uses entity default: CURRENT_TIMESTAMP
 
      - id: CUSTOMER_ADDRESS
        transformation_expression: address
        # Override: use the actual update timestamp from the source
        attribute_effective_timestamp_expression: updated_at

When to use each

Scenario	Use
Same timestamp for all attributes	`entity_effective_timestamp_expression` only
Most attributes share a timestamp, some differ	Both (entity default + attribute overrides)
Each attribute has a unique timestamp	`attribute_effective_timestamp_expression` on each

Transformation Expressions

Transformation expressions define how to extract and transform data from source columns. These expressions use the SQL dialect of your source database, so syntax varies by platform (PostgreSQL, Snowflake, BigQuery, etc.).

Note: The examples below use PostgreSQL syntax. Consult your database documentation for platform-specific functions and operators.

Direct Column Mapping

- id: CUSTOMER_NAME
  transformation_expression: customer_name

SQL Functions

- id: CUSTOMER_NAME
  transformation_expression: UPPER(TRIM(customer_name))
 
- id: CREATED_DATE
  transformation_expression: TO_TIMESTAMP(created_at, 'YYYY-MM-DD HH24:MI:SS')
 
- id: TOTAL_AMOUNT
  transformation_expression: CAST(amount AS DECIMAL(10,2))

Concatenation

- id: FULL_NAME
  transformation_expression: first_name || ' ' || last_name

Multiline Expressions

For complex SQL expressions like CASE statements, subqueries, or nested function calls, use YAML's multiline syntax. This makes your mappings much more readable.

Folded Style (`>`)

The folded style (>) joins lines with spaces, ideal for SQL that would normally be written on one line:

- id: DISCOUNT_AMOUNT
  transformation_expression: >
    CASE
      WHEN total > 1000 THEN total * 0.15
      WHEN total > 500 THEN total * 0.10
      WHEN total > 100 THEN total * 0.05
      ELSE 0
    END

Literal Style (`|`)

The literal style (|) preserves line breaks exactly, useful when formatting matters:

- id: CATEGORY_NAME
  transformation_expression: |
    (SELECT category_name
     FROM categories c
     WHERE c.id = products.category_id)

Multiline with Modifiers

You can add modifiers to control trailing newlines:

Indicator	Description
`>` or `\|`	Default: single trailing newline
`>-` or `\|-`	Strip: no trailing newline
`>+` or `\|+`	Keep: preserve all trailing newlines

# Strip trailing newline with >-
- id: STATUS_CODE
  transformation_expression: >-
    COALESCE(
      status,
      'UNKNOWN'
    )

Practical Examples

Complex CASE expression:

- id: ORDER_PRIORITY
  transformation_expression: >
    CASE
      WHEN is_expedited = true THEN 'HIGH'
      WHEN order_date < CURRENT_DATE - INTERVAL '30 days' THEN 'HIGH'
      WHEN total_amount > 10000 THEN 'MEDIUM'
      ELSE 'LOW'
    END

Nested function calls:

- id: FORMATTED_ADDRESS
  transformation_expression: >
    CONCAT_WS(', ',
      NULLIF(TRIM(street), ''),
      NULLIF(TRIM(city), ''),
      NULLIF(TRIM(state), ''),
      NULLIF(TRIM(postal_code), '')
    )

Correlated subquery:

- id: LATEST_ORDER_DATE
  transformation_expression: |
    (SELECT MAX(order_date)
     FROM orders o
     WHERE o.customer_id = customers.customer_id)

Tip: Use folded style (>) for most SQL expressions. Use literal style (|) when you need exact line breaks preserved (rare).

Where Clauses

Filter conditions to select relevant data:

- id: EMAIL
  transformation_expression: email
  where: email IS NOT NULL AND email LIKE '%@%'
 
- id: ACTIVE_STATUS
  transformation_expression: is_active
  where: deleted_at IS NULL

Relationship Mappings

If your model defines relationships between entities, you map them by specifying how to extract the foreign key from your source data. Relationships are defined at the mapping group level (not inside tables).

Basic Relationship Mapping

Suppose your model has an ORDER_ITEM entity with a relationship to ORDER (defined as ORDER_ITEM_BELONGS_TO_ORDER in model.yaml). In your order item mapping, you specify how to find the order ID:

entity_id: ORDER_ITEM
 
mapping_groups:
  - name: default_mapping_group
    allow_multiple_identifiers: false
 
    tables:
      - connection: dev
        table: stage.order_items
        primary_keys:
          - order_id || ' ' || order_item_id
        ingestion_strategy: FULL
 
        attributes:
          - id: ORDER_ITEM_ID
            transformation_expression: order_item_id
          - id: PRICE
            transformation_expression: price
 
    # Relationships are defined at the mapping group level
    relationships:
      - id: ORDER_ITEM_BELONGS_TO_ORDER
        source_table: stage.order_items
        target_transformation_expression: order_id

Relationship Fields

Field	Description
`id`	The relationship ID from your model (e.g., `ORDER_ITEM_BELONGS_TO_ORDER`, `IS_PLACED_BY`)
`source_table`	The table containing the foreign key data (use `schema.table` format)
`target_transformation_expression`	SQL expression that returns the target entity's identifier

Multi-Table Mappings

Map a single entity from multiple source tables:

entity_id: CUSTOMER
 
mapping_groups:
  - name: default_mapping_group
    allow_multiple_identifiers: false
 
    tables:
      - connection: dev
        table: archive.legacy_customers
        primary_keys:
          - old_customer_id
        ingestion_strategy: FULL
        entity_effective_timestamp_expression: migrated_at
        attributes:
          - id: CUSTOMER_ID
            transformation_expression: old_customer_id
 
      - connection: dev
        table: public.new_customers
        primary_keys:
          - customer_uuid
        ingestion_strategy: FULL
        entity_effective_timestamp_expression: CURRENT_TIMESTAMP
        attributes:
          - id: CUSTOMER_ID
            transformation_expression: customer_uuid

Multiple Identifiers

By default, each entity has a single identifier (primary key). The allow_multiple_identifiers setting is an advanced feature for rare scenarios involving multi-source data integration.

When Multiple Identifiers Are Needed

This feature is designed for multi-source data integration scenarios where the same real-world entity exists in different source systems with different identifiers, and you need to preserve both:

Scenario	Example
Multi-source integration	Same customer exists as `CRM_ID: 12345` in Salesforce and `LEGACY_ID: A789` in your legacy ERP - both IDs must remain queryable
M&A data consolidation	After acquiring a company, customer `ACME_ID: 100` must coexist with your `CORP_ID: 5000` for the same person
Composite business keys	Order line items identified by both `order_id` + `line_number` as a composite key from source

When NOT to Use Multiple Identifiers

For most use cases, a single surrogate key is preferred:

Customer lookup by email/phone - Use attributes with indexes, not multiple identifiers
Alternative lookup keys - Model these as regular attributes, not identifiers
Product SKU vs barcode - One should be the identifier, the other an attribute

Industry best practice: Data warehouses typically use surrogate keys rather than composite business keys. Multiple identifiers add complexity and are only justified when you must preserve source system identity relationships.

Enabling Multiple Identifiers

Set allow_multiple_identifiers: true in your mapping group:

entity_id: CUSTOMER
 
mapping_groups:
  - name: default_mapping_group
    allow_multiple_identifiers: true  # Enable multiple identifiers
 
    tables:
      # Source 1: CRM system
      - connection: crm_db
        table: crm.customers
        primary_keys:
          - crm_customer_id
        ingestion_strategy: FULL
        attributes:
          - id: CUSTOMER_ID
            transformation_expression: crm_customer_id
 
      # Source 2: Legacy ERP (same customers, different IDs)
      - connection: legacy_db
        table: erp.customers
        primary_keys:
          - legacy_id
        ingestion_strategy: FULL
        attributes:
          - id: CUSTOMER_ID
            transformation_expression: legacy_id

Warning: This setting is irreversible. Once you enable multiple identifiers for an entity and materialize the mapping, you cannot go back to single identifier mode. This is a permanent structural change to how the entity is stored.

Key Considerations

Exhaust alternatives first - Can you solve this with a single identifier and additional attributes?
Plan ahead - Decide whether you need multiple identifiers before your first deployment
Performance impact - Multiple identifiers increase storage and may affect query performance
No rollback - If you enable this by mistake, you'll need to recreate the entity from scratch

Complete Example

entity_id: ORDER
 
mapping_groups:
  - name: default_mapping_group
    allow_multiple_identifiers: false
 
    tables:
      - connection: dev
        table: public.orders
 
        primary_keys:
          - order_id
 
        ingestion_strategy: FULL
 
        # Default timestamp for most attributes
        entity_effective_timestamp_expression: CURRENT_TIMESTAMP
 
        attributes:
          # Identifier
          - id: ORDER_ID
            transformation_expression: order_id
            where: order_id IS NOT NULL
 
          # Order date
          - id: ORDER_DATE
            transformation_expression: order_date
            where: order_date IS NOT NULL
 
          # Total amount - uses source's updated_at instead of default
          - id: TOTAL_AMOUNT
            transformation_expression: CAST(total_amount AS DECIMAL(10,2))
            where: total_amount > 0
            attribute_effective_timestamp_expression: updated_at
 
          # Status - uses source's status_changed_at
          - id: STATUS
            transformation_expression: UPPER(status)
            where: status IN ('pending', 'completed', 'cancelled')
            attribute_effective_timestamp_expression: status_changed_at
 
    # Relationships are defined at the mapping group level
    relationships:
      - id: ORDER_IS_PLACED_BY_CUSTOMER
        source_table: public.orders
        target_transformation_expression: customer_id

Quick Reference

Looking up a specific field? Here's the complete reference for all fields in mapping YAML files.

Mapping Root Fields

Field	Type	Required	Description
entity_id	`string`	✓	Target entity ID from your model
mapping_groups	list of `MappingGroup`	✓	List of mapping groups for this entity

✓ = required, ○ = optional

Mapping Group Fields

Field	Type	Required	Description
name	`string`	✓	Name of the mapping group
allow_multiple_identifiers	`bool`	✓	Allow multiple identifier values per entity ⚠ This setting is irreversible - once enabled, you cannot go back to single identifier mode⚠ Only enable if you need multiple primary keys per entity
tables	list of `Table`	✓	List of source tables in this mapping group
relationships	list of `RelationshipMapping`	○	Relationship mappings at the group level

✓ = required, ○ = optional

Table Fields

Field	Type	Required	Description
connection	`string`	✓	Connection profile name
table	`string`	✓	Source table name with schema
primary_keys	list of `string`	✓	Primary key columns from the source table
ingestion_strategy	`IngestionStrategy`	✓	How to load data from this table
where	`string`	○	SQL WHERE clause to filter source data ⚠ Syntax is database-specific - consult your platform's SQL documentation
entity_effective_timestamp_expression	`string`	○	Default timestamp expression for all attributes ⚠ Syntax is database-specific - consult your platform's SQL documentation ⚠ Used as default for attributes with effective_timestamp: true in the model
attributes	list of `AttributeMapping`	✓	List of attribute mappings

✓ = required, ○ = optional

Attribute Mapping Fields

Field	Type	Required	Description
id	`string`	✓	Attribute ID from the model
transformation_expression	`string`	✓	SQL expression to extract/transform the value ⚠ Syntax is database-specific - consult your platform's SQL documentation
ingestion_strategy	`IngestionStrategy`	○	Override ingestion strategy for this attribute
where	`string`	○	SQL WHERE clause for this attribute ⚠ Syntax is database-specific - consult your platform's SQL documentation
attribute_effective_timestamp_expression	`string`	○	Override timestamp expression for this attribute ⚠ Syntax is database-specific - consult your platform's SQL documentation ⚠ Only needed when this attribute has a different timestamp source than other attributes

✓ = required, ○ = optional

Relationship Mapping Fields

Field	Type	Required	Description
id	`string`	✓	Relationship ID from the model
source_table	`string`	✓	Source table name with schema
target_transformation_expression	`string`	✓	Expression to get the target entity identifier ⚠ Syntax is database-specific - consult your platform's SQL documentation

✓ = required, ○ = optional

Best Practices

Write clear definitions and descriptions - Good documentation makes your mappings self-explanatory
Generate templates first - Start with generated templates and customize as needed
Always use schema.table format - Specify schema.table (e.g., public.customers, stage.raw_data) to avoid ambiguity
Use descriptive expressions - Make transformation logic clear and maintainable
Add where clauses - Filter out invalid or incomplete data at the source
Use entity_effective_timestamp_expression - Set a DRY default at the table level
Override when needed - Use attribute_effective_timestamp_expression for attributes with different timestamp sources
Validate data types - Use CAST functions to ensure correct data types
Choose the right ingestion strategy - Use FULL for small tables, INCREMENTAL for large fact tables

Next Steps

Define workflows to orchestrate data transformations
Configure connections to your data sources
Follow the tutorial for a complete end-to-end example

Model Workflow

Mapping

Mapping Structure

Ingestion Strategies

Effective Timestamps

entity_effective_timestamp_expression

attribute_effective_timestamp_expression

When to use each

Transformation Expressions

Direct Column Mapping

SQL Functions

Concatenation

Multiline Expressions

Folded Style (>)

Literal Style (|)

Multiline with Modifiers

Practical Examples

Where Clauses

Relationship Mappings

Basic Relationship Mapping

Relationship Fields

Multi-Table Mappings

Multiple Identifiers

When Multiple Identifiers Are Needed

When NOT to Use Multiple Identifiers

Enabling Multiple Identifiers

Key Considerations

Complete Example

Quick Reference

Mapping Root Fields

Mapping Group Fields

Table Fields

Attribute Mapping Fields

Relationship Mapping Fields

Best Practices

Next Steps

Folded Style (`>`)

Literal Style (`|`)