DMDL
Mapping

Mapping

Mappings define how source data from your operational databases maps to your Daana data model.

Mapping Structure

A mapping file connects source tables to a single entity in your model. It has four main components:

  1. Entity target - Which entity this mapping populates
  2. Mapping groups - Container for table mappings (typically one group)
  3. Tables - Source tables with column-to-attribute mappings
  4. Relationships - How this entity connects to other entities (foreign keys)
entity_id: CUSTOMER
 
mapping_groups:
  - name: default_mapping_group
    allow_multiple_identifiers: false
 
    tables:
      - connection: dev
        table: public.customers
 
        primary_keys:
          - customer_id
 
        ingestion_strategy: FULL
 
        # Default timestamp expression for all attributes with effective_timestamp: true
        entity_effective_timestamp_expression: CURRENT_TIMESTAMP
 
        attributes:
          - id: CUSTOMER_ID
            transformation_expression: customer_id
            where: customer_id IS NOT NULL
 
          - id: CUSTOMER_NAME
            transformation_expression: customer_name
            where: customer_name IS NOT NULL

Tip: Always use schema.table format (e.g., public.customers, stage.raw_data) to avoid ambiguity across database schemas.

Ingestion Strategies

The ingestion_strategy field determines how Daana loads data from your source table. Choose based on your data characteristics:

StrategyBest ForDescription
FULLSmall dimension tablesComplete snapshot of all data for each delivery
INCREMENTALLarge fact tablesOnly changed or new data since last load (requires watermark column)
FULL_LOGChange history tablesComplete history of all changes, with multiple rows per instance
TRANSACTIONALAppend-only event logsAppend-only data where each instance is delivered exactly once

See the Glossary for detailed descriptions and examples of each strategy.

Effective Timestamps

When your model has attributes with effective_timestamp: true, Daana tracks when values change over time. Instead of manually creating timestamp entries in your mapping, use the DRY expression fields.

entity_effective_timestamp_expression

Set a default timestamp expression for all attributes with effective_timestamp: true at the table level:

tables:
  - connection: dev
    table: public.customers
    primary_keys:
      - customer_id
    ingestion_strategy: FULL
 
    # Default for ALL attributes with effective_timestamp: true
    entity_effective_timestamp_expression: CURRENT_TIMESTAMP
 
    attributes:
      - id: CUSTOMER_NAME
        transformation_expression: customer_name
      - id: CUSTOMER_EMAIL
        transformation_expression: email

With this setup, any attribute that has effective_timestamp: true in the model will automatically use CURRENT_TIMESTAMP for its effective timestamp.

attribute_effective_timestamp_expression

Override the table default for specific attributes that need a different timestamp source:

tables:
  - connection: dev
    table: public.customers
    primary_keys:
      - customer_id
    ingestion_strategy: FULL
 
    # Default timestamp expression
    entity_effective_timestamp_expression: CURRENT_TIMESTAMP
 
    attributes:
      - id: CUSTOMER_NAME
        transformation_expression: customer_name
        # Uses entity default: CURRENT_TIMESTAMP
 
      - id: CUSTOMER_ADDRESS
        transformation_expression: address
        # Override: use the actual update timestamp from the source
        attribute_effective_timestamp_expression: updated_at

When to use each

ScenarioUse
Same timestamp for all attributesentity_effective_timestamp_expression only
Most attributes share a timestamp, some differBoth (entity default + attribute overrides)
Each attribute has a unique timestampattribute_effective_timestamp_expression on each

Transformation Expressions

Transformation expressions define how to extract and transform data from source columns. These expressions use the SQL dialect of your source database, so syntax varies by platform (PostgreSQL, Snowflake, BigQuery, etc.).

Note: The examples below use PostgreSQL syntax. Consult your database documentation for platform-specific functions and operators.

Direct Column Mapping

- id: CUSTOMER_NAME
  transformation_expression: customer_name

SQL Functions

- id: CUSTOMER_NAME
  transformation_expression: UPPER(TRIM(customer_name))
 
- id: CREATED_DATE
  transformation_expression: TO_TIMESTAMP(created_at, 'YYYY-MM-DD HH24:MI:SS')
 
- id: TOTAL_AMOUNT
  transformation_expression: CAST(amount AS DECIMAL(10,2))

Concatenation

- id: FULL_NAME
  transformation_expression: first_name || ' ' || last_name

Multiline Expressions

For complex SQL expressions like CASE statements, subqueries, or nested function calls, use YAML's multiline syntax. This makes your mappings much more readable.

Folded Style (>)

The folded style (>) joins lines with spaces, ideal for SQL that would normally be written on one line:

- id: DISCOUNT_AMOUNT
  transformation_expression: >
    CASE
      WHEN total > 1000 THEN total * 0.15
      WHEN total > 500 THEN total * 0.10
      WHEN total > 100 THEN total * 0.05
      ELSE 0
    END

Literal Style (|)

The literal style (|) preserves line breaks exactly, useful when formatting matters:

- id: CATEGORY_NAME
  transformation_expression: |
    (SELECT category_name
     FROM categories c
     WHERE c.id = products.category_id)

Multiline with Modifiers

You can add modifiers to control trailing newlines:

IndicatorDescription
> or |Default: single trailing newline
>- or |-Strip: no trailing newline
>+ or |+Keep: preserve all trailing newlines
# Strip trailing newline with >-
- id: STATUS_CODE
  transformation_expression: >-
    COALESCE(
      status,
      'UNKNOWN'
    )

Practical Examples

Complex CASE expression:

- id: ORDER_PRIORITY
  transformation_expression: >
    CASE
      WHEN is_expedited = true THEN 'HIGH'
      WHEN order_date < CURRENT_DATE - INTERVAL '30 days' THEN 'HIGH'
      WHEN total_amount > 10000 THEN 'MEDIUM'
      ELSE 'LOW'
    END

Nested function calls:

- id: FORMATTED_ADDRESS
  transformation_expression: >
    CONCAT_WS(', ',
      NULLIF(TRIM(street), ''),
      NULLIF(TRIM(city), ''),
      NULLIF(TRIM(state), ''),
      NULLIF(TRIM(postal_code), '')
    )

Correlated subquery:

- id: LATEST_ORDER_DATE
  transformation_expression: |
    (SELECT MAX(order_date)
     FROM orders o
     WHERE o.customer_id = customers.customer_id)

Tip: Use folded style (>) for most SQL expressions. Use literal style (|) when you need exact line breaks preserved (rare).

Where Clauses

Filter conditions to select relevant data:

- id: EMAIL
  transformation_expression: email
  where: email IS NOT NULL AND email LIKE '%@%'
 
- id: ACTIVE_STATUS
  transformation_expression: is_active
  where: deleted_at IS NULL

Relationship Mappings

If your model defines relationships between entities, you map them by specifying how to extract the foreign key from your source data. Relationships are defined at the mapping group level (not inside tables).

Basic Relationship Mapping

Suppose your model has an ORDER_ITEM entity with a relationship to ORDER (defined as ORDER_ITEM_BELONGS_TO_ORDER in model.yaml). In your order item mapping, you specify how to find the order ID:

entity_id: ORDER_ITEM
 
mapping_groups:
  - name: default_mapping_group
    allow_multiple_identifiers: false
 
    tables:
      - connection: dev
        table: stage.order_items
        primary_keys:
          - order_id || ' ' || order_item_id
        ingestion_strategy: FULL
 
        attributes:
          - id: ORDER_ITEM_ID
            transformation_expression: order_item_id
          - id: PRICE
            transformation_expression: price
 
    # Relationships are defined at the mapping group level
    relationships:
      - id: ORDER_ITEM_BELONGS_TO_ORDER
        source_table: stage.order_items
        target_transformation_expression: order_id

Relationship Fields

FieldDescription
idThe relationship ID from your model (e.g., ORDER_ITEM_BELONGS_TO_ORDER, IS_PLACED_BY)
source_tableThe table containing the foreign key data (use schema.table format)
target_transformation_expressionSQL expression that returns the target entity's identifier

Multi-Table Mappings

Map a single entity from multiple source tables:

entity_id: CUSTOMER
 
mapping_groups:
  - name: default_mapping_group
    allow_multiple_identifiers: false
 
    tables:
      - connection: dev
        table: archive.legacy_customers
        primary_keys:
          - old_customer_id
        ingestion_strategy: FULL
        entity_effective_timestamp_expression: migrated_at
        attributes:
          - id: CUSTOMER_ID
            transformation_expression: old_customer_id
 
      - connection: dev
        table: public.new_customers
        primary_keys:
          - customer_uuid
        ingestion_strategy: FULL
        entity_effective_timestamp_expression: CURRENT_TIMESTAMP
        attributes:
          - id: CUSTOMER_ID
            transformation_expression: customer_uuid

Multiple Identifiers

By default, each entity has a single identifier (primary key). The allow_multiple_identifiers setting is an advanced feature for rare scenarios involving multi-source data integration.

When Multiple Identifiers Are Needed

This feature is designed for multi-source data integration scenarios where the same real-world entity exists in different source systems with different identifiers, and you need to preserve both:

ScenarioExample
Multi-source integrationSame customer exists as CRM_ID: 12345 in Salesforce and LEGACY_ID: A789 in your legacy ERP - both IDs must remain queryable
M&A data consolidationAfter acquiring a company, customer ACME_ID: 100 must coexist with your CORP_ID: 5000 for the same person
Composite business keysOrder line items identified by both order_id + line_number as a composite key from source

When NOT to Use Multiple Identifiers

For most use cases, a single surrogate key is preferred:

  • Customer lookup by email/phone - Use attributes with indexes, not multiple identifiers
  • Alternative lookup keys - Model these as regular attributes, not identifiers
  • Product SKU vs barcode - One should be the identifier, the other an attribute

Industry best practice: Data warehouses typically use surrogate keys rather than composite business keys. Multiple identifiers add complexity and are only justified when you must preserve source system identity relationships.

Enabling Multiple Identifiers

Set allow_multiple_identifiers: true in your mapping group:

entity_id: CUSTOMER
 
mapping_groups:
  - name: default_mapping_group
    allow_multiple_identifiers: true  # Enable multiple identifiers
 
    tables:
      # Source 1: CRM system
      - connection: crm_db
        table: crm.customers
        primary_keys:
          - crm_customer_id
        ingestion_strategy: FULL
        attributes:
          - id: CUSTOMER_ID
            transformation_expression: crm_customer_id
 
      # Source 2: Legacy ERP (same customers, different IDs)
      - connection: legacy_db
        table: erp.customers
        primary_keys:
          - legacy_id
        ingestion_strategy: FULL
        attributes:
          - id: CUSTOMER_ID
            transformation_expression: legacy_id

Warning: This setting is irreversible. Once you enable multiple identifiers for an entity and materialize the mapping, you cannot go back to single identifier mode. This is a permanent structural change to how the entity is stored.

Key Considerations

  1. Exhaust alternatives first - Can you solve this with a single identifier and additional attributes?
  2. Plan ahead - Decide whether you need multiple identifiers before your first deployment
  3. Performance impact - Multiple identifiers increase storage and may affect query performance
  4. No rollback - If you enable this by mistake, you'll need to recreate the entity from scratch

Complete Example

entity_id: ORDER
 
mapping_groups:
  - name: default_mapping_group
    allow_multiple_identifiers: false
 
    tables:
      - connection: dev
        table: public.orders
 
        primary_keys:
          - order_id
 
        ingestion_strategy: FULL
 
        # Default timestamp for most attributes
        entity_effective_timestamp_expression: CURRENT_TIMESTAMP
 
        attributes:
          # Identifier
          - id: ORDER_ID
            transformation_expression: order_id
            where: order_id IS NOT NULL
 
          # Order date
          - id: ORDER_DATE
            transformation_expression: order_date
            where: order_date IS NOT NULL
 
          # Total amount - uses source's updated_at instead of default
          - id: TOTAL_AMOUNT
            transformation_expression: CAST(total_amount AS DECIMAL(10,2))
            where: total_amount > 0
            attribute_effective_timestamp_expression: updated_at
 
          # Status - uses source's status_changed_at
          - id: STATUS
            transformation_expression: UPPER(status)
            where: status IN ('pending', 'completed', 'cancelled')
            attribute_effective_timestamp_expression: status_changed_at
 
    # Relationships are defined at the mapping group level
    relationships:
      - id: ORDER_IS_PLACED_BY_CUSTOMER
        source_table: public.orders
        target_transformation_expression: customer_id

Quick Reference

Looking up a specific field? Here's the complete reference for all fields in mapping YAML files.

Mapping Root Fields

FieldTypeRequiredDescription
entity_idstring
Target entity ID from your model
mapping_groupslist of MappingGroup
List of mapping groups for this entity

✓ = required, ○ = optional

Mapping Group Fields

FieldTypeRequiredDescription
namestring
Name of the mapping group
allow_multiple_identifiersbool
Allow multiple identifier values per entity
This setting is irreversible - once enabled, you cannot go back to single identifier modeOnly enable if you need multiple primary keys per entity
tableslist of Table
List of source tables in this mapping group
relationshipslist of RelationshipMapping
Relationship mappings at the group level

✓ = required, ○ = optional

Table Fields

FieldTypeRequiredDescription
connectionstring
Connection profile name
tablestring
Source table name with schema
primary_keyslist of string
Primary key columns from the source table
ingestion_strategyIngestionStrategy
How to load data from this table
wherestring
SQL WHERE clause to filter source data
Syntax is database-specific - consult your platform's SQL documentation
entity_effective_timestamp_expressionstring
Default timestamp expression for all attributes
Syntax is database-specific - consult your platform's SQL documentation
Used as default for attributes with effective_timestamp: true in the model
attributeslist of AttributeMapping
List of attribute mappings

✓ = required, ○ = optional

Attribute Mapping Fields

FieldTypeRequiredDescription
idstring
Attribute ID from the model
transformation_expressionstring
SQL expression to extract/transform the value
Syntax is database-specific - consult your platform's SQL documentation
ingestion_strategyIngestionStrategy
Override ingestion strategy for this attribute
wherestring
SQL WHERE clause for this attribute
Syntax is database-specific - consult your platform's SQL documentation
attribute_effective_timestamp_expressionstring
Override timestamp expression for this attribute
Syntax is database-specific - consult your platform's SQL documentation
Only needed when this attribute has a different timestamp source than other attributes

✓ = required, ○ = optional

Relationship Mapping Fields

FieldTypeRequiredDescription
idstring
Relationship ID from the model
source_tablestring
Source table name with schema
target_transformation_expressionstring
Expression to get the target entity identifier
Syntax is database-specific - consult your platform's SQL documentation

✓ = required, ○ = optional

Best Practices

  1. Write clear definitions and descriptions - Good documentation makes your mappings self-explanatory
  2. Generate templates first - Start with generated templates and customize as needed
  3. Always use schema.table format - Specify schema.table (e.g., public.customers, stage.raw_data) to avoid ambiguity
  4. Use descriptive expressions - Make transformation logic clear and maintainable
  5. Add where clauses - Filter out invalid or incomplete data at the source
  6. Use entity_effective_timestamp_expression - Set a DRY default at the table level
  7. Override when needed - Use attribute_effective_timestamp_expression for attributes with different timestamp sources
  8. Validate data types - Use CAST functions to ensure correct data types
  9. Choose the right ingestion strategy - Use FULL for small tables, INCREMENTAL for large fact tables

Next Steps