Mapping
Mappings define how source data from your operational databases maps to your Daana data model.
Mapping Structure
A mapping file connects source tables to a single entity in your model. It has four main components:
- Entity target - Which entity this mapping populates
- Mapping groups - Container for table mappings (typically one group)
- Tables - Source tables with column-to-attribute mappings
- Relationships - How this entity connects to other entities (foreign keys)
entity_id: CUSTOMER
mapping_groups:
- name: default_mapping_group
allow_multiple_identifiers: false
tables:
- connection: dev
table: public.customers
primary_keys:
- customer_id
ingestion_strategy: FULL
# Default timestamp expression for all attributes with effective_timestamp: true
entity_effective_timestamp_expression: CURRENT_TIMESTAMP
attributes:
- id: CUSTOMER_ID
transformation_expression: customer_id
where: customer_id IS NOT NULL
- id: CUSTOMER_NAME
transformation_expression: customer_name
where: customer_name IS NOT NULLTip: Always use
schema.tableformat (e.g.,public.customers,stage.raw_data) to avoid ambiguity across database schemas.
Ingestion Strategies
The ingestion_strategy field determines how Daana loads data from your source table. Choose based on your data characteristics:
| Strategy | Best For | Description |
|---|---|---|
| FULL | Small dimension tables | Complete snapshot of all data for each delivery |
| INCREMENTAL | Large fact tables | Only changed or new data since last load (requires watermark column) |
| FULL_LOG | Change history tables | Complete history of all changes, with multiple rows per instance |
| TRANSACTIONAL | Append-only event logs | Append-only data where each instance is delivered exactly once |
See the Glossary for detailed descriptions and examples of each strategy.
Effective Timestamps
When your model has attributes with effective_timestamp: true, Daana tracks when values change over time. Instead of manually creating timestamp entries in your mapping, use the DRY expression fields.
entity_effective_timestamp_expression
Set a default timestamp expression for all attributes with effective_timestamp: true at the table level:
tables:
- connection: dev
table: public.customers
primary_keys:
- customer_id
ingestion_strategy: FULL
# Default for ALL attributes with effective_timestamp: true
entity_effective_timestamp_expression: CURRENT_TIMESTAMP
attributes:
- id: CUSTOMER_NAME
transformation_expression: customer_name
- id: CUSTOMER_EMAIL
transformation_expression: emailWith this setup, any attribute that has effective_timestamp: true in the model will automatically use CURRENT_TIMESTAMP for its effective timestamp.
attribute_effective_timestamp_expression
Override the table default for specific attributes that need a different timestamp source:
tables:
- connection: dev
table: public.customers
primary_keys:
- customer_id
ingestion_strategy: FULL
# Default timestamp expression
entity_effective_timestamp_expression: CURRENT_TIMESTAMP
attributes:
- id: CUSTOMER_NAME
transformation_expression: customer_name
# Uses entity default: CURRENT_TIMESTAMP
- id: CUSTOMER_ADDRESS
transformation_expression: address
# Override: use the actual update timestamp from the source
attribute_effective_timestamp_expression: updated_atWhen to use each
| Scenario | Use |
|---|---|
| Same timestamp for all attributes | entity_effective_timestamp_expression only |
| Most attributes share a timestamp, some differ | Both (entity default + attribute overrides) |
| Each attribute has a unique timestamp | attribute_effective_timestamp_expression on each |
Transformation Expressions
Transformation expressions define how to extract and transform data from source columns. These expressions use the SQL dialect of your source database, so syntax varies by platform (PostgreSQL, Snowflake, BigQuery, etc.).
Note: The examples below use PostgreSQL syntax. Consult your database documentation for platform-specific functions and operators.
Direct Column Mapping
- id: CUSTOMER_NAME
transformation_expression: customer_nameSQL Functions
- id: CUSTOMER_NAME
transformation_expression: UPPER(TRIM(customer_name))
- id: CREATED_DATE
transformation_expression: TO_TIMESTAMP(created_at, 'YYYY-MM-DD HH24:MI:SS')
- id: TOTAL_AMOUNT
transformation_expression: CAST(amount AS DECIMAL(10,2))Concatenation
- id: FULL_NAME
transformation_expression: first_name || ' ' || last_nameMultiline Expressions
For complex SQL expressions like CASE statements, subqueries, or nested function calls, use YAML's multiline syntax. This makes your mappings much more readable.
Folded Style (>)
The folded style (>) joins lines with spaces, ideal for SQL that would normally be written on one line:
- id: DISCOUNT_AMOUNT
transformation_expression: >
CASE
WHEN total > 1000 THEN total * 0.15
WHEN total > 500 THEN total * 0.10
WHEN total > 100 THEN total * 0.05
ELSE 0
ENDLiteral Style (|)
The literal style (|) preserves line breaks exactly, useful when formatting matters:
- id: CATEGORY_NAME
transformation_expression: |
(SELECT category_name
FROM categories c
WHERE c.id = products.category_id)Multiline with Modifiers
You can add modifiers to control trailing newlines:
| Indicator | Description |
|---|---|
> or | | Default: single trailing newline |
>- or |- | Strip: no trailing newline |
>+ or |+ | Keep: preserve all trailing newlines |
# Strip trailing newline with >-
- id: STATUS_CODE
transformation_expression: >-
COALESCE(
status,
'UNKNOWN'
)Practical Examples
Complex CASE expression:
- id: ORDER_PRIORITY
transformation_expression: >
CASE
WHEN is_expedited = true THEN 'HIGH'
WHEN order_date < CURRENT_DATE - INTERVAL '30 days' THEN 'HIGH'
WHEN total_amount > 10000 THEN 'MEDIUM'
ELSE 'LOW'
ENDNested function calls:
- id: FORMATTED_ADDRESS
transformation_expression: >
CONCAT_WS(', ',
NULLIF(TRIM(street), ''),
NULLIF(TRIM(city), ''),
NULLIF(TRIM(state), ''),
NULLIF(TRIM(postal_code), '')
)Correlated subquery:
- id: LATEST_ORDER_DATE
transformation_expression: |
(SELECT MAX(order_date)
FROM orders o
WHERE o.customer_id = customers.customer_id)Tip: Use folded style (
>) for most SQL expressions. Use literal style (|) when you need exact line breaks preserved (rare).
Where Clauses
Filter conditions to select relevant data:
- id: EMAIL
transformation_expression: email
where: email IS NOT NULL AND email LIKE '%@%'
- id: ACTIVE_STATUS
transformation_expression: is_active
where: deleted_at IS NULLRelationship Mappings
If your model defines relationships between entities, you map them by specifying how to extract the foreign key from your source data. Relationships are defined at the mapping group level (not inside tables).
Basic Relationship Mapping
Suppose your model has an ORDER_ITEM entity with a relationship to ORDER (defined as ORDER_ITEM_BELONGS_TO_ORDER in model.yaml). In your order item mapping, you specify how to find the order ID:
entity_id: ORDER_ITEM
mapping_groups:
- name: default_mapping_group
allow_multiple_identifiers: false
tables:
- connection: dev
table: stage.order_items
primary_keys:
- order_id || ' ' || order_item_id
ingestion_strategy: FULL
attributes:
- id: ORDER_ITEM_ID
transformation_expression: order_item_id
- id: PRICE
transformation_expression: price
# Relationships are defined at the mapping group level
relationships:
- id: ORDER_ITEM_BELONGS_TO_ORDER
source_table: stage.order_items
target_transformation_expression: order_idRelationship Fields
| Field | Description |
|---|---|
id | The relationship ID from your model (e.g., ORDER_ITEM_BELONGS_TO_ORDER, IS_PLACED_BY) |
source_table | The table containing the foreign key data (use schema.table format) |
target_transformation_expression | SQL expression that returns the target entity's identifier |
Multi-Table Mappings
Map a single entity from multiple source tables:
entity_id: CUSTOMER
mapping_groups:
- name: default_mapping_group
allow_multiple_identifiers: false
tables:
- connection: dev
table: archive.legacy_customers
primary_keys:
- old_customer_id
ingestion_strategy: FULL
entity_effective_timestamp_expression: migrated_at
attributes:
- id: CUSTOMER_ID
transformation_expression: old_customer_id
- connection: dev
table: public.new_customers
primary_keys:
- customer_uuid
ingestion_strategy: FULL
entity_effective_timestamp_expression: CURRENT_TIMESTAMP
attributes:
- id: CUSTOMER_ID
transformation_expression: customer_uuidMultiple Identifiers
By default, each entity has a single identifier (primary key). The allow_multiple_identifiers setting is an advanced feature for rare scenarios involving multi-source data integration.
When Multiple Identifiers Are Needed
This feature is designed for multi-source data integration scenarios where the same real-world entity exists in different source systems with different identifiers, and you need to preserve both:
| Scenario | Example |
|---|---|
| Multi-source integration | Same customer exists as CRM_ID: 12345 in Salesforce and LEGACY_ID: A789 in your legacy ERP - both IDs must remain queryable |
| M&A data consolidation | After acquiring a company, customer ACME_ID: 100 must coexist with your CORP_ID: 5000 for the same person |
| Composite business keys | Order line items identified by both order_id + line_number as a composite key from source |
When NOT to Use Multiple Identifiers
For most use cases, a single surrogate key is preferred:
- Customer lookup by email/phone - Use attributes with indexes, not multiple identifiers
- Alternative lookup keys - Model these as regular attributes, not identifiers
- Product SKU vs barcode - One should be the identifier, the other an attribute
Industry best practice: Data warehouses typically use surrogate keys rather than composite business keys. Multiple identifiers add complexity and are only justified when you must preserve source system identity relationships.
Enabling Multiple Identifiers
Set allow_multiple_identifiers: true in your mapping group:
entity_id: CUSTOMER
mapping_groups:
- name: default_mapping_group
allow_multiple_identifiers: true # Enable multiple identifiers
tables:
# Source 1: CRM system
- connection: crm_db
table: crm.customers
primary_keys:
- crm_customer_id
ingestion_strategy: FULL
attributes:
- id: CUSTOMER_ID
transformation_expression: crm_customer_id
# Source 2: Legacy ERP (same customers, different IDs)
- connection: legacy_db
table: erp.customers
primary_keys:
- legacy_id
ingestion_strategy: FULL
attributes:
- id: CUSTOMER_ID
transformation_expression: legacy_idWarning: This setting is irreversible. Once you enable multiple identifiers for an entity and materialize the mapping, you cannot go back to single identifier mode. This is a permanent structural change to how the entity is stored.
Key Considerations
- Exhaust alternatives first - Can you solve this with a single identifier and additional attributes?
- Plan ahead - Decide whether you need multiple identifiers before your first deployment
- Performance impact - Multiple identifiers increase storage and may affect query performance
- No rollback - If you enable this by mistake, you'll need to recreate the entity from scratch
Complete Example
entity_id: ORDER
mapping_groups:
- name: default_mapping_group
allow_multiple_identifiers: false
tables:
- connection: dev
table: public.orders
primary_keys:
- order_id
ingestion_strategy: FULL
# Default timestamp for most attributes
entity_effective_timestamp_expression: CURRENT_TIMESTAMP
attributes:
# Identifier
- id: ORDER_ID
transformation_expression: order_id
where: order_id IS NOT NULL
# Order date
- id: ORDER_DATE
transformation_expression: order_date
where: order_date IS NOT NULL
# Total amount - uses source's updated_at instead of default
- id: TOTAL_AMOUNT
transformation_expression: CAST(total_amount AS DECIMAL(10,2))
where: total_amount > 0
attribute_effective_timestamp_expression: updated_at
# Status - uses source's status_changed_at
- id: STATUS
transformation_expression: UPPER(status)
where: status IN ('pending', 'completed', 'cancelled')
attribute_effective_timestamp_expression: status_changed_at
# Relationships are defined at the mapping group level
relationships:
- id: ORDER_IS_PLACED_BY_CUSTOMER
source_table: public.orders
target_transformation_expression: customer_idQuick Reference
Looking up a specific field? Here's the complete reference for all fields in mapping YAML files.
Mapping Root Fields
| Field | Type | Required | Description |
|---|---|---|---|
| entity_id | string | ✓ | Target entity ID from your model |
| mapping_groups | list of MappingGroup | ✓ | List of mapping groups for this entity |
✓ = required, ○ = optional
Mapping Group Fields
| Field | Type | Required | Description |
|---|---|---|---|
| name | string | ✓ | Name of the mapping group |
| allow_multiple_identifiers | bool | ✓ | Allow multiple identifier values per entity ⚠ This setting is irreversible - once enabled, you cannot go back to single identifier mode⚠ Only enable if you need multiple primary keys per entity |
| tables | list of Table | ✓ | List of source tables in this mapping group |
| relationships | list of RelationshipMapping | ○ | Relationship mappings at the group level |
✓ = required, ○ = optional
Table Fields
| Field | Type | Required | Description |
|---|---|---|---|
| connection | string | ✓ | Connection profile name |
| table | string | ✓ | Source table name with schema |
| primary_keys | list of string | ✓ | Primary key columns from the source table |
| ingestion_strategy | IngestionStrategy | ✓ | How to load data from this table |
| where | string | ○ | SQL WHERE clause to filter source data ⚠ Syntax is database-specific - consult your platform's SQL documentation |
| entity_effective_timestamp_expression | string | ○ | Default timestamp expression for all attributes ⚠ Syntax is database-specific - consult your platform's SQL documentation ⚠ Used as default for attributes with effective_timestamp: true in the model |
| attributes | list of AttributeMapping | ✓ | List of attribute mappings |
✓ = required, ○ = optional
Attribute Mapping Fields
| Field | Type | Required | Description |
|---|---|---|---|
| id | string | ✓ | Attribute ID from the model |
| transformation_expression | string | ✓ | SQL expression to extract/transform the value ⚠ Syntax is database-specific - consult your platform's SQL documentation |
| ingestion_strategy | IngestionStrategy | ○ | Override ingestion strategy for this attribute |
| where | string | ○ | SQL WHERE clause for this attribute ⚠ Syntax is database-specific - consult your platform's SQL documentation |
| attribute_effective_timestamp_expression | string | ○ | Override timestamp expression for this attribute ⚠ Syntax is database-specific - consult your platform's SQL documentation ⚠ Only needed when this attribute has a different timestamp source than other attributes |
✓ = required, ○ = optional
Relationship Mapping Fields
| Field | Type | Required | Description |
|---|---|---|---|
| id | string | ✓ | Relationship ID from the model |
| source_table | string | ✓ | Source table name with schema |
| target_transformation_expression | string | ✓ | Expression to get the target entity identifier ⚠ Syntax is database-specific - consult your platform's SQL documentation |
✓ = required, ○ = optional
Best Practices
- Write clear definitions and descriptions - Good documentation makes your mappings self-explanatory
- Generate templates first - Start with generated templates and customize as needed
- Always use schema.table format - Specify
schema.table(e.g.,public.customers,stage.raw_data) to avoid ambiguity - Use descriptive expressions - Make transformation logic clear and maintainable
- Add where clauses - Filter out invalid or incomplete data at the source
- Use entity_effective_timestamp_expression - Set a DRY default at the table level
- Override when needed - Use
attribute_effective_timestamp_expressionfor attributes with different timestamp sources - Validate data types - Use CAST functions to ensure correct data types
- Choose the right ingestion strategy - Use FULL for small tables, INCREMENTAL for large fact tables
Next Steps
- Define workflows to orchestrate data transformations
- Configure connections to your data sources
- Follow the tutorial for a complete end-to-end example