DMDL Reference
Mapping
Mappings define how source data from your operational databases maps to your Daana data model.
Mapping Structure
A mapping file connects source tables to a single entity in your model. It has four main components:
- Entity target - Which entity this mapping populates
- Mapping groups - Container for table mappings (typically one group)
- Tables - Source tables with column-to-attribute mappings
- Relationships - How this entity connects to other entities (foreign keys)
entity_id: CUSTOMER
mapping_groups:
- name: default_mapping_group
allow_multiple_identifiers: false
tables:
- connection: dev
table: public.customers
primary_keys:
- customer_id
ingestion_strategy: FULL
# Default timestamp expression for all attributes with effective_timestamp: true
entity_effective_timestamp_expression: CURRENT_TIMESTAMP
attributes:
- id: customer_id
transformation_expression: customer_id
where: customer_id IS NOT NULL
- id: customer_name
transformation_expression: customer_name
where: customer_name IS NOT NULL
Casing convention. Throughout the mapping examples, attribute ids use
lower_snake_caseso they line up with the lowercase columns in the generatedview_<entity>views, while entity ids, relationship ids, and atomic-context values useUPPER_SNAKE_CASE.
Tip: Always use
schema.tableformat (e.g.,public.customers,stage.raw_data) to avoid ambiguity across database schemas.
Composite and Computed Primary Keys
primary_keys accepts two forms. The simple form is a list of one or more source column names; that is what the generator emits and what most mappings use:
primary_keys:
- customer_id
When the source has no single column that uniquely identifies a row, you can use a SQL expression directly. The expression is evaluated against the source table on every read and its result becomes the natural-key string Daana hashes into the surrogate key:
# Composite key: order_line_id only unique within an order
primary_keys:
- CONCAT(CAST(order_id AS TEXT), '-', CAST(order_line_id AS TEXT))
Tutorial Chapter 5 walks through this for ORDER_LINE (Step 5: Edit the Order Line Mapping).
Stability matters. Daana derives
<entity>_keydeterministically from the natural-key value the expression returns for each row. Two expressions that produce the same string for the same row (for exampleCONCAT(a, '-', b)) yield the same surrogate keys. Changing the expression so that it returns a different string (for example dropping a separator, or changing the order of columns) does change every surrogate key, so treat any change to the value shape the same way you would treat changing a primary-key column.
Ingestion Strategies
The ingestion_strategy field determines how Daana loads data from your source table. Choose based on your data characteristics:
| Strategy | Best For | Description |
|---|---|---|
| FULL | Source has current state only (1 row per entity), small enough to scan each run | Complete snapshot of all data for each delivery |
| FULL_LOG | Source already contains history (N rows per entity), small enough to scan each run | Complete history of all changes, with multiple rows per instance |
| INCREMENTAL | Source is mutable (rows can be updated in place) and large enough that watermarked reads pay off | Only changed or new data since last load (requires watermark column) |
| TRANSACTIONAL | Source is an append-only event log (rows are written once and never updated) | Append-only data where each instance is delivered exactly once |
For narrative coverage of when to pick each strategy, walked through with a real source for every choice, see Tutorial Chapter 4: Loading Strategies. For per-attribute overrides (e.g., INCREMENTAL table with a FULL_LOG attribute), see Advanced Mapping Patterns.
Effective Timestamps
When your model has attributes with effective_timestamp: true, Daana tracks when values change over time. Instead of manually creating timestamp entries in your mapping, use the DRY expression fields.
entity_effective_timestamp_expression
Set a default timestamp expression for all attributes with effective_timestamp: true at the table level:
tables:
- connection: dev
table: public.customers
primary_keys:
- customer_id
ingestion_strategy: FULL
# Default for ALL attributes with effective_timestamp: true
entity_effective_timestamp_expression: CURRENT_TIMESTAMP
attributes:
- id: customer_name
transformation_expression: customer_name
- id: customer_email
transformation_expression: email
With this setup, any attribute that has effective_timestamp: true in the model will automatically use CURRENT_TIMESTAMP for its effective timestamp.
attribute_effective_timestamp_expression
Override the table default for specific attributes that need a different timestamp source:
tables:
- connection: dev
table: public.customers
primary_keys:
- customer_id
ingestion_strategy: FULL
# Default timestamp expression
entity_effective_timestamp_expression: CURRENT_TIMESTAMP
attributes:
- id: customer_name
transformation_expression: customer_name
# Uses entity default: CURRENT_TIMESTAMP
- id: customer_address
transformation_expression: address
# Override: use the actual update timestamp from the source
attribute_effective_timestamp_expression: updated_at
When to use each
| Scenario | Use |
|---|---|
| Same timestamp for all attributes | entity_effective_timestamp_expression only |
| Most attributes share a timestamp, some differ | Both (entity default + attribute overrides) |
| Each attribute has a unique timestamp | attribute_effective_timestamp_expression on each |
Transformation Expressions
Transformation expressions define how to extract and transform data from source columns. These expressions use the SQL dialect of your source database, so syntax varies by platform (PostgreSQL, Snowflake, BigQuery, etc.).
Note: The examples below use PostgreSQL syntax. Consult your database documentation for platform-specific functions and operators.
Direct Column Mapping
- id: customer_name
transformation_expression: customer_name
SQL Functions
- id: customer_name
transformation_expression: UPPER(TRIM(customer_name))
- id: created_date
transformation_expression: TO_TIMESTAMP(created_at, 'YYYY-MM-DD HH24:MI:SS')
- id: total_amount
transformation_expression: CAST(amount AS DECIMAL(10,2))
Concatenation
- id: full_name
transformation_expression: first_name || ' ' || last_name
Multiline Expressions
For complex SQL expressions like CASE statements, subqueries, or nested function calls, use YAML's multiline syntax. This makes your mappings much more readable.
Folded Style (>)
The folded style (>) joins lines with spaces, ideal for SQL that would normally be written on one line:
- id: discount_amount
transformation_expression: >
CASE
WHEN total > 1000 THEN total * 0.15
WHEN total > 500 THEN total * 0.10
WHEN total > 100 THEN total * 0.05
ELSE 0
END
Literal Style (|)
The literal style (|) preserves line breaks exactly, useful when formatting matters:
- id: category_name
transformation_expression: |
(SELECT category_name
FROM categories c
WHERE c.id = products.category_id)
Multiline with Modifiers
You can add modifiers to control trailing newlines:
| Indicator | Description |
|---|---|
> or | | Default: single trailing newline |
>- or |- | Strip: no trailing newline |
>+ or |+ | Keep: preserve all trailing newlines |
# Strip trailing newline with >-
- id: status_code
transformation_expression: >-
COALESCE(
status,
'UNKNOWN'
)
Practical Examples
Complex CASE expression:
- id: order_priority
transformation_expression: >
CASE
WHEN is_expedited = true THEN 'HIGH'
WHEN order_date < CURRENT_DATE - INTERVAL '30 days' THEN 'HIGH'
WHEN total_amount > 10000 THEN 'MEDIUM'
ELSE 'LOW'
END
Nested function calls:
- id: formatted_address
transformation_expression: >
CONCAT_WS(', ',
NULLIF(TRIM(street), ''),
NULLIF(TRIM(city), ''),
NULLIF(TRIM(state), ''),
NULLIF(TRIM(postal_code), '')
)
Correlated subquery:
- id: latest_order_date
transformation_expression: |
(SELECT MAX(order_date)
FROM orders o
WHERE o.customer_id = customers.customer_id)
Tip: Use folded style (
>) for most SQL expressions. Use literal style (|) when you need exact line breaks preserved (rare).
Where Clauses
Filter conditions to select relevant data:
- id: email
transformation_expression: email
where: email IS NOT NULL AND email LIKE '%@%'
- id: active_status
transformation_expression: is_active
where: deleted_at IS NULL
Relationship Mappings
If your model defines relationships between entities, you map them by specifying how to extract the foreign key from your source data. Relationships are defined at the mapping group level (not inside tables).
Basic Relationship Mapping
Suppose your model has an ORDER_ITEM entity with a relationship to ORDER, defined with name: BELONGS_TO, source_entity_id: ORDER_ITEM, and target_entity_id: ORDER. Daana derives the relationship id ORDER_ITEM_BELONGS_TO_ORDER. In the order item mapping you reference that id and specify how to find the order ID:
entity_id: ORDER_ITEM
mapping_groups:
- name: default_mapping_group
allow_multiple_identifiers: false
tables:
- connection: dev
table: stage.order_items
primary_keys:
- order_id || ' ' || order_item_id
ingestion_strategy: FULL
attributes:
- id: order_item_id
transformation_expression: order_item_id
- id: price
transformation_expression: price
# Relationships are defined at the mapping group level
relationships:
- id: ORDER_ITEM_BELONGS_TO_ORDER
source_table: stage.order_items
target_transformation_expression: order_id
Relationship Fields
| Field | Description |
|---|---|
id | The relationship's id, derived by Daana from the model definition following the pattern {SOURCE_ENTITY_ID}_{RELATIONSHIP_NAME}_{TARGET_ENTITY_ID} (e.g., ORDER_IS_PLACED_BY_CUSTOMER) |
source_table | The table containing the foreign key data (use schema.table format) |
target_transformation_expression | SQL expression that returns the target entity's identifier |
Finding the correct
idvalue:idfollows a fixed pattern: a model-relationship withname: "IS_PLACED_BY"between source entityORDERand target entityCUSTOMERgives the mapping-relationshipidORDER_IS_PLACED_BY_CUSTOMER. The easiest way to get the exact value is to rundaana-cli generate mapping, which pre-fills it for each relationship.
Multi-Source Relationships
When a relationship's data comes from multiple sources, or from a different table than the entity's attributes, you have several patterns available.
Multi-Group Pattern
Two mapping groups can drive the same relationship independently. Each group generates its own pipeline with a unique DATA_KEY, and truedelta operates independently per group:
entity_id: ORDER
mapping_groups:
- name: online_orders
tables:
- connection: dev
table: online.orders
primary_keys: [order_id]
ingestion_strategy: INCREMENTAL
batch_expression: created_at
attributes:
- id: order_total_attr
transformation_expression: total_amount
relationships:
- id: ORDER_PLACED_BY_CUSTOMER
source_table: online.orders
target_transformation_expression: customer_id
- name: retail_orders
tables:
- connection: dev
table: retail.pos_transactions
primary_keys: [transaction_id]
ingestion_strategy: FULL
attributes:
- id: order_total_attr
transformation_expression: amount
relationships:
- id: ORDER_PLACED_BY_CUSTOMER
source_table: retail.pos_transactions
target_transformation_expression: loyalty_customer_id
Both groups populate the same relationship (ORDER_PLACED_BY_CUSTOMER) from different source tables. The DATA_KEY isolation ensures each group's truedelta operates independently: online order relationships never interfere with retail order relationships.
Relationship-Only Source Tables
A table entry can have empty attributes (attributes: []) and exist solely to provide primary_keys, ingestion_strategy, and batch_expression for a relationship's source_table. This is useful when a relationship's source data comes from a different table than the entity's attributes.
entity_id: CONTENT
mapping_groups:
- name: content_group
tables:
# Primary source: provides attributes
- connection: dev
table: catalog.content_items
primary_keys: [content_id]
ingestion_strategy: FULL
attributes:
- id: content_title_attr
transformation_expression: title
# Relationship-only source: no attributes, provides key derivation + batch config
- connection: dev
table: catalog.content_contributors
primary_keys: [content_id]
ingestion_strategy: INCREMENTAL
batch_expression: updated_at
attributes: []
relationships:
- id: CONTENT_HAS_CONTRIBUTOR
source_table: catalog.content_contributors
target_transformation_expression: contributor_id
The relationship pipeline uses catalog.content_contributors as its source, deriving entity keys from that table's primary_keys and applying batch filtering from its ingestion_strategy and batch_expression. No TYPE pipelines are generated for this table since it has no attributes.
Both daana-cli check and daana-cli deploy accept this pattern.
source_table Resolution
The source_table field must exactly match one of the tables in the same mapping group's tables list. If no match is found, the relationship is silently skipped (no error is produced). Double-check for typos if a relationship isn't appearing in your output.
The relationship inherits these properties from the matched source table:
- primary_keys: used to derive the entity key
- ingestion_strategy: determines the SQL template and batch filtering
- batch_expression: controls watermark-based incremental reads
Multi-Table Mappings
Map a single entity from multiple source tables:
entity_id: CUSTOMER
mapping_groups:
- name: default_mapping_group
allow_multiple_identifiers: false
tables:
- connection: dev
table: archive.legacy_customers
primary_keys:
- old_customer_id
ingestion_strategy: FULL
entity_effective_timestamp_expression: migrated_at
attributes:
- id: customer_id
transformation_expression: old_customer_id
- connection: dev
table: public.new_customers
primary_keys:
- customer_uuid
ingestion_strategy: FULL
entity_effective_timestamp_expression: CURRENT_TIMESTAMP
attributes:
- id: customer_id
transformation_expression: customer_uuid
Multiple Identifiers
By default, each entity has a single identifier (primary key). The allow_multiple_identifiers setting is an advanced feature for rare scenarios involving multi-source data integration.
When Multiple Identifiers Are Needed
This feature is designed for multi-source data integration scenarios where the same real-world entity exists in different source systems with different identifiers, and you need to preserve both:
| Scenario | Example |
|---|---|
| Multi-source integration | Same customer exists as CRM_ID: 12345 in Salesforce and LEGACY_ID: A789 in your legacy ERP - both IDs must remain queryable |
| M&A data consolidation | After acquiring a company, customer ACME_ID: 100 must coexist with your CORP_ID: 5000 for the same person |
| Composite business keys | Order line items identified by both order_id + line_number as a composite key from source |
When NOT to Use Multiple Identifiers
For most use cases, a single surrogate key is preferred:
- Customer lookup by email/phone - Use attributes with indexes, not multiple identifiers
- Alternative lookup keys - Model these as regular attributes, not identifiers
- Product SKU vs barcode - One should be the identifier, the other an attribute
Industry best practice: Data warehouses typically use surrogate keys rather than composite business keys. Multiple identifiers add complexity and are only justified when you must preserve source system identity relationships.
Enabling Multiple Identifiers
Set allow_multiple_identifiers: true in your mapping group:
entity_id: CUSTOMER
mapping_groups:
- name: default_mapping_group
allow_multiple_identifiers: true # Enable multiple identifiers
tables:
# Source 1: CRM system. Single identifier (customer_id).
- connection: crm_db
table: crm.customers
primary_keys:
- customer_id
ingestion_strategy: FULL
attributes:
- id: customer_id
transformation_expression: customer_id
# Source 2: Legacy ERP. Carries its own identifier plus the CRM
# customer_id as a cross-reference (the overlapping identifier that
# binds the two sources together).
- connection: legacy_db
table: erp.customers
primary_keys:
- legacy_id
- customer_id
ingestion_strategy: FULL
attributes:
- id: customer_id
transformation_expression: customer_id
- id: legacy_id
transformation_expression: legacy_id
The overlap is intentional: every legacy row must carry the customer_id value of the same customer in the CRM, so that resolution at INTEGRATION time matches the existing IDFR row and reuses the CRM-assigned surrogate. The CRM source does not need to know about legacy_id. The overlap on customer_id is sufficient for the legacy system's identifiers to attach to the existing entity.
How resolution works
Daana supports up to 5 identifier types per entity, stored in slots IDFR_01 through IDFR_05. When allow_multiple_identifiers: true, each source row carries one identifier value per slot (drawn from the mapping group's primary_keys), and the pipeline resolves the row to an entity by co-occurrence:
- READER populates the IDFR slots with the identifier values from the source row.
- INTEGRATION looks each slot value up against the existing
<entity>_idfrtable. The lookup is a LEFT JOIN across all five slots; any match returns the entity's surrogate key. - TRUEDELTA classifies the row:
- If at least one slot value matches an existing identifier, Daana reuses that surrogate key for all identifiers on the row. The row carries an existing entity.
- If no slot value matches anything, Daana generates a fresh UUID. The row carries a new entity.
- INSERT writes only the new identifier values to the
<entity>_idfrtable. Existing identifiers are not re-inserted.
This is the mechanism that binds different source systems together. Every time a source row carries at least one identifier already known to the system, the other identifiers it carries also become attached to the same entity.
Example. A CRM source emits a row for one customer with customer_id = C-12345. Daana finds no match in IDFR and assigns a new surrogate UUID-AAA. The IDFR table now has one row:
customer_idfr | customer_key | row_st
C-12345 | UUID-AAA | Y
A legacy ERP source later emits a row for the same customer carrying legacy_id = L-A789 and customer_id = C-12345 together. Daana's lookup finds C-12345 → UUID-AAA, classifies the row as an existing entity, and inserts only the new identifier:
customer_idfr | customer_key | row_st
C-12345 | UUID-AAA | Y
L-A789 | UUID-AAA | Y
Both identifiers resolve to the same surrogate. Downstream queries on customer_key = UUID-AAA return both source-side identifiers as different facets of one entity.
Two invariants of resolution worth stating explicitly:
- Resolution matches values, not column names. The IDFR table stores identifier values in a single column (
<entity>_idfr); the source column the value originated from is not preserved. If the legacy source's cross-reference column were namedcrm_refinstead ofcustomer_id, resolution would still work as long as the valueC-12345appears in some row of the legacy source. What binds sources together is the overlap of values, not the overlap of column names. - The order of
primary_keysin the mapping does not affect resolution. The pipeline checks every slot against the IDFR table; any match returns the existing surrogate. Writingprimary_keys: [legacy_id, customer_id]versusprimary_keys: [customer_id, legacy_id]produces identical behavior. Slot indices are structural (they keep up to 5 identifier values on one source row distinct inside the pipeline); they are not priority ranks.
Identifier transformations
The primary_keys field accepts SQL expressions, not just column names (Composite and Computed Primary Keys covers the syntax). For multi-identifier setups, this matters: each source's primary_keys is what lands in the IDFR table, and a normalization expression lets two sources that store the same logical identifier in different shapes still produce matching values.
For example, if the CRM stores customer IDs as plain strings (12345) but the legacy source prefixes them (CRM-12345), the CRM mapping can normalize on read:
- connection: crm_db
primary_keys:
- CONCAT('CRM-', CAST(customer_id AS TEXT))
...
- connection: legacy_db
primary_keys:
- external_ref # already prefixed 'CRM-...' in legacy
- legacy_id
...
Both sources now write CRM-12345 to IDFR for the same customer, and resolution proceeds normally. The same pattern handles composite natural keys (one source carries tenant_id + customer_id, the other carries a pre-combined global_customer_id): a CONCAT expression collapses the composite into a single value that can overlap with the other source.
The overlap requirement still applies to the transformed values. If the expression on each side produces different strings for the same real-world entity, the result is identifier orphaning.
Identifier orphaning
The resolution above requires that overlapping sources share at least one identifier value per row. If they don't, Daana cannot tell that two source rows describe the same real-world entity and creates two separate surrogates.
Concrete failure mode. Assume the legacy ERP source arrives without the customer_id cross-reference (carrying only legacy_id = L-A789):
customer_idfr | customer_key
L-A789 | UUID-XXX
The CRM source then arrives with customer_id = C-12345 and no overlapping identifier. Daana finds no match and assigns a fresh surrogate:
customer_idfr | customer_key
L-A789 | UUID-XXX
C-12345 | UUID-YYY
Two surrogate keys now exist for one real-world customer. Downstream tables keyed by customer_key treat them as separate entities. The warehouse silently carries a duplicate.
Operational requirement. For multi-identifier resolution to work, every legacy row must carry the corresponding customer_id value (or whichever identifier is the cross-reference). In the example above, the working configuration has the legacy ERP source emit both legacy_id and customer_id on every row; the failing configuration has it emit only legacy_id.
If a source system emits rows with only its own identifiers and no overlap with any other source, multi-identifier resolution is not the right tool. Solve the integration upstream (data cleaning, ID-mapping table, or a master-data-management system) before the data reaches Daana.
Adding identifiers later
Within the 5-slot ceiling, identifier evolution is rowwise, not schema-wise:
- New identifier types from existing sources flow as new rows in
<entity>_idfrwithout any DDL change. If a CRM row that previously carried onlycustomer_idstarts also carryingcustomer_email(a new identifier type), the new value lands in an unused slot, becomes a new IDFR row, and resolves like any other identifier. - New source systems that emit identifier types overlapping with existing sources attach via the same mechanism. They contribute new rows that resolve via co-occurrence; existing rows are unchanged.
- No schema migrations are required for either case. The IDFR table's column shape is fixed at install time.
Warning: This setting is irreversible. Once you enable multiple identifiers for an entity and materialize the mapping, you cannot go back to single-identifier mode. This is a permanent structural change to how the entity is stored.
Why irreversible. When
allow_multiple_identifiers: true, the IDFR table becomes the authoritative source of entity identity: surrogate keys are generated by co-occurrence resolution in IDFR and propagated to every<entity>_desc,<entity>_focal, and relationship_xrow. Disabling multi-identifier mode requires regenerating every surrogate from a different path (thef_<entity>()derivation used in single-identifier mode), which cascades to every downstream row that holds the old surrogate. There is no in-place migration. Recovery requires dropping the entity's four physical tables (_idfr,_focal,_desc, and each_x), re-runninginstallanddeploy, and re-executing the workflow to rebuild from source. Downstream consumers (Layer 3 views, BI dashboards) see new surrogate keys for every entity during the rebuild.
Key Considerations
- Exhaust alternatives first - Can you solve this with a single identifier and additional attributes?
- Plan ahead - Decide whether you need multiple identifiers before your first deployment
- Performance impact - Multiple identifiers increase storage and may affect query performance
- No rollback - If you enable this by mistake, you'll need to recreate the entity from scratch
Complete Example
entity_id: ORDER
mapping_groups:
- name: default_mapping_group
allow_multiple_identifiers: false
tables:
- connection: dev
table: public.orders
primary_keys:
- order_id
ingestion_strategy: FULL
# Default timestamp for most attributes
entity_effective_timestamp_expression: CURRENT_TIMESTAMP
attributes:
# Identifier
- id: order_id
transformation_expression: order_id
where: order_id IS NOT NULL
# Order date
- id: order_date
transformation_expression: order_date
where: order_date IS NOT NULL
# Total amount - uses source's updated_at instead of default
- id: total_amount
transformation_expression: CAST(total_amount AS DECIMAL(10,2))
where: total_amount > 0
attribute_effective_timestamp_expression: updated_at
# Status - uses source's status_changed_at
- id: status
transformation_expression: UPPER(status)
where: status IN ('pending', 'completed', 'cancelled')
attribute_effective_timestamp_expression: status_changed_at
# Relationships are defined at the mapping group level
relationships:
- id: ORDER_IS_PLACED_BY_CUSTOMER
source_table: public.orders
target_transformation_expression: customer_id
Quick Reference
Looking up a specific field? Here's the complete reference for all fields in mapping YAML files.
Mapping Root Fields
| Field | Type | Required | Description |
|---|---|---|---|
| entity_id | string | ✓ | Target entity ID from your model |
| mapping_groups | list of MappingGroup | ✓ | List of mapping groups for this entity |
✓ = required, ○ = optional
Mapping Group Fields
| Field | Type | Required | Description |
|---|---|---|---|
| name | string | ✓ | Name of the mapping group |
| allow_multiple_identifiers | bool | ✓ | Allow multiple identifier values per entity ⚠ This setting is irreversible - once enabled, you cannot go back to single identifier mode⚠ Only enable if you need multiple primary keys per entity |
| tables | list of Table | ✓ | List of source tables in this mapping group |
| relationships | list of RelationshipMapping | ○ | Relationship mappings at the group level |
✓ = required, ○ = optional
Table Fields
| Field | Type | Required | Description |
|---|---|---|---|
| connection | string | ✓ | Connection profile name |
| table | string | ✓ | Source table name with schema |
| primary_keys | list of string | ✓ | Primary key columns or SQL expression for the source table ⚠ A single entry containing SQL operators or function calls is evaluated as an expression against the source table⚠ Multiple entries are treated as column names and combined into the natural key in the listed order |
| ingestion_strategy | IngestionStrategy | ✓ | How to load data from this table |
| where | string | ○ | SQL WHERE clause to filter source data ⚠ Syntax is database-specific - consult your platform's SQL documentation |
| entity_effective_timestamp_expression | string | ○ | Default timestamp expression for all attributes ⚠ Syntax is database-specific - consult your platform's SQL documentation ⚠ Used as default for attributes with effective_timestamp: true in the model |
| batch_expression | string | ○ | Batch filtering expression (overrides workflow default) ⚠ Only applies to INCREMENTAL and TRANSACTIONAL ingestion strategies⚠ If not set, falls back to the workflow-level batch_expression⚠ If value contains ${BATCH_START} or ${BATCH_END}, it's used as raw SQL |
| attributes | list of AttributeMapping | ✓ | List of attribute mappings |
✓ = required, ○ = optional
Attribute Mapping Fields
| Field | Type | Required | Description |
|---|---|---|---|
| id | string | ✓ | Attribute ID from the model |
| transformation_expression | string | ✓ | SQL expression to extract/transform the value ⚠ Syntax is database-specific - consult your platform's SQL documentation |
| ingestion_strategy | IngestionStrategy | ○ | Override ingestion strategy for this attribute |
| where | string | ○ | SQL WHERE clause for this attribute ⚠ Syntax is database-specific - consult your platform's SQL documentation |
| attribute_effective_timestamp_expression | string | ○ | Override timestamp expression for this attribute ⚠ Syntax is database-specific - consult your platform's SQL documentation ⚠ Only needed when this attribute has a different timestamp source than other attributes |
✓ = required, ○ = optional
Relationship Mapping Fields
| Field | Type | Required | Description |
|---|---|---|---|
| id | string | ✓ | Relationship ID from the model ⚠ Do not add a separate `atomic_context:` line. The mapping schema only defines `id` for relationships; an extra `atomic_context:` line is silently ignored by the YAML parser, so it looks like it works but contributes nothing |
| source_table | string | ✓ | Source table name with schema |
| target_transformation_expression | string | ✓ | Expression to get the target entity identifier ⚠ Syntax is database-specific - consult your platform's SQL documentation |
✓ = required, ○ = optional
Best Practices
- Generate templates first - Start with generated templates and customize as needed
- Always use schema.table format - Specify
schema.table(e.g.,public.customers,stage.raw_data) to avoid ambiguity - Use descriptive expressions - Make transformation logic clear and maintainable
- Add where clauses - Filter out invalid or incomplete data at the source
- Use entity_effective_timestamp_expression - Set a DRY default at the table level
- Override when needed - Use
attribute_effective_timestamp_expressionfor attributes with different timestamp sources - Validate data types - Use CAST functions to ensure correct data types
- Choose the right ingestion strategy - Match the strategy to the shape of the source: FULL for small dimensions with one row per entity, FULL_LOG when the source already carries history, INCREMENTAL for large mutable tables (use a staging clock for
batch_expression), TRANSACTIONAL for append-only event logs
Next Steps
- Define workflows to orchestrate data transformations
- Configure connections to your data sources
- Follow the tutorial for a complete end-to-end example