AI-Assisted Data Mapping
Overview
This guide describes how to use AI assistants to accelerate your data modeling and mapping work with Daana CLI. The workflow combines AI's ability to understand semantics and generate SQL with rigorous validation against real data.
AI works particularly well with Daana because Daana's Focal Point data model provides an unambiguous definition of all logic and metadata - the AI knows exactly what rules to play by. This clear structure enables more accurate and reliable AI-generated mappings compared to working with vanilla SQL.
Recommended setup: An AI-enabled IDE (Cursor, GitHub Copilot, Windsurf, Claude Code, or similar) provides the best experience, as it can directly access your local files, execute queries, and iterate within your development environment. However, the workflow also works with web-based AI assistants (Claude, ChatGPT) where you manually share context.
What you'll learn:
- A 4-step workflow for AI-assisted mapping
- How to validate transformations before deployment
- Best practices for working with AI on data mapping
- Common pitfalls and how to avoid them
Who this is for:
- Data engineers building Daana mappings
- Analytics engineers familiar with SQL transformations
- Anyone comfortable prompting LLMs and validating output
Why AI-Assisted Mapping Matters
Mapping source data to your semantic model involves:
- Understanding business definitions
- Exploring unfamiliar source schemas
- Writing and testing SQL transformations
- Validating against real data patterns
AI assistants excel at the first three but require your oversight on the fourth. This combination of AI speed and human validation creates an efficient, reliable workflow for data mapping at scale.
What AI does well:
- Translate semantic definitions into SQL logic
- Explore schemas and suggest transformations
- Iterate quickly on complex SQL transformations
What AI needs from you:
- Access to source data for validation
- Domain expertise to verify semantic correctness
daana-cli checkas the final validation gate
Quick Start
For experienced users, here's the TL;DR (Too Long; Didn't Read):
- Share your
model.yamlentity definitions with AI - Give AI access to source schema (e.g.,
select * from information_schema.tableswith sample data) - For any transformation, write a validation query and run it against real data
- Test the proposed SQL before putting it in the mapping
- Run
daana-cli check mappingas final validation - Ideally use an AI agent for all the above steps
Golden rule: Never finalize a mapping without testing the SQL against real data first.
Prerequisites
Before starting:
- You have defined your entities in
model.yaml(see Model Definition docs) - You have access to source database (BigQuery, Snowflake, or Postgres)
- You're familiar with basic SQL (CASE, window functions, aggregations)
- You have an AI assistant with code execution or can run queries yourself
The 4-Step Workflow
Step 1: Understand the Model
Before writing mappings, ensure AI understands your semantic data model.
Share your entity definition:
# From your model.yaml
- id: "ORDER_ENTITY"
name: "ORDER"
definition: "A purchase order placed by a customer"
attributes:
- id: "ORDER_STATUS_ATTR"
name: "ORDER_STATUS"
definition: "Current state of the order"
description: "PENDING, CONFIRMED, SHIPPED, DELIVERED, CANCELLED"
- id: "ORDER_AMOUNT_ATTR"
name: "ORDER_AMOUNT"
definition: "Total order value including tax"
data_type: "NUMERIC"What to share with AI:
- Entity definition and business meaning
- All attributes with definitions and expected values
- Data types and constraints
- Relationships to other entities
Example prompt:
"I'm mapping the ORDER entity. Here's the model definition [paste YAML]. This entity represents purchase orders in our e-commerce system. Help me map this from our source tables."
Step 2: Explore Source Data
AI needs visibility into your source schema and actual data patterns.
Get table schema:
bq show --schema --format=prettyjson "project:dataset.orders_raw"Check actual values in key columns:
-- What order statuses actually exist in the data?
SELECT
order_status,
COUNT(*) as cnt
FROM `project.dataset.orders_raw`
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20;Share with AI:
- Column names and types
- Column descriptions (if available)
- Sample values from key columns
- Cardinality (how many distinct values)
Key questions AI should answer:
- What columns are available?
- What do column descriptions say about meaning?
- What values actually exist in the data?
- Are there multiple source tables for this entity?
Example prompt:
"Here's the schema for our source table
orders_raw[paste schema]. Can you explore the actual data and tell me: 1) What distinct values exist for the status-related columns, 2) Are there any NULL values in key fields, 3) What's the grain of this table (one row per order or can there be duplicates)? Run queries to validate your understanding before suggesting mappings."
Step 3: Write & Test SQL Expressions
This is the critical step. For any non-trivial transformation, always validate the SQL against real data before finalizing the mapping.
Example: Deriving ORDER_STATUS
Model expectation: PENDING, CONFIRMED, SHIPPED, DELIVERED, CANCELLED
Source columns available: status_code, fulfillment_state, cancelled_flag
AI proposes transformation:
CASE
WHEN cancelled_flag = 1 THEN 'CANCELLED'
WHEN fulfillment_state = 'shipped' THEN 'SHIPPED'
WHEN fulfillment_state = 'delivered' THEN 'DELIVERED'
WHEN status_code = 'confirmed' THEN 'CONFIRMED'
ELSE 'PENDING'
ENDBefore accepting this, test it:
SELECT
status_code,
fulfillment_state,
cancelled_flag,
-- Test the proposed expression
CASE
WHEN cancelled_flag = 1 THEN 'CANCELLED'
WHEN fulfillment_state = 'shipped' THEN 'SHIPPED'
WHEN fulfillment_state = 'delivered' THEN 'DELIVERED'
WHEN status_code = 'confirmed' THEN 'CONFIRMED'
ELSE 'PENDING'
END AS order_status,
COUNT(*) as cnt
FROM `project.dataset.orders_raw`
GROUP BY 1, 2, 3, 4
ORDER BY cnt DESC
LIMIT 20;Review the output:
- Does the logic produce expected values?
- Are there unexpected combinations?
- Do any rows fall into 'PENDING' that shouldn't?
- Are there NULL values to handle?
Example prompt:
"I need to map ORDER_STATUS which expects values: PENDING, CONFIRMED, SHIPPED, DELIVERED, CANCELLED. The source table has columns status_code, fulfillment_state, and cancelled_flag. Please: 1) Write a CASE expression to derive ORDER_STATUS, 2) Create a validation query that shows all combinations of source columns and the derived status with row counts, 3) Run this query and review the results for any edge cases or unexpected values."
Example: Window Functions for Counters
When deriving sequential counters (like "order number for this customer"), understand the grain of source data first.
Check for duplicates at key level:
-- Expected: one row per order_id
SELECT order_id, COUNT(*) as row_count
FROM `project.dataset.orders_raw`
GROUP BY 1
HAVING COUNT(*) > 1
LIMIT 10;If duplicates exist, handle appropriately:
-- Use DENSE_RANK for stable counter derivation
DENSE_RANK() OVER (
PARTITION BY customer_id
ORDER BY order_date, order_id
) - 1 AS customer_order_numberTest with a real customer:
SELECT
customer_id,
order_id,
order_date,
DENSE_RANK() OVER (
PARTITION BY customer_id
ORDER BY order_date, order_id
) - 1 AS customer_order_number
FROM `project.dataset.orders_raw`
WHERE customer_id = 'CUST_12345' -- Pick a real customer ID
ORDER BY order_date, order_id;Verify:
- Does customer_order_number start at 0 (or 1, depending on your model)?
- Is the sequence stable across re-runs?
- Are there gaps that indicate missing data?
Step 4: Validate with CLI
After writing the mapping, run Daana CLI validation.
Validate specific mapping:
daana-cli check mapping mappings/order-mapping.yaml \
--model model.yaml \
--connections connections.yamlValidate all mappings:
daana-cli check mapping mappings/Expected output:
Schema Validation:
✓ Schema validation passed
Entity: ORDER_ENTITY
Tables: 2
Attributes: 12/12 mapped
Relationships: 2/2 mapped
Schema: validated
✓ Mapping validIf validation fails:
- Check error message for specific issues
- Common causes: wrong column names, mismatched IDs, missing attributes
- Fix and re-validate
Example prompt:
"I've created this mapping file [paste mapping YAML]. Please help me validate it by: 1) Running
daana-cli check mappingon this file, 2) If there are any errors, explain what they mean and suggest fixes, 3) Once validation passes, confirm that all required attributes from the model are mapped."
Best Practices for AI-Assisted Mapping
1. Be Critical of AI-Proposed Mappings
When adding new source tables to an entity, don't assume AI-proposed mappings are correct without validation.
Example scenario:
You're adding orders_failed to an entity that already maps orders_success.
# Review existing logic
mapping_groups:
- name: orders_success_group
tables:
- table: orders_success
attributes:
- id: ORDER_STATUS_ATTR
transformation_expression: >
CASE
WHEN fulfillment_state = 'delivered' THEN 'DELIVERED'
WHEN fulfillment_state = 'shipped' THEN 'SHIPPED'
ELSE 'CONFIRMED'
END
# Adding new source - verify the logic makes sense for both!
- name: orders_failed_group
tables:
- table: orders_failed
attributes:
- id: ORDER_STATUS_ATTR
transformation_expression: "'CANCELLED'" # Different logic!Question to ask:
- Does the existing CASE statement make sense for failed orders?
- Are there overlapping values that need consistent handling?
- Should failed orders always be 'CANCELLED' or could they have other statuses?
2. Check Data Contracts
If you have data contracts or schema documentation, share them with AI.
Example data contract:
columns:
- name: order_source
description: "System that created the order"
type: STRING
possible_values:
- WEBSITE: "Direct customer purchase via website"
- MOBILE_APP: "Purchase through mobile application"
- API: "Third-party integration order"
- ADMIN: "Manual order entry by staff"Why this helps:
- AI understands semantic meaning, not just column names
- Reduces guesswork about value mappings
- Prevents incorrect assumptions
3. Validate Semantic Assumptions
When AI proposes categorizations, verify the reasoning.
Example dialogue (formalized):
AI suggests:
CASE
WHEN order_source IN ('WEBSITE', 'MOBILE_APP') THEN 'CUSTOMER_INITIATED'
WHEN order_source = 'API' THEN 'PARTNER_INITIATED'
WHEN order_source = 'ADMIN' THEN 'MANUAL'
ENDYou question:
"Are you certain that API orders are always partner-initiated? Could API orders be customer-initiated through a partner app?"
AI refines after checking documentation:
-- Updated after verifying API can be both customer and partner initiated
CASE
WHEN order_source IN ('WEBSITE', 'MOBILE_APP') THEN 'CUSTOMER_INITIATED'
WHEN order_source = 'API' AND partner_type = 'PLATFORM' THEN 'PARTNER_INITIATED'
WHEN order_source = 'API' AND partner_type IS NULL THEN 'CUSTOMER_INITIATED'
WHEN order_source = 'ADMIN' THEN 'MANUAL'
ENDLesson: Your domain knowledge is essential for semantic correctness.
4. Identify Model Issues Early
Sometimes the problem isn't the mapping - it's the model itself.
Common model issues:
Duplicate attributes:
# Two attributes trying to capture the same source data
- id: ORDER_PAYMENT_METHOD_ATTR # ← Overlap!
- id: ORDER_PAYMENT_PROVIDER_ATTR # ← Overlap!
# Both want to map from: payment_method column
# Solution: Decide which attribute to keep, remove the duplicateMissing source data:
- id: ORDER_EXPECTED_DELIVERY_DATE_ATTR
# No column in source data contains this information!
# Solution: Mark as optional or remove from modelSemantic overlap:
- id: ORDER_FULFILLMENT_STATUS_ATTR
description: "PENDING, IN_PROGRESS, COMPLETED"
- id: ORDER_SHIPPING_STATUS_ATTR
description: "PENDING, SHIPPED, DELIVERED"
# These are hard to distinguish clearly
# Solution: Consolidate into single attribute or clarify definitionsThe Four Essential Validation Steps
When working with AI on mappings, ensure these four steps are completed:
| Step | Action | Tool | Purpose |
|---|---|---|---|
| 1. Schema Discovery | Understand available columns and meanings | select * from information_schema.tables | Know what data exists |
| 2. Data Exploration | See what values actually exist | SELECT ... GROUP BY | Understand data patterns |
| 3. Expression Testing | Validate transformation logic | SELECT with test SQL | Catch logic errors early |
| 4. CLI Validation | Confirm mapping structure | daana-cli check | Ensure schema compatibility |
Never skip Step 3. Testing SQL against real data is the most critical validation step.
Common Pitfalls
Pitfall 1: Wrong Column Names
Error message:
Error: Unrecognized name: user_id at [5:10]Cause: Column name doesn't match schema.
Solution:
# Check actual column name
bq show --schema "project:dataset.orders_raw" | grep -i user
# Actual column: customer_id (not user_id)Fix mapping:
- id: ORDER_CUSTOMER_ID_ATTR
transformation_expression: "customer_id" # Corrected!Pitfall 2: Mismatched Relationship IDs
Error message:
Error: Relationship 'ORDER_ENTITY_BELONGS_TO_CUSTOMER_ENTITY' not found in modelCause: Relationship ID in mapping doesn't match model definition.
Model defines:
relationships:
- id: "ORDER_ENTITY_IS_PLACED_BY_CUSTOMER_ENTITY"Mapping incorrectly uses:
relationships:
- id: "ORDER_ENTITY_BELONGS_TO_CUSTOMER_ENTITY" # ❌ Doesn't match!Fix:
relationships:
- id: "ORDER_ENTITY_IS_PLACED_BY_CUSTOMER_ENTITY" # ✓ Match exactly!Pitfall 3: Missing Workflow Entry
Error message:
Error: Mapping file 'mappings/order-mapping.yaml' not included in workflowCause: New mapping not added to workflow.yaml.
Fix:
# workflow.yaml must include all mapping files
mappings:
- mappings/customer-mapping.yaml
- mappings/order-mapping.yaml # ← Don't forget new mappings!
- mappings/product-mapping.yamlPitfall 4: Untested NULL Handling
Problem: AI generates logic that doesn't handle NULLs properly.
Example:
-- This looks fine but silently mishandles NULLs
CASE
WHEN order_amount <= 0 THEN 'UNPAID'
ELSE 'PAID'
END
-- NULL <= 0 evaluates to NULL, which falls through to ELSE
-- Result: NULL amounts incorrectly marked as 'PAID'!Test for NULLs:
SELECT
order_amount,
order_amount <= 0 AS amount_check,
CASE
WHEN order_amount <= 0 THEN 'UNPAID'
ELSE 'PAID'
END AS status,
COUNT(*) as cnt
FROM `project.dataset.orders_raw`
GROUP BY 1, 2, 3
ORDER BY cnt DESC;Fix:
CASE
WHEN order_amount IS NULL THEN 'UNKNOWN'
WHEN order_amount <= 0 THEN 'UNPAID'
ELSE 'PAID'
ENDValidation Checklist
Before finalizing any mapping, ensure:
Model understanding:
- AI has seen full entity definition
- Expected values are documented
- Relationships are clear
Source exploration:
- Schema retrieved and shared with AI
- Key columns explored with GROUP BY queries
- Data patterns understood (cardinality, NULLs, outliers)
SQL testing:
- Transformation SQL tested against real data
- Edge cases checked (NULLs, zeros, empty strings)
- Results manually reviewed for correctness
AI agent self-validation (when using agentic AI):
- Agent prompted to run quality checks on actual data
- Agent tests transformations against common pitfalls (NULLs, negative values, unexpected data types)
- Agent uses appropriate filters, aggregations, and limits for large tables
- Agent validates unexpected/anomalous values and categories
Suggested prompt for AI agents:
"Run queries on actual data to ensure all transformations work as intended and don't fail on common pitfalls such as NULLs, negative values, unexpected data types, and unexpected/anomalous values or categories. Assume tables can be huge, so always use applicable filters, aggregations, and limits when assessing data quality."
CLI validation:
-
daana-cli check mappingpasses - All attributes mapped (or explicitly marked optional)
- Relationships validated
- Mapping added to workflow.yaml
Example: Complete Workflow
Here's a full example of the workflow in action:
1. Share model with AI:
- id: "CUSTOMER_ENTITY"
name: "CUSTOMER"
attributes:
- id: "CUSTOMER_SEGMENT_ATTR"
name: "CUSTOMER_SEGMENT"
definition: "Customer classification based on behavior"
description: "BRONZE, SILVER, GOLD, PLATINUM"2. Explore source:
-- What columns exist?
bq show --schema "project:dataset.customers_raw"
-- What values exist for customer_tier?
SELECT customer_tier, COUNT(*) as cnt
FROM `project.dataset.customers_raw`
GROUP BY 1
ORDER BY 2 DESC;
-- Result: basic, premium, vip, enterprise3. AI proposes mapping:
CASE
WHEN customer_tier = 'basic' THEN 'BRONZE'
WHEN customer_tier = 'premium' THEN 'SILVER'
WHEN customer_tier = 'vip' THEN 'GOLD'
WHEN customer_tier = 'enterprise' THEN 'PLATINUM'
ELSE 'BRONZE'
END4. Test it:
SELECT
customer_tier,
CASE
WHEN customer_tier = 'basic' THEN 'BRONZE'
WHEN customer_tier = 'premium' THEN 'SILVER'
WHEN customer_tier = 'vip' THEN 'GOLD'
WHEN customer_tier = 'enterprise' THEN 'PLATINUM'
ELSE 'BRONZE'
END AS customer_segment,
COUNT(*) as cnt
FROM `project.dataset.customers_raw`
GROUP BY 1, 2
ORDER BY cnt DESC;5. Verify output looks correct, then add to mapping:
mapping_groups:
- name: customers_group
tables:
- table: customers_raw
attributes:
- id: CUSTOMER_SEGMENT_ATTR
transformation_expression: >
CASE
WHEN customer_tier = 'basic' THEN 'BRONZE'
WHEN customer_tier = 'premium' THEN 'SILVER'
WHEN customer_tier = 'vip' THEN 'GOLD'
WHEN customer_tier = 'enterprise' THEN 'PLATINUM'
ELSE 'BRONZE'
END6. Validate:
daana-cli check mapping mappings/customer-mapping.yaml
# ✓ Mapping validTips for Effective AI Collaboration
Give AI context incrementally:
- Start with model definition
- Then share schema
- Then share sample data
- Then validate proposed SQL
Ask AI to explain reasoning:
"Why did you choose to map 'basic' to 'BRONZE' rather than 'SILVER'?"
Request multiple options:
"Show me 2-3 ways to derive CUSTOMER_SEGMENT from the available columns"
Validate assumptions explicitly:
"What assumptions are you making about NULL values in customer_tier?"
Use AI for tedious work:
- Generating boilerplate YAML structure
- Writing GROUP BY queries for exploration
- Converting similar logic across multiple tables
Keep the human in the loop for:
- Semantic correctness
- Business logic validation
- Final SQL testing
- Model refinement decisions
Summary
The workflow:
- Share model → AI understands semantics
- Explore schema → AI sees available data
- Write & test SQL → Validate logic against real data
- Run CLI check → Confirm mapping structure
The golden rule: Never finalize a mapping without testing the SQL against real data.
AI excels at:
- Understanding semantic definitions
- Generating SQL transformations
- Iterating quickly on complex logic
You provide:
- Domain expertise
- Semantic validation
- SQL testing discipline
- Final approval
The combination creates an efficient, reliable workflow for data mapping at scale.
Related Documentation
- DMDL: Model Definition - Defining entities and attributes
- DMDL: Mapping - Mapping reference and syntax
- Commands - All available commands
- Understanding Daana Output - Understanding Daana's output