AI-Assisted Data Mapping

Overview

This guide describes how to use AI assistants to accelerate your data modeling and mapping work with Daana CLI. The workflow combines AI's ability to understand semantics and generate SQL with rigorous validation against real data.

AI works particularly well with Daana because Daana's Focal Point data model provides an unambiguous definition of all logic and metadata - the AI knows exactly what rules to play by. This clear structure enables more accurate and reliable AI-generated mappings compared to working with vanilla SQL.

Recommended setup: An AI-enabled IDE (Cursor, GitHub Copilot, Windsurf, Claude Code, or similar) provides the best experience, as it can directly access your local files, execute queries, and iterate within your development environment. However, the workflow also works with web-based AI assistants (Claude, ChatGPT) where you manually share context.

What you'll learn:

A 4-step workflow for AI-assisted mapping
How to validate transformations before deployment
Best practices for working with AI on data mapping
Common pitfalls and how to avoid them

Who this is for:

Data engineers building Daana mappings
Analytics engineers familiar with SQL transformations
Anyone comfortable prompting LLMs and validating output

Why AI-Assisted Mapping Matters

Mapping source data to your semantic model involves:

Understanding business definitions
Exploring unfamiliar source schemas
Writing and testing SQL transformations
Validating against real data patterns

AI assistants excel at the first three but require your oversight on the fourth. This combination of AI speed and human validation creates an efficient, reliable workflow for data mapping at scale.

What AI does well:

Translate semantic definitions into SQL logic
Explore schemas and suggest transformations
Iterate quickly on complex SQL transformations

What AI needs from you:

Access to source data for validation
Domain expertise to verify semantic correctness
daana-cli check as the final validation gate

Quick Start

For experienced users, here's the TL;DR (Too Long; Didn't Read):

Share your model.yaml entity definitions with AI
Give AI access to source schema (e.g., select * from information_schema.tables with sample data)
For any transformation, write a validation query and run it against real data
Test the proposed SQL before putting it in the mapping
Run daana-cli check mapping as final validation
Ideally use an AI agent for all the above steps

⚠️

Golden rule: Never finalize a mapping without testing the SQL against real data first.

Prerequisites

Before starting:

You have defined your entities in model.yaml (see Model Definition docs)
You have access to source database (BigQuery, Snowflake, or Postgres)
You're familiar with basic SQL (CASE, window functions, aggregations)
You have an AI assistant with code execution or can run queries yourself

The 4-Step Workflow

Step 1: Understand the Model

Before writing mappings, ensure AI understands your semantic data model.

Share your entity definition:

# From your model.yaml
- id: "ORDER_ENTITY"
  name: "ORDER"
  definition: "A purchase order placed by a customer"
  attributes:
    - id: "ORDER_STATUS_ATTR"
      name: "ORDER_STATUS"
      definition: "Current state of the order"
      description: "PENDING, CONFIRMED, SHIPPED, DELIVERED, CANCELLED"
 
    - id: "ORDER_AMOUNT_ATTR"
      name: "ORDER_AMOUNT"
      definition: "Total order value including tax"
      data_type: "NUMERIC"

What to share with AI:

Entity definition and business meaning
All attributes with definitions and expected values
Data types and constraints
Relationships to other entities

Example prompt:

"I'm mapping the ORDER entity. Here's the model definition [paste YAML]. This entity represents purchase orders in our e-commerce system. Help me map this from our source tables."

Step 2: Explore Source Data

AI needs visibility into your source schema and actual data patterns.

Get table schema:

bq show --schema --format=prettyjson "project:dataset.orders_raw"

Check actual values in key columns:

-- What order statuses actually exist in the data?
SELECT
  order_status,
  COUNT(*) as cnt
FROM `project.dataset.orders_raw`
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20;

Share with AI:

Column names and types
Column descriptions (if available)
Sample values from key columns
Cardinality (how many distinct values)

Key questions AI should answer:

What columns are available?
What do column descriptions say about meaning?
What values actually exist in the data?
Are there multiple source tables for this entity?

Example prompt:

"Here's the schema for our source table orders_raw [paste schema]. Can you explore the actual data and tell me: 1) What distinct values exist for the status-related columns, 2) Are there any NULL values in key fields, 3) What's the grain of this table (one row per order or can there be duplicates)? Run queries to validate your understanding before suggesting mappings."

Step 3: Write & Test SQL Expressions

This is the critical step. For any non-trivial transformation, always validate the SQL against real data before finalizing the mapping.

Example: Deriving ORDER_STATUS

Model expectation: PENDING, CONFIRMED, SHIPPED, DELIVERED, CANCELLED

Source columns available: status_code, fulfillment_state, cancelled_flag

AI proposes transformation:

CASE
  WHEN cancelled_flag = 1 THEN 'CANCELLED'
  WHEN fulfillment_state = 'shipped' THEN 'SHIPPED'
  WHEN fulfillment_state = 'delivered' THEN 'DELIVERED'
  WHEN status_code = 'confirmed' THEN 'CONFIRMED'
  ELSE 'PENDING'
END

Before accepting this, test it:

SELECT
  status_code,
  fulfillment_state,
  cancelled_flag,
  -- Test the proposed expression
  CASE
    WHEN cancelled_flag = 1 THEN 'CANCELLED'
    WHEN fulfillment_state = 'shipped' THEN 'SHIPPED'
    WHEN fulfillment_state = 'delivered' THEN 'DELIVERED'
    WHEN status_code = 'confirmed' THEN 'CONFIRMED'
    ELSE 'PENDING'
  END AS order_status,
  COUNT(*) as cnt
FROM `project.dataset.orders_raw`
GROUP BY 1, 2, 3, 4
ORDER BY cnt DESC
LIMIT 20;

Review the output:

Does the logic produce expected values?
Are there unexpected combinations?
Do any rows fall into 'PENDING' that shouldn't?
Are there NULL values to handle?

Example prompt:

"I need to map ORDER_STATUS which expects values: PENDING, CONFIRMED, SHIPPED, DELIVERED, CANCELLED. The source table has columns status_code, fulfillment_state, and cancelled_flag. Please: 1) Write a CASE expression to derive ORDER_STATUS, 2) Create a validation query that shows all combinations of source columns and the derived status with row counts, 3) Run this query and review the results for any edge cases or unexpected values."

Example: Window Functions for Counters

When deriving sequential counters (like "order number for this customer"), understand the grain of source data first.

Check for duplicates at key level:

-- Expected: one row per order_id
SELECT order_id, COUNT(*) as row_count
FROM `project.dataset.orders_raw`
GROUP BY 1
HAVING COUNT(*) > 1
LIMIT 10;

If duplicates exist, handle appropriately:

-- Use DENSE_RANK for stable counter derivation
DENSE_RANK() OVER (
  PARTITION BY customer_id
  ORDER BY order_date, order_id
) - 1 AS customer_order_number

Test with a real customer:

SELECT
  customer_id,
  order_id,
  order_date,
  DENSE_RANK() OVER (
    PARTITION BY customer_id
    ORDER BY order_date, order_id
  ) - 1 AS customer_order_number
FROM `project.dataset.orders_raw`
WHERE customer_id = 'CUST_12345'  -- Pick a real customer ID
ORDER BY order_date, order_id;

Verify:

Does customer_order_number start at 0 (or 1, depending on your model)?
Is the sequence stable across re-runs?
Are there gaps that indicate missing data?

Step 4: Validate with CLI

After writing the mapping, run Daana CLI validation.

Validate specific mapping:

daana-cli check mapping mappings/order-mapping.yaml \
  --model model.yaml \
  --connections connections.yaml

Validate all mappings:

daana-cli check mapping mappings/

Expected output:

Schema Validation:
  ✓ Schema validation passed

Entity: ORDER_ENTITY
  Tables: 2
  Attributes: 12/12 mapped
  Relationships: 2/2 mapped
  Schema: validated

✓ Mapping valid

If validation fails:

Check error message for specific issues
Common causes: wrong column names, mismatched IDs, missing attributes
Fix and re-validate

Example prompt:

"I've created this mapping file [paste mapping YAML]. Please help me validate it by: 1) Running daana-cli check mapping on this file, 2) If there are any errors, explain what they mean and suggest fixes, 3) Once validation passes, confirm that all required attributes from the model are mapped."

Best Practices for AI-Assisted Mapping

1. Be Critical of AI-Proposed Mappings

When adding new source tables to an entity, don't assume AI-proposed mappings are correct without validation.

Example scenario: You're adding orders_failed to an entity that already maps orders_success.

# Review existing logic
mapping_groups:
  - name: orders_success_group
    tables:
      - table: orders_success
        attributes:
          - id: ORDER_STATUS_ATTR
            transformation_expression: >
              CASE
                WHEN fulfillment_state = 'delivered' THEN 'DELIVERED'
                WHEN fulfillment_state = 'shipped' THEN 'SHIPPED'
                ELSE 'CONFIRMED'
              END
 
  # Adding new source - verify the logic makes sense for both!
  - name: orders_failed_group
    tables:
      - table: orders_failed
        attributes:
          - id: ORDER_STATUS_ATTR
            transformation_expression: "'CANCELLED'"  # Different logic!

Question to ask:

Does the existing CASE statement make sense for failed orders?
Are there overlapping values that need consistent handling?
Should failed orders always be 'CANCELLED' or could they have other statuses?

2. Check Data Contracts

If you have data contracts or schema documentation, share them with AI.

Example data contract:

columns:
  - name: order_source
    description: "System that created the order"
    type: STRING
    possible_values:
      - WEBSITE: "Direct customer purchase via website"
      - MOBILE_APP: "Purchase through mobile application"
      - API: "Third-party integration order"
      - ADMIN: "Manual order entry by staff"

Why this helps:

AI understands semantic meaning, not just column names
Reduces guesswork about value mappings
Prevents incorrect assumptions

3. Validate Semantic Assumptions

When AI proposes categorizations, verify the reasoning.

Example dialogue (formalized):

AI suggests:

CASE
  WHEN order_source IN ('WEBSITE', 'MOBILE_APP') THEN 'CUSTOMER_INITIATED'
  WHEN order_source = 'API' THEN 'PARTNER_INITIATED'
  WHEN order_source = 'ADMIN' THEN 'MANUAL'
END

You question:

"Are you certain that API orders are always partner-initiated? Could API orders be customer-initiated through a partner app?"

AI refines after checking documentation:

-- Updated after verifying API can be both customer and partner initiated
CASE
  WHEN order_source IN ('WEBSITE', 'MOBILE_APP') THEN 'CUSTOMER_INITIATED'
  WHEN order_source = 'API' AND partner_type = 'PLATFORM' THEN 'PARTNER_INITIATED'
  WHEN order_source = 'API' AND partner_type IS NULL THEN 'CUSTOMER_INITIATED'
  WHEN order_source = 'ADMIN' THEN 'MANUAL'
END

Lesson: Your domain knowledge is essential for semantic correctness.

4. Identify Model Issues Early

Sometimes the problem isn't the mapping - it's the model itself.

Common model issues:

Duplicate attributes:

# Two attributes trying to capture the same source data
- id: ORDER_PAYMENT_METHOD_ATTR      # ← Overlap!
- id: ORDER_PAYMENT_PROVIDER_ATTR    # ← Overlap!
 
# Both want to map from: payment_method column
# Solution: Decide which attribute to keep, remove the duplicate

Missing source data:

- id: ORDER_EXPECTED_DELIVERY_DATE_ATTR
  # No column in source data contains this information!
  # Solution: Mark as optional or remove from model

Semantic overlap:

- id: ORDER_FULFILLMENT_STATUS_ATTR
  description: "PENDING, IN_PROGRESS, COMPLETED"
 
- id: ORDER_SHIPPING_STATUS_ATTR
  description: "PENDING, SHIPPED, DELIVERED"
 
# These are hard to distinguish clearly
# Solution: Consolidate into single attribute or clarify definitions

The Four Essential Validation Steps

When working with AI on mappings, ensure these four steps are completed:

Step	Action	Tool	Purpose
1. Schema Discovery	Understand available columns and meanings	`select * from information_schema.tables`	Know what data exists
2. Data Exploration	See what values actually exist	`SELECT ... GROUP BY`	Understand data patterns
3. Expression Testing	Validate transformation logic	`SELECT` with test SQL	Catch logic errors early
4. CLI Validation	Confirm mapping structure	`daana-cli check`	Ensure schema compatibility

⚠️

Never skip Step 3. Testing SQL against real data is the most critical validation step.

Common Pitfalls

Pitfall 1: Wrong Column Names

Error message:

Error: Unrecognized name: user_id at [5:10]

Cause: Column name doesn't match schema.

Solution:

# Check actual column name
bq show --schema "project:dataset.orders_raw" | grep -i user
 
# Actual column: customer_id (not user_id)

Fix mapping:

- id: ORDER_CUSTOMER_ID_ATTR
  transformation_expression: "customer_id"  # Corrected!

Pitfall 2: Mismatched Relationship IDs

Error message:

Error: Relationship 'ORDER_ENTITY_BELONGS_TO_CUSTOMER_ENTITY' not found in model

Cause: Relationship ID in mapping doesn't match model definition.

Model defines:

relationships:
  - id: "ORDER_ENTITY_IS_PLACED_BY_CUSTOMER_ENTITY"

Mapping incorrectly uses:

relationships:
  - id: "ORDER_ENTITY_BELONGS_TO_CUSTOMER_ENTITY"  # ❌ Doesn't match!

Fix:

relationships:
  - id: "ORDER_ENTITY_IS_PLACED_BY_CUSTOMER_ENTITY"  # ✓ Match exactly!

Pitfall 3: Missing Workflow Entry

Error message:

Error: Mapping file 'mappings/order-mapping.yaml' not included in workflow

Cause: New mapping not added to workflow.yaml.

Fix:

# workflow.yaml must include all mapping files
mappings:
  - mappings/customer-mapping.yaml
  - mappings/order-mapping.yaml      # ← Don't forget new mappings!
  - mappings/product-mapping.yaml

Pitfall 4: Untested NULL Handling

Problem: AI generates logic that doesn't handle NULLs properly.

Example:

-- This looks fine but silently mishandles NULLs
CASE
  WHEN order_amount <= 0 THEN 'UNPAID'
  ELSE 'PAID'
END
-- NULL <= 0 evaluates to NULL, which falls through to ELSE
-- Result: NULL amounts incorrectly marked as 'PAID'!

Test for NULLs:

SELECT
  order_amount,
  order_amount <= 0 AS amount_check,
  CASE
    WHEN order_amount <= 0 THEN 'UNPAID'
    ELSE 'PAID'
  END AS status,
  COUNT(*) as cnt
FROM `project.dataset.orders_raw`
GROUP BY 1, 2, 3
ORDER BY cnt DESC;

Fix:

CASE
  WHEN order_amount IS NULL THEN 'UNKNOWN'
  WHEN order_amount <= 0 THEN 'UNPAID'
  ELSE 'PAID'
END

Validation Checklist

Before finalizing any mapping, ensure:

Model understanding:

AI has seen full entity definition
Expected values are documented
Relationships are clear

Source exploration:

Schema retrieved and shared with AI
Key columns explored with GROUP BY queries
Data patterns understood (cardinality, NULLs, outliers)

SQL testing:

Transformation SQL tested against real data
Edge cases checked (NULLs, zeros, empty strings)
Results manually reviewed for correctness

AI agent self-validation (when using agentic AI):

Agent prompted to run quality checks on actual data
Agent tests transformations against common pitfalls (NULLs, negative values, unexpected data types)
Agent uses appropriate filters, aggregations, and limits for large tables
Agent validates unexpected/anomalous values and categories

Suggested prompt for AI agents:

"Run queries on actual data to ensure all transformations work as intended and don't fail on common pitfalls such as NULLs, negative values, unexpected data types, and unexpected/anomalous values or categories. Assume tables can be huge, so always use applicable filters, aggregations, and limits when assessing data quality."

CLI validation:

daana-cli check mapping passes
All attributes mapped (or explicitly marked optional)
Relationships validated
Mapping added to workflow.yaml

Example: Complete Workflow

Here's a full example of the workflow in action:

1. Share model with AI:

- id: "CUSTOMER_ENTITY"
  name: "CUSTOMER"
  attributes:
    - id: "CUSTOMER_SEGMENT_ATTR"
      name: "CUSTOMER_SEGMENT"
      definition: "Customer classification based on behavior"
      description: "BRONZE, SILVER, GOLD, PLATINUM"

2. Explore source:

-- What columns exist?
bq show --schema "project:dataset.customers_raw"
 
-- What values exist for customer_tier?
SELECT customer_tier, COUNT(*) as cnt
FROM `project.dataset.customers_raw`
GROUP BY 1
ORDER BY 2 DESC;
 
-- Result: basic, premium, vip, enterprise

3. AI proposes mapping:

CASE
  WHEN customer_tier = 'basic' THEN 'BRONZE'
  WHEN customer_tier = 'premium' THEN 'SILVER'
  WHEN customer_tier = 'vip' THEN 'GOLD'
  WHEN customer_tier = 'enterprise' THEN 'PLATINUM'
  ELSE 'BRONZE'
END

4. Test it:

SELECT
  customer_tier,
  CASE
    WHEN customer_tier = 'basic' THEN 'BRONZE'
    WHEN customer_tier = 'premium' THEN 'SILVER'
    WHEN customer_tier = 'vip' THEN 'GOLD'
    WHEN customer_tier = 'enterprise' THEN 'PLATINUM'
    ELSE 'BRONZE'
  END AS customer_segment,
  COUNT(*) as cnt
FROM `project.dataset.customers_raw`
GROUP BY 1, 2
ORDER BY cnt DESC;

5. Verify output looks correct, then add to mapping:

mapping_groups:
  - name: customers_group
    tables:
      - table: customers_raw
        attributes:
          - id: CUSTOMER_SEGMENT_ATTR
            transformation_expression: >
              CASE
                WHEN customer_tier = 'basic' THEN 'BRONZE'
                WHEN customer_tier = 'premium' THEN 'SILVER'
                WHEN customer_tier = 'vip' THEN 'GOLD'
                WHEN customer_tier = 'enterprise' THEN 'PLATINUM'
                ELSE 'BRONZE'
              END

6. Validate:

daana-cli check mapping mappings/customer-mapping.yaml
# ✓ Mapping valid

Tips for Effective AI Collaboration

Give AI context incrementally:

Start with model definition
Then share schema
Then share sample data
Then validate proposed SQL

Ask AI to explain reasoning:

"Why did you choose to map 'basic' to 'BRONZE' rather than 'SILVER'?"

Request multiple options:

"Show me 2-3 ways to derive CUSTOMER_SEGMENT from the available columns"

Validate assumptions explicitly:

"What assumptions are you making about NULL values in customer_tier?"

Use AI for tedious work:

Generating boilerplate YAML structure
Writing GROUP BY queries for exploration
Converting similar logic across multiple tables

Keep the human in the loop for:

Semantic correctness
Business logic validation
Final SQL testing
Model refinement decisions

Summary

The workflow:

Share model → AI understands semantics
Explore schema → AI sees available data
Write & test SQL → Validate logic against real data
Run CLI check → Confirm mapping structure

The golden rule: Never finalize a mapping without testing the SQL against real data.

AI excels at:

Understanding semantic definitions
Generating SQL transformations
Iterating quickly on complex logic

You provide:

Domain expertise
Semantic validation
SQL testing discipline
Final approval

The combination creates an efficient, reliable workflow for data mapping at scale.

AI-Assisted Data Mapping

Overview

Why AI-Assisted Mapping Matters

Quick Start

Prerequisites

The 4-Step Workflow

Step 1: Understand the Model

Step 2: Explore Source Data

Step 3: Write & Test SQL Expressions

Example: Deriving ORDER_STATUS

Example: Window Functions for Counters

Step 4: Validate with CLI

Best Practices for AI-Assisted Mapping

1. Be Critical of AI-Proposed Mappings

2. Check Data Contracts

3. Validate Semantic Assumptions

4. Identify Model Issues Early

The Four Essential Validation Steps

Common Pitfalls

Pitfall 1: Wrong Column Names

Pitfall 2: Mismatched Relationship IDs

Pitfall 3: Missing Workflow Entry

Pitfall 4: Untested NULL Handling

Validation Checklist

Example: Complete Workflow

Tips for Effective AI Collaboration

Summary

Related Documentation