Database Design (Part 3): Schemas, Fact/Dimension Models & Practical Patterns (AI/ML Perspective)

Aug 2, 2025

Database Design (Part 3): Schemas, Fact/Dimension Models & Practical Patterns (AI/ML Perspective)

Aug 2, 2025

Database Design (Part 3): Schemas, Fact/Dimension Models & Practical Patterns (AI/ML Perspective)

Aug 2, 2025

1. Why Schema Design is the Bridge Between Databases & AI/ML Analytics

“Your database schema determines how easily you can extract meaningful insights and features for AI/ML systems.”

In Part 1, we covered the fundamentals of relational databases and why database design is the silent differentiator in AI/ML. In Part 2, we dived deep into data integrity, normalization, and ERDs, exploring how these practices enforce data quality and consistency.

Now, in this final part, we close the loop by focusing on schemas, fact/dimension models, and practical database patterns that support analytics, OLAP workloads, and modern AI/ML pipelines.

Because here’s the reality:

  • You can design normalized databases perfectly (1NF, 2NF, 3NF).

  • You can enforce every integrity constraint and draw ERDs that make architects smile.

But if your schema design doesn’t support analytics at scale, your data science and ML teams will spend 80% of their time writing convoluted queries, fighting join complexity, and building fragile pipelines.

Schema design is where database design meets analytics and machine learning.

2. Understanding Schemas: The Logical Blueprint of Your Database

A database schema defines the structure, organization, and relationships within your database.

  • It is the logical blueprint of how data is stored.

  • Schemas define tables, columns, data types, constraints, and relationships.

  • In large organizations, schemas also define security boundaries (who can access what).

2.1 Types of Schemas

  1. Logical Schema:

    • High-level design showing how data is logically organized.

    • Includes tables, relationships, constraints, and keys.

    • Example: ERDs, normalization diagrams.

  2. Physical Schema:

    • Actual storage implementation on disk.

    • Includes partitions, indexes, file structures.

  3. External Schema (Views):

    • Customized data views for specific users or applications.

AI/ML Perspective:

  • Logical schemas help data scientists understand how to query and join data.

  • External schemas (views) can expose curated feature sets.

2.2 Why Schemas Matter for AI/ML

  1. Feature Engineering Efficiency:

    • Well-designed schemas reduce the need for excessive joins.

    • Makes feature pipelines cleaner and reproducible.

  2. Data Warehouse/Feature Store Design:

    • OLAP systems (Snowflake, BigQuery, Redshift) depend heavily on dimensional schemas.

  3. Compliance and Governance:

    • Schemas enforce access controls and data lineage tracking.

3. Fact & Dimension Models: The Heart of Analytical Schema Design

Traditional OLTP (transactional) databases are highly normalized for data integrity. But analytics workloads and AI/ML pipelines require fast aggregations, joins, and large-scale queries.

This is where fact/dimension modeling comes in.

3.1 The Star Schema

The Star Schema is the most popular dimensional model.

  • Fact Table: Central table containing measurable metrics.

  • Dimension Tables: Surrounding tables providing context to facts.

📌 Example: E-Commerce Analytics

Fact Table (FACT_SALES):

  • Columns: sale_id, product_id, customer_id, date_id, quantity, amount

Dimension Tables:

  • DIM_CUSTOMER: Attributes like name, location, segment.

  • DIM_PRODUCT: Attributes like category, brand.

  • DIM_DATE: Calendar attributes (day, week, month).

Why It Works:

  • Single fact table, multiple dimensions → fast aggregations.

  • Data scientists can query "total sales by region, product category, month" easily.

3.2 The Snowflake Schema

The Snowflake Schema is a normalized version of the star schema.

  • Dimension tables are further broken down into sub-dimensions.

  • Reduces redundancy but increases join complexity.

📌 Example:

  • DIM_PRODUCT split into DIM_PRODUCT, DIM_CATEGORY, DIM_BRAND.

Trade-Off:

  • Pros: Saves storage, improves consistency.

  • Cons: More joins, harder for analytics.

AI/ML Perspective:

  • Star schemas are preferred for feature stores because they minimize joins.

  • Snowflake schemas may be better for governance-heavy environments.

3.3 Fact Table Design Best Practices

  1. Grain: Decide the lowest level of detail (e.g., one row per sale transaction).

  2. Surrogate Keys: Use synthetic keys (integers) for better performance.

  3. Additive Measures: Store measures (e.g., sales amount) that can be aggregated.

  4. Time Dimension: Always include a date/time dimension for temporal analysis.

3.4 Dimension Table Design Best Practices

  1. Descriptive Attributes: Include attributes for easy slicing (e.g., region, product category).

  2. Slowly Changing Dimensions (SCD):

    • Handle changes in attributes over time (e.g., customer moves city).

    • Type 1: Overwrite.

    • Type 2: Create new row with effective date range.

  3. Hierarchies: Support hierarchies (e.g., Country → State → City).

4. Schemas & AI/ML Feature Stores: The Connection

Feature stores are OLAP systems for ML. They rely on dimensional schemas for performance and maintainability.

4.1 Why Feature Stores Use Star Schemas

  1. Fast Feature Retrieval:

    • One fact table (e.g., transactions)

    • Dimensions (e.g., customers, products) provide context

  2. Reusability:

    • Dimensions like DIM_CUSTOMER can be reused across multiple models.

  3. Consistency:

    • Same schema used for training and serving reduces training-serving skew.

4.2 Example: Churn Prediction Feature Store

Fact Table: FACT_CALLS

  • Grain: One row per call

  • Measures: call_duration, call_cost

Dimensions:

  • DIM_CUSTOMER: demographics, segment

  • DIM_DATE: call date/time

  • DIM_PLAN: subscription plan details

Feature:

  • average_call_duration_last_30_days = Aggregation from FACT_CALLS

5. Practical Patterns: Balancing Normalization & Denormalization

5.1 OLTP vs OLAP

  • OLTP (Online Transaction Processing):

    • Transactional systems

    • Highly normalized (3NF)

    • Optimized for writes

  • OLAP (Online Analytical Processing):

    • Analytical systems (data warehouses, feature stores)

    • Denormalized (star/snowflake)

    • Optimized for reads

AI/ML Perspective:

  • OLTP: Source of truth for raw data.

  • OLAP: Curated for analytics and ML features.

5.2 Hybrid Approaches

Some modern systems (e.g., BigQuery, Snowflake) allow you to query OLTP-like normalized data efficiently. But most teams still build ETL pipelines to transform OLTP → OLAP.

5.3 When to Denormalize

  • When joins become bottlenecks for analytics.

  • When feature pipelines require low-latency aggregations.

  • When data volumes are manageable, and storage is cheap.

Example:

  • Combine DIM_CUSTOMER attributes directly into FACT_TRANSACTIONS to reduce joins in training pipelines.

6. Advanced Schema Design for AI/ML

6.1 Partitioning

  • Horizontal Partitioning (Sharding): Split tables by key (e.g., region).

  • Vertical Partitioning: Split columns into separate tables.

AI/ML Use:

  • Speeds up queries for localized models.

6.2 Indexing

  • Clustered Indexes: Define physical order of rows.

  • Non-Clustered Indexes: Pointers for quick lookups.

AI/ML Use:

  • Speeds up feature retrieval for specific IDs.

6.3 Materialized Views

  • Precompute common aggregations (e.g., average purchase per customer).

  • Reduce feature pipeline computation.

6.4 Schema Evolution

  • Databases evolve; schemas must adapt without breaking pipelines.

  • Use versioned tables or SCD Type 2 for historical tracking.

7. Case Studies: Schema Design Lessons from AI/ML Projects

Case Study 1 – Real-Time Fraud Detection

  • Required low-latency access to transaction history.

  • Star schema with pre-aggregated materialized views reduced feature lookup time from 2s → 100ms.

Case Study 2 – Recommendation Engine Feature Store

  • Snowflake schema caused complex joins.

  • Migrated to star schema; reduced pipeline code by 40%.

Case Study 3 – Healthcare Predictive Models

  • Needed full historical patient context.

  • Implemented Slowly Changing Dimensions (SCD Type 2) in DIM_PATIENT.

  • Models improved accuracy because they captured temporal attribute changes.

8. Tools for Schema Design & Validation

  1. ERD & Schema Design: dbdiagram.io, MySQL Workbench, Lucidchart

  2. Data Warehouse Modeling: dbt, Snowflake Schema Designer

  3. Validation: Great Expectations (ensures schema and data integrity)

9. Preparing for Your AI/ML Data Architecture

This 3-part Database Design Recap series has covered:

  1. Part 1: Relational database fundamentals & why design matters.

  2. Part 2: Data integrity, normalization & ERDs.

  3. Part 3 (this post): Schema design, fact/dimension modeling & practical patterns.

With these concepts, you can design data architectures that support robust analytics, reproducible ML pipelines, and scalable AI systems.

Final Thought

“Database schemas are the bridge between raw data and intelligence. Build them thoughtfully.”

When you invest in dimensional modeling and schema design:

  • Your feature stores become reliable.

  • Your ML pipelines are simpler and faster.

  • Your data scientists can focus on building models, not firefighting data issues.

💬 Your Turn

👉 Which schema design patterns have you used in your AI/ML projects?

  • Star schema? Snowflake schema?

  • Slowly Changing Dimensions?

  • Hybrid OLTP/OLAP?

Share your experiences in the comments—I’ll highlight the most insightful ones for the community.

Data integrity, normalisation, and ERDs are the unsung heroes of AI/ML systems...

Aug 2, 2025

Data integrity, normalisation, and ERDs are the unsung heroes of AI/ML systems...

Aug 2, 2025

Data integrity, normalisation, and ERDs are the unsung heroes of AI/ML systems...

Aug 2, 2025

"Your models are only as good as the data they’re trained on. In AI/ML, we spen...

Aug 1, 2025

"Your models are only as good as the data they’re trained on. In AI/ML, we spen...

Aug 1, 2025

"Your models are only as good as the data they’re trained on. In AI/ML, we spen...

Aug 1, 2025

The final post in the 'Math That Makes AI Work (For You)' series — a practical,...

Jun 25, 2025

The final post in the 'Math That Makes AI Work (For You)' series — a practical,...

Jun 25, 2025

The final post in the 'Math That Makes AI Work (For You)' series — a practical,...

Jun 25, 2025