Database Design (Part 2): Data Integrity, Normalisation & ERDs (AI/ML Perspective)

Aug 2, 2025

Database Design (Part 2): Data Integrity, Normalisation & ERDs (AI/ML Perspective)

Aug 2, 2025

Database Design (Part 2): Data Integrity, Normalisation & ERDs (AI/ML Perspective)

Aug 2, 2025

1. Why Data Integrity, Normalization & ERDs Matter in AI/ML

"Data quality is not something you fix at the end; it's something you design for at the start."

In Part 1, we discussed how database design can make or break AI/ML systems. We explored why relational databases are the backbone of structured data pipelines and how poor database design silently sabotages models.

But one big truth remains: having a database isn't enough.

  • What if your database allows duplicate records?

  • What if you can’t enforce relationships between tables?

  • What if you don’t have a clear map of how entities are connected?

This is where Data Integrity, Normalization, and Entity-Relationship Diagrams (ERDs) come into play.

In this Part 2, we’ll go deeper into:

  1. Data Integrity: The rules that keep data valid and trustworthy.

  2. Normalization: The process of structuring data to eliminate redundancy and anomalies.

  3. ERDs: How to design databases visually so they’re scalable and maintainable.

We’ll connect each concept to real-world AI/ML use cases and include examples, visuals, and SEO-optimized explanations to make this a definitive resource.

2. Data Integrity: The Foundation of Trustworthy Data

Data Integrity is the assurance that data is accurate, consistent, and reliable over its entire lifecycle.

In AI/ML, data integrity issues are silent killers.

  • Models trained on duplicates overfit.

  • Orphaned records break joins in feature engineering.

  • Inconsistent values degrade predictive accuracy.

2.1 Types of Data Integrity

Relational databases enforce data integrity through four core pillars:

1. Entity Integrity

  • Ensures that each record is uniquely identifiable.

  • Enforced using Primary Keys (PK).

📌 Example:
Each CUSTOMER_ID in the CUSTOMER table must be unique and non-null.

CREATE TABLE CUSTOMER (
  CUSTOMER_ID INT PRIMARY KEY,
  NAME VARCHAR(100),
  EMAIL VARCHAR(100)
);

AI/ML Impact:
Duplicate customer records can lead to inflated purchase history features, skewing churn models.

2. Referential Integrity

  • Maintains consistency between related tables.

  • Enforced using Foreign Keys (FK).

📌 Example:
Each ORDER.CUSTOMER_ID must match an existing CUSTOMER.CUSTOMER_ID.

CREATE TABLE ORDER (
  ORDER_ID INT PRIMARY KEY,
  CUSTOMER_ID INT,
  AMOUNT DECIMAL,
  FOREIGN KEY (CUSTOMER_ID) REFERENCES CUSTOMER(CUSTOMER_ID)
);

AI/ML Impact:
Orphaned orders without customers break joins when computing features like "average basket size."

3. Domain Integrity

  • Validates that values fall within acceptable ranges.

  • Enforced using Data Types, CHECK Constraints, and Defaults.

📌 Example:

CREATE TABLE TRANSACTION (
  TXN_ID INT PRIMARY KEY,
  AMOUNT DECIMAL CHECK (AMOUNT > 0),
  STATUS VARCHAR(20) CHECK (STATUS IN ('PENDING','SUCCESS','FAILED'))
);

AI/ML Impact:
Incorrect negative amounts can silently skew features like total spend.

4. User-Defined Integrity

  • Custom business rules beyond standard constraints.

  • Implemented via Triggers, Procedures, or Application Logic.

📌 Example:
A trigger ensuring that customers flagged as "inactive" cannot place new orders.

AI/ML Impact:
Helps enforce edge-case conditions that ensure training data reflects reality.

2.2 How Integrity Constraints Improve AI/ML Pipelines

  • Cleaner Training Data: Constraints prevent garbage data from entering feature stores.

  • Faster Feature Engineering: Fewer nulls and anomalies mean less preprocessing.

  • Better Reproducibility: Enforced constraints keep historical data stable for retraining.

3. Normalization: Eliminating Redundancy and Anomalies

"Normalization is like spring-cleaning your database—removing unnecessary clutter so data is consistent and manageable."

3.1 Why Normalization?

Databases often start simple but accumulate redundancy as they grow.

  • Customer addresses copied in multiple tables.

  • Product details repeated in order records.

This leads to three major anomalies:

  1. Insertion Anomaly:
    You can’t insert data without other unrelated data.

    • Example: Can’t add a new product without an order.

  2. Update Anomaly:
    You must update the same piece of data in multiple places.

    • Example: Changing a customer’s email in 3 tables.

  3. Deletion Anomaly:
    Deleting a record unintentionally deletes related information.

    • Example: Removing the last order deletes customer info.

3.2 Normalization in Action: The Normal Forms

Normalization is achieved through Normal Forms (NF), a series of increasingly strict design rules.

1st Normal Form (1NF): Atomic Data

Rule:

  • No repeating groups.

  • All attributes must be atomic (indivisible).

📌 Before (Bad Design):

CUSTOMER_ID

NAME

EMAIL

PHONES

1

Anil

anil@email.com

12345, 67890

📌 After (1NF):

CUSTOMER_ID

NAME

EMAIL

PHONE

1

Anil

anil@email.com

12345

1

Anil

anil@email.com

67890

AI/ML Impact:
Phones can now be used reliably as categorical features without string-splitting hacks.

2nd Normal Form (2NF): Remove Partial Dependencies

Rule:

  • Must already be in 1NF.

  • Non-key attributes must depend on the entire primary key.

📌 Before:

STUDENT_ID

COURSE_ID

COURSE_NAME

GRADE

Here, COURSE_NAME depends only on COURSE_ID, not the full composite key (STUDENT_ID, COURSE_ID).

📌 After (2NF):

  • Separate COURSE table

COURSE_ID

COURSE_NAME

101

Math

AI/ML Impact:
Eliminates redundant data that could bias features like "average grade per course."

3rd Normal Form (3NF): Remove Transitive Dependencies

Rule:

  • Must already be in 2NF.

  • Non-key attributes must not depend on other non-key attributes.

📌 Before:

EMP_ID

DEPT_ID

DEPT_NAME

LOCATION

Here, LOCATION depends on DEPT_NAME, not directly on EMP_ID.

📌 After (3NF):

  • Create a separate DEPARTMENT table

DEPT_ID

DEPT_NAME

LOCATION

AI/ML Impact:
Prevents anomalies that corrupt location-based features.

3.3 Trade-Offs: Normalization vs Performance

  • Highly Normalized:

    • Pros: Cleaner data, minimal redundancy.

    • Cons: Requires more joins, slower for analytics.

  • Denormalized:

    • Pros: Faster reads, simpler queries.

    • Cons: Redundancy and potential inconsistency.

AI/ML Best Practice:

  • Use normalization for OLTP (transactional systems).

  • Use denormalization for OLAP (analytics/feature stores) where read performance is critical.

4. Entity-Relationship Diagrams (ERDs): Visualizing Database Design

4.1 What is an ERD?

An Entity-Relationship Diagram (ERD) is a visual blueprint of how entities (tables) relate.

  • Entities: Represent tables

  • Attributes: Columns

  • Relationships: Connections (1:1, 1:M, M:N)

Why they matter:

  • Clarify complex relationships before implementation.

  • Ensure alignment with AI/ML feature needs.

4.2 ERD Components

  1. Entities (Tables): Rectangles

  2. Relationships: Lines connecting entities

    • Crow’s foot notation for cardinality

  3. Attributes: Ovals or listed within entities

4.3 Example: E-Commerce ERD

  • CUSTOMER:

    • CUSTOMER_ID (PK)

  • ORDER:

    • ORDER_ID (PK)

    • CUSTOMER_ID (FK)

  • ORDER_ITEM:

    • Composite PK (ORDER_ID, PRODUCT_ID)

4.4 AI/ML Benefits of ERDs

  • Helps feature engineers understand relationships for joins.

  • Identifies potential data leakage points.

  • Reveals missing constraints early.

5. Real-World AI/ML Failures Due to Missing Integrity & ERDs

Case Study – Churn Prediction Model Failure

A telecom churn model underperformed because the data team couldn’t reliably join call detail records with customer profiles.

Root Cause:

  • Missing foreign keys.

  • No ERD to show relationships.

Impact:
Feature engineering relied on incomplete call histories, corrupting model accuracy.

Case Study – Duplicate Feature Values in Feature Store

A recommendation engine’s feature store contained duplicate product embeddings due to lack of primary keys in upstream databases.

Impact:
The model kept recommending the same product multiple times.

Lesson:
Entity integrity isn’t optional when your AI/ML stack depends on feature stores.

6. Building ERDs and Applying Normalization in Practice

6.1 Tools

  • Draw.io / Lucidchart: General-purpose diagramming

  • dbdiagram.io: Database-specific ERDs

  • MySQL Workbench: Automatically generate ERDs

6.2 Steps

  1. Identify Entities: Based on business domain.

  2. Define Attributes: Ensure atomicity (1NF).

  3. Establish Relationships:

    • Determine cardinality (1:1, 1:M, M:N).

    • Add PK and FK constraints.

  4. Apply Normalization: Up to 3NF for most cases.

  5. Review with Stakeholders: Feature engineers, analysts, and product owners.

7. Preparing for Part 3: Schemas, Fact/Dimension Models & Practical Patterns

In Part 3, we’ll connect the dots:

  • Schema Design: Star vs Snowflake for analytics

  • Fact & Dimension Tables: How they power OLAP systems and feature stores

  • Best Practices: Balancing normalization with performance

These concepts will help you design databases that support real-time inference, large-scale analytics, and reliable retraining pipelines.

Final Thought

“Data integrity and normalization aren’t just database buzzwords—they’re the insurance policy for your AI/ML pipelines.”

When you enforce constraints and design databases visually with ERDs:

  • Your feature stores stay clean.

  • Your models retrain reproducibly.

  • Your data scientists spend less time cleaning and more time innovating.

💬 Your Turn

👉 How have you used normalization and ERDs in your data architecture?

  • Did it save your AI/ML team from silent data bugs?

  • Or have you experienced failures because these steps were skipped?

Drop your experiences below—I’ll feature the most valuable insights in Part 3.

Schema design is the bridge between databases and AI/ML analytics. In this final...

Aug 2, 2025

Schema design is the bridge between databases and AI/ML analytics. In this final...

Aug 2, 2025

Schema design is the bridge between databases and AI/ML analytics. In this final...

Aug 2, 2025

"Your models are only as good as the data they’re trained on. In AI/ML, we spen...

Aug 1, 2025

"Your models are only as good as the data they’re trained on. In AI/ML, we spen...

Aug 1, 2025

"Your models are only as good as the data they’re trained on. In AI/ML, we spen...

Aug 1, 2025

The final post in the 'Math That Makes AI Work (For You)' series — a practical,...

Jun 25, 2025

The final post in the 'Math That Makes AI Work (For You)' series — a practical,...

Jun 25, 2025

The final post in the 'Math That Makes AI Work (For You)' series — a practical,...

Jun 25, 2025