Database Design (Part 2): Data Integrity, Normalisation & ERDs (AI/ML Perspective)

Home

Aug 2, 2025

Home

Database Design (Part 2): Data Integrity, Normalisation & ERDs (AI/ML Perspective)

Aug 2, 2025

Home

Database Design (Part 2): Data Integrity, Normalisation & ERDs (AI/ML Perspective)

Aug 2, 2025

1. Why Data Integrity, Normalization & ERDs Matter in AI/ML

"Data quality is not something you fix at the end; it's something you design for at the start."

In Part 1, we discussed how database design can make or break AI/ML systems. We explored why relational databases are the backbone of structured data pipelines and how poor database design silently sabotages models.

But one big truth remains: having a database isn't enough.

What if your database allows duplicate records?
What if you can’t enforce relationships between tables?
What if you don’t have a clear map of how entities are connected?

This is where Data Integrity, Normalization, and Entity-Relationship Diagrams (ERDs) come into play.

In this Part 2, we’ll go deeper into:

Data Integrity: The rules that keep data valid and trustworthy.
Normalization: The process of structuring data to eliminate redundancy and anomalies.
ERDs: How to design databases visually so they’re scalable and maintainable.

We’ll connect each concept to real-world AI/ML use cases and include examples, visuals, and SEO-optimized explanations to make this a definitive resource.

2. Data Integrity: The Foundation of Trustworthy Data

Data Integrity is the assurance that data is accurate, consistent, and reliable over its entire lifecycle.

In AI/ML, data integrity issues are silent killers.

Models trained on duplicates overfit.
Orphaned records break joins in feature engineering.
Inconsistent values degrade predictive accuracy.

2.1 Types of Data Integrity

Relational databases enforce data integrity through four core pillars:

1. Entity Integrity

Ensures that each record is uniquely identifiable.
Enforced using Primary Keys (PK).

📌 Example:
Each CUSTOMER_ID in the CUSTOMER table must be unique and non-null.

CREATE TABLE CUSTOMER (
  CUSTOMER_ID INT PRIMARY KEY,
  NAME VARCHAR(100),
  EMAIL VARCHAR(100)
);

AI/ML Impact:
Duplicate customer records can lead to inflated purchase history features, skewing churn models.

2. Referential Integrity

Maintains consistency between related tables.
Enforced using Foreign Keys (FK).

📌 Example:
Each ORDER.CUSTOMER_ID must match an existing CUSTOMER.CUSTOMER_ID.

CREATE TABLE ORDER (
  ORDER_ID INT PRIMARY KEY,
  CUSTOMER_ID INT,
  AMOUNT DECIMAL,
  FOREIGN KEY (CUSTOMER_ID) REFERENCES CUSTOMER(CUSTOMER_ID)
);

AI/ML Impact:
Orphaned orders without customers break joins when computing features like "average basket size."

3. Domain Integrity

Validates that values fall within acceptable ranges.
Enforced using Data Types, CHECK Constraints, and Defaults.

📌 Example:

CREATE TABLE TRANSACTION (
  TXN_ID INT PRIMARY KEY,
  AMOUNT DECIMAL CHECK (AMOUNT > 0),
  STATUS VARCHAR(20) CHECK (STATUS IN ('PENDING','SUCCESS','FAILED'))
);

AI/ML Impact:
Incorrect negative amounts can silently skew features like total spend.

4. User-Defined Integrity

Custom business rules beyond standard constraints.
Implemented via Triggers, Procedures, or Application Logic.

📌 Example:
A trigger ensuring that customers flagged as "inactive" cannot place new orders.

AI/ML Impact:
Helps enforce edge-case conditions that ensure training data reflects reality.

2.2 How Integrity Constraints Improve AI/ML Pipelines

Cleaner Training Data: Constraints prevent garbage data from entering feature stores.
Faster Feature Engineering: Fewer nulls and anomalies mean less preprocessing.
Better Reproducibility: Enforced constraints keep historical data stable for retraining.

3. Normalization: Eliminating Redundancy and Anomalies

"Normalization is like spring-cleaning your database—removing unnecessary clutter so data is consistent and manageable."

3.1 Why Normalization?

Databases often start simple but accumulate redundancy as they grow.

Customer addresses copied in multiple tables.
Product details repeated in order records.

This leads to three major anomalies:

Insertion Anomaly:
You can’t insert data without other unrelated data.
- Example: Can’t add a new product without an order.
Update Anomaly:
You must update the same piece of data in multiple places.
- Example: Changing a customer’s email in 3 tables.
Deletion Anomaly:
Deleting a record unintentionally deletes related information.
- Example: Removing the last order deletes customer info.

3.2 Normalization in Action: The Normal Forms

Normalization is achieved through Normal Forms (NF), a series of increasingly strict design rules.

1st Normal Form (1NF): Atomic Data

Rule:

No repeating groups.
All attributes must be atomic (indivisible).

📌 Before (Bad Design):

CUSTOMER_ID	NAME	EMAIL	PHONES
1	Anil	anil@email.com	12345, 67890

📌 After (1NF):

CUSTOMER_ID	NAME	EMAIL	PHONE
1	Anil	anil@email.com	12345
1	Anil	anil@email.com	67890

AI/ML Impact:
Phones can now be used reliably as categorical features without string-splitting hacks.

2nd Normal Form (2NF): Remove Partial Dependencies

Rule:

Must already be in 1NF.
Non-key attributes must depend on the entire primary key.

📌 Before:

STUDENT_ID	COURSE_ID	COURSE_NAME	GRADE

Here, COURSE_NAME depends only on COURSE_ID, not the full composite key (STUDENT_ID, COURSE_ID).

📌 After (2NF):

Separate COURSE table

COURSE_ID	COURSE_NAME
101	Math

AI/ML Impact:
Eliminates redundant data that could bias features like "average grade per course."

3rd Normal Form (3NF): Remove Transitive Dependencies

Rule:

Must already be in 2NF.
Non-key attributes must not depend on other non-key attributes.

📌 Before:

EMP_ID	DEPT_ID	DEPT_NAME	LOCATION

Here, LOCATION depends on DEPT_NAME, not directly on EMP_ID.

📌 After (3NF):

Create a separate DEPARTMENT table

DEPT_ID	DEPT_NAME	LOCATION

AI/ML Impact:
Prevents anomalies that corrupt location-based features.

3.3 Trade-Offs: Normalization vs Performance

Highly Normalized:
- Pros: Cleaner data, minimal redundancy.
- Cons: Requires more joins, slower for analytics.
Denormalized:
- Pros: Faster reads, simpler queries.
- Cons: Redundancy and potential inconsistency.

AI/ML Best Practice:

Use normalization for OLTP (transactional systems).
Use denormalization for OLAP (analytics/feature stores) where read performance is critical.

4. Entity-Relationship Diagrams (ERDs): Visualizing Database Design

4.1 What is an ERD?

An Entity-Relationship Diagram (ERD) is a visual blueprint of how entities (tables) relate.

Entities: Represent tables
Attributes: Columns
Relationships: Connections (1:1, 1:M, M:N)

Why they matter:

Clarify complex relationships before implementation.
Ensure alignment with AI/ML feature needs.

4.2 ERD Components

Entities (Tables): Rectangles
Relationships: Lines connecting entities
- Crow’s foot notation for cardinality
Attributes: Ovals or listed within entities

4.3 Example: E-Commerce ERD

CUSTOMER:
- CUSTOMER_ID (PK)
ORDER:
- ORDER_ID (PK)
- CUSTOMER_ID (FK)
ORDER_ITEM:
- Composite PK (ORDER_ID, PRODUCT_ID)

4.4 AI/ML Benefits of ERDs

Helps feature engineers understand relationships for joins.
Identifies potential data leakage points.
Reveals missing constraints early.

5. Real-World AI/ML Failures Due to Missing Integrity & ERDs

Case Study – Churn Prediction Model Failure

A telecom churn model underperformed because the data team couldn’t reliably join call detail records with customer profiles.

Root Cause:

Missing foreign keys.
No ERD to show relationships.

Impact:
Feature engineering relied on incomplete call histories, corrupting model accuracy.

Case Study – Duplicate Feature Values in Feature Store

A recommendation engine’s feature store contained duplicate product embeddings due to lack of primary keys in upstream databases.

Impact:
The model kept recommending the same product multiple times.

Lesson:
Entity integrity isn’t optional when your AI/ML stack depends on feature stores.

6. Building ERDs and Applying Normalization in Practice

6.1 Tools

Draw.io / Lucidchart: General-purpose diagramming
dbdiagram.io: Database-specific ERDs
MySQL Workbench: Automatically generate ERDs

6.2 Steps

Identify Entities: Based on business domain.
Define Attributes: Ensure atomicity (1NF).
Establish Relationships:
- Determine cardinality (1:1, 1:M, M:N).
- Add PK and FK constraints.
Apply Normalization: Up to 3NF for most cases.
Review with Stakeholders: Feature engineers, analysts, and product owners.