Normally, there's an ad running in this spot, but you're using an ad blocker. To help support our blog, which provides free tutorials for everybody, we just ask that you whitelist or follow us on facebook, twitter, or subscribe using the form to the leftabove. Thank you!

    Database Normalization

    With normalization, the goal is to reduce or eliminate data redundancy. As data grows, it becomes increasingly complex for software developers to maintain the same information in multiple places, as you've learned in our relational data modeling article. Now, there are instances where data redundancy is beneficial. It can increase performance by reducing the time it takes to access your data, for example, with fewer tables to query across. Or increase reliability by giving you multiple locations to access the data in case one of your nodes goes down. As with any architectural problem, you have to weigh the quality attributes against each other. There's never one solution for all of your problems, but different solutions at different stages of the company and you have to be strategic. This article is going to focus on a scenario where the cost of maintaining the data far outweighs the performance benefits of keeping that data in one table.

    The Original Dataset We Want to Normalize

    Order
    OrderId OrderDate CustomerName CustomerAddress
    1001 2015-03-23 00:04:21 Dan Burke 35 East Parson St. Yonkers, NY 10701
    1002 2015-03-23 00:06:07 Elena Burris 78 Darly Rd. Addison, IL 60101
    1003 2015-03-23 00:010:02 John Smith 700 Burnet Dr. Addison, IL 60101
    1004 2015-03-23 00:010:11 Don Mckee 3 Spring County Blvd. Coward, SC 29530
    1005 2015-03-23 00:11:41 Kristina Farrell 803 Potter Rd. Seattle, WA 98109
    1006 2015-03-23 00:11:41 Denise Marsh 419 Lowel Rd. Chicago, IL 60607
    1789 2017-05-24 12:12:06 John Smith 700 Burnet Dr. Addison, IL 60101

    First Normal Form (1NF)

    First Normal Form is defined by wikipedia as a property of a relation in a relational database. A relation is in first normal form if the domain of each field contains only atomic values, and the value of each field contains only a single value from that domain. Now, what does this actually mean? What we'll do is go over all of the qualifiers and explain each with examples from our original dataset. Most of these rules reinforce the idea that your data must live in a two dimensional table.

    Columns and Rows

    Simply put, the table must consist of a column and a row to represent the data. Moving on, one column must equal one field and have a unique field name. So this table would not be compliant with this 1NF rule:

    OrderId, OrderDate <--violation CustomerName CustomerAddress
    1001
    2015-03-23 00:04:21
    Dan Burke 35 East Parson St. Yonkers, NY 10701

    However, this table would:

    OrderId OrderDate CustomerName CustomerAddress
    1001 2015-03-23 00:04:21 Dan Burke 35 East Parson St. Yonkers, NY 10701

    By complying with this rule, tracking down records by OrderDate is much easier.

    Strict Types

    Secondly, column entries must be of the same kind. As you've noticed, all of our entries have the same consistent format. All names include a first and last name, and all addresses contain the address, city, state, and zip code. If we were to have inconsistent address formats, it would be more difficult to query against and perform analyses. The following table would not be compliant:

    OrderId OrderDate CustomerName CustomerAddress
    1001 2015-03-23 00:04:21 Dan Burke 35 East Parson St. Yonkers, NY 10701
    1002 2015-03-23 00:06:07 Elena Burris Addison, IL <--violation

    Unique Rows

    Third, rows must be unique. If you scroll back to the top of this page and look at the original dataset, you'll see that we haven't violated this rule. While you will see repeated values, no row is a complete copy of another because the OrderID is always unique. This can also be accomplished with other fields, such as social security number or email address, but we aren't capturing those.

    1 Column/Field = 1 Value

    Fourth, each field must also contain one value and can't be in multiple columns to be two dimensional. To better illustrate this, we'll simplify our data to focus only on the customers and addresses. The following tables would violate that rule:

    CustomerName CustomerAddress
    Dan Burke 35 East Parson St. Yonkers, NY 10701
    1003 Brighton St. Arlington, TX 76010 <--violation
    Elena Burris 78 Darly Rd. Addison, IL 60101

    CustomerName CustomerAddress1 CustomerAddress2
    Dan Burke 35 East Parson St. Yonkers, NY 10701 1003 Brighton St. Arlington, TX 76010
    Elena Burris 78 Darly Rd. Addison, IL 60101 Null <--violation

    See how Dan Burke now has two addresses under the same field? Now Elena Burris, who lives in one location has a null value in her CustomerAddress2 column. You want to avoid this. We can solve this problem by splitting up customer data from address data, like so:

    Customer
    CustomerId CustomerName
    2001 Dan Burke
    2002 Elena Burris

    Address
    AddressId Address
    3001 35 East Parson St. Yonkers, NY 10701
    3002 1003 Brighton St. Arlington, TX 76010
    3003 78 Darly Rd. Addison, IL 60101

    Customer-Address
    CustomerId AddressId
    2001 3001
    2001 3002
    2002 3003

    Now, we have the ability to keep both sets of data isolated and communicate the many-to-many, or n...:n..., relationships defined in a separate Customer-Address table.

    Atomic Values

    Lastly, you want to make sure your value are atomic, storing the smallest possible unit.

    OrderId OrderDate CustomerFirstName CustomerLastName CustomerAddress CustomerCity CustomerState CustomerZipCode
    1001 2015-03-23 00:04:21 Dan Burke 35 East Parson St. Yonkers NY 10701
    1002 2015-03-23 00:06:07 Elena Burris 78 Darly Rd. Addison IL 60101
    1003 2015-03-23 00:010:02 John Smith 700 Burnet Dr. Addison IL 60101
    1004 2015-03-23 00:010:11 Don Mckee 3 Spring County Blvd. Coward SC 29530
    1005 2015-03-23 00:11:41 Kristina Farrell 803 Potter Rd. Seattle WA 98109
    1006 2015-03-23 00:11:41 Denise Marsh 419 Lowel Rd. Chicago IL 60607
    1789 2017-05-24 12:12:06 John Smith 700 Burnet Dr. Addison IL 60101

    Second Normal Form (2NF)

    The 2NF is a superset of the 1NF, meaning 2NF = 1NF + additional qualification. This additional qualification is that each column depends on the entire primary key. So let's focus in on two entries of our dataset to get a better understanding of this. In the previous article, we mentioned having two orders from a John Smith from the same location, but because the gap between the two order was so large, we couldn't tell if this was the same John Smith or if this was a new John Smith that had moved into the residence much later.

    OrderId OrderDate CustomerFirstName CustomerLastName CustomerAddress CustomerCity CustomerState CustomerZipCode
    1003 2015-03-23 00:010:02 John Smith 700 Burnet Dr. Addison IL 60101
    1789 2017-05-24 12:12:06 John Smith 700 Burnet Dr. Addison IL 60101

    2NF addresses this by making sure that every entry can be associated with a primary key. So when you have duplicate data, you can tell whether it's unique or not. We should have stored the customer data in a separate Customer table from the beginning and given each customer a CustomerId in order to uniquely identify accounts from orders. Take a look at the following data model, which is 2NF compliant:

    Order
    OrderId OrderDate CustomerId
    1003 2015-03-23 00:010:02 2003
    1789 2017-05-24 12:12:06 3002

    Customer
    CustomerId CustomerFirstName CustomerLastName CustomerAddress CustomerCity CustomerState CustomerZipCode
    2003 John Smith 700 Burnet Dr. Addison IL 60101
    3002 John Smith 700 Burnet Dr. Addison IL 60101

    Now you can see that the CustomerIds don't match, meaning the order was made from two different John Smith's who happened to live in the same residence. Of course, you can also assume that this is the same person who just so happened to make an order from a different account, but eh, you get the point. If you really want to make sure, you should also store the user's email address or any other unique values.

    Primary Keys vs Foreign Keys

    So in our Order table, we've designated the OrderId as our primary key. Primary keys are used to uniquely identify a record in the same table. When defining the relationship between two tables, there must always be one common data element between them. That common data element is the foreign key. Our Order table references the Customer table by way of the CustomerId.

    Third Normal Form (3NF)

    3NF is very similar to 2NF in that it all ties back to the primary key, except it deals with a scenario where a column depends upon another column which depends on the primary key. This is also known as a transitive dependency. Let's take a look at our newly formed Customer table to get a better understanding of this concept:

    CustomerId CustomerFirstName CustomerLastName CustomerAddress CustomerCity CustomerState CustomerZipCode
    2003 John Smith 700 Burnet Dr. Addison IL 60101
    3002 John Smith 700 Burnet Dr. Addison IL 60101

    As you can see, although we have rows that are tied to the primary key (CustomerId), there are still dependencies within the data. For example, all of the address-related fields are unbreakably bound to the CustomerZipCode. To solve this, you can split some of the data into a a separate ZipCode table.

    Customer
    CustomerId CustomerFirstName CustomerLastName ZipCode
    2003 John Smith 60101
    3002 John Smith 60101

    ZipCode
    ZipCode Address City State
    60101 700 Burnet Dr. Addison IL

    Creating Our Data Model in MySQL

    Follow this guide in order to have a mysql instance ready to connect to.

    Set Up MySQL Tables

        
        # ssh from the root of vagrant-ansible-mysql
        vagrant ssh
        # log in to mysql
        mysql -uroot
        # all of the following commands are...
        # ...executed from within mysql>
        SHOW DATABASES;
        +--------------------+
        | Database           |
        +--------------------+
        | information_schema |
        | mysql              |
        | performance_schema |
        +--------------------+
        4 rows in set (0.01 sec)
        # use mysql db
        USE mysql;
        # create schema for zipcode table
        CREATE TABLE zip_codes( zip_code VARCHAR(20) NOT NULL, address VARCHAR(100) NOT NULL, city VARCHAR(100), state VARCHAR(100), created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(zip_code) );
        # create schema for customer table
        CREATE TABLE customers( id INT NOT NULL AUTO_INCREMENT, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, zip_code VARCHAR(20) NOT NULL, created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(id), FOREIGN KEY(zip_code) REFERENCES zip_codes(zip_code) ON UPDATE CASCADE ON DELETE CASCADE );
        # create schema for order table
        CREATE TABLE orders( id INT NOT NULL AUTO_INCREMENT, order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP, customer_id INT NOT NULL, PRIMARY KEY(id), FOREIGN KEY(customer_id) REFERENCES customers(id) ON UPDATE CASCADE ON DELETE CASCADE );
        
      

    Great, now let's take a look at all of our schemas.

        
        # show orders schema
        SHOW COLUMNS FROM mysql.orders;
        +-------------+-----------+------+-----+-------------------+----------------+
        | Field       | Type      | Null | Key | Default           | Extra          |
        +-------------+-----------+------+-----+-------------------+----------------+
        | id          | int(11)   | NO   | PRI | NULL              | auto_increment |
        | order_date  | timestamp | NO   |     | CURRENT_TIMESTAMP |                |
        | customer_id | int(11)   | NO   | MUL | NULL              |                |
        +-------------+-----------+------+-----+-------------------+----------------+
        3 rows in set (0.00 sec)
        # show customers schema using describe
        DESCRIBE mysql.customers;
        +------------+--------------+------+-----+-------------------+----------------+
        | Field      | Type         | Null | Key | Default           | Extra          |
        +------------+--------------+------+-----+-------------------+----------------+
        | id         | int(11)      | NO   | PRI | NULL              | auto_increment |
        | first_name | varchar(100) | NO   |     | NULL              |                |
        | last_name  | varchar(100) | NO   |     | NULL              |                |
        | zip_code   | varchar(20)  | NO   | MUL | NULL              |                |
        | created    | timestamp    | NO   |     | CURRENT_TIMESTAMP |                |
        +------------+--------------+------+-----+-------------------+----------------+
        5 rows in set (0.00 sec)
        # show zip_codes schema
        DESCRIBE mysql.zip_codes;
        +----------+--------------+------+-----+-------------------+-------+
        | Field    | Type         | Null | Key | Default           | Extra |
        +----------+--------------+------+-----+-------------------+-------+
        | zip_code | varchar(20)  | NO   | PRI | NULL              |       |
        | address  | varchar(100) | NO   |     | NULL              |       |
        | city     | varchar(100) | YES  |     | NULL              |       |
        | state    | varchar(100) | YES  |     | NULL              |       |
        | created  | timestamp    | NO   |     | CURRENT_TIMESTAMP |       |
        +----------+--------------+------+-----+-------------------+-------+
        5 rows in set (0.00 sec)
        
      

    Did you like this tutorial? Help us pay for server costs by following us on Facebook, Twitter, and subscribing below, where you'll get post notifications, training webinar invites, and free bundles.