One simple way to begin to understand data modeling is to think back on the last time you ever filled out a form. You're presented with a series of fields to which you enter values for. You're usually presented with First Name, Last Name, D.O.B., etc. Now, prior to you filling out this form, someone had to design the structure of that data and the data types (string, number, boolean), etc, in order to make records as manageable as possible. So data modeling is about applying various methodologies to a list of data elements in order to come up with an optimal relational database design. This makes your data easy to manage and inexpensive to scale.
Let's say you wanted to store customer and order data for an application that sells a desktop app through a website. You might come up with the following data model:
- OrderID (Integer)
- OrderDate (Timestamp)
- CustomerName (String)
- CustomerAddress (String)
With relational data, you want to follow these three steps after coming up with your data model, ultimately ending up with an Entity Relationship Model.
- Identify candidate keys
- Select a primary key
- Apply normalization
Keys are essentially data elements. When identifying candidate keys, you want to point out which, if any, or how many are unique. So in our data model, we need to identify the keys in which no two orders can share the same value. So take a second to try and determine this for yourself and then read on for the answer.
The correct answer is
OrderDate isn't unique because multiple customers can make orders at the same time, and multiple customers can share the same
Customer Name and
OrderID is the only truly unique key in this data model. If that's a little hard to grasp, don't worry. We'll be revisiting this as we enter data into our table. We'll also indirectly be learning about normal forms before explicitly covering in this this follow up article.
Before we do that, I'd like to point out that this was an example where we came out with one candidate key. I'd also like to provide an example where you'd have two candidate keys. You may have multiple keys- one that's explicitly designed to be unique by the software architect and another that's unique by the nature of the data itself. So if we had an Employee table, the explicitly designed unique key would be the
EmployeeID and the key that's unique by nature could be a
Social Security Number. Whichever one you select as the primary key, unique keys provide the system with quicker access to the data instance you're looking for.
Now, let's organize this data into a table. As you'll see below, our fields will form the header, each column will represent the value, and every row (line item) will indicate every instance of data. This data will exist in an
Order table. Remember, our goal here is to identify why
OrderID is our unique candidate key.
rig is a handy command line tool generating a random identity and address. Install it with brew, apt or yum.
|1001||2015-03-23 00:04:21||Dan Burke||35 East Parson St. Yonkers, NY 10701|
|1002||2015-03-23 00:06:07||Elena Burris||78 Darly Rd. Addison, IL 60101|
|1003||2015-03-23 00:010:02||John Smith||700 Burnet Dr. Addison, IL 60101|
|1004||2015-03-23 00:010:11||Don Mckee||3 Spring County Blvd. Coward, SC 29530|
|1005||2015-03-23 00:11:41||Kristina Farrell||803 Potter Rd. Seattle, WA 98109|
|1006||2015-03-23 00:11:41||Denise Marsh||419 Lowel Rd. Chicago, IL 60607|
|1789||2017-05-24 12:12:06||John Smith||700 Burnet Dr. Addison, IL 60101|
As you can see from the data above, we've already run into a scenario where our last two orders came in at the same time so the
OrderDate is definitely not a candidate key. We also have two
John Smiths, with the same address. John Smith is a common name and the order numbers and dates between the two have a long gap so perhaps this is another John Smith that moved into the same location. We can't really tell, using this data model, and this is what normalization is for. We do, however, know that
OrderId is our only unique key.
Now, imagine if this data were to continue growing and our stakeholders decided they wanted to use the data for more than just identifying unique orders? What if they also wanted to leverage the customer data to get a better understanding of their different demographics? Let's say they wanted to know where most of their customers lived so they could open up more stores in those areas? Well, we do record their addresses, but because of the structure of the data, we're going to see a lot of duplicates. If David were to place 50 more orders, we'd have 50 more entries in Los Angeles, CA. Sure, we could perform a query and use the
CustomerID to realize that it's the same person and throw out the redundant data, but what if that person were to change their address? We'd have to make sure to update every order entry with the new address. And even if we had a system to update that whenever we needed to, you can see how the cost of these operations continues to grow with the databases. Why not have a separate database for the customers? And what if Los Angeles were to be renamed to something different? Well, we should have a separate database for addresses, as well. Normalization to the rescue, again.