Normalizing – How do you do it??

Hopefully everyone has read FundamentalsOfDBDesign by Paul Litwin that I mentioned in the last post.  (If not, perhaps take a few minutes to stop and read it and then come back and join us, understanding some of those basics will be key to this installment.  It’s ok, this post will still be here waiting when you finish.)

There are many pages out there that talk about normalizing data.  I feel that many of them only go part way.  Some of them are a little too “canned”.  Even Litwin’s examples are some what simplified.  They all have a membership database, or a customer database or something simple and a list of fields to be categorized.  Well that’s all fine and dandy, but how do you get to that point?  Where did that list of fields come from?  How do you work with data in the real world?  I could never find any websites or blog posts on HOW to get that.  Here’s how I do it.

I was lucky enough at the Court to be able to design and build from scratch and then re-do, a couple of times for version upgrades, a core Court application – the Jury Management System (JMS).  When I started at the Court the “Jury System” on the AS400 was basically an upload of a text file of summoned Jurors on which one or two functions or processes could be performed.  All the rest of the work was performed manually – mail merging letters, tracking time worked, tracking panels served on, tracking trials for which a panelist was selected, and making payments, just to name a few.  it took months to get a Jury term processed and delivered to Accounting to start the payment process. There was one report that was manually compiled every two weeks. I was assigned the task of creating a new system that would allow the staff to automate all these processes and more.

During the initial design phase I took the time to create a proper relational database.  Over the past 10 years and 3 major revisions that database structure is still valid and the only required modifications have been to include new data that wasn’t part of the original application or to accommodate major changes in functionality.

When you are lucky enough to get to design from scratch here are some things that you should look for to help determine what fields you need to have in your database:

What are your inputs?

In my case it was the text file that I was being sent that contained the names of people who had been summoned for jury duty.  Did I need everything from the file? No, but I could look at the file and determine I should probably get their name, address, date for which they were summoned, other information that identified the people that have been asked to complete a form and return it to the Jury Division. I didn’t need the name of the Authorizing Clerk of the Court or the Court’s address. Those pieces of information may have been important to the Agency creating and sending the Court the file, but not to our application.

What are you expected to output?

Like most applications, what I was building was replacing something, some manual task that some person has to spend a lot of time compiling and working. Well, what did that person produce? Is it a report? What is on the report? If you are being asked to report the number of Trials that a person served on you need to make sure that somewhere in the database you capture some trial information and what jurors served on that trial. If you don’t capture it, you can’t report on it. A lot of developers skip this step and it is CRUCIAL!

What is your application expected to do? What data will it need in order to perform that task?

One function the Jury application was expected to perform was to calculate the number of hours a Juror worked in order to pay them. Well, in order to do that you have to know all the days a person was in attendance, what time they arrive and what time they left. But, there are special rules for Public Employees. They only get paid when they are at the Court outside their regular working hours because Agencies are required to pay employees while on Jury Duty. So now we need to know which of the people who showed up for Jury Duty are public employees and what their regular start and end times are. But then there are special cases like Substitute Teachers, who are Public Employees, but get paid for all hours worked; or, teachers on Summer Break. Jurors also get paid for mileage to and from their primary residence to the Court round-trip for each day they come. Now you need to know the Round Trip Mileage for each person.
All of these rules determine what data is required to be stored in order for the application to do the appropriate calculation to pay the Juror correctly.

Now you have a list of all the fields you think that your application will need to do the work it is supposed to do.  Now what?  Well, you have to decide what tables to create.  How do you do that??

Well, by looking at the functions your application is required to perform, inputs and outputs you should have a general idea of some of your major tables.  In the case of the JMS, and given just the information I shared above, I would determine a need for the following tables: JUROR, TRIAL and ATTENDANCE.  Plus, we also need some where to keep all the information about PUBLIC EMPLOYEES – we need to know who they are and when they work and if there are any special conditions that affect their pay.  We need to know where the JURORs live so an ADDRESS table might be a good candidate.

So you go through your list of potential fields and determine what table they belong to.  It doesn’t matter if you don’t get it right the first time, you just need to make a first pass and see what you KNOW goes certain places and then you’ll be left with the fields you’re not quite sure about.  It may turn out that something you thought was a FOR SURE you end up changing.

Each table must have a PRIMARY KEY.  I prefer auto-incrementing keys.  They are easy to create and maintain.  But regardless, perhaps there is a natural key that you prefer using, every table has a KEY.  So each table starts off having an ID field.  This ID refers to ONE row in the set.

The rest of the fiels in each table are added by asking yourself, when deciding if a field belongs in a table, “Does this field pertain to ONLY what is in this table?”.  Some things are no brainers.  In our list we have the fields from our input file: name, address, SSN, DOB, the date they are to show up for jury duty.  Those might at first be all things that you would select for the JURORs table.  The TRIALS that a JUROR served on?  That really isn’t information about the JUROR.  So you wouldn’t want to put TRIAL information in the JUROR table.

So where would you put TRIAL information that is about the JURORS?  We have a JUROR table that lists everything we know about the JUROR. We have a TRIAL table that lists everything about the TRIAL.  Where DO you store which JURORS served on which TRIALS?  In the JUROR_TRIAL table!   We will have a table that contains an ID for each row in this set and an ID for the JUROR and an ID for the TRIAL they served on.  (This structure is how a many-to-many relationship is represented: many jurors can serve on many trials.  We will get into relationships another day.)

Hopefully this has helped you to understand the actual steps taken to get a database design into a normalized state.

Feel free to contact me if you need any assistance with your design or have any questions about normalization!

theplatypusdatabase@gmail.com


Leave a comment