query tip « The Platypus Database

I was pulled into a meeting to help out with a query that needed to find some data to update a field. Basically the situation is:
1. a file is received from an external entity that has 2 pieces of information needed for the lookup, a GROUP and a UCC*
2. this file is loaded into a table and then an ETL (Extract, Transform, Load) process is called
3. the ETL needs to know a PriceType that is stored in a table that contains GROUP and VendorID
4. VendorID can be found in a query using the UCC

Ultimately what I need is the PriceType to include in my ETL process.
If I know the GROUP and UCC, I can find the VendorID that is needed to determine the correct PriceType.

So, the current process takes FOREVER – days!! It is using cursors and fetch next into the cursor and a whole lot of things that I don’t get! That is why we were having the meeting…to try and find a way to make the load run in a reasonable amount of time.

So I am listening to the conversation and my co-worker keeps talking about looping through the database to find the VendorID. And I am thinking, maybe I am not getting it, but it just seems like we can take the data we know, GROUP and UPC, and using JOINs get the VendorID. Why would we need to loop? So I asked and everyone agreed that what I was saying sounded reasonable, but weren’t really positive that what I was saying would work.

So, I rewrote this UPDATE_CURSOR statement with a query. I got it 90% there. The query was then put in a table and called as part of the ETL and I was missing a key field to join on in order to perform the update, but once that piece was included the process completed in an hour!

From 27 hours with no completion to complete in an hour! That is the power of joins and thinking in a set. If you are looping through your data set to find data in your tables it try to approach it and see if you can find a way for it to be done in a single query!!

The idea of iterating is a very programmatic approach and has it’s place, but not when you are trying to extract data from a database. You should try to think of everything you want to get and what connections have to be made in order to get that data.

Start with a small query. What is the main table that you need the most data from or has the key field that will be needed to find data in other tables? Write the query that just gets those fields from that main table. Where is the next piece of data that you need? Are there any criteria that need to be applied before JOINing into that data? Can you JOIN into the table or do you need to write a SUB-QUERY to extract just some of the data from that table? Once you get the details of the secondary information worked out, JOIN your main query to that and add the fields you need. It maybe that some of the fields you included you don’t need as output, but you do need as something to JOIN on in a later part of the query.

As you can see, building a query to get your full result set takes some time and knowledge of your database structure but it really how queries are supposed to work.

Now, once your application gets the result set returned, feel free to iterate through it as many times as you like to get data from it, but the idea of running one query to get a piece of information and looping through that data to run another query should now be contrary to all you believe as a Platypus Hunter.

*The UCC is part of the UPC that is on products. It is assigned to a manufacturer and is the first 6 digits in a UPC. Every product a company makes has a UPC that starts with their assigned UCC (more information can be found here http://www.insightu.org/hobby/guide_wd/ch3.htm)

I am sure that there will be many posts about dealing with bad design. So, I am going to number them as we go so I can see just how many different bad designs need a solution!

Duplicate Fields

Like many shops, the Court never had a DBA. If you look at the Court’s PDB (Platypus Database) it is obvious that developers with no idea on how to create a relational database had “designed” it. For over 20 years, developers had just added tables as they needed for their specific task. To make matters worse, the platform is an IBM iSeries (formerly known as the AS400) where, depending on how a program is written, a change to a table design requires the recompilation of all programs that use the table. And since there was no source control process it was impossible to tell WHICH programs would need to be recompiled.

The solution of course was to copy an existing table that had most of what the new process needed, rename it, and just add a few new fields…viola!!
That was the database design process.

This causes havoc when trying to write applications that need to interface with the database. The first issue, how do you keep all the data synchronized? If you update data in one table, how does it get updated in the other? The second issue, which fields need to be included in a query? The ones from the first table or the second, almost identical, table? Unfortunately in cases like this the only way to know is to ask someone and learn the structure. Eventually you will figure out when to use which table.

Fake Dates

Another example, back in the day when storage was a main consideration and before there were specific date fields, the Court decided to use a six digit INTEGER data type as the ‘Date’ field, YYMMDD. So looking at the date 111 – what date would you figure that to be? Did you guess January 11, 2000? No? huh, I’m surprised that you didn’t see that 😉

Then there was the Y2K issue. A perfect time to fix that bad decision because by then the AS400 DID have a Date datatype. Nope…let’s just add an EIGHT long INTEGER data type to every table that duplicates the six digit date and store the dates as YYYYMMDD.

And again we have to rely on someone making sure that all the dates are updated correctly. But of course there are exceptions. There’s at least one table that the eight digit date did not get added. It has to be joined into another table using the six digit date…but sometimes that data isn’t updated so the query returns no data.

Committing Spreadsheet

But I think one of the more prevalent bad design decisions is not creating the proper many to many structure – committing spreadsheet. Remember in the last post we talked about normalization and the many-to-many relationship of Jurors to Trials. The WRONG way to store this information is to do something like:

TrialID DefendantName Juror1 Juror2 Juror3 Juror4 Juror5 Juror6 ....Juror 25

Here’s why this is wrong. It is rigid – what if someone wants 26 jurors? Now you have to change the table design (and maybe the application) just to add one more person to a Jury Panel. What if they only want 20 Jurors? now there are NULLs in the data for Juror21 – Juror25. (While NULL has its place in databases this is not one of them – NULL means ‘unknown’ – these Jurors are not unknown – they don’t exist, that is a different state.)

Other Platypus Designs

I would love to hear how your PDB is designed! Send me an email or comment on your design issues and I will see if I have a solution to post for you!

Developer Golden Egg

When you have to work with a database where a non-PDB Hunter has committed spreadsheet, you can use SQL to create what SHOULD have been done. Using a UNION we can create the many-to-many table like so:

SELECT 1 as ID, TrialID, Juror1 As JurorID FROM BadTable Where Juror1 is not null UNION SELECT 2, TrialID, Juror2 FROM BadTable Where Juror2 is not null ... SELECT 25, TrialID, Juror25 FROM BadTable Where Juror25 is not null

Now you have a result set that is normalized and can be used as part of another query, to join into, display results from, the same things you could do with it if it HAD been designed correctly.