set theory « The Platypus Database

I was pulled into a meeting to help out with a query that needed to find some data to update a field. Basically the situation is:
1. a file is received from an external entity that has 2 pieces of information needed for the lookup, a GROUP and a UCC*
2. this file is loaded into a table and then an ETL (Extract, Transform, Load) process is called
3. the ETL needs to know a PriceType that is stored in a table that contains GROUP and VendorID
4. VendorID can be found in a query using the UCC

Ultimately what I need is the PriceType to include in my ETL process.
If I know the GROUP and UCC, I can find the VendorID that is needed to determine the correct PriceType.

So, the current process takes FOREVER – days!! It is using cursors and fetch next into the cursor and a whole lot of things that I don’t get! That is why we were having the meeting…to try and find a way to make the load run in a reasonable amount of time.

So I am listening to the conversation and my co-worker keeps talking about looping through the database to find the VendorID. And I am thinking, maybe I am not getting it, but it just seems like we can take the data we know, GROUP and UPC, and using JOINs get the VendorID. Why would we need to loop? So I asked and everyone agreed that what I was saying sounded reasonable, but weren’t really positive that what I was saying would work.

So, I rewrote this UPDATE_CURSOR statement with a query. I got it 90% there. The query was then put in a table and called as part of the ETL and I was missing a key field to join on in order to perform the update, but once that piece was included the process completed in an hour!

From 27 hours with no completion to complete in an hour! That is the power of joins and thinking in a set. If you are looping through your data set to find data in your tables it try to approach it and see if you can find a way for it to be done in a single query!!

The idea of iterating is a very programmatic approach and has it’s place, but not when you are trying to extract data from a database. You should try to think of everything you want to get and what connections have to be made in order to get that data.

Start with a small query. What is the main table that you need the most data from or has the key field that will be needed to find data in other tables? Write the query that just gets those fields from that main table. Where is the next piece of data that you need? Are there any criteria that need to be applied before JOINing into that data? Can you JOIN into the table or do you need to write a SUB-QUERY to extract just some of the data from that table? Once you get the details of the secondary information worked out, JOIN your main query to that and add the fields you need. It maybe that some of the fields you included you don’t need as output, but you do need as something to JOIN on in a later part of the query.

As you can see, building a query to get your full result set takes some time and knowledge of your database structure but it really how queries are supposed to work.

Now, once your application gets the result set returned, feel free to iterate through it as many times as you like to get data from it, but the idea of running one query to get a piece of information and looping through that data to run another query should now be contrary to all you believe as a Platypus Hunter.

*The UCC is part of the UPC that is on products. It is assigned to a manufacturer and is the first 6 digits in a UPC. Every product a company makes has a UPC that starts with their assigned UCC (more information can be found here http://www.insightu.org/hobby/guide_wd/ch3.htm)

For years I have been and will continue to recommend that anyone working with databases and queries read the following articles

The Fundamentals of Relational Database Design by Paul Litwin

and

Understanding SQL Joins @ devshed.com

These two articles helped solidify my formal education in databases and for years I thought I understood the Relational Model and some of the more advanced SQL Joins…which I did, to a point. However, to be Platypus Hunters, we must delve deeper and understand how databases are represented by the relational model.

Over the past few weeks I have been reading Exam 70-461: Querying MS SQL Server 2012 Training Kit preparing to become a Microsoft Certified Solutions Associate: SQL Server 2012. After reading Chapter 1 – Foundations of Querying I feel I have an even better understanding of the relational model and what makes something relational. So, my Hunters-to-be, let’s discover Cobb and his relations.

There were several “aha” moments reading this first chapter. The first came from reading this:

A relation in the relational model is what SQL calls a table. The two are not synonymous. You could say that a table is an attempt by SQL to represent a relation….

That made me re-think some of my preconceived notions about what I knew about databases, tables and the relational model.

So going forward with an open mind, I read:

Some of the most important principals to understand about T-SQL stem from the relational model’s core foundations-set theory and predicate logic.

Remember that the heading of a relation is a set of attributes, and the body a set of tuples. So what is a set? According to the creator of mathematical set theory, Georg Cantor, a set is described as follows:

“By a ‘set’ we mean any collection of M into a whole of definite, distinct objects m (which are called the ‘elements’ of M) of our perception or of our thought.”

Set Theory

Reading this chapter exposed me to concepts that I had knowledge of but no understanding. I had heard “real” database people refer to tuples but had never really “gotten it”. I had to go look up “tuple” on the All-Knowing Wikipedia. As it relates to math it

is an ordered list of elements. Tuples are usually written by listing the elements within parentheses ‘()’ and separated by commas; for example, (2,7,4,1,7) denotes a 5-tuple.

That was another “aha” for me. I could see that this was a set of data, like what is returned from a query. That made sense to me.

Looking into set theory I found that a relation has a heading and a body. The heading is a set of attributes. These attributes are referred to by name and type . The body is a set of tuples. These concepts become transformed by SQL into something recognizable. The relation is a table showing the heading as column names and tuples as rows in the table.

(So, remember in the last episode when we said that thinking of a database like a spreadsheet was ok for non-Hunters, this concept is the real deal.)

Do you remember back in elementary school when you learned about VENN diagrams? You had to draw all those circles that represented MAMMALS and EGG LAYING CREATURES and come up with the INTERSECTION showing Mammals that are Egg Laying Creatures?

Venn Diagram

That’s Set Theory – see you know this – you just learned it so long ago, you’ve forgotten!

Back to Cantor’s definition of a set. It should be considered as a single, whole thing….a SET of something, in this case tuples. It should be a set with no duplicates (there are exceptions to this in the real world, but theoretically…) and while not explicitly stated, no order to the set – as one of the tek-tip gurus would say, think of a result set as a bag of marbles, all the records jumbled up together – (when, in a query, you add sorting you have created a sequence, not a relation – a topic for another day!)

Predicate Logic

A predicate is defined as an expression that when applied to an object makes a proposition evaluate to either true or false. For instance, ‘hire date greater than 1/1/2012’ is an example of a predicate. For each employee you can evaluate this predicate and create a proposition that returns true if that employee was hired after that date. Any expression that you can write to evaluate criteria to true or false is a predicate. Is it greater than another number, does a name match, any other criteria.

Predicate Logic is a core element of the relational model. Using predicates you can enforce data integrity, filter data, and even define the data model.

While I didn’t study these concepts formally, I did have an intuitive understanding of them. Understanding these concepts will be the first weapon in your arsenal to fight the Platypus Database.

Developer’s Golden Egg

I have decided that when I have some tidbit that might be helpful to a certain group of my readership, the Platypus will give you a Golden Egg – the first one goes to the Developers! In fact, for a while, most of them will probably be for developers…that is what I did for a long time!

One thing that I tried to do as an application developer was iterate through the results of my queries in my application and not on the database. Remember, think of the results of your query as a set. So, if you need to get all the rows that meet some criteria try to figure out a way using a query to just return to your program the information in which you are interested. Conversely, try not to return ALL the rows and have your application determine which ones meet your criteria, let the database do what databases are good at doing.

Also. try to use aggregate functions when possible. Don’t bring back a whole record set to your application and count the number of rows, if you just need to know the total number of rows use the COUNT function and just return what you need.