Anatomy of a SELECT Statement Part 5 – GROUP BY and HAVING

After a short break from this series, I was thinking it was time to add another post to the Anatomy of the SELECT statement series. This post will focus on the GROUP BY and the HAVING clauses.  These two statements can be used together, although when using the GROUP BY a HAVING is not required.  So far in the series we have covered the SELECT, FROM, JOINS clauses as well as the order of processing.  We have also covered the PIVOT, which follows the FROM clause.

The GROUP BY can be used to get an aggregation for a group. For example it can be used to get the number of times each product has been sold.  As with most of my other posts, all examples in this post will use the AdventureWorks2014 sample database.

SELECT p.ProductID, p.name, COUNT(p.name) AS ‘NumberOfTimesSold’
FROM Production.product p
INNER JOIN Sales.SalesOrderDetail s
ON p.ProductID = s.ProductID
GROUP BY p.ProductID, p.name

In the above statement, notice the use of an aggregate function.  This is really what the GROUP BY is about.  The GROUP BY and the use of the COUNT() aggregate function work together to get the results I am seeking, the number of times each product has been sold.

Requirements of the GROUP BY

As with all other statements and clauses in TSQL, there are requirements that must be followed.  Below are the basic requirements of the GROUP BY.

  • Cannot use a column from an indexed view
  • The column used in the aggregate function does not need to appear in the SELECT list
  • Cannot group on the following data types
    • XML
    • TEXT
    • NTEXT
    • IMAGE
  • Can group on XML if the column is using in a function that outputs an acceptable datatype
  • All columns must either be in the GROUP BY or in an aggregate function

Using the GROUP BY

Now that we have covered a few of the requirements. let’s see how we can use the GROUP BY.

Using the same code as in the example above, let’s dig a little deeper.  

SELECT p.ProductID
, p.name
, p.Color
, COUNT(s.ProductID) AS ‘NumberOfTimesSold’
FROM Production.product p
INNER JOIN Sales.SalesOrderDetail s
ON p.ProductID = s.ProductID
GROUP BY p.ProductID, p.name, p.color

As you can see, the GROUP BY will follow the FROM clause and any joins.  After the GROUP BY you will simply list the columns you would like to have each group be made of.  The order of the columns listed in the GROUP BY does not matter and will not change the results or the column order in the results set, that is determined by the SELECT list. You can list them in any order you like, although I prefer to list them in the same order as they appear in the SELECT list.  Again, that is just a preference and not a requirement of TSQL.

In the example, the groups will consist of the Productid, name and color columns.  If you look at the image below, the columns in the green box make up the group, while the column in the blue box represents the aggregation, in this case it is a COUNT.

When using the GROUP  BY, all columns in the SELECT must either be in the GROUP BY or using an aggregate function.  If this requirement is not met, you well see this error:

Msg 8120, Level 16, State 1, Line 2
Column ‘Production.product.Name’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

To prevent this error, I like to do a simple test.  Look at each column and check to see if it is in the GROUP BY or using an aggregate.

SELECT p.ProductID  –Is in the Group by
, p.name   –Is in the Group by
, p.Color  –Is NOT in the Group by, will cause an error
, COUNT(s.ProductID) AS ‘NumberOfTimesSold’  –Is using an aggregate
FROM Production.product p
INNER JOIN Sales.SalesOrderDetail s
ON p.ProductID = s.ProductID
GROUP BY p.ProductID, p.name

Can Expressions be used in the GROUP BY?

Now let’s change what we are looking for.  Let’s now look for how many times each product was sold by year and month.  In order to do this, I can use an expression on the GROUP BY clause.

SELECT p.name
, MONTH(h.orderdate) AS ‘MONTH’
, YEAR(h.orderdate) AS ‘YEAR’
, COUNT(s.ProductID) AS ‘NumberOfTimesSold’
FROM Production.product p
INNER JOIN Sales.SalesOrderDetail s
ON p.ProductID = s.ProductID
INNER JOIN Sales.SalesOrderHeader h
ON s.SalesOrderID = h.SalesOrderID
GROUP BY p.name, MONTH(h.orderdate), YEAR(h.orderdate)
ORDER BY Year DESC, Month DESC

Notice in the GROUP BY the MONTH and YEAR functions are being used.  As you can see in the results below, the “Classic Vest, M” product was sold 14 times during the month of June 2014.

Final thoughts on GROUP BY

Using the GROUP BY really isn’t very complicated, you just need to do three things:

  1. Determine what columns you want to make up the groups
  2. Determine which aggregate function you would like to use
  3. Before running the statement, check all the columns in the SELECT list to make sure each column is either in the GROUP BY or using an aggregate function

Using the HAVING

Now that I have my groups and my aggregate totals, what if I want to apply a filter to the results, similar to the WHERE clause?  That is where the HAVING comes in.  You can think if this as a where clause for groups.  The HAVING clause appears after the GROUP BY and includes both the HAVING keyword and an expression.  In the case below, the expression is COUNT(s.ProductID) > 10.

SELECT p.name
, COUNT(s.ProductID) AS ‘NumberOfTimesSold’
FROM Production.product p
INNER JOIN Sales.SalesOrderDetail s
ON p.ProductID = s.ProductID
INNER JOIN Sales.SalesOrderHeader h
ON s.SalesOrderID = h.SalesOrderID
GROUP BY p.name
HAVING COUNT(s.ProductID) > 10

Using HAVING also requires the use of an aggregate function.  This aggregate really should be the same as what is in the SELECT list. If you do not include an aggregate and use something like this, HAVING s.rowguid > 10,  you will get the error below.

Msg 8121, Level 16, State 1, Line 11
Column ‘Sales.SalesOrderDetail.ProductID’ is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BY clause.

Conclusion

Using the GROUP BY and HAVING clauses are very commonly used and having a good understanding of them is essential for anyone who writes TSQL code.  Below you will find two links to documentation from Microsoft

GROUP BY

HAVING

Thanks for visiting my blog and I hope you learned something!

PIVOT…What is it and Where do the column names come from

Microsoft defines the PIVOT as:

“Rotates a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output, and performs aggregations where they are required on any remaining column values that are wanted in the final output” – link here

If you have used Excel, you will find the results when using the PIVOT are very similar to a Pivot Table.  Below is a screenshot of a Pivot table from a video from Microsoft, link here. Notice the values across the top and on the left side.  The numbers in the middle are aggregations of the data.

If you look at the example below of a result set using the TSQL PIVOT , you will notice the same structure.  Values across the top and on the left side with aggregations in the middle.

Now that we have talked about what the PIVOT command does, lets go over some basics.

  • The PIVOT was first introduced into SQL Server with the release of SQL Server 2005
  • The PIVOT keyword is placed after the FROM clause
  • The resulting pivot table requires a table alias
  • “When aggregate functions are used with PIVOT, the presence of any null values in the value column are not considered when computing an aggregation.” – per Microsoft documentation

Elements of the PIVOT statement

Let’s go over the various parts of the PIVOT.  Using the image below, I will dissect each component of the PIVOT command.

Part A

This is simply the TSQL statement that will pull the data you would like to use in the PIVOT. My suggestion is to get this query to return what you are looking to pivot first, then add the necessary code for the PIVOT around it.

Part B

These are hard coded values that will determine the names of the columns in the final result set.  We will cover how to dynamically get the column names later in this post.

Part C

This is the column in the PIVOT table that will be aggregated.

Part D

This is the actual PIVOT table.   Notice that the column does not appear to the pivot table at the bottom.  Only the year and the aggregate column appear.  As stated earlier, this requires an alias. If you do not use an alias you will receive this error.

Msg 102, Level 15, State 1, Line 16
Incorrect syntax near ‘)’.

By looking at the image below, you can see the alias has been commented out.  The causes the above error, as well the hard coded column to be marked with an invalid column name error.

Part E

In this case, it is the first column.  It start with the Name column in the inner query and works its way to the outer query and eventually to the PIVOT table results.  Notice that the column alias flows through the entire process.  If you change the column alias in the inner query, you will also have to change it on the outer query as well. If you use an alias for the Name column in the inner query and you do not change the column name in the outer query you will see this error.

Msg 207, Level 16, State 1, Line 1
Invalid column name ‘Name’.

Part F

This is the aggregate column.  The column in the inner query does not have a aggregate function on it, but it can if your needs dictate that it does.

Part G

These are both the column names and the values that will be used to complete the grouping for the aggregation.  If you change one if them you will get an invalid column name error.

Here is the completed code.  I used the AdventureWorks2014 database.

SELECT Name, [2011], [2012], [2013]
FROM
( SELECT loc.name
, sde.OrderQty
, YEAR(she.OrderDate) AS Year
FROM Production.Location loc
INNER JOIN Production.ProductInventory pin
ON loc.LocationID = pin.LocationID
INNER JOIN Sales.SalesOrderDetail sde
ON pin.ProductID = sde.ProductID
INNER JOIN sales.SalesOrderHeader she
ON sde.SalesOrderID = she.SalesOrderID
) AS x
PIVOT
( SUM(OrderQty)
FOR Year IN ([2011], [2012], [2013]))
AS PivotTable

What do I do if I don’t want to hard code the values for the column names?

This a question I get a lot in my classes.  The answer is dynamic SQL.  Yes I know about all the potential issue with dynamic SQL, so you need to exercise caution when doing this.

The first thing we need is a couple of variables.

DECLARE @cols      AS NVARCHAR(MAX)
DECLARE @query  AS NVARCHAR(MAX);

The first variable is to hold what will be your column names, while the second variable will be used to execute the SQL.

Next we need to go get the values we want to use as our column names.  In order to do this we will use the keyword STUFF.  What is STUFF, it simply is used to insert a string in a string….in other words, stuff a string in a string.  Go here to get more detailed information about the STUFF keyword.

The STUFF keyword has four parameters.

  1. Character expression – this is the string we want replaced
  2. Starting point
  3. Length
  4. The string we are going to insert

We will also need to the QUOTENAME function, click here for more information.  This function simply adds delimiters. If we look at the code below and PRINT the variable we will see this:  [2011],[2012],[2013],[2014]

Below is the code we can use to get the column names.  If you want to use this, all you have to do is replace everything from the CONVERT to the ASC with your query.  Remember, it can only return one column.

SET @cols = STUFF((SELECT DISTINCT ‘,’ + QUOTENAME(
CONVERT(VARCHAR(4),YEAR(she.OrderDate))) AS Year
FROM Production.Location loc
INNER JOIN Production.ProductInventory pin
ON loc.LocationID = pin.LocationID
INNER JOIN Sales.SalesOrderDetail sde
ON pin.ProductID = sde.ProductID
INNER JOIN sales.SalesOrderHeader she
ON sde.SalesOrderID = she.SalesOrderID
ORDER BY YEAR ASC
FOR XML PATH(), TYPE
).VALUE(‘.’, ‘NVARCHAR(MAX)’)
,1,1,)

The next thing we need to do is set the variable for execution.

SET @query = ‘SELECT name AS LocationName, ‘ + @cols + ‘ from
            (
            SELECT loc.name
            FROM Production.Location loc
               INNER JOIN Production.ProductInventory pin
                   ON loc.LocationID = pin.LocationID
               INNER JOIN Sales.SalesOrderDetail sde
                   ON pin.ProductID = sde.ProductID
               INNER JOIN sales.SalesOrderHeader she
                   ON sde.SalesOrderID = she.SalesOrderID
           ) x
            pivot
            (
                  SUM(OrderQty)
FOR Year IN  (‘ + @cols + ‘)
            ) p ‘

Finally the last step, execute the query.

EXECUTE(@query)

You will find the entire code here: PivotDemo.

While researching this post I found a very nice post about another method to use to do this using Json.  It can be found on Red-Gate’s web site, this link is below.

Phil Factor from Red-Gate shows how to use Json to pivot: click here

Thank you for visiting my blog, hope you enjoyed and learned something!

Select Distinct and XML data

As you know, the DISTINCT key word can be used to eliminate duplicate rows from the result set of a query.  

For example:

If you run this query against the AdventureWorks2014 database you will get 19614 rows with many duplicates.

SELECT City
FROM [AdventureWorks2014].[Person].[Address]

As you can see in the above results set there are duplicates.  Albany appears  4 times. However, my goal is to simply see what cities the database has addresses in.

I can address this by using the DISTINCT keyword.  The code below show that you simply add it just after the SELECT keyword.  Here is a link for more information on DISTINCT.

SELECT DISTINCT City
FROM [AdventureWorks2014].[Person].[Address]
ORDER BY City

The result set below shows that Albany only appears one time, duplicates have been removed.

Well, what if the result set contains XML data?  Will the DISTINCT work the same way?  The answer is no.  

Since I was working on an example for my next performance class, I decided to use an example from that session.  In the query below the qp.* part of the column list will return a column named query_plan which is an XML data type.

SELECT DISTINCT pa.,cp.,qp.*
FROM
(SELECT plan_handle
FROM sys.dm_exec_cached_plans) cp
CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle ) qp
CROSS APPLY sys.dm_exec_plan_attributes(cp.plan_handle) As pa
WHERE is_cache_key = 1 AND pa.attribute = ‘set_options’

However, when I add the DISTINCT keyword I get the error below.

So what do I do about.  You can simply use the CONVERT function to convert that datatype from XML to VARCHAR(MAX) like below.

SELECT Distinct qp.dbid
, qp.encrypted
, CONVERT(VARCHAR(MAX),qp.query_plan)
FROM
(SELECT plan_handle
FROM sys.dm_exec_cached_plans) cp
CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle ) qp
CROSS APPLY sys.dm_exec_plan_attributes(cp.plan_handle) As pa
WHERE is_cache_key = 1 AND pa.attribute = ‘set_options’


Just a disclaimer, I didn’t analyze this for performance because the class is about an unrelated topic. If you plan on using this in production code, please assess the impact on performance.

I will put this into the “You learn something new everyday” category!   As always, it is always fun to learn and I hope you did as well.  Thank you for visiting my blog.