Anatomy of a SELECT Statement Part 5 – GROUP BY and HAVING

After a short break from this series, I was thinking it was time to add another post to the Anatomy of the SELECT statement series. This post will focus on the GROUP BY and the HAVING clauses.  These two statements can be used together, although when using the GROUP BY a HAVING is not required.  So far in the series we have covered the SELECT, FROM, JOINS clauses as well as the order of processing.  We have also covered the PIVOT, which follows the FROM clause.

The GROUP BY can be used to get an aggregation for a group. For example it can be used to get the number of times each product has been sold.  As with most of my other posts, all examples in this post will use the AdventureWorks2014 sample database.

SELECT p.ProductID, p.name, COUNT(p.name) AS ‘NumberOfTimesSold’
FROM Production.product p
INNER JOIN Sales.SalesOrderDetail s
ON p.ProductID = s.ProductID
GROUP BY p.ProductID, p.name

In the above statement, notice the use of an aggregate function.  This is really what the GROUP BY is about.  The GROUP BY and the use of the COUNT() aggregate function work together to get the results I am seeking, the number of times each product has been sold.

Requirements of the GROUP BY

As with all other statements and clauses in TSQL, there are requirements that must be followed.  Below are the basic requirements of the GROUP BY.

  • Cannot use a column from an indexed view
  • The column used in the aggregate function does not need to appear in the SELECT list
  • Cannot group on the following data types
    • XML
    • TEXT
    • NTEXT
    • IMAGE
  • Can group on XML if the column is using in a function that outputs an acceptable datatype
  • All columns must either be in the GROUP BY or in an aggregate function

Using the GROUP BY

Now that we have covered a few of the requirements. let’s see how we can use the GROUP BY.

Using the same code as in the example above, let’s dig a little deeper.  

SELECT p.ProductID
, p.name
, p.Color
, COUNT(s.ProductID) AS ‘NumberOfTimesSold’
FROM Production.product p
INNER JOIN Sales.SalesOrderDetail s
ON p.ProductID = s.ProductID
GROUP BY p.ProductID, p.name, p.color

As you can see, the GROUP BY will follow the FROM clause and any joins.  After the GROUP BY you will simply list the columns you would like to have each group be made of.  The order of the columns listed in the GROUP BY does not matter and will not change the results or the column order in the results set, that is determined by the SELECT list. You can list them in any order you like, although I prefer to list them in the same order as they appear in the SELECT list.  Again, that is just a preference and not a requirement of TSQL.

In the example, the groups will consist of the Productid, name and color columns.  If you look at the image below, the columns in the green box make up the group, while the column in the blue box represents the aggregation, in this case it is a COUNT.

When using the GROUP  BY, all columns in the SELECT must either be in the GROUP BY or using an aggregate function.  If this requirement is not met, you well see this error:

Msg 8120, Level 16, State 1, Line 2
Column ‘Production.product.Name’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

To prevent this error, I like to do a simple test.  Look at each column and check to see if it is in the GROUP BY or using an aggregate.

SELECT p.ProductID  –Is in the Group by
, p.name   –Is in the Group by
, p.Color  –Is NOT in the Group by, will cause an error
, COUNT(s.ProductID) AS ‘NumberOfTimesSold’  –Is using an aggregate
FROM Production.product p
INNER JOIN Sales.SalesOrderDetail s
ON p.ProductID = s.ProductID
GROUP BY p.ProductID, p.name

Can Expressions be used in the GROUP BY?

Now let’s change what we are looking for.  Let’s now look for how many times each product was sold by year and month.  In order to do this, I can use an expression on the GROUP BY clause.

SELECT p.name
, MONTH(h.orderdate) AS ‘MONTH’
, YEAR(h.orderdate) AS ‘YEAR’
, COUNT(s.ProductID) AS ‘NumberOfTimesSold’
FROM Production.product p
INNER JOIN Sales.SalesOrderDetail s
ON p.ProductID = s.ProductID
INNER JOIN Sales.SalesOrderHeader h
ON s.SalesOrderID = h.SalesOrderID
GROUP BY p.name, MONTH(h.orderdate), YEAR(h.orderdate)
ORDER BY Year DESC, Month DESC

Notice in the GROUP BY the MONTH and YEAR functions are being used.  As you can see in the results below, the “Classic Vest, M” product was sold 14 times during the month of June 2014.

Final thoughts on GROUP BY

Using the GROUP BY really isn’t very complicated, you just need to do three things:

  1. Determine what columns you want to make up the groups
  2. Determine which aggregate function you would like to use
  3. Before running the statement, check all the columns in the SELECT list to make sure each column is either in the GROUP BY or using an aggregate function

Using the HAVING

Now that I have my groups and my aggregate totals, what if I want to apply a filter to the results, similar to the WHERE clause?  That is where the HAVING comes in.  You can think if this as a where clause for groups.  The HAVING clause appears after the GROUP BY and includes both the HAVING keyword and an expression.  In the case below, the expression is COUNT(s.ProductID) > 10.

SELECT p.name
, COUNT(s.ProductID) AS ‘NumberOfTimesSold’
FROM Production.product p
INNER JOIN Sales.SalesOrderDetail s
ON p.ProductID = s.ProductID
INNER JOIN Sales.SalesOrderHeader h
ON s.SalesOrderID = h.SalesOrderID
GROUP BY p.name
HAVING COUNT(s.ProductID) > 10

Using HAVING also requires the use of an aggregate function.  This aggregate really should be the same as what is in the SELECT list. If you do not include an aggregate and use something like this, HAVING s.rowguid > 10,  you will get the error below.

Msg 8121, Level 16, State 1, Line 11
Column ‘Sales.SalesOrderDetail.ProductID’ is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BY clause.

Conclusion

Using the GROUP BY and HAVING clauses are very commonly used and having a good understanding of them is essential for anyone who writes TSQL code.  Below you will find two links to documentation from Microsoft

GROUP BY

HAVING

Thanks for visiting my blog and I hope you learned something!

PIVOT…What is it and Where do the column names come from

Microsoft defines the PIVOT as:

“Rotates a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output, and performs aggregations where they are required on any remaining column values that are wanted in the final output” – link here

If you have used Excel, you will find the results when using the PIVOT are very similar to a Pivot Table.  Below is a screenshot of a Pivot table from a video from Microsoft, link here. Notice the values across the top and on the left side.  The numbers in the middle are aggregations of the data.

If you look at the example below of a result set using the TSQL PIVOT , you will notice the same structure.  Values across the top and on the left side with aggregations in the middle.

Now that we have talked about what the PIVOT command does, lets go over some basics.

  • The PIVOT was first introduced into SQL Server with the release of SQL Server 2005
  • The PIVOT keyword is placed after the FROM clause
  • The resulting pivot table requires a table alias
  • “When aggregate functions are used with PIVOT, the presence of any null values in the value column are not considered when computing an aggregation.” – per Microsoft documentation

Elements of the PIVOT statement

Let’s go over the various parts of the PIVOT.  Using the image below, I will dissect each component of the PIVOT command.

Part A

This is simply the TSQL statement that will pull the data you would like to use in the PIVOT. My suggestion is to get this query to return what you are looking to pivot first, then add the necessary code for the PIVOT around it.

Part B

These are hard coded values that will determine the names of the columns in the final result set.  We will cover how to dynamically get the column names later in this post.

Part C

This is the column in the PIVOT table that will be aggregated.

Part D

This is the actual PIVOT table.   Notice that the column does not appear to the pivot table at the bottom.  Only the year and the aggregate column appear.  As stated earlier, this requires an alias. If you do not use an alias you will receive this error.

Msg 102, Level 15, State 1, Line 16
Incorrect syntax near ‘)’.

By looking at the image below, you can see the alias has been commented out.  The causes the above error, as well the hard coded column to be marked with an invalid column name error.

Part E

In this case, it is the first column.  It start with the Name column in the inner query and works its way to the outer query and eventually to the PIVOT table results.  Notice that the column alias flows through the entire process.  If you change the column alias in the inner query, you will also have to change it on the outer query as well. If you use an alias for the Name column in the inner query and you do not change the column name in the outer query you will see this error.

Msg 207, Level 16, State 1, Line 1
Invalid column name ‘Name’.

Part F

This is the aggregate column.  The column in the inner query does not have a aggregate function on it, but it can if your needs dictate that it does.

Part G

These are both the column names and the values that will be used to complete the grouping for the aggregation.  If you change one if them you will get an invalid column name error.

Here is the completed code.  I used the AdventureWorks2014 database.

SELECT Name, [2011], [2012], [2013]
FROM
( SELECT loc.name
, sde.OrderQty
, YEAR(she.OrderDate) AS Year
FROM Production.Location loc
INNER JOIN Production.ProductInventory pin
ON loc.LocationID = pin.LocationID
INNER JOIN Sales.SalesOrderDetail sde
ON pin.ProductID = sde.ProductID
INNER JOIN sales.SalesOrderHeader she
ON sde.SalesOrderID = she.SalesOrderID
) AS x
PIVOT
( SUM(OrderQty)
FOR Year IN ([2011], [2012], [2013]))
AS PivotTable

What do I do if I don’t want to hard code the values for the column names?

This a question I get a lot in my classes.  The answer is dynamic SQL.  Yes I know about all the potential issue with dynamic SQL, so you need to exercise caution when doing this.

The first thing we need is a couple of variables.

DECLARE @cols      AS NVARCHAR(MAX)
DECLARE @query  AS NVARCHAR(MAX);

The first variable is to hold what will be your column names, while the second variable will be used to execute the SQL.

Next we need to go get the values we want to use as our column names.  In order to do this we will use the keyword STUFF.  What is STUFF, it simply is used to insert a string in a string….in other words, stuff a string in a string.  Go here to get more detailed information about the STUFF keyword.

The STUFF keyword has four parameters.

  1. Character expression – this is the string we want replaced
  2. Starting point
  3. Length
  4. The string we are going to insert

We will also need to the QUOTENAME function, click here for more information.  This function simply adds delimiters. If we look at the code below and PRINT the variable we will see this:  [2011],[2012],[2013],[2014]

Below is the code we can use to get the column names.  If you want to use this, all you have to do is replace everything from the CONVERT to the ASC with your query.  Remember, it can only return one column.

SET @cols = STUFF((SELECT DISTINCT ‘,’ + QUOTENAME(
CONVERT(VARCHAR(4),YEAR(she.OrderDate))) AS Year
FROM Production.Location loc
INNER JOIN Production.ProductInventory pin
ON loc.LocationID = pin.LocationID
INNER JOIN Sales.SalesOrderDetail sde
ON pin.ProductID = sde.ProductID
INNER JOIN sales.SalesOrderHeader she
ON sde.SalesOrderID = she.SalesOrderID
ORDER BY YEAR ASC
FOR XML PATH(), TYPE
).VALUE(‘.’, ‘NVARCHAR(MAX)’)
,1,1,)

The next thing we need to do is set the variable for execution.

SET @query = ‘SELECT name AS LocationName, ‘ + @cols + ‘ from
            (
            SELECT loc.name
            FROM Production.Location loc
               INNER JOIN Production.ProductInventory pin
                   ON loc.LocationID = pin.LocationID
               INNER JOIN Sales.SalesOrderDetail sde
                   ON pin.ProductID = sde.ProductID
               INNER JOIN sales.SalesOrderHeader she
                   ON sde.SalesOrderID = she.SalesOrderID
           ) x
            pivot
            (
                  SUM(OrderQty)
FOR Year IN  (‘ + @cols + ‘)
            ) p ‘

Finally the last step, execute the query.

EXECUTE(@query)

You will find the entire code here: PivotDemo.

While researching this post I found a very nice post about another method to use to do this using Json.  It can be found on Red-Gate’s web site, this link is below.

Phil Factor from Red-Gate shows how to use Json to pivot: click here

Thank you for visiting my blog, hope you enjoyed and learned something!

Select Distinct and XML data

As you know, the DISTINCT key word can be used to eliminate duplicate rows from the result set of a query.  

For example:

If you run this query against the AdventureWorks2014 database you will get 19614 rows with many duplicates.

SELECT City
FROM [AdventureWorks2014].[Person].[Address]

As you can see in the above results set there are duplicates.  Albany appears  4 times. However, my goal is to simply see what cities the database has addresses in.

I can address this by using the DISTINCT keyword.  The code below show that you simply add it just after the SELECT keyword.  Here is a link for more information on DISTINCT.

SELECT DISTINCT City
FROM [AdventureWorks2014].[Person].[Address]
ORDER BY City

The result set below shows that Albany only appears one time, duplicates have been removed.

Well, what if the result set contains XML data?  Will the DISTINCT work the same way?  The answer is no.  

Since I was working on an example for my next performance class, I decided to use an example from that session.  In the query below the qp.* part of the column list will return a column named query_plan which is an XML data type.

SELECT DISTINCT pa.,cp.,qp.*
FROM
(SELECT plan_handle
FROM sys.dm_exec_cached_plans) cp
CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle ) qp
CROSS APPLY sys.dm_exec_plan_attributes(cp.plan_handle) As pa
WHERE is_cache_key = 1 AND pa.attribute = ‘set_options’

However, when I add the DISTINCT keyword I get the error below.

So what do I do about.  You can simply use the CONVERT function to convert that datatype from XML to VARCHAR(MAX) like below.

SELECT Distinct qp.dbid
, qp.encrypted
, CONVERT(VARCHAR(MAX),qp.query_plan)
FROM
(SELECT plan_handle
FROM sys.dm_exec_cached_plans) cp
CROSS APPLY sys.dm_exec_query_plan (cp.plan_handle ) qp
CROSS APPLY sys.dm_exec_plan_attributes(cp.plan_handle) As pa
WHERE is_cache_key = 1 AND pa.attribute = ‘set_options’


Just a disclaimer, I didn’t analyze this for performance because the class is about an unrelated topic. If you plan on using this in production code, please assess the impact on performance.

I will put this into the “You learn something new everyday” category!   As always, it is always fun to learn and I hope you did as well.  Thank you for visiting my blog.

UNION vs UNION ALL

Before we get into which is better to use, let’s talk about some UNION basics.  First of all, both UNION and UNION ALL are set based operations and the purpose of both of these statements is to combine two distinctly different result sets into a single set.  Let’s say I wanted to send a invitation to both employees and customers for an event, UNION or UNION ALL would work perfectly to allow me to get the list I am looking for.

Requirements

The requirements for both UNION and UNION ALL are pretty simple.  Both queries must have the same number of columns and the columns must have similar data types.

Below is an example of a UNION query using the AdventureWorks2014 database.

SELECT Name
FROM Purchasing.Vendor
UNION
SELECT Firstname + ‘ ‘ + LastName
FROM person.person

Even though the second query is using two columns from the Person.Person table, they are being concatenated into a single column.

So what happens if the number of columns do not match?  You will get this error.

Msg 205, Level 16, State 1, Line 1
All queries combined using a UNION, INTERSECT or EXCEPT operator must have an equal number of expressions in their target lists.

This is a pretty easy error to fix, simply make sure the number of columns match.

Now on to the second requirement, the columns must be similar data types.  Notice that I didn’t say exact same data types.  If the data types are different, SQL server will use the data type precedence and do a convert to the same data types.  If you go here, you can see the data type precedence.  If you look at the image below, you will see that the BusinessEntityID is INT while the LastName is NVARCHAR.  According to the precedence list, INT is higher up the list than NVARCHAR so SQL Server will attempt to convert the LastName column data to INT resulting in this error.

Msg 245, Level 16, State 1, Line 8
Conversion failed when converting the nvarchar value ‘Abbas’ to data type int.

If you get this error, you will need to decide if you want to use the CONVERT or CAST functions to convert the column with the higher data type to the lower data type.  Obviously, the two columns mentioned is something you probably wouldn’t see in the real world, but I wanted to use it for demonstration purposes.

UNION vs UNION ALL

The differences between these two is pretty simple, when using just the UNION keyword, duplicates records will be eliminated while the UNION ALL will include any duplicates.

If you execute this statement against the AdventureWorks2014 database you will see that it will return 20,077 rows and will include duplicates.  If you run the same two queries with UNION instead of UNION ALL you will only get 19,622 rows.  Andrew Cencini will appear two times in the first query and only one time in the second.

Includes duplicates:

SELECT Name
FROM Purchasing.Vendor
UNION ALL
SELECT Firstname + ‘ ‘ + LastName
FROM person.person

Does not include duplicates:

SELECT Name
FROM Purchasing.Vendor
UNION
SELECT Firstname + ‘ ‘ + LastName
FROM person.person

How can I tell what query each row is coming from?   In the query below you will see the text in pink, this is simply a string with a column alias.  In this case the string is the name of the table the row comes from with an alias of Source.

SELECT Name, ‘Vendor’ AS ‘Source’
FROM Purchasing.Vendor
UNION ALL
SELECT Firstname + ‘ ‘ + LastName, ‘Person’ AS ‘Person’
FROM person.person

When doing this, the actual name of the column will come from the first query.  In the query above there are two aliases, Source in the first query and Person in the second.  The below image shows where the column name will come from.  Also notice the values in the column identify the source table for that row.

ORDER BY

Of course in order to guarantee that the result set is to use the ORDER BY keywords.  When using UNION or UNION ALL, the ORDER BY must be after the last query.  If you try to use it in an earlier query, you will get this error.

Msg 156, Level 15, State 1, Line 4
Incorrect syntax near the keyword ‘UNION’.

What about performance?

Generally the UNION ALL will be faster than the UNION.  Here are a few links to some great posts that give more information about the impact on performance when using these statements.

Pinal Dave

SQLPerformance.com

Thanks for visiting my blog and please contact me if you have any questions.

Anatomy of a SELECT statement – JOINS – Part 4

If your database is normalized, data points will only appear in the database one time.  For example, a customer name will only appear in one table,  customer address will appear in only one table and employee names will only appear in one table. This is a nice Wikipedia article on Normalization Well, what do you do if you have a request to create a report on sales that has a requirement of containing both the sales person name and the customer name.  Since the sales person name is in one table and the customer name is in another table, you will need to use the JOIN keyword. Given that, the purpose of a join is when there are columns in different tables that you want in the same result set.

There are a number of join types that can be used in SQL Server. In this post we will cover the logical join, what you using in your code.

Logical Join Types

INNER
OUTER
CROSS
FULL

Behind the scenes the query engine will use physical joins.  There are four of these, Nested Loop, Hash Match, Merge and starting in SQL Server 2017, Adaptive joins.  Although it is important to know these when working on query optimization, these will be covered in a future post.

Logical Join Requirements

Before getting into the types of joins, we should review the requirements of joins.

  • Column names do not need to match
  • Data types must be compatible
  • SELECT list does not need to include columns from every table
  • Although not an absolute requirement, it is highly recommended that you use table aliases when joining tables

Keep in mind, if the data types are not the same, the data will have to be converted.  This can be done explicitly using CAST or CONVERT, or it will be done with an implicit convert.

What columns To Join On

Typically the join will be on the Primary Key column in one table and a foreign key column in another.  However, there are times in which this will not be the case.

Table Alias

When using joins, it is important to give your tables an alias.  When use tables aliases, I like to have them be somewhat descriptive, like e for employee or s for sales. I also like to keep them the same length when ever possible and keep the to three or less characters.  These are just my preference and in no way is a requirement.

If you don’t use any aliases and the columns are not referenced properly in the SELECT statement and the column name you want to use exists on one or more tables in the FROM clause you will get an error.

In the statement below, you will see that the ProductID column does NOT use the alias as expected. Also notice the error below it, “Ambiguous column name ProductID”. This simply means that the column name exists in more than one table and SQL Server does not know which table to get the column from.

 

INNER Join

An INNER join would be used when you want to return data that has matching values in both tables in the join.

For example, you want to see all the products that have been sold. In order for a query to pull this data, there needs to be a matching value in both tables.  Using the AdventureWorks2014 database, the Product information is in the Production.Products table while the products sold is in the Sales.SalesOrderDetail table.  If you look at the image below, you will see that the two columns that we need to join on are both named ProductID.  However, as I stated before, that is not a requirement, but something you will see quite often.  The data types for both of these columns are also the same data type, INT.  However, that is also not a requirement, they just need to be similar data types.

 

To meet the requirements of the request of seeing all the products that have been sold, the ProductID must exist in both tables.  If product ID number 1 was sold at least one time, you will see it in the ProductID column in the SalesOrderDetail table.  The image below visualizes this.  The green circle is the Product table while the orange circle represents the SalesOrderDetail table.  The points where the two circles cross represents the values that can be found in both tables.

 

Using the code below, we can query the AdventureWorks2014 database to find all the products that have been sold at least one time.  What makes this work is that a ProductID will be returned ONLY if it appears in at least one order.

SELECT p.productid
, p.Name
, s.ProductID
, s.SalesOrderID
FROM production.product p
INNER JOIN [Sales].[SalesOrderDetail] s
ON p.ProductID = s.ProductID
ORDER BY p.ProductID

When you run the above query, will the get the results below.  Notice that the result set is ordered by ProductID in ascending order.  You will also notice that the ProductID 1 does not appear, despite existing in the Production.Product table.  This is because ProductID has not been sold.

This is confirmed by using the two queries below, one for the Product table and the other for the SalesOrderDetail table. Since this is an INNER JOIN, it is expected that productid number 1 would not be returned in the above result set.

Another small note about the INNER JOIN is that it is a the default join type.  In the statement below, you will notice that the INNER key word is no longer present, yet returns the exact same results set as above.

SELECT      p.productid
, p.Name
, s.ProductID
FROM production.product p
JOIN [Sales].[SalesOrderDetail] s
ON p.ProductID = s.ProductID
ORDER BY p.ProductID

OUTER JOIN

Lets change the requirements just a bit.  Instead of wanting to see all the products that have sold, we now want to know what products have NOT sold.  The OUTER join to the rescue!

An OUTER join will return all rows from the table on one side of the join and only matching values from the other table.

Starting with the LEFT OUTER JOIN, this join returns all rows from the table on the left and rows with a matching value on the left. We will use LEFT OUTER JOIN in this blog, however the OUTER key is technical optional, I just prefer to always use it.

SELECT p.productid
, p.Name
, s.ProductID
FROM production.product p
LEFT OUTER JOIN [Sales].[SalesOrderDetail] s
ON p.ProductID = s.ProductID

Below you will find the results of the above statement.  Notice that there are a number of rows that have NULL in the ProductID column from the SalesOrderDetail table.  This is what you will see when using an outer join and there is not a matching value in the table.  In this case, ProductID 530 does not have a matching value in the SalesOrderDetail table.

 

Of course if we switch this from a LEFT OUTER JOIN to a RIGHT OUTER JOIN SQL Server will now pull all the rows from the SalesOrderDetail table and the matching values from the Product table.

The code below is an example of a RIGHT OUTER JOIN.  All I did was change the word LEFT to RIGHT.  Now it will pull all the records from the SalesOrderDetail table and the matching values in the Product table.  One thing you will notice when you run this is that there not any NULL values in the p.productid column.  The reason for this is that referential integrity will not allow for a product that doesn’t exist in the Production.Product table to be entered into the table.

SELECT p.productid
, p.Name
, s.ProductID
FROM production.product p
RIGHT OUTER JOIN [Sales].[SalesOrderDetail] s
ON p.ProductID = s.ProductID

How are the LEFT and RIGHT tables determined

If you look at the two statements below, you will notice that the only difference is the last line.  In the first statement, the product table is listed first after the ON keyword, while in the second statement the SalesOrderDetail table is listed first.  Despite this change, both results had the exact same number of rows, 121,555.

SELECT p.productid
, p.Name
, s.ProductID
FROM production.product p
LEFT OUTER JOIN [Sales].[SalesOrderDetail] s
ON p.ProductID = s.ProductID

SELECT p.productid
, p.Name
, s.ProductID
FROM production.product p
LEFT OUTER JOIN [Sales].[SalesOrderDetail] s
ON s.ProductID = p.ProductID

However, if you switch the table order in the FROM clause, like in the statement below, you will see a big difference in the result set.

SELECT p.productid
, p.Name
, s.ProductID
FROM [Sales].[SalesOrderDetail] s
LEFT OUTER JOIN production.product p
ON p.ProductID = s.ProductID

This statement returns 121,317 rows.  This is every row from the SalesOrderDetail table and matching values in the Product table.  With some simple math, we can confirm this.  If you take the 121,555 and subtract the 212,317 you will get 238.  Using the code below, we will see that there are exactly 238 products that have not been sold. 

Now let’s get to the original question, what products have not been sold. In the example below we can simply add a WHERE clause and look for a any column from the right side table that is null.  The reason right side table, it because this is a left join.

SELECT p.productid
, p.Name
, s.ProductID
FROM production.product p
LEFT OUTER JOIN [Sales].[SalesOrderDetail] s
ON s.ProductID = p.ProductID
WHERE s.ProductID is null

When you run this you will see that there are 238 products that have not been sold.

Full Join

The code for this example can be found here: FullOuterJoin.

A full join is kind of a combination of the inner and both outer joins. Below you can see the two tables we will use for the example.

SELECT a.id
, a.name
, b.id
, b.name
FROM TableA a
FULL OUTER JOIN TableB b
ON A.id = b.id

If you run the above code you will see that it returns all rows that have matching values in each table, all the rows that are in TableA and not in TableB as well as the rows in Tableb that are not in TableA.

 

CROSS JOIN

A CROSS join is also known as a Cartesian product.  This type of join returns every possible combinations.

SELECT a.id
, b.id
FROM TableA a
CROSS JOIN TableB b

Below are the results.  The first ID column is from TableA while the second ID column is from TableB.  Notice that the results show each value from TableA and each value from TableB.  That is why you see 5 ones in the second column, there are 5 values in TableA.

Joining More than One Table

There are many times where you will need to join more than one table.  In this case we want to know the dates of each time a product was sold. In this case, we need to add the SalesOrderHeader table to our query. Although, there are three tables in the query, a join can only be between two table.

SELECT p.productid
, p.Name
, s.ProductID
, h.OrderDate
FROM [Sales].[SalesOrderDetail] s
INNER JOIN production.product p
ON p.ProductID = s.ProductID
INNER JOIN sales.SalesOrderHeader h
ON s.SalesOrderID = h.SalesOrderID

 

In a future post I will discuss the physical joins SQL Server will use to implement these logical joins.

Thank you for visiting my blog!!

 

 

 

 

 

Table Variable and Rollback

I am currently at PASS Summit in Seattle and learning a ton of new stuff.  There are a lot things that I have learned so far, but one thing really jumped out at me.  I attended Brent Ozar’s session on Cardinality Estimates and he showed that when you use table variables, the transaction does not get rolled back. That is what is great about SQL Server, there is always something to learn!!

If you run this code, you will see that when the select runs, it will actually not return any rows.

CREATE TABLE #test
(Name VARCHAR(100))

BEGIN TRAN
    INSERT #Test
(Name)
VALUES
(‘Dave’),
(‘Raf’)

ROLLBACK

SELECT * FROM #test

DROP TABLE #test

However, when you run this code, you will see that that rows are still in the variable even though the rollback command has completed.

DECLARE @TestV TABLE
(Name VARCHAR(100))

BEGIN TRAN
        INSERT @Testv
(Name)
VALUES
(‘Dave’),
(‘Raf’)

ROLLBACK

SELECT * FROM @TestV

This is another reason why table variables should be used with caution. Table variables do not get statistics on the columns which of course can be a performance killer!!!

SSMS Options – Part 2

In SSMS Options – Part 1, I went over several options, including the color theme, track changes, line number and how to get to the options screen.  This post is a continuation of that post.  We will cover a few more options that could be helpful to you when using SQL Server Management Studio.

This post will cover these options

  • Select\Edit
  • Surround object names with brackets
  • Automatically populate Find What with text from the Editor
  • Show Errors

SELECT\EDIT

If you right click on a table in the Object Explorer, you will see a number of options, as seen below.  I would like to show how to use Options to change two of these, Select Top 1000 Rows and Edit Top 200 Rows.

These two options do exactly as they say, Select and return rows that you can edit.  Although, these options are here, I will still prefer and recommend that you write the DML statements needed instead.

When you use the “Select” option you will see something similar to what is below.  SSMS writes the DML statement for you and includes a comment at the top.  The results will be returned to you in a grid format.  However, it will return only the number of rows indicated.  In this case, 1000 rows.  You cannot edit these rows in SSMS without writing the proper DML statement.

However, if you click the “Edit….” menu item you can edit the data in the grid by clicking in the cell you would like to edit.  Although, as I stated before, I would recommend that you use the proper TSQL DML statement to complete data modifications.  When you do this, you will see the grid below.  You can now begin editing data.

When you do start editing you will see a small pencil just to the left of the first column.  Once you move to another record or close the table, the record will be saved.

Again, I would recommend that you use the proper DML statements to modify your data.  This looks like it might be a nice short cut…but I wouldn’t do it because the modification can fail.

The question you may have is, how do I change these values?  You will find the answer below.  Change these to what you feel is appropriate.  Notice that both state the TOP keyword, yet we cannot specify the column to do the order by.

 

Surround object names with brackets……

As you may know, when you drag an object name or a column name from the Object Explorer to the query window, SSMS will drop the name.  This setting will determine is SSMS place brackets around the name or not.

 

This option can be found, just above the two settings talked about earlier.  The values can either be True or False.

As expected, the results are below.

Automatically populate Find What with text from the Editor

This setting actually can make our lives just a little bit easier.  When this is turned on, SSMS will populate the Find What box with what is selected in the text editor.  You can find this setting in the location below.

If you click “Find and Replace” under the Edit menu, the box highlighted below will appear.  In the box will be what every you have selected in the text, as seen below.

Show Errors

This options can make it easier to locate the errors in your code.  First of all, if can be found under “Text Editor”.

When this is enabled, you will see a red indicator near the right scroll bar when there is an error.  Notice that it is actually on the line below the line that actually contains the error.

That is if there is one error.  Well, what happens if there are two or more errors. The statement below contains two errors, if you notice, the error indicator is on the line below the first error.

When the Show Errors option is turned off, you will not see it, just like in the image below.

Thanks for stopping by my blog!

Optimize for Ad Hoc Workloads

Optimize for Ad Hoc Workloads is one of those server level settings that are not changed very often, but is still good to know about. Before we get into the details let’s talk about what it is. When using SQL Server, it reserves a portion of memory for Plan Cache. The Optimize for Ad Hoc Workloads setting controls what SQL Server places into this plan cache for single use queries. When it is turned off, all single use queries will have the entire plan cached, therefore consuming more space. By turning this on, you are asking SQL Server to not store the entire plan when the query is executed the first time, SQL Server will now only store a plan stub instead. Which consumes less memory than the full plan would. Something to keep in mind, the next time that the query is executed, it will flush the stub from the cache and replace it with the full plan.

There are two methods you can use to determine if this setting is on or off. The first being SQL Server Management Studio(SSMS). In the Object Explorer right click on the server and then click on Properties.  The default is False, meaning that the entire plan will be placed in cache when a query is compiled and executed.

As with many things in SQL Server, you can also use TSQL to get this information. You will need to query sys.Configurations to get this setting via TSQL.

SELECT name
, value
, description
FROM sys.configurations
WHERE Name = ‘optimize for ad hoc workloads’

Here is the result of the above query.

How to determine if there are a lot of Single use Queries in Cache

I think there are two measures we need to understand when it comes to finding out what is consuming the cache. One is the amount of space being consumed by Ad Hoc queries and the other is the number of queries that are considered to be Ad Hoc queries.

Below is the Microsoft description of the dm_exec_cached_plans DMV.  This DMV can be used to capture the amount of space consumed by Ad Hoc queries.

“Returns a row for each query plan that is cached by SQL Server for faster query execution. You can use this dynamic management view to find cached query plans, cached query text, the amount of memory taken by cached plans, and the reuse count of the cached plans.”  – Microsoft

The code below uses the dm_exec_cached_plans to find out how much space in cache is being consumed by Ad Hoc queries.

From Pinal Dave:

SELECT
AdHoc_Plan_MB, Total_Cache_MB,
AdHoc_Plan_MB*100.0 / Total_Cache_MB AS ‘AdHoc %’
FROM (
SELECT SUM(CASE
WHEN objtype = ‘adhoc’
           THEN CONVERT(BIGINT,size_in_bytes)
ELSE 0 END) / 1048576.0 AdHoc_Plan_MB,
           SUM(CONVERT(BIGINT,size_in_bytes)) / 1048576.0 Total_Cache_MB
FROM sys.dm_exec_cached_plans) T

Below you will see the results of the above query.  Note that only 2.8% of the cache is being consumed by Ad Hoc queries.

So now let’s see how many Ad Hoc query plans are currently in cache.  The query below used a few additional funtions, sys.dm_exec_sql_text and  sys.dm_exec_query_plan, both of which accepts the plan handle as an argument value. 

SELECT SUM(c.usecounts)
, c.objtype
FROM sys.dm_exec_cached_plans AS c
CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS t
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS q
GROUP BY c.objtype

The results are below

By looking at these two result sets you can see that although there are 2,542 Ad Hoc query plans in cache, they are consuming a small percentage of the total cache, 2.8%.

So what does this mean?  Well, although there are a lot of Ad Hoc queries, if the server is suffering from pressure in memory or in the plan cache, it is probably not due to the Ad Hoc queries.

How do I get the Query Plan

There is a post at www.sqlshack.com that gives  really good explanation on how to get the query plan.

This code from the above link.  It not only tells us the query text and the plan, but also gives us the number of times that plan was used.

SELECT    cplan.usecounts
                , cplan.objtype
                , qtext.text
                , qplan.query_plan
FROM sys.dm_exec_cached_plans AS cplan’
CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS qtext’
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS qplan’
ORDER BY cplan.usecounts DESC
Below is the result set from the above query.  With this, I can review the query and the execution plan for the Ad Hoc query.
Clear Cache
Of course, restarting the SQL Server service will clear the cache, however there is also a method we can used to clear the cache without restarting.  We can use the code below.
         DBCC FREEPROCCACHE
When the cache is cleared, whether by a reboot or using the above command, you should let the server run for a bit to let the cache get reloaded with plans.
Does Case Matter?
When talking about the plan cache, case does matter.  However, they will both result in the same execution plan.
Using the AdventureWorks2014 database, when you run the following query you will see the plan in cache.
SELECT *
FROM [AdventureWorks2014].[Person].[Person]
Where BusinessEntityID = 3
The plan is below in the green box.  If you look at the other queries, you will also see the same query with one small difference, the “w” in Where is lower case.  Even though both queries will use the same Execution Plan.  The reason this happens I because SQL Server creates a hash of the query and if the hash value exists in the cache, it will use that plan.  However, because the “w” is a different case, the hash will also be different.
Let’s See it at Work
This is just a simple demo to show that when Optimize for Ad Hoc Workloads is turned on, the plan does not appear in the plan cache with the first execution, only the second.
First thing is to go to the properties of the server and make sure that the Optimize for Ad Hoc workloads is set to true.  Once that is done flush the  cache either by restarting the service or using DBCC FREEPROCCACHE.
Once that is done, run this code to confirm that there are not any plans in cache for the query we are about to run.
The query we are going to use is in the AdventureWorks2014 database:
SELECT * FROM Production.Product
Here is the code to check if there is already a plan in cache for this query.  Make sure to change the database ID, it is 8 on the system I am using.
SELECT c.usecounts
, c.objtype
, t.text
, q.query_plan
FROM sys.dm_exec_cached_plans AS c
CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS t
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS q
WHERE t.text LIKE ‘%select% *%’ AND q.dbid = 8
ORDER BY c.usecounts DESC
Once you run the above code, you will see something similar to the image below.  Notice that there an entry for our query of the product table.
Now run the product table query.  Once you do you will see something like below.  As you can see not only is there a record for our query, but also a plan as well.  This is because the Optimize for Ad Hoc workloads is turned off.
Now, set Optimize for Ad Hoc workload to true, flush the cache and let’s see what happens.  Sure enough, no plan, just as before.
Run the following query:
SELECT * FROM Production.Product
then the query to get the plans.  When you do, you will see the record for the execution, but no plan.
Run the query for the products table a second time. Now you will see the plan is also in cache.  This is because the Optimize for Ad Hoc Workloads, when set to True only stores a stub in the cache on the first execution of a query, then the entire plan on the second execution.
As mentioned earlier in this post, spaces and case do make a difference when SQL Server queries the plan cache.  Notice below that there are two rows for what appears to be the same query.  However, if you look closer, you will see that the second query actually has an additional space between FROM and the table name.  This results in a different hash, and then results in a different entry in the plan cache.
Given all the information should I turn the on or off?
Well, that is a good question.  In my option, unless you have evidence that there is pressure in the cache, I would leave it at the default setting.  I have never actually had a need to change this setting, but am open to it if there is evidence it would benefit the overall performance of the server.  So let’s take a look at what others say.
“I can’t really give you a situation that I would expect to see where this shouldn’t be enabled honestly. There are theoretical examples out there – maybe a series of queries – lots of them – with reusable plans executed only twice. Or something like that. I answer a lot of questions on DBA.StackExchange – it’s a great site in a great network and the consensus there seems to generally be that it almost always makes sense in this answer. When I do a SQL Server Health assessment? I want to see that enabled.`”
Kendra Little Says this on her blog:

“Arguing about this setting is a lot like arguing about whether or not it’s better to put cream in your coffee: if the coffee is decent, it probably tastes fine either way.

My general preference is to not turn on a setting unless I have a good reason to believe that it will make a positive difference, so I don’t turn this on by default. Instead, I’d rather monitor the number of single use plans in cache, and investigate and act accordingly if that number starts spiking upwards.

But admittedly, that’s being fussy: if I had a flock of 1,000 SQL Servers to manage myself and I knew they were all patched to recent supported versions, I’d probably enable it on them all and I wouldn’t feel a bit bad about it, because I wouldn’t have the bandwidth to do it the very best, artisanal, hand-crafted way.”

Pinal Dave states this on his blog:

“Based on the result you can make your own conclusion and change your settings. I, personally, prefer to turn Optimize for Ad Hoc Workloads settings on when I see AdHoc Percentages between 20-30%. Please note that this value gets reset when you restart SQL Server services. Hence, before you change the settings, make sure that your server is up for a quite a few days.

If you are not sure, what to do, just reach out to me with the result and I will be happy to help you.

One final note, This is just one settings, which contributes to your SQL Server Performance”

I tend to be more in line with what Pinal Dave is saying with the understanding that this is only one setting, however all three make very valid points.  You need to look at the big picture and decide for yourself on what the best settings is for your systems. There are a number of other SQL Server settings that could impact performance and you should review them all before making a change.  You should look at the big picture for performance to determine if the server could benefit by changing this setting.

Thanks for stopping by my blog….

SSMS Options – Part 1

We all use SQL Server Management Studio on a daily basis.  Although many of the options you will more than likely not change at all.  There are a few that would be good to change to help enhance our experience in SSMS.  This post, as well as a few future posts will go over a number of these options.  The thing to remember when changing options in SSMS, some change will require a restart of SSMS.  I will point these out as I cover the various options.

Obviously, the logical starting point is to go over how we get to the options form.  In SSMS you go to the “Tools” menu, under which you will see the “Options…”menu.

Once you click Options you will see this..

There are many options available that can be changed.  These options will not only change the way SSMS behaves, they can also change how it looks.

When you first look, on the left you will see a tree structure.  If you see an arrow to the left of the menu item, there are more options under that you can see when you click it the menu item.  You can see this in the image below.

As you can see, there are a ton of options.  If you are not sure of the exact name, you can search the options.  In the upper left hand corner of the Options window, you can enter the words you are looking for and SSMS will return all the options that have that word.

 

Color Theme

This can be used to change the over all appearance of SSMS.  With this option you have two choices, Blue and Light.  Blue is the default.  This does not require a restart of SSMS.

Blue

This give a darker look to SSMS. The borders are blue and the headers are blue as well.

Lite

As the name would imply, this turns SSMS into a light colored theme.

 

Track Changes

This option is used to track any changes you have made to the TSQL you are working with.  You can find this option under “Text Editor”.  Changing this does not require a restart of SSMS and is on by default.

When this option is turned on, you will see a yellow bar on the left side for every line you modify.  Once you save the file, the it will turn to green.  Like the two images below.

Line Number

This option is off by default.  Some folks like to turn this one, some prefer it off.  Personally I usually have it off.

When this option is turned on, you will see the line numbers on the left side.  When you copy the code, it does NOT copy the line numbers as well.

 


In order to keep the post kind of short, I will do another post on additional options that you might want to learn about.

Thanks for stopping by my blog and hope you learned something.

And we wait!!

Since I am at the airport waiting for a very delayed flight to take off, I thought it would be appropriate to do a post on…WAITFOR!!

As expected, this statement can be used to pause the execution of a batch.  It can also be used to have a batch wait until a specific time to continue the execution.

There are really two options, DELAY and TIME.

DELAY

The DELAY argument is used to pause the execution of the batch.  You can force a batch to wait for up to 24 hours.  This argument uses the “Time To Pass” value to determine how long to wait.  This really means, how much time has to pass before execution of the batch can continue. You cannot pass a date into this argument, it must be a time in the following format: hh:mm:ss.msss.  The time passed must not exceed 24 hours.  If the time does exceed 24 hours, you will get the following error.

If you put WAITFOR DELAY ’23:59:59′, this will run fine, however, if you put WAITFOR DELAY ’24:00:00′, it will fail with the same error as above.

TIME

The TIME argument is used to have a batch pause and continue executing at a specific time.  You will hear this referred to as “time_to_execute”. This argument has the same 24 hour limitation as the DELAY argument.  If you attempt to run this, WAITFOR TIME ’25:48:00′, you will receive the same error as above.

In the code below, the batch will execute the first SELECT and then wait until 1:48 AM to execute the second SELECT.

SELECT *
FROM [Person].[Person]

WAITFOR TIME ‘1:48:00’

SELECT *
FROM [Person].[Person]

GO

When specifying a time, you can use 24 hour time, WAITFOR TIME ’13:48:00′ or you can use AM/PM, WAITFOR TIME ‘1:48:00 PM’.

A point of caution if using WAITFOR in production code, make sure that by using it you are not increasing the timeouts in the application.  You may need to adjust the timeout setting of the application.  Also use a great deal of caution if using this in a transaction..it can make for a bad day!!

Keep in mind, the time specified starts when the thread is scheduled.  If there is a delay in scheduling, it may actually wait longer than expected. This can happen on really busy servers.

The statement below is also important to remember if you decide to use WAITFOR a lot.

“SQL Server monitors the number of threads associated with WAITFOR statements, and randomly selects some of these threads to exit if the server starts to experience thread starvation.” – Microsoft

Per Microsoft, you need to be careful using this because you can potentially increase the number of deadlocks.

Here is a link to Microsoft’s page on this topic.

Here is a very good video by Burt Wagner. I like that he has a stored procedure named:  BuyDonuts!!!

Thanks for visiting my blog.