Queries Based Upon The Absence of Data
Aaron Alton, March 20, 2008
It is a fairly common requirement to base the results of a query upon the absence of data in a specific object. For example:
I need to copy records from one table to another, but only if the rows do not already exist in the destination table.
I need to report on all customers who have not placed sales orders in the last six months.
I need to show all invoices which are missing service charge lines.
A modified version of the second example will be used for the code examples within this Wiki Page (modified to account for the transaction dates in AdventureWorks).
Whether the target rows are qualified by the absence of data in the same table, or in a different table, there are four commonly accepted query strategies:
EXCEPT Queries (SQL 2005 and Greater)
Correlated Subqueries
Left-Anti-Semi Joins
NOT IN Queries
Except
The Except Operator, available in SQL Server 2005 or newer versions, excludes rows from the "Left" query which also exist in the "Right" query. The general format of this type of query is:
--LEFT SIDE
SELECT <Column List>
FROM MyTable
EXCEPT
--RIGHT SIDE
SELECT <Column List>
FROM MyOtherTable
The EXCEPT operator demands that the number of columns in the "Left" query matches the number of colums in the "Right". It also (naturally) requires that the datatypes of columns with the same ordinal in both queries match (ex. Column1 in Query1 is compatable with Column1 in Query2).
The following example uses
AdventureWorks to demonstrate the EXCEPT operator:
USE AdventureWorks
GO
--Left Query
SELECT c.CustomerID
FROM Sales.Customer AS c
EXCEPT
--Right Query
SELECT sh.CustomerID
FROM Sales.SalesOrderHeader AS sh
WHERE OrderDate BETWEEN '2004-01-01' AND GETDATE()
Limitations: As the EXCEPT operator requires that the column list from both queries match, it is not always the
simplest solution for the type of example cited above. Usually, multiple columns need to be included in the result set, which requires the use of the EXCEPT block as a filter condition for the "overall" query.
Correlated Subqueries
Correlated subqueries involve two queries which are mutually dependent. The values from the "outer" query feed into the "inner" query, which in turn determines the results of the "outer" query. Correlated subqueries have many applications outside of the scope of this page. For more information see the links at the bottom of this page. One (of many) general examples of this query is as follows:
SELECT <Column List>
FROM MyTable AS a
WHERE NOT EXISTS (SELECT Column1
FROM MyOtherTable AS b
WHERE a.KeyColumn = b.KeyColumn)
The following example uses
AdventureWorks to demonstrate a correlated subquery:
USE AdventureWorks
GO
SELECT c.CustomerID
FROM Sales.Customer AS c
WHERE NOT EXISTS
( SELECT sh.CustomerID
FROM Sales.SalesOrderHeader AS sh
WHERE c.CustomerID = sh.CustomerID
AND sh.OrderDate BETWEEN '2004-01-01' AND GETDATE())
Caution: Under some circumstances, a correlated subquery can result in poor query performance (the solution is less of a pure set solution than the other examples on this page. As with any query, test each option and analyze the query plans under realistic test circumstances prior to deciding on a strategy.
Left Anti Semi Joins
A Left Anti Semi Join is a LEFT OUTER join which is used to determine the absence of a record in the JOINed table. A general example follows:
SELECT <Column List>
FROM MyTable AS a
LEFT OUTER JOIN MyTable AS b ON a.KeyColumn = b.KeyColumn
WHERE b.KeyColumn IS NULL
Because of the "AND b.KeyColumn IS NULL" provision, the LEFT OUTER JOIN effectively filters rows where a.KeyColumn matches b.KeyColumn (where a match exists).
The following example uses
AdventureWorks to demonstrate a Left Anti Semi Join:
USE AdventureWorks
GO
SELECT c.CustomerID
FROM Sales.Customer AS c
LEFT OUTER JOIN (SELECT sh.CustomerID
FROM Sales.SalesOrderHeader AS sh
WHERE sh.OrderDate BETWEEN '2004-01-01' AND GETDATE()) AS b ON b.CustomerID = c.CustomerID
WHERE b.CustomerID IS NULL
NOT IN Queries
The NOT IN query compares the value in a column from the primary query to a column list derived from another query. The latter query creates a list of the values which are to be eliminated in the former query, and the NOT IN clause performs the filtering. A general example follows:
SELECT <Column List>
FROM MyTable AS a
WHERE a.KeyColumn NOT IN
(SELECT b.KeyColumn
FROM MyOtherTable AS b
WHERE <WHERE Clause>)
The following example uses
AdventureWorks to demonstrate a NOT IN Query:
USE AdventureWorks
GO
SELECT c.CustomerID
FROM Sales.Customer AS c
WHERE c.CustomerID NOT IN
(SELECT sh.CustomerID
FROM Sales.SalesOrderHeader AS sh
WHERE sh.OrderDate BETWEEN '2004-01-01' AND GETDATE())
Which Method Do I Use?
There is no singular answer. Where the efficiency of the query is critical, write it in each format, and test, test, test! As data volumes change, the optimal query method may also change.