Or Using SQL Server 2012 Offset Functions to Detect Gaps In Your Data
Today we are going to examine date island problems and the ways in which SQL Server allows us to detect them.
I will demonstrate the use of the new to SQL Server 2012 LEAD and LAG functions (also known as Offset Window Functions) for detecting those problems.
Finally, I will review equivalent pre-SQL Server 2012 solutions and compare differences in performance.
So, What Is The Islands Problem?
Let’s assume the tb_DailyTrafficSum table that summarizes the number of departed and arrived flights in an airport for each and every day of the year:
One day, an executive approaches you, and he isn’t looking so happy. He says that querying the database for the bi-monthly total flights for February-March 2012 returns an average of 13 flights, while the usual average is between 72 and 78. You connect to the database, and after finding the right date range in the table you get the picture immediately:
Someone or something (hopefully not you) deleted records in unpredictable date ranges and broke the sequence of dates; therefore, the result of average flights for the last two months turned out to be too low. Such an anomaly of data gaps and islands resulting in partially sequential values is called an Islands Problem.
You have managed to find the gaps, which caused the low result of flight average, but before trying to recover the missing records, how can you be sure that there aren’t other rows missing? And how can you detect them?
SQL Server 2012 introduces quite a few profound window functions, expanding the initial window function implementations of SQL Server 2005, which included functions such as ROW_NUMBER (), RANK () and DENSE_RANK (see Asaf Aviv’s post on The Difference Between ROW_NUMBER, RANK, DENSE_RANK, NTILE).
There are currently four types of window functions in SQL Server: Aggregate, Ranking, Distribution and Offset. This post will focus on the latter, demonstrating its LAG and LEAD implementations.
Similarly to other window functions, the LAG and LEAD functions have the following structure:
window function(value) OVER(<PARTITION BY value> <ORDER BY value>)
They allow you to return the value of the previous (LAG) or next (LEAD) row from the current row within the same window partition.
To simplify things, let’s have a look at the next example using a table of non-sequential numbers and see what LAG and LEAD do when running the following query:
PgSQL
SELECT NonSequentialNumbers, LAG(NonSequentialNumbers) OVER(ORDER BY NonSequentialNumbers) AS [LAGValue], LEAD(NonSequentialNumbers) OVER(ORDER BY NonSequentialNumbers) AS [LEADValue] FROM dbo.NumbersWithGaps;
1
2
3
4
5
6
SELECT NonSequentialNumbers,
LAG(NonSequentialNumbers) OVER(ORDER BY NonSequentialNumbers) AS
[LAGValue],
LEAD(NonSequentialNumbers) OVER(ORDER BY NonSequentialNumbers) AS
[LEADValue]
FROM dbo.NumbersWithGaps;
We can see that LAG returned the previous value offset by -1 row, and LEAD returned the next value offset by +1 row to the current row.
The order is always determined by the ORDER BY clause: for instance, if we run the same query with the DESC argument, the result will be inverted:
PgSQL
SELECT NonSequentialNumbers, LAG(NonSequentialNumbers) OVER(ORDER BY NonSequentialNumbers DESC) AS [LAGValue], LEAD(NonSequentialNumbers) OVER(ORDER BY NonSequentialNumbers DESC) AS [LEADValue] FROM dbo.NumbersWithGaps;
1
2
3
4
5
6
SELECT NonSequentialNumbers,
LAG(NonSequentialNumbers) OVER(ORDER BY NonSequentialNumbers DESC) AS
[LAGValue],
LEAD(NonSequentialNumbers) OVER(ORDER BY NonSequentialNumbers DESC) AS
[LEADValue]
FROM dbo.NumbersWithGaps;
The default offset value for LAG and LEAD is 1 if not specified otherwise. In order to specify a different offset value add the argument as follows (i.e., steps of 3):
PgSQL
SELECT NonSequentialNumbers, LAG(NonSequentialNumbers, 3) OVER(ORDER BY NonSequentialNumbers DESC) AS [LAGValue], LEAD(NonSequentialNumbers, 3) OVER(ORDER BY NonSequentialNumbers DESC) AS [LEADValue] FROM dbo.NumbersWithGaps;
1
2
3
4
5
6
SELECT NonSequentialNumbers,
LAG(NonSequentialNumbers, 3) OVER(ORDER BY NonSequentialNumbers DESC) AS
[LAGValue],
LEAD(NonSequentialNumbers, 3) OVER(ORDER BY NonSequentialNumbers DESC) AS
[LEADValue]
FROM dbo.NumbersWithGaps;
As you probably noticed in the last three examples, when there is no previous or next value to return, SQL Server marks the row with a NULL. In order to exchange NULLS for an alternate value, specify a third argument (i.e., 0):
PgSQL
SELECT NonSequentialNumbers, LAG(NonSequentialNumbers, 3, 0) OVER(ORDER BY NonSequentialNumbers DESC) AS [LAGValue], LEAD(NonSequentialNumbers, 3, 0) OVER(ORDER BY NonSequentialNumbers DESC) AS [LEADValue] FROM dbo.NumbersWithGaps;
1
2
3
4
5
6
SELECT NonSequentialNumbers,
LAG(NonSequentialNumbers, 3, 0) OVER(ORDER BY NonSequentialNumbers DESC) AS
[LAGValue],
LEAD(NonSequentialNumbers, 3, 0) OVER(ORDER BY NonSequentialNumbers DESC) AS
[LEADValue]
FROM dbo.NumbersWithGaps;
Now that we have practiced LEADing and LAGging with simple integers, we should go back to our tb_DailyTrafficSum table and find out which dates are missing before our executive gets really upset.
PgSQL
WITH ComputeLEAD AS ( SELECT TrafficDay AS CurrentDate, LEAD(TrafficDay) OVER(ORDER BY TrafficDay) AS NextDate FROM dbo.tb_DailyTrafficSum ) SELECT DATEADD(DAY, 1, CurrentDate) AS GapBegin, DATEADD(DAY, -1, NextDate) AS GapEnd FROM ComputeLEAD WHERE DATEDIFF(DAY, CurrentDate, NextDate) > 1;
1
2
3
4
5
6
7
8
9
10
WITH ComputeLEAD AS
(
SELECT TrafficDay AS CurrentDate,
LEAD(TrafficDay) OVER(ORDER BY TrafficDay) AS NextDate
FROM dbo.tb_DailyTrafficSum
)
SELECT DATEADD(DAY, 1, CurrentDate) AS GapBegin,
DATEADD(DAY, -1, NextDate) AS GapEnd
FROM ComputeLEAD
WHERE DATEDIFF(DAY, CurrentDate, NextDate) > 1;
Here we used only the LEAD function, as we wanted to obtain only the following value after the current one, and we used the default +1 offset since our interval is one day. We wrapped the first query with a CTE (Common Table Expression) in order to query easily against the generated column.
With date and time functions such as DATEADD and DATEDIFF, we manipulated the window function results to represent the first day of the gap and the last day of the gap and return only the rows that indicate a gap of one day or more.
As you can see, this simple query helped us find other missing date gaps during the year. Now we have all the ranges of damaged data that should be restored.
Let’s assume that our airport database is still running on SQL Server 2005, the airport hasn’t obtained their copy of SQL Server 2012, and they cannot enjoy the LAG and LEAD functions. is there a way for them to detect gaps and islands?
The answer is yes, using other window functions implemented since SQL Server 2005, we can write a query that provides the same solution but is slightly more complicated and less optimizer- friendly:
PgSQL
WITH TrafficDayRowNum AS ( SELECT TrafficDay, ROW_NUMBER() OVER(ORDER BY TrafficDay) AS rn FROM dbo.tb_DailyTrafficSum ) SELECT DATEADD(DAY, 1, b.TrafficDay) AS GapBegin, DATEADD(DAY, -1, e.TrafficDay) AS GapEnd FROM TrafficDayRowNum b LEFT OUTER JOIN TrafficDayRowNum e ON b.rn = e.rn -1 WHERE DATEDIFF(DAY, b.TrafficDay, e.TrafficDay) > 1;
1
2
3
4
5
6
7
8
9
10
11
12
13
WITH TrafficDayRowNum AS (
SELECT TrafficDay,
ROW_NUMBER() OVER(ORDER BY TrafficDay) AS
rn
FROM dbo.tb_DailyTrafficSum
)
SELECT DATEADD(DAY, 1, b.TrafficDay) AS GapBegin,
DATEADD(DAY, -1, e.TrafficDay) AS GapEnd
FROM TrafficDayRowNum b
LEFT OUTER JOIN TrafficDayRowNum e
ON b.rn = e.rn -1
WHERE DATEDIFF(DAY, b.TrafficDay, e.TrafficDay) > 1;
Here we used the ROW_NUMBER () ranking window function to obtain a unique sequential number per row, and then simulated the LEAD function by using a -1 step in the JOIN predicate. The other parts of the query are completely similar to the SQL Server 2012 version.
Let’s have a look at the execution plans for both offset function and rank function solutions:
Let’s start with the SQL Server 2005/2008 ROW_NUMBER () function solution:
We notice that the query optimizer chose to access the clustered index on our date column and scan it twice, and in addition spent a fair amount of time on joining the results from both scans all together. This query took three seconds to complete.
The above screenshot shows the execution plan for our SQL Server 2012 LEAD function solution; the optimizer chose to scan the clustered index on the date column (the only column we choose to access) once, and then proceed to the Window Spool and Stream Aggregate iterators.
The Window Spool is a new iterator, using in-memory or in-disk cache mechanism to store and process rows within a window context. The Stream Aggregate iterator (which is in fact implemented as part of Window Spool) groups rows by windows and computes their aggregates. This magic pair of iterators is significantly less expensive when compared to the pre-SQL Server 2012 multi-index scans + merge joins, completing the query in less than one second. This difference may seem trivial processing 50 rows, but multiply the number of records and the difference will stand out.
This was just a taster of the simplest and most useful window functions of SQL Server 2012. If you want to know more about offset functions and window functions in general, I truly recommend the following book:
It is an in-depth, mind-blowing expedition into the world of window functions and the broad range of uses that they offer, essential reading for anyone who writes T-SQL queries and is interested in expanding their skills.
Hoping this will save you time finding your lost days!
“Today is only one day in all the days that will ever be. But what will happen in all the other days that ever come can depend on what you do today.” Ernest Hemingway, Islands In The Stream
Comentarios