T-SQL

Practical Tally Table to WHILE Comparison (Generating Calendar)

Practical Tally Table to WHILE Comparison (Generating Calendar)

The “Tally table” is a well-known concept not only in the SQL Server world but in all set-based (relational) database engines. I don’t want to repeat some basic or go much deep into it. You can do it on your own starting i.g. on sqlservercentral.com. My post is just another reminder of why it’s really good to go with Tally table instead of row-by-row direction. And yes, it’s about the performance as usual. Let’s start the game.

Our mission is simple: We would like to generate a simple calendar table with one date (day) per row for all dates between @StartDate and @EndDate. We will use the WHILE loop and add one day to the @StartDate on each cycle. We will measure the total duration of the WHILE loop execution.

SET NOCOUNT ON

DECLARE @StartDate DATE
DECLARE @EndDate DATE
DECLARE @ST DATETIME 

DROP TABLE IF EXISTS [#Calendar]

CREATE TABLE [#Calendar] ( [CalendarDate] [DATE] NOT NULL PRIMARY KEY )

SET @StartDate = '1500-01-01'
SET @EndDate = '2100-12-31'

SET @ST = GETDATE()

   WHILE @StartDate <= @EndDate
   BEGIN
		
	INSERT INTO [#Calendar] VALUES (@StartDate)

	SET @StartDate = DATEADD(DAY, 1, @StartDate)

   END

PRINT DATEDIFF(ms, @ST, GETDATE())

SELECT * FROM [#Calendar] ORDER BY 1
GO

There are totally more than 200k rows for the 6 hundred years range inserted into the calendar table. The execution time is around 2.900 ms on my test machine with really small differences for multiple executions. Not bad, right?:)

But, can it be better? Let’s try. Instead of firing more than 200k atomic inserts, we can try to hit the line in one step using the Tally table horsepower.

SET NOCOUNT ON

DECLARE @StartDate DATE
DECLARE @EndDate DATE
DECLARE @ST DATETIME 

DROP TABLE IF EXISTS [#Calendar]

CREATE TABLE [#Calendar] ( [CalendarDate] [DATE] NOT NULL PRIMARY KEY )

SET @StartDate = '1500-01-01'
SET @EndDate = '2100-12-31'

SET @ST = GETDATE()

	;WITH Tally (n) AS
	(
		SELECT 
			ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
		FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a (n)
		   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b (n)
		   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c (n)
		   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n)
		   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n)
		   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) f (n)
	)
	INSERT INTO [#Calendar]
	   ( [CalendarDate] )
	   SELECT DATEADD(DAY, n-1, @StartDate) 
	   FROM Tally
	   WHERE DATEADD(DAY, n-1, @StartDate) <=  @EndDate
           --WHERE n <= DateDiff(Day, @StartDate, @EndDate) + 1
	   ORDER BY 1

PRINT DATEDIFF(ms, @ST, GETDATE())

SELECT * FROM [#Calendar] ORDER BY 1
GO

The same amount of rows was inserted into the calendar table. But instead of 3 seconds, it took about 360 ms now! Amazing.

There are two (or even more) possible WHERE clauses to limit the final number of rows we would like to get because the Tally table needs to be generated most of the time bigger than the final result set. In your case, there are 1.000.000 rows ready by the Tally table (the CROSS JOINs count * VALUEs count) but we need only 219.511 rows to be used for the required dates range. This is why we are using the WHERE condition to limit it.

It’s interesting to compare these two conditions via performance and their execution plans.

WHERE DATEADD(DAY, n-1, @StartDate) <= @EndDate => duration 360 ms

And the second one:

WHERE n <= DateDiff(Day, @StartDate, @EndDate) + 1 => duration 550 ms

This is quite an interesting difference in execution plans and performance. Looks like it depends on how data are finally sorted for the clustered index insert: if in Sort operator before or in Clustered Index Insert operator itself. If we will remove PRIMARY KEY constraint from the #Calendar table definition, both execution plans will align to be the same (except parallelism used or not in favor of the first WHERE condition). Btw, removing the clustered key will speed it up from about 360 ms to 90 ms. This is a great gain but the price is we can’t be sure all dates in the #Calendar table are unique and we will need to index the column soon or later to improve its usage in other queries.

The Tally table size

Our Tally table has a total size of 1.000.000 available numbers as mentioned above. Does it somehow affect the performance in this scenario? Let’s test it. We will remove some unnecessary (0) to limit available numbers to 300k only because we know this will be enough for our requested data range.

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a (n)
   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b (n)
   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c (n)
   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n)
   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n)
   CROSS JOIN (VALUES(0),(0),(0)) f (n)

With every (0) removed performance gets better by about 5-10%, in our case from about 360 ms to 220 ms. This is matching to linear scaling (one (0) is 100k rows removed). So it’s really important to size the Tally table as much as possible to what is needed.

We can optimize it further and remove this dependency if the Tally table will be prepared in advance, like in this sample:

DROP TABLE IF EXISTS [#Tally]

CREATE TABLE [#Tally] ( [n] [INT] NOT NULL PRIMARY KEY )

INSERT INTO [#Tally]
	( [n] )
SELECT 	ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) n
FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a (n)
   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b (n)
   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c (n)
   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n)
   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n)
   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) f (n)
ORDER BY n
GO

DECLARE @StartDate DATE
DECLARE @EndDate DATE
DECLARE @ST DATETIME 

DROP TABLE IF EXISTS [#Calendar]

CREATE TABLE [#Calendar] ( [CalendarDate] [DATE] NOT NULL PRIMARY KEY )

SET @StartDate = '1500-01-01'
SET @EndDate = '2100-12-31'

SET @ST = GETDATE()

	INSERT INTO [#Calendar]
	   ( [CalendarDate] )
	   SELECT DATEADD(DAY, n-1, @StartDate) 
	   FROM #Tally
	   WHERE n <= DATEDIFF(day, @StartDate, @EndDate) + 1

PRINT DATEDIFF(ms, @ST, GETDATE())

SELECT * FROM [#Calendar] ORDER BY 1
GO

We have inserted 1.000.000 rows into the #Tally temporary table and indexed it using the PRIMARY KEY which means that clustered index was created. We have also modified the WHERE condition to limit the total number of n values used from the Tally table. The measured duration is now 80 ms with this execution plan:

The main reason why we should persist the Tally table and reuse it is not only because of the performance but the fact, that it nearly doesn’t matter how many rows we have in it. The set is ordered and Index Seek (or Scan) operator will need the same IO operations to stop at the desired number of rows required by the WHERE condition. If we will modify the above case and create #Tally table with 100 billion rows the duration and IOs will be the same (there will be a little bit different memory grants or other estimates).

Leave a Reply

Your email address will not be published. Required fields are marked *