Skip to content

User Defined Functions and Performance

There is definitely a lack of awareness in the SQL world regarding the use of user defined functions and the potential performance hit they can have when using within your queries. Don’t get me wrong, I would love nothing more than to be able to centralize my commonly used code into functions for reuse. In a lot of cases this is possible, however there are specific cases where this can cause a huge performance impact.

The Problem

The one thing we need to be aware of with SQL is that its efficiency lies in the fact that it deals with data in SETS. Meaning that its power does not come in performing row-by-row operations, rather it wants to retrieve chunks of data and manipulate them as recordsets. Keeping this in mind, you can look out for scenarios where certain operations will cause more of a row-by-row operation and therefore impact performance.

The most common no no, is the use of scalar functions within a set based operation. It seems (but I can’t prove) that SQL 2008 has actually made some great strides in being able to deal with these situations, however there will always be a negative impact. First, let’s look at a common scenario.

The Test

First, let’s deploy this scalar user defined function which calculates the End of month for a given date:

CREATE FUNCTION [dbo].[ufn_GetLastDayOfMonth] ( @pInputDate    DATETIME )
RETURNS DATETIME
BEGIN

    DECLARE @vOutputDate        DATETIME

    SET @vOutputDate = CAST(YEAR(@pInputDate) AS VARCHAR(4)) + '/' +
                       CAST(MONTH(@pInputDate) AS VARCHAR(2)) + '/01'
    SET @vOutputDate = DATEADD(DD, -1, DATEADD(M, 1, @vOutputDate))

    RETURN @vOutputDate

END

Now, lets run a test against the SalesOrderDetail table in Adventureworks. In the first example, we’ll put the scalar function within the select list passing in the value of ModifiedDate from the SalesOrderDetail table. In the second example, we’ll put only the code from the inline function within the select list.

DECLARE @now DATETIME;
SET @now = GETDATE();

-- use the scalar udf function
SELECT
	EOM = dbo.ufn_GetLastDayOfMonth2(ModifiedDate)
INTO #MEME
FROM Sales.SalesOrderDetail

SELECT 'scalar timing', DATEDIFF(ms, @now, GETDATE())
SET @now = GETDATE();

-- use only the code from the scalar function
SELECT
	EOM = DATEADD(DD, -1, DATEADD(M, 1, CAST(YEAR(ModifiedDate) AS VARCHAR(4)) + '/' + CAST(MONTH(ModifiedDate) AS VARCHAR(2)) + '/01'))
INTO #MEME2
FROM Sales.SalesOrderDetail

SELECT 'non scalar timing', DATEDIFF(ms, @now, GETDATE())

From the results, we can see that the scalar function made the query about 4 times slower. The reason for this is because the optimizer could not interweave the execution plan of the external query with the execution plan of the scalar function. The scalar function is basically a black box. Because of this, the query needed to do a row-by-row operation on the values being passed in, rather than deal with the result as a SET.

The Solution

So is there anyway around this? Well, there kind of is. The best way to intermingle execution plans is to use an inline table valued function. This should not be mistaken with a multi-valued table valued function (which performs as a black box). The inline table value function is able to expose its execution plan to the external query, and thus increasing throughput. Let’s create the same function except we’ll use an inline table valued function:

CREATE FUNCTION [dbo].[ufn_GetLastDayOfMonthTABLE] ( @pInputDate    DATETIME )
RETURNS TABLE
AS RETURN

    SELECT EOMDate = DATEADD(DD, -1, DATEADD(M, 1, CAST(YEAR(@pInputDate) AS VARCHAR(4)) + '/' +
                       CAST(MONTH(@pInputDate) AS VARCHAR(2)) + '/01'))

Now let’s run the same query above, but we’ll use an apply operator to return the result:


SET @NOW = GETDATE();

SELECT
	EOM = t.EOMDate
INTO #MEME3
FROM Sales.SalesOrderDetail sod
OUTER APPLY ufn_GetLastDayOfMonthTABLE(sod.ModifiedDate) t

SELECT DATEDIFF(ms, @now, GETDATE())

Now we see that the result is actually just as fast as the inline code and we gain the benefit of being able to encapsulate it.

So the conclusion? Scalar functions by themselves don’t slow down queries, people who place them in in set based statements slow down queries.

Increase Job History

The SQL Server Agent by default only keeps 1000 rows of history for all jobs. There is also a limitation for each job to keep only 100 rows each. These rows do not only include the a row for the entire job, but the rows for each step in the job also. These rows can quickly get filled leaving you in the dark when you need to troubleshoot what had happened in a job.

To change these values, you need to have sysadmin access to the SQL Server Agent. Within the object explorer, right click on the SQL Server Agent node, and choose properties.

A new pop-up window will appear displaying the SQL Server Agent properties. Select the History node and the history settings are displayed:

Here you have the option to not limit the agent job history, by unchecking the corresponding box. You can also choose the maximum # of rows to store for all jobs, and an option to limit each job. The last option is to Remove history that’s older than a specified time period. This would be recommended if you do not limit the history log size, since you will not be pruning any records otherwise.

Using DBCC INPUTBUFFER

This command shows the last statement executed for a particular SPID. This statement is mostly used for troubleshooting purposes to determine the exact command a particular SPID is running. You must be a member of the sys admin fixed server role, or have VIEW SERVER STATE permission (if SQL 2005+). As for a real world use, I will mostly use this command after using sp_who2 to find out which SPID is taking a lot of resources.

To execute, simply replace the SPID 55 below with the one you want to spy on.

DBCC INPUTBUFFER(55)

And here are the results:

The results you are looking for are displayed in the third column above. This shows the last statement that has been or is currently being executed by the SPID.

Another alternative to the DBCC INPUTBUFFER is to use the dm_exec_requests DMV. This query will show the currently executing statement for a SPID:

SELECT
	SQLStatement       =
        SUBSTRING
        (
            qt.text,
            er.statement_start_offset/2,
            (CASE WHEN er.statement_end_offset = -1
                THEN LEN(CONVERT(nvarchar(MAX), qt.text)) * 2
                ELSE er.statement_end_offset
                END - er.statement_start_offset)/2
        )

FROM sys.dm_exec_requests er
CROSS APPLY sys.dm_exec_sql_text(er.sql_handle) as qt
WHERE er.session_id = 54

This DMV will only work on SQL 2005+ and you also need the VIEW SERVER STATE permission as a minimum. The one advantage DBCC INPUTBUFFER has over this, is the DMV will only show the currently executing request and not the previously executed request so you need to be a little quicker to catch it. The above DMV is also embedded in the sp_who3 procedure.

Creating Unique Random Numbers

This is typically a hot topic and I’m going to try and tackle it with my rudimentary math. Ultimately given time and the frequency of generation, there is no such thing as a completely unique random number. There will always be some chance that a random number can be regenerated even though the chances do go way down when given a larger set of bytes.

Using Numbers

Let’s first look at probably the most random way to generate a random number. Using a function introduced in SQL 2008 called CRYPT_GEN_RANDOM(). This function takes a parameter of byte length, and returns a random hex based number within that range of bytes. We can easily convert this varbinary data to an integer based value. This function is based upon a low level windows API and is cryptographically secure.

SELECT CAST(CRYPT_GEN_RANDOM(8) AS bigint)

Using the method above returning an 8 byte random number and casting it as a bigint, your chances of repeating a duplicate are (roughly) between 1 and 15 billion. To break that out in terms of time, if you generated a random number every second you would without a doubt hit a duplicate random number between 31 and 47 years. However there’s no guarantee that it might not happen way before or way after however the chances go down the farther you move from the mean. Even though this may sound like a long time, the randomness that a bigint provides may not be enough for your application. In which case you may choose to possibly generate 2 bigints (if you wanted to go the number route) or you would have to settle for either varbinary or a GUID. Also note that these statistics do assume that you are taking advantage of the negative range of numbers bigint provides.

Using Numbers – Method 2

Let’s assume you want to provide customers with a random number. Another method you could use is to populate a table of integers and hand them out randomly while marking the ones that have been handed out. You could populate the table by looping through the number of integers you want to provide. When selecting an integer specify the TOP 1 clause along with calling NEWID(). This will provide a random ordering of the table.

SELECT TOP 1 rand_cust_id
FROM cust_numbers
WHERE is_given = 0
ORDER BY NEWID()

This query would return a single customer number that is randomly selected. It also guarantees the same number will not be selected which is something the other methods cannot provide. You would need to update the is_given flag in this method or you could alternatively place the given number in a separate table and perform a NOT EXISTS against that table when selecting numbers.

Using Binary

Aside from the previous method using binary will provide the highest chances for a unique id. The CRYPT_GEN_RANDOM function can return a maximum of an 8000 byte number. I can’t imagine ever needing something to that precision, and I won’t even begin to try to calculate it. I will say I can’t imagine needing anything over 16.

SELECT CRYPT_GEN_RANDOM(8000)
Using a GUID

We got a partial taste of the GUID method above when we used the NEWID() function. This method returns a Globally Unique IDentifier which is based off a 16 byte number. Our bigint above was based off an 8 byte number however don’t think they are double the randomness. According to my calculations (don’t crucify me I may be off) but if you generate a guid every second, you will not repeat it until 1 billion millenia.

To create a GUID simply call NEWID()

DECLARE @guid uniqueidentifier;

SET @guid = SELECT NEWID();

That does it. No I didn’t talk about RAND() because I guess I don’t see a need for it with CRYPT_GEN_RANDOM. RAND() I hear can also be considered deterministic, especially when given a seed value.

Oh, how did I come up with the numbers? I actually did it the hard way. I generated random numbers for tinyint, smallint, and int. When I got enough samples, I realized the mean of the numbers generated before a collision was roughly (n ^ 2) * .85 for twice the bytes. It may not be exact science, but I believe it’s somewhere in the ballpark.

Overcome Divide by Zero Using NULLIF

Anytime we are dividing we need to think of the divide by zero scenario. Even if you think you will never encounter this with your result set, it’s advisable to guard against it because when divide by zero is encountered, an error is thrown.

The best method I’ve found to overcome this is by using the NULLIF function. This function takes two parameters and if they are equal, a NULL value is returned.

Lets take a look at an example that throws a divide by zero error.

DECLARE @iter float;
DECLARE @num float

SET @num = 10;
SET @iter = 5;

WHILE @iter > -5
BEGIN
	SELECT @num / @iter

	SET @iter = @iter - 1
END

Running the following query, we see that once the variable @iter becomes zero, we receive an error.

So the most elegant way to overcome this is to use NULLIF function and compare @iter to zero. When it does equal zero, it will instead change it to a null. And when dividing anything by NULL will equal a NULL.

DECLARE @iter float;
DECLARE @num float;

SET @num = 10;
SET @iter = 5;

WHILE @iter > -5
BEGIN

	SELECT @num / NULLIF(@iter,0);

	SET @iter = @iter - 1;

END

This executes without error, however we still receive a null as a result. If you need otherwise, then you may want to wrap the equation in an ISNULL, to return a different value.

DECLARE @iter float;
DECLARE @num float;

SET @num = 10;
SET @iter = 5;

WHILE @iter > -5
BEGIN

	SELECT ISNULL(@num / NULLIF(@iter,0),@num);

	SET @iter = @iter - 1;

END

This will just return the same number you are dividing by, if you encounter a NULL denominator.

A Query Method Faster Than BETWEEN

This is an invaluable trick that can be sometimes be used instead of the BETWEEN operator. A common place this can be used is in an historical table that contains EffectiveFrom & EffectiveTo dates. In these kinds of tables, many historical versions of a record can be stored. Each record in this scenario will contain a Start & End Date which signifies the time span when a record is active. Because two records cannot be active at the same time, we can be sure that the Start & End dates will never overlap.

While this describes one particular scenario for this example, there are many other situations this method can be used.

Let’s look at the typical query we would use to find the record which is active for an employee at a specific point in time.

SELECT *
FROM [HumanResources].[EmployeeDepartmentHistory]
WHERE EmployeeID = 274
AND '2001-04-28' BETWEEN StartDate AND EndDate

This query will return to us the single record that is active as of ’2001-04-28′. However if you think about the constraints our table contains, we actually only need to query one column (the StartDate) in order to isolate this record. This is because the StartDate will not overlap any other StartDate & EndDate for this employee.

Here’s an example:

SELECT TOP 1 *
FROM [HumanResources].[EmployeeDepartmentHistory]
WHERE EmployeeID = 274
AND StartDate >= '2001-04-28'
ORDER BY StartDate

Because only one record can be active for an employee at a single point in time, we can be sure that if we take the minimum start date that is less than or equal to the as of date, this will also return the active record. Since this query only needs a one column in order to isolate the record, it can be much quicker than our BETWEEN version.

If we wanted to be sure this date range fell within our as of date, would could also use this fail safe:

SELECT *
FROM (
SELECT TOP 1 *
FROM [HumanResources].[EmployeeDepartmentHistory]
WHERE EmployeeID = 274
AND StartDate >= '2001-04-28'
ORDER BY StartDate
) t
WHERE '2001-04-28' <= EndDate

While this trick may show marginal improvements in some tables, it can be many many times faster in other scenarios. It all depends on how many versions of a single record exist, and the size of the table. I recently employed this method that took a query down from 30 ms, to under half of one millisecond.

SELECT INTO IDENTITY

Adding an identity column to a SELECT..INTO query is probably something I use nearly on a daily basis. I’m providing it here merely for reference purposes. It useful when you want to loop through a temporary table.

SELECT
	 id			= IDENTITY(int,1,1)
	,FirstName	= pc.FirstName
	,LastName	= pc.LastName
INTO #person_contact
FROM Person.Contact pc
WHERE EmailPromotion = 2
ORDER BY pc.LastName
  • The first parameter for the IDENTITY function is that data type. It must be of data type int, bigint, smallint, tinyint, or decimal or numeric with a scale of 0, and constrained to be nonnullable.
  • The second parameter is the seed (or starting) value.
  • The third parameter is the increment.

An alternative method is to use the ROW_NUMBER window function. This can actually be a bit more pliable in certain situations because of it’s optional partitioning function. The obvious difference here though is that any rows subsequently added to the table do not get an identity value.

SELECT
	 id			= ROW_NUMBER() OVER (ORDER BY pc.LastName)
	,FirstName	= pc.FirstName
	,LastName	= pc.LastName
INTO #person_contact2
FROM Person.Contact pc
WHERE EmailPromotion = 2

Note that with the IDENTITY method, Microsoft does not guarantee that the identity will be in the correct order when performing an order by clause, especially when using TOP or ROWCOUNT.

Making a Procedure Available in all Databases

If you’ve ever wondered how to make a procedure available in any database, it’s actually pretty simple. If you create a procedure in the master database with the prefix of “sp_”, it will be callable from any database.

I personally think this feature is great for utilities, however I would not use this for dependent objects. In other words, I would not advise creating a procedure (or function that is referenced by other procedures) within the master database. The biggest reason is that when restoring databases to other servers, you typically do not restore the master database. So it can be easily left out or forgotten. If that happens you may have a mission critical troubleshooting situation on your hands. However sometimes there may not be a way around it. But you can look into utilizing synonyms as an alternative.

Another interesting observation regarding placing procedures in the master database with the sp_ prefix is that the procedure is actually executed from the context of the master database. In other words, if you create a procedure to enumerate all the other procedures within your database and place it in master, once you execute the procedure from a database other than master, it will still enumerate the procedures in the master database.

Here’s an example:

After creating this procedure and calling it from AdventureWorks, it still enumerates all the procedures within the master database.

To get around this, we can use the undocumented procedure which will mark the procedure as a system object. Once we do that, the context of the call is from the database we are executing within.

The code is below:

USE MASTER
GO

CREATE PROCEDURE sp_routines
AS
BEGIN

	SELECT *
	FROM INFORMATION_SCHEMA.ROUTINES

END
GO
EXECUTE sp_MS_marksystemobject 'sp_routines'

USE AdventureWorks
GO

EXEC sp_routines

Find Triggers and Associated Tables

These simple snippets of code will show all triggers in the current database, along with the tables they belong to. Unfortunately the INFORMATION_SCHEMA views do not show triggers so we need to revert to using less documented options. The proper way would be to use the INFORMATION_SCHEMA.TRIGGERS view, however SQL Server has yet to create and publish this view. In the meantime we need to use one of the following workarounds:

SELECT
	table_name		= so.name
	,trigger_name	= st.name
	,trigger_text	= sc.text
	,create_date	= st.create_date
FROM sys.triggers st
JOIN sysobjects so
	ON st.parent_id = so.id
JOIN syscomments sc
	ON sc.id = st.[object_id]

You can also use only sysobjects:

SELECT
	 table_name		= so2.name
	,trigger_name	= so.name
	,trigger_text	= sc.text
	,create_date	= so.crdate
FROM sysobjects so
JOIN sysobjects so2
	ON so.parent_obj = so2.id
JOIN syscomments sc
	ON sc.id = so.id
WHERE so.type = 'tr'

I would imagine that SQL Server will provide more documented ways to query this information, until then we need to take a chance by deploying these solutions.

Compare Stored Procedure Output by Creating a Test Harness

When optimizing procedures often times you need to rewrite the procedure in order to speed it up. However, when you’re finished how can you be sure that the result set is still the same? This article will show you how to run a thorough test to make sure. We’ll execute the old and new stored procedure then compare their result sets by using a checksum aggregate. We’ll also compare rowcounts and elapsed execution time. The benefit of this test harness is that it will allow you to loop through and execute the procedures with as many dynamic parameter values as you wish, then compare the result set of the old procedure vs the new procedure for each execution.

First, we need to do a couple things to set up our environment.

  1. Setup a local linked server
  2. Create a wrapper procedure

We need the local linked server in order to dynamically retrieve the result set of the stored procedure using OPENQUERY. We could skip this step and create the temp table explicitly if we knew all the columns and data types being returned, but sometimes this is a hassle because there could be many columns that are returned and explicitly creating a temp table for each procedure we wanted to test is a pain. And that’s probably why you’re here in the first place isn’t it?

So to create a local linked server, do the following:

EXEC sp_addlinkedserver
@server='LOCALSERVER',
@srvproduct='SQLSERVER',
@provider='SQLNCLI',
@datasrc='Your_Server_Name_Here'

For this example, the server name must be LOCALSERVER

Next, we need to create a wrapper procedure that will take in the stored procedure along with dynamically passed parameters, and predefined temp table name, in order to populate the temp table. The reason we need this wrapper is because the OPENQUERY command does not allow concatenated strings, thus we need to use dynamic SQL to build and execute the openquery command within the wrapper procedure. To build the wrapper proc, execute the following code.

CREATE PROCEDURE [dbo].[proc_to_table]
(
     @temp_table_name nvarchar(100)
    ,@sp_and_parameters nvarchar(max)
    ,@elapsed_ms int OUTPUT
    ,@row_count   int OUTPUT
)
AS

BEGIN
    DECLARE @Driver nvarchar(20);
    DECLARE @SQL nvarchar(max);
    DECLARE @RowsetSQL nvarchar(max);
    DECLARE @now datetime;
    DECLARE @local_server_name varchar(100) = 'LOCALSERVER';

    SET @RowsetSQL = '''' +'SET FMTONLY OFF ' + '' + @sp_and_parameters + '' + '''';

    SET @SQL = 'INSERT INTO #' + @temp_table_name + ' SELECT * FROM OPENQUERY(' + @local_server_name;

    SET @SQL = @SQL + ',' + @RowsetSQL + ')';

    SET @now = GETDATE();

    EXEC (@SQL);

    SET @row_count = @@ROWCOUNT;

    SET @elapsed_ms = DATEDIFF(ms, @now, GETDATE());
END

Ok, so now that we have the requirements done, let’s go through a simple example of it’s use, then we’ll look at a more complete example.

Simple Example
-- Create shell of temp table
SELECT TOP 0 *
INTO #holding_table
FROM OPENQUERY
(
      LOCALSERVER
      ,'set fmtonly off exec your_db.dbo.spr_your_procedure 1' -- hard code proc and any params
);

-- form procedure execution and parameters into a string
SET @sp_with_params = 'exec your_db.dbo.spr_your_procedure '
      + CAST(@employee_id AS VARCHAR(50));

-- pass in temp table name & procedure string
EXEC proc_to_table 'holdingtable'
      ,@sp_with_params
      ,@elapsed_ms OUTPUT
      ,@row_count OUTPUT;

-- Display
SELECT *
FROM #holding_table;

You won’t see the benefit from this simple example until we actually loop it, however this is the crux of the work. Now let’s look at a more functional example.
This example will retrieve a specified number of parameters, then loop through and test the old and new procedure with this parameter. You can just as well have multiple parameters also.

Complete Example
DECLARE @rows int = 1000; -- number of records to test

--drop table if exists
IF OBJECT_ID('tempdb..#employee_ids') IS NOT NULL
BEGIN
      DROP TABLE #employee_ids
END

-- retrieve the parameters we want to pass to the procedures
SELECT TOP (@rows)
      employee_id
      ,rownum = ROW_NUMBER() OVER (ORDER BY e.employee_id DESC)
INTO #employee_ids
FROM dbo.employee e
WHERE e.employee_type_id IN (1,2,3)

DECLARE @employee_id int;
DECLARE @sp_with_params varchar(max)
DECLARE @elapsed_ms int;
DECLARE @row_count int;
DECLARE @chksum int;
DECLARE @chksum_old int;
DECLARE @iter int = (SELECT COUNT(1) FROM #employee_ids);

WHILE @iter >= 1
BEGIN

      /*************************************************
      * first procedure run
      **************************************************/
      --drop table
      IF OBJECT_ID('tempdb..#holdingtable') IS NOT NULL
      BEGIN
            DROP TABLE #holdingtable
      END

      -- select one of the parameters to test with
      SET @employee_id = (SELECT employee_id FROM #employee_ids WHERE rownum = @iter);

      -- create empty temp table
      SELECT TOP 0 *
      INTO #holdingtable
      FROM OPENQUERY(LOCALSERVER,'set fmtonly off exec yourdb.dbo.spr_your_procedure 1');

      -- set the parameters of the procedure
      SET @sp_with_params = 'exec yourdb.dbo.spr_your_procedure ' + CAST(@employee_id AS VARCHAR(50))

      -- execute the procedure
      exec proc_to_table 'holdingtable', @sp_with_params, @elapsed_ms OUTPUT, @row_count OUTPUT

      -- calculate checksum for entire table
      SELECT @chksum = CHECKSUM_AGG(BINARY_CHECKSUM(*))
      FROM #holdingtable

      -- insert results
      INSERT INTO tbl_test_spr_your_procedure
      (
            ID,
            employee_id,
            new_elapsed_ms,
            new_chksum,
            new_rowcount,
            datetime_inserted
      )
      SELECT
            @iter,
            @employee_id,
            @elapsed_ms,
            @chksum,
            @row_count,
            GETDATE()   

      /*************************************************
      * second procedure run
      **************************************************/
      IF OBJECT_ID('tempdb..#holdingtable2') IS NOT NULL
      BEGIN
            DROP TABLE #holdingtable2
      END

      -- create temp table
      SELECT TOP 0 *
      INTO #holdingtable2
      FROM OPENQUERY(LOCALSERVER,'set fmtonly off exec yourdb.dbo.spr_your_procedure 1');

      -- set the parameters of the procedure
      SET @sp_with_params = 'exec yourdb.dbo.spr_your_procedure ' + CAST(@employee_id AS VARCHAR(50))

      -- execute the procedure
      exec proc_to_table 'holdingtable2', @sp_with_params, @elapsed_ms OUTPUT, @row_count OUTPUT

      -- calculate checksum for entire table
      SELECT @chksum_old = CHECKSUM_AGG(BINARY_CHECKSUM(*))
      FROM #holdingtable2

      -- update results
      UPDATE tbl_test_spr_your_procedure
      SET
            old_elapsed_ms = @elapsed_ms,
            old_chksum = @chksum_old,
            old_rowcount = @row_count
      WHERE ID = @iter  

      SET @iter -= 1;

END

/**********************************************************
*     compare the results
***********************************************************/
SELECT *
FROM tbl_test_spr_your_procedure
WHERE
(
      new_chksum != old_chksum
      OR
      new_rowcount != old_rowcount
)

SELECT AVG(new_elapsed_ms), AVG(old_elapsed_ms)
FROM tbl_test_spr_your_procedure

That’s it! I’m sure this code can also be extended to be more generic, if you have any other ideas or enhancements please comment about them below.