Boost your User-Defined Functions in PostgreSQL

postgresql performance

Emanuel Calvo & Anthony Sotolongo – Feb 5, 2021

Introduction

Using the RDBMS only to store data is restricting the full potential of the database systems, which were designed for server-side processing and provide other options besides being a data container. Some of these options are stored procedures and functions that allow the user to write server-side code, using the principle of bringing computation to data, avoiding large datasets round trips and taking advantage of server resources. PostgreSQL allows programming inside the database since the beginning, with User Defined Functions (UDFs). These functions can be written in several languages like SQL, PL/pgsql, PL/Python, PL/Perl, and others. But the most common are the first two mentioned: SQL and PL/pgsql. However, there may be “anti-patterns” in your code within functions and they can affect performance. This blog will show the reader some simple tips, examples and explanations about increasing performance in server-side processing with User Defined Functions in PostgreSQL. It is also important to clarify that the intention of this post isn’t to discuss whether Business Logic should be placed, but only how you can take advantage of the resources of the database server.

Avoid these antipatterns in your PL/pgsql’s code

1: Use PL/pgsql functions for simple SQL statements

SQL functions, in certain conditions, can have their function bodies inlined into the main query directly. This can be a performance advantage because the function code can be analyzed by the planner, which can apply some optimizations. When can I apply this pattern? When you have a query or group of simple queries that do not require intermediate analysis/process before returning the result. On the other hand, writing these simple SQL sentences in PL/pgsql requires overhead for the PL/pgsql compiler. Example:

1- CREATE OR REPLACE FUNCTION hemisphere_sql (character varying)
    RETURNS character varying     LANGUAGE sql 
    AS $$
    SELECT
        CASE WHEN $1 IN ('UK', 'Germany', 'Japan', 'US', 'China', 'Canada', 'Russia', 'France') THEN
            'North'
        WHEN $1 IN ('South Africa', 'Australia', 'Chile') THEN
            'South'
        ELSE
            'unknown'
        END
$$;

2- CREATE OR REPLACE FUNCTION hemisphere_plpgsql (character varying)
    RETURNS character varying   LANGUAGE plpgsql
    AS $$
DECLARE
    result character varying;
BEGIN
    result:= (
        SELECT
            CASE WHEN $1 IN ('UK', 'Germany', 'Japan', 'US', 'China', 'Canada', 'Russia', 'France') THEN
                'North'
            WHEN $1 IN ('South Africa', 'Australia', 'Chile') THEN
                'South'
            ELSE
                'unknown'
            END);
    RETURN result;
END;
$$;

1- EXPLAIN (ANALYZE,VERBOSE ) SELECT hemisphere_sql(country) FROM customers;

Seq Scan on public.customers  (cost=0.00..963.00 rows=20000 width=32) (actual time=0.039..29.309 rows=20000 loops=1)
   Output: CASE WHEN ((country)::text = ANY ('{UK,Germany,Japan,US,China,Canada,Russia,France}'::text[])) THEN 'North'::text WHEN ((country)::text = ANY ('{"South Africa",Australia,Chile}'::text[])) THEN 'South'::text ELSE 'unknown'::text END
 Planning Time: 0.458 ms
 Execution Time: 32.306 ms

2-EXPLAIN (ANALYZE,VERBOSE ) SELECT hemisphere_plpgsql(country) FROM customers;

 Seq Scan on public.customers  (cost=0.00..5688.00 rows=20000 width=32) (actual time=0.654..174.685 rows=20000 loops=1)
   Output: hemisphere_plpgsql(country)
 Planning Time: 0.082 ms
 Execution Time: 175.972 ms

As we can see in the outputs of the explains commands, on first EXPLAIN, output tag has the SQL code that belongs to SQL function, which means that this code is inlined into the calling query, rather than call the function, as the second EXPLAIN, this can improve the performance of our queries.

2: Unnecessary usage of SELECT INTO clause

Inside the PL/pgsql function, it is a bit more costly to assign a value using SELECT INTO than a simple assignment using :=. When can I apply this? When the := operator can replace the INTO clause. Example:

1- CREATE OR REPLACE FUNCTION simple ()  RETURNS void
    AS $$
DECLARE
    s int DEFAULT 0;
BEGIN
    FOR i IN 1..10000 LOOP
        s := s + 1;
    END LOOP;
END;
$$
LANGUAGE plpgsql;

2- CREATE OR REPLACE FUNCTION using_select ()  RETURNS void
    AS $$
DECLARE
    s int DEFAULT 0;
BEGIN
    FOR i IN 1..10000 LOOP
        SELECT s + 1 INTO s;
    END LOOP;
END;
$$
LANGUAGE plpgsql;

1-SELECT simple();
 Time: 16.980 ms
2-SELECT using_select();
Time: 86.931 ms

3: Overusing RAISE clause

RAISE clause can be useful to debug and show some information about the code, but it carries or has an extra load within the functions. Only use it if necessary in the production environments. Example:

1- CREATE OR REPLACE FUNCTION some_sum (val int, cnt int)
    RETURNS int
    AS $$
DECLARE
    i int;
    result int:= 0;
BEGIN
    FOR i IN 1.. $2 LOOP
        result:= result + $1;
    END LOOP;
    RAISE notice 'Final value of result: %', result;
    RETURN result;
END;
$$
LANGUAGE plpgsql;
2- CREATE OR REPLACE FUNCTION some_sum_raise (val int, cnt int)
    RETURNS int
    AS $$
DECLARE
    i int;
    result int:= 0;
BEGIN
    FOR i IN 1.. $2 LOOP
        result:= result + $1;
        RAISE notice 'Temporary value of result: %', result;
    END LOOP;
    RAISE notice 'Final value of result: %', result;
    RETURN result;
END;
$$
LANGUAGE plpgsql;

1- SELECT  some_sum(3,100);
NOTICE:  Final value of result: 300
Time: 1.843 ms

2- SELECT  some_sum_raise(3,100);
NOTICE:  Temporary value of result: 3
...
NOTICE:  Temporary value of result: 300
NOTICE:  Final value of result: 300
Time: 8.578 ms

4: Overusing the high-level programming coding style for SQL activities

Even programmers who come from high-level programming are unaware of the benefits of SQL, and ADVANCED SQL a language that can speed up the performance considerably by avoiding unnecessary loops. For example iterating on a FOR LOOP and doing a select within can be replaced by a single query using the LATERAL clause, which essentially is like a SQL for each loop

1- CREATE OR REPLACE  FUNCTION oldest_orders_by_customer (int) RETURNS SETOF t_oldest_orders_by_customer  AS $$
 DECLARE
    c customers;
    result record;
 BEGIN 
	 FOR c IN SELECT * FROM customers c2 WHERE age > $1    loop
	     SELECT c.firstname,o.orderid, o.orderdate , o.totalamount into result
              FROM orders o 
              WHERE o.customerid = c.customerid
              ORDER BY o.orderdate DESC
              LIMIT 1;
         IF result is not null THEN       
             RETURN NEXT result;   
         END IF;   
     END LOOP;
    RETURN;
 END;
 $$
 LANGUAGE plpgsql;


2- CREATE OR REPLACE  FUNCTION oldest_orders_by_customer_lateral (int) RETURNS SETOF t_oldest_orders_by_customer  AS  $$
 BEGIN 
	RETURN QUERY SELECT customer_sub.firstname  , o_sub.*
       FROM (SELECT * FROM customers c2 WHERE age > $1) customer_sub,
       LATERAL (SELECT o.orderid, o.orderdate , o.totalamount 
                FROM orders o 
                WHERE o.customerid = customer_sub.customerid
                ORDER BY o.orderdate DESC
                LIMIT 1) o_sub;
 END;
 $$
 LANGUAGE plpgsql;

 
1- SELECT * FROM oldest_orders_by_customer(80);
Time: 89.296 ms

2- SELECT * FROM oldest_orders_by_customer_lateral(80);
Time: 45.230 ms

Using functions properties

The definition of functions has some properties that can help with function performance, for example:

1: Use PARALLEL SAFE whenever possible

The planner cannot determine automatically if a function is parallel safe, but under certain conditions parallel mode can boost performance significantly if you process a large dataset. The number of workers that the planner will use is limited by the parameters max_parallel_workers_per_gather, which in turn are taken from the pool of processes established by max_worker_processes, limited by max_parallel_workers, the maximum number of concurrent queries to execute with parallelism is determined by the following formula, as long as max_worker_processes<= server cores:

#Q_concurrent_par = max_worker_processes /max_parallel_workers_per_gather (integer division)

When is it safe to use PARALLEL in a function? As long as your code does not perform the following, you should be ready to use it:

Writes to the database.
Access sequences.
Change the transaction state.
Makes persistent changes to settings.
Access temporary tables.
Use cursors.
Defines prepared statements

Example:


1- CREATE OR REPLACE FUNCTION pair_div_4 (i int) RETURNS boolean
    AS $$
BEGIN
    IF $1%2 = 0 AND $1%4 = 0 THEN
        RETURN TRUE;
    END IF;
    RETURN FALSE;
END;
$$
LANGUAGE plpgsql;

2- CREATE OR REPLACE FUNCTION pair_div_4_ps (i int) RETURNS boolean
    AS $$
BEGIN
    IF $1%2 = 0 AND $1%4 = 0 THEN
        RETURN TRUE;
    END IF;
    RETURN FALSE;
END;
$$
LANGUAGE plpgsql
PARALLEL SAFE;


1- EXPLAIN ANALYZE  SELECT * from trade where pair_div_4 (id);

 Seq Scan on trade  (cost=0.00..448684.86 rows=563520 width=16) (actual time=0.323..2459.553 rows=422640 loops=1)
   Filter: pair_div_4(id)
   Rows Removed by Filter: 1267921
 Planning Time: 0.070 ms
 Execution Time: 2471.796 ms


2- explain analyze  select * from trade where pair_div_4_ps (id);

 Gather  (cost=1000.00..249635.11 rows=563520 width=16) (actual time=0.883..1386.856 rows=422640 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on trade  (cost=0.00..192283.11 rows=234800 width=16) (actual time=0.868..1301.220 rows=140880 loops=3)
         Filter: pair_div_4_ps(id)
         Rows Removed by Filter: 422640
 Planning Time: 0.138 ms
 Execution Time: 1405.412 ms

As shown in the outputs of the explains, on the second EXPLAIN, Workers Launched tag has value 2, this means that this query used 2 workers to execute, and the first EXPLAIN was executed without parallelisms

2: Use IMMUTABLE when it is possible

The IMMUTABLE option informs the query optimizer about the behavior of the function and can apply some optimization. Any call to the function with all-constant arguments can be immediately replaced with the function value. To mark a function as IMMUTABLE you need to comply with the following:

You cannot modify the database (state) and always returns the same result for the same argument values;
Do not search in the databases or use information that is not directly present in its argument list values.

Example:

1- CREATE OR REPLACE FUNCTION get_date (date) RETURNS int
    AS $$
DECLARE
    i int;
    result int:= 0;
BEGIN
   RETURN extract (day from $1);
END;
$$
LANGUAGE plpgsql;


2- CREATE OR REPLACE FUNCTION get_date_i (date) RETURNS int
    AS $$
DECLARE
    i int;
    result int:= 0;
BEGIN
   RETURN extract (day from $1);
END;
$$
LANGUAGE plpgsql IMMUTABLE;


1- SELECT 1  from trade where id=get_date(current_date);
 Time: 2557.521 ms (00:02.558)
2- SELECT 1  from trade where id=get_date_i(current_date);;
 Time: 2208.848 ms (00:02.209)

Monitoring performance of functions

PostgreSQL allows the user to track the performance of functions in the database. For example, we can see the performance stats using the view pg_stat_user_functions, as long as you configure the parameter named track_functions, that allows tracking function call counts and time spent. To simplify the configuration we can leverage the option that gives us postgresqlcon.nf to share a configuration file, download it and apply it to your server. Specifically, to track function performance, select the download format alter_system, apply the modification to your server, and reload the configuration using select pg_reload_conf(). This allows us to detect which functions are working as expected or are slow.

For example, to use this view you can write a query like this:

select schemaname||'.'||funcname func_name, calls, total_time, round((total_time/calls)::numeric,2) as mean_time, self_time   from pg_catalog.pg_stat_user_functions;


         func_name         | calls | total_time | mean_time | self_time 
---------------------------+-------+------------+-----------+-----------
 public.f_plpgsql          |     2 |     93.908 |     46.95 |    93.908
 public.auditoria_clientes |  2684 |    593.705 |      0.22 |   593.705
 public.prc_clientes       |     2 |      1.447 |      0.72 |     0.387
 public.max_pro_min        |     3 |      1.589 |      0.53 |     1.589
 public.registro_ddl       |    17 |     39.217 |      2.31 |    39.217
 public.registro_ddl_drop  |     2 |    422.386 |    211.19 |   422.386

calls: Number of times this function has been called

total_time: Time(ms) spent in this function and all other functions called by it inside their code

mean_time: AVG Time(ms) spent in this function and all other functions called by it inside their code

self_time: Time(ms) spent in this function itself, without including other functions called by it

Conclusions

The tips and examples shown above have shown us that sometimes with minimum changes written in our code in PostgreSQL’s functions we can get some performance benefits. These tips are not exclusive, whenever possible these can be combined to achieve an improvement. e.g.: PARALLEL SAFE and avoid overusing the RAISE clause. If you know any other tips or examples please feel free to share them with us. Also, we can monitor our function’s performance by issuing a simple change in PostgreSQL’s configuration.