How to optimize your queries using deep filters (optimizing joins)

Introduction

Optimizing SQL queries in Firebolt can improve performance, particularly when dealing with joins that involve large datasets. One effective technique for achieving this is the implementation of deep filters. In this article, you'll learn about the concept of deep filters and see a practical example of how to apply them to speed up query execution. By the end of this article, you will know how to implement deep filters to optimize your SQL queries.

TL;DR

  • Deep filters add extra WHERE clause filters to reduce the data volume processed during joins.
  • They prune unnecessary data reads, improving query efficiency.
  • Applicable primarily when you have an existing filter condition on one table and extend it to other tables in the join.

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Firebolt Sample Dataset

Step 1: Identify Potential for Deep Filters

Deep filters are particularly useful in scenarios involving joins where one table's filter can logically be extended to another table to reduce the dataset before the join operation. Start by reviewing your query plans to identify expensive joins that might benefit from additional filtering.

Example SQL code:

SELECT p2.email,avg(currentscore) AS avg_low_level_score
FROM playstats ps 
INNER JOIN players_200million p2 
	ON ps.playerid = p2.playerid
WHERE gameid = 1 AND ps.tournamentid < 5 
GROUP BY ALL

Let's look at the query execution plan for this query, by prepending it with the EXPLAIN keyword. This is the result:

[0] [Projection] players_200million.email, avg(playstats.currentscore)
 \_[1] [Aggregate] GroupBy: [players_200million.email] Aggregates: [avg(playstats.currentscore)]
    \_[2] [Projection] players_200million.email, playstats.currentscore
       \_[3] [Join] Mode: Inner [(players_200million.playerid = playstats.playerid)]
          \_[4] [StoredTable] Name: 'players_200million', used 2/10 column(s) FACT
          \_[5] [Projection] playstats.playerid, playstats.currentscore
             \_[6] [Predicate] (playstats.gameid = 1), (playstats.tournamentid < 5)
                \_[7] [StoredTable] Name: 'playstats', used 4/13 column(s) FACT

You can see that in this case, we are first filtering the playstats table (by gameid and tournamentid ), then we are performing the join with the players_200million table, and then we are aggregating the revenue email . On a small engine, this takes about 1.9 seconds. Looking in engine_query_history, it scans 189 million rows.

Step 2: Implement Deep Filtering

Based on the initial filter applied to one table, extend this filter to reduce the number of rows in the join table. This involves adding a subquery that selects relevant entries based on the existing condition.

Example SQL code:

SELECT p2.email,avg(currentscore) AS avg_low_level_score
FROM playstats ps 
INNER JOIN players_200million p2 
	ON ps.playerid = p2.playerid
WHERE gameid = 1 AND ps.tournamentid < 5 
-- apply deep filter
AND p2.playerid IN (SELECT DISTINCT playerid FROM playstats WHERE gameid = 1 AND tournamentid < 5 )
GROUP BY ALL

Step 3: Evaluate the Impact

After implementing deep filters, assess the performance improvement. Check the execution plan, engine_query_history, and the execution time and the volume of data processed. This step is crucial to validate the effectiveness of the deep filters.

Let's look at the query execution plan for this query, by prepending it with the EXPLAIN keyword. This is the result:

[0] [Projection] players_200million.email, avg(playstats.currentscore)
 \_[1] [Aggregate] GroupBy: [players_200million.email] Aggregates: [avg(playstats.currentscore)]
    \_[2] [Projection] players_200million.email, playstats.currentscore
       \_[3] [Join] Mode: Inner [(players_200million.playerid = playstats.playerid)]
          \_[4] [Join] Mode: Semi [(players_200million.playerid = playstats.playerid)]
          |  \_[5] [StoredTable] Name: 'players_200million', used 2/10 column(s) FACT
          |  \_[6] [Projection] playstats.playerid
          |     \_[7] [Predicate] (playstats.gameid = 1), (playstats.tournamentid < 5)
          |        \_[8] [StoredTable] Name: 'playstats', used 3/13 column(s) FACT
          \_[9] [Projection] playstats.playerid, playstats.currentscore
             \_[10] [Predicate] (playstats.gameid = 1), (playstats.tournamentid < 5)
                \_[11] [StoredTable] Name: 'playstats', used 4/13 column(s) FACT	

You can see that in this case, we are able to do a semi-join with the join with the players_200million table, reducing the number of rows to be considered. On a small engine, this takes about .9 seconds. Looking in engine_query_history, it scans 278 thousand rows.

Important note

Note that while deep filters are usually a very effective optimization technique, there are also edge cases where they can harm the performance of the query. This usually occurs on large datasets where the additional deep filter isn't as selective and adds more time to the total execution of the query because of inefficient pruning.