How to optimize your queries using deep filters

WHY

Joins are expensive operations in SQL, and using a deep filter is an optimization technique that can help speed up queries that use joins. This article will briefly describe deep filters and show an example of when and how they can be used.

WHAT ARE DEEP FILTERS

When optimizing queries, a typical approach is to dissect the query and try to find the bottlenecks. One of the usual bottlenecks is the join operation between a fact and dimension table, or between two fact tables, where an engine's resources are spent on finding the rows from all tables involved in a join, performing the join in memory based on the join condition, and then transferring the results back to the client.

Firebolt is great at finding the rows needed for a join when there are predicates in the WHERE clause that filter the data out, so that less data is read from the underlying storage, transferred to the engine and kept in memory. The more filters there are in the WHERE clause, the better the pruning (avoiding reading unnecessary data) is. Firebolt will prune data efficiently if the filters are part of the tables' partition key and/or primary index.

Deep filters are a technique to add another layer of filtering to the fact table so that the rows needed for the join are reduced to a minimum.

HOW TO USE DEEP FILTERS WITH AN EXAMPLE

The first step in using deep filters is to understand when they can be used.

If you have a query where table A is being INNER JOINed to table B and you have a predicate (filter) in the WHERE clause on table A, that's probably a good candidate for a deep filter. The deep filter is essentially another predicate where you filter out only the relevant rows from table B, based on the already existing filter on table A.

Let's look at an example to make this more practical. We'll be using the popular SSB schema (Star Schema Benchmark) - see the attached schema in the appendix below. We have a query that is used to find the region with the most revenue in January 1992. It works by joining the orders with the customers, filtering out orders that were placed in January 1992 and then grouping by the customer's region and summing the revenue.

Our original query looks like this:

SELECT    c.c_region          AS region,    SUM(l.lo_revenue)   AS total_revenueFROM lineorder AS l INNER JOIN customer AS c ON c.c_custkey = l.lo_custkeyWHERE    lo_orderdate BETWEEN '1992-01-01' AND '1992-01-31'GROUP BY 1ORDER BY 2 DESC;

Let's look at the query execution plan for this query, by prepending it with the EXPLAIN keyword. This is the result:

image.png

You can see that in this case, we are first filtering the lineorder table (by lo_orderdate), then we are aggregating the revenue by customer, and then we are performing the inner join with the customer table and aggregating again on the region level. In our sample dataset, this query takes around 0.7 seconds.

Let's now apply the deep filter here. We will add another filter to the WHERE clause (this additional filter is the deep filter) which will filter out the customer table to only the customers that have placed an order in January of 1992. This is how our query looks with this additional deep filter:

SELECT    c.c_region          AS region,    SUM(l.lo_revenue)   AS total_revenueFROM  lineorder AS l INNER JOIN customer AS c ON c.c_custkey = l.lo_custkeyWHERE    lo_orderdate BETWEEN '1992-01-01' AND '1992-01-31'    --deep filter added below:    AND c.c_custkey IN (SELECT lo_custkey FROM lineorder                         WHERE lo_orderdate BETWEEN '1992-01-01' AND '1992-01-31')GROUP BY 1ORDER BY 2 DESC;

After applying the deep filter, the query execution plan looks like this:

image.png

As you can see, on the right side we now filter out the customer table to get only the customers that have placed the order in January 1992, and then we use that filtered (reduced) dataset and join it with the lineorders table to do the aggregation. This extra filter basically reduces the right side of the inner join which is more effective, returns less rows and the resulting query now executes in 0.17 seconds.

Another way of applying a deep filter in this scenario would be to replace the join and instead of joining on the customer table, create a join on a reduced set of customers who placed their order in January 1992 - the resulting optimization will be the same.

Nested deep filters

Deep filters can be nested as well. Let's look at an example where we want to get the total quantity across regions for a specific industry (retail in our case). The original query looks like this, where the industry is contained in the market table:

SELECT    c.c_region          AS region,    SUM(l.lo_quantity)  AS total_quantityFROM lineorder AS l     INNER JOIN customer AS c ON c.c_custkey = l.lo_custkey    INNER JOIN market AS m ON c.c_mktsegment = m.m_mktsegmentWHERE    m.m_industry = 'RETAIL'GROUP BY 1ORDER BY 2 DESC;

This query takes around 2.5 seconds to execute. Let's apply a deep filter here:

SELECT    c.c_region          AS region,    SUM(l.lo_quantity)  AS total_quantityFROM lineorder AS l     INNER JOIN customer AS c ON c.c_custkey = l.lo_custkey    INNER JOIN market AS m ON c.c_mktsegment = m.m_mktsegmentWHERE    m.m_industry = 'RETAIL'    --nested deep filter    AND lo_custkey IN (SELECT c_custkey FROM customer WHERE c_mktsegment IN                         (SELECT m_mktsegment FROM market WHERE m_industry = 'RETAIL'))GROUP BY 1ORDER BY 2 DESC;

As you can see, we can nest this filter all the way down to the market table and propagate it up to the lineorder. This gives us a performance boost and the query now takes 1.2 seconds.

Important note

Note that while deep filters are usually a very effective optimization technique, there are also edge cases where they can harm the performance of the query. This usually occurs on large datasets where the additional deep filter isn't as selective and adds more time to the total execution of the query because of inefficient pruning.

CONCLUSION

Now that you have a solid understanding of what deep filters are, go and find queries to optimize using this simple technique!

Appendix

Star Schema used in the example: SSB schema DDL and DML.sql