Introduction
In Firebolt, aggregating indexes can be leveraged not only for accelerating aggregate functions but also for enhancing the performance of queries by pre-calculating distinct values of specific columns. This approach is particularly useful for filtering operations on columns that are frequently queried but are not prominent in the primary index, ensuring fast retrieval and efficient data pruning. By the end of this article, you will know how to create and use aggregating indexes to optimize filtering on specific columns.
TL;DR
-
Purpose: Improve query performance for filtering operations.
-
Creation: Use
CREATE AGGREGATING INDEX
to store distinct column values -
Mechanism: Use
SELECT DISTINCT
in your queries to manipulate the query to use the aggregating index. -
Maintenance: Firebolt automatically updates these indexes as new data is ingested.
Step-by-Step Guide
All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link:
Ultra Fast Gaming Firebolt Sample Dataset.
Step 1: Understanding the Use Case for Secondary Indexes
Aggregating indexes can be used to pre-calculate and store distinct values of specific columns, optimizing filtering operations. This technique is beneficial when certain columns are frequently used in WHERE
clauses, enabling faster data retrieval and reducing query execution time.
Example SQL code:
-- Common SELECT query, filtering on Currentscore
SELECT PlayerID, GameID, CurrentScore
FROM PlayStats
WHERE Currentscore =574172;
Step 2: When to Use Aggregating Indexes for Filtering
You can benefit from using an aggregating index as a secondary index when three conditions are met:
-
High-Cardinality Field: Your query contains a filter over a high-cardinality field.
-
Non-Primary Index Field: The filtered field is not part of the scanned table’s primary index OR is part of the primary index but positioned in a less meaningful position in the primary index.
-
Optimized Primary Index: The scanned table has a well-thought primary index definition that is designed to serve other frequent filtered fields.
In our example, thePlayStats
table is very large (over 1 Billion records), therefore getting theCurrentScore
values can be expensive. Additionally, this is not a column we’d like to add to the primary key as its primary index is already optimized.
Example SQL code:
-- High-Cardinality field:
SELECT APPROX_COUNT_DISTINCT(city_hash(CurrentScore)) FROM playstats;
-- Result: 770,357
-- The primary index of the FACT table, which should remain unchanged:
SELECT primary_index FROM information_schema.tables WHERE table_name= 'playstats';
-- Result: playerid, stattime, tournamentid, gameid
Step 3: Choosing a Primary Index column
The secondary index would include the filtered field, as well as a column from the primary index. The way to choose one of the primary index columns is one with high cardinality and low ratio to the filtered field.
Example SQL code:
-- The Primary index field you will add to the aggregating index:
SELECT APPROX_COUNT_DISTINCT(city_hash(PlayerID)) FROM playstats;
-- Result: 5,406
-- Low ratio between the primary index column and the filtered field
SELECT APPROX_COUNT_DISTINCT(city_hash(PlayerID))/APPROX_COUNT_DISTINCT(city_hash(Currentscore))::decimal FROM PlayStats;
-- Result: 0.007017526
-- Result should be less than 0.2
Step 4: Defining the Aggregating Index
Create the aggregating index like any other aggregating index, containing one of the primary index columns and the columns you want to filter by. There is an implicit count(*)
included in every aggregating index, so there is no need to specify an aggregation column.
By defining an aggregating index that includes PlayerID
and CurrentScore
, Firebolt can use this smaller index to retrieve values, resulting in less data being read and faster performance.
Example SQL code:
-- This index helps filter data efficiently based on CurrentScore
CREATE AGGREGATING INDEX idx_playstats_playerid_currentscore
ON PlayStats
(
PlayerID,
CurrentScore
);
Step 5: Querying with the Aggregating Index
Once you create a secondary index, you may need to adjust your query pattern to consistently use the aggregating index for pruning. Note that an aggregating index is not utilized in JOINS or in queries without aggregations in a single table query. Therefore, the filter pattern required for the aggregating index to be effective should be as follows:
-- Use SELECT DISTINCT in your queries
SELECT PlayerID, GameID, CurrentScore
FROM PlayStats
WHERE CurrentScore = 574172
AND PlayerID IN (SELECT DISTINCT PlayerID FROM PlayStats WHERE CurrentScore = 574172);
By using the SELECT DISTINCT
construction, Firebolt will prune your query using the aggregating index.
Step 6: Validating the Effectiveness of the Index
Validate the effectiveness of the aggregating index by comparing the execution time and performance of queries before and after creating the index. You can confirm the aggregating index is indeed utilized by running the EXPLAIN option.
Example SQL code:
-- Compare the query execution plans to see the performance improvement.
EXPLAIN (ANALYZE) SELECT PlayerID, GameID, CurrentScore
FROM PlayStats
WHERE Currentscore =574172;
EXPLAIN (ANALYZE) SELECT PlayerID, GameID, CurrentScore
FROM PlayStats
WHERE CurrentScore = 574172
AND PlayerID IN (SELECT DISTINCT PlayerID FROM PlayStats WHERE CurrentScore = 574172);
Step 7: Maintaining Aggregating Indexes
Firebolt automatically maintains aggregating indexes as new data is ingested, ensuring that the distinct values remain up-to-date without requiring manual intervention.
Full Example Code
SELECT PlayerID, GameID, CurrentScore
FROM PlayStats
WHERE Currentscore =574172;
SELECT APPROX_COUNT_DISTINCT(city_hash(CurrentScore)) FROM playstats;
SELECT primary_index FROM information_schema.tables WHERE table_name= 'playstats';
SELECT APPROX_COUNT_DISTINCT(city_hash(PlayerID)) FROM playstats;
SELECT APPROX_COUNT_DISTINCT(city_hash(PlayerID))/APPROX_COUNT_DISTINCT(city_hash(Currentscore))::decimal FROM PlayStats;
CREATE AGGREGATING INDEX idx_playstats_playerid_currentscore
ON PlayStats
(
PlayerID,
CurrentScore
);
SELECT PlayerID, GameID, CurrentScore
FROM PlayStats
WHERE CurrentScore = 574172
AND PlayerID IN (SELECT DISTINCT PlayerID FROM PlayStats WHERE CurrentScore = 574172);
EXPLAIN (ANALYZE) SELECT PlayerID, GameID, CurrentScore
FROM PlayStats
WHERE Currentscore =574172;
EXPLAIN (ANALYZE) SELECT PlayerID, GameID, CurrentScore
FROM PlayStats
WHERE CurrentScore = 574172
AND PlayerID IN (SELECT DISTINCT PlayerID FROM PlayStats WHERE CurrentScore = 574172);