Optimizing Queries with Aggregating Indexes in Firebolt

FireboltAutomations · September 20, 2023, 3:46pm

Objective

Aggregating indexes in Firebolt are key tools for enhancing performance. By the end of this article, you'll learn how to identify, set up, test, and adjust aggregating indexes to optimize query performance.

For a concise high-level summary, please refer to the TL;DR section below.
For a detailed, step-by-step guide, start with the Introduction section.

TL;DR

Aggregating Indexes are independent objects in Firebolt that pre-calculate and store results of specified aggregate functions for optimized query performance.

Aggregating Indexes are self-maintaining, updating automatically when base tables are updated and utilized automatically in relevant queries.

You can create an aggregating index following these five steps:

Identify all aggregating functions in your query or queries (Refer to the "Selecting The Right Columns" section for guidance).
Identify the set of GROUP BY, WHERE and CASE WHEN and determine if including them will boost your performance based on a minimum of 5x ratio between the total row count and a unique combination of those columns (Refer to the "Selecting The Right Columns" section for guidance).
Before ingesting data, create the aggregating index. Prioritize column order based on selectivity first, then include all aggregating functions (Refer to the "Setting Up The Aggregating Index" section for guidance).
Use EXPLAIN on your query or queries to ensure the aggregating index is utilized.
If you modify your queries or add additional aggregations, either DROP and CREATE the aggregating index with the necessary tweaks or create an additional aggregating index - Firebolt will automatically select the most effective index as needed.

Introduction & Use Case

What Will You Need?

For this tutorial, you will need:

A working Firebolt account with permission to create tables, ingest data, and query the tables you have created.
A dataset - You can use your data or reproduce the tutorial step by step using Firebolt’s sample dataset. For instructions on using Firebolt's sample dataset, please see the Reference section below.

When Should You Use Aggregating Indexes?

As a rule of thumb - use aggregating index in Firebolt when you run frequent aggregate queries on large fact tables. Here are examples of both relevant and irrelevant cases for aggregating indexes:

Relevant Use Cases for Aggregating Index:

Dashboards: Often pull aggregated data from large datasets.
Repetitive Reports: Regularly generated reports with aggregate data.
Common Aggregations: On large fact tables, like SUM or COUNT.

Irrelevant Use Cases for Aggregated Index:

Ad-hoc Queries: One-time or infrequent queries without aggregates.
Small Tables: The overhead might outweigh the benefits.
Ingestion Priority: If faster data ingestion is more important than query speed.

Introducing Ultra Fast Gaming Use Case

Meet Tom, a member of the Data team at Ultra Fast Gaming Inc. Tom is responsible for related analytics as Ultra Fast Gaming ventures into online games. The Product team recently tasked Tom with setting up a monthly tracking system to monitor overall player performance across selected tournaments, excluding a car called Sunfire.

Tom constructed the following query for this purpose:

WITH WeeklyPerformance AS (    SELECT         PlayerID,        TournamentID,        SUM(CurrentScore) AS TotalScore,        AVG(CurrentSpeed) AS AvgSpeed,        COUNT(CASE WHEN ErrorCode NOT ILIKE '%NoError%' THEN '1' END) AS ErrorCount,        ROW_NUMBER() OVER (PARTITION BY PlayerID ORDER BY SUM(CurrentScore) DESC) AS PlayerRank    FROM playstats    WHERE         TournamentID IN (56, 16, 98, 111, 53, 29, 49, 31, 125, 12)        AND SelectedCar NOT IN ('Sunfire')    GROUP BY 1, 2)SELECT     p.PlayerID,    p.TournamentID,    p.TotalScore,    p.AvgSpeed,    CASE         WHEN p.ErrorCount > 10000 THEN 'High Error Count'        ELSE 'Low Error Count'    END AS ErrorLevel,    SUM(CASE WHEN p.PlayerRank <= 10 THEN 1 ELSE 0 END) OVER (PARTITION BY p.TournamentID) AS Top10PlayersCountFROM WeeklyPerformance pORDER BY p.TournamentID, p.TotalScore DESC;

What Are We Trying to Solve?

On a small engine, the query runtime is fast at 0.78s when hot and 2.1s when cold. However, Tom wants to make it even faster. Since this widely used query will rarely change, it is a perfect candidate for an aggregating index.

Selecting The Right Columns

The first and most important step is selecting the right columns for aggregating indexes. Generally speaking, your aggregating index should include all the fields in the query that fall under the following four types:

Aggregation expressions fields.
Fields included in a GROUP BY clause that do not contain aggregate or window expressions (including subqueries).
Fields filtered as part of a WHERE clause.
Fields included in a CASE WHEN expression.

Only include GROUP BY, WHERE, and CASE WHEN if they significantly reduce the scanned rows within the table - the ratio of total rows in a table and the unique compositions of the GROUP BY, WHERE, and CASE WHEN should be at least 5X.

Step by Step: Selecting the Right Columns for Ultra Fast Gaming Case

Based on our description, we will now identify the relevant columns for Tom's scenario.

Determine the Key Elements for Indexing

Identify Significant Columns: In Tom's example, the important columns are:

PlayerID: Included in the GROUP BY clause.
SelectedCar: Used in the WHERE clause.
TournamentID: Appears in both GROUP BY and WHERE clauses.
ErrorCode: Used in ErrorCount, which involves conditional manipulation (CASE WHEN) and aggregation (COUNT).

Exclusions: CurrentScore is excluded as it's part of a window function (PlayerRank).
Evaluate Inclusion Criteria: Determine if these elements can reduce the amount of data the query scans. Aim for a unique-to-total row ratio of at least 5X. To do this, compare:

Total rows: SELECT COUNT(*) FROM PlayStats
Unique combinations: SELECT APPROX_COUNT_DISTINCT(city_hash(PlayerID, SelectedCar, TournamentID, ErrorCode)) FROM playstats
Note: Only include CASE WHEN when it's also part of an aggregate function (e.g., ErrorCount). Exclude it if it's used only for grouping (e.g., ErrorLevel).

Assessment: In Tom's case, the total rows were 1,036,270,144, and the unique combinations were 543,530, yielding a favorable ratio of ~1,906. This supports including these columns in the aggregating index.
Additional Considerations: If the unique-to-total row ratio was below 5X, consider splitting the index. Include only subsets of the GROUP BY, WHERE, and CASE WHEN columns. Firebolt will automatically choose the most efficient index for query execution.

Identify Aggregate Functions in the Query

In Tom's situation, SUM(CurrentScore) and AVG(CurrentSpeed) are the aggregate functions.

Setting Up The Aggregating Index

Once you've identified all relevant elements, setting up the aggregating index is straightforward (see our documentation for further reference).
However, keep in mind these a few key points:

When adding an aggregated column (for example, CurrentScore is aggregated using SUM()), you should add the whole aggregation (SUM(CurrentScore)) to the index.
The sequence of columns in an aggregating index is important because this order is the primary index for the aggregating index itself.

Selectivity is the key criterion when ordering your columns. In this context, "Selectivity" refers to the frequency with which a specific column filters your queries.
A column with high selectivity is frequently used as a filter in queries. The higher the frequency, the greater the selectivity.
Columns involved in aggregation functions should be placed last in the order.

Note that you can't modify aggregating indexes after creation. You'd need to DROP and then CREATE the new index to make changes to the index.
In most cases, it is best practice first to create the aggregating index, test it, and only then ingest the data.

Step by Step: Setting Up Aggregating Indexes for Ultra Fast Gaming Case

Next, we'll walk through the process of creating the aggregating index step by step:

As a reminder, here are the columns we decided to include in the aggregating index on the playstats table

PlayerID
SelectedCar
TournamentID
ErrorCode
CurrentScore - Since we use SUM() aggregation, we will add the whole aggregation to the index - SUM(CurrentScore)
CurrentSpeed - Since we use AVG() aggregation, we will add the whole aggregation to the index - AVG(CurrentSpeed)
ErrorCode - since it is both part of an aggregation and a CASE WHEN, we would include the whole statement - COUNT(CASE WHEN ErrorCode NOT ILIKE '%NoError%' THEN 1 END). Please note - In case it was a CASE WHEN, we would only include the column name (ErrorCode, in this case).

We will now determine the columns order based on their selectivity. Consider two additional queries that Tom is working with:

Query II

SELECT   PlayerID,  CurrentLevel,  CAST("Timestamp" AS DATE) AS PlayDate,  SUM(CurrentScore) AS TotalScore,  AVG(CurrentSpeed) AS AvgSpeed,  SUM(CASE WHEN CurrentSpeed > 150 THEN 1 ELSE 0 END) AS HighSpeedCounts,FROM   playstatsWHERE   PlayerID IN (1, 2, 4, 67, 89)GROUP BY ALL;

Query III

SELECT   PlayerID,  TournamentID,  SUM(CurrentScore) AS TotalScore,  AVG(CurrentSpeed) AS AvgSpeed,  COUNT(*) AS TotalEventsFROM   playstatsWHERE   PlayerID NOT IN (67, 89, 45)AND   TournamentID NOT IN (56, 16, 98, 111, 53)GROUP BY 1, 2;

Based on the above, PlayerID has the highest selectivity and is used in all queries. TournamentID is second, being used only in the original query and Query III, and SelectedCar third, filtered only in the original query.
Please note -

To optimize the two additional queries, we could have also included CurrentLevel and the expression SUM(CASE WHEN CurrentSpeed > 150 THEN 1 ELSE 0 END) in the index.
Alternatively, if the unique combination of columns falls below a 5X ratio, another approach is to create a separate aggregating index that includes only the relevant columns. Firebolt will automatically select the most efficient index to use when necessary.

We will include the aggregated columns at the end of the aggregating index.
Here is the resulting code for Tom's aggregating index:

CREATE AGGREGATING INDEX playstats_agg_index ON playstats(    PlayerID,    TournamentID,    SelectedCar,    COUNT(CASE WHEN ErrorCode NOT ILIKE '%NoError%' THEN '1' END),     SUM(CurrentScore),    AVG(CurrentSpeed));

Testing Your Aggregating Indexes and Making Further Adjustments

After setting up your aggregating index, it's time to test it. Testing should include checking:

Usage - Is your query using the aggregating index?
Performance - Is your query running faster?

Step by Step: Testing Aggregating Indexes for Ultra Fast Gaming Case

Next, let's walk through the process of testing the aggregating index step by step:

Test aggregating index usage by running the EXPLAIN command. In Tom's case, run the original query with EXPLAIN:

EXPLAINWITH   WeeklyPerformance AS (SELECT   PlayerID,...

Examine the results. You should see a row describing the use of the aggregating index. In Tom's case, the result is that playstats_agg_index was used in 6/6 columns.

[0]      \_      ...        \_[11] [StoredTable] Name: 'playstats_agg_index__FB_AGG_IDX_MV__TABLE', used 6/6 column(s) AGGREGATING_INDEX_TABLE

Run your query to see performance improve. Tom's performance improved from 0.7s to 0.02s - a Truly sub-second speed! Great job!

Conclusion

Aggregating indexes can be game changers for frequent and significant aggregate queries on large fact tables. The potential for performance improvement is significant, yet it's crucial to strategically select the columns you include and carefully determine their order. Testing plays a critical role in this process, it's essential to verify that your aggregating index is being utilized effectively and enhancing query speeds as expected. If you follow these steps and tips, you can significantly boost your query performance in Firebolt.

Reference

Setting up the Firebolt sample dataset

Run the following code in a new or existing database. This will:

Create all relevant external tables.
Create all relevant FACT and DIMENSION tables.
Ingest the data into the tables.

Allow the process to run. It should take less than 10 minutes on a small engine.

DROP TABLE IF EXISTS ex_games;CREATE EXTERNAL TABLE IF NOT EXISTS ex_games(    src TEXT) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/games.json'TYPE = (JSON PARSE_AS_TEXT = 'TRUE');DROP TABLE IF EXISTS ex_levels;CREATE EXTERNAL TABLE IF NOT EXISTS ex_levels(    LevelID INT,    GameID INT,    Level INT,    Name TEXT,    LevelType TEXT,    NextLevel INT NULL,    MinPointsToPass INT,    MaxPoints INT,     NumberOfLaps INT,    MaxPlayers INT,    MinPlayers INT,    PointsPerLap FLOAT,    MusicTrack TEXT,    SceneDetails TEXT,    MaxPlayTimeSeconds INT,    LevelIcon TEXT) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/levels.csv'TYPE = (CSV SKIP_HEADER_ROWS = 1);DROP TABLE IF EXISTS ex_players;CREATE EXTERNAL TABLE IF NOT EXISTS ex_players(    src TEXT) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/players.json'TYPE = (JSON PARSE_AS_TEXT = 'TRUE');DROP TABLE IF EXISTS ex_tournaments;CREATE EXTERNAL TABLE IF NOT EXISTS ex_tournaments(    TournamentID INT,    Name TEXT,    GameID INT,    TotalPrizeDollars INT,    StartDateTime TIMESTAMP,    EndDateTime TIMESTAMP,    RulesDefinition TEXT) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/tournaments.csv'TYPE = (CSV SKIP_HEADER_ROWS = 1);DROP TABLE IF EXISTS ex_playstats;CREATE EXTERNAL TABLE IF NOT EXISTS ex_playstats(    "GameID" INT,    "PlayerID" INT,    "Timestamp" TIMESTAMP,    "SelectedCar" TEXT,    "CurrentLevel" INT,    "CurrentSpeed" FLOAT,    "CurrentPlayTime" BIGINT,    "CurrentScore" BIGINT,    "Event" TEXT,    "ErrorCode" TEXT,    "TournamentID" INT PARTITION('.*TournamentID=(.+)\/.*')) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/playstats/*.parquet'TYPE = (PARQUET);DROP TABLE IF EXISTS ex_rankings;CREATE EXTERNAL TABLE IF NOT EXISTS ex_rankings(    "GameID" INT,    "PlayerID" INT,    "MaxLevel" INT,    "TotalScore" BIGINT,    "PlaceWon" INT,    "TournamentID" INT PARTITION('.*TournamentID=(.+)\/.*')) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/rankings/*.parquet'TYPE = (PARQUET);DROP TABLE IF EXISTS games CASCADE;CREATE DIMENSION TABLE IF NOT EXISTS games(    GameID INT,    Title TEXT,    Abbreviation TEXT,    Series TEXT,    Version DECIMAL(10, 2),    "Description" TEXT,    Category TEXT,    LaunchDate DATE,    Author TEXT,    SupportedPlatforms ARRAY(TEXT),    GameConfiguration TEXT,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX GameID, Title;DROP TABLE IF EXISTS levels CASCADE;CREATE DIMENSION TABLE IF NOT EXISTS levels(    LevelID INT,    GameID INT,    Level INT,    Name TEXT,    LevelType TEXT,    NextLevel INT NULL,    MinPointsToPass INT,    MaxPoints INT,     NumberOfLaps INT,    MaxPlayers INT,    MinPlayers INT,    PointsPerLap FLOAT,    MusicTrack TEXT,    SceneDetails TEXT,    MaxPlayTimeSeconds INT,    LevelIcon BYTEA,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX LevelID;DROP TABLE IF EXISTS players CASCADE;CREATE DIMENSION TABLE IF NOT EXISTS players(    PlayerID INT,    Nickname TEXT,    Email TEXT,    AgeCategory TEXT,    Platforms ARRAY(TEXT),    RegisteredOn PGDATE,    IsSubscribedToNewsletter BOOLEAN,    InternalProbabilityToWin DOUBLE,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX PlayerID, Nickname, AgeCategory, RegisteredOn;DROP TABLE IF EXISTS tournaments CASCADE;CREATE DIMENSION TABLE IF NOT EXISTS tournaments(    TournamentID INT,    Name TEXT,    GameID INT,    TotalPrizeDollars INT,    StartDateTime TIMESTAMP,    EndDateTime TIMESTAMP,    RulesDefinition TEXT,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX TournamentID;DROP TABLE IF EXISTS rankings CASCADE;CREATE FACT TABLE IF NOT EXISTS rankings(    GameID INT,    PlayerID INT,    MaxLevel INT,    TotalScore BIGINT,    PlaceWon INT,    TournamentID INT,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX GameID, TournamentID, PlayerID;DROP TABLE IF EXISTS playstats CASCADE;CREATE FACT TABLE IF NOT EXISTS playstats(    GameID INT,    PlayerID INT,    "Timestamp" TIMESTAMP,    SelectedCar TEXT,    CurrentLevel INT,    CurrentSpeed FLOAT,    CurrentPlayTime BIGINT,    CurrentScore BIGINT,    Event TEXT,    ErrorCode TEXT,    TournamentID INT,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX GameID, TournamentID, PlayerID, "Timestamp";INSERT INTO games SELECT     JSON_EXTRACT(src,'/GameID','INT') AS GameID,    JSON_EXTRACT(src,'/Title','TEXT') AS Title,    JSON_EXTRACT(src,'/Abbreviation','TEXT') AS Abbreviation,    JSON_EXTRACT(src,'/Series','TEXT') AS Series,    JSON_EXTRACT(src,'/Version','FLOAT')::DECIMAL(10, 2) AS Version,    JSON_EXTRACT(src,'/Description','TEXT') AS "Description",    JSON_EXTRACT(src,'/Category','TEXT') AS Category,    JSON_EXTRACT(src,'/LaunchDate','TEXT')::DATE AS LaunchDate,    JSON_EXTRACT(src,'/Author','TEXT') AS Author,    ARRAY_AGG(JSON_EXTRACT(platforms_unnest, 'Name', 'TEXT')) AS SupportedPlatforms,    JSON_EXTRACT_RAW(src,'/GameConfiguration') AS GameConfiguration,    SOURCE_FILE_NAME,     SOURCE_FILE_TIMESTAMP FROM ex_games UNNEST(JSON_EXTRACT_ARRAY_RAW(src, '/SupportedPlatforms') AS platforms_unnest)GROUP BY ALL;INSERT INTO levelsSELECT    LevelID,    GameID,    Level,    Name,    LevelType,    NextLevel,    MinPointsToPass,    MaxPoints,     NumberOfLaps,    MaxPlayers,    MinPlayers,    PointsPerLap,    MusicTrack,    SceneDetails,    MaxPlayTimeSeconds,    DECODE(REPLACE(LevelIcon,'"',''),'BASE64'),    SOURCE_FILE_NAME,    SOURCE_FILE_TIMESTAMPFROM ex_levels;INSERT INTO players SELECT     JSON_EXTRACT(src,'/playerID','INT') AS PlayerID,    JSON_EXTRACT(src,'/details/nickname','TEXT') AS Nickname,    JSON_EXTRACT(src,'/details/email','TEXT') AS Email,    JSON_EXTRACT(src,'/details/ageCategory','TEXT') AS AgeCategory,    JSON_EXTRACT(src, '/details/platforms', 'ARRAY(TEXT)') AS Platforms,    JSON_EXTRACT(src,'/details/registeredOn','TEXT')::PGDATE AS RegisteredOn,    CASE WHEN LOWER(JSON_EXTRACT_RAW(src,'/details/isSubscribedToNewsletter')::TEXT) = 'true' THEN 1::BOOLEAN ELSE 0::BOOLEAN END AS IsSubscribedToNewsletter,    JSON_EXTRACT_RAW(src,'/details/internalProbabilityToWin')::DOUBLE AS InternalProbabilityToWin,    SOURCE_FILE_NAME,     SOURCE_FILE_TIMESTAMP FROM ex_players;INSERT INTO tournaments SELECT *, SOURCE_FILE_NAME, SOURCE_FILE_TIMESTAMP FROM ex_tournaments;INSERT INTO rankings SELECT *, SOURCE_FILE_NAME, SOURCE_FILE_TIMESTAMP FROM ex_rankings;INSERT INTO playstats SELECT *, SOURCE_FILE_NAME, SOURCE_FILE_TIMESTAMP FROM ex_playstats;

Topic		Replies	Views
Query Diagnosis and Troubleshooting Help Center	4	543	August 3, 2024
How to optimize query performance by choosing the right Primary Index in Firebolt	2	73	July 15, 2024
Tips for a successful Firebolt evaluation	4	337	August 7, 2024
Welcome to Firebolt Help Center Site Feedback	0	626	May 18, 2022
Simplifying SQL query analysis with EXPLAIN (ANALYZE)	0	282	June 20, 2024