Optimizing Queries with Aggregating Indexes in Firebolt

Objective

Firebolt's aggregating indexes are one of the main tools to increase performance. By the end of this article, you will be able to identify, set up, test, and adjust aggregating indexes to optimize query performance.

  • If you are looking for a high-level summary, have a look at the TL;DR section below
  • If you are looking for an in-depth, step-by-step guide, begin in the Introduction section

TL;DR

Aggregating Indexes are independent objects in Firebolt that pre-calculate and store results of specified aggregate functions for optimized query performance.

Aggregating Indexes are self-maintaining, updating automatically when base tables are updated and utilized automatically in relevant queries.

You can create an aggregating index following these five steps:

  1. Identify all aggregating functions in your query or queries (see "Selecting The Right Columns" section).
  2. Identify the set of GROUP BY, WHERE and CASE WHEN and determine if including them will boost your performance based on a minimum of 5x ratio between the total row count and a unique combination of those columns (see "Selecting The Right Columns" section).
  3. Create the aggregating index before you ingest the data. Ensure you order the columns based on selectivity first, then include all aggregating functions (see "Setting Up The Aggregating Index" section).
  4. Use EXPLAIN on your query or queries to ensure the aggregating index is utilized.
  5. If you change your queries or add additional aggregations, either DROP and CREATE the aggregating index with the necessary tweaks or create an additional aggregating index - Firebolt will know how to use the most useful index when needed.

Introduction & Use Case

What Will You Need?

To follow this tutorial, you will require:

  • A working Firebolt account with permission to create tables, ingest data, and query the tables you created.
  • A dataset - You can use your data or reproduce the tutorial step by step using Firebolt’s sample dataset. If you wish to use Firebolt's sample dataset, please refer to the following section.

When Should You Use Aggregating Indexes?

As a rule of thumb, use an aggregating index in Firebolt when you run frequent aggregate queries on large fact tables. Below are some examples of relevant and irrelevant cases for aggregating indexes:

  • Relevant Use Cases for Aggregating Index:
    • Dashboards: Often pull aggregated data from large datasets.
    • Repetitive Reports: Regularly generated reports with aggregate data.
    • Common Aggregations: On large fact tables, like SUM or COUNT.
  • Irrelevant Use Cases for Aggregated Index:
    • Ad-hoc Queries: One-time or infrequent queries without aggregates.
    • Small Tables: The overhead might outweigh the benefits.
    • Ingestion Priority: If faster data ingestion is more important than query speed.

Introducing Ultra Fast Gaming Use Case

Meet Tom, a Data team member at Ultra Fast Gaming Inc. Tom is responsible for related analytics as Ultra Fast Gaming ventures into online games. The Product team recently tasked Tom with setting up a monthly tracking system to monitor overall player performance across selected tournaments, excluding a car called Sunfire.

Tom constructed the following query for this purpose:

WITH WeeklyPerformance AS (    SELECT         PlayerID,        TournamentID,        SUM(CurrentScore) AS TotalScore,        AVG(CurrentSpeed) AS AvgSpeed,        COUNT(CASE WHEN ErrorCode NOT ILIKE '%NoError%' THEN 1 END) AS ErrorCount,        ROW_NUMBER() OVER (PARTITION BY PlayerID ORDER BY SUM(CurrentScore) DESC) AS PlayerRank    FROM PlayStats    WHERE         TournamentID IN (56, 16, 98, 111, 53, 29, 49, 31, 125, 12)        AND SelectedCar NOT IN ('Sunfire')    GROUP BY 1, 2)SELECT     p.PlayerID,    p.TournamentID,    p.TotalScore,    p.AvgSpeed,    CASE         WHEN p.ErrorCount > 10000 THEN 'High Error Count'        ELSE 'Low Error Count'    END AS ErrorLevel,    SUM(CASE WHEN p.PlayerRank <= 10 THEN 1 ELSE 0 END) OVER (PARTITION BY p.TournamentID) AS Top10PlayersCountFROM WeeklyPerformance pORDER BY p.TournamentID, p.TotalScore DESC;

What Are We Trying to Solve?

Using the M4 type engine, and since it is Firebolt, the query runtime is fast at 0.78s when hot and 2.1s when cold. However, Tom wants to make it even faster. Since this widely-used query will rarely change, it is a perfect candidate for an aggregating index.

Selecting The Right Columns

The first and most important step is selecting the right columns for aggregating indexes. Generally speaking, your aggregating index should include all the fields in the query that fall under the following four types:

  • Aggregation expressions fields.
  • Fields included in a GROUP BY clause that do not contain aggregate or window expressions (including subqueries).
  • Fields filtered as part of a WHERE clause.
  • Fields included in a CASE WHEN expression.

Please note

You should always include aggregation expressions in the aggregating index.

Only include GROUP BY, WHERE, and CASE WHEN if they significantly reduce the scanned rows within the table - the ratio of total rows in a table and the unique compositions of the GROUP BY, WHERE, and CASE WHEN should be at least 5X.

Step by Step: Selecting the Right Columns for Ultra Fast Gaming Case

Based on our description, we will now select the relevant columns for Tom's case.

  1. Determine the Key Elements for Indexing
    1. Identify Significant Columns: In Tom's example, the important columns are:
      1. PlayerID: Included in the GROUP BY clause.
      2. SelectedCar: Used in the WHERE clause.
      3. TournamentID: Appears in both GROUP BY and WHERE clauses.
      4. ErrorCode: Used in ErrorCount, which involves conditional manipulation (CASE WHEN) and aggregation (COUNT).
    2. Exclusions: CurrentScore is excluded as it's part of a window function (PlayerRank).
    3. Evaluate Inclusion Criteria: Determine if these elements can reduce the amount of data the query scans. Aim for a unique-to-total row ratio of at least 5X. To do this, compare:
      1. Total rows: SELECT COUNT(*) FROM PlayStats
      2. Unique combinations: SELECT APPROX_COUNT_DISTINCT(city_hash(PlayerID, SelectedCar, TournamentID, ErrorCode)) FROM playstats
      3. Note: Only include CASE WHEN when it's also part of an aggregate function (e.g., ErrorCount). Exclude it if it's used only for grouping (e.g., ErrorLevel).
    4. Assessment: In Tom's case, the total rows were 1,036,270,144, and the unique combinations were 543,530, yielding a favorable ratio of ~1,906. This supports including these columns in the aggregating index.
    5. Additional Considerations: If the unique-to-total row ratio was below 5X, consider splitting the index. Include only subsets of the GROUP BY, WHERE, and CASE WHEN columns. Firebolt will automatically choose the most efficient index for query execution.
  2. Identify Aggregate Functions in the Query
    1. In Tom's situation, SUM(CurrentScore) and AVG(CurrentSpeed) are the aggregate functions.

Setting Up The Aggregating Index

Once you've identified all relevant elements, setting up the aggregating index is straightforward (see our documentation for further reference). However, remember a few key points:

  1. When adding an aggregated column (for example, CurrentScore is aggregated using SUM()), you should add the whole aggregation (SUM(CurrentScore)) to the index.
  2. The sequence of columns in an aggregating index is important because this order is the primary index for the aggregating index itself.
    • Selectivity is the key criterion when ordering your columns. In this context, "Selectivity" refers to the frequency with which a specific column filters your queries.
    • A column with high selectivity is frequently used as a filter in queries. The higher the frequency, the greater the selectivity.
    • Columns involved in aggregation functions should be placed last in the order.
  3. Note that you can't modify aggregating indexes after creation. You'd need to DROP and then CREATE the new index to make changes to the index.
  4. In most cases, it is best practice first to create the aggregating index, test it, and only then ingest the data.

Step by Step: Setting Up Aggregating Indexes for Ultra Fast Gaming Case

We will now go step by step in creating the aggregating index.

  1. As a reminder, here are the columns we decided to include in the aggregating index on the playstats table
    • PlayerID
    • SelectedCar
    • TournamentID
    • ErrorCode
    • CurrentScore - Since we use SUM() aggregation, we will add the whole aggregation to the index - SUM(CurrentScore)
    • CurrentSpeed - Since we use AVG() aggregation, we will add the whole aggregation to the index - AVG(CurrentSpeed)
    • ErrorCode - since it is both part of an aggregation and a CASE WHEN, we would include the whole statement - COUNT(CASE WHEN ErrorCode NOT ILIKE '%NoError%' THEN 1 END). Please note - In case it was a CASE WHEN, we would only include the column name (ErrorCode, in this case).
  2. We will now decide on the order of columns based on their selectivity. Imagine that Tom has two additional queries:

Query II

SELECT   PlayerID,  CurrentLevel,  CAST(Timestamp AS DATE) AS PlayDate,  SUM(CurrentScore) AS TotalScore,  AVG(CurrentSpeed) AS AvgSpeed,  SUM(CASE WHEN CurrentSpeed > 150 THEN 1 ELSE 0 END) AS HighSpeedCounts,FROM   PlayStatsWHERE   PlayerID IN (1, 2, 4, 67, 89)GROUP BY ALL

Query III

SELECT   PlayerID,  TournamentID,  SUM(CurrentScore) AS TotalScore,  AVG(CurrentSpeed) AS AvgSpeed,  COUNT() AS TotalEventsFROM   PlayStatsWHERE   PlayerID NOT IN (67, 89, 45)AND   TournamentID NOT IN (56, 16, 98, 111, 53)GROUP BY 1, 2
  1. Based on the above, PlayerID has the highest selectivity and is used in all queries. TournamentID is second, being used only in the original query and Query III, and SelectedCar third, filtered only in the original query.
  2. Please note -
    1. To optimize the two additional queries, we could have also included CurrentLevel and the expression SUM(CASE WHEN CurrentSpeed > 150 THEN 1 ELSE 0 END) in the index.
    2. Alternatively, if the unique combination of columns falls below a 5X ratio, another approach is to create a separate aggregating index that includes only the relevant columns. Firebolt will automatically select the most efficient index to use when necessary.
  1. We will include the aggregated columns at the end of the aggregating index.
  2. Here is the resulting code for Tom's aggregating index:
CREATE AGGREGATING INDEX playstats_agg_index ON PlayStats (    PlayerID,    TournamentID,    SelectedCar,    COUNT(CASE WHEN ErrorCode NOT ILIKE '%NoError%' THEN 1 END),     SUM(CurrentScore),    AVG(CurrentSpeed),)

Testing Your Aggregating Indexes and Making Further Adjustments

After setting up your aggregating index, it's time to test it. Testing involves checking:

  • Usage - Is your query using the aggregating index?
  • Performance - Is your query running faster?

Step by Step: Testing Aggregating Indexes for Ultra Fast Gaming Case

We will now go step by step in testing the aggregating index.

  1. Test aggregating index usage by running the EXPLAIN command. In Tom's case, run the original query with EXPLAIN:
EXPLAINWITH   WeeklyPerformance AS (SELECT   PlayerID,...
  1. Examine the results. You should see a row describing the use of the aggregating index. In Tom's case, the result is that playstats_agg_index was used in 6/6 columns.
[0]      \_      ...        \_[11] [StoredTable] Name: 'playstats_agg_index__FB_AGG_IDX_MV__TABLE', used 6/6 column(s) AGGREGATING_INDEX_TABLE
  1. Run your query to see performance improve. Tom's performance improved from 0.7s to 0.02s - a Truly sub-second speed! Great job!

Conclusion

Aggregating indexes can be game changers for frequent and significant aggregate queries on large fact tables. The performance gain can be immense, but it's essential to be strategic about the columns you include and the order they're listed. Testing is vital: Always ensure your aggregating index is being used and delivering the speed enhancements you're after. If you follow these steps and tips, you can significantly boost your query performance in Firebolt.

Reference

Setting up the Firebolt sample dataset

  1. Run the following code in a new or existing database. This will:
    1. Create all relevant external tables
    2. Create all relevant FACT and DIMENSION tables
    3. Ingest the data into the tables
  2. Allow the process to run. It should take less than 10 minutes on 1x M8.
DROP TABLE IF EXISTS ex_games;CREATE EXTERNAL TABLE IF NOT EXISTS ex_games(    src TEXT) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/games.json'TYPE = (JSON PARSE_AS_TEXT = 'TRUE');DROP TABLE IF EXISTS ex_levels;CREATE EXTERNAL TABLE IF NOT EXISTS ex_levels(    LevelID INT,    GameID INT,    Level INT,    Name TEXT,    LevelType TEXT,    NextLevel INT NULL,    MinPointsToPass INT,    MaxPoints INT,     NumberOfLaps INT,    MaxPlayers INT,    MinPlayers INT,    PointsPerLap FLOAT,    MusicTrack TEXT,    SceneDetails TEXT,    MaxPlayTimeSeconds INT,    LevelIcon TEXT) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/levels.csv'TYPE = (CSV SKIP_HEADER_ROWS = 1);DROP TABLE IF EXISTS ex_players;CREATE EXTERNAL TABLE IF NOT EXISTS ex_players(    src TEXT) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/players.json'TYPE = (JSON PARSE_AS_TEXT = 'TRUE');DROP TABLE IF EXISTS ex_tournaments;CREATE EXTERNAL TABLE IF NOT EXISTS ex_tournaments(    TournamentID INT,    Name TEXT,    GameID INT,    TotalPrizeDollars INT,    StartDateTime TIMESTAMP,    EndDateTime TIMESTAMP,    RulesDefinition TEXT) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/tournaments.csv'TYPE = (CSV SKIP_HEADER_ROWS = 1);DROP TABLE IF EXISTS ex_playstats;CREATE EXTERNAL TABLE IF NOT EXISTS ex_playstats(    "GameID" INT,    "PlayerID" INT,    "Timestamp" TIMESTAMP,    "SelectedCar" TEXT,    "CurrentLevel" INT,    "CurrentSpeed" FLOAT,    "CurrentPlayTime" BIGINT,    "CurrentScore" BIGINT,    "Event" TEXT,    "ErrorCode" TEXT,    "TournamentID" INT PARTITION('.*TournamentID=(.+)\/.*')) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/playstats/*.parquet'TYPE = (PARQUET);DROP TABLE IF EXISTS ex_rankings;CREATE EXTERNAL TABLE IF NOT EXISTS ex_rankings(    "GameID" INT,    "PlayerID" INT,    "MaxLevel" INT,    "TotalScore" BIGINT,    "PlaceWon" INT,    "TournamentID" INT PARTITION('.*TournamentID=(.+)\/.*')) URL = 's3://firebolt-publishing-public/help_center_assets/firebolt_sample_dataset/'OBJECT_PATTERN = 'help_center_assets/firebolt_sample_dataset/rankings/*.parquet'TYPE = (PARQUET);DROP TABLE IF EXISTS games CASCADE;CREATE DIMENSION TABLE IF NOT EXISTS games(    GameID INT UNIQUE,    Title TEXT,    Abbreviation TEXT,    Series TEXT,    Version DECIMAL(10, 2),    "Description" TEXT,    Category TEXT,    LaunchDate DATE,    Author TEXT,    SupportedPlatforms ARRAY(TEXT),    GameConfiguration TEXT,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX GameID, Title;DROP TABLE IF EXISTS levels CASCADE;CREATE DIMENSION TABLE IF NOT EXISTS levels(    LevelID INT UNIQUE,    GameID INT,    Level INT,    Name TEXT,    LevelType TEXT,    NextLevel INT NULL,    MinPointsToPass INT,    MaxPoints INT,     NumberOfLaps INT,    MaxPlayers INT,    MinPlayers INT,    PointsPerLap FLOAT,    MusicTrack TEXT,    SceneDetails TEXT,    MaxPlayTimeSeconds INT,    LevelIcon BYTEA,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX LevelID;DROP TABLE IF EXISTS players CASCADE;CREATE DIMENSION TABLE IF NOT EXISTS players(    PlayerID INT UNIQUE,    Nickname TEXT,    Email TEXT,    AgeCategory TEXT,    Platforms ARRAY(TEXT),    RegisteredOn PGDATE,    IsSubscribedToNewsletter BOOLEAN,    InternalProbabilityToWin DOUBLE,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX PlayerID, Nickname, AgeCategory, RegisteredOn;DROP TABLE IF EXISTS tournaments CASCADE;CREATE DIMENSION TABLE IF NOT EXISTS tournaments(    TournamentID INT UNIQUE,    Name TEXT,    GameID INT,    TotalPrizeDollars INT,    StartDateTime TIMESTAMP,    EndDateTime TIMESTAMP,    RulesDefinition TEXT,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX TournamentID;DROP TABLE IF EXISTS rankings CASCADE;CREATE FACT TABLE IF NOT EXISTS rankings(    GameID INT,    PlayerID INT,    MaxLevel INT,    TotalScore BIGINT,    PlaceWon INT,    TournamentID INT,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX GameID, TournamentID, PlayerID;DROP TABLE IF EXISTS playstats CASCADE;CREATE FACT TABLE IF NOT EXISTS playstats(    GameID INT,    PlayerID INT,    "Timestamp" TIMESTAMP,    SelectedCar TEXT,    CurrentLevel INT,    CurrentSpeed FLOAT,    CurrentPlayTime BIGINT,    CurrentScore BIGINT,    Event TEXT,    ErrorCode TEXT,    TournamentID INT,    SOURCE_FILE_NAME TEXT,    SOURCE_FILE_TIMESTAMP TIMESTAMP) PRIMARY INDEX GameID, TournamentID, PlayerID, "Timestamp";INSERT INTO games SELECT     JSON_EXTRACT(src,'/GameID','INT') AS GameID,    JSON_EXTRACT(src,'/Title','TEXT') AS Title,    JSON_EXTRACT(src,'/Abbreviation','TEXT') AS Abbreviation,    JSON_EXTRACT(src,'/Series','TEXT') AS Series,    JSON_EXTRACT(src,'/Version','FLOAT')::DECIMAL(10, 2) AS Version,    JSON_EXTRACT(src,'/Description','TEXT') AS "Description",    JSON_EXTRACT(src,'/Category','TEXT') AS Category,    JSON_EXTRACT(src,'/LaunchDate','TEXT')::DATE AS LaunchDate,    JSON_EXTRACT(src,'/Author','TEXT') AS Author,    NEST(JSON_EXTRACT(platforms_unnest, 'Name', 'TEXT')) AS SupportedPlatforms,    JSON_EXTRACT_RAW(src,'/GameConfiguration') AS GameConfiguration,    SOURCE_FILE_NAME,     SOURCE_FILE_TIMESTAMP FROM ex_games UNNEST(JSON_EXTRACT_ARRAY_RAW(src, '/SupportedPlatforms') AS platforms_unnest)GROUP BY ALL;INSERT INTO levelsSELECT    LevelID,    GameID,    Level,    Name,    LevelType,    NextLevel,    MinPointsToPass,    MaxPoints,     NumberOfLaps,    MaxPlayers,    MinPlayers,    PointsPerLap,    MusicTrack,    SceneDetails,    MaxPlayTimeSeconds,    DECODE(REPLACE(LevelIcon,'"',''),'BASE64')FROM ex_levels;INSERT INTO players SELECT     JSON_EXTRACT(src,'/playerID','INT') AS PlayerID,    JSON_EXTRACT(src,'/details/nickname','TEXT') AS Nickname,    JSON_EXTRACT(src,'/details/email','TEXT') AS Email,    JSON_EXTRACT(src,'/details/ageCategory','TEXT') AS AgeCategory,    JSON_EXTRACT(src, '/details/platforms', 'ARRAY(TEXT)') AS Platforms,    JSON_EXTRACT(src,'/details/registeredOn','TEXT')::PGDATE AS RegisteredOn,    CASE WHEN LOWER(JSON_EXTRACT_RAW(src,'/details/isSubscribedToNewsletter')::TEXT) = 'true' THEN 1::BOOLEAN ELSE 0::BOOLEAN END AS IsSubscribedToNewsletter,    JSON_EXTRACT_RAW(src,'/details/internalProbabilityToWin')::DOUBLE AS InternalProbabilityToWin,    SOURCE_FILE_NAME,     SOURCE_FILE_TIMESTAMP FROM ex_players;INSERT INTO tournaments SELECT *, SOURCE_FILE_NAME, SOURCE_FILE_TIMESTAMP FROM ex_tournaments;INSERT INTO rankings SELECT *, SOURCE_FILE_NAME, SOURCE_FILE_TIMESTAMP FROM ex_rankings;INSERT INTO playstats SELECT *, SOURCE_FILE_NAME, SOURCE_FILE_TIMESTAMP FROM ex_playstats;