Optimizing SQL Queries in Firebolt Using Aggregating Indexes

Introduction

In this article, we explore how aggregating indexes can be used to optimize
SQL queries within the Firebolt environment, specifically designed to enhance query performance and efficiency. By leveraging these indexes, data engineers can significantly reduce query execution times and improve the responsiveness of analytics applications. By the end of this article, you will know how to utilize aggregating indexes in Firebolt to streamline your data querying processes.

TL;DR

  • Understanding Aggregating Indexes: What they are and why they’re useful.
  • Creating Aggregating Indexes: Step-by-step SQL guide to set up.
  • Optimizing Queries: How to apply these indexes to optimize various queries.
  • Performance Benefits : Demonstrating the tangible improvements in query performance.
  • Best Practices: Tips for maintaining optimal performance with aggregating indexes.

Step-by-Step Guide

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Data Set

Step 1: Understanding the Need for Aggregating Indexes

Before diving into the creation of Aggregating Indexes, it’s crucial to
comprehend their role and impact. Aggregating indexes in Firebolt are designed to pre-aggregate data during the loading phase, which allows queries to run faster by reading less data and performing fewer calculations at query time. Aggregating Indexes are self-maintaining, updating automatically when base tables are updated.

Example SQL code:

-- Example SELECT query that can benefit from Aggregating Indexes

    SELECT PlayerID, GameID, COUNT(*) as games_played
    FROM PlayStats
    GROUP BY PlayerID;

Step 2: identify the columns to use in the Aggregating Indexes

You can define an aggregating index on a column or a set of columns that you frequently query.

  • Identify all aggregating functions in your query (or queries).
  • Identify the set of GROUP BY , WHERE , and CASE WHEN columns. Determine whether including them will boost your performance, based on a minimum of a 5x ratio between the total row count and a unique combination of those columns.
  • Start composing the index with low-cardinality columns that have at most a few dozen distinct values.
  • Once you get to high-cardinality columns, prioritize column order based on selectivity (the frequency with which a specific column is used as a filter in your queries).
  • Then include all aggregating functions.

Example SQL code:

-- Checking 5x ratio between the total row count and a unique combination of the selected columns.

    -- Total rows 
    SELECT COUNT(*) FROM PlayStats;

    -- Unique combinations 
    SELECT APPROX_COUNT_DISTINCT(city_hash(PlayerID, GameID)) FROM PlayStats;

    -- Combined query to check ratio
    SELECT APPROX_COUNT_DISTINCT(city_hash(PlayerID, GameID))/COUNT(*)::decimal FROM PlayStats;

-- If the result is below 0.2, it means that the Aggregating Index will be efficient.

Step 3: Creating Aggregating Indexes

Once you understand the concept, the next step is to create these indexes. You can create the Aggregating Index either before or after you ingest data to the base table.

Example SQL code:

-- Creating an aggregating index on the 'playstats' table for the 'playerid' and 'game_id' columns.

    CREATE AGGREGATING INDEX idx_playstats_game_player ON PlayStats (GameID, PlayerID, COUNT(*), AVG(CurrentScore));

Step 4: Applying Aggregating Indexes to Optimize Queries

With the index in place, you can write queries that leverage the pre-aggregated data to speed up response times, particularly useful for large datasets.

Example SQL code:

-- Utilizing the aggregating index to quickly fetch average scores by player.

    SELECT PlayerID, AVG(CurrentScore)
    FROM PlayStats
    GROUP BY PlayerID;

Step 5: Analyzing the Performance Improvement

After implementing Aggregating Indexes, analyze the performance of your queries to understand the improvements and adjust your indexing strategy as needed. Run EXPLAIN (ANALYZE) and look at the StoredTable step to verify the query was running over the idx_playstats_game_player Aggregating Index, and not over the PlayStats table.

Example SQL code:

-- Ensure the Aggregating Index is indeed used.

    EXPLAIN (ANALYZE) SELECT PlayerID, AVG(CurrentScore)
    FROM PlayStats
    GROUP BY PlayerID;

Step 6: Making Further Adjustments

If you modify your queries or add additional aggregations, either DROP and CREATE the aggregating index with the necessary changes or create an additional aggregating index. Firebolt will automatically select the most effective index as needed.
Note that you can’t modify Aggregating Indexes after creation. You would need to DROP and then CREATE the new index to make any necessary changes.

Example SQL code:

-- Adjust your Aggregating Index based on new queries

    DROP AGGREGATING INDEX idx_playstats_game_player;
    CREATE AGGREGATING INDEX idx_playstats_game_player ON PlayStats (GameID, PlayerID, COUNT(*), AVG(CurrentScore), AVG(CurrentPlayTime));

Full Example Code

    SELECT PlayerID, GameID, COUNT(*) as games_played
    FROM PlayStats
    GROUP BY PlayerID;

    SELECT APPROX_COUNT_DISTINCT(city_hash(PlayerID, GameID))/COUNT(*) FROM PlayStats;

    CREATE AGGREGATING INDEX idx_playstats_game_player ON PlayStats (GameID, PlayerID, COUNT(*), AVG(CurrentScore));

    SELECT PlayerID, AVG(CurrentScore)
    FROM PlayStats
    GROUP BY PlayerID;

    EXPLAIN (ANALYZE) SELECT PlayerID, AVG(CurrentScore)
    FROM PlayStats
    GROUP BY PlayerID;

    DROP AGGREGATING INDEX idx_playstats_game_player;

    CREATE AGGREGATING INDEX idx_playstats_game_player ON PlayStats (GameID, PlayerID, COUNT(*), AVG(CurrentScore), AVG(CurrentPlayTime));