How to optimize query performance by choosing the right Primary Index in Firebolt

Introduction

Choosing the correct primary index in Firebolt is crucial for optimizing query performance by reducing the data scanned during queries. By the end of this article, you will know how to choose the optimal primary index for your specific use case.

TL;DR

  • Select columns used in highly selective WHERE clauses.

  • Start with low-cardinality columns.

  • Limit the number of columns in the primary index to essential ones.

  • Avoid complex expressions in WHERE clauses for indexed columns.

  • Use CALL RECOMMEND_DDL for automatic recommendations, based on your own queries.

  • Primary index cannot be changed after the table is created.

Step-by-Step Guide

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Sample Dataset.

Step 1: Identify Highly Selective Columns

The most critical columns for primary indexes are those frequently used in WHERE clauses with highly selective predicates. These columns allow Firebolt to prune data efficiently by reducing the number of rows scanned. Review your most frequent queries to identify your highly selective predicates.

Example SQL code:

-- Common select query, illustrates selectivity
SELECT * FROM players
WHERE registeredon >'2022-08-25';

-- Table definition, including the primary index. agecategory and registeredon are both highly selective columns
CREATE DIMENSION TABLE players (playerid integer NULL, nickname text NULL, email text NULL, agecategory text NULL, platforms array(text NULL) NULL, registeredon pgdate NULL, issubscribedtonewsletter boolean NULL, internalprobabilitytowin double precision NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL) 
PRIMARY INDEX agecategory, registeredon;

Step 2: Start with Low-Cardinality Columns

Low-cardinality columns (columns with few distinct values) should be at the beginning of the primary index. This ensures long runs of ordered data, which enhances data pruning efficiency. As a rule of thumb, try to limit to columns with no more than a few dozen distinct values.

Example SQL code:

-- Low-Cardinality field:
SELECT APPROX_COUNT_DISTINCT(agecategory) FROM players;
-- Result: 5

-- High-Cardinality field, which was not selected for the primary key:
SELECT APPROX_COUNT_DISTINCT(playerid)/COUNT(*)::decimal FROM players;
-- Result: 1.001, meaning playerid is fully unique and therefore won't help with data pruning

Step 3: Limit the Number of Columns in the Primary Index

While it might be tempting to include many columns in the primary index, doing so can reduce the efficiency of data pruning when filtering by the later columns in the primary index. Aim to include only the most essential columns that significantly improve query performance and choose a maximum of 5 to 10 columns, depending on your data and queries pattern.

Example SQL code:

-- Keeping the primary index simple with the most crucial columns
CREATE TABLE playstats (gameid integer NULL, playerid integer NULL, stattime timestampntz NULL, selectedcar text NULL, currentlevel integer NULL, currentspeed real NULL, currentplaytime bigint NULL, currentscore bigint NULL, event text NULL, errorcode text NULL, tournamentid integer NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL) 
PRIMARY INDEX gameid, tournamentid, playerid, stattime;

Step 4: Avoid Complex Expressions in WHERE Clauses

Ensure that indexed columns are used directly in WHERE clauses without complex expressions. Firebolt cannot utilize the index if the column values are transformed in the query.

Example SQL code:

-- Avoid using expressions on the primary index column
SELECT * FROM players
WHERE SPLIT_PART(agecategory, '-',2) = '28';

Step 5: Use CALL RECOMMEND_DDL for Automatic Recommendations

Firebolt provides a built-in command, called recommend_ddl, to recommend optimal primary index and partition key configurations based on your workload. As your workload changes over time, it can be beneficial to run the CALL recommend_ddl command periodically to check for better table configurations.

Example SQL code:

-- This command provides tailored recommendations for primary index to players table
CALL recommend_ddl(players, (SELECT query_text FROM information_schema.engine_query_history WHERE start_time > NOW() - INTERVAL '1 week'));

Step 6: Note that Primary Index Cannot Be Changed After Table Creation

Once a table is created with a primary index, it cannot be altered. To change the primary index, you need to drop and recreate the table.

Example SQL code:

DROP TABLE IF EXISTS players CASCADE;

-- Create table with the new primary index
CREATE DIMENSION TABLE players (playerid integer NULL, nickname text NULL, email text NULL, agecategory text NULL, platforms array(text NULL) NULL, registeredon pgdate NULL, issubscribedtonewsletter boolean NULL, internalprobabilitytowin double precision NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL) 
PRIMARY INDEX issubscribedtonewsletter, agecategory, registeredon;

-- Ingest data to the table
COPY INTO players (
  playerid $1, 
  nickname $2, 
  email $3, 
  agecategory $4, 
  platforms $5, 
  registeredon $6, 
  issubscribedtonewsletter $7, 
  internalprobabilitytowin $8, 
  source_file_name $SOURCE_FILE_NAME, 
  source_file_timestamp $SOURCE_FILE_TIMESTAMP)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/players/' WITH PATTERN = '*'  TYPE = PARQUET;

Full Example Code

SELECT * FROM players
WHERE registeredon >'2022-08-25';


CREATE DIMENSION TABLE players (playerid integer NULL, nickname text NULL, email text NULL, agecategory text NULL, platforms array(text NULL) NULL, registeredon pgdate NULL, issubscribedtonewsletter boolean NULL, internalprobabilitytowin double precision NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL) 
PRIMARY INDEX agecategory, registeredon;


SELECT APPROX_COUNT_DISTINCT(agecategory) FROM players;
SELECT APPROX_COUNT_DISTINCT(playerid)/COUNT(*)::decimal FROM players;

CREATE TABLE playstats (gameid integer NULL, playerid integer NULL, stattime timestampntz NULL, selectedcar text NULL, currentlevel integer NULL, currentspeed real NULL, currentplaytime bigint NULL, currentscore bigint NULL, event text NULL, errorcode text NULL, tournamentid integer NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL) 
PRIMARY INDEX gameid, tournamentid, playerid, stattime;


SELECT * FROM players
WHERE SPLIT_PART(agecategory, '-',2) = '28';

CALL recommend_ddl(players, (SELECT query_text FROM information_schema.engine_query_history WHERE start_time > NOW() - INTERVAL '1 week'));


DROP TABLE IF EXISTS players CASCADE;

CREATE DIMENSION TABLE players (playerid integer NULL, nickname text NULL, email text NULL, agecategory text NULL, platforms array(text NULL) NULL, registeredon pgdate NULL, issubscribedtonewsletter boolean NULL, internalprobabilitytowin double precision NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL) 
PRIMARY INDEX issubscribedtonewsletter, agecategory, registeredon;

COPY INTO players (
  playerid $1, 
  nickname $2, 
  email $3, 
  agecategory $4, 
  platforms $5, 
  registeredon $6, 
  issubscribedtonewsletter $7, 
  internalprobabilitytowin $8, 
  source_file_name $SOURCE_FILE_NAME, 
  source_file_timestamp $SOURCE_FILE_TIMESTAMP)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/players/' WITH PATTERN = '*'  TYPE = PARQUET;