Introduction
Choosing the correct primary index in Firebolt is crucial for optimizing query performance by reducing the data scanned during queries. By the end of this article, you will know how to choose the optimal primary index for your specific use case.
TL;DR
-
Select columns used in highly selective
WHERE
clauses. -
Include join key columns for fact tables in star schemas.
-
Start with low-cardinality columns.
-
Avoid complex expressions in
WHERE
clauses for indexed columns. -
Use
CALL RECOMMEND_DDL
for automatic recommendations, based on your own queries. -
Primary index cannot be changed after the table is created.
Step-by-Step Guide
All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Sample Dataset.
Step 1: Identify Highly Selective Columns
The most critical columns for primary indexes are those frequently used in WHERE
clauses with highly selective predicates. These columns allow Firebolt to prune data efficiently by reducing the number of rows scanned. Review your most frequent queries to identify your highly selective predicates.
Example SQL code:
-- Common select query, illustrates selectivity
SELECT * FROM players
WHERE registeredon >'2022-08-25';
-- Table definition, including the primary index. agecategory and registeredon are both highly selective columns
CREATE DIMENSION TABLE players (playerid integer NULL, nickname text NULL, email text NULL, agecategory text NULL, platforms array(text NULL) NULL, registeredon pgdate NULL, issubscribedtonewsletter boolean NULL, internalprobabilitytowin double precision NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL)
PRIMARY INDEX agecategory, registeredon;
Step 2: Include Join Key Columns in Fact Tables
In star schemas, including join key columns (foreign keys) in the primary index of fact tables improves query performance by enabling efficient data pruning during joins.
Example SQL code:
-- Commonly used JOIN
SELECT p.playerid, p.agecategory, ps.currentlevel, avg(currentscore) as avg_currentscore
FROM playstats ps
JOIN players p ON p.playerid = ps.playerid
WHERE p.agecategory = '21-28'
GROUP BY ALL
-- Choosing playerid as a primary index column on the FACT table
CREATE TABLE playstats (gameid integer NULL, playerid integer NULL, stattime timestampntz NULL, selectedcar text NULL, currentlevel integer NULL, currentspeed real NULL, currentplaytime bigint NULL, currentscore bigint NULL, event text NULL, errorcode text NULL, tournamentid integer NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL)
PRIMARY INDEX gameid, tournamentid, playerid, stattime;
Step 3: Start with Low-Cardinality Columns
Low-cardinality columns (columns with few distinct values) should be at the beginning of the primary index. This ensures long runs of ordered data, which enhances data pruning efficiency. As a rule of thumb, try to limit to columns with no more than a few dozen distinct values.
Example SQL code:
`-- Low-Cardinality field:`SELECT APPROX_COUNT_DISTINCT(agecategory) FROM players;
-- Result: 5
-- High-Cardinality field, which was not selected for the primary key:
SELECT APPROX_COUNT_DISTINCT(playerid)/COUNT(*)::decimal FROM players;
-- Result: 1.001, meaning playerid is fully unique and therefore won't help with data pruning
Step 4: Avoid Complex Expressions in WHERE
Clauses
Ensure that indexed columns are used directly in WHERE
clauses without complex expressions that transform the column values. Firebolt cannot utilize the index if the column values are transformed in the query.
If you anticipate needing a function in a predicate, it’s a good idea to create a virtual column to store the function’s result. You can then incorporate this virtual column into your primary index and queries. See examples on how to add a virtual column here.
Example SQL code:
-- Avoid using expressions on the primary index column
SELECT * FROM players
WHERE SPLIT_PART(agecategory, '-',2) = '28';
Step 5: Use CALL RECOMMEND_DDL
for Automatic Recommendations
Firebolt provides a built-in command, called recommend_ddl, to recommend optimal primary index and partition key configurations based on your workload. As your workload changes over time, it can be beneficial to run the CALL recommend_ddl
command periodically to check for better table configurations.
Example SQL code:
-- This command provides tailored recommendations for primary index to players table
CALL recommend_ddl(players, (SELECT query_text FROM information_schema.engine_query_history WHERE start_time > NOW() - INTERVAL '1 week'));
Step 6: Note that Primary Index Cannot Be Changed After Table Creation
Once a table is created with a primary index, it cannot be altered. To change the primary index, you need to drop and recreate the table.
Example SQL code:
DROP TABLE IF EXISTS players CASCADE;
-- Create table with the new primary index
CREATE DIMENSION TABLE players (playerid integer NULL, nickname text NULL, email text NULL, agecategory text NULL, platforms array(text NULL) NULL, registeredon pgdate NULL, issubscribedtonewsletter boolean NULL, internalprobabilitytowin double precision NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL)
PRIMARY INDEX issubscribedtonewsletter, agecategory, registeredon;
-- Ingest data to the table
COPY INTO players (
playerid $1,
nickname $2,
email $3,
agecategory $4,
platforms $5,
registeredon $6,
issubscribedtonewsletter $7,
internalprobabilitytowin $8,
source_file_name $SOURCE_FILE_NAME,
source_file_timestamp $SOURCE_FILE_TIMESTAMP)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/players/' WITH PATTERN = '*' TYPE = PARQUET;
Full Example Code
SELECT * FROM players
WHERE registeredon >'2022-08-25';
CREATE DIMENSION TABLE players (playerid integer NULL, nickname text NULL, email text NULL, agecategory text NULL, platforms array(text NULL) NULL, registeredon pgdate NULL, issubscribedtonewsletter boolean NULL, internalprobabilitytowin double precision NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL)
PRIMARY INDEX agecategory, registeredon;
SELECT p.playerid, p.agecategory, ps.currentlevel, avg(currentscore) as avg_currentscore
FROM playstats ps
JOIN players p ON p.playerid = ps.playerid
WHERE p.agecategory = '21-28'
GROUP BY ALL
CREATE TABLE playstats (gameid integer NULL, playerid integer NULL, stattime timestampntz NULL, selectedcar text NULL, currentlevel integer NULL, currentspeed real NULL, currentplaytime bigint NULL, currentscore bigint NULL, event text NULL, errorcode text NULL, tournamentid integer NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL)
PRIMARY INDEX gameid, tournamentid, playerid, stattime;
SELECT APPROX_COUNT_DISTINCT(agecategory) FROM players;
SELECT APPROX_COUNT_DISTINCT(playerid)/COUNT(*)::decimal FROM players;
SELECT * FROM players
WHERE SPLIT_PART(agecategory, '-',2) = '28';
CALL recommend_ddl(players, (SELECT query_text FROM information_schema.engine_query_history WHERE start_time > NOW() - INTERVAL '1 week'));
DROP TABLE IF EXISTS players CASCADE;
CREATE DIMENSION TABLE players (playerid integer NULL, nickname text NULL, email text NULL, agecategory text NULL, platforms array(text NULL) NULL, registeredon pgdate NULL, issubscribedtonewsletter boolean NULL, internalprobabilitytowin double precision NULL, source_file_name text NULL, source_file_timestamp timestampntz NULL)
PRIMARY INDEX issubscribedtonewsletter, agecategory, registeredon;
COPY INTO players (
playerid $1,
nickname $2,
email $3,
agecategory $4,
platforms $5,
registeredon $6,
issubscribedtonewsletter $7,
internalprobabilitytowin $8,
source_file_name $SOURCE_FILE_NAME,
source_file_timestamp $SOURCE_FILE_TIMESTAMP)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/players/' WITH PATTERN = '*' TYPE = PARQUET;