The Challenge
Performing aggregations directly during the data ingestion process in Firebolt can lead to inefficient and slow operations, especially with multi-node engines and multiple source files. On a multi-node engine, data from S3 is normally accessed in parallel, with each node processing a file. Aggregating data from the source files often involves merging all data from the files onto a single node. This removes the benefits of parallelization and requires all data to be read from S3 before an insert can begin, causing significant delays.
Solution
To optimize data ingestion:
-
Avoid direct aggregations of source data: These operations can be particularly slow and inefficient.
-
Leverage aggregating indexes: Instead of aggregating during ingestion, use Firebolt’s aggregating indexes to provide the desired aggregations when querying the base table.
Step-by-Step Guide
All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Firebolt Sample Dataset.
In the Ultra Fast Gaming scenario, assume you want to persist an aggregation of playstats into a separate table called playstats_player_range
. Instead of ingesting raw data into that table, make use of an aggregating index on the playstats
table and insert from there.
Step 1: Ingest raw data
COPY INTO playstats (
gameid $1,
playerid $2,
stattime $3,
selectedcar $4,
currentlevel $5,
currentspeed $6,
currentplaytime $7,
currentscore $8,
event $9,
errorcode $10,
tournamentid $11,
source_file_name $SOURCE_FILE_NAME,
source_file_timestamp $SOURCE_FILE_TIMESTAMP)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' WITH PATTERN = '*' TYPE = PARQUET;
Step 2: Create aggregating index to support new table insert
CREATE AGGREGATING INDEX idx_playerstats_player_range ON playstats
(player_id,
MAX(current_score),
MIN(current_score));
Step 3: Insert the aggregated data, benefitting from the aggregating index
CREATE TABLE playstats_player_range AS
SELECT player_id,
MAX(current_score) AS max_score,
MIN(current_score) AS min_score
FROM playstats;
Full Example Code
COPY INTO playstats (
gameid $1,
playerid $2,
stattime $3,
selectedcar $4,
currentlevel $5,
currentspeed $6,
currentplaytime $7,
currentscore $8,
event $9,
errorcode $10,
tournamentid $11,
source_file_name $SOURCE_FILE_NAME,
source_file_timestamp $SOURCE_FILE_TIMESTAMP)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' WITH PATTERN = '*' TYPE = PARQUET;
CREATE AGGREGATING INDEX idx_playerstats_player_range ON playstats
(player_id,
MAX(current_score),
MIN(current_score));
CREATE TABLE playstats_player_range AS
SELECT player_id,
MAX(current_score) AS max_score,
MIN(current_score) AS min_score
FROM playstats;