Quick Tip: Optimize Data Ingestion by Avoiding Aggregations During the Process

The Challenge

Performing aggregations directly during the data ingestion process in Firebolt can lead to inefficient and slow operations. Aggregating raw data into a table
often involves merging all data from S3 on a single node. This removes the benefits of parallelization and requires all data to be read from S3 before an insert can begin, causing significant delays.

Solution

To optimize data ingestion:

  • Avoid direct aggregations over external tables: These operations can be particularly slow and inefficient.
  • Leverage aggregating indexes: Instead of aggregating during ingestion, use Firebolts aggregating indexes that support parallel intermediate persists, enhancing performance and scalability.

Example

In the Ultra Fast Gaming scenario, assume you want to persist an aggregation of playstats into a separate table called playstats_player_range. Instead of ingesting raw data into that table, make use of an aggregating index on the playstats table and insert from there.

-- Step 1: Ingest raw dataCOPY INTO playstats (  gameid $1,   playerid $2,   stattime $3,   selectedcar $4,   currentlevel $5,   currentspeed $6,   currentplaytime $7,   currentscore $8,   event $9,   errorcode $10,   tournamentid $11,   source_file_name $SOURCE_FILE_NAME,   source_file_timestamp $SOURCE_FILE_TIMESTAMP)FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' WITH PATTERN = '*'  TYPE = PARQUET;-- Step 2: Create AI to support new table insertCREATE AGGREGATING INDEX idx_playerstats_player_range ON playstats  (player_id,   MAX(current_score),   MIN(current_score));-- Step 3: Insert the aggregated data, benefitting from the AICREATE TABLE playstats_player_range AS SELECT player_id,   MAX(current_score) AS max_score,   MIN(current_score) AS min_scoreFROM playstats;