Quick Tip: Optimize Data Ingestion by Avoiding Aggregations During the Process

The Challenge

Performing aggregations directly during the data ingestion process in Firebolt can lead to inefficient and slow operations, especially with multi-node engines and multiple source files. On a multi-node engine, data from S3 is normally accessed in parallel, with each node processing a file. Aggregating data from the source files often involves merging all data from the files onto a single node. This removes the benefits of parallelization and requires all data to be read from S3 before an insert can begin, causing significant delays.

Solution

To optimize data ingestion:

  • Avoid direct aggregations of source data: These operations can be particularly slow and inefficient.

  • Leverage aggregating indexes: Instead of aggregating during ingestion, use Firebolt’s aggregating indexes to provide the desired aggregations when querying the base table.

Step-by-Step Guide

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Firebolt Sample Dataset.

In the Ultra Fast Gaming scenario, assume you want to persist an aggregation of playstats into a separate table called playstats_player_range. Instead of ingesting raw data into that table, make use of an aggregating index on the playstats table and insert from there.

Step 1: Ingest raw data

COPY INTO playstats (
  gameid $1, 
  playerid $2, 
  stattime $3, 
  selectedcar $4, 
  currentlevel $5, 
  currentspeed $6, 
  currentplaytime $7, 
  currentscore $8, 
  event $9, 
  errorcode $10, 
  tournamentid $11, 
  source_file_name $SOURCE_FILE_NAME, 
  source_file_timestamp $SOURCE_FILE_TIMESTAMP)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' WITH PATTERN = '*'  TYPE = PARQUET;

Step 2: Create aggregating index to support new table insert

CREATE AGGREGATING INDEX idx_playerstats_player_range ON playstats
  (player_id, 
  MAX(current_score), 
  MIN(current_score));

Step 3: Insert the aggregated data, benefitting from the aggregating index

CREATE TABLE playstats_player_range AS 
SELECT player_id, 
  MAX(current_score) AS max_score, 
  MIN(current_score) AS min_score
FROM playstats;

Full Example Code

COPY INTO playstats (
  gameid $1, 
  playerid $2, 
  stattime $3, 
  selectedcar $4, 
  currentlevel $5, 
  currentspeed $6, 
  currentplaytime $7, 
  currentscore $8, 
  event $9, 
  errorcode $10, 
  tournamentid $11, 
  source_file_name $SOURCE_FILE_NAME, 
  source_file_timestamp $SOURCE_FILE_TIMESTAMP)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' WITH PATTERN = '*'  TYPE = PARQUET;

CREATE AGGREGATING INDEX idx_playerstats_player_range ON playstats
  (player_id, 
  MAX(current_score), 
  MIN(current_score));

CREATE TABLE playstats_player_range AS 
SELECT player_id, 
  MAX(current_score) AS max_score, 
  MIN(current_score) AS min_score
FROM playstats;