Quick Tip: Optimize Data Ingestion by Avoiding Aggregations During the Process

FireboltAutomations · November 22, 2022, 6:55am

The Challenge

Performing aggregations directly during the data ingestion process in Firebolt can lead to inefficient and slow operations, especially with multi-node engines and multiple source files. On a multi-node engine, data from S3 is normally accessed in parallel, with each node processing a file. Aggregating data from the source files often involves merging all data from the files onto a single node. This removes the benefits of parallelization and requires all data to be read from S3 before an insert can begin, causing significant delays.

Solution

To optimize data ingestion:

Avoid direct aggregations of source data: These operations can be particularly slow and inefficient.
Leverage aggregating indexes: Instead of aggregating during ingestion, use Firebolt’s aggregating indexes to provide the desired aggregations when querying the base table.

Step-by-Step Guide

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Firebolt Sample Dataset.

In the Ultra Fast Gaming scenario, assume you want to persist an aggregation of playstats into a separate table called playstats_player_range. Instead of ingesting raw data into that table, make use of an aggregating index on the playstats table and insert from there.

Step 1: Ingest raw data

COPY INTO playstats (
  gameid $1, 
  playerid $2, 
  stattime $3, 
  selectedcar $4, 
  currentlevel $5, 
  currentspeed $6, 
  currentplaytime $7, 
  currentscore $8, 
  event $9, 
  errorcode $10, 
  tournamentid $11, 
  source_file_name $SOURCE_FILE_NAME, 
  source_file_timestamp $SOURCE_FILE_TIMESTAMP)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' WITH PATTERN = '*'  TYPE = PARQUET;

Step 2: Create aggregating index to support new table insert

CREATE AGGREGATING INDEX idx_playerstats_player_range ON playstats
  (player_id, 
  MAX(current_score), 
  MIN(current_score));

Step 3: Insert the aggregated data, benefitting from the aggregating index

CREATE TABLE playstats_player_range AS 
SELECT player_id, 
  MAX(current_score) AS max_score, 
  MIN(current_score) AS min_score
FROM playstats;

Full Example Code

COPY INTO playstats (
  gameid $1, 
  playerid $2, 
  stattime $3, 
  selectedcar $4, 
  currentlevel $5, 
  currentspeed $6, 
  currentplaytime $7, 
  currentscore $8, 
  event $9, 
  errorcode $10, 
  tournamentid $11, 
  source_file_name $SOURCE_FILE_NAME, 
  source_file_timestamp $SOURCE_FILE_TIMESTAMP)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' WITH PATTERN = '*'  TYPE = PARQUET;

CREATE AGGREGATING INDEX idx_playerstats_player_range ON playstats
  (player_id, 
  MAX(current_score), 
  MIN(current_score));

CREATE TABLE playstats_player_range AS 
SELECT player_id, 
  MAX(current_score) AS max_score, 
  MIN(current_score) AS min_score
FROM playstats;

Topic		Replies	Views
Engine Sizing for Simple Ingestion	10	304	July 16, 2024
Query Diagnosis and Troubleshooting Help Center	4	543	August 3, 2024
Tips for a successful Firebolt evaluation	4	337	August 7, 2024
Simplifying SQL query analysis with EXPLAIN (ANALYZE)	0	282	June 20, 2024
Troubleshooting CSV file ingestion Help Center	0	396	September 1, 2023