How to Optimize Data Ingestion into Firebolt

Introduction

Effective data ingestion into Firebolt is essential for maintaining the optimal performance of your database. This article outlines best practices for structuring and managing your data ingestion processes from S3 buckets into Firebolt. By the end of this article, you will know how to refine your approach across three key areas: S3 Buckets and Files, Firebolt Engines, and Incremental Ingest.

TL;DR

  • Strategically organize S3 buckets to minimize file listing times.
  • Optimize file sizes to balance between memory usage and operational speed.
  • Move files post-ingestion to prevent reprocessing and streamline ingestion.
  • Utilize compression to reduce storage needs and improve data transfer speeds.
  • Configure Firebolt engines appropriately to handle your specific data load efficiently.
  • Leverage incremental ingest techniques for effective data management.

Step-by-step guide

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Data Set

Step 1: Optimize S3 Buckets and Files

Organizing your data storage and managing file sizes effectively can significantly influence the efficiency of data ingestion into Firebolt. Here are detailed recommendations:

  • Subfolder Organization : Use logical subfolders within your S3 buckets to group similar files, which reduces the time Firebolt spends listing files during ingestion. Group files by entity type like /orders or /users , or by ingestion timeframe like /monthly or /yearly .

Example Bucket structure:

/***************************************************************************/
    /* each file type is in it's own folder, so that time is not spent         */
    /* listing all files, just those of the type to be imported                */  
    /***************************************************************************/
    COPY INTO games (
    ...)
    FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/games/'...
    COPY INTO playstats (
    ...)
    FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' ...
  • File Size Management : Avoid very large files as they require significant memory resources to ingest. While larger files can be handled without error, there is overhead in managing the memory allocations. Large files will require larger nodes to be most efficient. Conversely, too many small files increase the overhead in managing them on S3.
  • Post-Ingestion File Handling : After successfully ingesting files, move them to a different folder or bucket that isn’t included in future ingestion scans. This avoids re-ingestion of the same data and reduces unnecessary scanning.
  • Compression Use : Employ compressed file formats (See Supported file formats and Compression) to decrease the volume of data transferred during ingestion and to save on storage.

Step 2: Configure Firebolt Engines Appropriately

See Engine Sizing for Simple Ingestion for guidance on sizing engines for ingesting data. Generally speaking, more nodes will ingest data faster than fewer nodes. Larger numbers of small files will benefit from many, smaller nodes. Smaller numbers of large files will benefit from fewer, larger nodes.

Step 3: Implement Incremental Ingest Techniques

To manage data efficiently and avoid reprocessing the same information, incremental ingestion techniques work together with moving files after ingestion. These methods involve ingesting only new or changed data since the last update. See Ingestion Tip - Incremental Loads and Using Metadata virtual columns for more information.

Example SQL code:

-- Example: Ingesting only new files based on timestamp metadata
    COPY INTO playstats (
      gameid $1, 
      playerid $2, 
      stattime $3, 
      selectedcar $4, 
      currentlevel $5, 
      currentspeed $6, 
      currentplaytime $7, 
      currentscore $8, 
      event $9, 
      errorcode $10, 
      tournamentid $11, 
      source_file_name $SOURCE_FILE_NAME, 
      source_file_timestamp $SOURCE_FILE_TIMESTAMP)
    FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' WITH PATTERN = '*'  TYPE = PARQUET;
    WHERE $SOURCE_FILE_TIMESTAMP > (SELECT MAX(source_file_timestamp) FROM game_stats);

Step 4: Avoid aggregations on external data

When ingesting data, it is possible to write an INSERT query that aggregates the external data prior to inserting in to the final table. However, aggregating raw data into a table often involves merging all data from S3 on a single node. This removes the benefits of parallelization and requires all data to be read from S3 before an insert can begin, causing significant delays. See Optimize Data Ingestion by Avoiding Aggregations for more information