How to Best Organize S3 Structures for Efficient Ingestion

Introduction

Fast and efficient data ingestion into Firebolt is an important part of providing useful and timely data to users. By the end of this article, you will know how to organize and manage your S3 source data.

TL;DR

  • Strategically organize S3 folders to minimize file listing times.

  • Optimize file sizes to balance between memory usage and operational speed.

  • Move files post-ingestion to prevent reprocessing and streamline ingestion.

Step-by-step guide

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Data Set

Step 1: Organize S3 Folders

Organizing your data storage can significantly influence the efficiency of data ingestion into Firebolt.

Use logical subfolders within your S3 buckets to group similar files, which reduces the time Firebolt spends listing files during ingestion. Group files by entity type like /games or /playstats, or by ingestion timeframe like /monthly or /yearly.

Example Bucket structure:

/***************************************************************************/
/* Each file type is in its own folder, so that time is not spent          */
/* listing all files, just those of the type to be imported                */  
/***************************************************************************/
COPY INTO games (
...)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/games/'...

COPY INTO playstats (
...)FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' ...

Step 2: Optimize File Sizes and Types

Avoid very large files as they require significant memory resources to ingest. While larger files can be handled without error, there is overhead in managing the memory allocations. Large files will require larger nodes to be most efficient.

Also avoid many small files, as the number of files increases the overhead in managing them on S3. File counts on the order of several thousands of files should be avoided.

To conclude, file sizes between 500 MB to a few GB are best, with 1 GB as the ideal size.

Firebolt performs well across file types, but Parquet is the most efficient format for data ingestion and will provide the best performance.

Step 3: Move Files After Ingestion

After successfully ingesting files, move them to a different folder or bucket that isn’t included in future ingestion scans. This reduces unnecessary scanning of already ingested files.