How to Best Organize S3 Structures for Efficient Ingestion

FireboltAutomations · May 9, 2024, 11:29am

Introduction

Fast and efficient data ingestion into Firebolt is an important part of providing useful and timely data to users. By the end of this article, you will know how to organize and manage your S3 source data.

TL;DR

Strategically organize S3 folders to minimize file listing times.
Optimize file sizes to balance between memory usage and operational speed.
Move files post-ingestion to prevent reprocessing and streamline ingestion.

Step-by-step guide

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Data Set

Step 1: Organize S3 Folders

Organizing your data storage can significantly influence the efficiency of data ingestion into Firebolt.

Use logical subfolders within your S3 buckets to group similar files, which reduces the time Firebolt spends listing files during ingestion. Group files by entity type like /games or /playstats, or by ingestion timeframe like /monthly or /yearly.

Example Bucket structure:

/***************************************************************************/
/* Each file type is in its own folder, so that time is not spent          */
/* listing all files, just those of the type to be imported                */  
/***************************************************************************/
COPY INTO games (
...)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/games/'...

COPY INTO playstats (
...)FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' ...

Step 2: Optimize File Sizes and Types

Avoid very large files as they require significant memory resources to ingest. While larger files can be handled without error, there is overhead in managing the memory allocations. Large files will require larger nodes to be most efficient.

Also avoid many small files, as the number of files increases the overhead in managing them on S3. File counts on the order of several thousands of files should be avoided.

To conclude, file sizes between 500 MB to a few GB are best, with 1 GB as the ideal size.

Firebolt performs well across file types, but Parquet is the most efficient format for data ingestion and will provide the best performance.

Step 3: Move Files After Ingestion

After successfully ingesting files, move them to a different folder or bucket that isn’t included in future ingestion scans. This reduces unnecessary scanning of already ingested files.

Topic		Replies	Views
Engine Sizing for Simple Ingestion	10	274	July 16, 2024
Tips for a successful Firebolt evaluation	4	331	August 7, 2024
Ultra Fast Gaming: Firebolt Sample Dataset	2	605	May 6, 2024
How to optimize query performance by choosing the right Primary Index in Firebolt	2	65	July 15, 2024
How to Rename a Table in Firebolt	2	156	August 2, 2024