Introduction
Fast and efficient data ingestion into Firebolt is an important part of providing useful and timely data to users. By the end of this article, you will know how to organize and manage your S3 source data.
TL;DR
-
Strategically organize S3 folders to minimize file listing times.
-
Optimize file sizes to balance between memory usage and operational speed.
-
Move files post-ingestion to prevent reprocessing and streamline ingestion.
Step-by-step guide
All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Data Set
Step 1: Organize S3 Folders
Organizing your data storage can significantly influence the efficiency of data ingestion into Firebolt.
Use logical subfolders within your S3 buckets to group similar files, which reduces the time Firebolt spends listing files during ingestion. Group files by entity type like /games
or /playstats
, or by ingestion timeframe like /monthly
or /yearly
.
Example Bucket structure:
/***************************************************************************/
/* Each file type is in its own folder, so that time is not spent */
/* listing all files, just those of the type to be imported */
/***************************************************************************/
COPY INTO games (
...)
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/games/'...
COPY INTO playstats (
...)FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/playstats/' ...
Step 2: Optimize File Sizes and Types
Avoid very large files as they require significant memory resources to ingest. While larger files can be handled without error, there is overhead in managing the memory allocations. Large files will require larger nodes to be most efficient.
Also avoid many small files, as the number of files increases the overhead in managing them on S3. File counts on the order of several thousands of files should be avoided.
To conclude, file sizes between 500 MB to a few GB are best, with 1 GB as the ideal size.
Firebolt performs well across file types, but Parquet is the most efficient format for data ingestion and will provide the best performance.
Step 3: Move Files After Ingestion
After successfully ingesting files, move them to a different folder or bucket that isn’t included in future ingestion scans. This reduces unnecessary scanning of already ingested files.