Ingestion Best Practices

Introduction

Data ingestion into Firebolt involves reading data from files stored in an S3 bucket and inserting it into a Firebolt database table. There are five basic steps (see Data Management Lifecycle for more details).

  1. Create an S3 bucket and store files for each data entity in it.
  2. Create a Firebolt database and engine in which to store the data.
  3. Define an external table to connect to each set of files for a data entity.
  4. Define a Firebolt table for each data entity.
  5. Insert into a Firebolt table from the external table.

s3 buckets and files

Data to be ingested into Firebolt will be stored in files located in an S3 bucket. Here are some guidelines:

  • Use subfolders to separate groups of like files.
    • Based on the URL in the external table definition (see Work with External Tables for more details), Firebolt will get a list of files to read from. Getting that list from S3 might take time, so having the files in separate folders can reduce the number of files to be listed.
    • The folders could be for each entity (e.g., /items or /orders), or for a timeframe (e.g., /current or /month) to be ingested, or some other logical grouping. Do not use a folder structure whose name will change with each ingest, as that will require dropping and re-creating the external table.
  • Don't make individual files too big or too small.
    • Firebolt uses memory to store the contents of a file and needs to allocate memory equal to the uncompressed size of a file. Large files (10 GB+) will require larger engines.
    • It takes time for S3 to provide a list of files to be ingested. Files that are small can result in a large number of files (tens or hundreds of thousands) being stored in S3, slowing the ingest.
  • Move files after they are ingested.
    • At some point after files are successfully ingested, they should be moved to a separate folder or bucket that is not referenced by an external table. That will reduce the number of S3 files the firebolt has to list during an ingest.
    • They are not automatically moved out of the bucket/folder. They will be ingested again unless some sort of incremental load filter is used (see Ingestion Tip - Incremental Loads and Using Metadata virtual columns).
  • Use compressed files.
    • Some of the supported file types (see Supported file formats) are already compressed (parquet, orc, etc.). For other types (csv, tsv, json), Firebolt can read files using GZIP compression (see Compression). The compression does not slow down the ingest, and there are some storage savings with the smaller files.

Firebolt engine

See Resource Planning: Ingesting Data in Firebolt for guidance on sizing engines for ingesting data.

External table

External tables are pointers to files that contain data (see Create External Table). SQL can be used to see the data in the files. Here are some guidelines:

  • Do not use external tables in analytical SELECT queries. They are not actual tables and it is slow to access data associated with them - there are no indexes and data is being read line-by-line from the associated files.
  • It is not necessary to drop/create external tables when adding/deleting files from S3. The files are dynamically accessed at the time data is selected from them.
  • Files are not automatically loaded into Firebolt when they are associated with an external table. An insert must be executed after the file is put into the S3 bucket.

Firebolt table AND inserts

A Firebolt table is a collection of data stored in a proprietary format that can be accessed using SQL. See Working with Tables for more details. Here are some guidelines: