How to run transactional DMLs in Firebolt

This article contains Firebolt's best practice for functional replacement of transactions, as part of continuous ingestion pipelines.

What cases can benefit from this approach?

In cases where a user would like to run one / multiple DML statements in an atomic manner:

  1. Commit changes only when all DML statements execute successfully
  2. Have the ability to rollback in case something went wrong

Advantages to this approach:

  1. Once "transaction" has completed, changes are reflected instantly and across engines
  2. This approach does not introduce any significant performance degradation in writing to table nor in reading from it.
  3. This approach is generic, and can be applied to all tables

Limitations to using this approach:

  1. Requires the creation of 2 additional objects for every ingested table - querying view and management table
  2. May include a minimal transition period where last batch data is missing (while delete is running) - up to a few seconds

How to run transactional DML

There are 2 steps that go into implementing transactional DML; one time objects creation and ongoing DML sequence:

DDLs - one time objects creation

  1. Create {table_name}_base table - using the target table DDL + additional batch_id field :
CREATE TABLE IF NOT EXISTS {table_name}_base (  {table_fields},  batch_id INTEGER) PRIMARY INDEX {primary_index_fields},batch_id;
  1. Create <table_name>_intermediate_batch_config - one value config table:
CREATE TABLE IF NOT EXISTS {table_name}_intermediate_batch_config AS SELECT -1 AS intermediate_batch_id;
  1. Create table_name view which will be used by read queries to access the target table, while avoiding access to the intermediate batch (if any). We give this view the original table name, so queries can continue to interact with the same object name seamlessly:
CREATE OR REPLACE VIEW {table_name} AS    SELECT * FROM {table_name}_base    WHERE batch_id <> (SELECT intermediate_batch_id FROM {table_name}_intermediate_batch_config);

DMLS

We mimic the transaction beginning by inserting the newly ingested batch_id into the configuration table, and the transaction end by truncating the configuration table.

For example here's how you would run INSERT + DELETE:

UPDATE {table_name}_intermediate_batch_config SET intermediate_batch_id = (SELECT COALESCE(MAX(batch_id) + 1,1) FROM {table_name}_base);INSERT INTO {table_name}_baseSELECT{fields_to_insert},(SELECT intermediate_batch_id from {table_name}_intermediate_batch_config) as batch_idFROM{external_table_name}WHERE{conditions};    DELETE FROM {table_name}_baseWHERE batch_id <> (SELECT intermediate_batch_id FROM {table_name}_intermediate_batch_config) AND {conditions};    UPDATE {table_name}_intermediate_batch_config SET intermediate_batch_id = -1;

How does it work?

  1. First we UPDATE intermediate_batch_id to MAX(batch_id) + 1 within the configuration table. This will ensure that any query from this point on that interacts with the target table (via the view we created), will ignore this new batch id.
  2. Next we INSERT the new batch of data into the target table. We add to the INSERT statement the batch_id taken from the configuration table.
  3. Next we DELETE from the table. We add an extra condition to verify that we do not delete records from the newly ingested batch.
  4. Last we UPDATE the configuration table intermediate_batch_id back to -1, and complete the "transaction". New batch should be accessible for querying, while old batch should be deleted.