This article contains Firebolt's best practice for functional replacement of transactions, as part of continuous ingestion pipelines.
What cases can benefit from this approach?
In cases where a user would like to run one / multiple DML statements in an atomic manner:
- Commit changes only when all DML statements execute successfully
- Have the ability to rollback in case something went wrong
Advantages to this approach:
- Once "transaction" has completed, changes are reflected instantly and across engines
- This approach does not introduce any significant performance degradation in writing to table nor in reading from it.
- This approach is generic, and can be applied to all tables
Limitations to using this approach:
- Requires the creation of 2 additional objects for every ingested table - querying view and management table
- May include a minimal transition period where last batch data is missing (while delete is running) - up to a few seconds
How to run transactional DML
There are 2 steps that go into implementing transactional DML; one time objects creation and ongoing DML sequence:
DDLs - one time objects creation
- Create
{table_name}_base
table - using the target table DDL + additional batch_id field :
CREATE TABLE IF NOT EXISTS {table_name}_base (
{table_fields},
batch_id INTEGER
) PRIMARY INDEX
{primary_index_fields},
batch_id;
- Create
<table_name>_intermediate_batch_config
- one value config table:
CREATE TABLE IF NOT EXISTS {table_name}_intermediate_batch_config
AS SELECT -1 AS intermediate_batch_id;
- Create
table_name
view which will be used by read queries to access the target table, while avoiding access to the intermediate batch (if any). We give this view the original table name, so queries can continue to interact with the same object name seamlessly:
CREATE OR REPLACE VIEW {table_name} AS
SELECT * FROM {table_name}_base
WHERE batch_id <>
(SELECT intermediate_batch_id FROM {table_name}_intermediate_batch_config);
DMLS
We mimic the transaction beginning by inserting the newly ingested batch_id into the configuration table, and the transaction end by truncating the configuration table.
For example here's how you would run INSERT + DELETE:
UPDATE {table_name}_intermediate_batch_config SET intermediate_batch_id = (SELECT COALESCE(MAX(batch_id) + 1,1) FROM {table_name}_base);
INSERT INTO {table_name}_base
SELECT
{fields_to_insert},
(SELECT intermediate_batch_id from {table_name}_intermediate_batch_config) as batch_id
FROM
{external_table_name}
WHERE
{conditions};
DELETE FROM {table_name}_base
WHERE batch_id <> (SELECT intermediate_batch_id FROM {table_name}_intermediate_batch_config) AND {conditions};
UPDATE {table_name}_intermediate_batch_config SET intermediate_batch_id = -1;
How does it work?
- First we UPDATE intermediate_batch_id to
MAX(batch_id) + 1
within the configuration table. This will ensure that any query from this point on that interacts with the target table (via the view we created), will ignore this new batch id. - Next we INSERT the new batch of data into the target table. We add to the INSERT statement the
batch_id
taken from the configuration table. - Next we DELETE from the table. We add an extra condition to verify that we do not delete records from the newly ingested batch.
- Last we UPDATE the configuration table intermediate_batch_id back to -1, and complete the "transaction". New batch should be accessible for querying, while old batch should be deleted.