Perform an UPSERT in Firebolt

Overview

In many SQL databases, the UPSERT or MERGE command allows you to alter existing data by either inserting a new row or updating an existing row. You may want to perform an upsert if your new file contains a mix of brand new and old records with minor changes. This commonly occurs with customer information tables, where a single day's load has both rows representing new customers and rows representing customers who have changed their address, phone number, or email.

An UPSERT or MERGE command can be replicated in Firebolt by using staging tables, INSERT statements, and DELETE statements. This article explains the process for performing an upsert operation in Firebolt.

How to Perform an UPSERT in Firebolt

Outline of Steps

  1. Create a control table to hold the maximum date in the fact table where you are inserting/updating data, to identify which source data needs to be added.
  2. Create a staging table to contain records in your source file that need to be updated in your destination fact table.
  3. Delete any records in the fact table that also exist in the updates staging table (records that will be updated).
  4. Insert records from the updates staging table into the fact table to replace the deleted records.
  5. Insert any new records from your source file into your destination fact table.
  6. Drop the temporary date control and staging tables.

Detailed How-To

STEP 1: Create date control table

SQL:

CREATE TABLE IF NOT EXISTS control_maxdate AS (  SELECT MAX(source_file_timestamp) AS max_time  FROM fact_table);

Creating a date control table will allow you to identify which source data needs to be added to your table. In this statement, we will be looking for the maximum source_file_timestamp in your existing fact table. Materialize this value as a table at the beginning of your upsert process so that the upsert can be performed as two INSERT statements executed sequentially. This is crucial as the first step, because the maximum date present in the fact table will change after your initial INSERT, which will prevent you from successfully completing the full upsert operation otherwise.

STEP 2: Create updates staging table

SQL:

CREATE TABLE IF NOT EXISTS updates_table AS (  WITH external_table AS (    SELECT      *    FROM <your_external_table>      WHERE source_file_timestamp > (SELECT max_time FROM control_maxdate)      AND <record_id> IN (SELECT DISTINCT <record_id> FROM <your_fact_table>)      GROUP BY <record_id>   )    SELECT     f.* EXCLUDE (f.source_file_name, f.source_file_timestamp)    , e.source_file_name AS source_file_name_new    , e.source_file_timestamp AS source_file_timestamp_new  FROM <your_fact_table> f   INNER JOIN <your_external_table> e    ON f.<record_id> = e.<record_id>    WHERE f.source_file_name <> e.source_file_name);

💡 Remember to replace <record_id>, <your_fact_table> and <your_external_table> with the appropriate values for your data! 💡

This query creates a new fact table that contains only the rows that need to be updated, identifying these rows by comparing the record IDs present in the source data (an external table in Firebolt) to the record IDs present in your fact table. Any record IDs that appear in both tables represent updates.

STEP 3: Delete records in fact table

SQL:

DELETE FROM <your_fact_table>  WHERE <record_id> IN (SELECT DISTINCT <record_id> FROM updates_table); 

💡 Remember to replace <record_id> and <your_fact_table> with the appropriate values for your data! 💡

This query deletes records from your fact table that need to be updated. These will be inserted as new records in the next step.

STEP 4: Insert records from updates staging table

SQL:

INSERT INTO <your_fact_table>  SELECT *   FROM updates_table;

This query inserts the updated records from step 2 into your fact table.

STEP 5: Insert new records from source file

SQL:

INSERT INTO <your_fact_table>SELECT  *  , source_file_name  , source_file_timestampFROM <your_external_table>  WHERE source_file_timestamp > (SELECT max_time FROM control_maxdate)  AND <record_id> NOT IN (SELECT DISTINCT <record_id> FROM <your_fact_table);

💡 Remember to replace <record_id>, <your_fact_table> and <your_external_table> with the appropriate values for your data! 💡

Next, you will insert the records from your source data that are not already represented in your fact table. This query identifies these records by looking for the most recent data from your source file with unique IDs not already present in your fact table.

STEP 6: Drop date control and updates staging tables

SQL:

DROP TABLE control_maxdate;DROP TABLE updates_table;

This command deletes the date control table and the updates staging table so that you can recreate them during your next ingestion.