How to extract a value from the file name/S3 path and insert it into a column in the table

Use Case

Sometimes we will want to extract a value from the source_file_name in S3 and insert this value into a column in S3.
Let's say we have a table with column date, and we want to capture the date from the path -

"URL" = 's3://bucket-in-s3/date=2023-01-01.csv'

and insert the date "2023-01-01" into the date column.

Steps

When creating the EXTERNAL TABLE, define the column as a partition and use regex to capture the date.

  1. Define the column as PARTITION in the external table and use regex to capture the date, for example -

CREATE EXTERNAL TABLE IF NOT EXISTS example_tbl ("date" DATE NOT NULL PARTITION('.*date=([0-9]+)-([0-9]+)-([0-9]+).*')"CREDENTIALS" = ("AWS_KEY_ID" = '***' "AWS_SECRET_KEY" = '***') "OBJECT_PATTERN" = '*.csv' "TYPE" = ("CSV") "URL" = 's3://bucket-in-s3/date=2023-01-01.csv'

With this regex we can capture the date - "2023-01-01"

  1. Create a FACT table -

CREATE FACT TABLE IF NOT EXISTS "example_fact_tbl" ("date" DATE NOT NULL)PRIMARY INDEX "date"PARTITION BY "date"
  1. In the case where the file name contains "date=2023-01-01", run an insert query from the EXTERNAL table into the FACT table

INSERT into example_fact_tblselect * from example_tbl

The value in date column should be "2023-01-01".

The same can be done on any regex, for example -

"name" TEXT NOT NULL PARTITION('.*name=(firebolt|dw).*')) 

The regex above will extract paths like this -

"URL" = 's3://bucket-in-s3/name=firebolt.csv'OR"URL" = 's3://bucket-in-s3/name=dw.csv'

Note, another way to extract value from the source_file_name column is by using SPLIT_PART function, as explained in our Docs.