Use Case
Sometimes we will want to extract a value from the source_file_name in S3 and insert this value into a column in S3.
Let's say we have a table with column date
, and we want to capture the date from the path -
"URL" = 's3://bucket-in-s3/date=2023-01-01.csv'
and insert the date "2023-01-01" into the date
column.
Steps
When creating the EXTERNAL TABLE, define the column as a partition and use regex to capture the date.
Define the column as PARTITION in the external table and use regex to capture the date, for example -
CREATE EXTERNAL TABLE IF NOT EXISTS example_tbl (
"date" DATE NOT NULL PARTITION('.*date=([0-9]+)-([0-9]+)-([0-9]+).*')
"CREDENTIALS" = ("AWS_KEY_ID" = '***' "AWS_SECRET_KEY" = '***')
"OBJECT_PATTERN" = '*.csv'
"TYPE" = ("CSV")
"URL" = 's3://bucket-in-s3/date=2023-01-01.csv'
With this regex we can capture the date - "2023-01-01"
Create a FACT table -
CREATE FACT TABLE IF NOT EXISTS "example_fact_tbl" (
"date" DATE NOT NULL)
PRIMARY INDEX "date"
PARTITION BY "date"
In the case where the file name contains "date=2023-01-01", run an insert query from the EXTERNAL table into the FACT table
INSERT into example_fact_tbl
select * from example_tbl
The value in date
column should be "2023-01-01".
The same can be done on any regex, for example -
"name" TEXT NOT NULL PARTITION('.*name=(firebolt|dw).*'))
The regex above will extract paths like this -
"URL" = 's3://bucket-in-s3/name=firebolt.csv'
OR
"URL" = 's3://bucket-in-s3/name=dw.csv'
Note, another way to extract value from the source_file_name column is by using SPLIT_PART function, as explained in our Docs.