Introduction
Using Firebolt’s COPY FROM
command with Automatic Schema Discovery simplifies data ingestion by automatically mapping columns from external data files and creating the destination table to match it. It is especially useful when the source files are in a format that includes metadata, such as Parquet. For formats that do not include metadata, such as CSV
, Automatic Schema Discovery infers column types based on the data content itself. While this process aims to accurately identify data types, it operates on a “best effort” basis. Automatic Schema Discovery saves time when handling data sources with a large number of columns or unknown schemas. By the end of this article, you will know how to configure and execute COPY FROM
with auto-schema discovery.
TL;DR
-
Auto Schema Discovery is in effect when
AUTO_CREATE = TRUE
on the COPY FROM command. It is TRUE by default, so it is not necessary to specify it. -
If the source file does not include metadata (e.g.,
CSV
orTSV
), the column data types will be inferred based on an initial sample of the file. If the source file does include metadata (e.g.PARQUET
), the column data types from the metadata will be used. -
If the destination table does not exist,
COPY FROM
will create it when the data is ingested. If it does exist, the AUTO_CREATE option will be ignored. The table will not be altered, and data ingest will proceed into the existing table.
Step-by-Step Guide
All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Firebolt Sample Dataset.
Step 1: MAKE SURE the TABLE DOES NOT EXIST
Ensure that the table into which the data will be ingested into does not exist. This ensures that the table will be created based on the data in the files.
Example SQL code:
DROP TABLE IF EXISTS games_auto;
Step 2: EXECUTE the COPY FROM
Command
Use the COPY FROM
command. AUTO_CREATE
is set to TRUE by default, but this example will explicitly define it for visibility.
Example SQL code:
-- NOTE: AUTO_CREATE is TRUE by default, explicitly defining it here for visibility
COPY games_auto
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/games/'
WITH
TYPE = PARQUET
AUTO_CREATE = TRUE;
Step 5: Verify the Schema and Data
After executing the COPY FROM
command, verify that the data is correctly ingested, and the table has been created as expected.
Select from the information_schema.tables
table to review the definition of the table.
Example SQL code:
SELECT ddl FROM information_schema.tables
WHERE table_name = 'games_auto';
Also, query the table to explore your data and confirm it looks as expected.
Example SQL code:
SELECT * FROM games_auto LIMIT 100;
Unless the table is dropped, subsequent COPY FROM
commands will NOT use Automatic Schema Discovery, and use the table column order, names, and data types just created.
Full Example Code
DROP TABLE IF EXISTS games_auto;
COPY games_auto
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/games/'
WITH
TYPE = PARQUET
AUTO_CREATE = TRUE;
SELECT ddl FROM information_schema.tables
WHERE table_name = 'games_auto';
SELECT * FROM games_auto LIMIT 100;