How to Use COPY FROM with Automatic Schema Discovery

Introduction

Using Firebolt’s COPY FROM command with Automatic Schema Discovery simplifies data ingestion by automatically mapping columns from external data files and creating the destination table to match it. It is especially useful when the source files are in a format that includes metadata, such as Parquet. For formats that do not include metadata, such as CSV, Automatic Schema Discovery infers column types based on the data content itself. While this process aims to accurately identify data types, it operates on a “best effort” basis. Automatic Schema Discovery saves time when handling data sources with a large number of columns or unknown schemas. By the end of this article, you will know how to configure and execute COPY FROM with auto-schema discovery.

TL;DR

  • Auto Schema Discovery is in effect when AUTO_CREATE = TRUE on the COPY FROM command. It is TRUE by default, so it is not necessary to specify it.

  • If the source file does not include metadata (e.g., CSV or TSV), the column data types will be inferred based on an initial sample of the file. If the source file does include metadata (e.g. PARQUET), the column data types from the metadata will be used.

  • If the destination table does not exist, COPY FROM will create it when the data is ingested. If it does exist, the AUTO_CREATE option will be ignored. The table will not be altered, and data ingest will proceed into the existing table.

Step-by-Step Guide

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Firebolt Sample Dataset.

Step 1: MAKE SURE the TABLE DOES NOT EXIST

Ensure that the table into which the data will be ingested into does not exist. This ensures that the table will be created based on the data in the files.

Example SQL code:

DROP TABLE IF EXISTS games_auto;

Step 2: EXECUTE the COPY FROM Command

Use the COPY FROM command. AUTO_CREATE is set to TRUE by default, but this example will explicitly define it for visibility.

Example SQL code:

-- NOTE: AUTO_CREATE is TRUE by default, explicitly defining it here for visibility
COPY games_auto
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/games/'
WITH
TYPE = PARQUET
AUTO_CREATE = TRUE;

Step 5: Verify the Schema and Data

After executing the COPY FROM command, verify that the data is correctly ingested, and the table has been created as expected.

Select from the information_schema.tables table to review the definition of the table.

Example SQL code:

SELECT ddl FROM information_schema.tables
WHERE table_name = 'games_auto';

Also, query the table to explore your data and confirm it looks as expected.

Example SQL code:

SELECT * FROM games_auto LIMIT 100;

Unless the table is dropped, subsequent COPY FROM commands will NOT use Automatic Schema Discovery, and use the table column order, names, and data types just created.

Full Example Code

DROP TABLE IF EXISTS games_auto;

COPY games_auto
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/parquet/games/'
WITH
TYPE = PARQUET
AUTO_CREATE = TRUE;

SELECT ddl FROM information_schema.tables
WHERE table_name = 'games_auto';

SELECT * FROM games_auto LIMIT 100;