How to resolve "Unable to cast" errors during CSV Ingestion

Introduction

CSV files are a common source of data for ingestion into Firebolt. Unlike other file types (e.g., Parquet), CSV files do not include data type information. For this reason, they can contain incorrect or unexpected data. This variability can cause “Unable to cast” errors when ingesting data into Firebolt. Understanding and addressing this common error ensures smoother and more efficient data management. This article will guide you through investigating and resolving this common error.

TL;DR

There are two common causes for “Unable to cast” error:

  • Data not consistent across all rows.

  • Columns are not consistent across all rows.
    Firebolt’s logging and error-handling features help identify the rows causing the errors and provide insight on how to resolve the errors.

Resolving “Unable to cast” errors may involve regenerating the source file, updating the table definition, simply ignoring the failures, or a combination of the three.

Step-by-Step Guide

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Firebolt Sample Dataset.

Step 1: Understand possible causes

Inconsistent data across all rows:

A common cause is that the data provided in the file is not consistent across all rows. An example of this can be seen by using files based on the Ultra Fast Gaming Data Set and running these queries:

DROP TABLE IF EXISTS tournaments_inconsistent_data;

CREATE TABLE IF NOT EXISTS
  "tournaments_inconsistent_data" (
    "tournamentid" integer NULL,
    "name" text NULL,
    "gameid" integer NULL,
    "totalprizedollars" integer NULL,
    "startdatetime" timestampntz NULL,
    "enddatetime" timestampntz NULL,
    "rulesdefinition" text NULL,
    "source_file_name" text NULL,
    "source_file_timestamp" timestampntz NULL
  ) PRIMARY INDEX "tournamentid";

COPY tournaments_inconsistent_data
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/csv/invalid_formats/' WITH PATTERN = 'tournaments_inconsistent_data.csv'  
TYPE = CSV HEADER = TRUE;
-- this file contains two rows with decimal data for totalprizedollars

The COPY FROM query will return an error: Unable to cast text '123.45' to integer

Inconsistent columns across all rows:

Another common cause is when the columns are not consistent across all rows. An example of this can be seen by running these queries:

DROP TABLE IF EXISTS tournaments_inconsistent_columns;

CREATE TABLE IF NOT EXISTS
  "tournaments_inconsistent_columns" (
    "tournamentid" integer NULL,
    "name" text NULL,
    "gameid" integer NULL,
    "totalprizedollars" integer NULL,
    "startdatetime" timestampntz NULL,
    "enddatetime" timestampntz NULL,
    "rulesdefinition" text NULL,
    "source_file_name" text NULL,
    "source_file_timestamp" timestampntz NULL
  ) PRIMARY INDEX "tournamentid";

COPY tournaments_inconsistent_columns
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/csv/invalid_formats/' WITH PATTERN = 'tournaments_inconsistent_columns.csv'  
TYPE = CSV HEADER = TRUE;
-- this file contains a row where totalprizedollars column is missing

The COPY FROM query will return an error: Unable to cast text '2022-05-11 18:12:48' to integer

Step 2: Set up error logging and handling

In the above queries, data ingestion failed. The COPY FROM command has parameters that can be set to allow valid data to be ingested, and have invalid data written out to files on an S3 folder. This allows you to look at the rows with bad data and determine the appropriate course of action. You will need an S3 folder with write permission. A child directory is created based on the time of load submission in the format YearMonthDay-HourMinuteSecond + an ID (Ex. 20220330-173205). Two types of files are written: 1) the reason (error) file: error_reasons.csv and 2) the data (row) file: rejected_rows.csv.

The following queries are updated to ingest the valid data from the first example and log the invalid rows to S3:

DROP TABLE IF EXISTS tournaments_inconsistent_data;

CREATE TABLE IF NOT EXISTS
  "tournaments_inconsistent_data" (
    "tournamentid" integer NULL,
    "name" text NULL,
    "gameid" integer NULL,
    "totalprizedollars" integer NULL,
    "startdatetime" timestampntz NULL,
    "enddatetime" timestampntz NULL,
    "rulesdefinition" text NULL,
    "source_file_name" text NULL,
    "source_file_timestamp" timestampntz NULL
  ) PRIMARY INDEX "tournamentid";

COPY tournaments_inconsistent_data
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/csv/invalid_formats/' WITH PATTERN = 'tournaments_inconsistent_data.csv'  
TYPE = CSV HEADER = TRUE
MAX_ERRORS_PER_FILE= '100%'
ERROR_FILE = '<S3 folder that you have write access to>'
ERROR_FILE_CREDENTIALS = (AWS_KEY_ID = 'key' AWS_SECRET_KEY = 'secret key');

SELECT * FROM tournaments_inconsistent_data;
-- this will show that 4 rows were successfully ingested into the table

Step 3: Review error log and rows with errors

After running the above queries, Firebolt will create a new folder in the provided S3 location, and write two files there:

  • error_reasons_1.csv

  • rejected_rows_1.csv

  • You can use the AWS console to locate and download the files:

error_reasons_1.csv

This file contains an error message for each row that failed with the file name, the row number, and the error message. It will look like this:

"file_name","source_line_num","error_message"
"gaming/csv/invalid_formats/tournaments_inconsistent_data.csv",5,"Error while casting"
"gaming/csv/invalid_formats/tournaments_inconsistent_data.csv",6,"Error while casting"

This indicates that rows five and six of the original file could not load due to a casting error.

rejected_rows_1.csv

This file contains each row that failed validation. It will look like this:

998,The African Grand Prix,1,123.45,2022-05-11 18:12:48,2022-05-11 22:29:12,"{'MostLaps': 'The race will be won by the car that completes the most laps within the time limit', 'PitStops': 'Pit stops are allowed for refuelling and tire changes, but must be completed within a designated time limit', 'RaceTrack': 'Drivers must follow the designated race track at all times', 'AvoidCollisions': 'Each driver must make every effort to avoid collisions with other cars or track obstacles', 'Overtaking': 'Overtaking other cars is allowed, but must be done safely', 'SafetyStandards': 'Any car that fails to meet safety standards or is too damaged to continue the race will be disqualified'}"
999,The African Grand Prix,1,706.45,2022-05-11 18:12:48,2022-05-11 22:29:12,"{'MostLaps': 'The race will be won by the car that completes the most laps within the time limit', 'PitStops': 'Pit stops are allowed for refuelling and tire changes, but must be completed within a designated time limit', 'RaceTrack': 'Drivers must follow the designated race track at all times', 'AvoidCollisions': 'Each driver must make every effort to avoid collisions with other cars or track obstacles', 'Overtaking': 'Overtaking other cars is allowed, but must be done safely', 'SafetyStandards': 'Any car that fails to meet safety standards or is too damaged to continue the race will be disqualified'}"

This shows that row 5 has 123.45 as the value for totalprizedollars and that row 6 has 706.45 as the value for totalprizedollars . That column is defined as an integer column and cannot accept decimal data.

Repeat Step 2 for the tournaments_inconsistent_columns example:

This will create a new folder and error files. In rejected_rows.csv you will see 1 row:

`5,The African Grand Prix,1,2022-05-11 18:12:48,2022-05-11 22:29:12,"{'MostLaps': 'The race will be won by the car that completes the most laps within the time limit', 'PitStops': 'Pit stops are allowed for refuelling and tire changes, but must be completed within a designated time limit', 'RaceTrack': 'Drivers must follow the designated race track at all times', 'AvoidCollisions': 'Each driver must make every effort to avoid collisions with other cars or track obstacles', 'Overtaking': 'Overtaking other cars is allowed, but must be done safely', 'SafetyStandards': 'Any car that fails to meet safety standards or is too damaged to continue the race will be disqualified'}"`

From this, you can see in Row five that there is no data or placeholder for totalprizedollars, just the data for startdatetime. That data cannot be cast to integer, and results in a failure.

Step 4: Determine resolution

Resolving these errors can involve any or all of these three actions:

  • Regenerating the source file and re-ingesting the data.

  • Redefining the table and re-ingesting the data.

  • Ignoring the errors and using the data that was successfully loaded.
    In the first example (tournaments_inconsistent_data) adjusting the table definition by defining totalprizedollars as a decimal data would eliminate the casting error during load. Running these queries will accomplish that:

DROP TABLE IF EXISTS tournaments_inconsistent_data;

CREATE TABLE IF NOT EXISTS
  "tournaments_inconsistent_data" (
    "tournamentid" integer NULL,
    "name" text NULL,
    "gameid" integer NULL,
    "totalprizedollars" decimal NULL, /* new data type - was integer, now decimal */
    "startdatetime" timestampntz NULL,
    "enddatetime" timestampntz NULL,
    "rulesdefinition" text NULL,
    "source_file_name" text NULL,
    "source_file_timestamp" timestampntz NULL
  ) PRIMARY INDEX "tournamentid";

COPY tournaments_inconsistent_data
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/csv/invalid_formats/' WITH PATTERN = 'tournaments_inconsistent_data.csv'  
TYPE = CSV HEADER = TRUE;
-- this file contains two rows with decimal data for totalprizedollars

In the second example (tournaments_inconsistent_columns) the source file is incorrect. The process that generates the source files should be reviewed and the file should be regenerated. If that is not possible, or the missing row is unimportant, then the error can just be ignored and Firebolt will have the data that was successfully loaded.

Full Example Code:

DROP TABLE IF EXISTS tournaments_inconsistent_data;

CREATE TABLE IF NOT EXISTS
  "tournaments_inconsistent_data" (
    "tournamentid" integer NULL,
    "name" text NULL,
    "gameid" integer NULL,
    "totalprizedollars" integer NULL,
    "startdatetime" timestampntz NULL,
    "enddatetime" timestampntz NULL,
    "rulesdefinition" text NULL,
    "source_file_name" text NULL,
    "source_file_timestamp" timestampntz NULL
  ) PRIMARY INDEX "tournamentid";

COPY tournaments_inconsistent_data
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/csv/invalid_formats/' WITH PATTERN = 'tournaments_inconsistent_data.csv'  
TYPE = CSV HEADER = TRUE;
-- this file contains two rows with decimal data for totalprizedollars

DROP TABLE IF EXISTS tournaments_inconsistent_columns;

CREATE TABLE IF NOT EXISTS
  "tournaments_inconsistent_columns" (
    "tournamentid" integer NULL,
    "name" text NULL,
    "gameid" integer NULL,
    "totalprizedollars" integer NULL,
    "startdatetime" timestampntz NULL,
    "enddatetime" timestampntz NULL,
    "rulesdefinition" text NULL,
    "source_file_name" text NULL,
    "source_file_timestamp" timestampntz NULL
  ) PRIMARY INDEX "tournamentid";

COPY tournaments_inconsistent_columns
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/csv/invalid_formats/' WITH PATTERN = 'tournaments_inconsistent_columns.csv'  
TYPE = CSV HEADER = TRUE;
-- this file contains a row where totalprizedollars column is missing

DROP TABLE IF EXISTS tournaments_inconsistent_data;

CREATE TABLE IF NOT EXISTS
  "tournaments_inconsistent_data" (
    "tournamentid" integer NULL,
    "name" text NULL,
    "gameid" integer NULL,
    "totalprizedollars" integer NULL,
    "startdatetime" timestampntz NULL,
    "enddatetime" timestampntz NULL,
    "rulesdefinition" text NULL,
    "source_file_name" text NULL,
    "source_file_timestamp" timestampntz NULL
  ) PRIMARY INDEX "tournamentid";

COPY tournaments_inconsistent_data
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/csv/invalid_formats/' WITH PATTERN = 'tournaments_inconsistent_data.csv'  
TYPE = CSV HEADER = TRUE
MAX_ERRORS_PER_FILE= '100%'
ERROR_FILE = '<S3 folder that you have write access to>'
ERROR_FILE_CREDENTIALS = (AWS_KEY_ID = 'key' AWS_SECRET_KEY = 'secret key');

SELECT * FROM tournaments_inconsistent_data;
-- this will show that 4 rows were successfully ingested into the table

DROP TABLE IF EXISTS tournaments_inconsistent_data;

CREATE TABLE IF NOT EXISTS
  "tournaments_inconsistent_data" (
    "tournamentid" integer NULL,
    "name" text NULL,
    "gameid" integer NULL,
    "totalprizedollars" decimal NULL,
    "startdatetime" timestampntz NULL,
    "enddatetime" timestampntz NULL,
    "rulesdefinition" text NULL,
    "source_file_name" text NULL,
    "source_file_timestamp" timestampntz NULL
  ) PRIMARY INDEX "tournamentid";

COPY tournaments_inconsistent_data
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/csv/invalid_formats/' WITH PATTERN = 'tournaments_inconsistent_data.csv'  
TYPE = CSV HEADER = TRUE;
-- this file contains two rows with decimal data for totalprizedollars
DROP TABLE IF EXISTS tournaments_inconsistent_columns;

CREATE TABLE IF NOT EXISTS
  "tournaments_inconsistent_data" (
    "tournamentid" integer NULL,
    "name" text NULL,
    "gameid" integer NULL,
    "totalprizedollars" integer NULL,
    "startdatetime" timestampntz NULL,
    "enddatetime" timestampntz NULL,
    "rulesdefinition" text NULL,
    "source_file_name" text NULL,
    "source_file_timestamp" timestampntz NULL
  ) PRIMARY INDEX "tournamentid";

COPY tournaments_inconsistent_columns
FROM 's3://firebolt-sample-datasets-public-us-east-1/gaming/csv/invalid_formats/' WITH PATTERN = 'tournaments_inconsistent_columns.csv'  
TYPE = CSV HEADER = TRUE
MAX_ERRORS_PER_FILE= '100%'
ERROR_FILE = '<S3 folder that you have write access to>'
ERROR_FILE_CREDENTIALS = (AWS_KEY_ID = 'key' AWS_SECRET_KEY = 'secret key');

SELECT * FROM tournaments_inconsistent_columns;
-- this will show that 4 rows were successfully ingested into the table