What causes "Unable to cast 'TEXT' to xxx target data type..." errors when selecting from external tables

This error can be encountered in several different activities with an external table that is based on CSV or TSV files:

  1. SELECTing FROM an external table

  2. INSERTing INTO a table FROM an external table

  3. CREATing a table AS SELECT FROM an external table

The error means that Firebolt was unable to convert the data found for one or more columns in one or more of the rows in the file to the column data type defined in the external table DDL.

Example

/************/* Table DDL*************/CREATE EXTERNAL TABLE my_external_table(    c_id INT,    c_sum LONG)CREDENTIALS = (AWS_KEY_ID = '*****' AWS_SECRET_KEY = '******')URL = 's3://bucket/'OBJECT_PATTERN= '*.csv'TYPE = (CSV)/************/* FileA.csv*************/100,2001000,2000/************/* FileB.csv*************/100,2001000,"abc"/************/* FileC.csv*************/"c_id","c_name"100,2001000,2000

In the above example, the following would happen:

  • FileA.csv can be accessed with no errors

  • FileB.csv will error due to the "abc" data in column c_sum in the second row. Firebolt will be unable to implicitly convert "abc" to a numeric value

  • FileC.csv will error due to the header row that is not being skipped. If the files have header rows, the DDL TYPE should include parameter to skip the header row(s):

    • TYPE = (CSV SKIP_HEADER_ROWS=1)

troubleshooting tip

It is necessary to use a text editor to review the data in the files to identify the mismatch in data types. Most of the time, it will be obvious by looking at the first few rows in the file to see what the issue is, and the most common resolution will be to make a simple change to the DDL to match the file contents.

However, it only takes 1 column in 1 row in the entire set of files to cause this error. If it is not quickly obvious in the first few rows, it can be helpful to narrow down the row(s) that may have the issue by using SELECT...LIMIT. Using the table in the example above, run this query multiple times and increase the OFFSET values to slice up the files and track down where the problematic rows are located. SOURCE_FILE_NAME is an auto-generated column and will help in identify which file the data is in (if there are multiple files).

SELECT SOURCE_FILE_NAME,COUNT(*) FROM     (SELECT *,SOURCE_FILE_NAME FROM my_external_table LIMIT 10000 OFFSET 0)GROUP BY source_file_name;....<repeat if no error>....SELECT SOURCE_FILE_NAME,COUNT(*) FROM     (SELECT *,SOURCE_FILE_NAME FROM my_external_table LIMIT 10000 OFFSET 10000)GROUP BY source_file_name