Troubleshooting CSV file ingestion

The Symptom

Sometimes when trying to ingest a CSV into an external table, the ingestion might fail. You'll see an error like this when querying the external table:

Cannot parse input: expected '|' but found '<CARRIAGE RETURN>' instead.: While handling s3 object 'some_bucket/some_folder/some_file.csv'

The above error indicates that the file delimiter in not what is defined in the table definition, or that the file contains a different number of columns than what is defined in the external table.

Troubleshooting

  1. Check the file delimiter and make sure the FIELD_DELIMITER is defined correctly for what is in your CSV.
  2. Take a look at the file and compare it column-by-column to the external table definition.
  3. If the file is too big to easily download or look at, then you can try setting up a special external table definition that lets you look at the rows in the CSV file as a single string:
DROP TABLE IF EXISTS ext_tmp;CREATE EXTERNAL TABLE ext_tmp(blob text)URL = 's3://some_bucket/somefolder/'OBJECT_PATTERN = '*' TYPE = (CSV FIELD_DELIMITER=' ');-- In this case, what looks like a space for a column delimiter is actually ascii(255).  On a windows machine, you can create this character by holding the alt key and typing 255 on your numeric keypad.-- The trick is to use a character that will not appear in the normal data of the file. You could use another character for the delimeter, such as TYPE = (CSV FIELD_DELIMITER='^')-- (blob text) is just a column named blob with type text. You could replace it with any column name you want.-- Once this external table is created, there will be one field called “blob” in each row.

Once the table is created, you can run queries like this to look at the data and compare to the original external table definition.

-- shows the header row and first row of data
SELECT * FROM ext_tmp LIMIT 2;
-- shows how many delimiters are in each row, and orders the result by the count.  This can show if there are rows with different numbers of delimiters, which is not allowed
SELECT LENGTH(SPLIT(‘|’, blob)), blob FROM ext_tmp
ORDER BY LENGTH(SPLIT(‘|’, blob)) desc;