Query Diagnosis and Troubleshooting

overview

If a query is not running as fast as desired, there are several different areas that can be investigated for bottlenecks or restrictions. This blog post describes a comprehensive approach to debugging/optimizing queries.

Following are additional resources/ideas for diagnosing performance issues.

QUERY EXPLAIN

The EXPLAIN plan for a query shows how the data will be accessed and possibly indicates opportunities to improve query performance by changing index configurations, joins, query structure, and more, before running the query. From this, you may see if one or more of these refinements could improve performance:

  • Change a primary index to better support the query pattern.
  • Add more filters to the WHERE and JOIN clauses, even if they seem redundant.
    • Additional criteria give additional flexibility for Firebolt to approach the query differently, for the most aggressive filtering. This is especially true in outer joins.
  • Create aggregating indexes.
  • Change the query to use deep filters.
  • Use hashing (CITY_HASH) on JOIN or WHERE columns.
    • Large strings can decrease performance, and in general, we prefer to work with numbers as much as possible. If you have a large string (more than 20 characters), it can be converted to a hash value (either at load time or at query time) and then you can use the compact hash number vs the large string in as many operations as possible.
    • CITY_HASH can be used to generate a single hash out of compound columns, and reduce multiple joins keys into one:
      • FROM table_a a INNER JOIN table_b b

        ON CITY_HASH(a.a,a.b,a.c,a.d) = CITY_HASH(b.a,b.b,b.c,b.d)
  • Materialize CTEs

query statistics in query history

Query history contains information about what resources were used and how much time was spent in various steps of query execution. This information can provide insight on where a query was spending its time.

Some key metrics for a query can be retrieved from query history using a query like this:

SELECT    start_time    , end_time    , status    , duration_usec/1000000 as duration_secs    , time_in_queue_ms    , query_optimization_us/1000 as q_opt_ms    , query_id        , query_text    , query_text_normalized_hash    , error_message    , round(cpu_usage_us/1000) as cpu_usage_ms    , round(cpu_delay_us/1000) as cpu_delay_ms    , round(total_ram_consumed/1000000000,3) as total_ram_gb    , round(scanned_bytes/1000000000,3) as scanned_gb    , round(scanned_bytes_cache/1000000000,3) as scanned_cache_gb    , round(scanned_bytes_storage/1000000000,3) as scanned_storage_gb            , scanned_rows    , inserted_rows    , user_idFROM information_schema.query_historyWHERE LOWER(query_text) not like '%query_history%'AND LOWER(query_text) not like '%running_queries%'AND lower(query_text) like '%<an identifying text pattern for the query%'AND status != 'STARTED_EXECUTION'AND start_time BETWEEN     '2022-04-21 15:49:00' and '2022-04-21 15:51:00'  --UTC time range for queryorder by start_time

Some of the more performance important columns are:

  • round(cpu_delay_us/1000) as cpu_delay_ms
    • A high number here means the query waited for CPU resources. It can indicate that there were other queries running at the same time, and the CPU was being 100% utilized for a period of time during query execution.
  • round(scanned_bytes_storage/1000000000,3) as scanned_storage_gb
    • A high number here indicates that the data was not warm when the query was run. The data needed by the query was not available in memory or SSD cache, and had to be retrieved from F3, which is slower that the other two options.
  • scanned_rows
    • A high number here means that a lot of rows were not filtered out prior to retrieval. This could indicate the need for better filters on the query, adjustments to the primary index, or creation of join indexes or aggregating indexes.
  • query_optimization_us/1000 as q_opt_ms
    • A high number here indicates a complex query that takes time to compile and be prepared for execution. This could indicate the need for intermediate tables or materialized CTEs.
  • round(total_ram_consumed/1000000000,3) as total_ram_gb
    • A high number here indicates the query required a lot of memory, possibly due to aggregations or ordering data. This can impact how much memory is available for concurrent queries and result in over slower queries. This could indicate the need for better filters on the query, adjustments to the primary index, or creation of join indexes or aggregating indexes.

Query execution history

If a query has been executed over a long period of time, and recently seems to be running slower, it can be useful to increase the time range looked at in query history - perhaps for a week or a month. This can be done searching for a text query pattern, or by adding query_text_normalized_hash to the WHERE clause (once you have identified the value for the query you are investigating).

Once the query execution information is retrieved, look for changes over time:

  • Changes in scanned_rows or total_ram_gb indicate there is more data being read than in the past, or if there are particularly "heavy" versions of the query. It may be necessary to increase the size of the engine or consider if new indexes or query filters would make a difference now that there is more data than in the past.
  • Changes in scanned_storage_gb can indicate more data is being read, and the SSD cache needs to be increased. It could also indicate the warm-up method or warmup queries are not bringing the needed data into cache when the engine is started.

query context

Other queries concurrently running on the engine can impact a query. It is useful to know how many other queries were running at the same time. This can be done by showing query activity by minute, for a period of time before and after the query execution being analyzed:

/*** use if engine version is 3.7 or later  ***/select minute,    count(*) as query_count,    round(avg(duration_usec/1000)) as avg_dur_ms,    round(avg(time_in_queue_ms)) as avg_queue_ms,    round(avg(query_optimization_us/1000)) as avg_q_opt_ms,    round(avg(scanned_rows),0) as avg_read_rows,    round(avg(inserted_rows),0) as avg_written_rows,    round(avg(scanned_storage_mb),0) as avg_bytes_file_mb,     round(avg(total_ram_mb)) as avg_mem_usage_mb,    round(avg(cpu_usage_ms),0) as avg_cpu_usage_ms,     round(avg(cpu_delay_ms)) as avg_cpu_delay_ms,    sum(scanned_rows) as tot_rows_read,        sum(select_count) as select_count,    sum(insert_count) as insert_count,    sum(cold_query_count) as cold_query_count,    sum(ddl_count) as ddl_count,    sum(err_count) as err_countfrom (    select         date_trunc('minute',qh.start_time) as minute,query_id,query_text,status        ,duration_usec        ,time_in_queue_ms        ,query_optimization_us        ,case             when lower(query_text) like '%insert into%' then 1            else 0          end insert_count        ,case             when (lower(query_text) like '%select%'             and  lower(query_text) not like '%insert into%') then 1             else 0         end select_count        ,case             when lower(query_text) like '%create dimension%'                or lower(query_text) like '%create fact%'                or lower(query_text) like '%drop table%' then 1             else 0         end ddl_count        ,case             when scanned_bytes_storage > 0 then 1            else 0          end cold_query_count        ,case             when error_message != '' then 1             else 0         end err_count        ,round(cpu_usage_us/1000) as cpu_usage_ms        ,round(cpu_delay_us/1000) as cpu_delay_ms        ,round(total_ram_consumed/1000000,3) as total_ram_mb        ,round(scanned_bytes/1000000,3) as scanned_mb        ,round(scanned_bytes_cache/1000000,3) as scanned_cache_mb        ,round(scanned_bytes_storage/1000000,3) as scanned_storage_mb                ,scanned_rows        ,inserted_rows    from information_schema.query_history qh     where lower(query_text) not like '%query_history%'    and lower(query_text) not like '%running_queries%'    and lower(query_text) not like '%show_indexes%'    and lower(query_text) not like '%show_tables%'    and status != 'STARTED_EXECUTION'    and start_time between '2022-09-01 20:30:00' and '2022-09-05 20:50:00'    )group by minuteorder by minute