How to inspect Firebolt engine resource utilization

WHY WE HAVE THIS CARD

This card was created to help Firebolt users better understand how their workload is interacting with the engines, and can help discover potential bottlenecks in that aspect.

The 3 main resources of every Firebolt engine are:

  1. CPU
  2. RAM
  3. Disk

This article explains how to inspect each resource's utilization within your Firebolt engine. Use these methods to better understand how your workloads are interacting with your engines, and to help discover potential bottlenecks in resource consumption.

QUERY TO SHOW RESOURCE CONSUMPTION

The following query, which analyzes the information_schema.query_history view, can be used to evaluate your engine CPU and transient RAM consumption over time, broken down by intervals you provide. You can run this query on any running engine. First we’ll show you the query, and then we’ll break it down.

WITH time_window AS (  SELECT     '2022-10-27 09:10:00' :: timestamp AS start_time,     '2022-10-27 09:40:00' :: timestamp AS end_time,     1 :: INT AS "analysis_interval",     30 :: INT AS "results_interval"), qs_time_window AS (  SELECT     (      SELECT         "analysis_interval"       FROM         time_window    ):: INT AS "analysis_interval",     min("start_time") AS start_time,     max("start_time") AS end_time   FROM     information_schema.query_history   WHERE     "start_time" BETWEEN (      SELECT         start_time       FROM         time_window    )     AND (      SELECT         end_time       FROM         time_window    )), timeline AS (  SELECT     data_point   FROM     (      SELECT         array_distinct(          transform(            x -> CASE WHEN date_add(              'second',               (x / "analysis_interval"):: INT * "analysis_interval",               start_time            ) < end_time THEN date_add(              'second',               (x / "analysis_interval"):: INT * "analysis_interval",               start_time            ) ELSE end_time end,             array_enumerate(              split('', REPEAT('1', 86400))            )          )        ) AS data_points       FROM         qs_time_window    ) unnest (data_points AS data_point)), queries_details AS (  SELECT     (      SELECT         "analysis_interval"       FROM         time_window    ) AS "analysis_interval",     "query_id" AS query_id,     CASE WHEN NOT CONTAINS(      nest("status"),       'ENDED_SUCCESSFULLY'    ) THEN 0 ELSE 1 end AS finished,     min("start_time") AS start_time,     max("start_time") AS end_time,     sum("duration_usec") / 1000.0 AS duration,     sum("scanned_bytes") / 1000000000.0 AS read_gb,     sum("total_ram_consumed") / 1000000000.0 AS ram_used_gb   FROM     information_schema.query_history   WHERE     "start_time" BETWEEN (      SELECT         min(data_point)       FROM         timeline    )     AND (      SELECT         max(data_point)       FROM         timeline    )   GROUP BY     "query_id"), interval_results AS (  SELECT     data_point,     count(*) AS total_queries,     count(*) - sum(finished) AS "unfinished_queries",     round(      avg(        CASE WHEN duration < 1 THEN duration ELSE to_float(          date_diff(            'second',             (start_time),             to_timestamp(              CASE WHEN date_add(                'second', "analysis_interval", data_point              ) < end_time THEN date_add(                'second', "analysis_interval", data_point              ) ELSE end_time end            )          )        ) end      ),       3    ) AS avg_running_duration,     round(      sum(read_gb), 10    ) AS sum_read_gb,     round(      sum(ram_used_gb), 10    ) AS sum_ram_used_gb,     nest(query_id) AS queries_ids   FROM     timeline CROSS     JOIN queries_details   WHERE     data_point BETWEEN start_time     AND end_time   GROUP BY     data_point   ORDER BY     data_point) SELECT   date_add(    'second',     (      date_diff('second', start_time, data_point) / "results_interval"    ):: INT * "results_interval",     start_time  ) AS data_point,   max(total_queries) AS max_concurrent_queries,   max(avg_running_duration) AS max_concurrent_duration,   max(sum_ram_used_gb) AS max_concurrent_gb_ram_used,   max(sum_read_gb) AS max_concurrent_gb_read,   sum(total_queries) AS total_queries,   avg(    avg_running_duration / total_queries  ) AS avg_duration,   avg(sum_ram_used_gb) AS avg_gb_ram_used,   avg(sum_read_gb) AS avg_gb_read,   sum(unfinished_queries) AS unfinished_queries,   flatten(    nest(queries_ids)  ) AS queries_ids FROM   interval_results CROSS   JOIN time_window GROUP BY 1 ORDER BY 1;

QUERY BREAKDOWN

This query uses 4 CTEs, time_window, timeline, queries_details and second_interval_results, to pull results from the information_schema.query_history view for the time interval we care about. We can define that time interval with a few user-chosen parameters that can be changed as part of the time_window CTE.

Note that the analyzed table information_schema.query_history is truncated every engine restart, so if your time window does not include the current active period, no queries will be found.

Optional parameters:

  • start_time- the beginning of the inspected time window. Queries will be included in your result set if their start time is greater than the start time you provided. In the example above, we specified ‘2022-06-28 08:00:00'
  • end_time- the end of the inspected time window. Queries will be included in your result set if their end time is smaller than the end time you provided. In the example above, we specified '2022-06-28 9:00:00'
  • analysis_interval - analysis segmentation, determines the grouping in the results for analysis. For example, if set to 1, the metrics would be calculated grouped by 1 second (how many queries were executed, how much RAM was consumed etc.)
  • results_interval - results segmentation, filters the data points on the timeline that will be presented in the query results. For example, if set to 30, the results will be segmented by timestamps with 30 seconds gap between them. We control the results and analysis interval separately, to allow more flexibility (for example calculate max RAM consumed on a 1 second level, but present results on a 30 seconds level)

Returned fields:

  • data_point - timeline data point (timestamp)
  • max_concurrent_queries - max number of queries that were run concurrently, per analysis interval.
  • max_concurrent_duration - max average duration of queries, per analysis interval.
  • max_concurrent_gb_ram_used - max amount of gb RAM used, per analysis interval.
  • max_concurrent_gb_read - max amount of gb scanned, per analysis interval.
  • total_queries- total amount of queries executed within the provided results interval.
  • avg_duration- average duration of queries executed within the provided results interval.
  • avg_gb_ram_used - average amount of RAM used by queries executed within the provided results interval.
  • avg_gb_read - average amount of gb scanned by queries executed within the provided results interval.
  • unfinished_queries - number of queries that have started within the provided results interval, but have not finished before the general end_time parameter.
  • queries_ids - query ids of all the queries that were executed within the provided results interval.

INSPECT CPU UTILIZATION

The best indicator of how many CPUs were utilized in every given moment is the number of queries executed concurrently. For every executed query there is at least one (but potentially more) CPU utilized. For example, if max_concurrent_queries shows 10, it means that the maximum number of CPUs utilized within the results interval is 10 or more. To inspect CPU load over time, set analysis_interval and results_interval to 1, and look for cluster of data points where max_concurrent_queriesshows high (relatively to the available CPUs that exist in the engine) and static number of executed queries.

If you have identified specific data points that indicate a potential bottleneck, use the associated query ids to further inspect the specific queries (by querying information_schema.query_history and filtering for them).

INSPECT RAM UTILIZATION

Firebolt engine utilizes RAM for 2 purposes:

  1. Transient - RAM utilized for on the fly query processing. max_concurrent_gb_ram_used is the best indicator of how much RAM was utilized in every given moment. This metric shows the max accumulated RAM consumed by concurrent queries within the results interval. To identify intense RAM usage over time, set analysis_interval and results_interval to 1, and look for cluster of data points where max_concurrent_gb_ram_usedshows high utilization (relatively to the available RAM by the engine).
    Note that final execution metrics are not logged for failing queries (including queries that have failed with out of memory error). These cases can be indicated by the unfinished_queries metric.
    If you have identified specific data points that indicate a potential bottleneck, use the associated query ids to inspect further the specific queries (by querying information_schema.query_history and filtering for them).

INSPECT DISK UTILIZATION

Firebolt engine storage (SSD) stores every data part that was uploaded to cache, as part of engine start-up / queries execution. To inspect cache utilization out of available capacity run show cache.

1 Like