WHY WE HAVE THIS CARD
This card was created to help Firebolt users better understand how their workload is interacting with the engines, and can help discover potential bottlenecks in that aspect.
The 3 main resources of every Firebolt engine are:
- CPU
- RAM
- Disk
This article explains how to inspect each resource's utilization within your Firebolt engine. Use these methods to better understand how your workloads are interacting with your engines, and to help discover potential bottlenecks in resource consumption.
QUERY TO SHOW RESOURCE CONSUMPTION
The following query, which analyzes the information_schema.query_history
view, can be used to evaluate your engine CPU and transient RAM consumption over time, broken down by intervals you provide. You can run this query on any running engine. First we’ll show you the query, and then we’ll break it down.
WITH time_window AS (
SELECT
'2022-10-27 09:10:00' :: timestamp AS start_time,
'2022-10-27 09:40:00' :: timestamp AS end_time,
1 :: INT AS "analysis_interval",
30 :: INT AS "results_interval"
),
qs_time_window AS (
SELECT
(
SELECT
"analysis_interval"
FROM
time_window
):: INT AS "analysis_interval",
min("start_time") AS start_time,
max("start_time") AS end_time
FROM
information_schema.query_history
WHERE
"start_time" BETWEEN (
SELECT
start_time
FROM
time_window
)
AND (
SELECT
end_time
FROM
time_window
)
),
timeline AS (
SELECT
data_point
FROM
(
SELECT
array_distinct(
transform(
x -> CASE WHEN date_add(
'second',
(x / "analysis_interval"):: INT * "analysis_interval",
start_time
) < end_time THEN date_add(
'second',
(x / "analysis_interval"):: INT * "analysis_interval",
start_time
) ELSE end_time end,
array_enumerate(
split('', REPEAT('1', 86400))
)
)
) AS data_points
FROM
qs_time_window
) unnest (data_points AS data_point)
),
queries_details AS (
SELECT
(
SELECT
"analysis_interval"
FROM
time_window
) AS "analysis_interval",
"query_id" AS query_id,
CASE WHEN NOT CONTAINS(
nest("status"),
'ENDED_SUCCESSFULLY'
) THEN 0 ELSE 1 end AS finished,
min("start_time") AS start_time,
max("start_time") AS end_time,
sum("duration_usec") / 1000.0 AS duration,
sum("scanned_bytes") / 1000000000.0 AS read_gb,
sum("total_ram_consumed") / 1000000000.0 AS ram_used_gb
FROM
information_schema.query_history
WHERE
"start_time" BETWEEN (
SELECT
min(data_point)
FROM
timeline
)
AND (
SELECT
max(data_point)
FROM
timeline
)
GROUP BY
"query_id"
),
interval_results AS (
SELECT
data_point,
count(*) AS total_queries,
count(*) - sum(finished) AS "unfinished_queries",
round(
avg(
CASE WHEN duration < 1 THEN duration ELSE to_float(
date_diff(
'second',
(start_time),
to_timestamp(
CASE WHEN date_add(
'second', "analysis_interval", data_point
) < end_time THEN date_add(
'second', "analysis_interval", data_point
) ELSE end_time end
)
)
) end
),
3
) AS avg_running_duration,
round(
sum(read_gb), 10
) AS sum_read_gb,
round(
sum(ram_used_gb), 10
) AS sum_ram_used_gb,
nest(query_id) AS queries_ids
FROM
timeline CROSS
JOIN queries_details
WHERE
data_point BETWEEN start_time
AND end_time
GROUP BY
data_point
ORDER BY
data_point
)
SELECT
date_add(
'second',
(
date_diff('second', start_time, data_point) / "results_interval"
):: INT * "results_interval",
start_time
) AS data_point,
max(total_queries) AS max_concurrent_queries,
max(avg_running_duration) AS max_concurrent_duration,
max(sum_ram_used_gb) AS max_concurrent_gb_ram_used,
max(sum_read_gb) AS max_concurrent_gb_read,
sum(total_queries) AS total_queries,
avg(
avg_running_duration / total_queries
) AS avg_duration,
avg(sum_ram_used_gb) AS avg_gb_ram_used,
avg(sum_read_gb) AS avg_gb_read,
sum(unfinished_queries) AS unfinished_queries,
flatten(
nest(queries_ids)
) AS queries_ids
FROM
interval_results CROSS
JOIN time_window
GROUP BY 1
ORDER BY 1;
QUERY BREAKDOWN
This query uses 4 CTEs, time_window, timeline, queries_details and second_interval_results, to pull results from the information_schema.query_history
view for the time interval we care about. We can define that time interval with a few user-chosen parameters that can be changed as part of the time_window CTE.
Note that the analyzed table information_schema.query_history
is truncated every engine restart, so if your time window does not include the current active period, no queries will be found.
Optional parameters:
start_time
- the beginning of the inspected time window. Queries will be included in your result set if their start time is greater than the start time you provided. In the example above, we specified ‘2022-06-28 08:00:00'end_time
- the end of the inspected time window. Queries will be included in your result set if their end time is smaller than the end time you provided. In the example above, we specified '2022-06-28 9:00:00'analysis_interval
- analysis segmentation, determines the grouping in the results for analysis. For example, if set to 1, the metrics would be calculated grouped by 1 second (how many queries were executed, how much RAM was consumed etc.)results_interval
- results segmentation, filters the data points on the timeline that will be presented in the query results. For example, if set to 30, the results will be segmented by timestamps with 30 seconds gap between them. We control the results and analysis interval separately, to allow more flexibility (for example calculate max RAM consumed on a 1 second level, but present results on a 30 seconds level)
Returned fields:
data_point
- timeline data point (timestamp)max_concurrent_queries
- max number of queries that were run concurrently, per analysis interval.max_concurrent_duration
- max average duration of queries, per analysis interval.max_concurrent_gb_ram_used
- max amount of gb RAM used, per analysis interval.max_concurrent_gb_read
- max amount of gb scanned, per analysis interval.total_queries
- total amount of queries executed within the provided results interval.avg_duration
- average duration of queries executed within the provided results interval.avg_gb_ram_used
- average amount of RAM used by queries executed within the provided results interval.avg_gb_read
- average amount of gb scanned by queries executed within the provided results interval.unfinished_queries
- number of queries that have started within the provided results interval, but have not finished before the generalend_time
parameter.queries_ids
- query ids of all the queries that were executed within the provided results interval.
INSPECT CPU UTILIZATION
The best indicator of how many CPUs were utilized in every given moment is the number of queries executed concurrently. For every executed query there is at least one (but potentially more) CPU utilized. For example, if max_concurrent_queries
shows 10, it means that the maximum number of CPUs utilized within the results interval is 10 or more. To inspect CPU load over time, set analysis_interval
and results_interval
to 1, and look for cluster of data points where max_concurrent_queries
shows high (relatively to the available CPUs that exist in the engine) and static number of executed queries.
If you have identified specific data points that indicate a potential bottleneck, use the associated query ids to further inspect the specific queries (by querying information_schema.query_history
and filtering for them).
INSPECT RAM UTILIZATION
Firebolt engine utilizes RAM for 2 purposes:
- Transient - RAM utilized for on the fly query processing.
max_concurrent_gb_ram_used
is the best indicator of how much RAM was utilized in every given moment. This metric shows the max accumulated RAM consumed by concurrent queries within the results interval. To identify intense RAM usage over time, setanalysis_interval
andresults_interval
to 1, and look for cluster of data points wheremax_concurrent_gb_ram_used
shows high utilization (relatively to the available RAM by the engine).
Note that final execution metrics are not logged for failing queries (including queries that have failed with out of memory error). These cases can be indicated by theunfinished_queries
metric.
If you have identified specific data points that indicate a potential bottleneck, use the associated query ids to inspect further the specific queries (by queryinginformation_schema.query_history
and filtering for them).
INSPECT DISK UTILIZATION
Firebolt engine storage (SSD) stores every data part that was uploaded to cache, as part of engine start-up / queries execution. To inspect cache utilization out of available capacity run show cache
.