How To Size Firebolt Engines for optimizing your analytical workloads

Introduction

Selecting the right Firebolt engine size is crucial for balancing performance, cost, and efficiency in data-intensive environments. This guide will focus on how to determine the ideal engine size based on various workload requirements, providing both beginners and seasoned data engineers with clear insights into making informed decisions. By the end of this article, you will understand how to effectively size Firebolt engines.

TL;DR

  • Start with an engine that matches your active dataset size.

  • Use engine observability metrics to adjust the size dynamically.

  • Consider both node type and the number of nodes for scaling.

  • Scaling up (bigger nodes) is generally better than scaling out (more nodes) for performance.

  • Use more nodes to improve query concurrency and throughput.

Step-by-step guide

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Firebolt Sample Dataset.

Step 1: Evaluate Your Workload Requirements

Start by assessing the specific needs of your analytics workloads and query demands, focusing solely on the active dataset, meaning how much data your queries will frequently access. This initial evaluation will guide you in choosing a suitable starting point for your engine size, ensuring optimal responsiveness and efficiency.

Example SQL code:

-- Example SQL code to assess active dataset
SELECT SUM(compressed_bytes/1024.0/1024.0/1024.0) as size_of_tables_GB
FROM information_schema.tables
where table_schema = 'public'
and table_name IN ('<tables you are interested to keep in cache>');

Note: This is an example to show potential cache needs. You don’t need to store entire tables in cache, you might opt to load only specific tablets, such as data from the last month, instead of all historical data.
This is for a single node engine cluster. For multiple nodes, see How to Choose the SSD Size for a Firebolt Engine.

Step 2: Choose an Appropriate Node Type

Select a node type (S, M, L, XL) that aligns with your active dataset’s cache size needs. It is important to consider not only the cache size but also its utilization efficiency to prevent frequent cache evictions. For maintaining optimal performance and avoiding cache eviction, plan for the active dataset to occupy only 80% of the available cache space. For even greater safety, it’s advised to calculate using only 60% of the cache capacity. This allows additional room for unforeseen data without triggering cache eviction.
For instance: appropriate cache size = 1800GB×0.6 = 1080GB

Example SQL code:

-- Simulate cache requirement calculation
SELECT 'Node Type Selection', 
       CASE 
           WHEN size_of_tables_GB <= 1800*0.6 THEN 'S'
           WHEN size_of_tables_GB <= 3700*0.6 THEN 'M'
           WHEN size_of_tables_GB <= 7500*0.6 THEN 'L'
           ELSE 'XL'
       END as recommended_node_type
FROM (SELECT SUM(compressed_bytes/1024.0/1024.0/1024.0) as size_of_tables_GB
FROM information_schema.tables
where table_schema = 'public'
and table_name IN  ('<tables you are interested to keep in cache>'));

Step 3: Use Observability Metrics to Refine Engine Size

Utilize the Information_Schema.engine_metrics_history to monitor CPU, RAM, disk usage and spilling size. This will help you determine if your current engine configuration meets the demand, or if you need to scale up or out.

Example SQL code:

-- Check engine utilization, the result is showing percentages 
SELECT max(cpu_used), max(memory_used), max(disk_used), max(spilling_size)
FROM information_schema.engine_metrics_history
WHERE event_time >= now() - interval '1 day';

Step 4: Decide on Scaling Up vs. Scaling Out

Based on the observed engine utilization metrics, make informed decisions about how to scale your Firebolt engine:

  • Scale Out: Opt for scaling out by adding more nodes when CPU utilization metrics indicate that CPU capacity is the primary bottleneck. This approach increases processing power across additional nodes, effectively distributing the load and improving overall system responsiveness.

  • Scale Up: Choose to scale up by selecting bigger nodes when metrics show high memory usage, spilling size or disk space utilization. Scaling up enhances the capacity of each individual node to handle larger datasets or more complex operations without the need for spilling to disk.

Step 5: Decide on adding more Clusters

Add More Clusters for Concurrency: If your observability metrics indicate an increasing number of queued queries, consider adding more clusters. This allows more queries to run simultaneously, thereby reducing wait times and improving throughput.

Example SQL code:

-- Counts how many queries are SUSPENDED
SELECT count(*) as query_count_in_status_suspended
FROM information_schema.engine_running_queries
WHERE status = 'SUSPENDED';

Full Example Code

SELECT SUM(compressed_bytes/1024.0/1024.0/1024.0) as size_of_tables_GB
FROM information_schema.tables
where table_schema = 'public'
and table_name IN ('<tables you are interested to keep in cache>');


SELECT 'Node Type Selection', 
       CASE 
           WHEN size_of_tables_GB <= 1800*0.6 THEN 'S'
           WHEN size_of_tables_GB <= 3700*0.6 THEN 'M'
           WHEN size_of_tables_GB <= 7500*0.6 THEN 'L'
           ELSE 'XL'
       END as recommended_node_type
FROM (SELECT SUM(compressed_bytes/1024.0/1024.0/1024.0) as size_of_tables_GB
FROM information_schema.tables
where table_schema = 'public'
and table_name IN  ('<tables you are interested to keep in cache>'));


SELECT max(cpu_used), max(memory_used), max(disk_used), max(spilling_size)
FROM information_schema.engine_metrics_history
WHERE event_time >= now() - interval '1 day';


SELECT count(*) as query_count_in_status_suspended
FROM information_schema.engine_running_queries
WHERE status = 'SUSPENDED';