Introduction
Warming up a table in Firebolt can improve query performance by preloading data into the cache. Data will be stored in the cache as normal queries are executed, but running warm-up queries after an engine is started will result in those normal queries running faster. By the end of this article, you will know how to effectively warm up tables using the CHECKSUM function.
TL;DR
-
Use the
CHECKSUM
function to warm up data in Firebolt. -
Focus on specific columns to save space and time.
-
Use filters to target specific data ranges.
-
Run multiple warm-up queries in parallel for large datasets.
-
Monitor cache usage during the warm-up process.
Step-by-Step Guide
All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Dataset.
Step 1: Identify the tables that should be warmed up
Tables that are large (millions of rows) and frequently used are candidates for being warmed up. When the normal workload is run, those tables will take the longest to read and have the biggest impact on initial query performance.
Step 2: Determine the type of warm-up approach
Basic Warm-up Using CHECKSUM
The simplest way to warm up data in Firebolt is by running a CHECKSUM
query on the desired table. This approach uses the most cache and takes the longest to run.
Example SQL code:
-- Warm-up the entire table
SELECT CHECKSUM(*) FROM playstats;
In this example, the CHECKSUM
function is applied to the entire playstats
table, which triggers Firebolt to read all the rows and columns, and storing the data in the cache as it does so.
If the normal workload does not require reading all the data in a table, some of the following approaches may be more appropriate.
Warm-up Specific Columns only
If your table has a large number of columns, you might not need to warm up all of them. Warming up only the columns that will be frequently accessed can save space and speed up the warm-up process.
Example SQL code:
-- Warm-up specific columns
SELECT CHECKSUM(GameID, PlayerID, CurrentScore) FROM playstats;
In this query, only the GameID
, PlayerID
, and CurrentScore
columns are warmed up. This approach is more efficient if the normal workload primarily accesses these columns.
Warm-up a subset of data using filters
Using filters in your warm-up queries helps focus on specific data subsets that are frequently used by the normal workload. This can be based on user activity, date ranges, or any other relevant criteria.
Example SQL code:
-- Warm-up data for specific game levels
SELECT CHECKSUM(*) FROM playstats WHERE CurrentLevel BETWEEN 1 AND 5;
-- Warm-up data for a date rangeSELECT CHECKSUM(*) FROM playstats WHERE StatTime BETWEEN '2022-01-01' AND '2022-03-31';
In the first query, only the data for players in levels 1 to 5 is warmed up. This is useful if most queries are expected to target these levels. In the second query, only the data associated with a time range is warmed up. This is useful if most queries are expected to focus on a particular time range, such as the most recent quarter.
Step 3: Determine if parallel warm-up queries would be beneficial
If it is necessary to warm up a large amount of data (millions or billions of rows), running multiple warm-up queries in parallel can significantly reduce the total warm-up time. You can split the dataset into smaller chunks based on specific criteria, such as date ranges. They can be executed at the same time by using multiple tabs in the Firebolt UI, each one running a different query.
Example SQL code:
-- Warm-up data in parallel by date ranges
-- Tab 1
SELECT CHECKSUM(*) FROM playstats WHERE StatTime BETWEEN '2022-01-01' AND '2022-03-31';
-- Tab 2
SELECT CHECKSUM(*) FROM playstats WHERE StatTime BETWEEN '2022-04-01' AND '2022-06-30';
-- Tab 3
SELECT CHECKSUM(*) FROM playstats WHERE StatTime BETWEEN '2022-07-01' AND '2022-09-30';
-- Tab 4SELECT CHECKSUM(*) FROM playstats WHERE StatTime BETWEEN '2022-10-01' AND '2022-12-31';
In this example, the data is divided into four chunks based on StatTime
. Each query can be run in a separate tab, warming up the data in parallel.
Step 4: Execute warm-up queries and monitor cache usage
While the warm-up queries are executing, the amount of cache being used will increase. You can monitor the increase by running two queries in another tab.
Example SQL code:
-- Use SHOW CACHE
SHOW CACHE;
-- This will return both a total and a percentage of cache used
-- for example: '36.35/1750.52 GB (2.08%)'
-- Use information_schema.engine_metrics_history
SELECT event_time,disk_used FROM information_schema.engine_metrics_history ORDER BY 1 DESC;
-- This will return a date/time and portion cache used
-- for example: 2024-07-02 20:33:00+00 0.021
If cache use approaches 80% after the warm-up queries have completed, this indicates cache is getting filled up. While this will not cause any errors, some data may get evicted from the cache, affecting query performance. If the warm-up queries are using a subset of columns or filtering data, it may be necessary to use fewer columns or more restrictive filters. Another option would be to either increase the number of nodes or the size of existing nodes to provide more available cache and ensure the desired data stays in cache to support normal workloads.
Full Example Code
SELECT CHECKSUM(*) FROM playstats;
SELECT CHECKSUM(GameID, PlayerID, CurrentScore) FROM playstats;
SELECT CHECKSUM(*) FROM playstats WHERE CurrentLevel BETWEEN 1 AND 5;
SELECT CHECKSUM(*) FROM playstats WHERE StatTime BETWEEN '2022-01-01' AND '2022-03-31';
SELECT CHECKSUM(*) FROM playstats WHERE StatTime BETWEEN '2022-01-01' AND '2022-03-31';
SELECT CHECKSUM(*) FROM playstats WHERE StatTime BETWEEN '2022-04-01' AND '2022-06-30';
SELECT CHECKSUM(*) FROM playstats WHERE StatTime BETWEEN '2022-07-01' AND '2022-09-30';
SELECT CHECKSUM(*) FROM playstats WHERE StatTime BETWEEN '2022-10-01' AND '2022-12-31';
SHOW CACHE;
SELECT event_time,disk_used FROM information_schema.engine_metrics_history ORDER BY 1 DESC;