Introduction
This article provides guidance for choosing the optimal engine specification for a simple ingestion case. You will most likely be performing this type of ingestion when you are loading an initial set of data for exploration or testing purposes.
For the purposes of this article, we define simple ingestion as meeting the following criteria:
-
The files to be ingested all have the same schema and will be inserted into the same table in Firebolt.
-
Data will be ingested into Firebolt using the COPY FROM command with schema inference enabled.
-
No primary or aggregating indexes have been defined on the Firebolt table.
-
Data will be bulk loaded into Firebolt one time, and no incremental loading filters have been applied.
TL;DR
-
Scale up your engine to increase overall ingestion performance. Larger engine types mean more resources which mean faster ingestion queries.
-
Add nodes to increase parallelism for cost-effective performance. By adding more, smaller nodes, you can often speed up your ingestion without a large increase in price.
-
Balance price and performance based on your ingestion requirements. Refer to the appendix with the full results of our benchmark testing to match engine specifications to your desired performance and cost.
-
Use information_schema.engine_metrics_history to further tune your performance. This table provides metrics about engine resource utilization.
Engine Sizing Recommendations
All engine sizing recommendations are based on ingestion tests that we performed on 163GB of data from the Ultra Fast Gaming sample dataset. We created 16 sets of data, each 163GB in total size, by dividing the “Playstats” table into CSV and Parquet files in 10MB, 100MB, 1GB, and 5GB sizes. We then ran a simple COPY FROM
statement with schema inference enabled to ingest the data into Firebolt.
You can use the recommendations in this section as general guidelines and best practices for engine sizing based on the results from our ingestion tests.
For more specific guidance, refer to the complete testing results in Appendix: Engine Performance Details (at the end of this article). To use the appendix, find the section that best matches your ingestion workload in terms of file size and file type, and then choose your preferred engine configuration based on the price and performance data.
We also recommend monitoring your engine resource consumption during the ingestion process using the information_schema.engine_metrics_history
and information_schema.engine_running_queries
tables. These tables provide information about CPU, RAM and cache usage that can help you decide when to change your engine configuration for better performance. For more information about using these tables, refer to Sizing Engines.
Scale up your engine to increase overall ingestion performance
Node for node, scaling up your engine (i.e. going from a S engine to a M engine) will lead to faster raw ingestion performance. Larger engines have more compute resources available to process queries, and can apply more threads to the ingestion process.
The following table shows the average performance gain associated with scaling up an engine for each tested combination of file size and type:
File Size / Type | Average Scale Up Performance Gain |
---|---|
10MB Parquet | 30% |
100MB Parquet | 30% |
1GB Parquet | 24% |
5GB Parquet | 15% |
To determine the average scale up performance gain, we compared the ingestion times for each of the following groupings:
-
1 node S engine vs. 1 node M engine
-
2 node S engine vs. 2 node M engine
-
3 node S engine vs. 3 node M engine
-
4 node S engine vs. 4 node M engine
All of these groupings compare a baseline engine against an engine with twice the available resources.
The chart below summarizes the ingestion performance for the 100MB Parquet files. As you can see, increasing the node size led to performance improvements across the board.
In general, larger engine types will lead to faster ingestion times.
You can use the information_schema.engine_metrics_history
table to assess whether to scale up your engine by paying attention to the amount of CPU used. When CPU is high, your ingestion workload might benefit from a larger engine. More information here: Sizing Engines.
Add nodes to increase parallelism for cost-effective performance
When you ingest data into Firebolt from S3, each individual file is streamed into the local memory of an engine node. Files are processed sequentially on each node. When you need to ingest multiple files, you can increase the parallelism of your ingestion process by adding additional nodes to your engine. Each node will then stream individual files into local memory for loading into Firebolt.
The following table shows the average performance gain associated with adding additional nodes to an engine for each tested combination of file size and type. “Scale out performance gain” refers to the additional performance achieved by doubling the number of nodes in an existing engine (e.g. going from a 1 node S engine to a 2 node S engine).
File Size / Type | Average Scale Out Performance Gain |
---|---|
10MB Parquet | 38% |
100MB Parquet | 40% |
1GB Parquet | 35% |
5GB Parquet | 21% |
To determine the average scale out performance gain, we compared the ingestion times for each of the following groupings:
-
1 node S engine vs. 2 node S engine
-
2 node S engine vs. 4 node S engine
-
1 node M engine vs. 2 node M engine
-
2 node M engine vs. 4 node M engine
All of these groupings compare a baseline engine against an engine with twice the available resources.
In general, adding an additional node to the engine improved ingestion time by approximately 25%.
Scaling out an engine can be a cost-effective way to improve your ingestion performance when you are processing multiple files. Compared with scaling up an engine, scaling out an engine provides almost the same performance benefit with only half of the cost increase. You can see this data in the table below:
File Size / Type | Average Scale Out Performance Gain | Average Scale Out Cost Increase | Average Scale Up Performance Gain | Average Scale Up Cost Increase |
---|---|---|---|---|
10MB Parquet | 38% | 23% | 30% | 40% |
100MB Parquet | 40% | 20% | 30% | 40% |
1GB Parquet | 35% | 30% | 24% | 52% |
5GB Parquet | 21% | 59% | 15% | 71% |
Although scaling out an engine can be more cost-effective than scaling up an engine, in some cases, scaling up an engine will improve performance more than scaling out. You can see this behavior for 100MB Parquet Files in the chart below:
As you can see, when comparing raw performance between a 1 node S engine, a 2 node S engine, and a 1 node M engine, the fastest ingestion performance comes from the 1 node M engine. However, when comparing raw performance between a 2 node S engine, a 2 node M engine, and a 4 node S engine, the fastest ingestion performance comes from the 4 node S engine.
You can use the information_schema.engine_metrics_history
table to assess whether to scale out your engine by paying attention to the amount of CPU used and the amount of RAM used. When both of these metrics are high, your ingestion workload might benefit from adding additional nodes to your engine. More information here: Sizing Engines.
Balance price and performance based on your ingestion requirements
Engine sizing for ingestion isn’t a one-size-fits-all equation, even for simple ingestion cases. Cost-sensitive organizations might prioritize spending less over having the fastest possible ingestion performance, while organizations who care most about speed might prioritize having the fastest ingestion workloads, even if they spend more money. The engine specification that the first organization chooses will be different from the engine specification that the second organization chooses.
You can see the difference in best performance and best price in the following table:
File Size / Type | Best Performance | Best Price |
---|---|---|
10MB Parquet | 4 node M engine (66 secs) | 1 node S engine (0.61 FBU) |
10MB CSV | 4 node M engine (216 secs) | 1 node S engine (1.98 FBU) |
100MB Parquet | 4 node M engine (58 secs) | 1 node S engine (0.55 FBU) |
100MB CSV | 4 node M engine (180 secs) | 2 node S engine (1.82 FBU) |
1GB Parquet | 4 node M engine (58 secs) | 1 node S engine (0.51 FBU) |
1GB CSV | 4 node M engine (170 secs) | 2 node S engine (1.78 FBU) |
5GB Parquet | 4 node M engine (133 secs) | 1 node S engine (0.56 FBU) |
5GB CSV | 4 node M engine (176 secs) | 2 node S engine (1.66 FBU) |
For each ingestion workload, the engine that provides the best performance has different specifications from the engine that provides the best price.
When choosing your engine, we recommend using the full results from our performance testing as a starting point. This data is contained in the appendix at the end of this article. To use the appendix, find the section that best matches your ingestion workload in terms of file size and file type, and then choose your preferred engine configuration based on the price and performance data.
Remember to also use the information_schema.engine_metrics_history
and information_schema.running_queries
tables to monitor your resource utilization throughout your ingestion process. Based on how your engine is consuming CPU, RAM, and cache, you might decide to scale up or add nodes. You can refer to Sizing Engines for more detailed information about using these tables.
Appendix: Engine Performance Details
10MB Parquet Files
Total data volume: 163GB
Engine Type | Engine Nodes | Ingestion Time (Sec) | Query Cost (FBU) |
---|---|---|---|
S | 1 | 275 | 0.61 |
M | 1 | 175 | 0.78 |
L | 1 | 155 | 1.38 |
S | 2 | 198 | 0.88 |
M | 2 | 144 | 1.28 |
S | 3 | 109 | 0.73 |
M | 3 | 78 | 1.04 |
S | 4 | 92 | 0.82 |
M | 4 | 66 | 1.17 |
100MB Parquet Files
Total data volume: 163GB
Engine Type | Engine Nodes | Ingestion Time (Sec) | Query Cost (FBU) |
---|---|---|---|
S | 1 | 249 | 0.55 |
M | 1 | 165 | 0.73 |
L | 1 | 146 | 1.30 |
S | 2 | 179 | 0.80 |
M | 2 | 134 | 1.19 |
S | 3 | 100 | 0.67 |
M | 3 | 66 | 0.88 |
S | 4 | 79 | 0.70 |
M | 4 | 58 | 1.03 |
1GB Parquet Files
Total data volume: 163GB
Engine Type | Engine Nodes | Ingestion Time (Sec) | Query Cost (FBU) |
---|---|---|---|
S | 1 | 231 | 0.51 |
M | 1 | 165 | 0.68 |
L | 1 | 146 | 1.34 |
S | 2 | 172 | 0.76 |
M | 2 | 134 | 1.29 |
S | 3 | 100 | 0.67 |
M | 3 | 66 | 1.04 |
S | 4 | 82 | 0.73 |
M | 4 | 58 | 1.12 |
5GB Parquet Files
Total data volume: 163GB
Engine Type | Engine Nodes | Ingestion Time (Sec) | Query Cost (FBU) |
---|---|---|---|
S | 1 | 253 | 0.56 |
M | 1 | 188 | 0.84 |
L | 1 | 145 | 1.29 |
S | 2 | 185 | 0.82 |
M | 2 | 158 | 1.40 |
S | 3 | 187 | 1.25 |
M | 3 | 164 | 2.19 |
S | 4 | 141 | 1.25 |
M | 4 | 133 | 2.36 |
10MB CSV Files
Total data volume: 163GB
Engine Type | Engine Nodes | Ingestion Time (Sec) | Query Cost (FBU) |
---|---|---|---|
S | 1 | 893 | 1.98 |
M | 1 | 781 | 3.47 |
L | 1 | 752 | 6.68 |
S | 2 | 459 | 2.04 |
M | 2 | 399 | 3.55 |
S | 3 | 323 | 2.15 |
M | 3 | 302 | 4.03 |
S | 4 | 247 | 2.20 |
M | 4 | 216 | 3.84 |
100MB CSV Files
Total data volume: 163GB
Engine Type | Engine Nodes | Ingestion Time (Sec) | Query Cost (FBU) |
---|---|---|---|
S | 1 | 843 | 1.87 |
M | 1 | 650 | 2.89 |
L | 1 | 522 | 4.64 |
S | 2 | 409 | 1.82 |
M | 2 | 293 | 2.60 |
S | 3 | 304 | 2.03 |
M | 3 | 244 | 3.25 |
S | 4 | 231 | 2.05 |
M | 4 | 180 | 3.20 |
1GB CSV Files
Total data volume: 163GB
Engine Type | Engine Nodes | Ingestion Time (Sec) | Query Cost (FBU) |
---|---|---|---|
S | 1 | 835 | 1.86 |
M | 1 | 580 | 2.58 |
L | 1 | 520 | 4.62 |
S | 2 | 400 | 1.78 |
M | 2 | 312 | 2.77 |
S | 3 | 316 | 2.11 |
M | 3 | 249 | 3.32 |
S | 4 | 221 | 1.96 |
M | 4 | 170 | 3.02 |
5GB CSV Files
Total data volume: 163GB
Engine Type | Engine Nodes | Ingestion Time (Sec) | Query Cost (FBU) |
---|---|---|---|
S | 1 | 750 | 1.67 |
M | 1 | 555 | 2.47 |
L | 1 | 553 | 4.92 |
S | 2 | 373 | 1.66 |
M | 2 | 380 | 3.38 |
S | 3 | 281 | 1.87 |
M | 3 | 238 | 3.17 |
S | 4 | 202 | 1.80 |
M | 4 | 176 | 3.13 |