Engine Sizing for Simple Ingestion

Introduction

This article provides guidance for choosing the optimal engine specification for a simple ingestion case. You will most likely be performing this type of ingestion when you are loading an initial set of data for exploration or testing purposes.

For the purposes of this article, we define simple ingestion as meeting the following criteria:

  • The files to be ingested all have the same schema and will be inserted into the same table in Firebolt.

  • Data will be ingested into Firebolt using the COPY FROM command with schema inference enabled.

  • No primary or aggregating indexes have been defined on the Firebolt table.

  • Data will be bulk loaded into Firebolt one time, and no incremental loading filters have been applied.

TL;DR

  • Scale up your engine to increase overall ingestion performance. Larger engine types mean more resources which mean faster ingestion queries.

  • Add nodes to increase parallelism for cost-effective performance. By adding more, smaller nodes, you can often speed up your ingestion without a large increase in price.

  • Balance price and performance based on your ingestion requirements. Refer to the appendix with the full results of our benchmark testing to match engine specifications to your desired performance and cost.

  • Use information_schema.engine_metrics_history to further tune your performance. This table provides metrics about engine resource utilization.

Engine Sizing Recommendations

All engine sizing recommendations are based on ingestion tests that we performed on 163GB of data from the Ultra Fast Gaming sample dataset. We created 16 sets of data, each 163GB in total size, by dividing the “Playstats” table into CSV and Parquet files in 10MB, 100MB, 1GB, and 5GB sizes. We then ran a simple COPY FROM statement with schema inference enabled to ingest the data into Firebolt.

You can use the recommendations in this section as general guidelines and best practices for engine sizing based on the results from our ingestion tests.

For more specific guidance, refer to the complete testing results in Appendix: Engine Performance Details (at the end of this article). To use the appendix, find the section that best matches your ingestion workload in terms of file size and file type, and then choose your preferred engine configuration based on the price and performance data.

We also recommend monitoring your engine resource consumption during the ingestion process using the information_schema.engine_metrics_history and information_schema.engine_running_queries tables. These tables provide information about CPU, RAM and cache usage that can help you decide when to change your engine configuration for better performance. For more information about using these tables, refer to Sizing Engines.

Scale up your engine to increase overall ingestion performance

Node for node, scaling up your engine (i.e. going from a S engine to a M engine) will lead to faster raw ingestion performance. Larger engines have more compute resources available to process queries, and can apply more threads to the ingestion process.

The following table shows the average performance gain associated with scaling up an engine for each tested combination of file size and type:

File Size / Type Average Scale Up Performance Gain
10MB Parquet 30%
100MB Parquet 30%
1GB Parquet 24%
5GB Parquet 15%

To determine the average scale up performance gain, we compared the ingestion times for each of the following groupings:

  • 1 node S engine vs. 1 node M engine

  • 2 node S engine vs. 2 node M engine

  • 3 node S engine vs. 3 node M engine

  • 4 node S engine vs. 4 node M engine
    All of these groupings compare a baseline engine against an engine with twice the available resources.

The chart below summarizes the ingestion performance for the 100MB Parquet files. As you can see, increasing the node size led to performance improvements across the board.

100MB Files Ingestion Medium vs Small

In general, larger engine types will lead to faster ingestion times.

You can use the information_schema.engine_metrics_history table to assess whether to scale up your engine by paying attention to the amount of CPU used. When CPU is high, your ingestion workload might benefit from a larger engine. More information here: Sizing Engines.

Add nodes to increase parallelism for cost-effective performance

When you ingest data into Firebolt from S3, each individual file is streamed into the local memory of an engine node. Files are processed sequentially on each node. When you need to ingest multiple files, you can increase the parallelism of your ingestion process by adding additional nodes to your engine. Each node will then stream individual files into local memory for loading into Firebolt.

The following table shows the average performance gain associated with adding additional nodes to an engine for each tested combination of file size and type. “Scale out performance gain” refers to the additional performance achieved by doubling the number of nodes in an existing engine (e.g. going from a 1 node S engine to a 2 node S engine).

File Size / Type Average Scale Out Performance Gain
10MB Parquet 38%
100MB Parquet 40%
1GB Parquet 35%
5GB Parquet 21%

To determine the average scale out performance gain, we compared the ingestion times for each of the following groupings:

  • 1 node S engine vs. 2 node S engine

  • 2 node S engine vs. 4 node S engine

  • 1 node M engine vs. 2 node M engine

  • 2 node M engine vs. 4 node M engine
    All of these groupings compare a baseline engine against an engine with twice the available resources.

In general, adding an additional node to the engine improved ingestion time by approximately 25%.

Scaling out an engine can be a cost-effective way to improve your ingestion performance when you are processing multiple files. Compared with scaling up an engine, scaling out an engine provides almost the same performance benefit with only half of the cost increase. You can see this data in the table below:

File Size / Type Average Scale Out Performance Gain Average Scale Out Cost Increase Average Scale Up Performance Gain Average Scale Up Cost Increase
10MB Parquet 38% 23% 30% 40%
100MB Parquet 40% 20% 30% 40%
1GB Parquet 35% 30% 24% 52%
5GB Parquet 21% 59% 15% 71%

Although scaling out an engine can be more cost-effective than scaling up an engine, in some cases, scaling up an engine will improve performance more than scaling out. You can see this behavior for 100MB Parquet Files in the chart below:

100MB Files Ingestion Scaling Comparisons (1)

As you can see, when comparing raw performance between a 1 node S engine, a 2 node S engine, and a 1 node M engine, the fastest ingestion performance comes from the 1 node M engine. However, when comparing raw performance between a 2 node S engine, a 2 node M engine, and a 4 node S engine, the fastest ingestion performance comes from the 4 node S engine.

You can use the information_schema.engine_metrics_history table to assess whether to scale out your engine by paying attention to the amount of CPU used and the amount of RAM used. When both of these metrics are high, your ingestion workload might benefit from adding additional nodes to your engine. More information here: Sizing Engines.

Balance price and performance based on your ingestion requirements

Engine sizing for ingestion isn’t a one-size-fits-all equation, even for simple ingestion cases. Cost-sensitive organizations might prioritize spending less over having the fastest possible ingestion performance, while organizations who care most about speed might prioritize having the fastest ingestion workloads, even if they spend more money. The engine specification that the first organization chooses will be different from the engine specification that the second organization chooses.

You can see the difference in best performance and best price in the following table:

File Size / Type Best Performance Best Price
10MB Parquet 4 node M engine (66 secs) 1 node S engine (0.61 FBU)
10MB CSV 4 node M engine (216 secs) 1 node S engine (1.98 FBU)
100MB Parquet 4 node M engine (58 secs) 1 node S engine (0.55 FBU)
100MB CSV 4 node M engine (180 secs) 2 node S engine (1.82 FBU)
1GB Parquet 4 node M engine (58 secs) 1 node S engine (0.51 FBU)
1GB CSV 4 node M engine (170 secs) 2 node S engine (1.78 FBU)
5GB Parquet 4 node M engine (133 secs) 1 node S engine (0.56 FBU)
5GB CSV 4 node M engine (176 secs) 2 node S engine (1.66 FBU)

For each ingestion workload, the engine that provides the best performance has different specifications from the engine that provides the best price.

When choosing your engine, we recommend using the full results from our performance testing as a starting point. This data is contained in the appendix at the end of this article. To use the appendix, find the section that best matches your ingestion workload in terms of file size and file type, and then choose your preferred engine configuration based on the price and performance data.

Remember to also use the information_schema.engine_metrics_history and information_schema.running_queries tables to monitor your resource utilization throughout your ingestion process. Based on how your engine is consuming CPU, RAM, and cache, you might decide to scale up or add nodes. You can refer to Sizing Engines for more detailed information about using these tables.

Appendix: Engine Performance Details

10MB Parquet Files

Total data volume: 163GB

Engine Type Engine Nodes Ingestion Time (Sec) Query Cost (FBU)
S 1 275 0.61
M 1 175 0.78
L 1 155 1.38
S 2 198 0.88
M 2 144 1.28
S 3 109 0.73
M 3 78 1.04
S 4 92 0.82
M 4 66 1.17

100MB Parquet Files

Total data volume: 163GB

Engine Type Engine Nodes Ingestion Time (Sec) Query Cost (FBU)
S 1 249 0.55
M 1 165 0.73
L 1 146 1.30
S 2 179 0.80
M 2 134 1.19
S 3 100 0.67
M 3 66 0.88
S 4 79 0.70
M 4 58 1.03

1GB Parquet Files

Total data volume: 163GB

Engine Type Engine Nodes Ingestion Time (Sec) Query Cost (FBU)
S 1 231 0.51
M 1 165 0.68
L 1 146 1.34
S 2 172 0.76
M 2 134 1.29
S 3 100 0.67
M 3 66 1.04
S 4 82 0.73
M 4 58 1.12

5GB Parquet Files

Total data volume: 163GB

Engine Type Engine Nodes Ingestion Time (Sec) Query Cost (FBU)
S 1 253 0.56
M 1 188 0.84
L 1 145 1.29
S 2 185 0.82
M 2 158 1.40
S 3 187 1.25
M 3 164 2.19
S 4 141 1.25
M 4 133 2.36

10MB CSV Files

Total data volume: 163GB

Engine Type Engine Nodes Ingestion Time (Sec) Query Cost (FBU)
S 1 893 1.98
M 1 781 3.47
L 1 752 6.68
S 2 459 2.04
M 2 399 3.55
S 3 323 2.15
M 3 302 4.03
S 4 247 2.20
M 4 216 3.84

100MB CSV Files

Total data volume: 163GB

Engine Type Engine Nodes Ingestion Time (Sec) Query Cost (FBU)
S 1 843 1.87
M 1 650 2.89
L 1 522 4.64
S 2 409 1.82
M 2 293 2.60
S 3 304 2.03
M 3 244 3.25
S 4 231 2.05
M 4 180 3.20

1GB CSV Files

Total data volume: 163GB

Engine Type Engine Nodes Ingestion Time (Sec) Query Cost (FBU)
S 1 835 1.86
M 1 580 2.58
L 1 520 4.62
S 2 400 1.78
M 2 312 2.77
S 3 316 2.11
M 3 249 3.32
S 4 221 1.96
M 4 170 3.02

5GB CSV Files

Total data volume: 163GB

Engine Type Engine Nodes Ingestion Time (Sec) Query Cost (FBU)
S 1 750 1.67
M 1 555 2.47
L 1 553 4.92
S 2 373 1.66
M 2 380 3.38
S 3 281 1.87
M 3 238 3.17
S 4 202 1.80
M 4 176 3.13