Data ingestion, like any operation in a data platform, relies on resources (available memory, CPU, and local disk) to operate efficiently. Running out of local disk and CPU, generally, are fairly easy problems to deal with - even if those resources are fully consumed, they'll only slow the operation of the application. Running out of memory, however, can be fatal to the ingest operation. Below are some best practices and suggestions for planning and managing resources to reduce the risk of running out during ingestion.
Suggestion #1 - Dev database
Set up a dev database for testing ingestion.
A dev database gives a safe space to verify findings, so we don't end up with a partial ingestion into a production database. It's best to keep this database around, as adding additional workloads to the database will affect ingestion, so repeated testing is required.
Suggestion #2 - Memory estimation and testing
Since memory is the primary concern due to the possibility of failure, the first step is to establish how much memory is required for an ingestion engine for a single job. There is some science to this memory calculation, but it is not exact.
On a simple ingestion (where data is pulled in native form from object storage into an indexed table ) things are pretty simple; we'll need at least as much memory on each node in the engine as the largest file ingested uncompressed. So if you have a 10gb Parquet snappy file that uncompresses to 30gb, 30gb is needed for the operation. That said, memory is used for other background processes, it's recommended to add 15% to 20%. This generally gets through basic ingest, but things become a bit unscientific after this.
Joins to internal tables at ingestion and maintaining aggregate indexes takes a bit more memory and CPU. How much depends on how those joins and aggregating indexes are configured and can be hard to predict - this is where the dev database comes into play. Using the dev database with a large memory engine configured for approximately twice the memory you believe will be necessary is an easy way to find how much you're going to need. Going big in this case is a good thing as you'll never get the metrics necessary for prod if it runs out of memory. It's a best practice to use memory optimized nodes, to test memory limits. The process is to run a "normal" ingestion in dev, then pull the memory used statistics from information_schema.query_history. If you intend to do incremental ingestion in prod, do this in dev. If you intend to do a full rebuild at every ingestion in prod, do this in dev. This will give you solid maximum memory requirements.
From there, we can start tuning for efficiency. Here as well, there are a number of factors. How many files are we ingesting per job is important. If it's a lot of small-ish files, we're going to want to use many smaller nodes (not smaller in memory footprint than the largest file). If there's a bunch of files, we have the opportunity to take advantage of scale out, allowing more nodes also increases the number of network interfaces, increasing bandwidth to S3 storage working in parallel improving performance. There is a limit though: if there’s only 5 files to ingest during a normal batch, more than that will over provision as you’ll end up with idle nodes.
If there's only a few files, or one big file per batch, we're going to want to scale up rather than out.
Suggestion #3 - Tune CPU capacity
To tune CPU capacity, again, information_schema.query_history is your friend. If you see high cpu_delay_us metrics, configure your nodes with more cores. The goal here, between scaling up and out, is to reduce the time per ingest. Scaling up increases memory and CPU available for a single task, while scaling out increases the network capacity allowing better parallelism.
As you add additional parallel ingest workloads, it's important to repeat this exercise. If an engine is tuned for one ingest workload, and another is added, it's very easy to affect performance.
To summarize some rules of thumbs:
- Node memory size is equal to the largest file ingested, uncompressed, + 15-20%.
- For ingesting number of files - use as many nodes as the number of files.
- Number of nodes should not exceed the number of files.
- For large files - increase the node size, and not number of nodes.
- Use memory used statistics from information_schema.query_history in order to check how much memory was actually used.
- If you see high cpu_delay_us metrics in information_schema.query_history, increase your nodes CPU.
→ Scaling up (larger node) increases memory and CPU available for a single task/file, scaling out (more nodes) increases the network capacity, allowing better parallelism.