Getting Started with Firebolt Engines

TL;DR

  • Start with a dev database for testing without affecting production.
  • Estimate memory needs for your largest file plus overhead.
  • Monitor CPU delays and adjust capacity as needed.
  • Choose the right instance family for your workload.
  • Scale up before out for better performance.

General Best Practices

  • Engine Sizing: Tailor engine size to your specific needs by testing various configurations to find a cost-effective balance for all workloads.
  • Monitoring: Employ information_schema.query_history to track engine utilization, including memory and CPU usage.
  • Scaling Strategies:
    • For ingestion: Adjust based on file size and quantity—larger nodes for big files, more nodes for multiple small files, but avoid exceeding the file count.
    • For querying: Favor fewer but larger nodes to boost performance and minimize result merging tasks.
  • Separate Engines: While a single engine can handle both ingestion and analytics, separate them if you face large ingestion volumes or need to manage query performance during intensive ingestion tasks.

Choosing the Right Instance Family

Select an instance family based on your specific workload requirements:

  • Memory Optimized (r series): Ideal for memory-intensive operations with many joins and aggregations.
  • CPU-Optimized (c series): Best for CPU-heavy tasks with extensive filtering and high concurrency needs.
  • Balanced (m series): A good middle ground for workloads requiring both CPU and memory resources.
  • Storage/Cache-Optimized (i series): Choose when a large cache is necessary to maintain performance, especially when data doesn't fit in other nodes' memory.

Ingestion Engine Tuning

Initial Setup

  • Development Database: Initiate by establishing a development database to serve as a sandbox for testing, ensuring your production database remains unaffected by test ingestions.

Memory Management

  • Memory Requirements: Estimate memory based on the uncompressed size of your largest file plus an extra 15-20% for operational overhead.
  • Scaling for Memory: Start with engines that have a large memory capacity, possibly double your estimated needs, to identify the maximum memory requirements safely.

File and CPU Handling

  • Small Files: Leverage more nodes to take advantage of parallel processing for numerous small files.
  • Large Files: Opt for larger nodes to handle fewer large files, rather than increasing the node count.
  • CPU Tuning: Monitor the cpu_delay_us metrics through information_schema.query_history. If you encounter high delays, boost CPU capacity to alleviate strain from intensive operations like joins or aggregations.

Analytics Engine Tuning

Performance Optimization

  • Indexing: Ensure robust and effective indexing for optimized engine performance.
  • Engine Variety: Utilize different engine types for diverse query demands to ensure data consistency and prevent performance bottlenecks.

Engine Configuration

  • Instance Families: Select from Memory Optimized, Compute Optimized, Balanced, or Storage Optimized to suit specific workload demands.
  • Node Types: Generally, larger nodes offer better performance—scale up before scaling out.

Concurrency and Isolation

  • Separate Engines: For significant, infrequent ingestion tasks or when managing workload concurrency, use distinct engines for ingestion and analytics to avoid performance dips during heavy ingestion periods.

Conclusion

To sum up, effective Firebolt engine configuration hinges on understanding your data demands and performance goals. Through strategic sizing, monitoring, and instance selection, you can craft a robust data engine setup that balances efficiency, cost, and scalability