How to identify queries that need aggregating indexes

Objective

Once your Firebolt-backed analytics solution has been deployed in production, you should periodically assess the performance of the most commonly executed queries. By identifying queries that are scanning large amounts of data or taking longer to return results, you can make adjustments to your Firebolt indexes and queries to continually optimize the user experience of your analytics application.

One of the most common query types that benefit from ongoing performance analysis and optimization are queries that use aggregation functions. This article explains how to use the "information_schema.query_history" and "information_schema.indexes" tables to surface queries that would benefit from aggregating indexes.

Steps

You can determine which of your most common aggregate queries need aggregating indexes using the following three steps:

  1. Use data from "information_schema.query_history" and "information_schema.indexes" to see which queries are run most frequently, how long (on average) those queries take to return results, and whether an aggregating index already exists for the source table. The detailed walkthrough below provides an example query that you can use.
  2. Display the full text of the query that you want to optimize.
  3. Run EXPLAIN on the query that you want to optimize. This command shows the query plan, which provides guidance around how to build your aggregating index. For detailed instructions about how to build the most appropriate aggregating index for your query, refer to "Optimizing Queries with Aggregating Indexes in Firebolt".

Identify the most frequently executed aggregate queries

Not all aggregate queries necessarily require--or would benefit from--aggregation indexes. Some queries may be performing adequately by leveraging primary indexes; other queries may be so infrequently executed that a performance improvement does not justify the time spent on optimization. In general, you can rely on three guidelines to determine whether to create an aggregating index for a query:

  1. Is your query run frequently?
  2. Does the performance of the query impact a large number of end-users on a recurring basis?
  3. Does your query take significantly longer to produce results than your desired SLA?

If you answer "yes" to one or more of these questions, your query could be a good candidate for optimization.

To understand specifically how often each query is run, much time each query takes to return results, and whether an aggregating index already exists, you can query the "information_schema.query_history" and "information_schema.indexes" table. By aggregating the data in these tables, you can make a data-driven decision about where to focus your energy.

Be aware that "information_schema" tables only store data starting from the current engine startup time. For example, if you shut down your engine on 9/12/2023, and started that engine again on 9/14/2023, the "information_schema" tables will only display data starting on 9/14/2023.

You can use the following query to surface the most relevant, actionable information:

with queries_base as (select     query_id,    duration_usec/1000000 as duration_sec,     date_trunc('week',start_time) as start_week, --Modify this line to change the time frame for your query    scanned_bytes/(1024*1024*1024) as scanned_gb,    regexp_extract_all(lower(query_text_normalized), 'from ([a-z_.]+) ','i', 1) as tables_used,    array_distinct(array_concat(regexp_extract_all(lower(query_text_normalized), 'where ([a-z_.]+) ','i', 1), regexp_extract_all(lower(query_text_normalized), 'and ([a-z_.]+) ','i', 1))) as predicates_used,    case when match(lower(query_text_normalized), 'group by') = 1 then 1    else 0 end as has_aggregations,    * from information_schema.query_historywhere status = 'ENDED_SUCCESSFULLY'and match(lower(query_text_normalized), 'information_schema') = 0and match(lower(query_text_normalized), 'create|insert|drop|delete') = 0),indexes as (select table_catalog, table_name, index_name, index_definition from information_schema.indexeswhere index_type = 'aggregating'),aggregating_queries as (    select example_query_id, average_duration, average_scanned_gb, query_count, start_week, query_text_normalized, predicates_used, tables_used from (    select any_value(query_id) as example_query_id,    avg(duration_sec) as average_duration,    avg(scanned_gb) as average_scanned_gb,    count(*) as query_count,    start_week,    query_text_normalized,    tables_used,    predicates_usedfrom queries_base where has_aggregations = 1group by start_week, query_text_normalized, tables_used, predicates_used)unnest(tables_used))select q.*,     case when i.index_name is null then 'no existing aggregate index'    else 'has existing aggregate index' end as aggregate_index_status,    i.index_name as aggregate_index_name,    i.index_definitionfrom aggregating_queries qleft outer join indexes ion q.tables_used = i.table_namewhere q.start_week = date_add('week', -1, date_trunc('week', current_date())) --Also modify this line to change the time frame of your queryorder by query_count desc

By default, this query looks at the prior week of data. You can modify the commented lines in the query to change the time frame.

This query returns the following columns:

Column Name

Description

example_query_id

For the configured time period and normalized query pattern, displays a randomly selected query ID.

average_duration

For the configured time period and normalized query pattern, returns the average query duration, in seconds.

average_scanned_gb

For the configured time period and normalized query pattern, returns the average GB scanned.

query_count

For the configured time period and normalized query pattern, returns the number of queries submitted.

start_week

The week (starting on Monday) when the queries were executed.

query_text_normalized

Text of the SQL statement with sanitized values for literals.

predicates_used

An array listing the columns used in your query's WHERE statements.

tables_used

The names of the tables used by your query.

When a single query uses multiple tables, multiple rows will be returned for each normalized query pattern. A single row will contain one table name.

aggregate_index_status

This column contains either of the following two values:

  • no existing aggregate index
  • has existing aggregate index

aggregate_index_name

If "aggregate_index_status" equals "has existing aggregate index", this column will display the name of the aggregate index associated with the fact table.

If "aggregate_index_status" equals "no existing aggregate index", this column will display a NULL value.

index_definition

If "aggregate_index_status" equals "has existing aggregate index", this column will display the index definition for the aggregate index associated with the fact table.

If "aggregate_index_status" equals "no existing aggregate index", this column will display a NULL value.

Example: Using the query

As an example, let's imagine that you ran the included query on your Firebolt database. In this scenario, your end users consume your analytics dashboard on a daily basis, and your team has defined a 5 second SLA for query response times. The query returns the results shown in the following two images:

Based on these results, three query patterns jump out as candidates for aggregate index creation. These queries are highlighted in red in the image below:

Why do these three queries present the strongest opportunities for optimizing performance with aggregating indexes?

  • These queries are all in the top 10 most frequently executed queries in a single week.
  • These queries all take more than 5 seconds to return results.

Now let's compare these three queries to determine which aggregating index we should create first.

Query 1:

  • Although the first query is executed the most frequently (56 times, compared to 55 times and 28 times), it is performing the best out of our three candidates. Additionally, this query scans a relatively small amount of data (2.97GB), which implies that it is either using a small table (in which case the overhead might outweigh the benefits of an aggregating index), or already leveraging primary indexes effectively. This query might benefit from an aggregating index, but it should not be your first priority.

Query 2:

  • This query is executed very frequently in a single week. It also takes over 30 seconds to return results, and scans almost 25GB of data. Given the query frequency, average duration, and volume of scanned data, this query is a strong candidate for an aggregating index.

Query 3:

  • This query has the longest average duration and scans the most data out of our three candidates. However, query 3 is the 8th most frequently executed query in a single week. Based purely on query frequency, you should likely prioritize it below query 2.

Based on this analysis, you should likely prioritize creating aggregating indexes for these queries in the following order:

  1. Query 2
  2. Query 3
  3. Query 1

Find the full text of your selected query

Once you have selected the query that you want to create an aggregating index for, you want to display the full text of the query. The example script that we provided earlier groups queries based on normalized query text. This provides the most accurate representation of query pattern frequency. However, to begin creating an aggregating index, you need to display the non-normalized query text.

You can show the non-normalized query text by copying the value in the "example_query_id" column and pasting it into the following query:

select query_text from information_schema.query_historywhere query_id = '<your query id>'and status = 'ENDED_SUCCESSFULLY'

This SQL command displays the complete query text in the results.

Display the query plan for your selected query

Finally, you can copy the complete query text and run an EXPLAIN command to display the query plan.

explain<your query>

From this point forward, you should follow the instructions in "Optimizing Queries with Aggregating Indexes in Firebolt". This article provides detailed information about how to build and test the most effective aggregating index for your query.