Using a Column in Partition By VS in Primary Index

Introduction

Query performance improves when the data processed by the query can be pruned, reducing the amount of data the query has to read. Data can be pruned through the use of primary indexes and partitions. By the end of this article, you will know when to include a column in a partition and when to use it in a primary index.

TL;DR

  • When data is stored in a table, the data is organized into tablets. Each tablet has an ordered set of data.

  • Primary index columns are used to order the data.

  • Partition columns are used to order the data as well as to define how the data is organized into tablets.

  • Partition columns take precedence over index columns when ordering data on tablets and pruning data for queries.

  • Partitions are a data management tool, enabling the use of ALTER TABLEā€¦DROP PARTITION to quickly remove data from a table.

  • If there is no need to manage data by a column that could be used for pruning, it should be included in the Primary Index rather than used in a Partition.

Concepts

All the example SQL code uses the Ultra Fast Gaming data set. To familiarize yourself with this data set, visit this link: Ultra Fast Gaming Firebolt Sample Dataset.

Storage and Performance

When data is inserted into a table, it is grouped into chunks of data called tablets. If there are no partition columns, Firebolt attempts to create tablets that are 5-10 GB in size. The data in the tablet is then ordered by the primary index columns. When there are partition columns, Firebolt first groups the data by the partition column values and then creates tablets that only contain data for a particular partition value. Within that restriction, it then attempts to create tablets that are 5-10 GB in size. The data in the tablets is ordered first by the partition columns, and then by the primary index columns. Since the tablets are split by partition values, a table with partitions will generally have a larger number of tablets than one without partitions.

There are several guidelines in the Firebolt Documentation on best practices for when Working with partitions. The guidelines are informed by the impact of partition columns on data storage. Managing a large number of tablets incurs overhead, which can negatively impact performance.

The following example will demonstrate the effect on the number of tablets:

-- create a version of the playstats table that has a primary index 
CREATE TABLE playstats_primary (
  gameid integer NULL,
  playerid integer NULL,
  stattime timestampntz NULL,
  selectedcar text NULL,
  currentlevel integer NULL,
  currentspeed real NULL,
  currentplaytime bigint NULL,
  currentscore bigint NULL,
  event text NULL,
  errorcode text NULL,
  tournamentid integer NULL,
  source_file_name text NULL,
  source_file_timestamp timestampntz NULL
) PRIMARY INDEX playerid, tournamentid;

-- create a version of the playstats table that has a primary index and partitions
CREATE TABLE playstats_partition (
  gameid integer NULL,
  playerid integer NULL,
  stattime timestampntz NULL,
  selectedcar text NULL,
  currentlevel integer NULL,
  currentspeed real NULL,
  currentplaytime bigint NULL,
  currentscore bigint NULL,
  event text NULL,
  errorcode text NULL,
  tournamentid integer NULL,
  source_file_name text NULL,
  source_file_timestamp timestampntz NULL
) PRIMARY INDEX playerid
PARTITION BY tournamentid;

-- insert the same data into the tables
INSERT INTO playstats_primary SELECT * FROM playstats;
INSERT INTO playstats_partition SELECT * FROM playstats;

-- Check the number of tablets in each table
SELECT table_name, number_of_rows, number_of_tablets 
FROM information_schema.tables WHERE table_name ILIKE 'playstats_%';

The results of the last query show that the playstats_index table has 10 tablets. It will also show that the playstats_partition table has 157 tablets, since there are 157 different values in the tournament_id column. There will be more overhead managing 157 tablets than 10 tablets, so query performance on the playstats_partition may be slower than on the playstats_index table.

Query performance could also be impacted by the structure of the primary index. In both tables, player_id is the first column in the primary index. If a query specifies player_id as a filter, the index will be used to efficiently prune data for the query. However, it will be less efficient on the playstats_partition table because the data is ordered first by tournament_id and then by player_id. On the playstats_partition table, pruning will have to skip through each tournament_id to find the matching player_id, while on the playstats_index table, it does not have to skip.

Data Management

Having partitions allows data to be quickly removed from the table using the ALTER TABLE...DROP PARTITION. Due to the organization of the tablets, it is very efficient to locate and remove all the tablets with the value to be dropped from the database.

It is possible to avoid using partitions by using the DELETE command, which marks data as deleted. The DELETE can be very performant, especially if the columns filtering the delete are in the primary index. In this example, the DROP PARTITION will be faster than the DELETE, but they both accomplish the same thing.

-- drop all the data for tournament 115
ALTER TABLE playstats_partition DROP PARTITION 115;

-- delete all the data for tournament 115
DELETE FROM playstats_primary WHERE tournamentid = 115;

Full Example Code

CREATE TABLE playstats_primary (
  gameid integer NULL,
  playerid integer NULL,
  stattime timestampntz NULL,
  selectedcar text NULL,
  currentlevel integer NULL,
  currentspeed real NULL,
  currentplaytime bigint NULL,
  currentscore bigint NULL,
  event text NULL,
  errorcode text NULL,
  tournamentid integer NULL,
  source_file_name text NULL,
  source_file_timestamp timestampntz NULL
) PRIMARY INDEX playerid,
tournamentid;

CREATE TABLE playstats_partition (
  gameid integer NULL,
  playerid integer NULL,
  stattime timestampntz NULL,
  selectedcar text NULL,
  currentlevel integer NULL,
  currentspeed real NULL,
  currentplaytime bigint NULL,
  currentscore bigint NULL,
  event text NULL,
  errorcode text NULL,
  tournamentid integer NULL,
  source_file_name text NULL,
  source_file_timestamp timestampntz NULL
) PRIMARY INDEX playerid
PARTITION BY tournamentid;

INSERT INTO playstats_primary SELECT * FROM playstats;
INSERT INTO playstats_partition SELECT * FROM playstats;

SELECT table_name, number_of_rows,number_of_tablets FROM information_schema.tables WHERE table_name ILIKE 'playstats_%';

ALTER TABLE playstats_partition DROP PARTITION 115;

DELETE FROM playstats_primary WHERE tournamentid = 115;