How to choose between fact and dimension for your table

Why

When creating a table inside of Firebolt, users must choose the type of the table between FACT and DIMENSION. It's not always clear which type to use.

Background

Skip this section if you understand the concepts of a Fact vs a Dimension table in traditional data warehouse modeling (Star or Snowflake schemas).

A common pattern in data is one table containing a stream of measurable events.  In retail this could be a sales table, in IOT a temperature reading table, in medical a diagnosis table, etc.  These tables are called fact tables and contain things to be tracked called measures.  Additional details on the participants / items in the event, which change less frequently are placed in auxiliary tables known as dimension tables. The picture below is one frequently used to show a star schema of fact and dimension tables for a traditional sales transaction database model.

star schema (1).png

The fact table contains keys pointing to a single record in the dimension (in some models, it's a multi-part key such as a key and time).
Note the above example features a time dimension which is generally not best practice, unless utilizing a non-standard calendar, as elements of time are built into the SQL language and so this join can be avoided.

In general:
- Fact tables contain much more rows than dimension tables
- Fact tables are usually not updated, while dimensions are updated as needed
- Fact tables contain the foreign keys of the dimension tables
- Fact tables contain measures, which are often aggregated by users
- Dimension tables are used to describe the fact record

While this is a well established model, it doesn't always fit, which leads to confusion on utilizing the fact or dimension table type in Firebolt.

Questions to Consider when deciding between a Fact and Dimension

  1. Does the table contain measures that need large aggregations which could benefit from an Aggregate Index (an average of 100, or more, records rolled up and queried)?
    Aggregate Indexes are only supported on fact tables. If you need heavy aggregation within the table (not across joins), setup the table as a fact.

  2. What is the join pattern for the table?
    If other tables join to the table in question to lookup a value (for example a friendly name, a threshold etc) in a pattern where many records in the other table can join to a single record for the table in question, then it should be a dimension table unless the table is large and should be sharded (lookup operations should not be sharded).
    In multi-node engines fact tables are sharded across nodes, while dimensions are replicated. When a fact table is sharded, different ranges of the table are divided among the nodes (using the columns of the primary index).  Dimension tables are replicated, or copied, on each node.

    image.png


    For best performance, joins should not require shuffling records between nodes; therefore, fact to fact joins should be avoided.
    Dimensions, as they are replicated, take up more SSD cache space (size of dimension data being used * number of nodes). 
    This tradeoff determines the best choice of a type for tables in a multi-node engine.  In a single node engine, records are never shuffled and this is not a consideration but should be accounted for if data or users will grow and require multi-node in the future.
    Join indexes are only available on dimension tables and are a further refinement to speed up lookup operations (given
    unique column = x fetch me y column(s) from the dimension). These indexes reside in RAM (not cache) for the ultimate join performance (but only for the very specific use case of a lookup, while there are many other joins that will not benefit from a join index).