Late Arriving Dimensions
When building a dimensional model it is critical that facts have accurate foreign keys pointing back to related dimensions. However, there are some situations where the fact record is known before the related dimension record. This situation is called a late arriving dimension. Example; We load a sales amount (fact) for a product (dimension) that does not exist in the product dimension. In this situation either an unknown value, such as –1, will be placed in the fact table due to a failed lookup to the product dimension, or the looked-up key will be pointing to the wrong version of the dimension record.
Dimension with SCD1 Attributes Only
When the dimension with the late arriving data contains only SCD1 or SCD0 attributes, we can handle this problem by simply cleaning up the fact table. One possible data flow is described below.
1. Initial Fact Load: Load all facts. Facts with failed dimension lookups will get a –1 as their foreign key value.
2. Load late arriving dimension records: When the dimension records are available they are loaded normally.
3. Fact Cleanup: All facts with failed lookups should be reloaded. If the foreign key is not part of the fact table primary key such as an accumulated snapshot fact table, then the facts can be reloaded without first deleting/disabling the current record. Otherwise, the current records should be deleted/disabled prior to reloading. Often this step involves marking the records as not processed in a consolidation area such as PSA.
Dimensions with SCD2 Attributes
Late arriving dimensions records which have attribute history tracked (SCD2) present a more complex scenario. Let’s first take a quick look at a dimension record with related facts as they exists prior to a late arriving dimension situation occurring.
In this scenario we have three versions of product 456 due to changed values for the Cost attribute which has history tracked as SCD2. So far all is good. Each fact is related to the correct version of the dimension record based on the fact’s date and the date range (start to end) of each dimension record.
During the next load process a new version of product 456 is collected (green). This record is effective as of 20120214. All of the red values are now incorrect.
The new version of the dimension record needs to be embedded within the existing version’s contiguous time span. To do so we need to modify the expired date of version 2 such that it expires prior to the new record’s start date. However, by doing this one of the facts that already has a foreign key pointing to version two has become out of date. This fact needs to be updated to point to the new record.
The above image shows the corrected values. Notice that the effected dimension records are updated & the fact record’s foreign key to product has been updated to point to the new product record version.
Handling a late arriving dimension situation where the effected dimension includes SCD2 attributes requires careful consideration. A broad brush solution is to delete all related dimension and fact records and later load them chronologically. A more surgical approach is to update records in place. Yet another option is to queue failed lookups (when no dimension version exists) and retry on each subsequent load. The chosen solution is often dictated by the amount of data being handled.
Data sources behave in a variety of ways. It is up to the ETL developer to understand this behavior and to create a plan to deal with all possible scenarios. Often the ETL design becomes complex when handling late arriving dimensions. LeapFrogBI users can handle these scenarios efficiently by eliminating much of the complexity related to common tasks such as dimension & fact loading.