Data warehouses provide organizations with a knowledgebase that is relied upon by decision makers. ETL (extract, transform, load) is the process that is responsible for ensuring the data warehouse is reliable, accurate, and up to date. This post presents a design pattern that forms the foundation for ETL processes.
What are the goals?
Before jumping into the design pattern it is important to review the purpose for creating a data warehouse. The following are some of the most common reasons for creating a data warehouse.
- Persist Data: Store data for predefined period regardless of source system persistence level
- Central View: Provide a central view into the organization’s data
- Data Quality: Resolve data quality issues found in source systems
- Single Version of Truth: Overcome different versions of same object value across multiple systems
- Common Model: Simplify analytics by creating a common model
- Easy to Navigate: Provide a data model that is easy for business users to navigate
- Fast Query Performance: Overcome latency issues related to querying disparate source systems directly
- Augment Source Systems: Mechanism for managing data needed to augment source systems
With these goals in mind we can begin exploring the foundation design pattern. The stated goals require that we create a copy of source system data and store this data in our data warehouse. Theoretically, it is possible to create a single process that collect data, transforms it, and loads it into a data warehouse.
Needless to say, this type of process will have numerous issues, but one of the biggest issues is the inability to adjust the data model without re-accessing the source system which will often not have historical values stored to the level required. A change such as converting an attribute from SCD Type 1 to SCD Type 2 would often not be possible. To support model changes without loss of historical values we need a consolidation area. This post will refer to the consolidation area as the PSA or persistent staging area.
The role of PSA is to store copies of all source system record versions with little or no modifications. With a PSA in place we now have a new reliable source that can be leverage independent of the source systems. More on PSA Between PSA and the data warehouse we need to perform a number of transformations to resolve data quality issues and restructure the data to support business logic.
On the upstream side of PSA we need to collect data from source systems. The source systems may be located anywhere and are not in the direct control of the ETL system which introduces risks related to schema changes and network latency/failure. To mitigate these risks we can stage the collected data in a volatile staging area prior to loading PSA. Once the data is staged in a reliable location we can be confident that the schema is as expected and we have removed much of the network related risks.
The interval which the data warehouse is loaded is not always in sync with the interval in which data is collected from source systems. To enable these two processes to run independently we need to delineate the ETL process between PSA and transformations.
With the two phases in place, collect & load, we can now further define the tasks required in the transform layer. The first task is to simply select the records that have not been processed into the data warehouse yet. This is often accomplished by creating load status flag in PSA which defaults to a not processed value. This task is needed for each destination dimension and fact table and is referred to as dimension source (ds) or fact source (fs).
Prior to loading a dimension or fact we also need to ensure that the source data is at the required granularity level. PSA retains all versions of all records which supports loading dimension attributes with history tracked. The granularity required by dimensions is the composite of effective date and the dimension’s natural key. Fact table granularity is typically the composite of all foreign keys. This granularity check or aggregation step must be performed prior to loading the data warehouse.
With the unprocessed records selected & the granularity defined we can now load the data warehouse. Typically there will be other transformations needed to apply business logic and resolve data quality issues. The final step is to mark PSA records as processed. This is easily supported since the source records have been captured prior to performing transformations.
The above diagram describes the foundation design pattern. Of course, there are always special circumstances that will require this pattern to be altered, but by building upon this foundation we are able to provide the features required in a resilient ETL (more accurately ELT) system that can support agile data warehousing processes.