Introduction
There was an interesting discussion on Data Vault load dates in the Linkedin Data Vault discussion groups, and how you can go wrong by not understanding the different incarnations/types of load dates.
Load Dates (actually load time-stamps) are one on the most important metadata attributes for a Data warehouse. Determining the current "Load Date"
seems very easy, until you take things like auditability, restartability , parallelism, Audit Trails/CDC load, message/real time and historic loads into account. In this blog post I'll try to describe the most common load dates used in Data warehousing and how you should capture them.
Types of Load Dates
There are several kinds of time-stamps
in the loading process that can be registered within a Data warehouse loading process. While not all types of Data warehouses register the same time-stamps (or call them by the same names), from a logistics process there are just a limited set of dates that can be logged for a Data warehouse. Note that all actions and dates also have corresponding periods in which the process that generates the load date actually runs, but we will not discuss them here.
Load Cycle Date
In batch-oriented ETL architectures the most important date is the
Load Cycle Date. It is the functional key that identifies an actual batch run. The granularity of the
Load Cycle Date is usually a day, but can under circumstances be hourly or even in minutes or seconds. The
Load Cycle Date does not have to represent the actual start or end date/time of the ETL process. The actual start usually lies after the
Load Cycle Date, but this depends on the ETL scheduling. The
Load Cycle Date is an
abstraction from the actual load process date/period. The
Load Cycle Date is used both to simplifies querying and maintaining time-lines and as a functional/foreign key to the batch run metrics of a specific load. This way it groups together all records that where loaded during a certain batch run. See the post on the
"One Column" for more details on a metadata model that uses the
Load Cycle Date.
(Source) Transaction time-stamp
The (source)
Transaction (Load) Date (formally the transaction
commit time-stamp)
is the time-stamp that a transaction record was processed/committed (in a source system) or transaction based source extract. In most batch-oriented ETL architectures this is date is not available because the source system does not keep track of all transactions. But in a data feed from an Audit Trail/Change Data Capture log might use the given transaction
time-stamps that can be used as a source for an
Transaction Load Date. You can use such a date to reconstruct the source system transaction timeline. While under certain conditions a Data warehouse load process should be able to rely on externally generated time-stamps of e.g. an Audit Trail, some extra precautions are required. You need to verify the external configuration and also the external time service (if applicable) that is used. Also do not forget to check daylight saving time settings. Just to be safe, check incoming transaction dates even if you have a trusted source.
(Source) Application record time-stamps
These are record insert/create and update/modify time-stamps maintained by source systems. Do not trust external application code regulated date mutations in supplying 100% accurate transaction dates. Only an external transaction driven mechanism like DBMS transaction log readers or triggers should be used. Code regulated record dates rely on programmers honoring this in all their code. In most data-centric applications there are several ways for programmers to (inadvertently) ignore setting application record dates correctly and in doing so might be ruining your Data warehouse loading process when you rely on these dates!
Message (systems) time-stamps
For message systems like message queues or message bus systems a message (process) date is sometimes provided.
Data warehouse (insert/update) transaction time-stamp
This time-stamp registers the transaction (commit) time-stamp for either an insert or update into the Data warehouse.
Data warehouse Record Insertion Time-stamp
This is the actual time-stamp of the individual record insertion. It can usually only registered on a DBMS by creating a column that defaults to a time function registering the current date and time.
Extraction/Snapshot Time-stamp
Extraction dates specify when a certain data extract(data dump) was made from a source system database. Extraction snapshot time-stamps specify which time-stamp was used to create a transactional consistent extraction/snapshot set. This snapshot time-stamps, when reliable, are very valuable and can be used with the same caveats as the source transaction time-stamp. For loading strategies using full extracts using transactional consistent snapshots should be the default. Extraction time-stamps can only equal the quality of a snapshot time-stamps when there is guaranteed no data modification during the extraction window.
ETL Process time-stamps?
The (ETL) Process time-stamp(s) signifies the point in time when a certain record or set of records has been processed by an ETL process into a Data warehouse. This can be an independent ETL-(sub)process generated time-stamp but more often it is actually the Data warehouse transaction time-stamp of the insert/update of the Data warehouse tables. Other Process time-stamps usually lack (transactional) consistency and are only interesting to measure a process duration.
Real Time Process time-stamp
The (Real Time) Process time-stamp is only important in non staged Real Time scenarios. In those circumstances there is no formal load cycle and hence no Load Cycle Date. But usually this time-stamp is a synonym for another time-stamp like the message or DWH transaction time-stamp.
Historized (Expected) Load Cycle Date
This is a very special type of
Load Date. It is only useful when doing historic loads or when your batch oriented ETL process needs to be extremely robust and cannot rely of fixed “
extract and load” cycles. It signifies what the original expected
Load Cycle Date should have been if the ETL process had run correctly at that time. In batch oriented loading this date is usually derived from a
Source e
xtraction time-stamps Extraction snapshot time-stamp or the available
Transaction Load time-stamps. Note that the
Historized Load Cycle Date based on an historical load might not have the same granularity as your standard
Load Cycle Date. It might be loaded from saved monthly extracts instead of daily feeds. Naturally, we can not only calculate past but also future(expected) load cycle dates as well.
Loading
Real Time/Message based loading
In real time loading scenarios when other types of Load Dates are not available we only have the Process time-stamp/Message time-stamp or Data warehouse transaction update/insert time-stamp.
Historic Loading
Historic loads are easy to do once you have established the different types of load dates. The only extra item is that in an historic load you have also a Historized Load Cycle Date. This date presents the (approximate) Load Cycle Date that would have been correct if the data warehouse had been available at the time of the data extract.
Restartability and Historic Loading
The Data warehouse load process should be fully restartable. However, the source extracts and stage loads are another matter. While CDC/Audit trail solutions often have a good restartability, often served by the restartability/transactional consistency of the underlying DBMS products, batch-oriented extractions don't match this by default. A good load process accepts that there is not always a 1:1 between source extracts/stage loads and Data warehouse loads. This might not be the default situation, but due to hardware or network issues there might be several extracts waiting for one Data warehouse load to process, or several Data warehouse loads are required to process one supplied source extract. This many-to-many relationship reinforces the idea to add an ETL (or source) driven Extract/Snapshot time-stamp to the process to distinguish extracts, and to use this time-stamp to drive the Transaction Load Date and Historized(Expected) Load Cycle Date. This makes that a batch load can be treated as a special kind of primitive Change Data Capture. In mimicking CDC behavior you increase the robustness of the load routines. A detailed discussion of this technique is a topic for another post.
Reconstructing transaction timelines
When you need to reconstruct (transaction) timelines in your Data warehouse (e.g. in Data Vault satellites) we can use some of the load dates we gathered. If you have several kinds of load dates it pays of to understand which load date to use. The problem is that choosing a date is always a trade-off between accuracy and consistency. While the
Record Insertion Time-stamp is very accurate in relation to the Data warehouse, the
(Source) Transaction time-stamp is far more consistent with the other (source) data. Personally I prefer consistency, because accuracy can be offset by a technique I call 'The Traveling Now' which implements system wide reference dates to prevent the inconsistent and inaccurate data that is currently loading being viewed by users (I'll get back to this on another blog post), while inconsistencies need to be resolved with additional processing.
Batch Oriented Loading
For batch-oriented loading the transaction timelines are best driven by
Transaction Load Dates when available. If they are not available use the
Controlled Extraction Snapshot time-stamp or the
Load Cycle Date if no controlled snapshot is provided. Integrated/multi sourced entities (e.g. Data Vault satellites that integrate common attributes over several sources) are best served by a
Load Cycle Date. This is required to synchronize the transactional time(lines) across the different sources.
Historic Loads
Historic loads into an auditable DWH that itself maintains a transaction timeline takes some consideration. We consider 2 scenario's: historic loads with the same level of auditability as your normal loads (even if it is just a subset of data and it's temporal granularity differs) and historic loads whose auditability level is lower than your basic loads.
Auditable Historic Loads
If your historic loads mimic your basic loads and when your historic loads contain controlled
extraction (snapshot) time-stamps or
(source) transaction time-stamps you can use your basic loading routines (either based on the
historical load date or the time-stamp it is based on) to reconstruct your transaction timeline. You might even consider and ETL routine that accepts many loads at once as described above.
Sub level auditable Historic Loads
If historic data loads differ considerably from normal loads you might want/need to handle them differently.
For those loads a good
Historized Load Cycle Date is mandatory, since that will be used to reconstruct the transactional timelines. The Load Cycle Date is useless here because they have no connection with the data to load. Note that deriving a
Historized Load Cycle Date is bit of making up data based on available extract data and metadata. Also, in this scenario the granularity of the
Historized Load Cycle Date for the historic loads will probably be different from your normal loads. (e.g. weeks,months or even quarters instead of days).
An alternative approach for those that consider historical loads not to have the same level of auditability as your normal loads is to add an additional time-line and use the
Historized load cycle dates to govern that additional timeline and treat your standard transactional timeline as a normal batch load (Normal Load Date). This second time-line is something between a valid timeline and an additional transaction timeline, but it should keep your data fully auditable. Another side effect is that unless you know how to query the historical data this data will not show up on standard queries referencing a point in time of your new valid timeline.
Real Time Loads
Assuming no other information is available real time or message based loads are best served with the message time-stamp (or an ETL process based timestamp) or a DWH insert transaction time-stamp.
Metadata
What to log
My advice is to add
all important dates:
Transaction Load Date,
Process Date, Historized (Expected) Load Cycle Date, extraction/Snapshot Date and
Load Cycle Date(as appropriate) to your Data warehouse when available. This setup allows to mix different types of loading while maintaining full
Load Date (metadata) transparency.
Where to log
There are three places to store time-stamps; in your metadata, in your actual data rows as data/metadata, and as time-stamps that drive your temporal registration/timeline construction (in DV satellite load date).
- Time-stamps you can register either in the metadata or in the data row
- Load Cycle Date
- Historized Load Cycle Dates
- (ETL) Process Dates
- Extraction (Snapshot) Time-stamp
- Data warehouse (insert/update) transaction time-stamp
- Time-stamps you can register only in the data rows
- DWH Insertion time-stamp
- (Source) Transaction Date
- Message (systems) Dates
- Time stamps used for timeline/Historic Load Date reconstruction (in order of preference)
- Trusted (Source) Transaction time-stamp
- Controlled! Extraction/Snapshot Time-stamp
- Historzized Load Cycle Date
- Load Cycle time-stamp (not available for Real time loading)
- Message/Real Time (process) based time-stamp
- Data warehouse (insert/update) transaction time-stamp
For the sake of consistency and flexibility I always advise to also log time-stamps used for timeline reconstruction separate from the timeline (attributes) themselves.
Conclusion
When logging data loads don't skimp on logging loading time-stamps and use the correct time-stamp to recreate your transactional timelines. Take this into consideration when designing your Data warehouse (eg Data Vault) ETL and metadata. Separating data and metadata so you can register different incarnations of the
Load Date and other metadata from the Data warehouse is therefore considered as a best practice. The Load Cycle ID is a good start, but it does not cover all metadata. This notion is best served with an
Audit ID as explained in the
"One Column" post.