Monday, January 21, 2013

Logging your loads: Data warehouse timestamping:



Introduction

There was an interesting discussion on Data Vault load dates in the Linkedin Data Vault discussion groups, and how you can go wrong by not understanding the different incarnations/types of load dates.
Load Dates (actually load time-stamps) are one on the most important metadata attributes for a Data warehouse. Determining the current "Load Date" seems very easy, until you take things like auditability, restartability , parallelism, Audit Trails/CDC load, message/real time and historic loads into account. In this blog post I'll try to describe the most common load dates used in Data warehousing and how you should capture them.

Types of Load Dates

There are several kinds of time-stamps in the loading process that can be registered within a Data warehouse loading process. While not all types of Data warehouses register the same time-stamps (or call them by the same names), from a logistics process there are just a limited set of dates that can be logged for a Data warehouse. Note that all actions and dates also have corresponding periods in which the process that generates the load date actually runs, but we will not discuss them here.

Load Cycle Date

In batch-oriented ETL architectures the most important date is the Load Cycle Date. It is the functional key that identifies an actual batch run. The granularity of the Load Cycle Date is usually a day, but can under circumstances be hourly or even in minutes or seconds. The Load Cycle Date does not have to represent the actual start or end date/time of the ETL process. The actual start usually lies after the Load Cycle Date, but this depends on the ETL scheduling. The Load Cycle Date is an abstraction from the actual load process date/period. The Load Cycle Date is used both to simplifies querying and maintaining time-lines and as a functional/foreign key to the batch run metrics of a specific load. This way it groups together all records that where loaded during a certain batch run. See the post on the "One Column" for more details on a metadata model that uses the Load Cycle Date.

(Source) Transaction time-stamp

The (source) Transaction (Load) Date (formally the transaction commit time-stamp) is the time-stamp that a transaction record was processed/committed (in a source system) or transaction based source extract. In most batch-oriented ETL architectures this is date is not available because the source system does not keep track of all transactions. But in a data feed from an Audit Trail/Change Data Capture log might use the given transaction time-stamps that can be used as a source for an Transaction Load Date. You can use such a date to reconstruct the source system transaction timeline. While under certain conditions a Data warehouse load process should be able to rely on externally generated time-stamps of e.g. an Audit Trail, some extra precautions are required. You need to verify the external configuration and also the external time service (if applicable) that is used. Also do not forget to check daylight saving time settings. Just to be safe, check incoming transaction dates even if you have a trusted source.

(Source) Application record time-stamps

These are record insert/create and update/modify time-stamps maintained by source systems. Do not trust external application code regulated date mutations in supplying 100% accurate transaction dates. Only an external transaction driven mechanism like DBMS transaction log readers or triggers should be used. Code regulated record dates rely on programmers honoring this in all their code. In most data-centric applications there are several ways for programmers to (inadvertently) ignore setting application record dates correctly and in doing so might be ruining your Data warehouse loading process when you rely on these dates!

Message (systems) time-stamps

For message systems like message queues or message bus systems a message (process) date is sometimes provided.

Data warehouse (insert/update) transaction time-stamp

This time-stamp registers the transaction (commit) time-stamp for either an insert or update into the Data warehouse.

Data warehouse Record Insertion Time-stamp

This is the actual time-stamp of the individual record insertion. It can usually only registered on a DBMS by creating a column that defaults to a time function registering the current date and time.

Extraction/Snapshot Time-stamp

Extraction dates specify when a certain data extract(data dump) was made from a source system database. Extraction snapshot time-stamps specify which time-stamp was used to create a transactional consistent extraction/snapshot set. This snapshot time-stamps, when reliable, are very valuable and can be used with the same caveats as the source transaction time-stamp. For loading strategies using full extracts using transactional consistent snapshots should be the default. Extraction time-stamps can only equal the quality of a snapshot time-stamps when there is guaranteed no data modification during the extraction window.


ETL Process time-stamps?

The (ETL) Process time-stamp(s) signifies the point in time when a certain record or set of records has been processed by an ETL process into a Data warehouse. This can be an independent ETL-(sub)process generated time-stamp  but more often it is actually the Data warehouse transaction time-stamp of the insert/update of the Data warehouse tables. Other Process time-stamps usually lack (transactional) consistency and are only interesting to measure a process duration.


Real Time Process time-stamp 

The (Real Time) Process time-stamp is only important in non staged Real Time scenarios. In those circumstances there is no formal  load cycle and hence no Load Cycle Date. But usually this time-stamp is a synonym for another time-stamp like the message or DWH transaction time-stamp.


Historized (Expected) Load Cycle Date

This is a very special type of Load Date. It is only useful when doing historic loads or when your batch oriented ETL process needs to be extremely robust and cannot rely of fixed “extract and load” cycles. It signifies what the original expected Load Cycle Date should have been if the ETL process had run correctly at that time. In batch oriented loading this date is usually derived from a Source extraction time-stamps  Extraction snapshot time-stamp or the available Transaction Load time-stampsNote that the Historized Load Cycle Date based on an historical load might not have the same granularity as your standard Load Cycle Date. It might be loaded from saved monthly extracts instead of daily feeds. Naturally, we can not only calculate past but also future(expected) load cycle dates as well.

Loading

Real Time/Message based loading

In real time loading scenarios when other types of Load Dates are not available we only have the Process time-stamp/Message time-stamp or Data warehouse transaction update/insert time-stamp.

Historic Loading

Historic loads are easy to do once you have established the different types of load dates. The only extra item is that in an historic load you have also a Historized Load Cycle Date. This date presents the (approximate) Load Cycle Date that would have been correct if the data warehouse had been available at the time of the data extract.

Restartability and Historic Loading

The Data warehouse load process should be fully restartable. However, the source extracts and stage loads are another matter. While CDC/Audit trail solutions often have a good restartability, often served by the restartability/transactional consistency of the underlying DBMS products, batch-oriented extractions don't match this by default. A good load process accepts that there is not always a 1:1 between source extracts/stage loads and Data warehouse loads. This might not be the default situation, but due to hardware or network issues there might be several extracts waiting for one Data warehouse load to process, or several Data warehouse loads are required to process one supplied source extract. This many-to-many relationship reinforces the idea to add an ETL (or source) driven Extract/Snapshot time-stamp to the process to distinguish extracts, and to use this time-stamp to drive the Transaction Load Date and Historized(Expected) Load Cycle Date. This makes that a batch load can be treated as a special kind of primitive Change Data Capture. In mimicking CDC behavior you increase the robustness of the load routines. A detailed discussion of this technique is a topic for another post.

Reconstructing transaction timelines

When you need to reconstruct (transaction) timelines in your Data warehouse (e.g. in Data Vault satellites) we can  use some of the load dates we gathered. If you have several kinds of load dates it pays of to understand which load date to use. The problem is that choosing a date is always a trade-off between accuracy and consistency. While the Record Insertion Time-stamp is very accurate in relation to the Data warehouse, the (Source) Transaction time-stamp is far more consistent with the other (source) data. Personally I prefer consistency, because accuracy can be offset by a technique I call 'The Traveling Now' which implements system wide reference dates to prevent the inconsistent and inaccurate data that is currently loading being viewed by users (I'll get back to this on another blog post), while inconsistencies need to be resolved with additional processing.

Batch Oriented Loading

For batch-oriented loading the transaction timelines are best driven by Transaction Load Dates when available. If they are not available use the Controlled Extraction Snapshot time-stamp or the Load Cycle Date if no controlled snapshot is provided. Integrated/multi sourced entities (e.g. Data Vault satellites that integrate common attributes over several sources) are best served by a Load Cycle Date. This is required to synchronize the transactional time(lines) across the different sources.

Historic Loads

Historic loads into an auditable DWH that itself maintains a transaction timeline takes some consideration. We consider 2 scenario's: historic loads with the same level of auditability as your normal loads (even if it is just a subset of data and it's temporal granularity differs) and historic loads whose auditability level is lower than your basic loads.

Auditable Historic Loads

If your historic loads mimic your basic loads and when your historic loads contain controlled extraction (snapshot) time-stamps or (source) transaction time-stamps you can use your basic loading routines (either based on the historical load date or the time-stamp it is based on) to reconstruct your transaction timeline. You might even consider and ETL routine that accepts many loads at once as described above.

Sub level auditable Historic Loads

If historic data loads differ considerably from normal loads you might want/need to handle them differently.
For those loads a good Historized Load Cycle Date  is mandatory, since that will be used to reconstruct the transactional timelines.  The Load Cycle Date is useless here because they have no connection with the data to load. Note that deriving a Historized Load Cycle Date  is bit of making up data based on available extract data and metadata. Also, in this scenario the granularity of the Historized Load Cycle Date for the historic loads will probably be different from your normal loads. (e.g. weeks,months or even quarters instead of days).

An alternative approach for those that consider historical loads not to have the same level of auditability as your normal loads is to add an additional time-line and use the Historized load cycle dates to govern that additional timeline and treat your standard transactional timeline as a normal batch load (Normal Load Date). This second time-line is something between a valid timeline and an additional transaction timeline, but it should keep your data fully auditable. Another side effect is that unless you know how to query the historical data this data will not show up on standard queries referencing a point in time of your new valid timeline.

Real Time Loads

Assuming no other information is available real time or message based loads are best served with the message time-stamp (or an ETL process based timestamp) or a DWH insert transaction time-stamp.

    Metadata

    What to log

    My advice is to add all important dates: Transaction Load DateProcess Date, Historized (Expected) Load Cycle Date, extraction/Snapshot Date and Load Cycle Date(as appropriate) to your Data warehouse  when available. This setup allows to mix different types of loading while maintaining full Load Date (metadata) transparency.

    Where to log

    There are three places to store time-stamps; in your metadata, in your actual data rows as data/metadata, and as time-stamps that drive your temporal registration/timeline construction (in DV satellite load date).

    1. Time-stamps you can register either in the metadata or in the data row
      1. Load Cycle Date
      2. Historized Load Cycle Dates
      3. (ETL) Process Dates
      4. Extraction (Snapshot) Time-stamp
      5. Data warehouse (insert/update) transaction time-stamp
    2. Time-stamps you can register only in the data rows
      1. DWH Insertion time-stamp
      2. (Source) Transaction Date
      3. Message (systems) Dates
    3. Time stamps used for timeline/Historic Load Date reconstruction (in order of preference)
      1. Trusted (Source) Transaction time-stamp
      2. Controlled! Extraction/Snapshot Time-stamp
      3. Historzized Load Cycle Date
      4. Load Cycle time-stamp (not available for Real time loading)
      5. Message/Real Time (process) based time-stamp
      6. Data warehouse (insert/update) transaction time-stamp
    For the sake of consistency and flexibility I always advise to also log time-stamps used for timeline reconstruction separate from the timeline (attributes) themselves.

    Conclusion

    When logging data loads don't skimp on logging loading time-stamps and use the correct time-stamp to recreate your transactional timelines.  Take this into consideration when designing your Data warehouse (eg Data Vault) ETL and metadata. Separating data and metadata so you can register different incarnations of the Load Date and other metadata from the Data warehouse is therefore considered as a best practice. The Load Cycle ID is a good start, but it does not cover all metadata. This notion is best served with an Audit ID as explained in the  "One Column" post.
    Post a Comment