Wednesday, November 21, 2012

One (Metadata) Column To Rule Them All








One Column to rule them all,
One Column to find them,
One Column to bring them all  
and to the metadata bind them!






 Introduction

There is usually ONE column that is all over your Data warehouse. It might be called process_id, run_id, step_id or audit_id and it usually points over to a set of (logging) tables that hold all kinds of metadata for your Data warehouse platform. While this column is sometimes seen as mandatory (e.g. with Anchor Modeling or Data Vault) it is in general always seen as a good idea to have for any Data warehouse. There is usually just one column, but sometimes there are more (a composite metadata key) . While this column usually shows the actual load/ETL metrics for load batches/actions its implementation and definition varies from implementation to implementation.

The One (Metadata) Column pattern

I think that we should try to standardize on the semantics of such a column. This is easy if you have a good metadata model. The intersection of the static metadata (models,entities)  and metrics metadata (load runs and load cycles) with the ETL control metadata (etl moldules, etl step numbers, etl step order) should always lead you to the metadata on the lowest grain where you can define your most detailed process information. On this grain I would do my central logging and create the One Metadata Column (as a surrogate key). The metadata column's meaning is then directly related to this logging entity, and should not change during the operation of the Data warehouse. This does imply that a well designed and complete metadata model that covers all (definition, operation, automation and management) aspects is essential for having a good usable   metadata column. Since the Column is defined as the surrogate key on the lowest grain of your model, there can only be one. If your metadata model does not have a lowest grain (but e.g. several disparate ones) you apparently have gaps in your model and you need to extend/complete it before you can define your One Column.

Such a column has many advantages because it decouples your data and metadata to the maximum extent without losing information. It insulates the data from the metadata model changes through the use of a surrogate key, which then becomes the one point of connection bewteen these 2 models.
A few disadvantages to this approach are the more complex queries to tie data and metadata together, and coping with real time or near real time data proceses.

My preference of the column name is audit_id because such a column does not have to reflect a load run or process run, and not even an ETL step (but even an action within a load step).

I would think this pattern is esp. important with e.g. Data Vault to maintain audit-ability  so we not only know when but exactly what was loaded, how and form where. For DWH Automation efforts I think a metadata model and this pattern are very important aspects to implement consistently.

Metadata Model Example

I've created a minimal metadata model as an example. Here we see a very simple model that can be used in such a pattern. The following tables are in this model:
  1. Data Objects: Here we store source and target tables
  2. Load templates and Load template versions: Here we store which load templates/processes are defined
  3. Load Batches: A Load batch is a collection of load actions/directives
  4. Load Cycles: Each batch is loaded in regulated intervals. This is recorded in the load cycles.
  5. Load directives: Records which entity is loaded from which source under a certain batch and with a certain template.
  6. Load Actions: Records on which cycle which load directive was executed.

The column we need here is the Load action id. It is the surrogate key of the table with the lowest/widest granularity in the metadata model. Referring to this column we can answer all our metadata oriented questions around when, where, who, how and what.


Conclusion

The One Metadata Column Pattern is a useful and often implemented pattern within Data warehousing. I think it would be better if there is a consistent way in which to define and implement this. I hope this post helps to understand how to do this in a more structured fashion.