Data Matters (Unseen): June 2013

Tuesday, June 11, 2013

Presentations Next generation DWH Modeling available for download

For those that want to download my own and all the other the presentations given last Thursday at the Next Generation DWH Modeling Conference, they are now available here.

Data Modeling Zone

Last fall I was at the first Data Modeling Zone in Baltimore, USA. This fall there will be 2 conferences dedicated to data modeling. One in Baltimore,USA 8-10th of October and one European one in Hanover, Germany on 23rd and 24th of September.

Content

There will be a lot of info on FCO-IM, Model transformations, Data Vault, Anchor Modeling and much more.

Presentations

I'll be doing 2 presentations on the European Zone. One on Information Modeling and implementations around Data Vault, Anchor Modeling etc. and one on the success of Data Vault and Data warehouse Automation in the Netherlands.

DMZ Promotion

For those that would like to attend the organization has set up individual promotion codes for each of the speakers - my code is EversDMZPromo. Anytime someone registers for DMZ Europe with this code on DataModelingZone.com or DataModelingZone.eu, you will receive 100 Euros off the registration price. For every two people that register using your code, one student from a local university will be allowed to register for the conference for free. The promotion code is active up until the date of the conference and can be used any number of times. Please use the code if you intend to come

Hope to see you there!

Friday, June 7, 2013

Data Vault vs Anchor Modeling: Who Is The One?

Introduction

Yesterday Data Vault and Anchor Modeling went head to head on the DWH Next generation Conference organized by BI-Podium and Vissers & van Baars recruitment. It was a great conference with over 300 attendees and lot's of sessions on Anchor Modeling and Data Vault

Anchor Modeling

Anchor Modeling has been invented by Lars Rönnbäck and Olle Regardt. It is a highly normalized anchor style modeling approach that has some aspects of 6NF. It looks somewhat similar to Data Vault, but there are a lot of 'gotcha's'. It is a closed business model driven approach with it's own simple business/information model technique and just like Data Vault uses a small sets of building blocks and adds time to the data. There have been several blogs on Anchor Modeling, by +Richard Steijn here, Hennie de Nooijer here and WorldOfIntelligence here.

Data Vault vs Anchor Modeling

Lars made a start with comparing Anchor Modeling and Data Vault here. Is there any truth to glean from the the specific items, their meaning and the score? Not really, because THE Data Vault flavour does not exist that we can compare with the tight definitions of Anchor Modeling. I found it more the Agile Data Vault style of +Hans Patrik Hultgren compared to Anchor Modeling. Many of the issues mentioned stem either from (past) best practices or lack of detailed standards on e.g. temporal data management. Most are not essential or irrevocably some differences run much deeper. Interestingly most techniques currently in use in Anchor Modeling(except the annex helper tables for bi-temporal modeling) are also used at (some) Data Vaults.

The Hidden Snag: Auditability & Adaptability

If fact, most aspects of Anchor Modeling we can apply to Data Vault or Anchor Vault as well, but auditability creates some serious issues. Anchor Models can never be guaranteed 100% audtiable in all circumstances, and proving Anchor Models are semantic equivalent with (arbitrary) source models requires some serious additional effort. This also means the transformations and loading routines have different issues that with a Data Vault, esp. on complex non auditable transformations (e.g. time line repairs). The flexibility of Data Vault flavors allows us to adapt to very different environments without losing the source data. The tightly defined Anchor Models can be easier to use due to their (internal) business model driven nature coupled with their high normalization. All this does not mean that source driven Anchor Models do not exist or are impossible. On the contrary, we see many people designing source driven Anchor Models, just as some create Business Model driven Data Vaults.

It's my party and I'll Vault how I want to.

It is clear to me that unless we address the issue of formally addressing data integration and source data models with integration modeling comparing Data Vault and Anchor Modeling is not really going to work. Instead we should focus on combining approaches. In fact my presentation on the DWH Next generation Conference showed a nice example of combining Anchor Modeling and Data Vault. As discussed in my post on data modeling styles they are just edge cases of a whole family. Since DV needs to work on a vastly larger range of architectures, it cannot be so tightly defined as Anchor Modeling. It is this flexibility that proves that when we look at Data Vault in a flexible way, it should be better 'by definition' at capturing sources in an Achor Style, while Anchor Modeling is better at handling business models in an Anchor Style modeling technique. This is exactly how I see the usage of such patterns, not either/or but used both in the right context.

An Example

Here is the example to show we can use Anchor Modeling and Data Vault together effectively.

Challenge

žIntegrate disparate Business Keys
¡Keys not integrated

¡No Master Key
¡Want a 1-1 time variant mapping between system keys
¡Internal de-duplication should be possible
¡Business Rule Driven mappings
¡Efficient and flexible implementation

Resulting Model

The result of this challenge is the following model. It contains 3 normal hubs (with corresponding sats and links not shown). It contains 3 1-1 key mapping ties (end dated links) that register which Business Keys need to be unified. It contains 1 Anchor (a Business Key-less/Empty Hub) that connects it all together. Since none of the source keys is a true business key, and we do not want to invent our own, an Anchor is an ideal construct to use here since it does not contain a business key(just a surrogate key). We assume here that we have (complex) business rules that allow us to map the different keys to the central Anchor.

žSolution

Classical Single BK Source Driven Hubs

¡No Multi key or changing key etc. so no Anchor required

žEfficient 1-1 Key Maps/Ties

¡Efficient

¡Business Rule Driven, so stable cardinality

¡Dynamic and traceable

žEmpty Hub/Anchor to tie all Hubs to

¡No “Master” Business Key required

Architecture

The Architecture we see emerging clearly uses Data Vault for capturing sources, while it will use Anchor like constructs for connecting them together in a slightly more flexible way that Data Vault does. This way both techniques are used where their strengths lies.

A Unifying Style

While advances in Data Vault and Anchor Modeling are nice, i'd like to state that choice between and adaptability of the methods/approaches to circumstances by practitioners is also important. But for this we need to understand all the detailed differences between all these modeling techniques. Especially when we want to combine them. If we want this then we're back at a generic Information modeling or logical modeling technique to unify all Data Vault and Anchor Modeling styles. While Data Vault and Anchor Modeling try to seriously simplify the construction of Data Warehousing within their won context, WE still need to evaluate the approaches in our clients context. For this we need all the detail, so we can make a informed, independent, correct decision on which techniques to use. Alas, with Anchor Modeling and Data Vault using their own nomenclature, this does NOT make it easier for experts to understand and trade-off methodologies. Only the 'workers' doing the implementation have it far easier once an architecture and methodology has been chosen and automated. This realization was one of the key drivers for the MATTER educational program and also my research in unifying Anchor Modeling and Data Vault modeling using Fact Oriented information modeling and model to model transformations using FCO-IM.

In the end, Chosen Ones only exists in stories. For us mortals, as Codd said, only correct, consistent and complete and utter information hiding will save us from having to understand all these modeling techniques. In the mean time we'll be forced to thoroughly understand all aspects of data modeling, be it abstract information models or specific implementation modeling styles.

Saturday, June 1, 2013

Repeating the Past: Repairing your history in with temporal data.

Those that forget the past are Condemned to repeat it

^ George Santayana (1905) Reason in Common Sense, volume 1 of The Life of Reason

Introduction

When you forget the past, you are condemned to repeat it. This also holds true when you model history and history of history in your database. When you recreate a transaction history in a fully temporal information system with loading data from a source that does not record one you are alas forgetting the past. While there is no way to reconstruct the pas from thin air, when the source information system maintains a valid timeline (a user controllable timeline that among others allows for backdating), you can still recreate an approximation of the past by repeating your valid timeline into your transaction timeline.

Forgetting the past

Temporal Data

Temporal Databases or temporal data modeling techniques allow you to record timelines. You can record when somebody lived at a certain address. We have 2 different classes of timelines; those that are connected to our own time, usually called logging, transaction, actual or asserted time, and time fully controllable by the end users, usually called valid, stated or user time. We talk about bi-temporal time when data is registered against 2 timelines, of each class one.

Data warehousing and logging

Data warehouses that log all data (implement a transaction/logged timeline) complement source systems that only register stated time so that they record a bi-temporal time. This is typically done with e.g. a Data Vault. Of course, a Data vault can never show the logged time for the past that it did not log, even if the source recorded a stated/valid time. However, we can create an approximation of the past with the source stated time.

Repeating the past

There are 2 strategies to repair the past, extension of the first recorded occurence into the past, and transposing the source valid timeline to the past. The first strategy is simple and can always be applied, but is a gross simplification. The second strategy only works when there is a valid timeline of the past. It assumes that this valid timeline is a 1-1 with the transaction timeline.

Example

Lets show a detailed example on how the repair the past by repeating it. we will show the second strategy, the first is left as an exercise to the reader. We start with a source table example that registers a stated and logged timeline for a certain value in the Data warehouse. We assume the source has been registering the data for some time. We also assume that for the past there is an 1-1 between the stated and logged time. This is a serious assumption, and the reconstructed past is certainly not auditable in the sense that it has been properly logged. We start processing the source data at time t3 (@Tran_hist_start).

Source Table:

LOAD DT	END DT	VALID FROM_DT	VALID TO_DT	VALUE
T3	T4	T0	T1	1
T3	-	T1	T2	3
T3	T5	T2	T3	4
T3	-	T3	-	6
T4	-	T0	T1	2
T5	-	T2	T3	5

This Source Table in a picture:

We can for example read the time points as months, with T0 being January. T1=February and so on. The Value can be seen as e.g. Status values, 1='Created',2='Open',.. 6='Destroyed'. We can read the lower boxes in the picture as: we started to record in april that from jan to feb the status was Created.In May we recorded that in the the jan to feb time frame the status was 'Open'. This is currently still true (true until end of time).

Transposed Situation

Lets transpose the valid timeline at time T3(@Tran_hist_start) and use that as a basis to reconstruct the unknown past. We assume here that in that past we don't know the future as known at T3.

Our bi temporal table will now look like this:

Table Values:

LOAD DT	END DT	VALID FROM_DT	VALID_ TO_DT	VALUE
T0	T1	T0	-	1
T1	T3	T0	T1	1
T1	T2	T1	-	3
T2	T3	T1	T2	3
T2	T3	T2	-	4

T3	T4	T0	T1	1
T3	-	T1	T2	3
T3	T5	T2	T5	4
T3	-	T3	-	6
T4	-	T0	T1	2
T5	-	T2	T3	5

a Visual representation of the data and the transformation:

Let's look at the 'SQL' Code to produe this past. We see that we create set of vertical oriented timeslices and a set of horizontal timelsieces to fill in the missing past.

Pseudo SQL Code

Vertical History:

Horizontal History

Total History:

Conclusion

Assuming the user controlled timeline (valid timeline, stated timeline) does not diverge significantly from the unknown transaction/logged timeline you can reconstruct the non logged past in a information system/data warehouse by pivoting and substituting the target transaction timeline with the source user controlled timeline for the missing past. By repeating this information we can 'repair the past'. This is especially handy when querying the past. Note that we are now repeating the past from the present because we 'forgot' to register this past in the first place. The question if we should actually persist this past or just derive this information at query time is a separate performance discussion.