Friday, November 8, 2013

One "Data Modeling" approach To Rule Them all




The Task

In a previous post on implementation (data) modeling styles I talked about needing a 'universal data modeling technique/method' that could be a starting point for all your structural model transformations (semantically equivalent derivations), be that dimensional, normalized or Anchor Style Modeling. Such a modeling technique should allow for easy transformation, it needs to be agnostic to most transformation techniques, but should still guarantee several levels of semantic equivalence. It also should allow for (database) model re-engineering and allow for data representations on any desired "level" (conceptual, logical). It should facilitate conceptual data integration and facilitate temporal aspects as well.

The magic wand of Semantic Equivalence

It is important to realize that all implementation modeling techniques like dimensional and Data Vault rely on a certain class of model transformations I call fully semantic equivalent. These class of transformations preserve (parts of) the data (they preserve the functional, join, multi value dependencies as defined in the RM) and hence provide a degree of data tracability from one transformation to the next. However, they DO make some aspects more difficult, especially maintaining all kinds of (complex) constraints. These kinds of transformations are usually biased towards certain aspects. Things like constraint minimization and standardization are best served by (highly) normalized models. Manual temporal processing and schema change impact isolation are best served by Anchor style modeling techniques like Data Vault. User access and data navigation are best served by (logical) Dimensional Models (see the OASI concept of van der Lek).

The Candidates

For me there where only 2 serious options, namely an anchored version of 6NF on the logical level,  or a Fact Based Modeling technique (FOM) from the NIAM family: FCO-IM, ORM or CogNIAM on the conceptual level. Other techniques like OWL are poor on constraint modeling and derivation and hence lack the desired easy (semantic equivalence) transformation. 6NF is an interesting candidate on the logical level since it is irreducible, time extendable and  is able to house constraints and derivations. It does not contain key-tables (anchors) by default, but we can create an anchored version (A6NF) that creates an anchor for each key that has a functional dependency. Date, Darwen an Lorentzos showed how to formally define temporal features in the relational model as well. However, apart form conceptual aspects like classification and quantification, verbalization and specialization/generalization  normalized models also do not directly abstract away from relationship implementation (which is also a bonus, but not in this scenario), which, given the myriad types of surrogation strategies used in data sources is considered a deficiency here. Abstracting away from foreign keys to relationships/roles gives rise to some additional issues but all in all FOM's can do all of this nicely (as far as it goes). The Fact Based modeling family lacks something in operators, esp temporal ones, but that can be remedied with a conceptual query language and simple conceptual temporal extension. (Another issues is that we need to facilitate robust and customizable data model reverse engineering, something current FCO-IM only has in a simple form.) I think that of all the FBM dialects FCO-IM lends itself best for understanding model transformations/derivations because it's focusing on fact types. For me this means that FCO-IM is an ideal candidate to use both as a modeling technique and as a modeling method (actually a diagramming method btw). Coupled with the fact that FCO-IM is taught and researched over here in the Netherlands means that it was an easy choice for me to make.

The One Model Methodology

The result is that I depend on FCO-IM for my transformation strategy analysis, and that it has become an important part of the MATTER program. It allows me to analyze, verbalize, display and derive implementation data model schema's in a consistent and generic way. This allows me to understand arbitrary implementation modeling styles in just one conceptual data model as a restructuring of fact (types). This way I resolve arbitrary diagramming, modeling, designing of data model schema's in a set of consistent, complete and correct fact restructuring directives.

Derivations

Transformation vs Derivations

FCO-IM is usable in describing the class of fully semantic equivalent transformations since it conceptually captures dependencies. The actual restructuring is done on the conceptual schemas of FCO-IM itself. a Model transformation first becomes (conceptual) model (schema) standardization and from there model schema derivation. We call this structural transformation or model restructuring, but in fact it is model schema derivation strategy. 

Implementations vs Definitions

As long as FCO-IM diagrams cannot be implemented directly we would actually transform an FCO-IM diagram to an an Anchorized 6NF model. This would be our idealized implementation model. from here all our implementation models become derivation models. Hence, implementation models are logical models that are logically derivable and controllable from a A6NF schema, while they are conceptually derived from a FBM model. Some implementation modeling styles that are closely related to A6NF (like normalized or Anchorized styles) lend themselves for creating central data repositories, while others (like dimensional) lend themselves to become derivated abstraction layers on top of this. Alas, in real world implementations we are usually forced to materialize some of these derived implementation modeling styles directly in databases, creating extra overhead that needs to be managed by some sort of data automation.

Implementation Modeling Styles?

Implementation data modeling styles are not 'physical' data modeling styles. Physical  is a misnomer from the ER and SQL database world. a Physical data model is what is stored inside a database like indexes and table spaces. Implementation modeling styles are logical model deviations used for logically accessing data, but also serving some non functional requirements around presentation, processing, performance and maintenance. They are artifacts created to counter the poor separation of concerns within current data toolings like SQL DBMSes, ETL tools and bad data management, poor data quality, poor temporal support etc etc. They are used for restructuring database schema's to separate and handle these concerns. In a TRDMS, a true relational database management system we would generate just 1 logical schema. Data Vaults and Dimensional models as we know them now would not be needed.

The Mission

For a lot of BI professionals however, Information modeling using FCO-IM is terra incognita and semantically equivalent model transformations (if done at all) are done using basic ER diagramming, which prohibits good standardization, transformation and (formal) derivations and hence understanding of implementation data modeling styles. ER modeling is just about visualization/drawing/noting model schema's, and do not help understanding this. To make understanding implementation data modeling styles better and more objective we need to educate professionals in this respect. Also since derivation and transformation go hand in hand, they need to understand the role of the relational model and model derivation as well, not just the (technical oriented) artifacts like Data Vaults and Dimensional Models.

Your Chance!

In December we start our MATTER FCO-IM track so BI professionals can dive deep into this aspect of information modeling. We start withg 3 days hand on FCO-IM (8-10 dec.). See for yourself how FCO-IM works and facilitates good data modeling. See BI Podium website for more info.



Tuesday, August 27, 2013

Reprising the MATTER Data Vault in-the-trenches track this fall

This fall the MATTER program will reprise the Data Vault in the trenches track. This track, meant for the more experienced Data (Vault) modelers/architects/BI specialist  focuses on the broad subject of Data Vault, Anchor Style modeling, 'Temporal data modeling' and even some Anchor Modeling. This is a 'no hold barred' track, so if you want to have thorough insights into Data Vault and other techniques this is your track. We'll deep dive in all the issues and opportunities that these kinds of model approaches have to offer.

I will personally teach the track's upcoming installments:
  1. Data Vault & Temporal Data Modeling in the trenches (11-12 sept.)
    • Focusing on (advanced) modeling constructs like key satellites, model transformation, model optimization, temporalization, model segmentation and many more issues.
  2. Data warehouse and Data Vault oriented Architecture, metadata & ETL (planned for begin oct.)
    • Focusing on Data Architecture, Raw/Rule Vault, business data models, metadata, transformation & generation & Data logistics.
  3. Advanced Data Modeling & Data Vault subjects (23-24 oct.)
    • Focusing on advanced modeling concepts like sub/super-typing (specialization/generalization), abstraction, multiple keys, virtualization, ...
See the Data Vault agenda or BI-podium website for more details. If you have questions around the track's qualifications, information or contents, don't hesitate to contact me or comment below this post and I'll get back to you.

Tuesday, June 11, 2013

Presentations Next generation DWH Modeling available for download

For those that want to download my own and all the other the presentations given last Thursday at the Next Generation DWH Modeling Conference, they are now available here.

Data Modeling Zone

Data Modeling Zone Europe



Last fall I was at the first Data Modeling Zone in Baltimore, USA. This fall there will be 2 conferences dedicated to data modeling. One in Baltimore,USA 8-10th of October and one European one in Hanover, Germany on 23rd and 24th of September.

Content

There will be a lot of info on FCO-IM, Model transformations, Data Vault, Anchor Modeling and much more.

Presentations

I'll be doing 2 presentations on the European Zone. One on Information Modeling and implementations around Data Vault, Anchor Modeling etc. and one on the success of Data Vault and Data warehouse Automation in the Netherlands.

DMZ Promotion

For those that would like to attend the organization has set up individual promotion codes for each of the speakers - my code is EversDMZPromo. Anytime someone registers for DMZ Europe with this code on DataModelingZone.com or DataModelingZone.eu, you will receive 100 Euros off the registration price. For every two people that register using your code, one student from a local university will be allowed to register for the conference for free. The promotion code is active up until the date of the conference and can be used any number of times. Please use the code if you intend to come

Hope to see you there!

Friday, June 7, 2013

Data Vault vs Anchor Modeling: Who Is The One?

Introduction

Yesterday Data Vault and Anchor Modeling went head to head on the DWH Next generation Conference organized by BI-Podium and Vissers & van Baars recruitment. It was a great conference with over 300 attendees and lot's of sessions on Anchor Modeling and Data Vault



Anchor Modeling

Anchor Modeling has been invented by Lars Rönnbäck and Olle Regardt. It is a highly normalized anchor style modeling approach that has some aspects of 6NF. It looks somewhat similar to Data Vault, but there are a lot of 'gotcha's'. It is a closed business model driven approach with it's own simple business/information model technique and just like Data Vault uses a small sets of building blocks and adds time to the data. There have been several blogs on Anchor Modeling, by +Richard Steijn here, Hennie de Nooijer here and WorldOfIntelligence here.

Data Vault vs Anchor Modeling

Lars made a start with comparing Anchor Modeling and Data Vault here. Is there any truth to glean from the the specific items, their meaning and the score? Not really, because THE Data Vault flavour does not exist that we can compare with the tight definitions of Anchor Modeling. I found it more the Agile Data Vault style of +Hans Patrik Hultgren compared to Anchor Modeling. Many of the issues mentioned stem either from (past) best practices or lack of detailed standards on e.g. temporal data management. Most are not essential or irrevocably some differences run much deeper. Interestingly most techniques currently in use in Anchor Modeling(except the annex helper tables for bi-temporal modeling) are also used at (some) Data Vaults.

The Hidden Snag: Auditability & Adaptability

If fact, most aspects of Anchor Modeling we can apply to Data Vault or Anchor Vault as well, but auditability creates some serious issues. Anchor Models can never be guaranteed 100% audtiable in all circumstances, and proving Anchor Models are semantic equivalent with (arbitrary) source models requires some serious additional effort. This also means the transformations and loading routines have different issues that with a Data Vault, esp. on complex non auditable transformations (e.g. time line repairs). The flexibility of Data Vault flavors allows us to adapt to very different environments without losing the source data. The tightly defined Anchor Models can be easier to use due to their (internal) business model driven nature coupled with their high normalization. All this does not mean that source driven Anchor Models do not exist or are impossible. On the contrary, we see many people designing source driven Anchor Models, just as some create Business Model driven Data Vaults.

It's my party and I'll Vault how I want to.

It is clear to me that unless we address the issue of formally addressing data integration and source data models with integration modeling comparing Data Vault and Anchor Modeling is not really going to work. Instead we should focus on combining approaches. In fact my presentation on the DWH Next generation Conference showed a nice example of combining Anchor Modeling and Data Vault. As discussed in my post on data modeling styles they are just edge cases of a whole family. Since DV needs to work on a vastly larger range of architectures, it cannot be so tightly defined as Anchor Modeling.  It is this flexibility that proves that when we look at Data Vault in a flexible way, it should be better 'by definition' at capturing sources in an Achor Style, while Anchor Modeling is better at handling business models in an Anchor Style modeling technique. This is exactly how I see the usage of such patterns, not either/or but used both in the right context.

An Example

Here is the example to show we can use Anchor Modeling and Data Vault together effectively.

Challenge

žIntegrate disparate Business Keys
¡Keys not integrated
¡No Master Key
¡Want a 1-1 time variant mapping between system keys
¡Internal de-duplication should be possible
¡Business Rule Driven mappings
¡Efficient and flexible implementation

Resulting Model

The result of this challenge is the following model. It contains 3 normal hubs (with corresponding sats and links not shown). It contains 3 1-1 key mapping ties (end dated links) that register which Business Keys need to be unified. It contains 1 Anchor (a Business Key-less/Empty Hub) that connects it all together. Since none of the source keys is a true business key, and we do not want to invent our own, an Anchor is an ideal construct to use here since it does not contain a business key(just a surrogate key). We assume here that we have (complex) business rules that allow us to map the different keys to the central Anchor.

žSolution

Classical Single BK Source Driven Hubs
¡No Multi key or changing key etc. so no Anchor required
žEfficient 1-1 Key Maps/Ties
¡Efficient
¡Business Rule Driven, so stable cardinality
¡Dynamic and traceable
žEmpty Hub/Anchor to tie all Hubs to
¡No “Master” Business Key required

Architecture

The Architecture we see emerging clearly uses Data Vault for capturing sources, while it will use Anchor like constructs for connecting them together in a slightly more flexible way that Data Vault does. This way both techniques are used where their strengths lies.

A Unifying Style

While advances in Data Vault and Anchor Modeling are nice, i'd like to state that choice between and adaptability of the methods/approaches to circumstances by practitioners is also important. But for this we need to understand all the detailed differences between all these modeling techniques. Especially when we want to combine them. If we want this then we're back at a generic Information modeling or logical modeling technique to unify all Data Vault and Anchor Modeling styles. While Data Vault and Anchor Modeling try to seriously simplify the construction of Data Warehousing within their won context, WE still need to evaluate the approaches in our clients context. For this we need all the detail, so we can make a informed, independent, correct decision on which techniques to use. Alas, with Anchor Modeling and Data Vault using their own nomenclature, this does NOT make it easier for experts to understand and trade-off methodologies. Only the 'workers' doing the implementation have it far easier once an architecture and methodology has been chosen and automated. This realization was one of the key drivers for the MATTER educational program and also my research in unifying Anchor Modeling and Data Vault modeling using Fact Oriented information modeling and model to model transformations using FCO-IM.

In the end, Chosen Ones only exists in stories. For us mortals, as Codd said, only correct, consistent and complete and utter information hiding will save us from having to understand all these modeling techniques. In the mean time we'll be forced to thoroughly understand all aspects of data modeling, be it abstract information models or specific implementation modeling styles.

Saturday, June 1, 2013

Repeating the Past: Repairing your history in with temporal data.

File:George Santayana.jpg
Those that forget the past are Condemned to repeat it
^ George Santayana (1905) Reason in Common Sense, volume 1 of The Life of Reason

Introduction

When you forget the past, you are condemned to repeat it. This also holds true when you model history and history of history in your database. When you recreate a transaction history in a fully temporal information system with loading data from a source that does not record one you are alas forgetting the past. While there is no way to reconstruct the pas from thin air, when the source information system maintains a valid timeline (a user controllable timeline that among others allows for backdating), you can still recreate an approximation of the past by repeating your valid timeline into your transaction timeline.

Forgetting the past

Temporal Data

Temporal Databases or temporal data modeling techniques allow you to record timelines. You can record when somebody lived at a certain address. We have 2 different classes of timelines; those that are connected to our own time, usually called logging, transaction, actual or asserted time, and time fully controllable by the end users, usually called valid, stated or user time. We talk about bi-temporal time when data is registered against 2 timelines, of each class one.

Data warehousing and logging

Data warehouses that log all data (implement a transaction/logged timeline) complement source systems that only register stated time so that they record a bi-temporal time. This is typically done with e.g. a Data Vault. Of course, a Data vault can never show the logged time for the past that it did not log, even if the source recorded a stated/valid time. However, we can create an approximation of the past with the source stated time.

Repeating the past

There are 2 strategies to repair the past, extension of the first recorded occurence into the past, and transposing the source valid timeline to the past. The first strategy is simple and can always be applied, but is a gross simplification. The second strategy only works when there is a valid timeline of the past. It assumes that this valid timeline is a 1-1 with the transaction timeline.

Example

Lets show a detailed example on how the repair the past by repeating it. we will show the second strategy, the first is left as an exercise to the reader. We start with a source table example that registers a stated and logged timeline for a certain value in the Data warehouse. We assume the source has been registering the data for some time. We also assume that for the past there is an 1-1 between the stated and logged time. This is a serious assumption, and the reconstructed past is certainly not auditable in the sense that it has been properly logged. We start processing the source data at time t3 (@Tran_hist_start).

Source Table:
LOAD DT
END DT
VALID FROM_DT
VALID TO_DT
VALUE
T3
T4
T0
T1
1
T3
-
T1
T2
3
T3
T5
T2
T3
4
T3
-
T3
-
6
T4
-
T0
T1
2
T5
-
T2
T3
5

This Source Table in a picture:

We can for example read the time points as  months, with T0 being January. T1=February and so on. The Value can be seen as e.g. Status values, 1='Created',2='Open',.. 6='Destroyed'. We can read the lower boxes in the picture as: we started to record in april that from jan to feb the status was Created.In May we recorded that in the the jan to feb time frame the status was 'Open'. This is currently still true (true until end of time).

Transposed Situation

Lets transpose the valid timeline at time T3(@Tran_hist_start) and use that as a basis to reconstruct the unknown past. We assume here that in that past we don't know the future as known at T3.
Our bi temporal table will now look like this:

Table Values:
LOAD DT
END DT
VALID FROM_DT
VALID_ TO_DT
VALUE
T0
T1
T0
-
1
T1
T3
T0
T1
1
T1
T2
T1
-
3
T2
T3
T1
T2
3
T2
T3
T2
-
4
T3
T4
T0
T1
1
T3
-
T1
T2
3
T3
T5
T2
T5
4
T3
-
T3
-
6
T4
-
T0
T1
2
T5
-
T2
T3
5


a Visual representation of the data and the transformation:

Let's look at the 'SQL' Code to produe this past. We see that we create set of vertical oriented timeslices and a set of horizontal timelsieces to fill in the missing past.

Pseudo SQL Code

Vertical History:

Horizontal History

Total History:

Conclusion

Assuming the user controlled timeline (valid timeline, stated timeline) does not diverge significantly from the unknown transaction/logged timeline you can reconstruct the non logged past in a information system/data warehouse by pivoting and substituting the target transaction timeline with the source user controlled timeline for the missing past. By repeating this information we can 'repair the past'. This is especially handy when querying the past. Note that we are now repeating the past from the present because we 'forgot' to register this past in the first place. The question if we should actually persist this past or just derive this information at query time is a separate performance discussion.