Friday, September 28, 2012

Introducing the M.A.T.T.E.R. Advanced Data warehouse educational program

What does MATTER stand for?

Intro

As some of you already know I'm both initiator, instructor but above all Program Director of the MATTER advanced Data warehouse educational program. Since we are currently rolling out our program I did my first official presentation on the Data warehouse automation conference (DWHAutomation). The MATTER program was a main sponsor at this event and was received very well. Here we announced our first scheduled course on advanced Data Vault modeling (see the bi-podium website for details). We are currently in the process of setting up registrations and website and fleshing out parts of our procedures.

Why?

We think that that modern Data warehouse should be valuable; it should add a clear value to organizations. This means they should be usable, but also be built in an agile, quick and cost effective way. This can only be achieved by a strong focus on automation efforts. But this is only feasible if you rely on modeling, transformation and metadata in combination with a robust and generic temporal aware (ETL) subsystem that together will form an agile, automatable and flexible data warehouse Architecture. We think that current efforts by vendors, consultants, clients and users deserve a strong educational program to support their efforts. Since there was no program or course focusing on these subjects we decided to create a whole new advanced educational program, the MATTER program.

Who?

The MATTER program is a unique collaboration between field experts/instructors and the Model-based Information Systems (M-BIS) research group at the HAN University of Applied Science. The commercial parties involved are the Ockham Group, DataMasters(Unseen), Occurro , Lars Röhnback (anchormodeling) en bi-podium. The program is led by the Martijn Evers, Program Director, and Simone Molenaar, Program Coordinator.

What?

MATTER is an unequalled program to gain deeper insights into (agile) data warehouse architectures and methodologies. Its focus and synergy should allow you to create extreme flexible, agile, automatable and cost effective data warehouse solutions, or allow you to make better use of existing data warehouse automation solutions.

Program Focus

  • Modeling: Learn advanced (temporal) modeling techniques like Anchor Modeling and in-depth Data Vault. Use information modeling to increase the quality of your data models. Understand the effects of temporalization on your data modeling efforts.
  • Metadata: In depth focus on metadata structures to support your data warehouse (automation) efforts.
  • Architecture: Focus on advanced and agile DWH architectures using Data Vault and Anchor Modeling. Realize the benefits of fact oriented ETL architectures.
  • Automation: Understand how to automate your DWH solutions. Learn the different approaches and ways to automate parts of your data warehouse’s ETL and architecture using modeling, metadata and transformation techniques.
  • Temporalization: Understand the effects of time in the database. Understand temporal concepts and their implementations. Understand how to create bi-temporal systems, both for data warehousing as transactional systems.
  • Model Transformation: Gain deeper knowledge in modeling by understanding how to transform from and to modeling styles like Anchor Modeling, Dimensional Modeling and Data Vault using a fact oriented information modeling technique (FCO-IM: Fully Communication Oriented Information Modeling) as a transformation language.
  • Enterprise Information/Data warehouse system design: Understand how these techniques and methodologies can be used to architect modern flexible and automatable enterprise class Data warehouses or other information systems.
  • (Relational) Database Management System implementation: Learn about the background, limitations and implementation details of current mainstream SQL DBMS systems with respect to these modern techniques and architectures. 

How?

Program Overview

The program courses are organized into layers and tracks. Each layer builds on the previous layers and progresses the knowledge and insight of the participants. 
We have a temporal methods layer that contains 3 tracks. It has a clear focus on a certain method or technique: Data Vault, Anchor Modeling and generic temporal modeling. The next layer consists of an Information Modeling track using FCO-IM. Here we use information modeling and use transformations to derive data models like star schema’s, Anchor Models or Data Vaults. The next layer consists of a “Master class” which focuses on architecture and modeling based on the knowledge gained from information modeling and transformations.

Program Contents

The content focus per track is shown in the Matrix below:

MATTER\Track
AM
DV+Arch
TEMP
FCO-IM
Master Class
Modeling
++
++
+
+++
++
Metadata
-
++
-
+
++
Architecture
+
++
+
-
+++
Automation
-
++
+
+
++
Temporalization
+
+
++
-
+
Transformation
+
++
+
+++
+++
Enterprise DWH
+
++
-
-
++
RDBMS
+
++
+
+
+
Certification
Yes
No
No
No
Yes
Implementation
+
++
+
+
-
Related subjects
-
+
-
-
+++

Certification

There is an optional overall certification whose level exceeds the regular Master degree as well as specific Anchor Modeling certifications. The certification process is done in coordination M-BIS research group of the HAN University of Applied Science. The examination consists of an in depth and interactive assignments. The program has its own exam board to oversee the examination and certification process.

The program also has regular course examinations and certifications, as well as a few specific specialist track certifications around Anchor Modeling.

Joining the program

While the program consists of several tracks each with several courses, there is no obligation to follow all tracks/courses. It is possible to just follow individual courses or limit yourself to just (parts of) a few tracks. Some dependencies exist, especially for the end certification. For those interested we do intakes to provide customized programs for all participants. This allows for an optimal result at minimal effort and expense.

When?

The program starts off in 2013, but we are already starting up the Data Vault "in the trenches" track this fall. It will start with a special extended version of the Advanced Data Vault Modeling course. For more information on how and what of the individual courses see the bi-podium website. Individual courses will be spread out so the full program will run for approx 1.5 yr.

There will also be a free information session of the full program for prospective students on Thursday 25 October at the HAN campus in Arnhem from 13.00-17.00hrs

Where?

Locations still need to be announced, but we do know that most of the courses will be held either around Utrecht and the HAN ICA Campus in Arnhem.

Wednesday, September 19, 2012

Styles of Data Vault modeling PART I: Overview

Introduction

Sometimes we see that given the same input and process, the end result still differs between situations. This also happens when creating (data) models from requirements, and hence also with Data Vault data models. As long as this is intentional and can be explained this does not have to pose an issue. But often it is an indication of lack of knowledge or common understanding. It is usually quite hard to track down where the assumptions differ based on just the end result, and this is no different with the Data Vault approach. One of the main reasons for this difference lies in the somewhat loose definition of the Data Vault, it allows for some variation. This has led to several (closely related) styles of modeling a Data Vault.

In this post I'll introduce different modeling styles of Data Vault I've found to date. We will focus on the modeling part, and not on the architecture part. So we are not discussing differences between source, business or raw data vaults, but mainly different ways to transform the same starting (modeling) scenario into a Data Vault.

The Data Vault approach

Data Vault is a modeling approach/method that assists with creating the central data repository in an Enterprise Data Warehouse. It works on the basis that there is some hub/spoke EDW architecture defined (e.g. an Inmon DW 2.0 architecture). In this architecture the central repository is there for data management while the outlying spokes (Data marts) are used for data delivery to users. Its basic assumption is that you begin by focusing on getting source data in your EDW (not cleansed and restructured data). 
Data Vault however forces you to adapt all your data to fit into the three basic building blocks that are allowed when creating a Data Vault. This means source data should then be reshaped so that it fits those basic data entity building blocks. These blocks are called Hubs, Links and Satellites. 
These building blocks have a very basic makeup: 

     Hubs are just lists of elementary/independent key values. They implement independent keys. Since Data Vault is interested in the big picture, these are called 'business keys' and they represent common identifiers around the business (invoice number, customer number, employee number).
     Links are associations between hubs
     Satellites track attributes associated with hubs and links through time.

Besides the building blocks there are also loading patterns for your central EDW repository. These follow from the definition of data entities in combination with the requirement of loading the source data unaltered into the Data Vault.

Data Vault modeling Styles

However, this recipe is just a global pattern. For example, it is does not work as a complete and detailed (semantic-equivalent) structural data model transformation recipe if we compare it to the transformation algorithms we use within FCO-IM and other Fact oriented Modeling techniques (FOM’s).
Besides the flexibility in definition and design of the Data Vault standard other factors contribute to variations in the Data Vault approach.
  1. There is a large history on using similar temporal data modeling approaches with OLTP systems like Veldwijk’s HTC.
  2. Additional requirements on flexibility and reuse of data and metadata
  3. The advent of new hardware and software like database appliances and in-memory technology
  4. The rise of similar EDW oriented modeling approaches like Anchor Modeling
This has led to the rise of what can be loosely called different flavors or styles of Data Vault modeling (especially with regards to end dating links). These can be described as varieties in the underlying structural model transformation strategies. What I present here is by no means a complete list, but it attempts to identify the most important structurally different varieties we currently see to date. We can explore these different varieties of the Data Vault by looking at the definition and transformation process.

Classical (aka Dimensional) Data Vault

In the Classical (I describe this as the ‘Dimensional’) Data Vault all links represent n-ary associations between hubs. These links combine all hub references in a source tabel, both hub references that are part of the source table key, as references that are not part of any key. Because of the auditability requirment we need a “driving” key in the link (which is in fact a representation of the source table key) for proper temporal data management. Such a Link can be considered a “proto” fact table, containing all dimensional references (although time independent) but without the measurements. It will usually be pretty wide. It tries to minimize joins at the expense of duplicating rows when non-driving key elements are updated in the link. This style of Data Vault modeling lends itself for easy staging to a dimensional layer, but link re-usability for e.g. a Business Rule Vault is tricky.

Anchorized Data Vault a.k.a. Anchor Vault

The Anchor Vault focuses on modeling Anchors (note, these are related, but not identical to the Anchors in Anchor Modeling). These can be independent (Hubs) and dependent (links). They are purely driven by source key columns. Here we basically put each source key in its own hub or link and its dependent attributes in a satellite. The main reason for doing this is easier transformation, Anchor/key re-usability and easy loading and managing the temporal aspects of your data (e.g. timeline completion and standardization). Those considering complex loading (e.g. Change Data Capture), or abundant re-use of entities for a Business Rule Vault, should consider an Anchor Vault.

Strong Anchor Vault

With the Strong Anchor Vault independent foreign keys (foreign key elements not part of a candidate key) get their own many-2-many link. This means we transform one-to-many relationships (foreign key references) in the source to many to many link in the Data Vault. Naturally many-to-many relationships also get their own link. Changes to cardinalities, higher order relationships and relationship inheritance can be elegantly modeled this way. Because each independent foreign key must have its own link (together with its own transaction satellite) we create considerably more join paths. You also need to change your end dating process. Note that I do not consider horizontally grouping these many-2-many links together is a good idea (but elaborating on that point merits another post). Note that “merging relationships” (actually subtyping relationships)  with different cardinality becomes possible (but not required). Either over time or over similar relationships (through abstraction).
A tool that uses strong anchorization is Quipu (www.datawarehousemanagement.org)

Weak Anchor Vault

The Weak Anchor Vault differs from the Strong Anchor Vault by creating foreign key links (end dated links) for independent foreign keys relationships (they actually resemble satellites holding references to other Hubs). These can either be singleton or grouped. Here we track changes to Anchor references over time. We can group/ungroup independent foreign keys across foreign key Links. This decreases the number join paths and load complexity in comparison with the Strong Anchor Vault, but flexibility is slightly decreased. This and similar styles have been in use for OLTP systems for many decades to implement bi-temporal operational information systems. It has been independently advocated and implemented by several people like RenĂ© Veldwijk (Veldwijk's HTC: http://www.faapartners.com/downloads ), Harm van der Lek (Generic Data Vault) and many others under a variety of different names. BIready (www.BIready.com) is a tool that internally uses a weak Anchor Vault.

Elementary Anchor Vault

The elementary Anchor Vault re-factors each element into a separate entity. This means that each attribute has its own satellite (singleton satellites and singleton foreign key links). This can be done on a Weak or Strong Anchor Vault. The effect is more or less an Anchorized and temporalized version of 6NF. It is still not an Anchor Model because for that you need to refactor all key attributes/relationships into their own entity.

Related modeling techniques

There are several related modeling techniques to Data Vault. They all are concerned with temporalization and (key) separation. Non-temporal (key) information is stored in entities that are used as 'Anchors' for temporal entities. I call this class of temporal modeling techniques 'Anchor Style Temporal Data Modeling' (not to be confused with Anchor Modeling, which is just one variety).

Next
In the next posts I will discuss the different styles in more detail and show some examples.