Introduction
Sometimes we see that given the same input and
process, the end result still differs between situations. This also happens
when creating (data) models from requirements, and hence also with
Data Vault data models. As long as this is intentional and can be explained
this does not have to pose an issue. But often it is an indication of lack of
knowledge or common understanding. It is usually quite hard to track down where
the assumptions differ based on just the end result, and this is no
different with the Data Vault approach. One of the main reasons for this
difference lies in the somewhat loose definition of the Data Vault,
it allows for some variation. This has led to several (closely
related) styles of modeling a Data Vault.
In this post I'll introduce different modeling
styles of Data Vault I've found to date. We will focus on the modeling
part, and not on the architecture part. So we are not discussing
differences between source, business or raw data vaults, but mainly
different ways to transform the same starting (modeling) scenario into a Data
Vault.
The Data Vault approach
Data Vault is a modeling approach/method that assists
with creating the central data repository in an Enterprise Data Warehouse. It
works on the basis that there is some hub/spoke EDW architecture
defined (e.g. an Inmon DW 2.0 architecture). In this
architecture the central repository is there for data management while the
outlying spokes (Data marts) are used for data delivery to users. Its basic
assumption is that you begin by focusing on getting source data in
your EDW (not cleansed and restructured data).
Data Vault however forces you to adapt all your data
to fit into the three basic building blocks that are allowed when creating a
Data Vault. This means source data should then be reshaped so that it fits
those basic data entity building blocks. These blocks are called Hubs, Links
and Satellites.
These building blocks have a very basic
makeup:
•
Hubs are just
lists of elementary/independent key values. They implement independent keys.
Since Data Vault is interested in the big picture, these are called 'business
keys' and they represent common identifiers around the business (invoice
number, customer number, employee number).
•
Links are
associations between hubs
•
Satellites track attributes
associated with hubs and links through time.
Besides the building blocks there are also loading
patterns for your central EDW repository. These follow from
the definition of data entities in combination with
the requirement of loading the source data unaltered into the Data
Vault.
Data Vault modeling Styles
However, this recipe is just a global pattern.
For example, it is does not work as a complete and detailed
(semantic-equivalent) structural data model transformation recipe if
we compare it to the transformation algorithms we use within FCO-IM and other
Fact oriented Modeling techniques (FOM’s).
Besides the flexibility in definition and design of
the Data Vault standard other factors contribute to variations in the Data
Vault approach.
- There is a large history on using similar temporal
data modeling approaches with OLTP systems like Veldwijk’s HTC.
- Additional requirements on flexibility
and reuse of data and metadata
- The advent of new hardware and software like
database appliances and in-memory technology
- The rise of similar EDW oriented
modeling approaches like Anchor Modeling
This has led to the rise of what can
be loosely called different flavors or styles of Data Vault
modeling (especially with regards to end dating links). These can be described
as varieties in the underlying structural model
transformation strategies. What I present here is by no means a complete list,
but it attempts to identify the most important structurally
different varieties we currently see to date. We can explore these
different varieties of the Data Vault by looking at
the definition and transformation process.
Classical (aka Dimensional)
Data Vault
In the Classical (I describe this as the ‘Dimensional’) Data
Vault all links represent n-ary associations between hubs. These links combine
all hub references in a source tabel, both hub references that are part of the
source table key, as references that are not part of any key. Because of the
auditability requirment we need a “driving” key in the link (which is in
fact a representation of the source table key) for proper temporal data
management. Such a Link can be considered a “proto” fact table, containing
all dimensional references (although time independent) but without
the measurements. It will usually be pretty wide. It tries to minimize
joins at the expense of duplicating rows when non-driving key elements are
updated in the link. This style of Data Vault modeling lends itself for
easy staging to a dimensional layer, but link re-usability for e.g. a
Business Rule Vault is tricky.
Anchorized Data Vault a.k.a. Anchor Vault
The Anchor Vault focuses on modeling Anchors (note,
these are related, but not identical to the Anchors in Anchor Modeling). These
can be independent (Hubs) and dependent (links). They are purely driven by
source key columns. Here we basically put each source key in its own hub or
link and its dependent attributes in a satellite. The main reason for doing
this is easier transformation, Anchor/key re-usability and easy
loading and managing the temporal aspects of your data (e.g. timeline
completion and standardization). Those considering complex loading (e.g. Change
Data Capture), or abundant re-use of entities for a Business Rule Vault,
should consider an Anchor Vault.
Strong Anchor Vault
With the Strong Anchor Vault independent foreign keys
(foreign key elements not part of a candidate key) get their own many-2-many
link. This means we transform one-to-many relationships (foreign key references)
in the source to many to many link in the Data Vault. Naturally many-to-many
relationships also get their own link. Changes to cardinalities, higher order
relationships and relationship inheritance can be elegantly modeled this
way. Because each independent foreign key must have its own link (together with
its own transaction satellite) we create considerably more join paths. You
also need to change your end dating process. Note that I do not consider
horizontally grouping these many-2-many links together is a good idea (but
elaborating on that point merits another post). Note that “merging relationships”
(actually subtyping relationships) with
different cardinality becomes possible (but not required). Either over time or
over similar relationships (through abstraction).
A tool that uses strong anchorization is Quipu
(www.datawarehousemanagement.org)
Weak Anchor Vault
The Weak Anchor Vault differs from the
Strong Anchor Vault by creating foreign key links (end dated links)
for independent foreign keys relationships (they actually resemble satellites
holding references to other Hubs). These can either be singleton or grouped.
Here we track changes to Anchor references over time. We can group/ungroup independent
foreign keys across foreign key Links. This decreases the number join paths and
load complexity in comparison with the Strong Anchor Vault, but flexibility is
slightly decreased. This and similar styles have been in use for OLTP systems
for many decades to implement bi-temporal operational information
systems. It has been independently advocated and implemented by several people
like René Veldwijk (Veldwijk's HTC: http://www.faapartners.com/downloads
), Harm van der Lek (Generic Data Vault) and many others under
a variety of different names. BIready (www.BIready.com) is a tool that
internally uses a weak Anchor Vault.
Elementary Anchor Vault
The elementary Anchor Vault re-factors each
element into a separate entity. This means that each attribute has its
own satellite (singleton satellites and singleton foreign key
links). This can be done on a Weak or Strong Anchor Vault. The effect is more
or less an Anchorized and temporalized version of 6NF. It
is still not an Anchor Model because for that you need
to refactor all key attributes/relationships into their own entity.
Related modeling techniques
There are several related modeling techniques to Data
Vault. They all are concerned with temporalization and (key) separation.
Non-temporal (key) information is stored in entities that are used as
'Anchors' for temporal entities. I call this class of temporal modeling
techniques 'Anchor Style Temporal Data Modeling' (not to be confused with
Anchor Modeling, which is just one variety).
Next
In the
next posts I will discuss the different styles in more detail and show some
examples.
No comments:
Post a Comment