Monday, October 8, 2012

Data Vault (Business Key) Mutations Matter

How to Track complex business key mutations in a Data Vault

Introduction

In a Data Vault we connect all our (temporal) information to our business keys by creating hubs for them. The assumption here is that business keys represent stable identifiers for our data. We do know that sometimes business keys change. For this we have a SAME-AS Link that connects an initial business key to it’s successor. Together with a satellite tracking when this happened this technique is able to track the occasional business key mutation.

The Problem

When business keys not only mutate one-on-one but also merge or split you need to be able to report dependent information both on the original business keys as well on the newly merged business key without seriously restructuring your Data Vault. This is also possible using the same SAME_AS_LINK connecting source and target business keys and using a set of calculations that can be performed on a satellite that allow us to view these mutations in a consistent manner.

Scenarios

We consider 4 scenario’s:

1. Key Replacement: 1 to 1 key mutation: 1 key is replaced by one other key

2. Key Split: 1 to Many mutation. 1 key is split into several keys

3. Key Merge:Many to 1 mutation. Several keys are merged into one target key

4. Multi Key Transitions:Many to Many mutation. Several keys are replaced with several other keys

We will not explore scenario 4 here because it is rare and rather complex. Rest assured this solution can handle this scenario as well. Implementing scenario 1 is left as an exercise to the reader.

All scenario’s we can also distinguish a key preserving scenario besides the non key preserving scenarios. With Key preserving scenario the source and target key(s) overlap.

1. Key Replacement with key preservation: Key mutates into itself, Identity transform.

2. Key Split. Source key is also one of the target keys

3. Key Merge. Target key is also one of the source keys

4. Multi Key Transitions: Source key set and target key set overlap

Example

As an Example we use the Hospitals example proposed by Martin Ekeblad. In  there we have several Hospital (units) that are regularly reorganized. We assume the hospitals are frequently merged together or split into new units. We want to be able to track these actions consistently through time in a the corresponding Data Vault.

Source Data Model

We assume each hospital has a Hospital Number that is the business key identifying the hospital. We assume we have a table called Hospitals with several attributes and the business key. We also have a Hospital Mutation table recording the mergers and splits of the hospitals numbers. We also provide an insert/mutation date as basis for the Data Vault load date.

Source table Hospitals

Hospital_Number
mutation/insertion date
SU
2010-01-01
AL
2010-01-01
NU
2010-01-01
SUAL
2012-02-01
NU1
2012-02-01
NU2
2012-02-01
NU2a
2012-03-01
NU2b
2012-03-01

We also have a table with mutations of hospitals, Hospital mergers as well as splits. Here we record initial and target hospital number.

Source table Hospital_ Mutations

Initial_Hospital_Number
Target_Hospital_Number
mutation/insertion date
SU
SUAL
2012-02-01
AL
SUAL
2012-02-01
NU
NU1
2012-02-01
NU
NU2
2012-02-01
NU2
NU2a
2012-03-01
NU2
NU2b
2012-03-01

Data Vault Model





The Data Vault consists of 4 tables:
  1.  Hospital Hub containing the Hospital Number
  2. A Same As Link connecting initial and target hospitals numbers
  3. A basic satellite SLNK Hospitals recording the changing hospital numbers
  4. A derived satellite SLNK Hospital Mutations in which we will capture all changes in an uniform and consistent way using business rules.

Table Details

HUB_Hospitals


HUB_Hospital_SQN
Hospital_Number
HUB_Hosptial_LDTS
1
SU
2010-01-01
2
AL
2010-01-01
3
NU
2010-01-01
4
SUAL
2012-02-01
5
NU1
2012-02-01
6
NU2
2012-02-01
7
NU2a
2012-03-01
8
NU2b
2012-03-01

SAME-AS-LINK table SLNK_Hospitals

SLNK_Hospitals_SQN
 Initial_Hospital_SQN
New_Hospital_SQN
 SOURCE

this LINK records the from and to Hospital as recorded by the source. This link has all the source/target records  required to facilitate a full temporal consistent list of hospital mutations.

Basic satellite table SAT_SLNK_Hospitals

SLNK_Hospitals_SQN
SLNK_Hospitals_LDTS
SLNK_Hospitals_LEDTS
SOURCE

this table tracks only actual inserts from table Hospital_Mutations

Derived satellite table SAT_SLNK_Hospitals_Mutations

SLNK_Hospitals_SQN
Load Date
 Load Date End
 Proration
SOURCE
MutType
Keyrate

tracks all detailed merger and split info. Note that this is assumed here to be mostly *derived* info that is created by subsequent business rules (ETL) code.

We record the following additional fields:
  • Proration 
    • Proration column to adjust amounts that need to be prorated across different hospital units after they have been split. This column is adjustable by using complex business rules. 
  • Mutation type
    • Type of key mutation: Replace,Split,Merge 
  • Keyrate Division
    • Calculated division rate of the mutation. It is used to calculate e basic prorations 
  • We have 3 business rules: 
    • Identity Rule: Will insert records in SAME AS LINK and derived sat linking keys back to themselves 
    • Proration Rule: Will calculate proration based on splitting. We assume an even proration for all target keys 
    • Transitivity Rule: Will calculate new LINK records in the SAME-AS LINK and sat that represent the transitive closure of repeated mergers or splits. This rule will also recursively calculate proration.

Processing Steps

We will show the actual process using several steps on a concrete example. We will use the data as shown at the source tables. The processing will be shown in 3 steps.
  1. Inserting the basic keys: Identity Rule
  2. Split and Merge Hospital Units
    1. Merge of SU and AL into SUAL on (2012-01-01)
    2. Split of NU into NU1,NU2
  3. Repeated split of NU2 into NU2a, NU2b
Also we will use Business Keys in the links for this example. Also non essential load dates have been omitted as well.

Step 1. Inserting the basic keys: Identity Rule

SLNK_Hospitals

1
SU
SU
 Identity Rule
2
AL
AL
 Identity Rule
3
NU
NU
 Identity Rule

SAT_SLNK_Hospitals
SAT_SLNK_Hospitals_ Mutations 

SQN
Load Date
 Load Date End
 Prorate
SOURCE
MutationType
Keyrate
1
2010-01-01
9999-12-31
100
Identity Rule
Replace
1
2
2010-01-01
9999-12-31
100
Identity Rule
Replace
1
3
2010-01-01
9999-12-31
100
Identity Rule
Replace
1

Step 2a. Merge of SU and AL into SUAL on (2012-01-01)

SLNK_Hospitals

SLNK_Hospitals_SQN
Initial_Hospital_SQN
Target_Hospital_SQN
Source
1
SU
 SU
 Identity Rule
2
AL
 AL
 Identity Rule
3
NU
 NU
 Identity Rule
4
SU
 SUAL
 Hospital_Mutations
5
SU
 SUAL
 Hospital_Mutations
6
SUAL
 SUAL
Identity Rule

SAT_SLNK_Hospitals

SLNK_Hospitals_SQN
SLNK_Hospitals_LDTS
SLNK_Hospitals_LEDTS
SOURCE
4
2012-02-01
9999-12-31
 Hospital_Mutations
5
2012-02-01
9999-12-31
 Hospital_Mutations

SAT_SLNK_Hospitals_ Mutations

SQN
Load Date
 Load Date End
 Prorate
SOURCE
MutationType
Keyrate
1
2010-01-01
2012-02-01
100
Identity Rule
Replace
1
2
2010-01-01
2012-02-01
100
Identity Rule
Replace
1
3
2010-01-01
9999-12-31
100
Identity Rule
Replace
1
4
2012-02-01
9999-12-31
100
Hospital_Mutations
Merge
1
5
2012-02-01
9999-12-31
100
Hospital_Mutations
Merge
1
6
2012-02-01
9999-12-31
100
Identity Rule
Replace
1

Step 2b. Split of NU into NU1,NU2


SLNK_Hospitals

SLNK_Hospitals_SQN
Initial_Hospital_SQN
New_Hospital_SQN
Source
1
SU
SU
 Identity Rule
2
AL
AL
 Identity Rule
3
NU
 NU
 Identity Rule
4
SU
SUAL
 Hospital_Mutations
5
SU
SUAL
 Hospital_Mutations
6
SUAL
SUAL
 Identity Rule
7
NU
NU1
Hospital_Mutations
8
NU
NU2
Hospital_Mutations
9
NU1
NU1
Identity Rule
10
NU2
NU2
Identity Rule

SAT_SLNK_Hospitals

SLNK_Hospitals_SQN
SLNK_Hospitals_LDTS
SLNK_Hospitals_LEDTS
SOURCE
4
2012-02-01
9999-12-31
 Hospital_Mutations
5
2012-02-01
9999-12-31
 Hospital_Mutations
7
2012-02-01
9999-12-31
 Hospital_Mutations
8
2012-02-01
9999-12-31
 Hospital_Mutations

SAT_SLNK_Hospitals_ Mutations 

SQN
Load Date
 Load Date End
 Prorate
SOURCE
MutationType
Keyrate
1
2010-01-01
2012-02-01
100
Identity Rule
Replace
1
2
2010-01-01
2012-02-01
100
Identity Rule
Replace
1
3
2010-01-01
2012-02-01
100
Identity Rule
Replace
1
4
2012-02-01
9999-12-31
100
Hospital_Mutations
Merge
1
5
2012-02-01
9999-12-31
100
Hospital_Mutations
Merge
1
6
2012-02-01
9999-12-31
100
Identity Rule
Replace
1
7
2012-02-01
9999-12-31
50
Proration Rule
Split
2
8
2012-02-01
9999-12-31
50
Proration Rule
Split
2
9
2012-02-01
9999-12-31
100
Identity Rule
Replace
1
10
2012-02-01
9999-12-31
100
Identity Rule
Replace
1

Step 3. Split of NU2 into NU2a, NU2b


SLNK_Hospitals

SLNK_Hospitals_SQN
Initial_Hospital_SQN
New_Hospital_SQN
Source
1
SU
SU
 Identity Rule
2
AL
AL
 Identity Rule
3
NU
 NU
 Identity Rule
4
SU
SUAL
 Hospital_Mutations
5
SU
SUAL
 Hospital_Mutations
6
SUAL
SUAL
 Identity Rule
7
NU
NU1
Hospital_Mutations
8
NU
NU2
Hospital_Mutations
9
NU1
NU1
Identity Rule
10
NU2
NU2
Identity Rule
11
NU2
Nu2a
Hospital_Mutations
12
NU2
Nu2b
Hospital_Mutations
13
NU2a
NU2a
Identity Rule
14
NU2b
NU2b
Identity Rule
15
NU
NU2a
 Transitivity Rule
16
NU
NU2b
 Transitivity Rule

SAT_SLNK_Hospitals

SLNK_Hospitals_SQN
SLNK_Hospitals_LDTS
SLNK_Hospitals_LEDTS
SOURCE
4
2012-02-01
9999-12-31
 Hospital_Mutations
5
2012-02-01
9999-12-31
 Hospital_Mutations
7
2012-02-01
9999-12-31
 Hospital_Mutations
8
2012-02-01
2012-03-01
 Hospital_Mutations
11
2012-03-01
9999-12-31
 Hospital_Mutations
12
2012-03-01
9999-12-31
 Hospital_Mutations

SAT_SLNK_Hospitals_ Mutations 

SQN
Load Date
 Load Date End
 Prorate
SOURCE
MutationType
Keyrate
1
2010-01-01
2012-02-01
100
Identity Rule
Replace
1
2
2010-01-01
2012-02-01
100
Identity Rule
Replace
1
3
2010-01-01
2012-02-01
100
Identity Rule
Replace
1
4
2012-02-01
9999-12-31
100
Hospital_Mutations
Merge
1
5
2012-02-01
9999-12-31
100
Hospital_Mutations
Merge
1
6
2012-02-01
9999-12-31
100
Identity Rule
Replace
1
7
2012-02-01
9999-12-31
50
Proration Rule
Split
2
8
2012-02-01
2012-03-01
50
Proration Rule
Split
2
9
2012-02-01
9999-12-31
100
Identity Rule
Replace
1
10
2012-02-01
2012-03-01
100
Identity Rule
Replace
1
11
2012-03-01
9999-12-31
50
 Proration Rule
Split
2
12
2012-03-01
9999-12-31
50
 Proration Rule
Split
2
13
2012-03-01
9999-12-31
100
Identity Rule
Replace
1
14
2012-03-01
9999-12-31
100
Identity Rule
Replace
1
15
2012-03-01
9999-12-31
25
Transitivity Rule
Split
4
16
2012-03-01
9999-12-31
25
Transitivity Rule
Split
4

Reporting Example

How to use these tables in actual reporting:
Assume we have the following table Invoices (The format, DV or dimensional is not important right now)

Table Invoices

Invoice#
Amount
 Hospital
0001
100
 NU

And we want to prorate this amount to a Hospital given the reference date of 2012-03-01

Query:
SELECT invoice#, 
       amount * sat.proration / 100 amt, 
       inv.hospital                 AS Original_Hospital, 
       slnk.new_hospital_sqn        AS Current_Hospital 
FROM   invoices inv 
       INNER JOIN slnk_hospitals slnk 
               ON slnk.initial_hospital_sqn = inv.hospital 
       INNER JOIN sat_slnk_hospitals_split_proration sat 
               ON sat.slnk_hospitals_sqn = slnk.slnk_hospitals_sqn 
WHERE  "2012-03-01" BETWEEN slnk_hospitals_ldts AND slnk_hospitals_ledts 



This query will produce:

Invoice#
Prorated Amount
 Original_Hospital_Number
 Current_Hospital_Number
0001
    50
 NU
NU1
0001
    25
 NU
NU2a
0001
    25
 NU
NU2b

An alternative, a Key Mutation Satellite

As an alternative a KEY_MUTATION_SAT can be deployed. this is more efficient but less tracable:
Example

SAT_Hospital_TRANS

SQN
Target Hopsital
Load Date
Load Date End
Proration
Source
1
SU
2010-01-01
2012-02-01
100
Identity Rule
2
AL
2010-01-01
2012-02-01
100
Identity Rule
3
NU
2010-01-01
2012-02-01
100
Identity Rule
1
SUAL
 2012-02-01
9999-12-31
100
Hospital_Mutations
1
SUAL
2012-02-01
9999-12-31
100
Hospital_Mutations
4
SUAL
2012-02-01
9999-12-31
100
Identity Rule
3
NU1
2012-02-01
9999-12-31
50
Proration Rule
3
NU2
2012-02-01
2012-03-01
50
Proration Rule
5
NU1
2012-02-01
9999-12-31
100
Identity Rule
6
NU2
2012-02-01
2012-03-01
100
Identity Rule
6
NU2a
 2012-03-01
9999-12-31
50
 Proration Rule
6
NU2b
2012-03-01
9999-12-31
50
 Proration Rule
7
NU2a
2012-03-01
9999-12-31
100
Identity Rule
8
NU2b
2012-03-01
9999-12-31
100
Identity Rule
3
NU2a
2012-03-01
9999-12-31
25
Transitivity Rule
3
NU2b
2012-03-01
9999-12-31
25
Transitivity Rule

This is exactly the same set, but now connected to the base hub and using an alternate business key
If business key splits are not expected this scenario, or a end dated one to many link can also be employed.

End dating

In key preserving scenarios when doing mergers it just means we do not need to do end dating on the original key. When splitting we need to add a new record for the key with the correct proration for the split.

Virtualizing the business rule and sparse population

We could opt for virtualizing parts of the business rules and creating a sparse mutation satellite. Esp the transitivity and identity rule are easily derivable. Just omitting them means the reporting queries become more complex. Specially when there are a LOT of business keys and the mutation rate is high.