Monday, March 4, 2013

Data Vault: Deletes MATTER


There was a short discussion on twitter about Data Vault and the tracking of (key) deletes. Few realize that the basic/traditional Data Vault is ONLY 100% auditable when keys/records are never deleted (and never updated). I personally think Data Vault not precise enough in formulating it's auditability requirement in relation to data model transformations and the tracking database data/transactions.


The auditability requirement can be restated using semantic equivalent data model (schema) transformation and data transactions. Being auditable means that both the model transformation is "lossless" and that all transactions are accounted for. A "lossless" model transformation is a bijective function between 2 model schema's. It means that we can we can transform back and forth between the data models. If we want to account for all transactions we either need to apply a (kind of) transactional replication or, when we do not need transactional consistency, we need to apply full snapshots at a certain interval (say a day). If we want to add a transaction timeline to one of the models (like Data Vault does), we effectively transform record deletes and updates to record inserts. So we are required to log deletes if we want to be 100% auditable.

Tracking deletes

While there are some very good arguments against physically deleting record in a database, it is often the case we need to account for them.There are basically 2 techniques for tracking deletes: Transaction logging(Change Data Capture, Audit Trails) and full compare of snapshots. With the snapshots we only need to compare (all) the key(s) of a table (there can be more than 1!!), hence the name "Key Compare". With an audit trail/CDC we can not only track record/key deletes, but also track key mutations. We can fill a SAME-AS link (actually a  FROM-TO link) or Key Sattelite directly from a CDC stream. Note that a key update is logically equivalent of a delete and an insert and knowing how old key value mutates into the new one.

A special case with deletes is data archiving, since archiving usually means physically deleting the records while logically they might still be valid. While this post is not about Data Vault and archiving (I feel another post coming up ;) The correct way to deal with archiving is to set an archiving indicator for those records, and not to logically or physically delete them in your Data Vault/EDW. Alas, regulatory and privacy concerns make handling data archiving (both in the Data Vault as well as in the source system) a complex subject.

Data Vault

The basic Data Vault does a trade off between performance, ease of use and full consistency and transparency. This is not an issue as long as this trade-off is made in an informed and consistent way. Alas, too many people take the 100% auditability requirement literally, but for the Basic Data Vault this is just a goal, not a fact. For those who want to make 100% auditability a certainty, YOU need make sure you're model transformation is correct and all your transactions/records are accounted for. The Data Vault standards where designed for a lot of good reasons , but they where never precise enough to scientifically/mathematically guarantee 100% data/model auditability on data and models in all circumstances.