Friday, April 12, 2013

Seeing the last of the 'Last Seen Date': Tracking Deletes

Introduction

With Data Vault data models we sometimes add a metadata column to hubs/links called the 'Last Seen Date'. It represents the last time any source delivered said hub/link record to the Data Vault (again), and is only relevant on processing full loads. It needs to be checked on each load and usually when doing the full loads, will update (almost) all hub and link records (sometimes even has multiple passes as well). An expensive, tricky and performance wise problematic action because for the rest hub/link records should not be altered in any way. IMO, it's the last thing I'd like to see in a Data Vault (That's how you should remember its name ;).

Alternative

The best alternative is using a full blown hub/link status satellite that tracks the status: inserted, deleted for a hub/link (driving key) entry. The problem lies in tracking the 'hard' deletes from a source system information because the 'vanilla' DV with just basic full loads does not address the deletes tracking issues.

Handling Deletes

There are 5 levels of 'hard' delete tracking you can use on a Data Vault

  1. None: We assume there are no 'hard'/physical deletes in a source system, only logical deletes, indicated by a status or end time attribute. Physical deletes, if done, are actually driven by archiving and do not indicate the records becoming invalid (false), but just be removed for e.g. space constraints. Archiving done for security/privacy/legal reasons might have impact on the Data Vault, but it depends if we will then register a delete in the DV (We might even have to, oh horror, remove data from our DV)
    If we do happen to stumble on source system deletes in this scenario we become non auditable, and only a Last Seen Date might give us an indication that a delete actually occurred.
  2. Driving Key: In Links that are subject to Non Key updates we can use the driving key to track dummy deletes. This happens (among others) when when a 1:M relationship is encoded in a M:N link
  3. Key Omission: If we assume source systems do deletes and we track them using full key compares between DV and source. This is a basic precaution when doing basic full loads on a DV.
  4. (Key) Deletes: We actually get a list of source (record) deletes. This is of course ideal to track deletes, but don't forget to cater for key updates/mutations as well (where possible in the source).
  5. (Key) Mutations: A Key mutation is logically just a combination of a key delete and a key insert. If we track all (key) mutations (where allowed/expected), like with a Change Data Capture driven solution we can easily track key deletes generated by key mutations as well, which saves us the work of having to do a full (key) compare. Additionally, we can now also populate a key mutation driven SAME-AS Link.
    As an alternative we can track the key updates only when we look at the source surrogate key (stored in e.g. a key satellite), given that it is stable.

Conclusion

The Last Seen Date is a expensive approximation of handling deletes. IMO the efforts can be better invested in better delete tracking. The Last Seen Date should be seen as a last ditch measure tracking deletes, not a basic metadata for a DV.