Data Matters (Unseen): April 2013

Thursday, April 18, 2013

The Rule & the Raw & the Business (Data) Vault: Data Vault Architecture and Business Rules

Introduction

Every now and then the discussion on derived data, centralization of data logic and the definition of the (Raw) Data Vault pops up. Both +Daniel Linstedt's supercharge book, +Hans Patrik Hultgren's Agile Date warehousing book and +Ronald Damhof's EDW papers (see the papers section of his his linkedin profile) describe varieties of the concept of storing derived data into a Data Vault oriented EDW. They also blogged about this subject and the terminology regarding the Raw Vault. See Ronald's (and Dan's) and Hans's blog post. In this blog post I'll discuss these views and my (and +Ronald Damhof's) take on this architecture pattern.

Centralized Logic

Often it makes sense to centralize not only the storage of data, but also some (all?) of the additional logic required to process the data before it goes out to the users/data marts/bi systems. This logic is often classified as 'Business Rules', 'Calculations' or 'Data Quality Rules'.

Business Rules?

From an information system perspective a formalized business rule is a constraint expressed on data. This means that also keys and domain definitions are business rules.
We personally prefer to talk about 2 different kinds of business rules: structural and non structural.

Structural Business Rules

Structural business rules are rules that structure elementary information. They contain the basic building blocks for creating data models.

Elementary structure rules like domain, key, dependency, foreign key/subset constraint and basic entity (relation/fact type).
Model derivation/transformation rules to create new semantical/logical equivalent (elementary) model structures.

Non Structural Business Rules

Non structural business rules define calculations/derivations, constraints & Data Quality rules and structural repair rules. Of those the structural repair rules are the most interesting. These are rules that change/extend date models (create new structural business rules!) based on derived data/calculations. So we have.

Derived data/calculations
Constraints/Data Quality rules
Structural repair rules

Derived Data and Business rules

We see not a big difference in handling non-structural business rule data and raw (source system) data. A business rule is just a small (source) data model that happens to derive its data from the Data Vault and not externally. It is (normally) loaded in the same manner and into the same kind of objects as with a raw Data Vault. This means from a metadata perspective we can add traceability information, and we have better control over the rule/data/metadata that that of externally sourced information systems, but that's it. Dan's "Business Data Vault" suggests the business is in the lead on that data (and business rule definitions). While this is often the case, business rules can also have a more technical origin like performance (aggregation satellites), so this is a bit suggestive. We prefer the term (Business) Rule Vault instead to emphasize the importance of the rule processing over the exact business definition, specification and naming.

The (Business) Rule Vault

For us all DV objects that are only/mainly driven by business rules belong to the Rule Vault. Note that since Rule Vault objects can(and often do) directly relate/depend on entities in the Raw Vault you should not see them as 2 completely different and independent (storage) tiers, they are in the same tier, just 2 different logical (sub)layers within the Data Vault. This separation is just a soft one. We can load the same DV entity from a rule or raw source, either in consecutive order or sometimes even at the same time. Our metadata will keep the records auditable according to the source, but assigning DV entities exclusively to the Rule or Raw Vault is not always possible or sensible, esp not over time.

The Raw (Data) Vault

The classical source driven but business oriented (Raw)Data Vault, the 'Raw Vault' for short, is the classical Data Vault. It is organized around integration through 'business keys', but also directly related/auditable through to the source (data). We'll forego the discussion here on (Single) Source Data Vaults driven by the primary/surrogate key and debunk Source Data Vaults, also called Stage Data Vaults, in another post.

The (Complete) Data Vault: Rule Vault + Raw Vault:

So for us the (Complete) 'Data Vault' combines the Business Rule (Data) Vault, the "Rule Vault' for short, and the 'Raw' Data Vault, or Raw Vault.

Business (Data) Vault and Business View Perspectives

But even with the Rule Vault we still need to connect our business rule results to the business. For this we use the "Business View" concept of +Ronald Damhof It brings together Rule and Raw Vault objects to present a consistent business view on all the data (a 'truth' in classical EDW thinking). Business Views are (conceptual) (business) data models bereft on detailed structural, temporal and other transformations/derivations. These business views can be based on available enterprise, industry or source system information models, greatly reducing overall data and business rule complexity.

Perspectives

These Business Views can be implemented through several perspectives in different structures (Dimensional, Normalized and Data Vault!) and with different temporal aspects (now, point-in time, period). We define a Business Data Vault as a Business View perspective in Data Vault format. These Business Views sit between our Data Vault and our Data Marts as a decoupling layer. This layer is driven by information models managing our different truths as different business views. Since this layer provides a semantic abstraction it is still appropriate even if there is no structure change from the Data Vault to the business view perspective. A Business Data Vault is still a complete Data Vault transformation from a business information model, albeit (usually) a fully virtual one!). Interestingly, of all the perspectives of a Business View, the Data Vault perspective is usually the first and foremost that is implemented.

Data (Definition) Architecture: Concerning Layers

When we look at the layers of the overall EDW Data Architecture we see a 3 layered approach, a Data Vault layer concerning storage and processing, a Business View layers concerning abstraction and consistency and an Access layer (Data Mart/Staging Out) concerning access, aggregation and navigation.

Data Logistics Architecture: Tiering your Data

EDW Logistical Functions: Acquiring(staging), Storing(Data Vault) and Access and delivery of data.

Separate from the data layers we also have the data tiers. These are connected to the data logistics functions like acquiring, storing, propagating and data delivery. From the Data Architecture perspective we want to minimize logistics through maximizing virtualization/derivation. We can infer that at minimum we need 1 data tier (apart from e.g. a staging tier). In practice we can create many tiers in all the data layers. We can even have tiers that cross layers. Since most EDW architectures (inadvertently) combine/entwine these two architecture aspects we'll discuss the connection between tiers and layers in a future blog post.

Conclusion

Personally we don't make too much of the distinction between Raw and Rule Vault, except in the metadata. We do think a (formal) business view layer (preferably in Data Vault format) is an important, nay essential, part of any Data Vault Architecture. (c)

©+Martijn Evers and +Ronald Damhof (w. thanks to +Tom Breur)

Friday, April 12, 2013

Seeing the last of the 'Last Seen Date': Tracking Deletes

Introduction

With Data Vault data models we sometimes add a metadata column to hubs/links called the 'Last Seen Date'. It represents the last time any source delivered said hub/link record to the Data Vault (again), and is only relevant on processing full loads. It needs to be checked on each load and usually when doing the full loads, will update (almost) all hub and link records (sometimes even has multiple passes as well). An expensive, tricky and performance wise problematic action because for the rest hub/link records should not be altered in any way. IMO, it's the last thing I'd like to see in a Data Vault (That's how you should remember its name ;).

Alternative

The best alternative is using a full blown hub/link status satellite that tracks the status: inserted, deleted for a hub/link (driving key) entry. The problem lies in tracking the 'hard' deletes from a source system information because the 'vanilla' DV with just basic full loads does not address the deletes tracking issues.

Handling Deletes

There are 5 levels of 'hard' delete tracking you can use on a Data Vault

None: We assume there are no 'hard'/physical deletes in a source system, only logical deletes, indicated by a status or end time attribute. Physical deletes, if done, are actually driven by archiving and do not indicate the records becoming invalid (false), but just be removed for e.g. space constraints. Archiving done for security/privacy/legal reasons might have impact on the Data Vault, but it depends if we will then register a delete in the DV (We might even have to, oh horror, remove data from our DV)
If we do happen to stumble on source system deletes in this scenario we become non auditable, and only a Last Seen Date might give us an indication that a delete actually occurred.
Driving Key: In Links that are subject to Non Key updates we can use the driving key to track dummy deletes. This happens (among others) when when a 1:M relationship is encoded in a M:N link
Key Omission: If we assume source systems do deletes and we track them using full key compares between DV and source. This is a basic precaution when doing basic full loads on a DV.
(Key) Deletes: We actually get a list of source (record) deletes. This is of course ideal to track deletes, but don't forget to cater for key updates/mutations as well (where possible in the source).
(Key) Mutations: A Key mutation is logically just a combination of a key delete and a key insert. If we track all (key) mutations (where allowed/expected), like with a Change Data Capture driven solution we can easily track key deletes generated by key mutations as well, which saves us the work of having to do a full (key) compare. Additionally, we can now also populate a key mutation driven SAME-AS Link.
As an alternative we can track the key updates only when we look at the source surrogate key (stored in e.g. a key satellite), given that it is stable.

Conclusion

The Last Seen Date is a expensive approximation of handling deletes. IMO the efforts can be better invested in better delete tracking. The Last Seen Date should be seen as a last ditch measure tracking deletes, not a basic metadata for a DV.

EDW: Building a House of Glass

Introduction

In software testing we have the boxing approaches to determine how to test our software applications. We test it systems with a certain knowledge of the internal workings of said systems.
But also in data management and information system development (for example data warehousing) we develop/manage our data with a certain level of knowledge of the data in our (source) systems. Often the knowledge of these system internals can be used to evaluate our data (management) practices, data/information development (integration, business intelligence) and do some data quality testing on the side as well.

Data in a Box

How can we see our data (silo's)?

As black boxes where data goes in, and (some) information/knowledge comes out through UI screens, documents, reports, lists and user (actions)?
As grey boxes, where you have a (reasonable?) amount of understanding of your (technical) data. You know (most of) the technical data structures, their descriptions and direct usage but a lot of knowledge about how your data is treated by the system (detailed process rules and constraints) is either implicit knowledge or not understood at all.
As white/glass boxes, where data and processes and interaction are well documented (modeled!) and understood? You can interface, interact with the data in a safe and consistent way understanding all processes and constraints that work on the data. You not only have accurate 'technical' (logical?) data modeling information including (complex) constraints, but also on the business/semantic level (including business rules and processes) and have a data model schema at the 'logical' level (esp. important when the actual database schema has been 'denormalized' or otherwise mangled beyond recognition).

Knocking up Information Hiding

The principle of information hiding is a good practice in data modeling, but when looking at boxing we should understand the current situation with COTS information systems. Preferably we look at an information system through an information/Data interface/layer representing a (logical) data model schema. Interestingly this should be the main function of a database schema even if in reality it is often not the case. Alas, a COTS is usually technical black box system and is usually poorly abstracted with a functional data layer (ie a data layer that is a logical, accurate, coherent, consistent, constrained and complete representation of the data). So while information hiding is good practice, in current COTS systems there is usually no good (formal) abstraction data/information layer that allows us correct and consistent access to information systems. We are usually stuck with a "black hole" box (where data goes in, but cannot get out easily if at all (save through arbitrary 'extractors', UI, reports or lists).

Unboxing your applications data

A lot of integration and information initiatives start with unboxing the murky data (models) from the source COTS information systems. When organizations purchase or design applications usually they see them as black boxes. As soon as data integration or BI initiatives are started they start delving into data models and data bases trying to understand the base data. They usually work up to a grey box understanding of their systems using internal data models because external/logical data model schema's, if available at all, are usually proprietary, incorrect, inconsistent or incomplete. From there they start developing their additional data processing (ETL, interfacing).
But most organizations who have done this still have serious issues understanding their data because they usually miss things or don't have all the knowledge together and integrated. They are usually implicitly trying to white box their information systems, but have no formal way of doing so. Detailed information about these systems is usually scattered in process and database diagrams, application code, knowledge workers, data warehouses or data marts and IT personnel. Unboxing usually stops at some grey box level.

EDW's: Building a House of Glass

Most (enterprise) data warehouse initiates (esp those talking about a 'single version of the truth') have an explicit or implied goal of constructing a house of glass, a glass box (globe) to see through all the data. They are actually trying to unbox their data to a white/glass box scenario However, they still have the issue of unboxing their OLTP/source systems. The house of glass built on top of murky source system data is actually conjuring up an illusions of control and transparency founded on a marsh of misunderstanding source systems. Instead of focusing on the house of truth they should be focusing on truly trying to unbox their source systems and focus on a house of (source system data) facts. A (logical) EDW housing both the (derived/implied) 'truth's' as well as the source data facts is the only way to construct a true data house of glass. this way an EDW can help you not just with your BI initiatives, but also with unlocking and unboxing your data for other initiatives like Data Quality control, interfacing/integration and data migration.