Friday, May 10, 2013

Data Vault Introductions

Introduction

No, this is not a post about introducing Data Vault, but a post referencing materials that introduce Data Vault. I've collected blog posts, articles and book references for you to browse to get to grips with Data Vault. If after all this info you still have questions left (or maybe even more than you started with), come back here and I'll see what I can do.

Blogs with Data Vault introductions and overviews

The following blogs have entries for introductions to Data Vault. +Hans Patrik Hultgren has written a blogpost here. Martin Ouellet did a good blogpost about Data vault on his blog+Daniel Linstedt has a good post on it here, and an additional one here. He has also a slideshare presentation here. Several blogs about Data vault from the people at lucrum can be found here. There are also some blog entries on AgileDSS describing basic Data Vault (modeling) and architecture.

Books on Data Vault

Supercharge your Data warehouse by +Daniel Linstedt,  Modeling the Agile Data Warehouse with Data Vault by +Hans Patrik Hultgren.

Introduction Papers for Data Vault.

The TDAN Articles are quite old but still readable. +Kent Graziano has a paperslideshare intro as well. +Hans Patrik Hultgren has a DV into paper as well.  Nippur has also an introduction presentation online here. My own SQLBits  presentation is also still online here! Wikipedia also has a Data Vault article+Roelant Vos has written an article about Data Vault. A presentation of his can be found here.

Courses introducing Data vault

See my Data Vault Agenda for more information. Except courses from the MATTER program most events and courses (also) target the novice audience.

Data Vault Comparisons

+Stefan Frost blogged on his toolbox blog about Data Vault vs. , +Kent Graziano did a comparison here . Hennie de Nooijer did a comparison in his blog here. Wherescape has also a comparison on data vault here (slidshare from Wherescape is here) The bibackend has a comparison here. Another comparison for Data Vault has been blogged about by +Roelant Vos  here. A dutch blog by worldofintelligence is here.

Thursday, May 2, 2013

Data Vault Model Examples

Introduction

There are several example models on Data Vault around the net. this post tries to list the ones publicly available as blog posts or as downloads.

Northwind

The SQL Server Northwind database has been used as an DV transformation example in the original TDAN Data Vault papers by Dan Linstedt. You can find the paper with Northwind here. Another blog where Northwind is used is the wherescape blog here. The pdf files of both the original and the Data Vault are also available. The Nothrwind database was an example database from Microsoft for it's SQL Server DBMS product. The original database is, like a lot of example databases from DBMS vendors, quite badly designed. For Data Vault this gives some nice issues to solve going to a Data Vault. The given example however is very basic. It ignores the use of reference tables and for examples sake solely creates a link for the orders table where a hub & link would be more apropriate. It als does not assign Data Vault sequence numbers.

Adventureworks

The basic adventureworks can be found on sourceforge. A repaired DDL version can be found on the blog of Hans Geurtsen. Others have used Adventureworks database with for example Quipu. See blogs here and here. Some info on the source database can be found here

NDC

On Sourceforge there is also a National Drug Code database example created by Dan Linstedt as well as a weblog data vault and a metrics vault.

TPC-H

I myself use the TPC-H model's as examples for Data vault transformations. The source model and information  can be found here. The different Data Vault model styles will be introduced in detail on my blog in future blog posts. (I still need to fix them up). They are also used in the MATTER program.

Other Examples

Other Examples of Data Vaults on the web are for example for mobile solutions. Other people blogged about DV models, like Stefan Frost. Simple examples used to explain Data Vault transformations can be found here and here.

Thursday, April 18, 2013

The Rule & the Raw & the Business (Data) Vault: Data Vault Architecture and Business Rules

Introduction

Every now and then the discussion on derived data, centralization of data logic and the definition of the (Raw) Data Vault pops up. Both +Daniel Linstedt's supercharge book, +Hans Patrik Hultgren's Agile Date warehousing book and +Ronald Damhof's EDW papers (see the papers section of his his linkedin profile) describe varieties of the concept of storing derived data into a Data Vault oriented EDW. They also blogged about this subject and the terminology regarding the Raw Vault. See Ronald's (and Dan's) and Hans's blog post. In this blog post I'll discuss these views and my (and +Ronald Damhof's) take on this architecture pattern.

Centralized Logic

Often it makes sense to centralize not only the storage of data, but also some (all?) of the additional logic required to process the data before it goes out to the users/data marts/bi systems. This logic is often classified as 'Business Rules', 'Calculations' or 'Data Quality Rules'.

Business Rules?

From an information system perspective a formalized business rule is a constraint expressed on data. This means that also keys and domain definitions are business rules.
We personally prefer to talk about 2 different kinds of business rules: structural and non structural.

Structural Business Rules

Structural business rules are rules that structure elementary information. They contain the basic building blocks for creating data models. 
  1. Elementary structure rules like domain, key, dependency, foreign key/subset constraint and basic entity (relation/fact type).
  2. Model derivation/transformation rules to create new semantical/logical equivalent (elementary) model structures.

Non Structural Business Rules

Non structural business rules define calculations/derivations, constraints & Data Quality rules and structural repair rules. Of those the structural repair rules are the most interesting. These are rules that change/extend date models (create new structural business rules!) based on derived data/calculations. So we have.
  1. Derived data/calculations
  2. Constraints/Data Quality rules
  3. Structural repair rules

Derived Data and Business rules

We see not a big difference in handling non-structural business rule data and raw (source system) data. A business rule is just a small (source) data model that happens to derive its data from the Data Vault and not externally. It is (normally) loaded in the same manner and into the same kind of objects as with a raw Data Vault. This means from a metadata perspective we can add traceability information, and we have better control over the rule/data/metadata that that of externally sourced information systems, but that's it. Dan's "Business Data Vault" suggests the business is in the lead on that data (and business rule definitions). While this is often the case, business rules can also have a more technical origin like performance (aggregation satellites), so this is a bit suggestive. We prefer the term (Business) Rule Vault instead to emphasize the importance of the rule processing over the exact business definition, specification and naming.

The (Business) Rule Vault

For us all DV objects that are only/mainly driven by business rules belong to the Rule Vault. Note that since Rule Vault objects can(and often do) directly relate/depend on entities in the Raw Vault you should not see them as 2 completely different and independent (storage) tiers, they are in the same tier, just 2 different logical (sub)layers within the Data Vault. This separation is just a soft one. We can load the same DV entity from a rule or raw source, either in consecutive order or sometimes even at the same time. Our metadata will keep the records auditable according to the source, but assigning DV entities exclusively to the Rule or Raw Vault is not always possible or sensible, esp not over time.

The Raw (Data) Vault

The classical source driven but business oriented (Raw)Data Vault, the 'Raw Vault' for short, is the classical Data Vault. It is organized around integration through 'business keys', but also directly related/auditable through to the source (data). We'll forego the discussion here on (Single) Source Data Vaults driven by the primary/surrogate key and debunk Source Data Vaults, also called Stage Data Vaults, in another post.

The (Complete) Data Vault: Rule Vault + Raw Vault:

So for us the (Complete) 'Data Vault' combines the Business Rule (Data) Vault, the "Rule Vault' for short, and the 'Raw' Data Vault, or Raw Vault.

Business (Data) Vault and Business View Perspectives

But even with the Rule Vault we still need to connect our business rule results to the business. For this we use the "Business View" concept of +Ronald Damhof It brings together Rule and Raw Vault objects to present a consistent business view on all the data (a 'truth' in classical EDW thinking). Business Views are (conceptual) (business) data models bereft on detailed structural, temporal and other transformations/derivations. These business views can be based on available enterprise, industry or source system information models, greatly reducing overall data and business rule complexity.

Perspectives

These Business Views can be implemented through several perspectives in different structures (Dimensional, Normalized and Data Vault!) and with different temporal aspects (now, point-in time, period). We define a Business Data Vault as a Business View perspective in Data Vault format. These Business Views sit between our Data Vault and our Data Marts as a decoupling layer. This layer is driven by information models managing our different truths as different business views. Since this layer provides a semantic abstraction it is still appropriate even if there is no structure change from the Data Vault to the business view perspective. A Business Data Vault is still a complete Data Vault transformation from a business information  model, albeit (usually) a fully virtual one!). Interestingly, of all the perspectives of a Business View, the Data Vault perspective is usually the first and foremost that is implemented.

Data (Definition) Architecture: Concerning Layers

When we look at the layers of the overall EDW Data Architecture we see a 3 layered approach, a Data Vault layer concerning storage and processing, a Business View layers concerning abstraction and consistency and an Access layer (Data Mart/Staging Out) concerning access, aggregation and navigation.


Data Logistics Architecture: Tiering your Data

EDW Logistical Functions: Acquiring(staging), Storing(Data Vault) and Access and delivery of data.
Separate from the data layers we also have the data tiers. These are connected to the data logistics functions like acquiring, storing, propagating and data delivery. From the Data Architecture perspective we want to minimize logistics through maximizing virtualization/derivation. We can infer that at minimum we need 1 data tier (apart from e.g. a staging tier). In practice we can create many tiers in all the data layers. We can even have tiers that cross layers. Since most EDW architectures (inadvertently) combine/entwine these two architecture aspects we'll discuss the connection between tiers and layers in a future blog post.

Conclusion

Personally we don't make too much of the distinction between Raw and Rule Vault, except in the metadata. We do think a (formal) business view layer (preferably in Data Vault format) is an important, nay essential, part of any Data Vault Architecture. (c)

©+Martijn Evers and +Ronald Damhof (w. thanks to +Tom Breur)


Friday, April 12, 2013

Seeing the last of the 'Last Seen Date': Tracking Deletes

Introduction

With Data Vault data models we sometimes add a metadata column to hubs/links called the 'Last Seen Date'. It represents the last time any source delivered said hub/link record to the Data Vault (again), and is only relevant on processing full loads. It needs to be checked on each load and usually when doing the full loads, will update (almost) all hub and link records (sometimes even has multiple passes as well). An expensive, tricky and performance wise problematic action because for the rest hub/link records should not be altered in any way. IMO, it's the last thing I'd like to see in a Data Vault (That's how you should remember its name ;).

Alternative

The best alternative is using a full blown hub/link status satellite that tracks the status: inserted, deleted for a hub/link (driving key) entry. The problem lies in tracking the 'hard' deletes from a source system information because the 'vanilla' DV with just basic full loads does not address the deletes tracking issues.

Handling Deletes

There are 5 levels of 'hard' delete tracking you can use on a Data Vault

  1. None: We assume there are no 'hard'/physical deletes in a source system, only logical deletes, indicated by a status or end time attribute. Physical deletes, if done, are actually driven by archiving and do not indicate the records becoming invalid (false), but just be removed for e.g. space constraints. Archiving done for security/privacy/legal reasons might have impact on the Data Vault, but it depends if we will then register a delete in the DV (We might even have to, oh horror, remove data from our DV)
    If we do happen to stumble on source system deletes in this scenario we become non auditable, and only a Last Seen Date might give us an indication that a delete actually occurred.
  2. Driving Key: In Links that are subject to Non Key updates we can use the driving key to track dummy deletes. This happens (among others) when when a 1:M relationship is encoded in a M:N link
  3. Key Omission: If we assume source systems do deletes and we track them using full key compares between DV and source. This is a basic precaution when doing basic full loads on a DV.
  4. (Key) Deletes: We actually get a list of source (record) deletes. This is of course ideal to track deletes, but don't forget to cater for key updates/mutations as well (where possible in the source).
  5. (Key) Mutations: A Key mutation is logically just a combination of a key delete and a key insert. If we track all (key) mutations (where allowed/expected), like with a Change Data Capture driven solution we can easily track key deletes generated by key mutations as well, which saves us the work of having to do a full (key) compare. Additionally, we can now also populate a key mutation driven SAME-AS Link.
    As an alternative we can track the key updates only when we look at the source surrogate key (stored in e.g. a key satellite), given that it is stable.

Conclusion

The Last Seen Date is a expensive approximation of handling deletes. IMO the efforts can be better invested in better delete tracking. The Last Seen Date should be seen as a last ditch measure tracking deletes, not a basic metadata for a DV.

EDW: Building a House of Glass

Introduction

In software testing we have the boxing approaches to determine how to test our software applications. We test it systems with a certain knowledge of the internal workings of said systems.
But also in data management and information system development  (for example data warehousing) we develop/manage our data with a certain level of knowledge of the data in our (source) systems. Often the knowledge of these system internals can be used to evaluate our data (management) practices, data/information development (integration, business intelligence) and do some data quality testing on the side as well.

Data in a Box

How can we see our data (silo's)?

  • As black boxes where data goes in, and (some) information/knowledge comes out through UI screens, documents, reports, lists and user (actions)?
  • As grey boxes, where you have a (reasonable?) amount of understanding of your (technical) data. You know (most of) the technical data structures, their descriptions and direct usage but a lot of knowledge about how your data is treated by the system (detailed process rules and constraints) is either implicit knowledge or not understood at all.
  • As white/glass boxes, where data and processes and interaction are well documented (modeled!) and understood? You can interface, interact with the data in a safe and consistent way understanding all processes and constraints that work on the data. You not only have accurate 'technical' (logical?) data modeling information including (complex) constraints, but also on the business/semantic level (including business rules and processes) and  have a data model schema at the 'logical' level (esp. important when the actual database schema has been 'denormalized' or otherwise mangled beyond recognition).

Knocking up Information Hiding

The principle of information hiding is a good practice in data modeling, but when looking at boxing we should understand the current situation with COTS information systems. Preferably we look at an information system through an information/Data interface/layer representing a (logical) data model schema. Interestingly this should be the main function of a database schema even if in reality it is often not the case. Alas, a COTS is usually technical black box system and is usually poorly abstracted with a functional data layer (ie a data layer that is a logical, accurate, coherent, consistent, constrained and complete representation of the data). So while information hiding is good practice, in current COTS systems there is usually no good (formal) abstraction data/information layer that allows us correct and consistent access to information systems. We are usually stuck with a "black hole" box (where data goes in, but cannot get out easily if at all (save through arbitrary 'extractors', UI, reports or lists).

Unboxing your applications data



A lot of integration and information initiatives start with unboxing the murky data (models) from the source COTS information systems. When organizations purchase or design applications usually they see them as black boxes. As soon as data integration or BI initiatives are started they start delving into data models and data bases trying to understand the base data. They usually work up to a grey box understanding of their systems using internal data models because external/logical data model schema's, if available at all, are usually proprietary,  incorrect, inconsistent or incomplete. From there they start developing their additional data processing (ETL, interfacing).
But most organizations who have done this still have serious issues understanding their data because they usually miss things or don't have all the knowledge together and integrated. They are usually implicitly trying to white box their information systems, but have no formal way of doing so. Detailed information about these systems is usually scattered in process and database diagrams, application code, knowledge workers, data warehouses or data marts and IT personnel. Unboxing usually stops at some grey box level.

EDW's: Building a House of Glass


Most (enterprise) data warehouse initiates (esp those talking about a 'single version of the truth') have an explicit or implied goal of constructing a house of glass, a glass box (globe) to see through all the data. They are actually trying to unbox their data to a white/glass box scenario  However, they still have the issue of unboxing their OLTP/source systems. The house of glass built on top of murky source system data is actually conjuring up an illusions of control and transparency founded on a marsh of misunderstanding source systems. Instead of focusing on the house of truth they should be focusing on truly trying to unbox their source systems and focus on a house of (source system data) facts. A (logical) EDW housing both the (derived/implied) 'truth's' as well as the source data facts is the only way to construct a true data house of glass. this way an EDW can help you not just with your BI initiatives, but also with unlocking and unboxing your data for other initiatives like Data Quality control, interfacing/integration and data migration.

Copyright Datamasters (Unseen) 2013

Saturday, March 30, 2013

Blog statistics milestone!:

Miliarium Aureum

This picture is of the golden milestone (Miliarium Aureum) of the via appia at the heart of the roman empire. At the hight of the empire this milestone was considered at the center of the whole empire. It was constructed 20BC by Augustus. All important milestones in the empire gave the distance relative to this stone.

Milestone

At the end of march my blog reached both 15.00 visits mainly in the last six months and is now over 3.000 visits a month. A nice first milestone for which I want to thank all my readers. I still have a lot of post coming up so I hope to see everyone back soon.

Wednesday, March 27, 2013

Data Vault: On The Nature of Hubs

Introduction

There are always a lot of discussions about the nature of hubs. There have been several linkedin discussions and blog-post who address the issue. Since these discussions are usually crossing all kind of levels of conceptualization I'd like to clarify on which levels these discussions take place. In this short post I'd like to shortcut these different levels of discussions around the nature of hubs.

Levels of Abstraction

I'll discern 4 levels of abstraction:

  1. The logical level, on which hubs are just (business) keys
  2. On the conceptual level in which hubs are identified by concepts and their (stated) identifiers
  3. On the meta identifier level where we design and construct and scope identifiers and their supporting (identification) processes needed to identify concepts at the conceptual level.
  4. On the ontologic level where we identify abstract concepts and ignore actual identification design.

1. Hubs as (key) transformations

Basically a HUB is an independent (logical) key (a key with no part of it dependent on another key). The only exception is when there are 2 keys and one is the surrogate key of the other, then the surrogate key does not become a hub (but optionally a keysat), but this is a minor issue since we can choose to ignore (sourced) surrogate keys altogether. When there is ONLY a surrogate key, we have an interesting issue, since a key on it's own can never be a surrogate key (because it's only a surrogate for another key), even if that was the intention. We might say it's a 'technical' key that will source a Hub when no other candidate key is found. The sources for Hubs/keys for a Data Vault(=Raw Data Vault+Rule Data Vault) are all relevant source system data models and all business data models. If there are several situations that try to model the same hub with different keys, you basically model all of the distinct keys. Some optimization/consolidation is possible when having multiple keys for the same concept but these decisions should be delegated to the correct modeling of the business information model using specialization/generalization. This idea relegates a lot of design and definition of (central) hubs to (master) data management and conceptual data model design. Since we distinguish between concepts and keys, a concept identified with a dependent key is a link by definition, but still an integration point. From a conceptual point a link can be seen as a hub as well.

2. Hubs as conceptual entities

Most people will equal hubs with conceptual entities like customer or product. The assumption is that an important master hub usually houses an important identifier like tax id, Social Security Number or product code. The discussion on which identifiers to use (or ignore) as 'master' hub  is however not a Data Vault discussion, but a business information model discussion. Here we try to find business identifiers with the correct scope and meaning. In the Data Vault we just implement (one or more) of the available model identifiers as the master key in our hub. Again, if we have several identifiers for (entity sub-types of ) one concept we can opt to use key satellites to model this in the Data Vault.

3. Hubs as Identification schemes

A lot of practitioners try to fix business key issues in the Data Vault, with the goal to create/construct or identify a hub that will house the master list of a certain entity. They are often  enticed to try to construct their own identification or consolidation scheme. Again, this is not the task of a Data Vault but a task of (master) data management. Approved matching and fixing rules can still be applied to the (Business) Rule Data Vault. These kinds of actions are usually a result of failing to find/implement a single good identifier for a conceptual entity, which in turn might lead to multiple entities encoding the same concept.

4. Hubs as (abstract) concepts

Most people trying to create the ultimate hub will end up creating ontological supertypes like 'all people' or 'all organizations'. But since there are not identification scheme's for all the 'people' they either have to invent their own (very hard) or accept that data quality will be low (duplicates abound). Again, this is not the task of a Data Vault to design these kinds of hubs (although it is natural to ask in the context of a Data Vault), but just to create or facilitate them when they have been correctly defined elsewhere. It is usually only something to define in an ontology or information model (as generalization), and usually business have no reason to sponsor these kind of endeavors in an information model when they are only interested in their own customers or vendors. So while you can define a conceptual/ontological supertype 'person', a concrete person's Hub is usually not very sensible (A derived supertype can always be constructed, of course).

Conclusion

From a formal perspective, in a Data Vault we are only interested in representing keys in an efficient and usable manner using hubs and optionally keysats (and even keylinks). Other discussions on the nature of hubs are important, but not the privilege of the Data Vault, but the providence on business information/data modeling, generalization and specialization and conceptual ontologies. The reason we see them crop up so often is because most organizations don't engage in serious business information/data modeling, which means the Data Vault/EDW designers/developers have to face a task they should ideally lay elsewhere. It is the lack of data management that makes us discuss these kinds of concepts instead of relaying them to the business (data model). Educating business and Data Vault practitioners on (conceptual/logical) data modeling is the only way to make sure these issues are tackled at the right level instead of (incorrectly) claiming that Data Vault can solve these kinds of issues while it is only an implementation pattern for a given (modeled) solution.