Wednesday, March 27, 2013

Data Vault: On The Nature of Hubs

Introduction

There are always a lot of discussions about the nature of hubs. There have been several linkedin discussions and blog-post who address the issue. Since these discussions are usually crossing all kind of levels of conceptualization I'd like to clarify on which levels these discussions take place. In this short post I'd like to shortcut these different levels of discussions around the nature of hubs.

Levels of Abstraction

I'll discern 4 levels of abstraction:

  1. The logical level, on which hubs are just (business) keys
  2. On the conceptual level in which hubs are identified by concepts and their (stated) identifiers
  3. On the meta identifier level where we design and construct and scope identifiers and their supporting (identification) processes needed to identify concepts at the conceptual level.
  4. On the ontologic level where we identify abstract concepts and ignore actual identification design.

1. Hubs as (key) transformations

Basically a HUB is an independent (logical) key (a key with no part of it dependent on another key). The only exception is when there are 2 keys and one is the surrogate key of the other, then the surrogate key does not become a hub (but optionally a keysat), but this is a minor issue since we can choose to ignore (sourced) surrogate keys altogether. When there is ONLY a surrogate key, we have an interesting issue, since a key on it's own can never be a surrogate key (because it's only a surrogate for another key), even if that was the intention. We might say it's a 'technical' key that will source a Hub when no other candidate key is found. The sources for Hubs/keys for a Data Vault(=Raw Data Vault+Rule Data Vault) are all relevant source system data models and all business data models. If there are several situations that try to model the same hub with different keys, you basically model all of the distinct keys. Some optimization/consolidation is possible when having multiple keys for the same concept but these decisions should be delegated to the correct modeling of the business information model using specialization/generalization. This idea relegates a lot of design and definition of (central) hubs to (master) data management and conceptual data model design. Since we distinguish between concepts and keys, a concept identified with a dependent key is a link by definition, but still an integration point. From a conceptual point a link can be seen as a hub as well.

2. Hubs as conceptual entities

Most people will equal hubs with conceptual entities like customer or product. The assumption is that an important master hub usually houses an important identifier like tax id, Social Security Number or product code. The discussion on which identifiers to use (or ignore) as 'master' hub  is however not a Data Vault discussion, but a business information model discussion. Here we try to find business identifiers with the correct scope and meaning. In the Data Vault we just implement (one or more) of the available model identifiers as the master key in our hub. Again, if we have several identifiers for (entity sub-types of ) one concept we can opt to use key satellites to model this in the Data Vault.

3. Hubs as Identification schemes

A lot of practitioners try to fix business key issues in the Data Vault, with the goal to create/construct or identify a hub that will house the master list of a certain entity. They are often  enticed to try to construct their own identification or consolidation scheme. Again, this is not the task of a Data Vault but a task of (master) data management. Approved matching and fixing rules can still be applied to the (Business) Rule Data Vault. These kinds of actions are usually a result of failing to find/implement a single good identifier for a conceptual entity, which in turn might lead to multiple entities encoding the same concept.

4. Hubs as (abstract) concepts

Most people trying to create the ultimate hub will end up creating ontological supertypes like 'all people' or 'all organizations'. But since there are not identification scheme's for all the 'people' they either have to invent their own (very hard) or accept that data quality will be low (duplicates abound). Again, this is not the task of a Data Vault to design these kinds of hubs (although it is natural to ask in the context of a Data Vault), but just to create or facilitate them when they have been correctly defined elsewhere. It is usually only something to define in an ontology or information model (as generalization), and usually business have no reason to sponsor these kind of endeavors in an information model when they are only interested in their own customers or vendors. So while you can define a conceptual/ontological supertype 'person', a concrete person's Hub is usually not very sensible (A derived supertype can always be constructed, of course).

Conclusion

From a formal perspective, in a Data Vault we are only interested in representing keys in an efficient and usable manner using hubs and optionally keysats (and even keylinks). Other discussions on the nature of hubs are important, but not the privilege of the Data Vault, but the providence on business information/data modeling, generalization and specialization and conceptual ontologies. The reason we see them crop up so often is because most organizations don't engage in serious business information/data modeling, which means the Data Vault/EDW designers/developers have to face a task they should ideally lay elsewhere. It is the lack of data management that makes us discuss these kinds of concepts instead of relaying them to the business (data model). Educating business and Data Vault practitioners on (conceptual/logical) data modeling is the only way to make sure these issues are tackled at the right level instead of (incorrectly) claiming that Data Vault can solve these kinds of issues while it is only an implementation pattern for a given (modeled) solution.