Data Matters (Unseen): March 2013

Saturday, March 30, 2013

Blog statistics milestone!:

Miliarium Aureum

This picture is of the golden milestone (Miliarium Aureum) of the via appia at the heart of the roman empire. At the hight of the empire this milestone was considered at the center of the whole empire. It was constructed 20BC by Augustus. All important milestones in the empire gave the distance relative to this stone.

Milestone

At the end of march my blog reached both 15.00 visits mainly in the last six months and is now over 3.000 visits a month. A nice first milestone for which I want to thank all my readers. I still have a lot of post coming up so I hope to see everyone back soon.

Wednesday, March 27, 2013

Data Vault: On The Nature of Hubs

Introduction

There are always a lot of discussions about the nature of hubs. There have been several linkedin discussions and blog-post who address the issue. Since these discussions are usually crossing all kind of levels of conceptualization I'd like to clarify on which levels these discussions take place. In this short post I'd like to shortcut these different levels of discussions around the nature of hubs.

Levels of Abstraction

I'll discern 4 levels of abstraction:

The logical level, on which hubs are just (business) keys
On the conceptual level in which hubs are identified by concepts and their (stated) identifiers
On the meta identifier level where we design and construct and scope identifiers and their supporting (identification) processes needed to identify concepts at the conceptual level.
On the ontologic level where we identify abstract concepts and ignore actual identification design.

1. Hubs as (key) transformations

Basically a HUB is an independent (logical) key (a key with no part of it dependent on another key). The only exception is when there are 2 keys and one is the surrogate key of the other, then the surrogate key does not become a hub (but optionally a keysat), but this is a minor issue since we can choose to ignore (sourced) surrogate keys altogether. When there is ONLY a surrogate key, we have an interesting issue, since a key on it's own can never be a surrogate key (because it's only a surrogate for another key), even if that was the intention. We might say it's a 'technical' key that will source a Hub when no other candidate key is found. The sources for Hubs/keys for a Data Vault(=Raw Data Vault+Rule Data Vault) are all relevant source system data models and all business data models. If there are several situations that try to model the same hub with different keys, you basically model all of the distinct keys. Some optimization/consolidation is possible when having multiple keys for the same concept but these decisions should be delegated to the correct modeling of the business information model using specialization/generalization. This idea relegates a lot of design and definition of (central) hubs to (master) data management and conceptual data model design. Since we distinguish between concepts and keys, a concept identified with a dependent key is a link by definition, but still an integration point. From a conceptual point a link can be seen as a hub as well.

2. Hubs as conceptual entities

Most people will equal hubs with conceptual entities like customer or product. The assumption is that an important master hub usually houses an important identifier like tax id, Social Security Number or product code. The discussion on which identifiers to use (or ignore) as 'master' hub is however not a Data Vault discussion, but a business information model discussion. Here we try to find business identifiers with the correct scope and meaning. In the Data Vault we just implement (one or more) of the available model identifiers as the master key in our hub. Again, if we have several identifiers for (entity sub-types of ) one concept we can opt to use key satellites to model this in the Data Vault.

3. Hubs as Identification schemes

A lot of practitioners try to fix business key issues in the Data Vault, with the goal to create/construct or identify a hub that will house the master list of a certain entity. They are often enticed to try to construct their own identification or consolidation scheme. Again, this is not the task of a Data Vault but a task of (master) data management. Approved matching and fixing rules can still be applied to the (Business) Rule Data Vault. These kinds of actions are usually a result of failing to find/implement a single good identifier for a conceptual entity, which in turn might lead to multiple entities encoding the same concept.

4. Hubs as (abstract) concepts

Most people trying to create the ultimate hub will end up creating ontological supertypes like 'all people' or 'all organizations'. But since there are not identification scheme's for all the 'people' they either have to invent their own (very hard) or accept that data quality will be low (duplicates abound). Again, this is not the task of a Data Vault to design these kinds of hubs (although it is natural to ask in the context of a Data Vault), but just to create or facilitate them when they have been correctly defined elsewhere. It is usually only something to define in an ontology or information model (as generalization), and usually business have no reason to sponsor these kind of endeavors in an information model when they are only interested in their own customers or vendors. So while you can define a conceptual/ontological supertype 'person', a concrete person's Hub is usually not very sensible (A derived supertype can always be constructed, of course).

Conclusion

From a formal perspective, in a Data Vault we are only interested in representing keys in an efficient and usable manner using hubs and optionally keysats (and even keylinks). Other discussions on the nature of hubs are important, but not the privilege of the Data Vault, but the providence on business information/data modeling, generalization and specialization and conceptual ontologies. The reason we see them crop up so often is because most organizations don't engage in serious business information/data modeling, which means the Data Vault/EDW designers/developers have to face a task they should ideally lay elsewhere. It is the lack of data management that makes us discuss these kinds of concepts instead of relaying them to the business (data model). Educating business and Data Vault practitioners on (conceptual/logical) data modeling is the only way to make sure these issues are tackled at the right level instead of (incorrectly) claiming that Data Vault can solve these kinds of issues while it is only an implementation pattern for a given (modeled) solution.

Wednesday, March 20, 2013

SAP Powerdesigner: The beginning of the end? PRICE UPDATE

Introduction

In this post https://plus.google.com/u/0/106553443733559106071/posts/65vnSaUEioZ +Mirko Pawlak discusses a price hike of the sybase products like Powerdesigner (an ER modeling tool) of around 80%. Since Powerdesigner is already one of the most (if not THE most) expensive ER modeling tool this means Powerdesigner Licensing costs will go through the roof.

ER tools and standards

ER modeling tools and standards have always been an issue. Interoperability and standardization have never been very strong here, leading to expensive (but sometimes interesting) closed ecosystems and tooling applications. There are different notations and approaches like Barker and Information Engineering, some tools support several notations while others stick to one notation. ER tools have extended the notation to different aspects of information/data modeling and databases, so they could claim full support of the development cycle, even if this is not always what Chen had in mind when he developed ER modeling/diagramming. Transporting metadata from one tool to another has always been a costly endeavor, especially for not non basic metadata data of entities, attributes and keys. The only vendor seriously addressing this aspect is metaintegration: http://www.metaintegration.net/ which shows the how small and exclusive this market really is.

Powerdesigner vs the rest

Powerdesigner has always been one of my favorite tools, not because it was always the best, or that it did not have serious (sometimes even almost fatal) flaws, but because it is quite extensible and hence you can fix a lot of issues around diagramming and design, model generation and transformation. Other good tools are Embarcadero's ER/Studio, which lacks the integrated extensibility, but is a good tool with a reasonable automation interface. CA Erwin is one of the most active tool vendors out there with integrating and partnering. For example a Data Vault ER modeling notation and integration with BIReady. If this means Erwin is going to be customizable to the extent that powerdesigner has now I don't know.

Powerdesigner also has Enterprise Architecture, UML, XML, ETL, BPM and other notations under it's roof, but product integration with 3rd party tools (except database platforms) is not that extensive. However, integration into the SAP technology stack is going to be fixed in a short time frame, allowing Business Objects, SAP HANA and other tools to integrate into Powerdesigner.

Buying into Powerdesigner

Like Mirko mentioned, if you'd like to buy (more) powerdesigner (licenses) you might do it now (before the end of march). If not, plan B might just need a good dusting off. As with Business Objects, SAP customers are currently faced with a new (expensive) tool from outside the SAP realm with which they have no experience. The current (non SAP) customer base however needs to address the fact that their tool gets the most attention when integrating into the SAP ecosystem. I don't mind if this means that the foundation of powerdesigner becomes better, but I fear a lot of new developments will be specific extensions for product integration, that might give rise to more issues instead of less.

The future of Powerdesigner, ER modeling and database tools.

The market for data and database diagramming is constantly changing. I see a commodization of database development diagramming tools like Oracle SQL Developer or Microsoft SQL Server Data Tools. UML diagramming is also becoming a commodity. This alas does not imply that good data modeling is becoming a basic skill, since I see toolsets, vendors and notations ruling the data modeling space instead of good data modeling standards and foundations like the Relational Model and Fact Orientend Modeling (NIAM/FCO-IM). I also see more and more semantic tools and technologies as well, but connection to the database realm is opportunistic at best. In this light the expensive Case Tools and ER tools are not winning any war. But it does mean that one of the few tools out there that is flexible enough to accommodate all kinds of diagramming is becoming more of a niche tool in the SAP ecosystem.

PRICE UPDATE!

As shown at the toolpark website: PD prices will go up indeed. As discussed here by +Mirko Pawlak , it is just to match the prices of the competition (although licensing schemes and functionality are difficult to match) . For example, the old price of the basic Data Architect standalone seat (the cheapest license btw) goes from 2.440 Euro netto to 4.600 Euro netto.

It is clear to me that market forces here are driving the prices up instead of down. Which means I have to conclude that there is no real competitive market, as any specialist can attest to who has done a migration from one ER tool to another.

Friday, March 15, 2013

Data Vault Agenda 2013

Introduction

I've decided to try to record all events where Data Vault will be a subject. It will contain a list of presentations, courses, events and seminars. It will probably be updated several times throughout the year. Please feel free to add suggestions or clarifications for this list in the comments section. I've Highlighted items I'm either teaching, attending, organizing or (co-)hosting.

February

25 - 26th Data Vault Modeling & Certification - Netherlands: genesee academy
27 - 28th + 6 march MATTER DV In the Trenches: Data (Vault) Modeling I, Netherlands: BI podium website ALREADY FULL!!

March

11 - 12th Data Modeling For Data warehousing, Nieuwegein, Netherlands: DIKW Academy
12 - 13th Data Vault Modeling & Certification - Stockholm: genesee academy
18 - 19th Data Vault Modeling Seminar - Lausanne Switzerland: genesee academy
21 - 22nd Data Vault Modeling Seminar - Lausanne Switzerland: genesee academy

April

Data Vault 2.0 Boot Camp & Certification, Saint Albans, Vermont,US: www.danlinstedt.com
3rd April, Crash Course Data Vault, Netherlands: Centennium
9 -11th MATTER DV In the Trenches: Data (Vault) Architecture II, Netherlands: BI podium
24 - 25th Data Vault Modeling & Certification - Netherlands: genesee academy
28 - 2nd May DAMAI-EDW Conference in San Diego, CA, US:Enterprise Data World 2013

May

2 - 3rd Implementing a Data Vault, Veenendaal,Netherlands: infosupport
22 - 23rd Data Vault Modeling & Certification - Stockholm: genesee academy
24th Data Vault Modeling (Intro): Netherlands: 5Hart

June

3-5th MATTER Anchor Modeling course, Netherlands: www.bi-podium.nl
6th Next Generation DWH Modeling - Netherlands: http://www.bi-podium.nl/showcase/next_generation_dwh_modeling
10 - 11th Data Vault Modeling & Certification - Netherlands: genesee academy
12 Data Vault Modeling (Intro) Netherlands: 5Hart
20th Crash Course Data Vault, Netherlands: Centennium
21 - 22nd Data Vault Modeling & Certification - Munich: genesee academy

July

19 - 20th Data Modeling: Operational, Data Warehousing & Business Intelligence: Colorado, US: genesee academy

August

Data Vault 2.0 Boot Camp & Certification, Saint Albans, Vermont,US: www.danlinstedt.com
19 - 20th Implementing a Data Vault, Veenendaal,Netherlands: infosupport
22 - 23rd Implementing a Data Vault, Mechelen, Belgium: infosupport
22 - 23rd Data Vault Modeling & Certification - Iceland: genesee academy

September

6th, Data Vault Modeling (Intro): Netherlands: 5Hart
11- 12 MATTER DV In the Trenches: Data (Vault) Modeling I, Netherlands: BI podium website PLACES AVAILABLE!!
24th, Crash Course Data Vault, Netherlands: Centennium
23-24, Data Modeling Zone Europe, Hannover,Germany: http://www.datamodelingzone.eu/2013/

October

Data Vault 2.0 Boot Camp & Certification, Saint Albans, Vermont,US: www.danlinstedt.com
8 - 10th Data Modeling Zone 2013 Baltimore,US: http://www.datamodelingzone.com/
9 - 10th Data Vault Modeling & Certification - Stockholm: genesee academy
16 - 17th Data Vault Modeling & Certification - Netherlands: genesee academy
21 - 22nd Data Vault Briefing - Helsinki: genesee academy
23 -24 MATTER DV In the Trenches: Data (Vault) Advanced III, Netherlands: BI podium

November

December

Monday, March 4, 2013

Data Vault: Deletes MATTER

Introduction

There was a short discussion on twitter about Data Vault and the tracking of (key) deletes. Few realize that the basic/traditional Data Vault is ONLY 100% auditable when keys/records are never deleted (and never updated). I personally think Data Vault not precise enough in formulating it's auditability requirement in relation to data model transformations and the tracking database data/transactions.

Auditability

The auditability requirement can be restated using semantic equivalent data model (schema) transformation and data transactions. Being auditable means that both the model transformation is "lossless" and that all transactions are accounted for. A "lossless" model transformation is a bijective function between 2 model schema's. It means that we can we can transform back and forth between the data models. If we want to account for all transactions we either need to apply a (kind of) transactional replication or, when we do not need transactional consistency, we need to apply full snapshots at a certain interval (say a day). If we want to add a transaction timeline to one of the models (like Data Vault does), we effectively transform record deletes and updates to record inserts. So we are required to log deletes if we want to be 100% auditable.

Tracking deletes

While there are some very good arguments against physically deleting record in a database, it is often the case we need to account for them.There are basically 2 techniques for tracking deletes: Transaction logging(Change Data Capture, Audit Trails) and full compare of snapshots. With the snapshots we only need to compare (all) the key(s) of a table (there can be more than 1!!), hence the name "Key Compare". With an audit trail/CDC we can not only track record/key deletes, but also track key mutations. We can fill a SAME-AS link (actually a FROM-TO link) or Key Sattelite directly from a CDC stream. Note that a key update is logically equivalent of a delete and an insert and knowing how old key value mutates into the new one.

A special case with deletes is data archiving, since archiving usually means physically deleting the records while logically they might still be valid. While this post is not about Data Vault and archiving (I feel another post coming up ;) The correct way to deal with archiving is to set an archiving indicator for those records, and not to logically or physically delete them in your Data Vault/EDW. Alas, regulatory and privacy concerns make handling data archiving (both in the Data Vault as well as in the source system) a complex subject.

Data Vault

The basic Data Vault does a trade off between performance, ease of use and full consistency and transparency. This is not an issue as long as this trade-off is made in an informed and consistent way. Alas, too many people take the 100% auditability requirement literally, but for the Basic Data Vault this is just a goal, not a fact. For those who want to make 100% auditability a certainty, YOU need make sure you're model transformation is correct and all your transactions/records are accounted for. The Data Vault standards where designed for a lot of good reasons , but they where never precise enough to scientifically/mathematically guarantee 100% data/model auditability on data and models in all circumstances.