Friday, November 8, 2013

One "Data Modeling" approach To Rule Them all

The Task

In a previous post on implementation (data) modeling styles I talked about needing a 'universal data modeling technique/method' that could be a starting point for all your structural model transformations (semantically equivalent derivations), be that dimensional, normalized or Anchor Style Modeling. Such a modeling technique should allow for easy transformation, it needs to be agnostic to most transformation techniques, but should still guarantee several levels of semantic equivalence. It also should allow for (database) model re-engineering and allow for data representations on any desired "level" (conceptual, logical). It should facilitate conceptual data integration and facilitate temporal aspects as well.

The magic wand of Semantic Equivalence

It is important to realize that all implementation modeling techniques like dimensional and Data Vault rely on a certain class of model transformations I call fully semantic equivalent. These class of transformations preserve (parts of) the data (they preserve the functional, join, multi value dependencies as defined in the RM) and hence provide a degree of data tracability from one transformation to the next. However, they DO make some aspects more difficult, especially maintaining all kinds of (complex) constraints. These kinds of transformations are usually biased towards certain aspects. Things like constraint minimization and standardization are best served by (highly) normalized models. Manual temporal processing and schema change impact isolation are best served by Anchor style modeling techniques like Data Vault. User access and data navigation are best served by (logical) Dimensional Models (see the OASI concept of van der Lek).

The Candidates

For me there where only 2 serious options, namely an anchored version of 6NF on the logical level,  or a Fact Based Modeling technique (FOM) from the NIAM family: FCO-IM, ORM or CogNIAM on the conceptual level. Other techniques like OWL are poor on constraint modeling and derivation and hence lack the desired easy (semantic equivalence) transformation. 6NF is an interesting candidate on the logical level since it is irreducible, time extendable and  is able to house constraints and derivations. It does not contain key-tables (anchors) by default, but we can create an anchored version (A6NF) that creates an anchor for each key that has a functional dependency. Date, Darwen an Lorentzos showed how to formally define temporal features in the relational model as well. However, apart form conceptual aspects like classification and quantification, verbalization and specialization/generalization  normalized models also do not directly abstract away from relationship implementation (which is also a bonus, but not in this scenario), which, given the myriad types of surrogation strategies used in data sources is considered a deficiency here. Abstracting away from foreign keys to relationships/roles gives rise to some additional issues but all in all FOM's can do all of this nicely (as far as it goes). The Fact Based modeling family lacks something in operators, esp temporal ones, but that can be remedied with a conceptual query language and simple conceptual temporal extension. (Another issues is that we need to facilitate robust and customizable data model reverse engineering, something current FCO-IM only has in a simple form.) I think that of all the FBM dialects FCO-IM lends itself best for understanding model transformations/derivations because it's focusing on fact types. For me this means that FCO-IM is an ideal candidate to use both as a modeling technique and as a modeling method (actually a diagramming method btw). Coupled with the fact that FCO-IM is taught and researched over here in the Netherlands means that it was an easy choice for me to make.

The One Model Methodology

The result is that I depend on FCO-IM for my transformation strategy analysis, and that it has become an important part of the MATTER program. It allows me to analyze, verbalize, display and derive implementation data model schema's in a consistent and generic way. This allows me to understand arbitrary implementation modeling styles in just one conceptual data model as a restructuring of fact (types). This way I resolve arbitrary diagramming, modeling, designing of data model schema's in a set of consistent, complete and correct fact restructuring directives.


Transformation vs Derivations

FCO-IM is usable in describing the class of fully semantic equivalent transformations since it conceptually captures dependencies. The actual restructuring is done on the conceptual schemas of FCO-IM itself. a Model transformation first becomes (conceptual) model (schema) standardization and from there model schema derivation. We call this structural transformation or model restructuring, but in fact it is model schema derivation strategy. 

Implementations vs Definitions

As long as FCO-IM diagrams cannot be implemented directly we would actually transform an FCO-IM diagram to an an Anchorized 6NF model. This would be our idealized implementation model. from here all our implementation models become derivation models. Hence, implementation models are logical models that are logically derivable and controllable from a A6NF schema, while they are conceptually derived from a FBM model. Some implementation modeling styles that are closely related to A6NF (like normalized or Anchorized styles) lend themselves for creating central data repositories, while others (like dimensional) lend themselves to become derivated abstraction layers on top of this. Alas, in real world implementations we are usually forced to materialize some of these derived implementation modeling styles directly in databases, creating extra overhead that needs to be managed by some sort of data automation.

Implementation Modeling Styles?

Implementation data modeling styles are not 'physical' data modeling styles. Physical  is a misnomer from the ER and SQL database world. a Physical data model is what is stored inside a database like indexes and table spaces. Implementation modeling styles are logical model deviations used for logically accessing data, but also serving some non functional requirements around presentation, processing, performance and maintenance. They are artifacts created to counter the poor separation of concerns within current data toolings like SQL DBMSes, ETL tools and bad data management, poor data quality, poor temporal support etc etc. They are used for restructuring database schema's to separate and handle these concerns. In a TRDMS, a true relational database management system we would generate just 1 logical schema. Data Vaults and Dimensional models as we know them now would not be needed.

The Mission

For a lot of BI professionals however, Information modeling using FCO-IM is terra incognita and semantically equivalent model transformations (if done at all) are done using basic ER diagramming, which prohibits good standardization, transformation and (formal) derivations and hence understanding of implementation data modeling styles. ER modeling is just about visualization/drawing/noting model schema's, and do not help understanding this. To make understanding implementation data modeling styles better and more objective we need to educate professionals in this respect. Also since derivation and transformation go hand in hand, they need to understand the role of the relational model and model derivation as well, not just the (technical oriented) artifacts like Data Vaults and Dimensional Models.

Your Chance!

In December we start our MATTER FCO-IM track so BI professionals can dive deep into this aspect of information modeling. We start withg 3 days hand on FCO-IM (8-10 dec.). See for yourself how FCO-IM works and facilitates good data modeling. See BI Podium website for more info.

No comments: