Thursday, October 11, 2012

Concerning Data Model Madness: what are we talking about?

There has been endless debates on how to name, identify, relate and transform the various kinds of Data Models we can, should, would, must have in the process of designing, developing and managing information systems (like Data warehouses). We talk about conceptual, logical and physical data models, usually in the context of certain tools, platforms, frameworks or methodologies. A confusion of tongues is usually the end result. Recently David Hay has made an interesting video (Kinds of Data Models -- And How to Name them) which he tries to resolve this issue in a consistent and complete manner. But on LinkedIn this was already questioned if this was a final or universal way of looking at such models.

One of the caveats is that a 'model' needs operators as well as data containers. Only the relational model (and some Fact oriented modeling techniques, which support conceptual queries) defines operators in a consistent manner. In this respect Entity Relationship techniques are formally diagramming techniques. Disregarding this distinction for now we need to ask ourselves if there are universal rules for kinds of Data Models and their place within a model generation/transformation strategy.

I note that if you use e.g. FCO-IM diagrams you can go directly from conceptual to "physical" DBMS schema if you want to, even special ones like Data Vault or Dimensional Modeling. I also want to remark that there are 'formal language' modeling techniques like Gellish that defy David's classification scheme. They are both ontological and fact driven and could in theory go from ontological to physical in one step without conceptual or logical layer (while still be consistent and complete btw, so no special assumptions except a transformation strategy). The question arises how many data model layers we want or need, and what each layer should solve for us. There is tension between minimizing the amount of layers while at the same time not overloading a layer/model with too much semantics and constructs which hampers its usability.

For me this is governed by the concept of concerns, pieces of interest that we try to describe with a specific kind of data model. These can be linguistic concerns like verbalization and translation, semantic concerns like identification, definition and ontology, Data quality constraints like uniqueness or implementation and optimization concerns like physical model transformation (e.g. Data Vault, Dimensional Modeling), partitioning and indexing. Modeling/diagramming techniques and methods usually have primary concerns that they want to solve in a consistent way, and secondary concerns they can model, but not deem important (but that are usually important concerns at another level/layer!). What makes this even more difficult is that within certain kinds of data models there is also the tension between notation, approach and theory (N.A.T. principle). E.g. the relational model is theoretically sound, but the formal notation of nested sets isn't visually helpful. ER diagramming looks good but there is little theoretic foundations beneath it.

I personally think we should try to rationalize the use of data model layers, driven by concerns, instead of standardizing on a basic 3 level approach of conceptual, logical, and physical. We should be explicit on the concerns we want to tackle in each layer instead of using generic suggestive layer names.

I would propose the following (data) model layers minimization rule:

A layered (data) modeling scenario supports the concept of separation of concerns (as defined by Dijkstra)  in a suitable way with a minimum of layers using a minimum of modeling methodologies and notations.