Monday, November 19, 2012

Implementation data modeling styles

Introduction

Business Intelligence specialists are often on the lookout for better way to solve their data modeling issues. This is especially true for Data warehouse initiatives where performance, flexibility and temporalization are primary concerns. They often wonder which approach to use, should it be Anchor Modeling, Data Vault, Dimensional or still Normalized (or NoSQL solutions, which we will not cover here)? These are modeling techniques focus around implementation considerations for Information system development. They are usually packed with an approach to design certain classes of information systems (like Data warehouses) or are being used in very specific OLTP system design. The techniques focus around physical design issues like performance and data model management sometimes together with logical/conceptual design issues like standardization, temporalization and inheritance/subtyping.

Implementation Data Modeling

Implementation Data Modeling techniques (also called physical data modeling techniques) come in a variety of forms. Their connection is a desire to pose modeling directives on the implemented data model to overcome several limitations of current SQLSDBMSes. While they also might address logical/conceptual considerations, they should not be treated like a conceptual or logical data model. Their concern is implementation. Albeit often abstracted from specific SQL DBMS platforms they nonetheless need to concern themselves with implementation considerations on the main SQL platforms like Oracle and Microsoft SQL Server. These techniques can be thought of as a set of transformations from a more conceptual model (usually envisaged as an ER diagram on a certain 'logical/conceptual' level but see this post for more info on "logical" data models )

Classes of Data Modeling Styles

It would be good if we could analyze, compare and classify these techniques. This way we can assess their usefulness and understand their mechanics. Apart from the methodology/architecture discussion around information system design and modeling (e.g. for Data warehousing: Kimball, Inmon, Linstedt or a combination) there is not a lot of choice. If we ignore abstractions like Entity/Attribute/Value patterns, grouping (normalization/denormalization), key refactoring, versioning, timestamping and other specific transformation techniques we end up with nominally 3 styles of modeling that are being used right now: Anchor Style, Normalized and Dimensional. All three relate to an overall model transformation strategy (from a logical/conceptual data model), and all three can come in highly normalized varieties and some even in strongly "denormalized" (better is to talk about grouping like in FOM/NIAM/FCO-IM)  ones (and anything in between). This means that the definition of each style lies in a basic transformation strategy which then is further customized using additional transformations, especially grouping.

Grouping vs. Normalization

Normalization can be seen from 2 sides. As an algorithm to decrease unwanted dependencies going from 1NF up to 6NF, or as a grouping strategy on a highly normalized design (6NF or ONF) grouping down to 5NF or lower. Besides being a model transformation process it is also used as a name for a family of related data models using ONLY this model transformation process.

Normalization Style

Standard normalized data can be in either 1NF to 6NF (which is the highest normal form). Normalization is the simplest and most fundamental family of transformation styles. It is currently mainly used for OLTP systems.

Anchor Style Modeling

What I call Anchor style modeling (also called key table modeling) is becoming more and more accepted, especially with Data warehousing. Data Vault and Anchor Modeling are both methods that rely in this family of techniques. The basic idea is to split (groups of) keys into their own entities (called hubs, anchors or key tables) and split of all other info in additional entities (non key table entities called leafs, satellites etc.). Another property of this family is that temporal registration is never done on key tables but only on the non key table entities. From there the different techniques and standards, as well as goals, diverge leading to several related techniques that are used to design Data warehouses as well as OLTP systems in different fashions.

Dimensional Modeling

From a transformation perspective Dimensional modeling is based on mapping basic 5NF models as directed graphs to a forest of trees where the root nodes are fact tables and all other nodes being dimensions. Also, all history is kept in dimensions using basic versioning. Fact tables then do a versioned join to the dimensions.

Implementation Modeling style properties

Apart from Normalization, implementation modeling styles have been mainly invented to overcome issues in database software, e.g. current mainstream DBMSes. They make (sometimes serious) compromises on aspects like normalization and integrity against performance and maintenance. Dimensional modeling overcomes some performance and some simple temporal limitations, and also covers some limitations of BI tools (mainly pattern based query generation), while at the same time isn't hampering usage and understanding too much. Anchor Style techniques help out with serious temporalization inside DBMSes while at the same time add extra flexibility to mutations of the data model schema.

Temporalization and Schema Flexibility

In a future post I'll discuss this classification's effect on temporalization, schema management and schema flexibility in more detail.

One Model to rule Them All

Ideally we would like to have a generic 'super' (conceptual) modeling technique or style that is able to direct and control all these techniques. A base (conceptual) modeling style that allows us to generate/transform to any target physical modeling style. We could use 5NF from the Relational Model, but there are also other considerations like the fact that Anchor Modeling more related to 6NF. We will explore this notion of a super model in an upcoming post.

Conclusion

Most modeling techniques can be expressed as transformation strategies, which in a generalized form describe families of related styles of implementation data modeling techniques. Apart from their corresponding implementation approaches these 3 styles actually cover the backbone of most implementations of current information systems be it Data warehouses or OLTP systems.