Data Matters (Unseen): November 2012

Wednesday, November 21, 2012

One (Metadata) Column To Rule Them All

One Column to rule them all,

One Column to find them,

One Column to bring them all

and to the metadata bind them!

Introduction

There is usually ONE column that is all over your Data warehouse. It might be called process_id, run_id, step_id or audit_id and it usually points over to a set of (logging) tables that hold all kinds of metadata for your Data warehouse platform. While this column is sometimes seen as mandatory (e.g. with Anchor Modeling or Data Vault) it is in general always seen as a good idea to have for any Data warehouse. There is usually just one column, but sometimes there are more (a composite metadata key) . While this column usually shows the actual load/ETL metrics for load batches/actions its implementation and definition varies from implementation to implementation.

The One (Metadata) Column pattern

I think that we should try to standardize on the semantics of such a column. This is easy if you have a good metadata model. The intersection of the static metadata (models,entities) and metrics metadata (load runs and load cycles) with the ETL control metadata (etl moldules, etl step numbers, etl step order) should always lead you to the metadata on the lowest grain where you can define your most detailed process information. On this grain I would do my central logging and create the One Metadata Column (as a surrogate key). The metadata column's meaning is then directly related to this logging entity, and should not change during the operation of the Data warehouse. This does imply that a well designed and complete metadata model that covers all (definition, operation, automation and management) aspects is essential for having a good usable metadata column. Since the Column is defined as the surrogate key on the lowest grain of your model, there can only be one. If your metadata model does not have a lowest grain (but e.g. several disparate ones) you apparently have gaps in your model and you need to extend/complete it before you can define your One Column.

Such a column has many advantages because it decouples your data and metadata to the maximum extent without losing information. It insulates the data from the metadata model changes through the use of a surrogate key, which then becomes the one point of connection bewteen these 2 models.
A few disadvantages to this approach are the more complex queries to tie data and metadata together, and coping with real time or near real time data proceses.

My preference of the column name is audit_id because such a column does not have to reflect a load run or process run, and not even an ETL step (but even an action within a load step).

I would think this pattern is esp. important with e.g. Data Vault to maintain audit-ability so we not only know when but exactly what was loaded, how and form where. For DWH Automation efforts I think a metadata model and this pattern are very important aspects to implement consistently.

Metadata Model Example

I've created a minimal metadata model as an example. Here we see a very simple model that can be used in such a pattern. The following tables are in this model:

Data Objects: Here we store source and target tables
Load templates and Load template versions: Here we store which load templates/processes are defined
Load Batches: A Load batch is a collection of load actions/directives
Load Cycles: Each batch is loaded in regulated intervals. This is recorded in the load cycles.
Load directives: Records which entity is loaded from which source under a certain batch and with a certain template.
Load Actions: Records on which cycle which load directive was executed.

The column we need here is the Load action id. It is the surrogate key of the table with the lowest/widest granularity in the metadata model. Referring to this column we can answer all our metadata oriented questions around when, where, who, how and what.

Conclusion

The One Metadata Column Pattern is a useful and often implemented pattern within Data warehousing. I think it would be better if there is a consistent way in which to define and implement this. I hope this post helps to understand how to do this in a more structured fashion.

Tuesday, November 20, 2012

Colors of The Data Vault

Introduction

Most Data Vault modelers color code their Data vault models. Alas, the chosen colors usually differ between (groups of) modelers. Hans Hultgren was one of the first to start color coding practice (see http://www.youtube.com/watch?v=kRoDRlj8_YU ), but others like Kasper de Graaf did also use these color coding as well.

The first practice to my knowledge was the following coding:

Hubs: Blue
Links: Red
Satellites: Yellow/Orange or Yellow/Green or Green
Reference Entities: None, but I use an unobtrusive grey/purple

I also use a shape coding as well:

Hubs: square/cube
Links: Oblong/oval
Sats: Rectangle/flat
Reference entity: circle
Hierarchical Link: pyramid
Same-As/From To link: Arrow like construct

Using Color Coding Practices

Color Coding can not only be done on the basic DV entities, but also on attributes, relationships and keys and hence on any style (Normalized/Dimensional) and other modeling technique (FOM's like FCO-IM). They provide an interesting visualization of modeling and transformation strategies. One of the best diagramming techniques to use color coding are FOM's like FCO-IM because it let's you track your transformation rules and you only need to color code roles and entities (whereas in ER you will be color coding keys, attributes and relationships)

Color Coding Data Models

Color coding data models is i fact a transformation strategy visualization method. It should rely on classification metadata of the underlying data model.

I'll show here some examples of color coding data models.

FCO-IM

In FCO-IM you color Nominalized fact types (hubs, links and cross reference tables) or all the roles of the fact type. You can aslo color code UC's role connectors as well

Data Vault

Here we usually just color code the basic entities.

DV Skeleton

The Dv skeleton is just the DV without the satellites.

3NF

In 3NF each and every entity can produce a hub, link and sat.

DV source system analysis

Here I show part of a detailed color coding of a 3NF source system that needs to be transformed to a Data Vault. See that not only entities, but also keys, attributes and relationships have extensive color coding as well as corresponding classification metadata.

Dimensional Model

A dimensional model shows a link as the basis for a fact table, with measures coming from a satellite. Dimensions are usually hubs with their accompanying satellites.

I hope this short overview will show you how you can use color coding the enhance the usefulness of your data (vault) model diagrams.

Monday, November 19, 2012

Implementation data modeling styles

Introduction

Business Intelligence specialists are often on the lookout for better way to solve their data modeling issues. This is especially true for Data warehouse initiatives where performance, flexibility and temporalization are primary concerns. They often wonder which approach to use, should it be Anchor Modeling, Data Vault, Dimensional or still Normalized (or NoSQL solutions, which we will not cover here)? These are modeling techniques focus around implementation considerations for Information system development. They are usually packed with an approach to design certain classes of information systems (like Data warehouses) or are being used in very specific OLTP system design. The techniques focus around physical design issues like performance and data model management sometimes together with logical/conceptual design issues like standardization, temporalization and inheritance/subtyping.

Implementation Data Modeling

Implementation Data Modeling techniques (also called physical data modeling techniques) come in a variety of forms. Their connection is a desire to pose modeling directives on the implemented data model to overcome several limitations of current SQLSDBMSes. While they also might address logical/conceptual considerations, they should not be treated like a conceptual or logical data model. Their concern is implementation. Albeit often abstracted from specific SQL DBMS platforms they nonetheless need to concern themselves with implementation considerations on the main SQL platforms like Oracle and Microsoft SQL Server. These techniques can be thought of as a set of transformations from a more conceptual model (usually envisaged as an ER diagram on a certain 'logical/conceptual' level but see this post for more info on "logical" data models )

Classes of Data Modeling Styles

It would be good if we could analyze, compare and classify these techniques. This way we can assess their usefulness and understand their mechanics. Apart from the methodology/architecture discussion around information system design and modeling (e.g. for Data warehousing: Kimball, Inmon, Linstedt or a combination) there is not a lot of choice. If we ignore abstractions like Entity/Attribute/Value patterns, grouping (normalization/denormalization), key refactoring, versioning, timestamping and other specific transformation techniques we end up with nominally 3 styles of modeling that are being used right now: Anchor Style, Normalized and Dimensional. All three relate to an overall model transformation strategy (from a logical/conceptual data model), and all three can come in highly normalized varieties and some even in strongly "denormalized" (better is to talk about grouping like in FOM/NIAM/FCO-IM) ones (and anything in between). This means that the definition of each style lies in a basic transformation strategy which then is further customized using additional transformations, especially grouping.

Grouping vs. Normalization

Normalization can be seen from 2 sides. As an algorithm to decrease unwanted dependencies going from 1NF up to 6NF, or as a grouping strategy on a highly normalized design (6NF or ONF) grouping down to 5NF or lower. Besides being a model transformation process it is also used as a name for a family of related data models using ONLY this model transformation process.

Normalization Style

Standard normalized data can be in either 1NF to 6NF (which is the highest normal form). Normalization is the simplest and most fundamental family of transformation styles. It is currently mainly used for OLTP systems.

Anchor Style Modeling

What I call Anchor style modeling (also called key table modeling) is becoming more and more accepted, especially with Data warehousing. Data Vault and Anchor Modeling are both methods that rely in this family of techniques. The basic idea is to split (groups of) keys into their own entities (called hubs, anchors or key tables) and split of all other info in additional entities (non key table entities called leafs, satellites etc.). Another property of this family is that temporal registration is never done on key tables but only on the non key table entities. From there the different techniques and standards, as well as goals, diverge leading to several related techniques that are used to design Data warehouses as well as OLTP systems in different fashions.

Dimensional Modeling

From a transformation perspective Dimensional modeling is based on mapping basic 5NF models as directed graphs to a forest of trees where the root nodes are fact tables and all other nodes being dimensions. Also, all history is kept in dimensions using basic versioning. Fact tables then do a versioned join to the dimensions.

Implementation Modeling style properties

Apart from Normalization, implementation modeling styles have been mainly invented to overcome issues in database software, e.g. current mainstream DBMSes. They make (sometimes serious) compromises on aspects like normalization and integrity against performance and maintenance. Dimensional modeling overcomes some performance and some simple temporal limitations, and also covers some limitations of BI tools (mainly pattern based query generation), while at the same time isn't hampering usage and understanding too much. Anchor Style techniques help out with serious temporalization inside DBMSes while at the same time add extra flexibility to mutations of the data model schema.

Temporalization and Schema Flexibility

In a future post I'll discuss this classification's effect on temporalization, schema management and schema flexibility in more detail.

One Model to rule Them All

Ideally we would like to have a generic 'super' (conceptual) modeling technique or style that is able to direct and control all these techniques. A base (conceptual) modeling style that allows us to generate/transform to any target physical modeling style. We could use 5NF from the Relational Model, but there are also other considerations like the fact that Anchor Modeling more related to 6NF. We will explore this notion of a super model in an upcoming post.

Conclusion

Most modeling techniques can be expressed as transformation strategies, which in a generalized form describe families of related styles of implementation data modeling techniques. Apart from their corresponding implementation approaches these 3 styles actually cover the backbone of most implementations of current information systems be it Data warehouses or OLTP systems.

Friday, November 2, 2012

Housebuilding, Data Architecture and Data warehousing

Introduction

Traditional Data warehousing focuses on designing and developing information systems to support an organization's BI initiatives and BI tooling. With the advent of new techniques, technologies, architectures and approaches we see the line between Data warehousing and other classes of information systems (transaction processing, interfacing, analyzing) blurring more and more.

Building a House

Building up your IT structure is a lot like building a house. The same goes for your data architecture. However, a data architecture is usually not a building built from scratch by an Architect, but a building made of a lot of prefab elements like rooms (applications), plumbings (Interfaces) and foundations (IT infrastructure). The end result depends, but it is usually more comparable to the Winchester Mystery House, or an Escher drawing than a regular building we work or live in.

All elements we buy are usually standardized by the supplier or builder, but still needs serious adaptation before it will fit in your data 'house'. Even then, it won't look pretty since size, look and makeup will usually differ considerably between elements from different suppliers (and even from the same supplier). 'Window dressing' will only make it palatable by the casual observer, not by the persons living inside.

Data Warehousing

While a Data warehouse can be seen as just another room (application), there are some special considerations. Data warehouses are supporting information systems, more a support and bridging structure for your house. They are usually custom built, bridging gaps between standard elements and supporting new types of rooms. This is different from most of the other elements/rooms of your 'data' house (Even if the Data warehouse itself is made up of standard elements). From this new support structure your BI applications (new rooms!) can now be built.

If our other (prefab) elements where well designed and structured in a way to connect seamlessly we might have no need of this new type custom support structure, but in practice this level of design is difficult and not implemented very much.

But why focus on a support structure for certain types of rooms (applications)? A normal house has just one intergated support structure that supports all the rooms and plumbings. Given current state of IT this overall support structure is usually built around the (IT) organization and is almost always an exercise in custom building, not an element of your 'data house' you can just buy directly of the shelf.

Building Better 'Data (Ware)houses'

Instead of looking directly for a Data warehouse as a supporting structure, we should look at supporting our total Data Architecture, which could include subjects like Data Warehousing, Federation/Virtualization, Data storage, Business Intelligence, Data Migration or data interfacing. We should focus on building a whole 'Data house' with just one 'supporting structure', of which our familiar 'Data warehouse' can be an intgrated part instead of just another element grafted on our already baroque 'Data house' landscape.