Friday, April 12, 2013

EDW: Building a House of Glass


In software testing we have the boxing approaches to determine how to test our software applications. We test it systems with a certain knowledge of the internal workings of said systems.
But also in data management and information system development  (for example data warehousing) we develop/manage our data with a certain level of knowledge of the data in our (source) systems. Often the knowledge of these system internals can be used to evaluate our data (management) practices, data/information development (integration, business intelligence) and do some data quality testing on the side as well.

Data in a Box

How can we see our data (silo's)?

  • As black boxes where data goes in, and (some) information/knowledge comes out through UI screens, documents, reports, lists and user (actions)?
  • As grey boxes, where you have a (reasonable?) amount of understanding of your (technical) data. You know (most of) the technical data structures, their descriptions and direct usage but a lot of knowledge about how your data is treated by the system (detailed process rules and constraints) is either implicit knowledge or not understood at all.
  • As white/glass boxes, where data and processes and interaction are well documented (modeled!) and understood? You can interface, interact with the data in a safe and consistent way understanding all processes and constraints that work on the data. You not only have accurate 'technical' (logical?) data modeling information including (complex) constraints, but also on the business/semantic level (including business rules and processes) and  have a data model schema at the 'logical' level (esp. important when the actual database schema has been 'denormalized' or otherwise mangled beyond recognition).

Knocking up Information Hiding

The principle of information hiding is a good practice in data modeling, but when looking at boxing we should understand the current situation with COTS information systems. Preferably we look at an information system through an information/Data interface/layer representing a (logical) data model schema. Interestingly this should be the main function of a database schema even if in reality it is often not the case. Alas, a COTS is usually technical black box system and is usually poorly abstracted with a functional data layer (ie a data layer that is a logical, accurate, coherent, consistent, constrained and complete representation of the data). So while information hiding is good practice, in current COTS systems there is usually no good (formal) abstraction data/information layer that allows us correct and consistent access to information systems. We are usually stuck with a "black hole" box (where data goes in, but cannot get out easily if at all (save through arbitrary 'extractors', UI, reports or lists).

Unboxing your applications data

A lot of integration and information initiatives start with unboxing the murky data (models) from the source COTS information systems. When organizations purchase or design applications usually they see them as black boxes. As soon as data integration or BI initiatives are started they start delving into data models and data bases trying to understand the base data. They usually work up to a grey box understanding of their systems using internal data models because external/logical data model schema's, if available at all, are usually proprietary,  incorrect, inconsistent or incomplete. From there they start developing their additional data processing (ETL, interfacing).
But most organizations who have done this still have serious issues understanding their data because they usually miss things or don't have all the knowledge together and integrated. They are usually implicitly trying to white box their information systems, but have no formal way of doing so. Detailed information about these systems is usually scattered in process and database diagrams, application code, knowledge workers, data warehouses or data marts and IT personnel. Unboxing usually stops at some grey box level.

EDW's: Building a House of Glass

Most (enterprise) data warehouse initiates (esp those talking about a 'single version of the truth') have an explicit or implied goal of constructing a house of glass, a glass box (globe) to see through all the data. They are actually trying to unbox their data to a white/glass box scenario  However, they still have the issue of unboxing their OLTP/source systems. The house of glass built on top of murky source system data is actually conjuring up an illusions of control and transparency founded on a marsh of misunderstanding source systems. Instead of focusing on the house of truth they should be focusing on truly trying to unbox their source systems and focus on a house of (source system data) facts. A (logical) EDW housing both the (derived/implied) 'truth's' as well as the source data facts is the only way to construct a true data house of glass. this way an EDW can help you not just with your BI initiatives, but also with unlocking and unboxing your data for other initiatives like Data Quality control, interfacing/integration and data migration.

Copyright Datamasters (Unseen) 2013