Introduction
When you create temporal data models, either with Data Vault, Anchor modeling or any other way, you end up with several tables with timelines in them. These timelines often need to be combined for going to Data Marts or other processing. This combining is done with a temporal JOINExample
- Table Product
- Table Product Cost
- Table Product Price
- Calculating Product Revenue = Sales Price- Manufacturing Cost
Conceptually the query will look like this (using the Allen operator)
SELECT "product cost".cost Cost,
"product price".price AS Price,
"product price".price - "product cost".cost AS Revenue,
"product cost".period INTERSECTS "product price".period AS Period
FROM "product cost"
INNER JOIN "product price"
ON "product price".id = "product cost".id
WHERE "product cost".period OVERLAPS "product price".period
The issues here are twofold, how to implement this in plain SQL, and what to define for Revenue for the omissions. Here we could state that the revenue is undefined, so in this case the first period is not available. Alas, this is not always the correct result.
SQL Implementation
There are a lot of ways to construct an SQL query that implements an intersection of 2 timelines. The most simplest are checking overlap between each 2 periods. This will quickly create unwieldy queries when more timelines need to be intersected. A more sophisticated approach rewrite, given x,y,z are periods:
y OVERLAPS y OVERLAPS y to NOT EMPTY(x INTERSECTS y INTERSECTS z) and to
say a = (x INTERSECTS y INTERSECTS z) then a.start_data
max(x.start_date,y.start_date,z.start_date) to min(x.end_date,y.end_date,z.end_date) is a valid period, which means:
max(x.start_date,y.start_date,z.start_date) < min(x.end_date,y.end_date,z.end_date)
Using this functions in the following way:
max(x.start_date,y.start_date,z.start_date) to min(x.end_date,y.end_date,z.end_date) is a valid period, which means:
max(x.start_date,y.start_date,z.start_date) < min(x.end_date,y.end_date,z.end_date)
Basic Greatest/Least SQL Query
Timeline Alignment
All these kinds of queries only work correctly when the different tables have aligned their timelines (have the same start and end dates for each key). For the end date this works ou as long am 'high end date' is used. For start dates additional records need to be created with empty/default/null values to align on a 'low start date' (say 1753-01-01).
DBMS implementation
These types of queries work in Both Oracle, Teradata and SQL Server. In Oracle and Teradata these queries can be used because the functions greatest and least are defined. In SQL Server however this is not the case.
One of the issues here is that you cannot use the helper functions in indexed views, you need to substitue the case statement, and duplicate it both in the select and where clause.
All these kinds of queries only work correctly when the different tables have aligned their timelines (have the same start and end dates for each key). For the end date this works ou as long am 'high end date' is used. For start dates additional records need to be created with empty/default/null values to align on a 'low start date' (say 1753-01-01).
DBMS implementation
These types of queries work in Both Oracle, Teradata and SQL Server. In Oracle and Teradata these queries can be used because the functions greatest and least are defined. In SQL Server however this is not the case.
SQL Server (T-SQL) implementation
To create this type of query in SQL Server we need both a Least and Greatest function. we can either code a .NET assembly for this, or an user defined function. Alas the udf won't perform at all. A trick here is to rewrite a scalar udf to an inline view udf. These will get optimized and peform OK. We only need to (CROSS) APPLY them to the query to get our wanted result.One of the issues here is that you cannot use the helper functions in indexed views, you need to substitue the case statement, and duplicate it both in the select and where clause.
Helper functions
Here are the SQL Server helper functions greatest and least and their usage pattern.
- CREATE FUNCTION [dbo].[ivf_least] (@1 datetime=null,@2 datetime = null ,@3 datetime = null,@4 datetime=null,@5datetime = null)
- RETURNS TABLE WITH SCHEMABINDING
- RETURN
- (WITH coal(a1,a2,a3,a4,a5)
- AS (SELECT COALESCE(@1,@2,@3,@4,@5),COALESCE(@2,@3,@4,@5,@1),COALESCE(@3,@4,@5,@1,@2),COALESCE(@4,@5,@1,@2,@3),COALESCE(@5,@1,@2,@3,@4))
- SELECT case when a1<=a2 and a1<=a3 and a1<=a4 and a1<=a5 then a1
- when a2<=a3 and a2<=a4 and a2<=a5 then a2
- when a3<=a4 and a3<=a5 then a3
- when a4<a5 then a4
- else a5 end as out from coal);
- GO
- CREATE FUNCTION [dbo].[ivf_greatest] (@1 datetime=null,@2 datetime = null ,@3 datetime = null,@4 datetime=null,@5datetime = null)
- RETURNS TABLE WITH SCHEMABINDING
- RETURN
- (WITH coal(a1,a2,a3,a4,a5)
- AS (SELECT COALESCE(@1,@2,@3,@4,@5),COALESCE(@2,@3,@4,@5,@1),COALESCE(@3,@4,@5,@1,@2),COALESCE(@4,@5,@1,@2,@3),COALESCE(@5,@1,@2,@3,@4))
- SELECT case when a1>=a2 and a1>=a3 and a1>=a4 and a1>=a5 then a1
- when a2>=a3 and a2>=a4 and a2>=a5 then a2
- when a3>=a4 and a3>=a5 then a3
- when a4>a5 then a4
- else a5 end as out from coal);
- GO
Using this functions in the following way:
- CREATE TABLE SAT1 (DV_ID int,START_DT datetime,END_DT datetime) PRIMARY KEY (DV_ID,START_DT)
- GO
- CREATE TABLE SAT2 (DV_ID int,START_DT datetime,END_DT datetime) PRIMARY KEY (DV_ID,START_DT)
- GO
- select
- SAT1.DV_ID,
- GREATEST_DT.out as START_DT,
- LEAST_DT.out as END_DT
- from SAT1 inner join SAT2 on SAT1.DV_ID=SAT2.DV_ID
- cross apply ivf_greatest(SAT1.START_DT,SAT2.START_DT,NULL,NULL,NULL) as GREATEST_DT
- cross apply ivf_least(SAT1.END_DT,SAT2.END_DT,NULL,NULL,NULL) as LEAST_DT
- where
- GREATEST_DT.out < LEAST_DT.out -- assuming an [Closed,Open) period
- GO
No comments:
Post a Comment