An unconventional approach for ETL of historized data

SCD2 mal anders
Andrej Pashchenko
Senior Consultant, Düsseldorf
@Andrej_SQL doag2017

Unser Unternehmen.
Trivadis DOAG17: SCD2 mal anders2 29.11.2018
Trivadis ist führend bei der IT-Beratung, der Systemintegration, dem Solution
Engineering und der Erbringung von IT-Services mit Fokussierung auf -
und -Technologien in der Schweiz, Deutschland, Österreich und
Dänemark. Trivadis erbringt ihre Leistungen aus den strategischen Geschäftsfeldern:
Trivadis Services übernimmt den korrespondierenden Betrieb Ihrer IT Systeme.
B E T R I E B

KOPENHAGEN
MÜNCHEN
LAUSANNE
BERN
ZÜRICH
BRUGG
GENF
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
WIEN
Mit über 600 IT- und Fachexperten bei Ihnen vor Ort.
14 Trivadis Niederlassungen mit
über 600 Mitarbeitenden.
Über 200 Service Level Agreements.
Mehr als 4'000 Trainingsteilnehmer.
Forschungs- und Entwicklungsbudget:
CHF 5.0 Mio. / EUR 4.0 Mio.
Finanziell unabhängig und
nachhaltig profitabel.
Erfahrung aus mehr als 1'900 Projekten
pro Jahr bei über 800 Kunden.

Über mich
Senior Consultant bei der Trivadis GmbH, Düsseldorf
Schwerpunkt Oracle
– Data Warehousing
– Application Development
– Application Performance
Kurs-Referent „Oracle 12c New Features für Entwickler“
und „TechnoCircle Oracle 12c Release 2“
Blog: http://blog.sqlora.com

Agenda
1. Introduction and state of the art
2. The „new“ approach
3. Use cases and performance
4. Conclusion

Introduction and state of the art

Introduction
Historization? As a part of loading process in a data warehouse
We consider Slowly Changing Dimensions Type II
All changes are completely tracked. The change in at least one of the tracked
columns toggles the creation of the new version record
The most challenging task is the change detection
DWH_KEY VALID_FROM VALID_TO CUR_VERSION ETL_OP BUS_KEY FIRST_NAME SECOND_NAMES LAST_NAME HIRE_DATE FIRE_DATE SALARY
1 01.12.2016 02.12.2016 N UPD 123 Roger Federer 01.01.2010 900000
11 03.12.2016 Y INS 123 Roger Federer 01.01.2010 920000
6 02.12.2016 02.12.2016 N UPD 345 Venus Williams 01.11.2016 500000
10 03.12.2016 Y INS 345 Venus Williams 01.11.2016 01.12.2016 500000
2 01.12.2016 02.12.2016 N UPD 456 Rafael Nadal 01.05.2009 720000
3 01.12.2016 01.12.2016 N UPD 789 Serena Williams 01.06.2008 650000
5 02.12.2016 Y INS 789 Serena Jameka Williams 01.06.2008 650000

State of the Art
Typical OWB mapping

BK_T C1_T C2_T
11 A BB
22 D E
77 M N
33 F G
State of the Art
BK C1 C2
11 A B
22 D E
44 K L
77 M
BK C1 C2
11 A BB
22 D E
33 F G
77 M N
BK_S C1_S C2_S
11 A B
22 D E
44 K L
77 M
NVL(C2_S,'(NULL)') != NVL(C2_T,'(NULL)')
LNNVL(C2_S = C2_T) AND NVL(C2_S, C2_T) IS NOT NULL
DECODE, STANDARD_HASH, SYS_OP_MAP_NONNULL …
Full
Outer
Join
Change
Detection?
Old
Versions
New
Versions
Old
New
Target
Source
Target
Split
UNION ALL
MERGE
More on delta detection: https://danischnider.wordpress.com/2016/10/08/delta-detection-in-oracle-sql/
Data to the left has to be
accessed twice!

State of the Art
Change detection must be done with respect to null values
Comparing each and every column in a complex way
Or maintaining and comparing hash-diffs: common rules needed, re-hashing after
structural changes sometimes needed
Full outer join may be expensive if not working with „deltas“
Splitting the join result into two data sets causes this join to be made twice
Another
solution?

The „new“ approach

The „new“ approach is not really new
Oft used for ad hoc queries
Are these two records different?
Using Group BY
BK C1 C2 C3 C4 … … C467 C468 C469
11 A B C D … … AA BB CC
11 A B C D … … AB BB CC
SELECT COUNT(*)
FROM t
GROUP BY BK, C1, C2, C3, C4, … C467, C468, C469

Or using analytical function:
If count equals 2 – they are the same
If count equals 1 – they are different
For GROUP BY and PARTITION BY:
NULL=NULL, VALUE!=NULL
SELECT COUNT(*) OVER (PARTITION BY BK, C1, C2, C3, … C468, C469)
FROM t;
But what
about NULLs?

BK C1 C2
11 A BB
33 F G
77 M N
S_T BK C1 C2
T 11 A BB
T 22 D E
T 33 F G
T 77 M N
BK C1 C2
11 A B
22 D E
44 K L
77 M
BK C1 C2
11 A BB
22 D E
33 F G
77 M N
UNION ALL
Target
Source
Target
GROUP BY MERGE
S_T BK C1 C2
S 11 A B
S 22 D E
S 44 K L
S 77 M
MIN
(S_T)
S
S
S
S
T
T
T
DEMO!
BK C1 C2
11 A B
22 D E
44 K L
77 M
CNT
1
2
1
1
1
1
1

An unconventional approach for ETL of historized data16 19.03.2017
Use Cases and Performance

Source
Older
Versions
Full Data
Current
VersionsJOIN
may be
slow
Filter
may be
slow
Partitio-
ning?
Target
Full Data Load
Full Data
Current
Versions
Group By
may be
slow
UNION ALLLegacy New

Source
Delta
JOIN Filter
may be
slow
Partitio-
ning?
Older
Versions
Current
Versions
Target
Delta Load
Delta
Current
Versions
Group By
may be
slow
UNION ALLLegacy New

Source
Older
Versions
Delta
Current
Versions
JOIN
Filter
Business_key
IN …
Target
Delta Load with pre-filter
Delta
Current Ver-
sions (filtered)
Group By
fast
UNION ALLLegacy New

Data Warehouse with Siebel-CRM as a source
Order table S_ORDER – 120 columns „only“
Comparing legacy approach vs. GROUP BY vs. analytical functions
Full staging table as a source vs. delta (with or without pre-filtering)
Ca. 6 Mio rows in the target table
Ca. 3 Mio rows in the full load dataset
Ca. 3000 rows in the delta load dataset

Method Delta Load, min Full Load, min
Outer Join (legacy approach) 0:09 0:41
GROUP BY 1:10 1:04
GROUP BY with pre-filter 0:04 N/A
Analytic Function 2:12 4:52
Analytic with pre-filter 0:12 N/A

Execution Plan
--------------------------------------------------------------------------------------------
| Id | Operation | Name | A-Rows | A-Time |
--------------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | 0 |00:00:04.33 |
| 1 | MERGE | CO_S_ORDER_TEST | 0 |00:00:04.33 |
| 2 | VIEW | | 3799 |00:00:04.29 |
| 3 | SEQUENCE | SEQ_CO_S_ORDER | 3799 |00:00:04.29 |
| 4 | PX COORDINATOR | | 3799 |00:00:04.28 |
| 5 | PX SEND QC (RANDOM) | :TQ10005 | 0 |00:00:00.01 |
|* 6 | HASH JOIN OUTER BUFFERED | | 3799 |00:00:11.51 |
| 7 | PX RECEIVE | | 3799 |00:00:00.01 |
...
| 15 | PX RECEIVE | | 4654 |00:00:00.04 |
| 16 | PX SEND HASH | :TQ10001 | 0 |00:00:00.01 |
| 17 | HASH GROUP BY | | 4654 |00:00:03.41 |
| 18 | VIEW | | 4801 |00:00:00.77 |
| 19 | UNION-ALL | | 4801 |00:00:00.77 |
| 20 | PX BLOCK ITERATOR | | 3120 |00:00:00.01 |
|* 21 | TABLE ACCESS FULL | STG_S_ORDER_DELTA | 3120 |00:00:00.01 |
|* 22 | HASH JOIN RIGHT SEMI | | 1681 |00:00:04.41 |
| 23 | PX RECEIVE | | 12480 |00:00:00.02 |
| 24 | PX SEND BROADCAST | :TQ10000 | 0 |00:00:00.01 |
| 25 | PX BLOCK ITERATOR | | 3120 |00:00:00.01 |
|* 26 | TABLE ACCESS FULL| STG_S_ORDER_DELTA | 3120 |00:00:00.01 |
| 27 | PX BLOCK ITERATOR | | 3710K|00:00:03.26 |
|* 28 | TABLE ACCESS FULL | CO_S_ORDER_TEST | 3710K|00:00:02.92 |
| 29 | PX RECEIVE | | 6107K|00:00:11.11 |
| 30 | PX SEND HASH | :TQ10004 | 0 |00:00:00.01 |
| 31 | PX BLOCK ITERATOR | | 6107K|00:00:05.37 |
|* 32 | TABLE ACCESS FULL | CO_S_ORDER_TEST | 6107K|00:00:04.69 |
--------------------------------------------------------------------------------------------

Legacy New
Source
Older
Versions
Current
Versions
Core
Current
Versions
Dim
JOIN
may be
slow
Filter
may be
slow
Partitio-
ning?
Target
Loading Dimensions from Core
Current
Versions
Core
Current
Versions
Dim
Group By
may be
slow
UNION ALL
Older
Versions

Legacy New
Source is a View
Older
Versions
Current
VersionsJOIN
may be
slow
Filter
may be
slow
Partitio-
ning?
Target
Loading Dimensions from Core
Full Data
Current
Versions
Group By
may be
slow
UNION ALL

Loading of a dimension via view
The view joins some „big“ tables (50 Gb, 40+ Mio rows)
And produces < 500 dimension records per day
The loading time could be reduced by 45 percent (3 min 50 sec → 2 min)

Conclusion
It is simpler and faster in certain cases
The source is queried only once, can be significant if the source is a view
The code can be simply generated
Simple to build even without generation (only a plain list of columns to Copy&Paste)
It‘s worth to do an ad hoc testing with your data
Test it!

Andrej Pashchenko
Senior Consultant
Tel. +49 211 58 666 470
andrej.pashchenko@trivadis.com
29.11.2018 Trivadis DOAG17: SCD2 mal anders28
blog.sqlora.com

Trivadis @ DOAG 2017
#opencompany
Stand: 3ter Stock, direkt an der Rolltreppe
Wir teilen unser Know how!
Einfach vorbei kommen, Live-Präsentationen
und Dokumentenarchiv
T-Shirts, Gewinnspiel und mehr
Wir freuen uns wenn Sie vorbei schauen
29.11.2018 Trivadis DOAG17: SCD2 mal anders29

An unconventional approach for ETL of historized data

Recommended

Recommended

More Related Content

Similar to An unconventional approach for ETL of historized data

Similar to An unconventional approach for ETL of historized data (20)

More from Andrej Pashchenko

More from Andrej Pashchenko (8)

Recently uploaded

Recently uploaded (20)

An unconventional approach for ETL of historized data