Data Vault ReConnect Speed Presenting PM Part Three

766 views

Published on

Third set of 5x5 Speed Presenting Updates:

1) Research Data Platform - Marc Bouma
2) Same-As Struggles - Sander Robijns
3) Groups of Links - Kasper de Graaf
4) Ensemble Model & MPP - Juan-José van der Linden
5) Data Vault on SAP HANA - Remco Broekmans

Published in: Data & Analytics, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
766
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Vault ReConnect Speed Presenting PM Part Three

  1. 1. Presenter: Date: Note: Company: eMail: Marc Bouma June 5, 2014 UMC Utrecht m.c.bouma@umcutrecht.nl
  2. 2. Our Dot on the Horizon - Central point for delivering healthcare processes data for medical research - Integrate various sources - Historize, trace and pseudonymize all data used
  3. 3. Our Journey - Learning and adapting to Data Vault  not everybody is a modeler (Shu Ha Ri) - Script, code, build, try, test, throw away and start again - Testing overrated? - Architectureimprovements  Performance issues SAS/Microsoft  Performance issues loading scripts  Automate DV load - From Chaos to SCRUM
  4. 4. Our Obstacles - Registration for healthcare process vs. usability for research - Questionnaires: sources or generic models? - Performance:  Do we really need all complete texts?  Do we really need 20 years of lab results? - The usual: conflicting interests,politics etc.
  5. 5. Our preliminary results - 2013: selection of 5 major Studies as starting showcasesproved difficult - 2014: had to choose 5 new showcasesfrom 25 applicants - Started as Research Data Platform, now growth towards Enterprise Data Platform (including Education and BI) - Architecturenow stable
  6. 6. Lessons learned • Automate when possible • Invest in a team of skilled pioneers • Models rule everything • Adapt agility, teach agility
  7. 7. Presenter: Date: Note: Company: eMail: Twitter: Sander Robijns June 5, 2014 Estrenuo BVBA sander.robijns@gmail.com @srobijns
  8. 8. The Issue No enterprise-wide business keys
  9. 9. The Current Approach Using recursive links on hubs to identify the same-as relationship
  10. 10. The Struggle Getting the facts reported under a single business key
  11. 11. The Future Approach Master Data Management will take away some of the struggles
  12. 12. The Lesson Learned Get the enterprise-wide business keys in place first using data governance
  13. 13. Presenter: Date: Note: Company: eMail: Twitter: Kasper de Graaf June 5 2014 Occurro kasper@occurro.nl kdgraaf
  14. 14. Groups of Links: context at hospital Imagine the following: • An operation (surgery) is executed by a group of people (first surgeon, second surgeon, assistant,anesthiologist, etc.) • An operation is planned a couple of weeks in advance • Whenever the planning changes in the source the complete group is sent to the EDW
  15. 15. Group of Links: the Data {Time} operation_no employee_no role T=1 19354 John OP1 19354 Jane OP2 19354 Chris ANA T=2 19354 John OP1 19354 Mary ANA T=3 19354 Jane OP1 19354 Chris ANA Please note: the actual operation with operation_no 19354 is executed by Jane (OP1) and Chris (ANA)
  16. 16. Groups of Links: the Problem Standard Data Vault loading routines cannot handle this situation: operation_no employee_no role load_dts 19354 John OP1 T=1 19354 Jane OP2 T=1 19354 Chris ANA T=1 19354 Mary ANA T=2 19354 Jane OP1 T=3
  17. 17. Groups of Links: the Problem Using end-dating of a link (preferable a validity satellite) cannot handle this problem either: operation_no employee_no role load_dts Active? 19354 John OP1 T=1 No (T=3) 19354 Jane OP2 T=1 Yes 19354 Chris ANA T=1 No (T=3) 19354 Mary ANA T=2 Yes 19354 Jane OP1 T=3 Yes BK of link used: operation_no + role
  18. 18. Groups of Links: our solution 1. Add a validity satellite to the link (for end-dating) 2. Tell the meta data of the automatin tool this is a group validity satellite with BK=operation_no 3. Whenever an existing operation_no is present in the staging layer set all current links to Active=No 4. Process as usual • Remark: because the same row can come back (i.e. John/OP1) it will be set to Active=No and Active=Yes at the same time there can be no unique index on BK of Validity satellite and some cleaning up is required after loading
  19. 19. Groups of Links: special thanks to … St. Antonius Hospital (for having the problem) Edwin Weber (for coding the solution) Get your copy of the solution: http://sourceforge.net/projects/pdidatavaultf w/
  20. 20. Presenter: Date: Note: Company: eMail: Twitter: Juan-Josévan der Linden June 5, 2014 DV, MPP QOSQO juanjose.vanderlinden@qosqo.nl @delostilos
  21. 21. SMP => MPP => AMPP SMP Symmetric Processing MPP Massively Parallel Processing AMPP Asymmetric MPP ( SMP + MPP)
  22. 22. Primary key => distribution key  hub -< satellite join - data redistribution - join local in parallel BK SID Ensemble 1 Dimensional 2 SID LDTS INFO 1 2001-01-01 My first DV 1 2014-06-05 DV Masters 2 1997-08-02 DM manifesto Node 1 Node 2
  23. 23. Hub SID => distribution key  hub -< satellite join - join local in parallel BK SID Ensemble 1 Dimensional 2 SID LDTS INFO 1 2001-01-01 First DV 1 2014-06-05 DV Masters 2 1997-08-02 DM manifesto Node 1 Node 2
  24. 24. Link SID => distribution key  Default L_SID, 1:N & N:M - data redistribution - join local in parallel H_MID H_SID L_SID 1 A 1 1 B 2 L_SID LDTS LDTS_END CURRENT 1 2001-01-01 2006-01-01 N 1 2014-06-05 9999-12-31 Y 2 2006-01-01 2014-06-05 N H_MID H_SID L_SID 1 A 1 1 B 2 L_SID H_MID H_SID LDTS LDTS_END 1 1 A 2001-01-01 2006-01-01 1 1 B 2014-06-05 9999-12-31 2 1 A 2006-01-01 2014-06-05 1:N => H_MID on link satellite - join local in parallel H_MID is the ensemble identifier ! Node 1 Node 2
  25. 25. Use the ensemble identifier if possible! H_SID H_SID LDTS INFO L_SID? H_SID H_MID H_SID ? L_SID ? LDTS INFO Distributing data efficiently to ensure good performance in a MPP database. - If uneven distribution, one node may become a bottleneck for the whole execution Try to minimize data movement between nodes - Data redistribution may occur when joining tables Ensemble
  26. 26. Presenter: Date: Note: Company: eMail: Twitter: Remco Broekmans June 5, 2014 Example for ReConnect Coarem Remco@Coarem.nl RemcoBroekmans
  27. 27. SAP #Hana is a column store #database which brings #efficiency in storage and access - #in- memory.
  28. 28. SAP #Hana seems to benefit on their technical #architecture in using 1 broad Satellite per #Hub - #benefit no need for #PIT, less tables
  29. 29. Splitting #Sat’s in #rate-of-change as efficient in storage as column store #multiple Sat’s to prefer if data coming from multiple sources (#write efficiency)
  30. 30. #referential join will only perform the join if data from the joined tables is used create 1 #PIT per #Hub (not as #SQL view)
  31. 31. #Lesson: DV is #efficient way of storing data #Lesson: #SQL views can’t be read by Hana Studio #Lesson: #Hana is still evolving

×