Tdwi agile data warehouse - dv, what is the buzz about

1,779 views
1,670 views

Published on

This is the presentation I did on TDWI EU in Munich - date; june 22nd, 2012. It is about a robust, agile and reliable way of deploying data warehouse environments. The majority of data warehouses in the Netherlands is Data Vault based now which instigated a wave of innovation of engineers and software vendors that pursued model driven development based on pattern based ETL,standardized modeling and a certain architectural style.

Published in: Technology
1 Comment
2 Likes
Statistics
Notes
  • Nice Slide Show. Thanks for clarity of vision. By the way, folks can also be trained and become certified directly from me, by contacting me at: http://danLinstedt.com

    Or they can see more at http://LearnDataVault.com/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
1,779
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
57
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

Tdwi agile data warehouse - dv, what is the buzz about

  1. 1. Agile Data Warehousing Data Vault, What is the buzz about TDWI München June 18, 2012 Ronald Damhof R.D.Damhof
  2. 2. “Our highest priority is to satisfy the customer through early and continuous delivery of valuable software” Agile Manifesto, 2001 Kent Beck, Mike Beedle, Arie van Bennekum, Alistair Cockburn, Ward Cunningham, Martin Fowler, James Grenning, Jim Highsmith, Andrew Hunt, Ron Jeffries, Jon Kern, Brian Marick, Robert C. Martin, Steve Mellor, Ken Schwaber, Jeff Sutherland, Dave ThomasR.D.Damhof
  3. 3. ‘Calculating risk Source ‘Yield modules’ Source ‘Customer segmentation’ ‘Semantic gap’ R.D.Damhof
  4. 4. Everybody mines their own data Everybody enriches their own data Everybody uses their own data User = Developer With his selfmade tools Data quality determined by the individual It’s a grind – limited reusability Leadtimes unpredictable No management R.D.Damhof
  5. 5. Lets ‘order’ an information product And hire a master/expert Separation between user/developer Developer/expert mines the data The information product = custom made Data quality is mostly dependable on the developer/expert Leadtimes unpredictable Still not much reusability R.D.Damhof
  6. 6. A central department who knows what information you need That assembles information products, ready to be used for you ‘I now what you want’ – black Efficiency is the name of the game At least I got something, but it does not comply - even remotely - to my needs Even worse; the guild-days are still there – the expert is now submerged, but needed to get the data you actually need. Introduction of management – you want something? Please apply in 3-fold… R.D.Damhof
  7. 7. Creating information products, the moment they are asked for Against quality criteria which are in line with the expectation of the customer Empower the customer with skills and facilities to be more self sufficient Minimize ‘data’-stock as much as possible Embrace new wishes and changes required by the customer The customer is the most important part of the production process Stephen Denning (2011) – Radical Management R.D.Damhof
  8. 8. A modern data management environment: The ‘Supermarket’ The ‘Restaurant’ The ‘Do it yourself buffet’ R.D.Damhof
  9. 9. R.D.Damhof
  10. 10. Push characteristics §  Mass production §  Known specifications, operational definitions, standards §  Repeatable, predictable, even better; uniform process §  Part of the system that needs statistical control §  Inventory allowed/necessary §  Supply driven §  Reliability over flexibility Pull characteristics §  Just in time §  Demand driven §  Build to order §  Preferably no inventory §  Flexibility over Reliability R.D.Damhof
  11. 11. Back to the issue at hand…… §  What: the ‘production process of data’ §  Where: Coordination - Local versus central §  How: System Engineering - Systematic vs. Opportunistic §  What principles guide us - leading principles R.D.Damhof
  12. 12. Local  vs  Central  deployment  Informa.on  Delivery  Proces   Recipient   Informa.on  Delivery  process   4.  Generate  Informa.on  products   End-­‐user  (Local)   Data    func.on  service   4   4   4   4   4   3.  Enrich  and  cleanse  data   3   3   3   3   3   2.  Register    Standardize   2   2   2   2   2   1   1   1   1   1   1.  Get  the  raw  uncut  data   Generic  proces  (Central)   Data  sources   (internal    external)   R.D.Damhof
  13. 13. System Engineering - Systematic vs. Opportunistic Manoeuvrability (opportunistic approach) Ad-hoc development proces Selfservice Developer=user Development Self-sufficient/ great degree of freedom Very broad tasks Lightweight development process Delegated Minimum of specialisation/ distinction of roles Development Self-sufficient/ limited freedom Development line discipline (OTAP) Developers at a distance from users IT Development Mutually dependent/ within frameworks Heavy separation of function Sustainability (Systematic approach)R.D.Damhof
  14. 14. Leading principles Compliant Adaptible Sustainable Decoupled Centralized Standardized Effective Industrialized R.D.Damhof
  15. 15. 1   2   3   4   Company  xxx  data  management  Domain   Source  store   BI  apps Reports     Business  View   Sources   BI  Apps   Analysis   Enterprise     Data  Warehouse   BI  Apps Ad-­‐hoc     Data,  ‘What’   Func.on,  ‘How’   ‘Where’,  ‘Whom’   15  R.D.Damhof
  16. 16. Source  to   Sourcestore  to   Sourcestore  to   EDW  (DV)   product   product   BV   Adaptable   Sustainable   Compliant   Decoupled   Effec.ve   Standardized   Centralized   16  R.D.Damhof
  17. 17. 1   2   3   4   Company  xxx  data  warehouse    Business  Intelligence    Domain   Source  store   BI  apps Reports     Business  View,     Data  feeds   Sources   BI  Apps   Analysis   Enterprise     Data  Warehouse   BI  Apps Ad-­‐hoc     Data,  ‘What’   Func.on,  ‘How’   ‘Where’,  ‘Whom’   17  R.D.Damhof
  18. 18. Administra.ve  process   Informa.on  Delivery  Process   Decision-­‐    control   Generate   Data    Informa.on  recipients   Distribute   Enrich   Register     Standardize  Proces   AXain   Why PDCA   DV? Compliance  repor.ng   Informa.on   products   Risk  Management   Push   Systems   DV  based   Pull   (internal     Data     Performance   external)   Warehouse   Management   Business   Supply  chain   Staging   rules   op.miza.on   Push   Data  products   Fraud  detec.on   Market  basket   analysis   Control  /  Metadata   18   R.D.Damhof
  19. 19. Metamodel  driven  automa.on   -­‐ Models  (process,  rules  and  data)  determine  the  metadata,  the  metadata  determines  the  automa.on  ar.facts   -­‐ Aim  is  to  be  100%  declara.ve   -­‐ It  can  not  be  generated  all,  specific  tailored  metadata  will  remain  necessary   Metadata  driven  automa.on   -­‐  Inputs:  Source  model(s),  target  model,  Template  Design,  Naming  conven.ons   -­‐  Advanced  inputs:  Normaliza.on  preferences,  Ontologies   Taken  from  Dan  Linstedt’s  blog  post:  hXp://danlinstedt.com/datavaultcat/code-­‐genera.on-­‐for-­‐data-­‐vault-­‐not-­‐as-­‐easy-­‐as-­‐you-­‐think/   Data  Vault   implementa.ons   Template  driven  automa.on   -­‐ In  the  most  basic  forms;  documenta.on    -­‐  describing  a  paXern   -­‐ More  advanced;  genera.ng  XML  code  for  2nd  gen.  ETL  tooling   -­‐ Vb  -­‐  hXp://www.grundsatzlich-­‐it.nl/bi-­‐tools-­‐templator.html   19  R.D.Damhof
  20. 20. My PoV about (Data Vault) automation Tooling §  Generation is an aid, not a goal in itself Do not accommodate the principles to fit the tool.... Look for decoupling §  Truly understand the mechanics - handcraft it first! Invest in proper education and learning Invest in getting ready time Involve your customers from the start §  PoC, PoC, PoC §  Deliver, Deliver, Deliver 20  R.D.Damhof
  21. 21. Agility Data Vault (1) Why is it that you can build and deploy extremely small particles in Data Vault and not in other approaches, without having an increase in the overhead and coordination of these particles? In other words; Divide and Conquer to beat the Size / Complexity Dynamic’ R.D.Damhof
  22. 22. Agility Data Vault (2) Why is it that you can re-engineer your existing model and guarantee that the changes remain local? Something that is hugely beneficial in data warehouses that - by definition - grow over time. R.D.Damhof
  23. 23. Agility Data Vault (3) Why is it that - as your (Data Vault based) data warehouse grows - your costs grow ‘merely’ in linear fashion initially, and as you approach the end state marginal growth in cost decreases exponentially. R.D.Damhof
  24. 24. Data Vault as-such is not Agile, it is the development process that needs to be agile, DV merely supports the agile development process. “Our highest priority is to satisfy the customer through early and continuous delivery of valuable software” Agile Manifesto, 2001 Kent Beck, Mike Beedle, Arie van Bennekum, Alistair Cockburn, Ward Cunningham, Martin Fowler, James Grenning, Jim Highsmith, Andrew Hunt, Ron Jeffries, Jon Kern, Brian Marick, Robert C. Martin, Steve Mellor, Ken Schwaber, Jeff Sutherland, Dave ThomasR.D.Damhof
  25. 25. Data Model Time Line Historic Overview © (Linstedt, Graziano, Hultgren, The New Business Supermodel, The Business of Data Vault Modeling, 2008, p. 36) §  Created By Dan Linstedt §  Released in 2000 §  Formally Introduced in the Netherlands in 2007 §  First DV Book: The Business of Data Vault Modeling 2008 §  First (Dutch) User group in 2010 §  Technical book from Dan Linstedt in 2011 R.D.Damhof
  26. 26. Application Architecture R.D.Damhof
  27. 27. Top Down Approach R.D.Damhof
  28. 28. Bottom Up Approach R.D.Damhof
  29. 29. Bottom Up Approach R.D.Damhof
  30. 30. Bottom Up Approach R.D.Damhof
  31. 31. Bottom Up Approach R.D.Damhof
  32. 32. Irony R.D.Damhof
  33. 33. Hybrid Approach (Data Vault) R.D.Damhof
  34. 34. R.D.Damhof
  35. 35. Kimball or Inmon ETL -  Complex ETL -  Truth oriented -  Business Rules before EDW ETL/Load Architecture -  100% of the data (within scope) 100% of the time -  Source driven /Auditable: -  “Fact Oriented” -  Template/metadata driven -  No Business Rules R.D.Damhof Pictures: Genesee Academy ©
  36. 36. Classic Data Vault Application Architecture Business   Transac.on   System     Staging   Data  Vault   Datasets   Out   Business   Transac.on   Generic  Business  Rules   System     Rule  Vault   Structure  transforma.on   Business  rule  execu.on   Hub  =  business  keys   Structure  and  value  transforma.on  Adaptable   Sustainable   Compliant   Decoupled   Effec.veness   Standardized   Centralized   ?   ?   36   R.D.Damhof
  37. 37. Data Vault Application Architecture §  Central EDW §  Business rules downstream §  Incremental/Non destructive Loading §  100% of the data (within scope) 100% of the time §  Auditable/Partly source driven R.D.Damhof
  38. 38. Modeling R.D.Damhof
  39. 39. R.D.Damhof
  40. 40. R.D.Damhof
  41. 41. R.D.Damhof
  42. 42. R.D.Damhof Pictures: Genesee Academy ©
  43. 43. R.D.Damhof Pictures: Genesee Academy ©
  44. 44. R.D.Damhof Pictures: Genesee Academy ©
  45. 45. R.D.Damhof Pictures: Genesee Academy ©
  46. 46. Data Vault Constructs R.D.Damhof Pictures: Genesee Academy ©
  47. 47. Data Vault Constructs R.D.Damhof Pictures: Genesee Academy ©
  48. 48. Data Vault Constructs R.D.Damhof Pictures: Genesee Academy ©
  49. 49. Core Components R.D.Damhof
  50. 50. Data Vault Core Components R.D.Damhof Pictures: Genesee Academy ©
  51. 51. Data Vault Core Components R.D.Damhof Pictures: Genesee Academy ©
  52. 52. Hubs R.D.Damhof Pictures: Genesee Academy ©
  53. 53. Hubs R.D.Damhof Pictures: Genesee Academy ©
  54. 54. Hubs R.D.Damhof Pictures: Genesee Academy ©
  55. 55. Satellites R.D.Damhof Pictures: Genesee Academy ©
  56. 56. Satellites R.D.Damhof Pictures: Genesee Academy ©
  57. 57. Links R.D.Damhof Pictures: Genesee Academy ©
  58. 58. Links R.D.Damhof Pictures: Genesee Academy ©
  59. 59. Loading R.D.Damhof
  60. 60. HUB load R.D.Damhof Pictures: Genesee Academy ©
  61. 61. HUB load INSERT INTO customer_hub (cust#,load_dts,record_src) SELECT source.customer#, @load_dts, @record_src FROM source_customer AS source WHERE NOT EXISTS (SELECT * FROM customer_hub AS hub WHERE hub.customer#=source.customer#) R.D.Damhof Pictures: Genesee Academy ©
  62. 62. Link Load Loading a Link R.D.Damhof Pictures: Genesee Academy ©
  63. 63. Link Load INSERT INTO custcontact_link(cust_id,contact_id,load_dts, record_src) SELECT source.customer#, @load_dts, @record_src FROM source_table AS source INNER JOIN contact_hub AS contact ON contact. contact#= source.contact# INNER JOIN customer_hub AS cust ON cust. customer#= source.customer# WHERE NOT EXISTS (SELECT * FROM custcontact_link AS link WHERE link. contact_id= contact.id and link.cust_id= cust.id) R.D.Damhof Pictures: Genesee Academy ©
  64. 64. Satellite Load Loading a Satellite R.D.Damhof Pictures: Genesee Academy ©
  65. 65. Satellite Load INSERT INTO customer_sat (hub_id,load_dts, name,record_src) SELECT hub.id, @load_dts, source.cust_name, ,@record_src FROM source_customer AS source INNER JOIN customer_hub AS hub ON cust.customer#= source.customer# # INNER JOIN customer_sat AS sat ON sat.id= hub.id# AND sat “Is most recent” AND sat.name source.name R.D.Damhof Pictures: Genesee Academy ©
  66. 66. Data Vault Loading Paradigm R.D.Damhof Pictures: Genesee Academy ©
  67. 67. Top 10 Rules for Data Vault Modeling R.D.Damhof Pictures: Genesee Academy ©
  68. 68. Agility Data Vault - recap (1) Why is it that you can build and deploy extremely small particles in Data Vault and not in other approaches, without having an increase in the overhead and coordination of these particles? In other words; Divide and Conquer to beat the Size / Complexity Dynamic’ Why is it that you can re-engineer your existing model and guarantee that the changes remain local? Something that is hugely beneficial in data warehouses that - by definition - grow over time. Why is it that - as your (Data Vault based) data warehouse grows - your costs grow ‘merely’ in linear fashion initially, and as you approach the end state marginal growth in cost decreases exponentially. R.D.Damhof
  69. 69. Agility Data Vault - recap (2) Remember the Push characteristics ➡  Mass production Data Vault ➡  Known specifications, operational definitions, standards Data Vault ➡  Repeatable, predictable, even better; uniform process Data Vault ➡  Part of the system that needs statistical control Data Vault ➡  Inventory allowed/necessary Data Vault ➡  Mainly supply driven Data Vault ➡  Reliability over flexibility Data Vault Automation of a Data Vault ‘production process’ is just common sense R.D.Damhof
  70. 70. Bonus Slides Forks and mutations in DV ‘evolution’R.D.Damhof
  71. 71. Type 1 - Classic Data Vault Business   Transac.on   System     Staging   Data  Vault   Datasets   Out   Business   Transac.on   Generic  Business  Rules   System     Rule  Vault   Structure  transforma.on   Business  rule  execu.on   Hub  =  business  keys   Structure  and  value  transforma.on  Adaptable   Sustainable   Compliant   Decoupled   Effec.veness   Standardized   Centralized   ?   ?   71   R.D.Damhof
  72. 72. Type 2 - Source Data Vault Business   Transac.on   Staging  Vault   System     Business     Data  Marts   Data  Vault   Business   Transac.on   Staging  Vault   System     Structure  transforma.on   Business  rule  execu.on   Structure  transforma.on   No  integra.on,  Hub=surrogate  keys   Integra.on   Persis.ng  staging  in  DV  format   DV  modelled    Adaptable   Sustainable   Compliant   Decoupled   Effec.veness   Standardized   Centralized   ?   ?   ?   72   R.D.Damhof
  73. 73. Source   Source    100%  Seman.c  gap   Source   Staging  DV   Business  DV   Source   Staging  DV   100%  Seman.c  gap   S.ll  the  source   Integra.on,  cleansing,  consolida.on   Business  rule  execu.on  upstream  ??   DV  modelled     73  R.D.Damhof
  74. 74. Source   Source    100%  Seman.c  gap   Source   Source  Staging  DV   Business  DV   Data  Warehouse   Source   Source  Staging  DV   100%  Seman.c  gap   S.ll  the  source   Integra.on,  cleansing,  consolida.on   Business  rule  execu.on  upstream  ??   DV  modelled     74  R.D.Damhof
  75. 75. Wanna know more? §  Training certification: www.geneseeacademy.com §  Books: ‘Super Charge Your Data Warehouse: Invaluable Data Modeling Rules to Implement Your Data Vault’ – D.Linstedt / K.Graziano §  Linkedin: Data Vault Discussions (approx. 800 members) §  Niche non-commercial conferences; www.dwhautomation.com §  Many blogs, articles, presentations on the World Wide Web §  The best way to learn; try it, make some code, experience, engage R.D.Damhof
  76. 76. Thank You Drs.  Ronald  D.  Damhof   Blog   hXp://prudenza.typepad.com/   hXp://www.b-­‐eye-­‐network.com/blogs/damhof/     Linkedin   hXp://nl.linkedin.com/in/ronalddamhof   Email   ronald.damhof@prudenza.nl   TwiXer   RonaldDamhof   Skype   Ronald.Damhof   Mobile   +31(0)6  269  67  184   Others   Informa.on  Quality  Cer.fied  Professional  (IQCP)   Data  Vault  Cer.fied  Grand  Master   Cer.fied  Scrum  Master   Member  of  the  Boulder  BI  Brain  Trust  (#BBBT)   Ronald  Damhof  is  an  independent  prac..oner  in  the  field  of  data  management  and  decision  support.  Graduated  in  1995  in  the   study  of  Economics.  Since  1995  he  worked  as  a  prac..oner  into  the  field  of  Informa.on  Management  with  a  focus  on  decision   support  and  data  management,  trying  hard  to  enhance  the  rigor  and  relevance  in  these  fields  by  combining  scien.fic  research   with  the  everyday  challenges  of  the  prac..oner.  Ronald  is  mainly  hired  by  customers  in  the  role  of  business/IT  architect,   auditor,  coach    trainer.  He  blogs  on  B-­‐Eye-­‐Network.com  as  well  as  his  own  blog,  is  a  member  of  the  pres.gious  BBBT,  wrote   several  ar.cles  regarding  decision  support  architectures  and  is  a  researcher  in  the  field  of  Informa.on  Management.       Although  Ronald  likes  to  work  with  theore.cal  grounded  research  and  proven  prac.ces,  Ronald  is  not  a  white  paper  architect;   put  your  money  where  your  mouth  is,  is  his  moXo.  He  likes  to  see  architectures  live  in  enterprises,  not  just  write  about  it.  In   most  organiza.ons  his  role  extends  architecture  onen.  In  truely  agile  spirit  the  roles  he  plays  depend  on  the  context  of  the   client;  he  can  be  a  missionary  (selling  the  value),  a  project  manager  (geong  it  done),  a  scrum  master  (removing  impediments),   specialist  (educa.ng  hardware  peeps,  data  architects,  data  logis.cs  etc.)  or  a  leader.   76  R.D.Damhof

×