Data Vault Consortium A Mathematical Perspective of Data Vault.

1,059 views

Published on

Doug Needham Data Vault presentation discussing Mathematical interpretation of links and the application of that to business modeling. Volumetrics and some general principles for Data Vault best practices

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,059
On SlideShare
0
From Embeds
0
Number of Embeds
35
Actions
Shares
0
Downloads
28
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Keplerwas the first Data Scientist because Brahe had collected and stored many years of observations (data), yet he had no way of interpreting it accurately until Kepler studied the data and came up with is laws for planetary motions. Kepler came up with an accurate model that not only explained the observations of Brahe, but also predicted future observations.
  • Keplerwas the first Data Scientist because Brahe had collected and stored many years of observations (data), yet he had no way of interpreting it accurately until Kepler studied the data and came up with is laws for planetary motions. Kepler came up with an accurate model that not only explained the observations of Brahe, but also predicted future observations.
  • Data Vault Consortium A Mathematical Perspective of Data Vault.

    1. 1. Data Vault Consortium Presentation by @dougneedham
    2. 2. Who are we? CLEAR MEASURES offers a range of services and solutions designed to satisfy needs shared by firms large and small; and the skills required to make your customized goals a reality. If your goals aren’t yet defined, CLEAR MEASURES can help you define a strategy for managing, analyzing, or visualizing your data in ways that make your path easier to identify. • Analytics and Intelligence • Data Integration • Enterprise Architecture • Strategic & Project Management • Cloud Infrastructure • Database Administration • System Administration • Technology Services
    3. 3. Who are we? All our customers have access to: Capacity Pay on demand, with 15 minute increments, not the half-day or full-day you pay for a contractor. Coverage True 24 X 7 Coverage, with in-facility staff directed from our Global Operations Center in Covington, Kentucky. Cost CLEAR MEASURES can help your team with effective costs from Rural Sourcing and Global Sourcing locations. CLEAR MEASURES proprietary ONguard system allows for complete direction of a global workforce with U.S. oversight, focused on efficiency and repeatability.
    4. 4. Who am I? • The Data Guy • 1st job was Marine Corps DBA supporting the Entire Marine Corps at the main site for Systems Software Evaluation. • First 10 years of my career DBA. • 20 years of data management. • Most recent decade building analytical systems. • Pentaho, Informatica, Business Objects, Cognos, Oracle, SQL Server, MySQL. • Cloud based Analytics with a large healthcare information company on Cassandra. • Trying to figure out where Data Science, and Big Data fit together with the Data Warehouse.
    5. 5. This is the wrong time for Data Science • It is also the wrong time for a Data Warehouse, Business Intelligence Platform, Data Vault, Data Mining, Big Data, or any other predictive, machine learning, analytics platform. • Do these projects when things are going well. Anticipate what could happen to prevent things from going poorly.
    6. 6. When is the right time? • If you have multiple systems you need to integrate. • As you lay the foundation for Self Service Business Intelligence. • To lay the foundation of Data as a service application. • If you are combining data from many applications, systems, or business units, or you are providing data to many applications, systems, or business units that want data provided to them in slightly different standard feeds.
    7. 7. Data Science and The Data Warehouse • “Data Science is the application of statistical and mathematical rigor to business data.” Doug • I have heard it said 80% of data science is data munging. • Data Vault is: “100% of the data 100% of the time” – Dan L. • What does this mean? • What does the data say? Where did the data come from? What happened to the data from the time it was captured until the time it was presented? • Models, Statistical Models specifically, are the core of Data Science. • Looking forward to hearing more about DV 2.0 and how it supports Polyglot persistence.
    8. 8. Data Science and The Data Warehouse • By the way, we have been doing this for a while. • Some data is predictive, All data is instructive. • Being able to create a statistical model, quickly run lots of data through that statistical model, observe the actual results and compare these with predicted results allows us to refine the statistical model. • Are Business analysts Data Scientists? What is the main differential between the two? • Which one “needs” more data? Which one can actually use more data?
    9. 9. Quick Trivia • Who was one of the first Data Scientist? • • Now let’s talk about storing all of this data we collect, and see if there is anything new with our understanding of the structures we are all familiar with.
    10. 10. Data Vault • The integration layer of an overall data warehouse strategy. • There are other areas of data warehousing. • Presentation • Near-Line • Archive • Applications within the enterprise are the data capture mechanisms. • I think everyone is trying to find the best way to leverage a “Big Data” platform into the world of the Data Warehouse. • Data vault is the mechanism that allows a data warehouse to evolve over time. • Simple, straightforward, repeatable, auditable, resilient.
    11. 11. Modeling • HUBs – Business Keys • LNKs -Relationships • SATs – Contextual data. • There are other entities of the Data Vault methods, however, these are the primary entities. Everything else is functionally dependent on some combination of the above. • Notice the colors, Hubs one color, Links another, Sats a third. Anything else should be a separate color.
    12. 12. HUB • Business Keys. • Isolated entities that can stand alone representing a list of unique business keys. • The collection of business keys for an organization is the answer to the question, “What do we do?” • Which business key is most important? • How many edges does it have?
    13. 13. LNK • Relationships. • Isolated entities that can stand alone representing a list of unique business keys. • The collection of relationships for an organization is the answer to the question, “At what time does whom do what to whom or what?” • Links are actually very interesting in their own right. We will be speaking further about links specifically a little later in this session. •
    14. 14. LNK • How many edges does a link have? The number of incoming edges a Link table has is the number of HUB_SQNs the link is connecting (This includes weak hubs). • Outgoing Edges are the number of Satellites connected to this Link table. • What is the ratio of OE/IE?
    15. 15. Research in progress •
    16. 16. Details •
    17. 17. Now What? • Now that I have these numbers, what do I do with them? • This is one way to confirm the accuracy of the sequencing of your business keys in a link, in order to separate out the driver business key from the dependent keys? • Are there any other links in the Data Vault that have a similar Cosine?
    18. 18. Now What? • If you have cosine similarity between links does this mean something? • What is going on in the business? Is it obvious the links are related? • More importantly, is it not obvious why two links are similar within a margin of error?
    19. 19. SAT • Contextual data. • Detail data. Most pertinent for loading use in downstream systems. • The “Payload” of the satellite is the data you want to capture. • The collection of business keys for an organization is the answer to the question “What do we do?” • Has one edge.
    20. 20. Satellite Clustering • Using some simple k-means clustering with Euclidean distance calculations you can identify divergent rates of change within a satellite. • This is one way to divaricate satellites coming from a single source table. • If you are interested in knowing more about this, let me know.
    21. 21. Philosophies • From Dan: “100% of the data 100% of the time” • From Doug: “A model is not valid, until 100% of the model is populated from source systems.” • Notice I did not say 100% of the data as Dan did. • During development, the assumptions built into the model have to be validated. • Designing a proper data vault model does not take very long for those versed in its abilities. Loading the model to validate the assumptions built into the model is paramount to success.
    22. 22. Philosophies • The second portion of this philosophy is to extract data from the Vault to an alternative system, be that star schema, statistical research, data science, excel, etc. Something Downstream needs to be populated FROM the vault • In order to know you have a valid model, data must both go in and come out accurately according to business rules. • This must be done in order to say a particular phase of the development cycle is complete. • What does complete mean? It means this is the end of the beginning. Welcome to the world of Data Warehouse support, maintenance and evolution.
    23. 23. Aesthetics • One of the most fascinating things about a data vault model - to me - is that it flows quite aesthetically in accordance with the particular business processes the data vault is attempting to model. • It just makes sense to a variety of users, from technical to executive. • The following slide is an example of this, where we are modeling a process and something surprising came out of the modelling exercise.
    24. 24. What do I mean by Aesthetics? • Can you do this with another data modeling technique?
    25. 25. Architecture • A data architect understands applications are only the entry point of data into the Enterprise. Data Science makes data forever useful.
    26. 26. Volumetrics •
    27. 27. Summary • One of the main reasons Architects are constantly studying designs is they are continuously looking for ways not just to create something new, but to reduce new problems to ones already solved. The same thing can be said for Mathematicians, Engineers, Physicists, even managers and executives. • The Data Vault is a repeatable pattern for database design when that database is to be used for integration of multiple systems. There are many other uses for Data Vault, of course, but this is the first principle of why the data vault exists. • As we learn from prior implementations, be they our own, or from someone else, let us continuously strive to not only reduce problems to those already solved but look for, and discuss these repeatable patterns of Data Vault design.
    28. 28. Final thoughts • With the Data Vault, the structure itself has meaning. • This is a feature that I believe is unique to Data Vault modeling. • Our email contact information: • dneedham@clearmeasures.com • pdokouzov@clearmeasures.com

    ×