Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Everything has changed except us

1,098 views

Published on

The way we make decisions has changed. The data we use has changed. The techniques we can apply to data and decisions have changed. Yet what we build and how we build it has barely changed in 20 years.
 
The definition of madness is doing more of what you already do and expecting different results. The threat to the data warehouse is not from new technology that will replace the data warehouse. It is from destabilization caused by new technology as it changes the architecture, and from failure to adapt to those changes.
 
The technology that we use is problematic because it constrains and sometimes prevents necessary activities. We don’t need more technology and bigger machines. We need different technology that does different things. More product features from the same vendors won’t solve the problem.
 
The data we want to use is challenging. We can’t model and clean and maintain it fast enough. We don’t need more data modeling to solve this problem. We need less modeling and more metadata.
 
And lastly, a change in scale has occurred. It isn’t a simple problem of “big”. The problem with current workloads has been solved, despite the performance problems that many people still have today. Scale has many dimensions – important among them are the number of discrete sources and structures, the rate of change of individual structures, the rate of change in data use, the variety of uses and the concurrency of those uses.
 
In short, we need new architecture that is not focused on creating stability in data, but one that is adaptable to continuous and rapidly changing uses of data.

Published in: Data & Analytics
  • Be the first to comment

Everything has changed except us

  1. 1. Copyright Third Nature, Inc. Everything has  changed except us February, 2015 Mark Madsen www.ThirdNature.net @markmadsen
  2. 2. Copyright Third Nature, Inc. The DW group as the crazy  uncle of the organization Madness: doing more of what  you already did and expecting  different results. We’ve been struggling with  shrinking load windows,  performance problems, and  most important, inability to  quickly meet data needs, for a  decade, yet we keep doing the  same things to try to fix them.
  3. 3. Copyright Third Nature, Inc. I never said the “E” in EDW meant “everything”… What do you mean, “Just tables?”
  4. 4. Copyright Third Nature, Inc. It’s going to get a lot worse Not E E Conclusion: any methodology built on the premise that you  must know and model all the data first is untenable 
  5. 5. © Third Nature Inc.© Third Nature Inc. The good news is: we solved the bigness problem Source: Noumenal, Inc.
  6. 6. Copyright Third Nature, Inc. Now, analytics embiggens the data volume problem Many of the processing problems are O(n2) or worse, so  small data can be a problem for DB‐based platforms
  7. 7. © Third Nature Inc.© Third Nature Inc. What makes data “big”? Aside from very large amounts: Hierarchical structures Nested structures Linked structures Encoded values Non‐standard (for a database)  types Deep structure Human authored text “big” is better off being defined as “complex” or “hard to manage” Copyright Third Nature, Inc.
  8. 8. Copyright Third Nature, Inc. Datasets today: Interconnection and Dependency Dynamic models are  missing from most  data systems today.  These drive new  workloads, generate  different data, need  new techniques.  Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data, Danny Holten
  9. 9. Copyright Third Nature, Inc. It’s not the number of genes that determine complexity, it’s the interactions between them. Source: M. Pertea and S. Salzberg/Genome Biology 2010
  10. 10. Copyright Third Nature, Inc. It’s not the number of genes that determine complexity, it’s the interactions between them. Source: M. Pertea and S. Salzberg/Genome Biology 2010
  11. 11. Copyright Third Nature, Inc. Categorizing the measurement data we collect The convenient data is the  transactional data. ▪ Goes in the DW and is used, even  if it isn’t the right measurement. The inconvenient data is  observational data. ▪ It’s not neat, clean, or designed  into most systems of operation. The difficult and misleading data  is declarative data. ▪ What people say and what they  do require ground truth. We need an architecture that  supports all three categories. Copyright Third Nature, Inc.
  12. 12. Copyright Third Nature, Inc. Observations Sensor data doesn’t fit well with current methods of collection and storage, or with the technology to process and analyze it. Copyright Third Nature, Inc.
  13. 13. Copyright Third Nature, Inc. Declarations
  14. 14. Copyright Third Nature, Inc. Unstructured is Not Really Unstructured Slide 14 Unstructured data isn’t  really unstructured: objects  have structure, language  has structure. Text can  contain traditional  structured data elements.  The problem is that the  content is unmodeled. Our real problem is making  implicit structure explicit. Conclusion: the data warehouse must cope with more complex data structures, storage and processing.
  15. 15. Copyright Third Nature, Inc. The creation, flow and use of data is different for  transactions and machine‐generated events Data entry Extract Cleanse Load UseStore Transactions MDM Generate Store Use UseCleanse Program Capture This runs at human speed This runs at machine speed, with slower feedback cycle
  16. 16. Copyright Third Nature, Inc. We’re moving BI from information to actuation This means  monitoring as  data flows,  detecting rather  than querying, as  well as feedback  to the sources.
  17. 17. Copyright Third Nature, Inc. The architecture we’ve been using. The general concept of a  separate architecture for BI  has been around longer, but  this paper by Devlin and  Murphy is the first formal  data warehouse architecture  and definition published. 17 “An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988) Slide 17Copyright Third Nature, Inc.
  18. 18. Copyright Third Nature, Inc. Origins: in 1988 there was only big hair. ▪ No real commercial email, public internet barely started ▪ Storage state of the art: 100MB, cost $10,000/GB ▪ Oracle Applications v1 GL released; SAP goes public,  enters US market ▪ Unix is mostly run by long‐haired freaks ▪ Mobile was this This is the context: scarcity of data, of system resources, of automated  systems outside core financials, of money to pay for storage.
  19. 19. Copyright Third Nature, Inc. We think of BI as publishing, an old metaphor. Publishing has value, but may not be actionable.
  20. 20. Copyright Third Nature, Inc. Data strategy means understanding the context of  data use so we can build the right infrastructure Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act Act on the process Act within the process We need to focus on what people do with information as the primary task, not on the data or the technology.
  21. 21. Copyright Third Nature, Inc. The usage models for conventional BI Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily This is what we’ve been doing with BI so far: static reporting, dashboards, ad-hoc query, OLAP
  22. 22. Copyright Third Nature, Inc. The usage models for analytics and “big data”  Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily Analytics and big data is focused on new use cases: deeper analysis, causes, prediction, optimizing decisions This isn’t ad-hoc, reporting, or OLAP.
  23. 23. Copyright Third Nature, Inc. As practices evolve based on new capabilities… A new level of  complexity  develops over  top of the  older, now  better  understood  processes,  leading to new  data and  analysis needs.
  24. 24. Copyright Third Nature, Inc. Growing complexity has changed our context Internal 3rd party & custom applications, logs, event  streams, hosted & external apps, 3rd party datasets… 
  25. 25. Copyright Third Nature, Inc. Enterprise architecture changes External = no data layer access SOA and REST = no data layer access Streams and messages are becoming the norm Observations and Transactions
  26. 26. Copyright Third Nature, Inc. Reality: continuous change in the DW You can’t keep up with source changes You can’t keep up with new data requests You are already scale, performance, latency limited But: Many parts of the organization need current operational data
  27. 27. Copyright Third Nature, Inc. The emerging big data market has an answer…
  28. 28. Copyright Third Nature, Inc. Centralize: that solves all problems! Creates bottlenecks Causes scale problems Enforces a single model
  29. 29. Copyright Third Nature, Inc. Data quality and definitions in a single schema are  based on the strictest requirement, reducing flexibility
  30. 30. Copyright Third Nature, Inc. The data warehouse vs business agility All the data Common, typed, tabular data The bottleneck is you
  31. 31. Copyright Third Nature, Inc. We have a design for stability. We need one for adaptability
  32. 32. Copyright Third Nature, Inc. Which is best, 3NF or dimensional? The core assumption that there can be just one big schema model on one big platform is flawed. Answer: neither. We think we can model all the data before use, but that’s a bottleneck. Current techniques for modeling and managing data are too rigid and incapable of describing all the possible relationships.
  33. 33. Copyright Third Nature, Inc. A core problem with one big schema is change
  34. 34. Copyright Third Nature, Inc. Big data answer? Schema‐on‐read! There’s a price to pay  with using “schema‐on‐ read” for everything. You won’t see the  problems with this until  you add a second  application, and a third. “One writer‐many  readers” kills schema‐on  read benefits.
  35. 35. Copyright Third Nature, Inc. Why is the choice no schema or hard schema? Simple key‐value files give you flexibility in some  areas. Tables give you flexibility in other areas. Which area do you need flexibility in and why? Programs writing data? Files Tables Programs processing data? Programs reading data? Why not flexible schemas instead of either-or?
  36. 36. Copyright Third Nature, Inc. “We can't solve problems by using the  same kind of thinking we used when  we created them.” Albert Einstein Page 37
  37. 37. Copyright Third Nature, Inc. With too much data the approach has to be inverted The process we still use: 1. Model 2. Collect 3. Analyze The new process is: 1. Collect 2. Analyze 3. Model 4. Promote This is a shift from planned design to evolutionary design for the data warehouse
  38. 38. Copyright Third Nature, Inc. Slide 39 The solution to our problems isn’t  necessarily technology, it’s architecture.
  39. 39. Copyright Third Nature, Inc. Workloads OLTP BI Analytics Access Read‐Write Read‐only Read‐mostly Predictability Predictable Unpredictable Fixed path Selectivity High Low Low Retrieval Low Low High Latency Milliseconds < seconds msecs to days Concurrency Huge Moderate 1 to huge Model 3NF, nested object Dim, denorm BWT Task size Small Large Small to huge
  40. 40. Copyright Third Nature, Inc. DATA ARCHITECTURE We’re so focused on the light switch that we’re not  talking about the light
  41. 41. Copyright Third Nature, Inc. Decoupled Data Architecture The core of the data warehouse isn’t the  database, it’s the data architecture that the  database and tools implement. We need a data architecture that is not limiting: ▪ Deals with data and schema change easily ▪ Does not always require up front modeling ▪ Does not limit the format or structure of data ▪ Assumes a full range of data latencies, from  streaming to one‐time bulk loads, both in and out, 
  42. 42. Copyright Third Nature, Inc. Food supply chain: an analogy for data Multiple contexts of use, differing quality levels
  43. 43. Integrate Manage Decouple data architecture layers Use This implies a new warehouse architecture and data modeling approaches Collect Transactions Observations Declarations
  44. 44. Copyright Third Nature, Inc. Break down the monolithic architecture
  45. 45. The technology architecture  must change, based on work  done with the data: ▪ Collection separate from ▪ Data management separate from ▪ Data delivery and use Data may live in more than  one place because it may have  more than one model, for  more than one use, using  more than one engine
  46. 46. Copyright Third Nature, Inc. Reinforcing relationships keep architectures from  changing, despite radical technology shifts Note how only one third is tech Architectural Regime MethodologyTechnology Organization Organization  defines where the  work is done and  the roles. Technology  defines what  work can be done  in a given area. Methodology  defines how  work is done  and what that  work is. Slide 49Copyright Third Nature, Inc.
  47. 47. Copyright Third Nature, Inc. Agile architectures without agile methods fail
  48. 48. Copyright Third Nature, Inc. How can you move to a more agile architecture? Start by deploying faster. Things will break. You will fix them. You will get better. So will your architecture.
  49. 49. Copyright Third Nature, Inc. The geography we have been using is out of date The box we created: • not any data, rigidly typed data • not any form, tabular rows and  columns of typed data • not any latency, persist what the  DB can keep up with • not any process, only queries The digital world was diminished  to only what’s inside the box until  we forgot the box was there.
  50. 50. Copyright Third Nature, Inc. Data infrastructure is a platform ▪ Any data – structures, forms ▪ Any latency –in motion, at rest ▪ Any process – query, algorithm, transform ▪ Any access – SQL, API, queue, file movement
  51. 51. Copyright Third Nature, Inc. Don’t follow the market Some people can’t resist  getting the next new thing  because it’s new and new is  always better. Many IT organizations are like  this, promoting a solution and  hunting for the problem that  matches it. Better to ask “What is the  problem for which this  technology is the answer?” Copyright Third Nature, Inc.
  52. 52. Copyright Third Nature, Inc. Think like an architect,  not like a consumer No more “enterprise  standard” ‐ now “what  works” The technology providers  are selling you what they  have, not what you need. Follow the goals of the  business. Translate the goals into  capabilities and match  those to the architecture  required.
  53. 53. Copyright Third Nature, Inc. “The future, according to some scientists, will be exactly like  the past, only far more expensive.” ~ John Sladek
  54. 54. Copyright Third Nature, Inc. CC Image Attributions Thanks to the people who supplied the creative commons licensed images used in this presentation: round hole square peg ‐ https://www.flickr.com/photos/epublicist/3546059144 firemen not noticing fire.jpg ‐ http://flickr.com/photos/oldonliner/1485881035/ pyramid_camel_rider.jpg ‐ http://www.flickr.com/photos/khalid‐almasoud/1528054134/ House on fire ‐ http://flickr.com/photos/oldonliner/1485881035/ glass_buildings.jpg ‐ http://www.flickr.com/photos/erikvanhannen/547701721 Circos, Hierarchical Edge Bundles:Visualization of Adjacency Relations in Hierarchical Data, Danny  Holten text composition ‐ http://flickr.com/photos/candiedwomanire/60224567/ Building demolition ‐ https://www.flickr.com/photos/gregpc/4429888820 peek_fence_dog.jpg ‐ http://www.flickr.com/photos/webwalker/114998078/ donuts_4_views.jpg ‐ http://www.flickr.com/photos/le_hibou/76718773/ shady_puppy_sales.jpg ‐ http://www.flickr.com/photos/brizzlebornandbred/5001120150 subway dc metro  ‐ http://flickr.com/photos/musaeum/509899161/
  55. 55. Copyright Third Nature, Inc. About the Presenter Mark Madsen is president of Third  Nature, a technology research and  consulting firm focused on business  intelligence, data integration and data  management. Mark is an award‐winning  author, architect and CTO whose work  has been featured in numerous industry  publications. Over the past ten years  Mark received awards for his work from  the American Productivity & Quality  Center, TDWI, and the Smithsonian  Institute. He is an international speaker,  a contributor to Forbes Online and on  the O’Reilly Strata program committee.  For more information or to contact  Mark, follow @markmadsen on Twitter  or visit  http://ThirdNature.net 
  56. 56. About Third Nature Third Nature is a research and consulting firm focused on new and emerging technology and practices in analytics, business intelligence, information strategy and data management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place. Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.

×