Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bi isn't big data and big data isn't BI (updated)


Published on

Big data is hyped, but isn't hype. There are definite technical, process and business differences in the big data market when compared to BI and data warehousing, but they are often poorly understood or explained. BI isn't big data, and big data isn't BI. By distilling the technical and process realities of big data systems and projects we can separate fact from fiction. This session examines the underlying assumptions and abstractions we use in the BI and DW world, the abstractions that evolved in the big data world, and how they are different. Armed with this knowledge, you will be better able to make design and architecture decisions. The session is sometimes conceptual, sometimes detailed technical explorations of data, processing and technology, but promises to be entertaining regardless of the level.

Yes, it’s about the data normally called “big”, but it’s not Hadoop for the database crowd, despite the prominent role Hadoop plays. The session will be technical, but in a technology preview/overview fashion. I won’t be teaching you to write MapReduce jobs or anything of the sort.

The first part will be an overview of the types, formats and structures of data that aren’t normally in the data warehouse realm. The second part will cover some of the basic technology components, vendors and architecture.

The goal is to provide an overview of the extent of data available and some of the nuances or challenges in processing it, coupled with some examples of tools or vendors that may be a starting point if you are building in a particular area.

Published in: Data & Analytics
  • Be the first to comment

Bi isn't big data and big data isn't BI (updated)

  1. 1. SQL.. . SQL! SQL? SQL Hadoop BI Isn’t Big Data, Big Data Isn’t BI September, 2015 Mark Madsen @markmadsen
  2. 2. © Third Nature Inc. Summary Common uses and commodity technology lead to Novel practices lead to Different data and different technology needs lead to New architectures Lead to Common uses and commodity technology 
  3. 3. © Third Nature Inc. Our ideas about information and how it’s used are outdated.
  4. 4. © Third Nature Inc. How We Think of Users Our design point is the  passive consumer of  information. Proof: methodology ▪ IT role is requirements,  design, build, deploy,  administer ▪ User role is run reports Self‐serve BI is not like  picking the right doughnut  from a box. Slide 4
  5. 5. © Third Nature Inc. How We Think of Users Our design point is the  passive consumer of  information. Proof: methodology ▪ IT role is requirements,  design, build, deploy,  administer ▪ User role is run reports Self‐serve BI is not like  picking the right doughnut  from a box. How We Want Users to  Think of Us
  6. 6. © Third Nature Inc. How We Think of Users What Users Really Think
  7. 7. © Third Nature Inc. We think of BI as publishing, an old metaphor. Publishing has value, but may not be actionable.
  8. 8. © Third Nature Inc. Planning data strategy means understanding the  context of data use so we can build infrastructure Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing We need to focus on what people do with information as the primary task, not on the data or the technology.
  9. 9. © Third Nature Inc. General model for organizational use of data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act within the process Usually real-time to daily
  10. 10. © Third Nature Inc. Origin of BI and data warehouse concepts The general concept of a  separate architecture for BI  has been around longer, but  this paper by Devlin and  Murphy is the first formal  data warehouse architecture  and definition published. 10 “An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988) Slide 10Copyright Third Nature, Inc.
  11. 11. © Third Nature Inc. Origins: in 1988 there was only big hair. ▪ No real commercial email, public internet barely started ▪ Storage state of the art: 100MB, cost $10,000/GB ▪ Oracle Applications v1 GL released; SAP goes public,  enters US market ▪ Unix is mostly run by long‐haired freaks ▪ Mobile was this This is the context: scarcity of data, of system resources, of automated  systems outside core financials, of money to pay for storage.
  12. 12. © Third Nature Inc. General model for organizational use of data Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Copyright Third Nature, Inc.
  13. 13. © Third Nature Inc. You need to be able to support both paths Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act Act on the process Act within the process Conventional BI, addition of EDM Causal analysis, “data science” Copyright Third Nature, Inc.
  14. 14. © Third Nature Inc. The usage models for conventional BI Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily This is what we’ve been doing with BI so far: static reporting, dashboards, ad-hoc query, OLAP Copyright Third Nature, Inc.
  15. 15. © Third Nature Inc. The usage models for analytics and “big data”  Collect new data Monitor Analyze Exceptions Analyze Causes Decide Act No problem No idea Do nothing Act on the process Usually days/longer timeframe Act within the process Usually real-time to daily Analytics and big data is focused on new use cases: deeper analysis, causes, prediction, optimizing decisions This isn’t ad-hoc, reporting, or OLAP. Copyright Third Nature, Inc.
  16. 16. © Third Nature Inc. When you first give people access to information  that was unavailable… OH GOD I can see into forever
  17. 17. © Third Nature Inc. After a while it becomes the new normal
  18. 18. © Third Nature Inc. As practices evolve based on new capabilities… A new level of  complexity  develops over  top of the  older, now  better  understood  processes,  leading to new  data and  analysis needs.
  19. 19. © Third Nature Inc. I never said the “E” in EDW meant “everything”… What do you mean, “Just doughnuts?”
  20. 20. © Third Nature Inc. The data warehouse vs business agility All the data Common, typed, tabular data The bottleneck is you
  21. 21. © Third Nature Inc. It’s going to get a lot worse Not E E Conclusion: any methodology built on the premise that you  must know and model all the data first is untenable 
  22. 22. © Third Nature Inc. Old market says: There’s nothing wrong with what  you have, just keep buying new products from us
  23. 23. © Third Nature Inc. The emerging big data market has an answer…
  24. 24. © Third Nature Inc. The data lake
  25. 25. © Third Nature Inc. The data lake after a little while
  26. 26. © Third Nature Inc. TANSTAAFL When replacing the old  with the new (or ignoring  the new over the old) you  always make tradeoffs,  and usually you won’t see  them for a long time. Technologies are not  perfect replacements for  one another. Often not  better, only different.
  27. 27. © Third Nature Inc. “Big data is unprecedented.” ‐ Anyone involved with big data in even the  most barely perceptible way
  28. 28. © Third Nature Inc. We’ve been here before Source: Bill Schmarzo, EMC
  29. 29. © Third Nature Inc. “Big” is well supported by databases now Source:Noumenal,Inc.
  30. 30. © Third Nature Inc. Orders of magnitude: 20 years ago TB, today PB Shifts in data availability by orders of magnitude  necessitate new means of managing and using it.
  31. 31. © Third Nature Inc. Analytics embiggens the data volume problem Many of the processing problems are O(n2) or worse, so  moderate data can be a problem for DB‐based platforms
  32. 32. © Third Nature Inc. Much of the big data value comes from analytics BI is a retrieval problem, not a computational problem. Five basic things you can do with analytics ▪Prediction – what is most likely to happen? ▪Estimation – what’s the future value of a variable? ▪Description – what relationships exist in the data? ▪Simulation – what could happen? ▪Prescription – what should you do? Slide 36 Copyright Third Nature, Inc. Copyright Third Nature, Inc.
  33. 33. © Third Nature Inc. Most people do not need special technologyNumberofpeople The distribution of data size is about normal, yet these guys set the tone of the market today. Bigness of data Copyright Third Nature, Inc.
  34. 34. © Third Nature Inc. Analytics: This is really raw data under storageNumberofjobs Microsoft study of 174,000 analytic jobs in their cluster: median size ??? Bigness of data Copyright Third Nature, Inc.
  35. 35. © Third Nature Inc. Working data for analytics most often not bigNumberofjobs 14 GB Smallness of data Copyright Third Nature, Inc.
  36. 36. © Third Nature Inc. An (overly) Simple Division of the Problem SpaceComputation LittleLots Data volume Little Lots Big analytics, little data Specialized computing, modeling problems: supercomputing, GPUs Big analytics, big data Complex math over large data volumes requires shared nothing architectures Little analytics, little data The entry point; SAS, SMP databases, even OLAP cubes can work Little analytics, big data The BI/DW space, for the most part, with work done in databases
  37. 37. © Third Nature Inc.© Third Nature Inc. What makes data “big”? Very large amounts Hierarchical structures Nested structures Linked structures Encoded values Non‐standard (for a  database) types Deep structure Human authored text “big” is better off being defined as “complex” or “hard to manage” Copyright Third Nature, Inc.
  38. 38. © Third Nature Inc. Categorizing the measurement data we collect The convenient data is the  transactional data. ▪ Goes in the DW and is used, even  if it isn’t the right measurement. The inconvenient data is  observational data. ▪ It’s not neat, clean, or designed  into most systems of operation. The difficult and misleading data  is declarative data. ▪ What people say and what they  do require ground truth. We need an architecture that  supports all three categories. Copyright Third Nature, Inc.
  39. 39. © Third Nature Inc. Transactions vs “big data” The classic example of “structured data” Transaction data includes: ▪ quantification details (date, value, count) ▪ reference data for explanation (product,  customer, account) ▪ Lots of meaningful information Reference data is usually shared across the  organization, hence its importance. There  are two parts: ▪ identifier to uniquely identify the subject ▪ descriptive attributes with common or  standardized value domains Transaction details Reference data
  40. 40. © Third Nature Inc. Today it’s different data: observations, not transactions Sensor data doesn’t fit well with current methods of collection and storage, or with the technology to process and analyze it. Copyright Third Nature, Inc.
  41. 41. © Third Nature Inc. Big data as a type of data: Transactions vs. Events Transactions: ▪ Each one is valuable ▪ Mutable ▪ The elements of a transaction can be aggregated easily ▪ A set of transactions does not usually have important ordering  or dependency Events: ▪ A single event often has no value, e.g. what is the value of one  click in a series? Some events are extremely valuable, but this  is only detectable within the context of other events. ▪ Elements of events are often not easily aggregated ▪ A set of events usually has a natural order and dependencies ▪ Immutable
  42. 42. © Third Nature Inc. Example “big data”: Web tracking data USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/2010 0:00 SESSION_START_DATE 1:41:44 AM PAGE_VIEW_DATE 1/10/2010 9:59 DESTINATION_URL link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐ 1&storeId=1055&URL=BECGiftListItemDisplay REFERRAL_NAME REFERRAL_URL oq=Italian&ie=UTF8&rlz=1T4ACGW_enUS386US387&q=italia n+rose&fu=0&ifi=1&dtd=204&xpc=1KoLqh374s PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY IP_ADDRESS BROWSER_OS_NAME MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS  NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
  43. 43. © Third Nature Inc. Web tracking data has a nested structure USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/2010 0:00 SESSION_START_DATE 1:41:44 AM PAGE_VIEW_DATE 1/10/2010 9:59 DESTINATION_URL link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐ 1&storeId=1055&URL=BECGiftListItemDisplay REFERRAL_NAME Direct REFERRAL_URL ‐ PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY IP_ADDRESS BROWSER_OS_NAME MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS  NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322) “unstructured” data embedded in the logged message: complex strings
  44. 44. © Third Nature Inc. The missing ingredient from most big data
  45. 45. © Third Nature Inc. The creation, flow and use of data is different for  transactions and machine‐generated events Data entry Extract Cleanse Load UseStore Transactions MDM Generate Store Use UseCleanse Program Capture This runs at human speed This runs at machine speed, with higher latency feedback cycles
  46. 46. We collect large volumes of text, a rare practice  ten years ago. Today we can turn text into data. Categories, taxonomies Topics, genres, relationships, abstracts Sentiment, tone, opinion Words & counts, keywords, tags Entities people, places, things, events, IDs Copyright Third Nature, Inc.
  47. 47. © Third Nature Inc. You can store this data in an RDBMS, but…
  48. 48. Example data: Twitter Message API Payload Looks like: This is really just a record format much like a DB row. Datetime, userID, name, location, description, message, message metadata, etc. But it’s In json or xml.
  49. 49. © Third Nature Inc. @markmadsen Check out: From #MongoDB to #Cassandra:  Why The Atlas Platform Is Migrating A tweet has lots of fields, but one important one The payload is free text but has other elements: From these things you likely want to generate or link to  reference data. ‘To’ username Hashtag HashtagURL
  50. 50. © Third Nature Inc.© Third Nature Inc. Internal payload elements form a new graph The @elements point to  other records and create a  deeply linked structure. You have to assemble the  linked structure to see  what’s really there, which  means repeated scanning  some/all of the data. The derived pattern is  interesting data,  sometimes more than the  individual messages.
  51. 51. © Third Nature Inc.© Third Nature Inc. There are many patterns in the data Follower / following networks are easy – they are explicit  and independent of the events. Community detection requires looking at patterns of @  communication in addition to follow relationships. What do you do with these after discovery? Follower network Conversational communities
  52. 52. © Third Nature Inc. More data: patterns emerge from lots of event data Patterns emerge from  the underlying structure  of the entire dataset. The patterns are more  interesting than sums  and counts of the events. Web paths: clicks in a  session as network node  traversal. Email: traffic analysis  producing a network The event stream is a source for analysis, generating another set of data that is the source for different analysis.
  53. 53. © Third Nature Inc. Big changes for data warehousing workloads The results of analytic  processing can, often do,  feed back into the  system from which they  originate. Much of the data is being  read, written and  processed in real time. Our design point was not  changing tables and  ephemeral patterns.
  54. 54. Unstructured is Not Really Unstructured Slide 58 Unstructured data isn’t  really unstructured:  language has structure.    Text can contain traditional  structured data elements.  The problem is that the  content is unmodeled.
  56. 56. © Third Nature Inc. There are really three workloads to consider, not two 1. Operational: OLTP systems 2. Analytic: OLAP systems 3. Processing: Computational systems Unit of focus: 1. Transaction 2. Query 3. Computation Different problems require different platforms
  57. 57. © Third Nature Inc. Workloads OLTP BI Analytics Access Read‐Write Read‐only Read‐mostly Predictability Predictable Unpredictable Fixed path Selectivity High Low Low Retrieval Low Low High Latency Milliseconds < seconds msecs to days Concurrency Huge Moderate 1 to huge Model 3NF, nested object Dim, denorm BWT Task size Small Large Small to huge
  58. 58. © Third Nature Inc. These do exactly the same thing: One is a set of technologies. One is an architecture. An idea promoted by big data vendors Data Warehouse
  59. 59. © Third Nature Inc. Reality: Hadoop disaggregates the database One of the key things Hadoop does is to separate the  storage, execution and API layers of a database. This  allows for processing flexibility, but it does not permit  one to build a reliable, high performance database  across the layers. Hadoop distributed filesystem (HDFS) General-purpose data engines Abstraction layers Storage management
  60. 60. © Third Nature Inc. A more specific look at layers and engines Base storage SQL, MDX Kylin Storage mgmt Engine Abstraction  layer / API You can program to any layer you choose. Some projects already build on top of multiple others. Language/API Engine Hadoop distributed filesystem (HDFS) MapReduce Tez Cascading Spark Storage (filetypes in HDFS, Hbase, etc) Crunch Pig Hive SparkSQL NativeAPI Giraph Hive Crunch Pig Impala Drill Presto NativeAPI NativeAPI Hive Pig NativeAPI Hbase Phoenix
  61. 61. © Third Nature Inc. An important Hadoop + cloud computing benefit Scalability is free – if your task requires 10 units of  work, you can decide when you want results: 10 servers, 1 unit of time Cost is the same. Not true of the conventional IT model Time 1 server, 10 units of time X X
  62. 62. © Third Nature Inc. Hadoop: a summary of the magic 1. Provides both storage and complex processing as part  of the same platform 2. Makes parallel programming more accessible 3. Schemaless (just files) therefore flexible 4. Inexpensive, reliable scale‐out 5. Potential for fast, scalable ingest 6. Cheaper than a database (for non‐database work) The bad stuff: ▪ Not great for mutable data ▪ Mostly file‐based sequential processing, or you store data  many times in different datastores (locality is important) ▪ Minimal data management (today)
  63. 63. © Third Nature Inc. The geography has been redefined The box we created: • not any data, rigidly typed data • not any form, tabular rows and  columns of typed data • not any latency, persist what the  DB can keep up with • not any process, only queries The digital world was diminished  to only what’s inside the box until  we forgot the box was there.
  64. 64. © Third Nature Inc. Layered data architecture The DW assumed a single flat  model of data, DB in the center.  New technology enables new  ways to organize data: ▪ Raw – straight from the source ▪ Enhanced –cleaned, standardized ▪ Integrated – modeled,  augmented, ~semi‐persistent ▪ Derived – analytic output,  pattern based sets, ephemeral Implies a new technology architecture  and data modeling approaches.
  65. 65. © Third Nature Inc. Decouple the Data Architecture The core of the data warehouse isn’t the  database, it’s the data architecture that the  database and tools implement. We need a data architecture that is not limiting: ▪ Deals with change more easily and at scale ▪ Does not enforce requirements and models up front ▪ Does not limit the format or structure of data ▪ Assumes the range of data latencies in and out, from  streaming to one‐time bulk
  66. 66. © Third Nature Inc. Deconstructing the data warehouse There are three  things happening  in a DW: ▪ Data acquisition ▪ Data management ▪ Data delivery Isolate them from  one another. Data Warehouse
  67. 67. © Third Nature Inc. Integrate Manage Decouple the data architecture by stage Use In reality, you are building three systems, not one. Treat them that way. Collect Transactions Observations Declarations
  68. 68. © Third Nature Inc. Food supply chain: an analogy for data Multiple contexts of use, differing quality levels
  69. 69. © Third Nature Inc. Data infrastructure is a platform ▪ Any data – structures, forms ▪ Any latency –in motion, at rest ▪ Any process – query, algorithm, transformation ▪ Any access – SQL, API, queue, file movement
  70. 70. © Third Nature Inc. The evolution of DW is to a data platform, which means  separating application from infrastructure. Derived data Raw data Infrastructure layer: Process and analyze Store and manage Application layer: Deliver and use The new model also encompasses data at rest and data in motion Multiple access methods Enhanced data Multiple ingest methods BI, data extracts,  analytics, applications The platform has to do more than serve queries; it has to be read-write.
  71. 71. © Third Nature Inc. Away from “one throat to choke”, back to best of breed “The extremely specialized  nature of mass production  raises the costs of product  change and therefore slows  down innovation.” ‐ Abernathy, 1978 Tight coupling leads to slow  changes. In a rapidly evolving market  componentized architectures,  modularity  and loose coupling  are favorable over monolithic  stacks, single‐vendor  architectures and tight  coupling.
  72. 72. © Third Nature Inc. Staff and skills are a problem in a build market @BigDataBorat: Give man Hadoop cluster he gain insight for a day. Teach man build Hadoop cluster he soon leave for better job #bigdata
  73. 73. © Third Nature Inc. Technology Adoption Some people can’t resist  getting the next new thing  because it’s new and new is  always better. Many IT organizations are like  this, promoting a solution and  hunting for the problem that  matches it. Better to ask “What is the  problem for which this  technology is the answer?” Copyright Third Nature, Inc.
  74. 74. © Third Nature Inc. Four core capabilities big data technologies add 1. Unlimited scale of storage, processing ▪ Agility, faster turnaround for new data requests (but not a replacement for BI) ▪ Fewer staff to accomplish same goals 2. New data accessibility ▪ More data retained for longer period ▪ Access to data unused due to cost or processing limits ▪ Any digital information becomes usable data 3. Scalable realtime processing ▪ Brings ability to monitor and act on data as events occur 4. Arbitrary analytics ▪ Faster analysis ▪ Deeper analysis ▪ More broadly accessible analytics
  75. 75. © Third Nature Inc. As a technology moves from emerging to commodity the  nature of acquiring, using and managing it changes Generate options Innovation Novel practice Maximize value Maturation Constrain choices Adaptation Good practice Optimize Standardize / minimize choice Acquisition Best practice Minimize costs SaturationInnovation Copyright Third Nature, Inc. Agile & open  source* methods  6 Sigma & process  methods
  76. 76. © Third Nature Inc. Today: repeating the experience of the 80s & 90s This is the turbulent phase of the market as it goes through rapid development, then product and service changes. Copyright Third Nature, Inc. The Internet combined with commodity computing is forcing a new business and IT structural evolution, already underway. Maturation SaturationInnovation
  77. 77. © Third Nature Inc. How we develop best practices: survival bias We don’t need best practices, we need worst failures.Copyright Third Nature, Inc.
  78. 78. © Third Nature Inc. Welcome to the big data revolution, more of an evolution Be pragmatic, not dogmatic
  79. 79. © Third Nature Inc. CC Image Attributions Thanks to the people who supplied the creative commons licensed images used in this presentation: acorn_blue.jpg ‐ wheat_field.jpg ‐ Phone dump ‐ Richard Barnes ponies in field.jpg ‐ straw men.jpg ‐ text composition ‐ girl on cell tokyo .jpg ‐ hamadan people mosaic.jpg ‐ twitter_network_bw.jpg ‐ klein_bottle_red.jpg ‐ donuts_4_views.jpg ‐ subway dc metro  ‐
  80. 80. About the Presenter Mark Madsen is president of Third  Nature, a consulting and advisory firm  focused on analytics, business  intelligence and data management.  Mark is an award‐winning author,  architect and CTO. Over the past ten  years Mark received awards for his work  from the American Productivity &  Quality Center, TDWI, and the  Smithsonian Institute. He is an  international speaker, a contributor to  Forbes, member of the O’Reilly Strata  program committee. For more  information or to contact Mark, follow  @markmadsen on Twitter or visit 
  81. 81. © Third Nature Inc. About Third Nature Third Nature is a consulting and advisory firm focused on new and emerging technology and practices in information strategy, analytics, business intelligence and data management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place. Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in strategy and architecture, so we look at emerging technologies and markets, evaluating how technologies are applied to solve problems rather than evaluating product features.