Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data and Bad Analogies

3,323 views

Published on

Data lakes, data exhaust, web scale, data is the new oil. Vendors are throwing new terms and analogies at us to convince us to buy their products as the market around data technologies grows. We change data persistence and transaction layers because "databases don't scale" or because data is "unstructured". If data had no structure then it wouldn't be data, it would be noise. Schema on read, schema on write, schemaless databases; they imply structure underlying the data. All data has schema, but that word may not mean what you think it means.
This presentation will describe concepts of data storage and retrieval from technology prehistory (i.e. before the 1980s) and examine the design principles behind both old and new technology for managing data because sometimes post-relational is actually pre-relational. It is important to separate what is identical to things that were tried in the past from new twists on old topics that deliver new capabilities.
Directly related to these topics are performance, scalability and the realities of what organizations do with data over time. All of these topics should guide architecture decisions to avoid the trap of creating technical debts that must be paid later, after systems are in place and change is difficult.

Published in: Internet

Big Data and Bad Analogies

  1. 1. Big Data, Bad Analogies GOTO Copenhagen September, 2014 Mark Madsen www.ThirdNature.net @markmadsen
  2. 2. Copyright Third Nature, Inc. The problem with bad framing s Leads to bad assumptions about use, inappropriate features, poor understanding of substitutability and the impacts it will have.
  3. 3. Copyright Third Nature, Inc. The data lake
  4. 4. Copyright Third Nature, Inc. The data lake after a little while
  5. 5. Copyright Third Nature, Inc. Data Exhaust
  6. 6. Copyright Third Nature, Inc. Data is the new oil
  7. 7. Copyright Third Nature, Inc. Reality: data is a choice
  8. 8. Copyright Third Nature, Inc. “There is nothing new under the sun but there are lots of old things we don't know.” Ambrose Bierce
  9. 9. Copyright Third Nature, Inc. Looking at past ways of organizing data
  10. 10. Copyright Third Nature, Inc. The Elizabethan Era Commercial printing presses Data management tech: ▪ Perfect copies ▪ Topical catalogs ▪ Font standardization ▪ Taxonomy ascends Information explosion: ▪ 8M books in 1500 ▪ 200M by 1600 ▪ Commoditization ▪ Overload
  11. 11. Copyright Third Nature, Inc. Elizabethan Era Storage and Retrieval
  12. 12. Copyright Third Nature, Inc. Elizabethan Era Storage and Retrieval
  13. 13. Copyright Third Nature, Inc. Elizabethan Era Storage and Retrieval
  14. 14. Copyright Third Nature, Inc. The Georgian Era: The Explosion of Natural Philosophy
  15. 15. Copyright Third Nature, Inc. The Victorian Era The powered printing information explosion: ▪ Card catalogs, cross- referencing, random access metadata ▪ Universal classification ▪ Extended information management debates ▪ Trading effort and flexibility for storage and retrieval ▪ Stereotyping
  16. 16. Copyright Third Nature, Inc. Melvil Dewey Dewey Decimal System Top down orientation Static structure Descriptive rather than explanatory Taxonomic classification
  17. 17. Copyright Third Nature, Inc. Cutter Expansive Classification System (~1882) Bottom up orientation More flexible structure Explanatory, descriptive Faceted classification Charles Ammi Cutter
  18. 18. Copyright Third Nature, Inc. So why did Dewey beat Cutter? Pragmatism Good enough wins the day It wasn’t solving the problem you thought it was. X In every choice, something is lost when something is gained.
  19. 19. Copyright Third Nature, Inc. What lessons does this history teach us? 1. Information requires organizing principles at multiple levels from items to collections. 2. Differences in scale require different principles. 3. At a key point in the adoption cycle, emphasis shifts from management of information to its dissemination and consumption. First we record, then we use and share. Like transaction processing and analysis.
  20. 20. Copyright Third Nature, Inc. What has this to do with data and persistence? “schema” is a broad term, a way of organizing and making something relatable and findable. “Data” (or object) is to “Database” as “Books” are to “Library”
  21. 21. Copyright Third Nature, Inc. The printed became more important than the printer. The book outlived generations of presses. Just like data is now. Which means we should pay attention to the broader organization and use of data, and persistence layers.
  22. 22. Copyright Third Nature, Inc. Order Entry Order Database Customer Service Interface Program Inventory Database Distribution Interface Program Receivables Database Accounts Receivable Data Warehouse Analysts & users Someone else always wants to use your data
  23. 23. Copyright Third Nature, Inc. Context (one company) "In an infinite universe, the one thing sentient life cannot afford to have is a sense of proportion.” – Douglas Adams
  24. 24. Copyright Third Nature, Inc. Monthly Production plans Weekly pre- orders for bulk cheese Availability confirmation and location In store system Store Stock Management Store EPOS data Category Supervis or Stock adjustments/ order interventions Order adjustment Stock/order interventions * * Orders (based on 6 day forecast) Dallas Distrib Centre WMS Picking/load teams Pos/Pick lists/Load sheets Confirmed Deliveries/ Confirmed picks + loads Farmers Milk intake/ silos Cheese plant Plant Processor In-house Cheese store Contract Cheese store Processor Packing plant Processor National Distribution Centre Retailer RDC Retailer Stores (550) Retailer HQ Consolidated Demand Ordering Processor NDC Customer Services Daily order - SKU/Depot/ Vol Sent @ 12.30-13:00 Delivery orders Processor HQ Sales Team/ Account Manager Processor HQ Forecasting Team Processor HQ Bulk Planning Team Cheese plant Planner/Stock office Processor HQ Milk Purchasing Team Cheese plant Transport Manager Actual daily delivery figures Daily collection planning Weekly order for delivery to Packing plant Daily & weekly Call- off Daily Call-off 15/day 22 pallet loads 15/day A80 Shortages/ Allocation instructions Annual Buying plan Milk Availability Forecast Annual prediction of milk production Shortages/ Allocation instructions Daily milk intake Weekly milk shortages shortages Spot mkt or Processor ingredients Packing plant Planning Team Processor HQ JBA Invoicing and Sales Monitor FGI and Last 5 weeks sales Expedite Changes to existing forecast - exceptions Retailer HQ Retailer Buyer Meeting every 6 weeks Packing plant Cheese ordering 10 day stock plan On line stock info 7 day order plan for bulk cheese Arrange daily delivery schedule Emergency call-off Daily optimisation of loads Service Monitor Despatch and delivery confirmations Processor NDC Transport Planning Transport Plan Processor NDC Inventory MonitoringStock and delivery monitoring Processor NDC Warehouse management syatem Operation Instructions Key RetailerCheese ProcessorFarms Schedule weekly & Daily 10 Day plan(wed) and daily plan 15/day Changes to existing forecast - exceptions Stock availability Monthly review Annual f/cast Source: IGD Food Chain Centre, February 2008 Context (multiple company supply chain) A value chain diagram, showing the data supply chain for cheese. The side effects of a single bug can be massive.
  25. 25. Copyright Third Nature, Inc. Aside: technical debt: what those diagrams show tek-ni-kuh l det: the cost that accrues due to decisions made in software design and coding. Look at the choices and mistakes in development: Intentional Unintentional Purposeful choices to optimize schedule, budget, satisfaction Missed requirements, poor code quality, poor design
  26. 26. Copyright Third Nature, Inc. It is a poor carpenter who blames his tools* *but sometimes it is the tools
  27. 27. Copyright Third Nature, Inc. Technical Debt The cost of some choices can be dealt with in the short term (e.g. the next sprint) and some only in the long term (redesign, start over) Short term Long term Mostly about the application code Mostly about architecture, design, and infrastructure
  28. 28. Copyright Third Nature, Inc. Code flaws (i.e. bugs) Design flaws If you enter into decisions knowing the true nature of your coding alternatives, you will be better off Green: these are deliberate, the tradeoffs known Yellow : these are minor defects Red: these are the things that kill a system Short term Long term Intentional Unintentional Code choices Design choices
  29. 29. Copyright Third Nature, Inc. Technical Debt can’t be avoided Development methods Experience, education (but it can be managed) Sometimes you think it’s intentional: incremental design Long term debts can only be dealt with through planning Short term Long term Intentional Unintentional Agile methods Redesign
  30. 30. Copyright Third Nature, Inc. Technical Debt can’t be avoided Let’s move this to the left What you believe about the technology underlying your system has a big influence on design choices, so the focus of this part is on architecture and design with the hope it will help reduce or avoid long term debt. Short term Long term Intentional Unintentional
  31. 31. Copyright Third Nature, Inc. How did we get here? There’s a difference between having no past and actively rejecting it.
  32. 32. Copyright Third Nature, Inc. MultiValue, Hierarchical PICK IMS IDS ADABAS Relational CODASYL System R (SEQUEL) SQL/DS INGRES (QUEL) Mimer Oracle RDBMS, SQL standard DB2 Teradata Informix Sybase Postgres OODBMS, ORDBMS Versant Objectivity Gemstone Informix* Oracle* MPP Query, NoSQL Netezza Paraccel Vertica MongoDB CouchBase Riak Cassandra NewSQL SciDB MonetDB NuoDB CitusDB 1960s 1980s 2000s 1970s 1990s 2010s Spanner F1
  33. 33. Copyright Third Nature, Inc. In the beginning: RMSs and pre-relational DBs At first, common code libraries so there was reusability for file ops. Problems: ▪ Portability across languages, OSs ▪ Queries of more than one file ▪ No metadata, what’s in there? Who wrote it? ▪ Concurrency The databases brought things like recoverability, durability, ACID transactions. But they were rigid, prone to breakage. Operations: First Next Prev Last
  34. 34. Copyright Third Nature, Inc. What is best in life?
  35. 35. Copyright Third Nature, Inc. To crush the vendors, see them driven before you, and hear the lamentation of their salespeople.
  36. 36. Copyright Third Nature, Inc. Wrong! Loose coupling. Scalability. Reusability.
  37. 37. Copyright Third Nature, Inc. The miracle of pre-relational DB: schema Loose coupling – the physical model of data structures and physical placement are no longer a program’s responsibility Reusability – More than one program can access the same data, and no more custom coding for each application or OS Scalability – Constraints of schema and typing reduce resource usage, have finer granularity for concurrent access, multiple online users.
  38. 38. Copyright Third Nature, Inc. Key schema flexibility tradeoff for data management Global validation vs contextual validation = Strict rules vs lenient rules = Write rules vs read rules
  39. 39. Copyright Third Nature, Inc. Schema on write vs schema on read Match the shape to the hole or Match the hole to the shape Predicate schemas for write flexibility (agility) and speed
  40. 40. Copyright Third Nature, Inc. “Flexibility” – a recent experience in query The problem with many of these pr-relational databases is tight coupling between a program and data structures. The physical model leaks into the logical with potentially career-ending effects if the DB is used for the wrong thing. And it’s happening again. It’s a poor carpenter who blames his tools. Or the users.
  41. 41. Copyright Third Nature, Inc. Trading for consistency for performance: CAP theorem http://blog.nahurst.com/visual-guide-to-nosql-systems
  42. 42. Copyright Third Nature, Inc. Services provided Standard API/query layer* Transaction / consistency Query optimization Data navigation, joins Data access Storage management Database Database Tradeoffs: In NoSQL the DBMS is in your code SQL database NoSQL database Application Application Anything not done by the DB becomes a developer’s task. Welcome to 1985!
  43. 43. Copyright Third Nature, Inc. Simplifying ACID vs BASE Remember: it’s a poor carpenter who blames his tools.
  44. 44. Copyright Third Nature, Inc. Google on eventual consistency: “F1: A Distributed SQL Database That Scales”, Proceedings of the VLDB Endowment, Vol. 6, No. 11, 2013
  45. 45. Copyright Third Nature, Inc. Party like it’s 1985 Fastest TPC-B benchmark in 1985 was IMS running on an IBM 370, 100 TPS, 400 iops, 30 iops/disk The best relational vendors could muster was 10 TPS 25 years later, SQLServer on an Intel box ran the TPC-B at 25,000 TPS, 100,000 iops, 300 iops/disk It looks like you’re running a benchmark...
  46. 46. Copyright Third Nature, Inc. Joins. Wait, there’s more than one? Wait, there’s more than one? 1986: SQL, joins!
  47. 47. Copyright Third Nature, Inc. Hipster bullshit I can’t get MySQL to scale therefore Relational databases don’t scale therefore We must use NoSQL* for everything *including Hadoop and related
  48. 48. Copyright Third Nature, Inc. Enumerate logically equivalent plans by applying equivalence rules For each logically equivalent plan, enumerate all alternative physical query plans Estimate the cost of each of the alternative physical query plans Run the plan with lowest estimated overall cost 2 1 3 4 Logical vs physical is an important thing. The CBO turns a SQL query into an optimal* execution plan for a parallel pipelined dataflow engine. Diagram: David J. DeWitt The relational gift: declarative language + CBO
  49. 49. Copyright Third Nature, Inc. A simple 3 table join: a programmer’s job? SELECT C.name, O.num FROM Orders O, Lines L, Customers C WHERE C.City = “Copenhagen” AND L.status = “X” AND O.num = L.num AND C.cid = O.cid Number of logical plans: 9 Ways to join (hash, merge, nested): 3 For each plan, there are multiple physical plans: 36 That makes a total of 324 physical plans, the efficiency of which changes based on cardinality.
  50. 50. Copyright Third Nature, Inc. Scalability? ZOMG just add nodes! "The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.“ – Henry Peteroski After all, your database is web scale, isn’t it?
  51. 51. Copyright Third Nature, Inc. Just add hardware? No amount of hardware will make incorrectly coded software run in parallel. Declarative languages make this easier by turning the problem over to the computer to resolve. Guess which runs in parallel: Open cursor C Loop (Fetch row C) Open cursor O Loop (Fetch row O) Open cursor L Output (Do-things) End loop End loop ... SELECT C.name, O.num FROM Orders O, Lines L, Customers C WHERE C.City = “Aarhus” AND C.cid = O.cid AND O.num = L.num AND L.status = “X”
  52. 52. Copyright Third Nature, Inc. A more realistic example: TPC-H query #8 Select o_year, sum(case when nation = 'BRAZIL' then volume else 0 end) / sum(volume) from ( select YEAR(O_ORDERDATE) as o_year, L_EXTENDEDPRICE * (1 - L_DISCOUNT) as volume, n2.N_NAME as nation from PART, SUPPLIER, LINEITEM, ORDERS, CUSTOMER, NATION n1, NATION n2, REGION where P_PARTKEY = L_PARTKEY and S_SUPPKEY = L_SUPPKEY and L_ORDERKEY = O_ORDERKEY and O_CUSTKEY = C_CUSTKEY and C_NATIONKEY = n1.N_NATIONKEY and n1.N_REGIONKEY = R_REGIONKEY and R_NAME = 'AMERICA‘ and S_NATIONKEY = n2.N_NATIONKEY and O_ORDERDATE between '1995-01-01' and '1996-12-31' and P_TYPE = 'ECONOMY ANODIZED STEEL' and S_ACCTBAL <= constant-1 and L_EXTENDEDPRICE <= constant-2 ) as all_nations group by o_year order by o_year 22 million possible execution plans, please find the best one in 4 milliseconds
  53. 53. Copyright Third Nature, Inc. 0% 100% 200% 300% 400% 500% 600% 700% 800% 900% 1000% 1 2 3 4 5 6 7 8 9 10 10% overhead Linear Just add hardware? Number of processors Speedup (amdahl’s law)
  54. 54. Data Warehouse + Behavioral Singularity Data Warehouse Semi-Structured SQL++ Structured SQL Low End Enterprise-class System Contextual-Complex Analytics Deep, Seasonal, Consumable Data Sets Production Data Warehousing Large Concurrent User-base + + 150+ concurrent users 500+ concurrent users Enterprise-class System 5-10 concurrent users Unstructured Java/C Structure the Unstructured Detect Patterns Commodity Hardware System 6+PB 40+PB 20+PB Hadoop Analyze & Report Discover & Explore Parallel Efficiency and Platform Costs
  55. 55. Platform Metrics for Table Scan and Sum, Hadoop vs Teradata
  56. 56. Copyright Third Nature, Inc. There are really three workload classes to consider 1. Operational: OLTP systems 2. Analytic: Query systems 3. Scientific: Computational systems Unit of focus: 1. Transaction 2. Query 3. Computation Different problems require different platforms
  57. 57. Copyright Third Nature, Inc. Workloads OLTP BI Analytics Access Read-Write Read-only Read-mostly Predictability Fixed path Unpredictable All data Selectivity High Low Low Retrieval Low Low High Latency Milliseconds <seconds msecs to days Concurrency Huge Moderate 1 to huge Model 3NF, nested object Dim, denorm BWT Task sizw Small Large Small to huge
  58. 58. Copyright Third Nature, Inc. A key point worth remembering: Performance over size <> performance over complexity OLTP performance is mostly related to transaction coordination and writing under high concurrency. BI performance is mostly related to data volume and query complexity. Analytics performance is about the intersection of these with computational complexity.
  59. 59. Copyright Third Nature, Inc. A history of databases in No notation 1970s: NoSQL = We have no SQL 1980s: NoSQL = Know SQL 2000s: NoSQL = No SQL! 2005s: NoSQL = Not only SQL 2010s: NoSQL = No, SQL! (R)DB(MS)
  60. 60. Copyright Third Nature, Inc. TANSTAAFL Technologies are not perfect replacements for one another. Often not better, only different. When replacing the old with the new (or ignoring the new over the old) you always make tradeoffs, and usually you won’t see them for a long time.
  61. 61. Copyright Third Nature, Inc. Unintended consequencesUnintended consequences
  62. 62. Copyright Third Nature, Inc. Away from “one throat to choke”, back to best of breed Tight coupling leads to slow changes. The market is not in the tight coupling phase In a rapidly evolving market, componentized architectures, modularity and loose coupling are favorable over monolithic stacks, single-vendor architectures and tight coupling.
  63. 63. Copyright Third Nature, Inc. Don’t follow the market Some people can’t resist getting the next new thing because it’s new and new is always better. Many IT organizations are like this, promoting a solution and hunting for the problem that matches it. Better to ask “What is the problem for which this technology is the answer?” Copyright Third Nature, Inc.
  64. 64. Copyright Third Nature, Inc. How we develop best practices: survival bias We don’t need best practices, we need worst failures.Copyright Third Nature, Inc.
  65. 65. Copyright Third Nature, Inc. Summarizing some key points 1. Data lives longer than code. It pays to focus on data when choosing and using persistence layers. 2. Different persistence layer approaches have different tradeoffs (like performance vs app complexity in BASE models). 3. Don’t throw away 30 years of engineering unless know why it was put there. 4. Decoupling, reusability (of data) and scalability are nice things to have.
  66. 66. Copyright Third Nature, Inc. References (things worth reading on the way home) A relational model for large shared data banks, Communications of the ACM, June, 1970, http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf Column-Oriented Database Systems, Stavros Harizopoulos, Daniel Abadi, Peter Boncz, VLDB 2009 Tutorial http://cs-www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf Nobody ever got fired for using Hadoop on a cluster, 1st International Workshop on Hot Topics in Cloud Data ProcessingApril 10, 2012, Bern, Switzerland. A co-Relational Model of Data for Large Shared Data Banks, ACM Queue, 2012, http://queue.acm.org/detail.cfm?id=1961297 A query language for multidimensional arrays: design, implementation and optimization techniques, SIGMOD, 1996 Probabilistically Bounded Staleness for Practical Partial Quorums, Proceedings of the VLDB Endowment, Vol. 5, No. 8, http://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf “Amorphous Data-parallelism in Irregular Algorithms”, Keshav Pingali et al MapReduce: Simplified Data Processing on Large Clusters, http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapre duce-osdi04.pdf Dremel: Interactive Analysis of Web-Scale Datasets, Proceedings of the VLDB Endowment, Vol. 3, No. 1, 2010 http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/3 6632.pdf Spanner: Google’s Globally-Distributed Database, SIGMOD, May, 2012, http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/es//archive/spanne r-osdi2012.pdf F1: A Distributed SQL Database That Scales, Proceedings of the VLDB Endowment, Vol. 6, No. 11, 2013, http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive /41344.pdf
  67. 67. Copyright Third Nature, Inc. About the Presenter Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business use of data and analytics. Mark is an award- winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor to Forbes Online and on the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net
  68. 68. Copyright Third Nature, Inc. About Third Nature Third Nature is a research and consulting firm focused on new and emerging technology and practices in business intelligence, analytics and performance management. If your question is related to BI, analytics, information strategy and data then you‘re at the right place. Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.
  69. 69. Copyright Third Nature, Inc. CC Image Attributions Thanks to the people who supplied the creative commons licensed images used in this presentation: bookshelf by spectrum.jpg - http://flickr.com/photos/santos/1704875109/ Vatican library - http://www.flickr.com/photos/paullew/1550844955 round hole square peg - https://www.flickr.com/photos/epublicist/3546059144 text composition - http://flickr.com/photos/candiedwomanire/60224567/ twitter_network_bw.jpg - http://www.flickr.com/photos/dr/2048034334/ round hole square peg - https://www.flickr.com/photos/epublicist/3546059144 glass_buildings.jpg - http://www.flickr.com/photos/erikvanhannen/547701721 CAP diagram - http://blog.nahurst.com/visual-guide-to-nosql-systems refinery-hdr.jpg - http://www.flickr.com/photos/vermininc/2477872191/ refinery-night.jpg - http://www.flickr.com/photos/vermininc/2485448766/

×