Determine the Right Analytic Database: A Survey of New Data Technologies


Published on

There has been an explosion in database technology designed to handle big data and deep analytics from both established vendors and startups. This session will provide a quick tour of the primary technology innovations and systems powering the analytic database landscape—from data warehousing appliances and columnar databases to massively parallel processing and in-memory technology. The goal is to help you understand the strengths and limitations of these alternatives and how they are evolving so you can select technology that is best suited to your organization and needs.

Presentation from the O'Reilly Strata conference, February 2011

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Determine the Right Analytic Database: A Survey of New Data Technologies

  1. 1. Determine the Right Analytic Database: A Survey of New Data Technologies O’Reilly Strata Conference February 1, 2011 Mark R. Madsen Twitter: @markmadsenAtomic Avenue #1 by Glen Orbik
  2. 2. Key Questions ▪ What technologies are available? ▪ What are they good for? ▪ How do you decide which to use?But first: why are analytic databases available now? Page 2
  3. 3. Consequences of Commoditization: Data Volume Spimes Chipping Sensors Data Generated GPS RFID You are here Time
  4. 4. An Unexpected Consequence of Data VolumesSums, counts and sorted results only get you so far.
  5. 5. An Unexpected Consequence of Data VolumesOur ability to collect data is still outpacing our ability to derive meaning from it.
  6. 6. Don’t worry about it. We’ll just buy more hardware. CPUs, memory and  storage track to very  similar curves
  7. 7. RIP Moore’s Law: it nearly ground to a halt for silicon integrated circuits about four years ago.
  8. 8. Technology Has Changed (a lot) But We Haven’t 1010 10 9 10,000 X improvementCalculations per second per $1000 10 8 107 106 105 104 103 102 101 10 10‐1 01‐2 Current DW architecture 10‐3 and methods start here 10‐4 10‐5 in the mid-1980s 10‐6 Data: Ray Kurzweil, 2001 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Mechanical Relay Vacuum tube Transistor Integrated circuit
  9. 9. Moore’s Law via the Lens of the Industry Analyst CPU Speed Time
  10. 10. Moore’s Law: Power Consumption Power Use Time 2017
  11. 11. Moore’s Law: Heat Generation CPU Temp Time 2017
  12. 12. Conclusion #1: Your own nuclear reactor by 2017 Power Use Time 2017
  13. 13. Conclusion #2: You Will Need a New Desk in 2017 Power Use Time 2017
  14. 14. Problem: linear extrapolation“If the automobile had followed Realitythe same development as thecomputer, a Rolls-Royce wouldtoday cost $100, get a millionmiles per gallon, and explodeonce a year killing everyoneinside.” Anything Robert Cringely Time
  15. 15. Multicore performance is not a linear extrapolation.
  16. 16. New Technology Evolution Means New Problems 1010 10 9 10 8 Massively  107 parallel era 106 105 104 Symmetric multi‐ 103 102 processing era 101 Investment phase 10 Improving, perfecting, applying 10‐1 Uniprocessor Core problems solved 01‐2 and custom  10‐3 CPU era 10‐4 Early engineering phase 10‐5 Exploring, learning, inventing 10‐6 1970 1980 1990 2000 2010 2020 Technology Maturity (time + engineering effort)
  17. 17. What’s different?ParallelismWe’re not getting more CPU power, but more CPUs.There are too many CPUs relative to other resources, creating an imbalance in hardware platforms.Most software is designed for a single worker, not  high degrees of parallelism and won’t scale well.
  18. 18. Core problem: software is not designed for parallel workDatabases must be designed to permit local work withminimal global coordination and data redistribution.
  20. 20. Storage ImprovementsFor data workloads, disk throughput still key.Improvements:▪ Spinning disks at .05/GB▪ Solid state disks remove  some latencies, read speed  of ~250MB/sec▪ SSD capacity still rising▪ Card storage (PCI), e.g.  FusionIO at 1.5GB/sec▪ SSD is still costly at $2/GB  up to $30/GB
  21. 21. Compression Applied to Stored Data10x compression means 1 disk I/O can read 10x as much data, stretching your current hardware investmentBut it eats CPU andmemory.YMMV
  22. 22. Scale‐up vs. Scale‐out Parallelism Uniprocessor environments required chip upgrades. SMP servers can grow to a point, then it’s a forklift  upgrade to a bigger box. MPP servers grow by adding mode nodes.
  23. 23. Database and Hardware Deployment ModelsThree levels of software‐hardware integration: ▪ Database appliance (specialized hardware and software) ▪ Preconfigured (commodity) hardware with software ▪ Software on generic hardwareThen there are the hardware‐database parallel models: Database DB DB Database OS OS OS OS OS OSShared Everything Shared Disk Shared Nothing Page 23
  24. 24. In‐Memory Processing1. Maybe not as fast you think. Depends entirely on  the database (e.g. VectorWise)2. So far, applied mainly to shared‐nothing models3. Very large memories are more applicable to  shared‐nothing than shared‐memory systems Box‐limited Limited by node scaling e.g. 2 TB max e.g. 16 nodes, 512MB per = 8TB4. Still an expensive way to get performance
  25. 25. Columnar Databases In a row-store model ID Name Salary these three rows 1 Marge Inovera $50,000 would be stored in 2 Anita Bath $120,000 sequential order as shown here, packed 3 Nadia Geddit $36,000 into a block. 1 Marge Inovera $50,000 In a column store 2 Anita Bath $120,000 model database they would be divided by 3 Nadia Geddit $36,000 columns and stored in different blocks. Not just changing the storage layout. Also involves changes to the execution engine and query optimizer.
  26. 26. Column Stores Rule the TPC‐H Benchmark
  27. 27. Columnar Advantages and Disadvantages+ Reduced I/O for queries not reading all columns+ Better compression characteristics, meaning database  size < raw data size (unlike row store) and less I/O+ Ability to operate on compressed data, improving  overall system performance+ Less manual tuning‐ Slower inserts and updates (causing ELT and trickle‐ feed problems*)‐ Worse for small retrievals and random I/O‐ Uses more system memory and CPU
  28. 28. Explosion of Analytic Techniques Machine  learning Visualization StatisticsGIS Advanced  Analytic  Information  Methods Numerical  theory & IR methods Rules  Text mining  engines &  & text  constraint  analytics programming
  29. 29. Map‐Reduce is a parallel programming framework  that allows one to code more easily across a  distributed computing environment, not a database. Ok, it’s not You write a Did youSo how do It’s not a a database distributed just tell meI query the database, How do I mapreduce to go todatabase? it’s a key- query it? function in hell? I believe I value erlang. did, Bob. store!
  30. 30. What’s DifferentNo databaseNo schemaNo metadataNo query language*Good for: ▪ Processing lots of complex  or non‐relational data ▪ Batch processing for very  large amounts of data* Hive, Hbase, Pig, others
  31. 31. Using MapReduce / HadoopHadoop is one implementation of MapReduce. There are different variations with different performance and resource characteristics e.g. Dryad, CGL‐MR, MPI variantsHadoop is only part of the solution. You need more for enterprise deployment. Cloudera’s distribution for Hadoopshows what a complete environment could look like.  Image: Cloudera 31
  32. 32. How Hadoop fits into a traditional BI environment Developers Analysts End Users Development Analysis tools, BI BI, Applications tools and IDEs Data Warehouse File loads ETL Databases Documents Flat Files XML Queues ERP Applications Source Environments
  33. 33. NoSQL theoretically = “not only sql”, in reality…Data stores that augment or replace relational accessand storage models with other methods. Different storage models: • Key‐value stores • Column families • Object / document stores • Graphs Different access models: • SQL (rarely) • programming API • get/put Reality: mostly suck for BI & analyticsAnalytic DB vendors are coming from the other direction: • Aster Data – SQL wrapped around MR • EMC (Greenplum) – MR on top of the database 33
  34. 34. Some realities to consider Cheap performance? ▪ Do you have 20 blades  lying around unused? ▪ How much concurrency? ▪ How much effort to write  queries? Debug them? ▪ Performance comparisons:  10x slower on the same  hardware? The key is the workload type  and the scale of it.Page 34
  35. 35. Do you really need a rack of blades for computing? Graphics co‐processors have  been used for certain problems  for years. Offer single‐system solution to  offload very large compute‐ intensive problems. Order of magnitude cost  reduction, order of magnitude  performance increase with  current technology today (for  compute‐intensive problems). We’ve barely started with this.
  36. 36. Other Options for analytic software deployment The basic models. 1. Separate tools and systems  (MapReduce and nosql are a  simple variation on this theme) 2. Integrated with a database 3. Embedded in a database The primary arguments about  deployment models center on  whether to take data to the  code or code to the data. 36
  37. 37. Leveraging the DatabaseLevels of database integration: ▪ Native DB connector ▪ External integration ▪ Internal integration ▪ Embedded+ Less data movement+ Possible dev process support+ Hardware / environment  savings+ Possible “sandboxing” support‐ Limitations on techniques  37
  38. 38. In‐database ExecutionYou can do a lot with standards‐compliant SQLIf the database has UDFs, you can code too (but it’s harder)Parallel support for UDFs variesSome vendors build functions directly into the database, (usually scalar)Iterative algorithms (ones that converge on a solution) are problematic, more so in MPP 38
  39. 39. What are factors in the decision?User concurrency: one job or many Repetition is a key element: ▪ Execute once and apply (build a response  or mortality model) ▪ Many executions daily (web cross‐sells)In‐process or Batch? ▪ Batch and use results – segment, score ▪ In‐process reacts on demand – detect  fraud, recommendIn‐process requires thinking about how it integrates with the calling application. (SQL sometimes not your friend) 39
  41. 41. The problem of size is three problems of volume. Computations! Number Amount of users! of data!
  42. 42. H
  43. 43. Lots of H“More” can become a qualitative rather than quantitative difference
  44. 44. Really lots of  H“Databases are dead!” – famous last words
  45. 45. Hardware Architectures and Deployment Compute and data sizes are the key requirements PF MR and relatedComputations TF Shared nothing GF Shared everything PC or shared disk MF <10s GB 100s GB 1s TB 10s TB 100sTB PB Data volume 45
  46. 46. Hardware Architectures and DeploymentToday’s reality, and true for a while in most businesses. PFComputations TF GF The bulk of the market resides here! MF <10s GB 100s GB 1s TB 10s TB 100sTB PB Data volume 46
  47. 47. Hardware Architectures and DeploymentToday’s reality, and true for a while in most businesses. PF …but analyticsComputations pushes many things TF into the MPP zone. GF The bulk of the market resides here! MF <10s GB 100s GB 1s TB 10s TB 100sTB PB Data volume 47
  48. 48. The real question: why do you want a new platform? Trouble doing what you already do today ▪ Poor response times ▪ Not meeting availability deadlines Doing more of what you do today ▪ Adding users, mining more data Doing something new with your data ▪ Data mining, recommendations, embedded real‐time  process support What’s desired is possible but limited by the cost of  supporting or growing the existing environment. Page 48
  49. 49. The World According to Gartner: One Magical Quadrant SQL Server 2008 R2 (PDW)Official production customers?EMC / GreenplumSQL limitationsMemory / concurrency issuesIngresOLTP databaseIlluminateSQL limitationsVery limited scalabilitySunMySQL for a DW, is this a joke? Magic Quadrant for Data Warehouse Database Management Systems 49
  50. 50. The assumption of the warehouse as a database is gone Non-traditional Parallel Messagedata (logs, audio, programming streams documents) platforms Traditional Streaming Databases tabular or DBs/engines structured data Data at rest Data in motionCopyright Third Nature, Inc. 50 Slide 50
  51. 51. Data Access DifferencesBasic data access styles:▪ Standard BI and reporting▪ Dashboards / scorecards▪ Operational BI▪ Ad‐hoc query and analysis▪ Batch analytics▪ Embedded analyticsData loading styles:▪ Refresh▪ Incremental▪ Constant
  52. 52. Evaluating ADB OptionsStorage style: ▪ Files, tables, columns, cubes, KVStorage type: ▪ Memory, disk, hybrid, compressedScaling model: ▪ SMP, clustered, MPP, distributedDeployment model: ▪ Appliance, cloud, SaaS, on‐premiseData access model: ▪ SQL, MapReduce, R, languages, etc.License options: ▪ CPU, data size, subscription Page 52
  53. 53. What’s it going to cost? A small sample at list: Solution Pricing model Price/unit 1 TB solution Remarks DatAupia Node $ 19,500/2TB $ 19,500 You can’t buy a 1 TB Satori server Kickfire Data Volume $ 50,000,-/TB $ 50,000 Includes MySQL (out of (raw) 5.1 Enterprise business) Vertica Data Volume $ 100,000/TB $ 200,000 Based on 5 nodes, (raw) $ 20,000 each ParAccel Data Volume $ 100,000/TB $ 200,000 Based on 5 nodes, (raw) $ 20,000 each EXASOL Data Volume $ 1,350/GB $ 350,000* Based on 4 nodes, (active) (€1,000/GB) $ 20,000 each Teradata Node $ 99,000 / TB $ 99,000** Based on 2550 base configuration* 1TB raw ± 200 GB active, **realistic configuration likely 2x this price 53
  54. 54. Factors and TradeoffsThe core tradeoff is not always money for performance.What else do you trade?• Load time• Trickle feeds• New ETL tools• New BI tools• Operational complexity: • Data integration and  management • Backups • Hardware maintenance Page 54
  55. 55. The Path to Performance 1. Laborware – tuning 2. Upgrade – try to solve the  problem without changing  out the database 3. Extend – add an ADB or  Hadoop cluster to the  environment to offload a  specific workload 4. Replace – out with the old,  in with the newPage 55
  56. 56. One Word: PoC!
  57. 57. The FutureAssuming database market embraces MPP, you have compute power that exceeds what the DB itself needs.Why not execute the code at the data?Even without MPP, moving  to in‐database analytic processing is a future direction and is workable for a large number of people. 57
  58. 58. Thank you!
  59. 59. Image AttributionsThanks to the people who supplied the images used in this presentation:Atomic Avenue #1 by Glen Orbik ‐ hole galaxy ‐ peru.jpg ‐ toy truck.jpg ‐ purple2.jpg ‐ ‐ ‐ ‐ kids truck peru.jpg ‐
  60. 60. What’s best for which types of problems?*Shared nothing will be best for solving large data problems, regardless of workload or concurrency.Column‐stores will improve query response time problems for most traditional query and aggregation workloads.Row‐stores will be better for operational BI or embedded BI.Fast storage always makes things better, but is only cost‐effective for medium scale or smaller data.Compression will help everyone, but column‐stores more than row stores because of how the engines work.Map‐Reduce and distributed filesystems offer advantages of a schema‐less storage & analytic layer that can process into relational databases.SMP and in‐memory will be better for high complexity problems under moderate data scale, shared‐nothing and MR for large data scale.*The answer is always “it depends” Page 60
  61. 61. About the Presenter Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, analytics and performance management. Mark is an award-winning author, architect and former CTO whose work has been featured in numerous industry publications. During his career Mark received awards from the American Productivity & Quality Center, TDWI, Computerworld and the Smithsonian Institute. He is an international speaker, contributing editor at Intelligent Enterprise, and manages the open source channel at the Business Intelligence Network. For more information or to contact Mark, visit
  62. 62. About Third NatureThird Nature is a research and consulting firm focused on new andemerging technology and practices in business intelligence, dataintegration and information management. If your question is related to BI,open source, web 2.0 or data integration then you‘re at the right place.Our goal is to help companies take advantage of information-drivenmanagement practices and applications. We offer education, consultingand research services to support business and IT organizations as well astechnology vendors.We fill the gap between what the industry analyst firms cover and what ITneeds. We specialize in product and technology analysis, so we look atemerging technologies and markets, evaluating the products rather thanvendor market positions.