Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Demystifying Systems for Interactive and Real-time Analytics


Published on

A number of systems have been released recently for use in interactive and real-time analytics. Examples include Drill, Druid, Impala, Muppet, Shark/Spark, Storm, and Tez. It can be confusing for a practitioner to pick the best system for her specific needs. Statements like “this system is 10x better than Hive” can be misleading without understanding factors like: (i) the workload and environment where the improvement can be repeatably obtained, (ii) whether proper system tuning can change the result, and (iii) whether the results can be different under other workloads. Duke and two other research institutions are jointly conducting a large-scale experimental study with multiple systems and workloads in order to answer these questions of broad interest. The workloads used in the study represent new-generation analytics needs that cover a diverse spectrum including SQL-like queries, machine-learning analysis, graph and matrix processing, and queries running continuously over rapid data streams. The talk will use the results from this study to present the strengths and weaknesses of each system, and rigorously characterize the scenarios where each system is the right choice. Opportunities to improve the systems with new features or by cross pollination of features from multiple systems will also be presented.

Published in: Technology
  • Be the first to comment

Demystifying Systems for Interactive and Real-time Analytics

  1. 1. The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs
  2. 2. Introduction • Who am I: Shivnath Babu • Associate Prof. of Computer Science at Duke University • Chief Scientist at Unravel Data Systems • Build tools for easy system management • What is this talk about: BigFrame • BigFrame helps you benchmark big data analytics systems … • … with a benchmark created automatically by BigFrame … • … for your custom application and workload needs • First open-source release planned for August 2013
  3. 3. Analytics System Landscape
  4. 4. Analytics System Landscape
  5. 5. Analytics System Landscape
  6. 6. Analytics System Landscape
  7. 7. Analytics System Landscape
  8. 8. Analytics System Landscape
  9. 9. Analytics System Landscape
  10. 10. Analytics System Landscape
  11. 11. What does this mean for Big Data Practitioners?
  12. 12. Gives them a lot of power! From:
  13. 13. Even the mighty may need a little help
  14. 14. Challenges for Practitioners Which system to use for the app that I am developing? • Features (e.g., graph data) • Performance (e.g., claims like System A is 50x faster than B) • Resource efficiency • Growth and scalability • Multi-tenancy App Developers, Data Scientists
  15. 15. Different parts of my app have different requirements Compose “best of breed” systems OR Use “one size fits all” system? Managing many systems is hard! System Admins Challenges for Practitioners Which system to use for the app that I am developing? App Developers, Data Scientists
  16. 16. Managing many systems is hard! Different parts of my app have different requirements Total Cost of Ownership (TCO)? CIOSystem Admins Challenges for Practitioners Which system to use for the app that I am developing? App Developers, Data Scientists
  17. 17. One Approach
  18. 18. Useful, But …
  19. 19. How a user uses BigFrame BigFrame Interface Benchmark Generator HBase Hive Map Reduce Benchmark Driver for System Under Test
  20. 20. bspec: Benchmark Specification HBase Hive Map Reduce 2. Data refresh pattern 3. Query streams 4.Evaluationmetrics 1. Data for initial load
  21. 21. What does the user (want to) specify? BigFrame Interface
  22. 22. The 3Vs
  23. 23. bigif: BigFrame’s InputFormat Data Variety Relational, text, array, graph Small, medium, large Data Volume Query Volume Query concurrency & classes Data Velocity At rest, slow, fast Micro, Macro Query Variety Exploratory, Continuous Query Velocity
  24. 24. Benchmark Generation Benchmark Generator
  25. 25. Application Domain Modeled Currently E-commerce sales, promotions, recommendations Social media sentiment & influence
  26. 26. Application Domain Modeled Currently Item Customer Web_sales Promotion Tweets Relationships
  27. 27. Application Domain Modeled Currently Item Web_sales Promotion
  28. 28. Application Domain Modeled Currently
  29. 29. Benchmark Generation Benchmark Generator
  30. 30. Use Case I: Exploratory BI • Large volumes of relational data • Mostly aggregation and few joins • Can Spark’s performance match that of an MPP DB?
  31. 31. Use Case II: Complex BI • Large volumes of relational data • Even larger volumes of text data • Combined analytics
  32. 32. • Large volume and velocity of relational and text data Use Case III: Dashboards • Continuously-updated Dashboards
  33. 33. Use Case IV: Does One Size Fit All? • Growing set of applications have to process relational, text, & graph data • Compose “best of breed” systems or use a “one size fits all” system?
  34. 34. Use Case V: Multi-tenancy and SLAs • Big data deployments are increasingly multi-tenant and need to meet SLAs
  35. 35. Working with the Community • First release of BigFrame planned for August 2013 • With feedback from benchmark developers (BigBench) • Open-source with extensibility APIs • Benchmark Drivers for more systems • Utilities (accessed through the Benchmark Driver to drill down into system behavior during benchmarking) • Instantiate the BigFrame pipeline for more app domains
  36. 36. • “Benchmarks shape a field (for better or worse) …” -- David Patterson, Univ. of California, Berkeley • Benchmarks meet different needs for different people • End customers, application developers, system designers, system administrators, researchers, CIOs • BigFrame helps users generate benchmarks that best meet their needs