Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

835 views

Published on

SpringOne Platform 2016
Speaker: Ian Fyfe; Director, Product Marketing, Hortonworks

Apache Hadoop is the most powerful and popular platform for ingesting, storing and processing enormous amounts of “big data”. However, due to its original roots as a batch processing system, doing interactive business analytics with Hadoop has historically suffered from slow response times, or forced business analysts to extract data summaries out of Hadoop into separate data marts. This talk will discuss the different options for implementing speed-of-thought business analytics and machine learning tools directly on top of Hadoop including Apache Hive on Tez, Apache Hive on LLAP, Apache HAWQ and Apache MADlib.

Published in: Technology
  • Be the first to comment

Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop

  1. 1. Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics on Hadoop Ian Fyfe Director Product Marketing August 2nd, 2016
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda  Hadoop architecture vs. BI tool requirements  Hive on MapReduce  The Data Movement Work-Around  Hive on Tez and LLAP  In-Hadoop Databases: Apache HAWQ, Apache Impala  In-Memory: AtScale, Zoomdata  Conclusion and summary
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks  The company behind Apache Hadoop – Hortonworks Data Platform (HDP) – Hortonworks Data Flow (HDF)  Strong partnership with Pivotal – Pivotal is converting Pivotal Hadoop Distribution customers to HDP – Hortonworks is reselling Pivotal HDB subscriptions
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop “Classic” Core Components  HDFS – a distributed file system allowing massive storage across a cluster of commodity servers  MapReduce – Framework for distributed computation, common use cases include aggregating, sorting, and filtering BIG data sets – Problem is broken up into small fragments of work that can be computed or recomputed in isolation on any node of the cluster – Massively scalable – High latency
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Related Projects  Hive – a data warehouse infrastructure on top of Hadoop – Implements a SQL like Query language, including a JDBC driver  HBase – the Hadoop database – AH HA! – NoSQL database problematic for traditional BI • Apache Phoenix provides SQL interface – Best at storing large amounts of unstructured data! – Not optimized for aggregate (BI style) queries
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Unfortunately Hadoop wasn't originally designed for most BI requirements …
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Low-latency queries at petabyte scale Full ANSI SQL Compliance
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved … but the situation is rapidly improving!
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive on MapReduce (Hive 1.0)
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive 1.0  Facilitates querying and managing large datasets.  Data analysts use Hive to explore, structure and analyze that data using a SQL-like language called HiveQL Hive on MapReduce Massively scale-able to PB range Batch reporting & ETL High latency – queries in minutes or hours Limited ANSI SQL compliance
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Result of Hive on MR… The Data Movement Work-Around
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Data Movement Work-Around  ETL the data into traditional data marts or data warehouses Oracle Vertica Netezza MySQL Etc. ETL Tool or code  Low latency queries  SQL compliance  Currency  Cost  Complexity
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive on Tez (Hive 2.0)
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive on Tez  Apache Tez – Alternate Query Framework to MapReduce which allows for a complex directed-acyclic-graph (DAG) of tasks for processing data” – Significantly faster than MapReduce  Many times faster than Hive on MR  PB scale  Still quite high query latency for BI tools
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive on LLAP (Hive 2.1)
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDP 2.5 is a Major Milestone for Hive  At a High Level: – 2000+ features, improvements and bug fixes in Hive since HDP 2.4. – 600+ of these from outside of Hortonworks.  Major Improvements: – Hive LLAP: Persistent query servers with intelligent in-memory caching. – ACID GA: Hardened and proven at scale. – Expanded SQL Compliance: More capable integration with BI tools. – Performance: Interactive query, 2x faster ETL. – Security: Row / Column security extending to views, Column level security for Spark. – Operations: LLAP integration in Ambari, new Grafana dashboards. 1391 642 From Hortonworks From Community Hive 2 Highlights Interactive Query with Hive LLAP+ SQL ACID Fully Supported+ 2x Faster ETL+
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Architecture Overview Deep Storage HDFS S3 + Other HDFS Compatible Filesystems YARN Cluster LLAP Daemon Executors In-Memory Cache LLAP Daemon Executors In-Memory Cache LLAP Daemon Executors In-Memory Cache LLAP Daemon Executors In-Memory Cache Query Coordinators App Master App Master App Master HiveServer2 ODBC / JDBC SQL Queries
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Preliminary Numbers 0 10 20 30 40 50 60 70 80 q3 q7 q12 q13 q19 q21 q26 q27 q42 q43 q45 q52 q55 q60 q73 q84 q89 q91 q98 Hive2.0 and LLAP: TPC-DS at 10 TB Scale, 18 Nodes Hive2.0-Tez LLAP Min query time: Query 55: 2.38s
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Linear Scaling at 1TB: 8 nodes versus 16 nodes. 0 20 40 60 80 100 120 8 16 32 Time(s) Concurrent Queries Average Query Time by Concurrency Average Time: 8 Node Average Time: 16 Node 0 50 100 150 200 250 300 8 16 32 Time(s) Concurrent Queries Maximum Query Time by Concurrency Max Time: 8 Node Max Time: 16 Node
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP Enables Interactive Query In Seconds  Faster interactive query  Faster ETL  Expanded SQL compliance for BI tools (nearing SQL:2011)  Enterprise Readiness: granular row & column level security  Simplified Operations: LLAP integration with Ambari with automated dashboards  TB scale datasets, not PB scale
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Caution: Not All Hive’s Are Created Equal Apache Hive in Hortonworks HDP Apache Hive in Cloudera CDH • Supports LLAP (with HDP 2.5) • Supports Tez • Supports ORC, Atlas, Ranger • Supports Vectorization • Supports In-Memory Computation • Lacks LLAP Support • Lacks Tez Support • Lacks ORC Support • Lacks Vectorization Support • Lacks In-Memory Support Note: I’ll talk about Impala later
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved In-Hadoop Databases Apache HAWQ
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop-native SQL query engine and advanced analytics MPP database that offers high-performance interactive ANSI SQL query execution and machine learning for Data Analysts & Data Scientists who want to find insights from large/complex datasets. HORTONWORKS HDBpowered by Apache HAWQ Apache HAWQ (Pivotal/Hortonworks HDB) Created by Pivotal – based on Greenplum core Resold as Hortonworks HDB
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HAWQ Architecture
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Apache Hive ● Multiple subject areas ● Holds very detailed information ● Scale – Multiple Petabytes ● Integrates all data sources ● ETL, Reporting & BI ● Low-Mid Query Latency  Apache HAWQ / HDB ● Single Subject Mart ● Summarized information ● Scale – 100s TB ● Ad-hoc Analytics & Visualization ● Machine Learning ● Low Query Latency Apache Hive & HAWQ/HDB Right Tool for the Job: Choose the right SQL engine based on your application’s needs.
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved In-Hadoop Databases Apache Impala (incubating)
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Impala (Incubating)  Brings scalable massively parallel processing (MPP) database technology to Hadoop – Circumvents MapReduce – Directly accesses the data through a specialized distributed query engine  MapReduce data processing and interactive queries can be done on the same system using the same data and metadata  Uses metadata, ODBC driver, and SQL syntax from Apache Hive
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved seconds 5-User Testing Results Industry Standard TPC-DS Queries * Queries that did not complete are omitted from results on both platforms • HAWQ 30% faster • Impala failed to complete 47% of the queries 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 Unsuppported SQL Long running killed Memory Limit Exceeded Impala Test Query Fails
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Impala  Fast interactive query  Re-uses Hive metadata and JDBC driver  Incomplete ANSI SQL compliance  User concurrency stability issues  TB scale, not PB scale  Vendor-specific security model (Cloudera)
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache HAWQ vs. Apache Impala Apache HAWQ / HDB Apache Impala • Deep YARN Integration • Best In Class Optimizer • Full ANSI SQL Compliance • Integrated Predictive Modeling • Performance Advantage 30%-600% over Impala • No YARN Integration – Poor Cluster Utilization • In-complete SQL Support • No Built-in support for Predictive Modeling • Poor concurrency
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved In-Memory Approach AtScale
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved AtScale
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved AtScale  Any BI Tool  No Data Movement  Single Semantic Layer Turn Your Hadoop Cluster into Scale-Out OLAP Server Resold by Hortonworks
  34. 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved AtScale Architecture – Leverages Auto-builds & maintains aggregates in Spark
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved AtScale  Fast interactive query  Full ANSI SQL Compliance  Any BI tool  No data movement  Good user concurrency  It is a “middleware” layer running on an edge node that you need to maintain
  36. 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved In-Memory + Micro-Queries Approach Zoomdata
  37. 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Zoomdata – Micro-Queries + Spark  Patented technique for delivering fast visualization of large volumes of data – Immediately displays a partial or approximate rendering which then becomes more accurate over time  Single logical query turned into a set of micro-queries executed in parallel – Results from the first micro-query immediately displayed – As the rest of the micro-queries complete, Zoomdata’s streaming architecture updates the visualization with new data until the full result set comes into focus – Interact with the data while data is still being processed  Leverages Spark as internal in-memory database Data Sharpening
  38. 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Zoomdata  Fast interactive query  Full ANSI SQL Compliance  Built-in visualization  No data movement  Good user concurrency  Runs on an edge node that you need to maintain  Not designed to work with other BI tools
  39. 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Conclusion
  40. 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Holy Grail of sub-second queries and full SQL compliance against PB-scale datasets in Hadoop is not easy.
  41. 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved But through a combination of innovation at the core and vendor innovation … … we are getting closer.
  42. 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary Scorecard Technology Scale Speed SQL Compliance Apache Hive on MapReduce Apache Hive on Tez Apache Hive on LLAP Hive on Tez + LLAP (HDP 2.5) Data Movement Work- Around Apache HAWQ Apache Impala (incubating) AtScale Zoomdata
  43. 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You ifyfe@hortonworks.com http://hortonworks.com
  44. 44. 50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Theme Colors R 30 G 30 B 30 R 0 G 0 B 0 R 255 G 255 B 255 R 59 G 134 B 64 R 63 G 174 B 42 R 61 G 181 B 230 R 68 G 105 B 125 R 218 G 217 B 214 R 255 G 112 B 10 R 255 G 198 B 30

×