Incredible Impala

  • 2,361 views
Uploaded on

Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries …

Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,361
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
52
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.Two primary components, HDFS and MapReduce. Based on software originally developed at Google.An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.Allows companies to begin storing data that was previously thrown away.Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
  • Take away – MapReduce is going away in favor of multi-framework Hadoop. Most of the replacements are improved MR. Impala is different.
  • Interactive SQL for HadoopResponses in seconds vs. minutes or hours4-65x faster than Hive; up to 100x seenNearly ANSI-92 standard SQL with HiveQLCREATE, ALTER, SELECT, INSERT, JOIN, subqueries, etc.ODBC/JDBC drivers Compatible SQL interface for existing Hadoop/CDH applicationsNative MPP Query EnginePurpose-built for low latency queries – another application being brought to HadoopSeparate runtime from MapReduce which is designed for batch processingTightly integrated with Hadoop ecosystem – major design imperative and differentiator for ClouderaSingle system (no integration)Native, open file formats that are compatible across the ecosystem (no copying)Single metadata model (no synchronization)Single set of hardware and system resources (better performance, lower cost)Integrated, end-to-end security (no vulnerabilities)Open SourceKeeps with our strategy of an open platform – i.e. if it stores or processes data, it’s open sourceApache-licensedCode available on Github
  • Interactive BI/Analytics on more dataRaw, full fidelity data – nothing lost through aggregation or ETL/LTNew sources & types – structured/unstructuredHistorical dataAsking new questionsExploration and data discovery for analytics and machine learning – need to find a data set for a model, which requires lots of simple queries to summarize, count, and validate.Hypothesis testing – avoid having to subset and fit the data to a warehouse just to ask a single questionData processing with tight SLAsCost-effective platformMinimize data movementReduce strain on data warehouseQuery-able storageReplace production data warehouse for DR/active archiveStore decades of data cost effectively (for better modeling or data retention mandates) without sacrificing the capability to analyze
  • Nows, we’ve finished scanning the RHS table and have finished building the hash table. We can now start scanning the LHS table to do the join.
  • If there’s a match, the joined row will bubble up the execution tree to the aggregation node.
  • This row doesn’t match. So, it won’t bubble up.
  • Now that all the rows have been returned from the hash join node, the aggregation node can start returning rows.
  • B are scanned in parallel, and broadcast to all impalad. Each Impalad reads its local data block for A and do the join. This is broadcast join. After the join is done, we do the aggregation. But before we can produce the final result, we need to redistribute the result of “local agg” according to the group by expression “state” and do the final aggregate.
  • We added a redundant condition in the WHERE clause to the query that doesn't change the query semantics or results returned. This is transparently mentioned in both our public blog post and published queries as the "explicit partition filter/predicate." Like window functions, this is done as a workaround for a feature limitation in both Impala and Hive to match what a user should to do optimize for these systems. Please also note that this change was done for all compared systems (Impala, Hive, and "DBMS-Y") to ensure an apples-to-apples comparison.

Transcript

  • 1. 1 Impala: Modern, Open-Source SQL Engine For Hadoop Gwen Shapira @gwenshap gshapira@cloudera.com
  • 2. Agenda • Why Hadoop? • Data Processing in Hadoop • User’s view of Impala • Impala Use Cases • Impala Architecture • Performance highlights 2
  • 3. 3 In the beginning…. was the database
  • 4. For a while, the database was all we needed. 4
  • 5. Data is not what it used to be 5 DataGrowth STRUCTURED DATA – 20% 1980 2012 UNSTRUCTUREDDATA–80%
  • 6. Hadoop was Invented to Solve: • Large volumes of data • Data that is only valuable in bulk • High ingestion rates • Data that requires more processing • Differently structured data • Evolving data • High license costs 6
  • 7. What is Apache Hadoop? 7 Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema Excels at Processing Complex Data  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Scales Economically  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Apache Hadoop is an open source platform for data storage and processing that is…  Distributed  Fault tolerant  Scalable CORE HADOOP SYSTEM COMPONENTS
  • 8. 8 Processing Data in Hadoop
  • 9. Map Reduce • Versatile • Flexible • Scalable • High latency • Batch oriented • Java • Challenging paradigm 9
  • 10. Hive & Pig • Hive – Turn SQL into MapReduce • Pig – Turn execution plans into MapReduce • Makes MapReduce easier • But not any faster 10
  • 11. Towards a Better Map Reduce • Spark – Next generation MapReduce With in-memory caching Lazy Evaluation Fast recovery times from node failures • Tez – Next generation MapReduce. Reduced overhead, more flexibility. Currently Alpha 11
  • 12. 12 And now to something completely different!
  • 13. What is Impala? 13
  • 14. Impala Overview 14 Interactive SQL for Hadoop  Responses in seconds  Nearly ANSI-92 standard SQL with Hive SQL Native MPP Query Engine  Purpose-built for low-latency queries  Separate runtime from MapReduce  Designed as part of the Hadoop ecosystem Open Source  Apache-licensed
  • 15. Impala Overview Runs directly within Hadoop  reads widely used Hadoop file formats  talks to widely used Hadoop storage managers  runs on same nodes that run Hadoop processes High performance  C++ instead of Java  runtime code generation  completely new execution engine – No MapReduce
  • 16.  Beta version released since October 2012  General availability (v1.0) release out since April 2013  Latest release (v1.2.3) released on December 23rd Impala is Production Ready
  • 17. User View of Impala: Overview • Distributed service in cluster: one Impala daemon on each node with data • Highly available: no single point of failure • Submit query to any daemon: • ODBC/JDBC • Impala CLI • Hue • Query is distributed to all nodes with relevant data • Impala uses Hive’s metadata
  • 18. User View of Impala: File Formats • There is no ‘Impala format’. • Impala supports: • Uncompressed/lzo-compressed text files • Sequence files and RCFile with snappy/gzip compression • Avro data files • Parquet columnar format (more on that later) • HBase
  • 19. User View of Impala: SQL Support • Most of SQL-92 • INSERT INTO … SELECT … • Only equi-joins; no non-equi joins, no cross products • Order By requires Limit (for now) • DDL support • SQL-style authorization via Apache Sentry (incubating) • UDFs and UDAFs are supported
  • 20. 20 Use Cases
  • 21. Impala Use Cases 21 Interactive BI/analytics on more data Asking new questions – exploration, ML Data processing with tight SLAs Query-able archive w/full fidelity Cost-effective, ad hoc query environment that offloads the data warehouse for:
  • 22. Global Financial Services Company 22 Saved 90% on incremental EDW spend & improved performance by 5x Offload data warehouse for query-able archive Store decades of data cost-effectively Process & analyze on the same system Improved capabilities through interactive query on more data
  • 23. Digital Media Company 24 20x performance improvement for exploration & data discovery Easily identify new data sets for modeling Interact with raw data directly to test hypotheses Avoid expensive DW schema changes Accelerate ‘time to answer’
  • 24. 25 Impala Architecture
  • 25. Impala Architecture • Impala daemon (impalad) – N instances • Query execution • State store daemon (statestored) – 1 instance • Provides name service and metadata distribution • Catalog daemon (catalogd) – 1 instance • Relays metadata changes to all impalad’s
  • 26. Impala Query Execution 27 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request 1) Request arrives via ODBC/JDBC/HUE/Shell
  • 27. Impala Query Execution 28 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 2) Planner turns request into collections of plan fragments 3) Coordinator initiates execution on impalad(s) local to data
  • 28. Impala Query Execution 29 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client Query results
  • 29. Query Planner 2-phase planning  Left deep tree  Partition plan to maximize data locality Join order  Before 1.2.3: Order of tables in query.  1.2.3 and above: Cost based if statistics exist Plan Operators  Scan, HashJoin, HashAggregation, Union, TopN, Exchange  All operators are fully distributed 30
  • 30. 31 Query Execution Example
  • 31. Simple Example SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (id) GROUP BY state ORDER BY 2 desc LIMIT 10
  • 32. How does a database execute a query? • Left Deep Tree • Data flows from bottom to top TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 33. Wait – Why is this a left-deep tree? HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin Agg HashJoin Scan: t0
  • 34. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. • So, the RHS table (Hbase scan) is scanned first. TopN Agg Hash Join Hdfs Scan Hbase Scan Scan Hbase first
  • 35. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 36. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 37. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 38. How does a database execute a query? • Start scanning LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows TopN Agg Hash Join Hdfs Scan Hbase Scan Probe hash table and a matching row is found.
  • 39. How does a database execute a query? • Matched rows are bubbled up the execution tree TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 40. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan No matching row
  • 41. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 42. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan Probe hash table and a matching row is found.
  • 43. How does a database execute a query? • Matched rows are bubbled up the execution tree TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 44. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan No matching row
  • 45. How does a database execute a query? • All rows have been returned from the hash join node. Agg node can start returning rows • Rows are bubbled up the execution tree TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 46. How does a database execute a query? • Rows from the aggregation node bubbles up to the top-n node TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 47. How does a database execute a query? • Rows from the aggregation node bubbles up to the top-n node • When all rows are returned by the agg node, top-n node can restart return rows to the end-user TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 48. Key takeaways  Data flows from bottom to top in the execution tree and finally goes to the end user  Larger tables go on the left  Collect statistics  Filter early 49
  • 49. Simpler Example SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (id) GROUP BY state
  • 50. How does an MPP database execute a query? Tbl b Scan Hash Join Tbl a Scan Exch Agg Exch Agg Agg Hash Join Tbl a Scan Tbl b Scan Broadcast Re-distribute by “state”
  • 51. How does a MPP database execute a query A join B A join B A join B Local Agg Local Agg Local Agg Scan and Broadcast Tbl B Final Agg Final Agg Final Agg Re-distribute by “state” Local read Tbl A
  • 52. 53 Performance
  • 53. Impala Performance Results • Impala’s Latest Milestone: • Comparable commercial MPP DBMS speed • Natively on Hadoop • Three Result Sets: • Impala vs Hive 0.12 (Impala 6-70x faster) • Impala vs “DBMS-Y” (Impala average of 2x faster) • Impala scalability (Impala achieves linear scale) • Background • 20 pre-selected, diverse TPC-DS queries (modified to remove unsupported language) • Sufficient data scale for realistic comparison (3 TB, 15 TB, and 30 TB) • Realistic nodes (e.g. 8-core CPU, 96GB RAM, 12x2TB disks) • Methodical testing (multiple runs, reviewed fairness for competition, etc) • Details: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/ 54
  • 54. Impala vs Hive 0.12 (Lower bars are better) 55
  • 55. Impala vs “DBMS-Y” (Lower bars are better) 56
  • 56. Impala Scalability: 2x the Hardware (Expectation: Cut Response Times in Half) 57
  • 57. Impala Scalability: 2x the Hardware and 2x Users/Data (Expectation: Constant Response Times) 58 2x the Users, 2x the Hardware 2x the Data, 2x the Hardware
  • 58. 59