Cloudera Impala Overview (via Scott Leberknight)
 

Cloudera Impala Overview (via Scott Leberknight)

on

  • 11,715 views

Slides for presentation on Cloudera Impala by Scott Leberknight of Near Infinity

Slides for presentation on Cloudera Impala by Scott Leberknight of Near Infinity

Statistics

Views

Total Views
11,715
Views on SlideShare
5,964
Embed Views
5,751

Actions

Likes
23
Downloads
289
Comments
0

32 Embeds 5,751

http://blog.cloudera.com 2839
http://www.thecloudavenue.com 1332
http://www.parallellabs.com 446
http://www.scoop.it 261
http://cloud.feedly.com 255
http://cirocavani.wordpress.com 246
http://tech.nerocrux.org 93
http://www.newsblur.com 56
http://xianguo.com 40
http://newsblur.com 39
http://digg.com 37
http://www.blogger.com 21
http://www.feedspot.com 19
https://twitter.com 15
http://it.zhans.org 11
http://8716531089719420013_33f8cb150dee82ea04b7e3a18cdb0fb462d38c74.blogspot.com 8
http://hadoop238.rssing.com 6
http://reader.aol.com 5
http://feedly.com 4
http://community.cloudera.com 3
http://translate.googleusercontent.com 3
http://reader.faltering.com 2
http://core.traackr.com 1
http://cache.yahoofs.jp 1
http://feeds.mierloiu.ro 1
https://www.google.com 1
http://127.0.0.1 1
http://www.hanrss.com 1
http://webcache.googleusercontent.com 1
http://cache.baiducontent.com 1
http://plus.url.google.com 1
http://www.google.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cloudera Impala Overview (via Scott Leberknight) Cloudera Impala Overview (via Scott Leberknight) Presentation Transcript

    • Scott Leberknight Cloudera's 7/9/2013
    • History lesson...
    • Google Map/Reduce paper (2004) Cutting & Cafarella create Hadoop (2005)
    • Google Dremel paper (2010) Facebook creates Hive (2007)*
    • Cloudera announces Impala (October 2012) HortonWorks' Stinger (February 2013) Apache Drill proposal (August 2012)
    • * Hive => "SQL on Hadoop" Write SQL queries Translate into Map/Reduce job(s) Convenient & easy High-latency (batch processing)
    • What is Impala? In-memory, distributed SQL query engine (no Map/Reduce) Native code (C++) Distributed (on HDFS data nodes)
    • Why Impala? Interactive data analysis Low-latency response (roughly, 4 - 100x Hive) Deploy on existing Hadoop clusters
    • Why Impala? (cont'd) Data stored in HDFS avoids... ...duplicate storage ...data transformation ...moving data
    • Why Impala? (cont'd) SPEED!
    • statestored & Hive metastore (for database metadata) Overview impalad daemon runs on HDFS nodes Queries run on "relevant" nodes Supports common HDFS file formats (for cluster metadata)
    • Overview (cont'd) Does not use Map/Reduce Not fault tolerant ! (query fails if any query on any node fails) Submit queries via Hue/Beeswax Thrift API, CLI, ODBC, JDBC
    • SQL Support SELECT Projection UNION INSERT OVERWRITE INSERT INTO ORDER BY (w/ LIMIT) Aggregation Subqueries (uncorrelated) JOIN (equi-join only, subject to memory limitations) (subset of Hive QL)
    • HBase Queries Maps HBase tables via Hive metastore mapping Row key predicates => start/stop row Non-row key predicates => SingleColumnValueFilter HBase scan translations:
    • (Very) Unscientific Benchmarks
    • 9 queries, run in CDH Quickstart VM Macbook Pro Retina, mid 2012 16GB RAM, 4GB for VM (VMWare 5), Intel i7 2.6GHz quad-core processor Hardware No other load on system during queries Pseudo-cluster + Impala daemons CDH 4.2, Impala 1.0
    • Benchmarks (cont'd) (from simple projection queries to multiple joins, aggregation, multiple predicates, and order by) Impala vs. Hive performance "TPC-DS" sample dataset (http://www.tpc.org/tpcds/)
    • Query "A" select c.c_first_name, c.c_last_name from customer c limit 50;
    • Query "B" select    c.c_first_name,    c.c_last_name,    ca.ca_city,    ca.ca_county,    ca.ca_state from customer c    join customer_address ca on c.c_current_addr_sk = ca.ca_address_sk limit 50;
    • Query "C" select    c.c_first_name,    c.c_last_name,    ca.ca_city,    ca.ca_county,    ca.ca_state from customer c    join customer_address ca on c.c_current_addr_sk = ca.ca_address_sk where lower(c.c_last_name) like 'smi%' limit 50;
    • Query "D" select distinct cd_credit_rating from customer_demographics;
    • Query "E" select    cd_credit_rating,    count(*) from customer_demographics group by cd_credit_rating;
    • Query "F" select    c.c_first_name,    c.c_last_name,    ca.ca_city,    ca.ca_county,    ca.ca_state,    cd.cd_marital_status,    cd.cd_education_status from customer c    join customer_address ca        on c.c_current_addr_sk = ca.ca_address_sk    join customer_demographics cd        on c.c_current_cdemo_sk = cd.cd_demo_sk where    lower(c.c_last_name) like 'smi%' and    cd.cd_credit_rating in ('Unknown', 'High Risk') limit 50;
    • Query "G" select    count(c.c_customer_sk) from customer c    join customer_address ca        on c.c_current_addr_sk = ca.ca_address_sk    join customer_demographics cd        on c.c_current_cdemo_sk = cd.cd_demo_sk where    ca.ca_zip in ('20191', '20194') and    cd.cd_credit_rating in ('Unknown', 'High Risk');
    • Query "H" select    c.c_first_name,    c.c_last_name,    ca.ca_city,    ca.ca_county,    ca.ca_state,    cd.cd_marital_status,    cd.cd_education_status from customer c    join customer_address ca        on c.c_current_addr_sk = ca.ca_address_sk    join customer_demographics cd        on c.c_current_cdemo_sk = cd.cd_demo_sk where    ca.ca_zip in ('20191', '20194') and    cd.cd_credit_rating in ('Unknown', 'High Risk') limit 100;
    • select     i_item_id,   s_state,   avg(ss_quantity) agg1,   avg(ss_list_price) agg2,   avg(ss_coupon_amt) agg3,   avg(ss_sales_price) agg4 from store_sales join date_dim    on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) join item    on (store_sales.ss_item_sk = item.i_item_sk) join customer_demographics    on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) join store    on (store_sales.ss_store_sk = store.s_store_sk) where   cd_gender = 'M' and   cd_marital_status = 'S' and   cd_education_status = 'College' and   d_year = 2002 and   s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD') group by   i_item_id,   s_state order by   i_item_id,   s_state limit 100; Query "TPC-DS"
    • Query Hive (sec) # M/R jobs Impala (sec) x Hive perf. A 13.8 1 0.25 54 B 30.0 1 0.41 73 C 33.3 1 0.42 79 D 23.2 1 0.64 36 E 21.6 1 0.62 35 F 59.1 2 1.96 30 G 78.5 3 1.56 50 H 59.6 2 1.89 32 TPC-DS 204.5 6 3.23 63 (remember, unscientific...)
    • A rchitecture
    • Two daemons impalad statestored impalad on each HDFS data node statestored - cluster metadata Thrift APIs, ODBC, JDBC
    • impalad Query execution Query coordination Query planning
    • impalad Query Coordinator Query Planner Query Executor HDFS DataNode HBase RegionServer
    • Queries performed in-memory Intermediate data never hits disk! Data streamed to clients C++ runtime code generation intrinsics for optimization Execution engine:
    • statestored Cluster membership Acts as a cluster monitor Not a SPOF (single point of failure)
    • Metadata Impala uses Hive metastore Daemons cache metadata REFRESH when table definition/data change Create tables in Hive or Impala
    • Next up - how queries work...
    • impalad Query Coordinator Query Planner Query Executor HDFS DataNode HBase RegionServer Client Statestore Hive Metastore table/ database metadata SQL query impalad Query Coordinator Query Planner Query Executor HDFS DataNode HBase RegionServer impalad Query Coordinator Query Planner Query Executor HDFS DataNode HBase RegionServer cluster monitoring
    • Read directly from disk Short-circuit reads Bypass HDFS DataNode (avoids overhead of HDFS API)
    • impalad Query Coordinator Query Planner Query Executor HBase Region Server HDFS DataNode Local Filesystem Read directly from disk
    • Current Limitations (as of version 1.0.1) No join order optimization No custom file formats, SerDes or UDFs Limit required when using ORDER BY Joins limited by aggregate memory of cluster ("put larger table on left")
    • Current Limitations (as of version 1.0.1) No advanced data structures (arrays, maps, json, etc.) Only basic DDL (otherwise do in Hive) Limited file formats and compression (though probably fine for most people)
    • Future... Structure types (structs, arrays, maps, json, etc.) DDL support Additional file formats & compression support "Performance" Join optimization (e.g. cost-based) UDFs (???) YARN integration Fault-tolerance (???)
    • Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. Comparing Impala to Dremel - http://research.google.com/pubs/pub36632.html
    • Comparing Impala to Dremel Impala = Dremel features circa 2010 + join support, assuming columnar data format (but, Google doesn't stand still...) Dremel is production, mature Basis for Google's BigQuery
    • Comparing Impala to Hive Hive uses Map/Reduce -> high latency Impala is in-memory, low- latency query engine Impala sacrifices fault tolerance for performance
    • Comparing Impala to Drill Apache Drill Based on Dremel In early stages...
    • "Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large- scale datasets. Drill is the open source version of Google's Dremel system which is available as an IaaS service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Currently, Drill is incubating at Apache." - http://incubator.apache.org/drill/drill_overview.html Comparing Impala to Drill
    • "The Stinger Initiative is a collection of development threads in the Hive community that will deliver 100X performance improvements as well as SQL compatibility." Comparing Impala to Stinger - http://hortonworks.com/stinger/
    • Comparing Impala to Stinger Stinger Improve Hive performance (e.g. optimize execution plan) Support for analytics (e.g. OVER clause, window functions) TEZ framework to optimize execution Columnar file format http://hortonworks.com/stinger/
    • Stinger Phase 1 performance... (Stinger phase 1 is really just Hive 0.11)
    • remember, these numbers are non-scientific micro-benchmarks!
    • Same 9 queries (as w/ Impala), run in HortonWorks Sandbox VM Macbook Pro Retina, mid 2012 16GB RAM, 4GB for VM (VMWare 5), Intel i7 2.6GHz quad-core processor Hardware (same as w/ Impala) No other load on system during queries HortonWorks Data Platform (HDP) 1.3 Running pseudo-cluster
    • Query Hive (sec) # M/R jobs Stinger Phase 1 (sec) # M/R jobs x Hive perf. A 13.8 1 10.0 1 1.4 B 30.0 1 15.8 1 1.9 C 33.3 1 14.1 1 2.4 D 23.2 1 18.7 1 1.2 E 21.6 1 19.7 1 1.1 F 59.1 2 34.3 1 1.7 G 78.5 3 35.2 1 2.2 H 59.6 2 31.5 1 1.9 TPC-DS 204.5 6 37.2 1 5.5 (remember, unscientific...)
    • Query Stinger Phase 1 (sec) Impala (sec) x Stinger perf. A 10.0 0.25 39 B 15.8 0.41 38 C 14.1 0.42 33 D 18.7 0.64 29 E 19.7 0.62 32 F 34.3 1.96 18 G 35.2 1.56 23 H 31.5 1.89 17 TPC-DS 37.2 3.23 12 (remember, unscientific...)
    • Impala Review In-memory, distributed SQL query engine Integrates into existing HDFS Not Map/Reduce Focus on performance (native code) Competition... Interactive data analysis
    • References Google Dremel - http://research.google.com/pubs/pub36632.html Apache Drill - http://incubator.apache.org/drill/ TPC-DS dataset - http://www.tpc.org/tpcds/ Stinger Initiative - http://hortonworks.com/blog/100x-faster-hive/ http://hortonworks.com/stinger/ Cloudera Impala resources http://www.cloudera.com/content/support/en/documentation/cloudera-impala/cloudera- impala-documentation-v1-latest.html Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache- hadoop-for-real/
    • Photo Attributions Impala - http://www.flickr.com/photos/gerardstolk/5897570970/ Measuring tape - http://www.morguefile.com/archive/display/24850 Bridge frame - http://www.morguefile.com/archive/display/9699 Balance - http://www.morguefile.com/archive/display/93433 * All others are iStockPhoto (I paid for them...)
    • My Info twitter.com/sleberknight www.sleberknight.com/blog scott dot leberknight at gmail dot com