SlideShare a Scribd company logo
Cloudera Impala:
    A Modern SQL Engine for Apache Hadoop
    Ricky Saltzer
    Tools Developer / Impala Contributor




                                           ricky@cloudera.com
                                                 @monstrado

1
Impala Overview: Goals
•   General-purpose SQL query engine:
     •   should work both for analytical and transactional workloads
     •   will support queries that take from milliseconds to hours
•   Runs directly within Hadoop:
     •   reads widely used Hadoop file formats
     •   talks to widely used Hadoop storage managers
     •   runs on same nodes that run Hadoop processes
•   High performance:
     •   C++ instead of Java
     •   runtime code generation
     •   completely new execution engine that doesn't build on MapReduce
User View of Impala: Overview
•   Runs as a distributed service in cluster: one Impala daemon on each
    node with data
•   User submits query via ODBC/Beeswax Thrift API to any of the
    daemons
•   Query is distributed to all nodes with relevant data
•   If any node fails, the query fails
•   Impala uses Hive's metadata interface, connects to Hive's metastore
•   Supported file formats:
      •   text files (GA: with compression, including lzo)
      •   sequence files with snappy/gzip compression
      •   GA: Avro data files
      •   GA: Trevni (columnar format; more on that later)
User View of Impala: SQL
•   SQL support:
      •   patterned after Hive's version of SQL
      •   limited to Select, Project, Join, Union, Subqueries, Aggregation and
          Insert
      •   only equi-joins; no non-equi joins, no cross products
      •   Order By only with Limit
      •   GA: DDL support (CREATE, ALTER)
•   Functional limitations:
      •   no custom UDFs, file formats, SerDes
      •   no beyond SQL (buckets, samples, transforms, arrays, structs, maps,
          xpath, json)
      •   only hash joins; joined table has to fit in memory:
            • beta: of single node
            • GA: aggregate memory of all (executing) nodes
User View of Impala: Apache HBase
•   HBase functionality:
     •   uses Hive's mapping of HBase table into metastore table
     •   predicates on rowkey columns are mapped into start/stop
         row
     •   predicates on other columns are mapped into
         SingleColumnValueFilters
•   HBase functional limitations:
     •   no nested-loop joins
     •   all data stored as text
Impala Architecture
• Two binaries: impalad and statestored
• Impala daemon (impalad)
     •   handles client requests and all internal requests related to
         query execution
     •   exports Thrift services for these two roles
•   State store daemon (statestored)
     •   provides name service and metadata distribution
     •   also exports a Thrift service
Impala Architecture
•   Query execution phases
     •   request arrives via odbc/beeswax Thrift API
     •   planner turns request into collections of plan fragments
     •   coordinator initiates execution on remote impalad's
     •   during execution
           • intermediate results are streamed between executors
           • query results are streamed back to client
           • subject to limitations imposed to blocking operators
             (top-n, aggregation)
Impala Architecture: Planner
• join order = FROM clause order GA target: rudimentary cost-
  based optimizer
• 2-phase planning process:
     •   single-node plan: left-deep tree of plan operators
     •   plan partitioning: partition single-node plan to maximize scan locality,
         minimize data movement
•   plan operators: Scan, HashJoin, HashAggregation, Union, TopN,
    Exchange
•   distributed aggregation: pre-aggregation in all nodes, merge
    aggregation in single node. GA: hash-partitioned aggregation:
    re-partition aggregation input on grouping columns in order to
    reduce per-node memory requirement
Impala Architecture: Planner
•   Example: query with join and aggregation
    SELECT state, SUM(revenue)
    FROM HdfsTbl h JOIN HbaseTbl b ON (...)
    GROUP BY 1 ORDER BY 2 desc LIMIT 10

       TopN
                                                  Agg
                          TopN
       Agg                                       Hash
       Hash                Agg                   Join
       Join                                HDFS                   HBase
                           Exch                         Exch
                                           Scan                    Scan
    HDFS      HBase     at coordinator   at DataNodes          at region servers
    Scan       Scan
Impala Architecture: Query Execution
Request arrives via odbc/beeswax Thrift API


     SQL App                               Hive
                                                     HDFS NN   Statestore
      ODBC                             Metastore
                         SQL
                       request

     Query Planner                Query Planner           Query Planner
   Query Coordinator             Query Coordinator      Query Coordinator
    Query Executor               Query Executor          Query Executor
   HDFS DN     HBase             HDFS DN     HBase      HDFS DN    HBase
Impala Architecture: Query Execution
Planner turns request into collections of plan fragments
Coordinator initiates execution on remote impalad's

      SQL App                     Hive
                                            HDFS NN   Statestore
       ODBC                   Metastore




     Query Planner       Query Planner           Query Planner
    Query Coordinator   Query Coordinator      Query Coordinator
     Query Executor      Query Executor         Query Executor
    HDFS DN     HBase   HDFS DN     HBase      HDFS DN    HBase
Impala Architecture: Query Execution
Intermediate results are streamed between impalad's Query
results are streamed back to client

     SQL App                          Hive
                                                HDFS NN   Statestore
      ODBC                        Metastore

                   query
                  results

    Query Planner            Query Planner           Query Planner
   Query Coordinator        Query Coordinator      Query Coordinator
    Query Executor           Query Executor         Query Executor
   HDFS DN     HBase        HDFS DN     HBase      HDFS DN    HBase
Impala Architecture
•   Metadata handling:
     • utilizes Hive's metastore
     • caches metadata: no synchronous metastore API calls
       during query execution
     • beta: impalad's read metadata from metastore at startup
     • GA: metadata distribution through statestore
     • post-GA: HCatalog
Impala Architecture
•   Execution engine
     •   written in C++
     •   runtime code generation for "big loops"
          •   example: insert batch of rows into hash table
          •   code generation with llvm
          •   inlines all expressions; no function calls inside loop
     •   all data is copied into canonical in-memory tuple format; all
         fixed-width data is located at fixed offsets
     •   uses intrinsics/special cpu instructions for text parsing,
         crc32 computation, etc.
Impala's Statestore
•   Central system state repository
      •   name service (membership)
      •   GA: metadata
      •   GA: other scheduling-relevant or diagnostic state
•   Soft-state
      •   all impalad's register at startup
      •   impalad's re-register after losing connection
      •   Impala service continues to function in absence of statestored (but:
          with increasingly stale state)
      •   State pushed to impalad's periodically
      •   Repeated failed heartbeat means impalad evicted from cluster view
•   Thrift API for service / subscription registration
Statestore: Why not ZooKeeper?
•   ZK is not a good pub-sub system
     •   Watch API is awkward and requires a lot of client logic
     •   multiple round-trips required to get data for changes to
         node's children
     •   push model is more natural for our use case
•   Don't need all the guarantees ZK provides:
     • serializability
     • persistence
     • prefer to avoid complexity where possible

•   ZK is bad at the things we care about and good at the
    things we don't
Comparing Impala to Google Dremel
•   What is Dremel?
     •   columnar storage for data with nested structures
     •   distributed scalable aggregation on top of that
•   Columnar storage in Hadoop: joint project between Cloudera
    and Twitter
     •   new columnar format, derived from Doug Cutting's Trevni
     •   stores data in appropriate native/binary types
     •   can also store nested structures similar to Dremel's ColumnIO
•   Distributed aggregation: Impala
•   Impala plus columnar format: a superset of the published
    version of Dremel (which didn't support joins)
Comparing Impala to Hive
•   Hive: MapReduce as an execution engine
     •   High latency, low throughput queries
     •   Fault-tolerance model based on MapReduce's on-disk
         checkpointing; materializes all intermediate results
     •   Java runtime allows for easy late-binding of functionality:
         file formats and UDFs.
     •   Extensive layering imposes high runtime overhead
•   Impala:
     •   direct, process-to-process data exchange
     •   no fault tolerance
     •   an execution engine designed for low runtime overhead
Comparing Impala to Hive
•   Impala's performance advantage over Hive: no hard
    numbers, but
     •   Impala can get full disk throughput (~100MB/sec/disk);
         I/O-bound workloads often faster by 3-4x
     •   queries that require multiple map-reduce phases in Hive
         see a higher speedup
     •   queries that run against in-memory data see a higher
         speedup (observed up to 100x)
Impala Roadmap: GA – Q2 2013
•   New data formats:
     • lzo-compressed text
     • Avro
     • columnar

•   Better metadata handling:
     •   automatic metadata distribution through statestore
• Connectivity: jdbc
• Improved query execution: partitioned joins
• Further performance improvements
Impala Roadmap: GA – Q2 2013
•   Guidelines for production deployment:
     •   load balancing across impalad's
     •   resource isolation within MR cluster
•   Additional packages: RHEL 5.7, Ubuntu, Debian
Impala Roadmap: 2013
•   Improved HBase support:
     •   composite keys, Avro data in columns,
         index nested-loop joins,
         INSERT/UPDATE/DELETE
•   Additional SQL:
     • UDFs
     • SQL authorization and DDL
     • ORDER BY without LIMIT
     • window functions
     • support for structured data types
Impala Roadmap: 2013
•   Runtime optimizations:
     • straggler handling
     • join order optimization
     • improved cache management
     • data collocation for improved join performance

•   Resource management:
     • cluster-wide quotas
     • Teradata-style policies ("user x can never have more than 5
       concurrent queries running", etc.)
     • goal: run exploratory and production workloads in same
       cluster, against same data, w/o impacting production jobs
Questions
     •   My email:              Download Impala (beta)
          ricky@cloudera.com     cloudera.com/downloads

                                Learn more about Cloudera
                                 Enterprise RTQ, Powered
                                        by Impala
                                   cloudera.com/impala



     Thank you for attending!

24

More Related Content

What's hot

Ambari: Agent Registration Flow
Ambari: Agent Registration FlowAmbari: Agent Registration Flow
Ambari: Agent Registration FlowHortonworks
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
Avkash Chauhan
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
Nguyen Quang
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
MongoDB
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
Davin Abraham
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Edureka!
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
Isheeta Sanghi
 
Fuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer ShinFuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer Shin
Databricks
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
DataWorks Summit/Hadoop Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
DataWorks Summit
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Overview of HDFS Transparent Encryption
Overview of HDFS Transparent Encryption Overview of HDFS Transparent Encryption
Overview of HDFS Transparent Encryption
Cloudera, Inc.
 

What's hot (20)

Ambari: Agent Registration Flow
Ambari: Agent Registration FlowAmbari: Agent Registration Flow
Ambari: Agent Registration Flow
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
 
Fuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer ShinFuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer Shin
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Overview of HDFS Transparent Encryption
Overview of HDFS Transparent Encryption Overview of HDFS Transparent Encryption
Overview of HDFS Transparent Encryption
 

Similar to Impala presentation

Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera, Inc.
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera, Inc.
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
markgrover
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopYahoo Developer Network
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
Chicago Hadoop Users Group
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera, Inc.
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
Shravan (Sean) Pabba
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
Gwen (Chen) Shapira
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
mashoodsyed66
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 

Similar to Impala presentation (20)

Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 

More from trihug

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
trihug
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
trihug
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
trihug
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentry
trihug
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
Practical pig
Practical pigPractical pig
Practical pig
trihug
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihug
trihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shain
trihug
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gatestrihug
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gatestrihug
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
trihug
 

More from trihug (11)

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentry
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Practical pig
Practical pigPractical pig
Practical pig
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shain
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gates
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

Impala presentation

  • 1. Cloudera Impala: A Modern SQL Engine for Apache Hadoop Ricky Saltzer Tools Developer / Impala Contributor ricky@cloudera.com @monstrado 1
  • 2. Impala Overview: Goals • General-purpose SQL query engine: • should work both for analytical and transactional workloads • will support queries that take from milliseconds to hours • Runs directly within Hadoop: • reads widely used Hadoop file formats • talks to widely used Hadoop storage managers • runs on same nodes that run Hadoop processes • High performance: • C++ instead of Java • runtime code generation • completely new execution engine that doesn't build on MapReduce
  • 3. User View of Impala: Overview • Runs as a distributed service in cluster: one Impala daemon on each node with data • User submits query via ODBC/Beeswax Thrift API to any of the daemons • Query is distributed to all nodes with relevant data • If any node fails, the query fails • Impala uses Hive's metadata interface, connects to Hive's metastore • Supported file formats: • text files (GA: with compression, including lzo) • sequence files with snappy/gzip compression • GA: Avro data files • GA: Trevni (columnar format; more on that later)
  • 4. User View of Impala: SQL • SQL support: • patterned after Hive's version of SQL • limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert • only equi-joins; no non-equi joins, no cross products • Order By only with Limit • GA: DDL support (CREATE, ALTER) • Functional limitations: • no custom UDFs, file formats, SerDes • no beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath, json) • only hash joins; joined table has to fit in memory: • beta: of single node • GA: aggregate memory of all (executing) nodes
  • 5. User View of Impala: Apache HBase • HBase functionality: • uses Hive's mapping of HBase table into metastore table • predicates on rowkey columns are mapped into start/stop row • predicates on other columns are mapped into SingleColumnValueFilters • HBase functional limitations: • no nested-loop joins • all data stored as text
  • 6. Impala Architecture • Two binaries: impalad and statestored • Impala daemon (impalad) • handles client requests and all internal requests related to query execution • exports Thrift services for these two roles • State store daemon (statestored) • provides name service and metadata distribution • also exports a Thrift service
  • 7. Impala Architecture • Query execution phases • request arrives via odbc/beeswax Thrift API • planner turns request into collections of plan fragments • coordinator initiates execution on remote impalad's • during execution • intermediate results are streamed between executors • query results are streamed back to client • subject to limitations imposed to blocking operators (top-n, aggregation)
  • 8. Impala Architecture: Planner • join order = FROM clause order GA target: rudimentary cost- based optimizer • 2-phase planning process: • single-node plan: left-deep tree of plan operators • plan partitioning: partition single-node plan to maximize scan locality, minimize data movement • plan operators: Scan, HashJoin, HashAggregation, Union, TopN, Exchange • distributed aggregation: pre-aggregation in all nodes, merge aggregation in single node. GA: hash-partitioned aggregation: re-partition aggregation input on grouping columns in order to reduce per-node memory requirement
  • 9. Impala Architecture: Planner • Example: query with join and aggregation SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 TopN Agg TopN Agg Hash Hash Agg Join Join HDFS HBase Exch Exch Scan Scan HDFS HBase at coordinator at DataNodes at region servers Scan Scan
  • 10. Impala Architecture: Query Execution Request arrives via odbc/beeswax Thrift API SQL App Hive HDFS NN Statestore ODBC Metastore SQL request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 11. Impala Architecture: Query Execution Planner turns request into collections of plan fragments Coordinator initiates execution on remote impalad's SQL App Hive HDFS NN Statestore ODBC Metastore Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 12. Impala Architecture: Query Execution Intermediate results are streamed between impalad's Query results are streamed back to client SQL App Hive HDFS NN Statestore ODBC Metastore query results Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 13. Impala Architecture • Metadata handling: • utilizes Hive's metastore • caches metadata: no synchronous metastore API calls during query execution • beta: impalad's read metadata from metastore at startup • GA: metadata distribution through statestore • post-GA: HCatalog
  • 14. Impala Architecture • Execution engine • written in C++ • runtime code generation for "big loops" • example: insert batch of rows into hash table • code generation with llvm • inlines all expressions; no function calls inside loop • all data is copied into canonical in-memory tuple format; all fixed-width data is located at fixed offsets • uses intrinsics/special cpu instructions for text parsing, crc32 computation, etc.
  • 15. Impala's Statestore • Central system state repository • name service (membership) • GA: metadata • GA: other scheduling-relevant or diagnostic state • Soft-state • all impalad's register at startup • impalad's re-register after losing connection • Impala service continues to function in absence of statestored (but: with increasingly stale state) • State pushed to impalad's periodically • Repeated failed heartbeat means impalad evicted from cluster view • Thrift API for service / subscription registration
  • 16. Statestore: Why not ZooKeeper? • ZK is not a good pub-sub system • Watch API is awkward and requires a lot of client logic • multiple round-trips required to get data for changes to node's children • push model is more natural for our use case • Don't need all the guarantees ZK provides: • serializability • persistence • prefer to avoid complexity where possible • ZK is bad at the things we care about and good at the things we don't
  • 17. Comparing Impala to Google Dremel • What is Dremel? • columnar storage for data with nested structures • distributed scalable aggregation on top of that • Columnar storage in Hadoop: joint project between Cloudera and Twitter • new columnar format, derived from Doug Cutting's Trevni • stores data in appropriate native/binary types • can also store nested structures similar to Dremel's ColumnIO • Distributed aggregation: Impala • Impala plus columnar format: a superset of the published version of Dremel (which didn't support joins)
  • 18. Comparing Impala to Hive • Hive: MapReduce as an execution engine • High latency, low throughput queries • Fault-tolerance model based on MapReduce's on-disk checkpointing; materializes all intermediate results • Java runtime allows for easy late-binding of functionality: file formats and UDFs. • Extensive layering imposes high runtime overhead • Impala: • direct, process-to-process data exchange • no fault tolerance • an execution engine designed for low runtime overhead
  • 19. Comparing Impala to Hive • Impala's performance advantage over Hive: no hard numbers, but • Impala can get full disk throughput (~100MB/sec/disk); I/O-bound workloads often faster by 3-4x • queries that require multiple map-reduce phases in Hive see a higher speedup • queries that run against in-memory data see a higher speedup (observed up to 100x)
  • 20. Impala Roadmap: GA – Q2 2013 • New data formats: • lzo-compressed text • Avro • columnar • Better metadata handling: • automatic metadata distribution through statestore • Connectivity: jdbc • Improved query execution: partitioned joins • Further performance improvements
  • 21. Impala Roadmap: GA – Q2 2013 • Guidelines for production deployment: • load balancing across impalad's • resource isolation within MR cluster • Additional packages: RHEL 5.7, Ubuntu, Debian
  • 22. Impala Roadmap: 2013 • Improved HBase support: • composite keys, Avro data in columns, index nested-loop joins, INSERT/UPDATE/DELETE • Additional SQL: • UDFs • SQL authorization and DDL • ORDER BY without LIMIT • window functions • support for structured data types
  • 23. Impala Roadmap: 2013 • Runtime optimizations: • straggler handling • join order optimization • improved cache management • data collocation for improved join performance • Resource management: • cluster-wide quotas • Teradata-style policies ("user x can never have more than 5 concurrent queries running", etc.) • goal: run exploratory and production workloads in same cluster, against same data, w/o impacting production jobs
  • 24. Questions • My email: Download Impala (beta) ricky@cloudera.com cloudera.com/downloads Learn more about Cloudera Enterprise RTQ, Powered by Impala cloudera.com/impala Thank you for attending! 24