• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Impala presentation ahad rana

Impala presentation ahad rana



Big Data Camp LA 2014, Impala By Ahad Rana of Factual

Big Data Camp LA 2014, Impala By Ahad Rana of Factual



Total Views
Views on SlideShare
Embed Views



2 Embeds 3

https://twitter.com 2
https://www.linkedin.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Impala presentation ahad rana Impala presentation ahad rana Presentation Transcript

    • Impala A high level overview. By Ahad Rana Director of Engineering @ Factual
    • What is Impala ? ● Real-time SQL Query Engine from Cloudera. ● Built on top of Hadoop. ● Open Source (Apache License).
    • Hadoop Ecosystem Keeps Growing/Maturing ● High Availability NameNodes ● Quorum based NameNode Journaling. ● Zookeeper integration. ● YARN support for Linux Containers/ Multi-Tenancy. ● HDFS optimizations - direct reads, mmap reads. ● Kerberos based security throughout, ACLs (in CDH5) ● HBase improvements. ● New file formats: Parquet (based on Google Dremel).
    • MapReduce:Not the Solution for Everything ● Batch oriented, high latency. ● Joins are complex. ● DSLs are complex. ● Only really accessible to developers.
    • Needed:Better Accessibility to “Big Data” ● SQL is still the lingua franca of the Business Analytics community. ● Real-time / faster query response times can have a dramatic affect on productivity. ● The previous generation of enabling technologies (like Hive) are batch oriented and too slow.
    • The Solution:MPP Query Engines on Hadoop ● That is, MPP SQL Query Engines on Hadoop. ● Proprietary vendors have been in this game for a while and have been charging a lot for custom hardware and software. ● New engines, like Impala, are game changers: ○ Built from the ground up to integrate with Hadoop - Your data is already in Hadoop - co-locate the queries next to your data! ○ Read data from existing Hadoop File Formats: Text,LZO-P, Sequence File,AVRO etc. ○ No time consuming ETL cycle or indexing. ○ Open Source, non-proprietary, and they run on the same commodity hardware! ○ Bypass MapReduce framework, tuned for speed. ○ All support similar subset of SQL-92 as Hive.
    • Benchmarks: Impala vs. Hive 0.12 Source: Cloudera ● Interactive - 6x-68x faster. ● Reports - 8x-29x faster. ● Analytics - 10x - 69x faster.
    • Impala - One of Many in a Crowded Field ● Many parallel efforts to address this need: ○ Presto (Facebook). ○ Shark (AmpLab). ○ Hive on Tez (Hortonworks). ○ Impala (Cloudera). ○ Redshift (Amazon/proprietary). ○ etc...
    • Impala vs. Other MPP Engines ● 21 Nodes (2 CPU - 12 Core, 12 disks, 384GB RAM) ● TPC-DS derived benchmark scaled to 15TB. Source: Cloudera
    • Impala Technology Highlights ● Designed with performance in mind - backend is C++. ● Operates on native types in memory wherever possible. ● Uses JIT compiler (LLVM) to generate optimized expression evaluation code (minimal branching, no v-table calls) on the fly. ● Asynchronously streams results between various stages of the pipelines. ● Keeps all intermediate results in memory. ● Parallelizes scans and locates them near data (DataNodes),and directly reads HDFS blocks from disk. ● Supports Broadcast Joins (Fact table is Scanned in a partitioned manner, Smaller Dimension table(s) are broadcast to all nodes) ● Also supports Partitioned Joins (both Fact/Dimension tables are partitioned on join columns / expressions). ● Supports a variety of file formats (Text,Text compressed, Seq File, RCFile, AVRO, HFile, Parquet) AND Scans against HBase.
    • Impala Deployment Environment ● Three binaries ○ impalad - one on each Hadoop Datanode, handles client requests, runs query execution pipelines. ○ statestored - single instance, name service (like Zookeeper), used to coordinate catalog changes, impalad availability etc. ○ catalogd - single instance, talks to Hive metastore, caches table metadata, distributes metadata to impalads, broadcasts metadata changes via statestored.
    • Impala - Life of a Query 1. Query is routed to arbitrary impalad in cluster. 2. Java code running in impalad creates parse tree from SQL statement. 3. Planner evaluates parse tree and create query Plan Tree. 4. Data Sources and Sinks (Tables) are validated against the metadata catalog (via catalogd). 5. Planner uses table metadata to make optimal placement decisions, establish sharding strategy,memory requirements per query stage,and constructs Plan Fragments. 6. catalogd caches HDFS block locations for every HDFS file, and Planner attaches block locations to Query Plan metadata. 7. Planner converts Plan Fragments into Thrift objects, and passes them on to the Query Coordinator (C++). 8. Coordinator runs root Fragment locally and distributes remaining Fragments to other backends. 9. Coordinator monitors Fragment execution until completion,cancellation or failure. 10. Fragments stream data to other Fragments or to the primary Fragment (User or Table).
    • Life of a Query (Planning)
    • Life of a Query (Execution)
    • Simple Query Plan (Broadcast Join)
    • Simple Query Plan (Shuffle)
    • Simple Query Plan (Aggregation)
    • Performance Checklist ● Fact to Fact table joins are expensive! ● Scalar data types (INT,BIGINT) better than generic (String). ● Coerce types in metadata during load. ● Push Filter Predicates downstream as much as possible (Avoid upstream IO). ● Use Impala’s Partitioning support to pre-partition tables.
    • Compression ● Compression means less network / disk IO. ● Always use compression. ● Choose codec wisely - make the appropriate size vs. cpu cost tradeoff. ● Snappy - Low CPU requirements, Low compression ratio. ● GZip - High CPU,High compression ratio. ● LZO - Somewhere in the middle. ● Snappy is generally a good default choice.
    • File Formats ● Use a Columnar format (like Parquet) if possible. ● Columnar formats give you: ○ Less IO for Impala - Only read relevant column data. ○ Better compression, more efficient encoding. ● Parquet is the new preferred file format for Impala.
    • Parquet Format Detail ● Row Group - group of rows in columnar format ○ A Row Group’s worth of data is buffered in RAM when writing. ○ Impala prefers large (1GB row group sizes) ○ Potentially multiple row groups per split (when reading). ● Column Chunk / Page(s) ○ Each column’s data is stored in a separate chunk. ○ Each chunk is further divided into Page(s) worth of data. ○ Each page is ~1MB in size. ○ Compression is applied at Page level. ○ Smallest level of granularity for read is the Page. ● Good for writing data in bulk, not good for small writes (Insert Values … ). ● Supports nested columns (see Google Dremel Paper). ● Map-Reduce InputFormat can materialize data back as Thrift objects.
    • HBase Integration ● Impala currently only uses HBase RPCs to scan HBase tables. ● Map HBase binary key/columns to types using Hive/Impala metadata. ● Best to only use HBase key column in predicates, otherwise you will have to scan entire table. ● Best to do large fact table scans in Impala first and then do a late stage join to a limited number of hbase rows. ● HBase now has snapshotting support, so perhaps bulk scans are possible in the future ?
    • Impala Codebase Commentary ● Good use of best practices throughout codebase ○ Robust Comments ○ Modern C++ standards (STL,Boost, aggressive smart pointer utilization to avoid leaks) ○ Good instrumentation framework throughout codebase. ○ Smart memory management. ○ Thrift RPC and Thrift data structures for exchange of data / messages between services and layers. ○ Smart use of Java for front-end query processing/parsing, DDL, DML integration with Hive etc, and then transfer of Query Plan from Java layer to C++ via Thrift objects. ○ Very efficient use of memory in pipeline ■ Tuple (collection of typed column data, allocated using pooled memory) ■ Awareness of various distinct tuples flowing through various stages of pipeline means they can generate super efficient just-in-time code to read/write these tuples.
    • Potential Gotchas ● High, potentially unconstrained demand on Hadoop nodes ○ Containerization (in YARN) helps, but still, nodes can consume a lot of RAM and CPU. ● Allocate enough memory to support your needs, avoid swap! ● Potentially large impact on shared network IO when doing running inefficient queries that broadcast lots of data upstream. ● No quota management, resource tracking by user.
    • ahad@factual.com P.S. Factual is always looking for good engineers! Thank you.