• Like
(Aaron myers)   hdfs impala
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

(Aaron myers) hdfs impala



Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. HOW CLOUDERA IMPALA HAS PUSHED HDFS IN NEW WAYS How HDFS is evolving to meet new needs
  • 2. ✛  Aaron T. Myers >  Email: atm@cloudera.com, atm@apache.org >  Twitter: @atm ✛  Hadoop PMC Member / Committer at ASF ✛  Software Engineer at Cloudera ✛  Primarily work on HDFS and Hadoop Security 2
  • 3. ✛  HDFS introduction/architecture ✛  Impala introduction/architecture ✛  New requirements for HDFS >  Block replica / disk placement info >  Correlated file/block replica placement >  In-memory caching for hot files >  Short-circuit reads, reduced copy overhead 3
  • 5. ✛  HDFS is the Hadoop Distributed File System ✛  Append-only distributed file system ✛  Intended to store many very large files >  Block sizes usually 64MB – 512MB >  Files composed of several blocks ✛  Write a file once during ingest ✛  Read a file many times for analysis 5
  • 6. ✛  HDFS originally designed specifically for Map/ Reduce >  Each MR task typically operates on one HDFS block >  MR tasks run co-located on HDFS nodes >  Data locality: move the code to the data ✛  Each block of each file is replicated 3 times >  For reliability in the face of machine, drive failures >  Provide a few options for data locality during processing 6
  • 8. ✛  Each cluster has… >  A single Name Node ∗  Stores file system metadata ∗  Stores “Block ID” -> Data Node mapping >  Many Data Nodes ∗  Store actual file data >  Clients of HDFS… ∗  Communicate with Name Node to browse file system, get block locations for files ∗  Communicate directly with Data Nodes to read/write files 8
  • 9. 9
  • 11. ✛  General-purpose SQL query engine: >  Should work both for analytical and transactional workloads >  Will support queries that take from milliseconds to hours ✛  Runs directly within Hadoop: >  Reads widely used Hadoop file formats >  Talks directly to HDFS (or HBase) >  Runs on same nodes that run Hadoop processes 11
  • 12. ✛  Uses HQL for query language >  Hive Query Language – what Apache Hive uses >  Very close to complete SQL-92 compliance ✛  Extremely high performance >  C++ instead of Java >  Runtime code generation >  Completely new execution engine that doesn't build on MapReduce 12
  • 13. ✛  Runs as a distributed service in cluster >  One Impala daemon on each node with data >  Doesn’t use Hadoop Map/Reduce at all ✛  User submits query via ODBC/JDBC to any of the daemons ✛  Query is distributed to all nodes with relevant data ✛  If any node fails, the query fails and is reexecuted 13
  • 15. ✛  Two daemons: impalad and statestored ✛  Impala daemon (impalad) >  Handles client requests >  Handles all internal requests related to query execution ✛  State store daemon (statestored) >  Provides name service of cluster members >  Hive table metadata distribution 15
  • 16. ✛  Query execution phases >  Request arrives to impalad via odbc/jdbc >  Planner turns request into collection of plan fragments ∗  Plan fragments may be executed in parallel >  Coordinator impalad initiates execution of plan fragments on remote impalad daemons ✛  During execution >  Intermediate results are streamed between executors >  Query results are streamed back to client 16
  • 17. ✛  During execution, impalad daemons connect directly to HDFS/HBase to read/write data 17
  • 19. ✛  Impala is concerned with very low latency queries >  Need to make best use of available aggregate disk throughput ✛  Impala’s more efficient execution engine is far more likely to be I/O bound as compared to Hive >  Implies that for many queries the best performance improvement will be from improved I/O ✛  Impala query execution has no shuffle phase >  Implies that joins between tables does not necessitate all-to-all communication 19
  • 20. ✛  Expose HDFS block replica disk location information ✛  Allow for explicitly co-located block replicas across files ✛  In-memory caching of hot tables/files ✛  Reduced copies during reading, short-circuit reads 20
  • 21. ✛  The problem: NameNode knows which DataNodes blocks are on, not which disks >  Only the DNs are aware of block replica -> disk map ✛  Impala wants to make sure that separate plan fragments operate on data on separate disks >  Maximize aggregate available disk throughput 21
  • 22. ✛  The solution: add new RPC call to DataNodes to expose which volumes (disks) replicas are stored on ✛  During query planning phase, impalad… >  Determines all DNs data for query is stored on >  Queries those DNs to get volume information ✛  During query execution phase, impalad… >  Queues disk reads so that only 1 or 2 reads ever happen to a given disk at a given time ✛  With this additional info, Impala is able to ensure disk reads are large, minimize seeks 22
  • 23. ✛  The problem: when performing a join, a single impalad may have to read from both a local file and a remote file on another DN ✛  Local reads at full disk throughput: ~800 MB/s ✛  Remote reads in a 1 gigabit network: ~128 MB/s ✛  Ideally all reads should be done on local disks 23
  • 24. ✛  The solution: add feature to HDFS to specify that a set of files should have their replicas placed on the same set of nodes ✛  Gives Impala more control to lay out data ✛  Can ensure that tables/files which are joined frequently have their data co-located ✛  Additionally, more fine-grained block placement control allows for potential improvements in columnar formats like Parquet 24
  • 25. ✛  The problem: Impala queries are often bottlenecked at maximum disk throughput ✛  Memory throughput is much higher ✛  Memory is getting cheaper/denser >  Routinely seeing DNs with 48GB-96GB of RAM ✛  We’ve observed substantial Impala speedups when file data ends up in OS buffer cache 25
  • 26. ✛  The solution: Add facility to HDFS to explicitly read specific HDFS files into main memory ✛  Allows Impala to read data at full memory bandwidth speeds (several GB/s) ✛  Give cluster operator control over which files/ tables are queried frequently and thus should be kept in memory >  Don’t want an MR job to inadvertently evict data from memory via the OS buffer cache 26
  • 27. ✛  The problem: A typical read in HDFS must be read from disk by DN, copied into DN memory, sent over network, copied into client buffers, etc. ✛  All of these extraneous copies use unnecessary memory, CPU resources 27
  • 28. ✛  The solution: Allow for reads to be performed directly on local files, use direct buffers ✛  Added facility to HDFS to allow for reads to completely bypass DataNode when client colocated with block replica files ✛  Added API in libhdfs to supply direct byte buffers to HDFS read operations to reduce number of copies to bare minimum 28
  • 29. ✛  For simpler queries (no joins, tpch-q*) on large datasets (1TB) >  5-10x faster than Hive ✛  For complex queries on large datasets (1TB) >  20-50x faster than Hive ✛  For complex queries out of buffer cache (300GB) >  25-150x faster than Hive ✛  Due to Impala’s improved execution engine, low startup time, improved I/O, etc. 29