(Aaron myers) hdfs impala


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

(Aaron myers) hdfs impala

  1. 1. HOW CLOUDERA IMPALA HAS PUSHED HDFS IN NEW WAYS How HDFS is evolving to meet new needs
  2. 2. ✛  Aaron T. Myers >  Email: atm@cloudera.com, atm@apache.org >  Twitter: @atm ✛  Hadoop PMC Member / Committer at ASF ✛  Software Engineer at Cloudera ✛  Primarily work on HDFS and Hadoop Security 2
  3. 3. ✛  HDFS introduction/architecture ✛  Impala introduction/architecture ✛  New requirements for HDFS >  Block replica / disk placement info >  Correlated file/block replica placement >  In-memory caching for hot files >  Short-circuit reads, reduced copy overhead 3
  5. 5. ✛  HDFS is the Hadoop Distributed File System ✛  Append-only distributed file system ✛  Intended to store many very large files >  Block sizes usually 64MB – 512MB >  Files composed of several blocks ✛  Write a file once during ingest ✛  Read a file many times for analysis 5
  6. 6. ✛  HDFS originally designed specifically for Map/ Reduce >  Each MR task typically operates on one HDFS block >  MR tasks run co-located on HDFS nodes >  Data locality: move the code to the data ✛  Each block of each file is replicated 3 times >  For reliability in the face of machine, drive failures >  Provide a few options for data locality during processing 6
  8. 8. ✛  Each cluster has… >  A single Name Node ∗  Stores file system metadata ∗  Stores “Block ID” -> Data Node mapping >  Many Data Nodes ∗  Store actual file data >  Clients of HDFS… ∗  Communicate with Name Node to browse file system, get block locations for files ∗  Communicate directly with Data Nodes to read/write files 8
  9. 9. 9
  11. 11. ✛  General-purpose SQL query engine: >  Should work both for analytical and transactional workloads >  Will support queries that take from milliseconds to hours ✛  Runs directly within Hadoop: >  Reads widely used Hadoop file formats >  Talks directly to HDFS (or HBase) >  Runs on same nodes that run Hadoop processes 11
  12. 12. ✛  Uses HQL for query language >  Hive Query Language – what Apache Hive uses >  Very close to complete SQL-92 compliance ✛  Extremely high performance >  C++ instead of Java >  Runtime code generation >  Completely new execution engine that doesn't build on MapReduce 12
  13. 13. ✛  Runs as a distributed service in cluster >  One Impala daemon on each node with data >  Doesn’t use Hadoop Map/Reduce at all ✛  User submits query via ODBC/JDBC to any of the daemons ✛  Query is distributed to all nodes with relevant data ✛  If any node fails, the query fails and is reexecuted 13
  15. 15. ✛  Two daemons: impalad and statestored ✛  Impala daemon (impalad) >  Handles client requests >  Handles all internal requests related to query execution ✛  State store daemon (statestored) >  Provides name service of cluster members >  Hive table metadata distribution 15
  16. 16. ✛  Query execution phases >  Request arrives to impalad via odbc/jdbc >  Planner turns request into collection of plan fragments ∗  Plan fragments may be executed in parallel >  Coordinator impalad initiates execution of plan fragments on remote impalad daemons ✛  During execution >  Intermediate results are streamed between executors >  Query results are streamed back to client 16
  17. 17. ✛  During execution, impalad daemons connect directly to HDFS/HBase to read/write data 17
  19. 19. ✛  Impala is concerned with very low latency queries >  Need to make best use of available aggregate disk throughput ✛  Impala’s more efficient execution engine is far more likely to be I/O bound as compared to Hive >  Implies that for many queries the best performance improvement will be from improved I/O ✛  Impala query execution has no shuffle phase >  Implies that joins between tables does not necessitate all-to-all communication 19
  20. 20. ✛  Expose HDFS block replica disk location information ✛  Allow for explicitly co-located block replicas across files ✛  In-memory caching of hot tables/files ✛  Reduced copies during reading, short-circuit reads 20
  21. 21. ✛  The problem: NameNode knows which DataNodes blocks are on, not which disks >  Only the DNs are aware of block replica -> disk map ✛  Impala wants to make sure that separate plan fragments operate on data on separate disks >  Maximize aggregate available disk throughput 21
  22. 22. ✛  The solution: add new RPC call to DataNodes to expose which volumes (disks) replicas are stored on ✛  During query planning phase, impalad… >  Determines all DNs data for query is stored on >  Queries those DNs to get volume information ✛  During query execution phase, impalad… >  Queues disk reads so that only 1 or 2 reads ever happen to a given disk at a given time ✛  With this additional info, Impala is able to ensure disk reads are large, minimize seeks 22
  23. 23. ✛  The problem: when performing a join, a single impalad may have to read from both a local file and a remote file on another DN ✛  Local reads at full disk throughput: ~800 MB/s ✛  Remote reads in a 1 gigabit network: ~128 MB/s ✛  Ideally all reads should be done on local disks 23
  24. 24. ✛  The solution: add feature to HDFS to specify that a set of files should have their replicas placed on the same set of nodes ✛  Gives Impala more control to lay out data ✛  Can ensure that tables/files which are joined frequently have their data co-located ✛  Additionally, more fine-grained block placement control allows for potential improvements in columnar formats like Parquet 24
  25. 25. ✛  The problem: Impala queries are often bottlenecked at maximum disk throughput ✛  Memory throughput is much higher ✛  Memory is getting cheaper/denser >  Routinely seeing DNs with 48GB-96GB of RAM ✛  We’ve observed substantial Impala speedups when file data ends up in OS buffer cache 25
  26. 26. ✛  The solution: Add facility to HDFS to explicitly read specific HDFS files into main memory ✛  Allows Impala to read data at full memory bandwidth speeds (several GB/s) ✛  Give cluster operator control over which files/ tables are queried frequently and thus should be kept in memory >  Don’t want an MR job to inadvertently evict data from memory via the OS buffer cache 26
  27. 27. ✛  The problem: A typical read in HDFS must be read from disk by DN, copied into DN memory, sent over network, copied into client buffers, etc. ✛  All of these extraneous copies use unnecessary memory, CPU resources 27
  28. 28. ✛  The solution: Allow for reads to be performed directly on local files, use direct buffers ✛  Added facility to HDFS to allow for reads to completely bypass DataNode when client colocated with block replica files ✛  Added API in libhdfs to supply direct byte buffers to HDFS read operations to reduce number of copies to bare minimum 28
  29. 29. ✛  For simpler queries (no joins, tpch-q*) on large datasets (1TB) >  5-10x faster than Hive ✛  For complex queries on large datasets (1TB) >  20-50x faster than Hive ✛  For complex queries out of buffer cache (300GB) >  25-150x faster than Hive ✛  Due to Impala’s improved execution engine, low startup time, improved I/O, etc. 29