• Like
(Aaron myers)   hdfs impala
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

(Aaron myers) hdfs impala

  • 1,072 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,072
On SlideShare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
47
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. HOW CLOUDERA IMPALA HAS PUSHED HDFS IN NEW WAYS How HDFS is evolving to meet new needs
  • 2. ✛  Aaron T. Myers >  Email: atm@cloudera.com, atm@apache.org >  Twitter: @atm ✛  Hadoop PMC Member / Committer at ASF ✛  Software Engineer at Cloudera ✛  Primarily work on HDFS and Hadoop Security 2
  • 3. ✛  HDFS introduction/architecture ✛  Impala introduction/architecture ✛  New requirements for HDFS >  Block replica / disk placement info >  Correlated file/block replica placement >  In-memory caching for hot files >  Short-circuit reads, reduced copy overhead 3
  • 4. HDFS INTRODUCTION
  • 5. ✛  HDFS is the Hadoop Distributed File System ✛  Append-only distributed file system ✛  Intended to store many very large files >  Block sizes usually 64MB – 512MB >  Files composed of several blocks ✛  Write a file once during ingest ✛  Read a file many times for analysis 5
  • 6. ✛  HDFS originally designed specifically for Map/ Reduce >  Each MR task typically operates on one HDFS block >  MR tasks run co-located on HDFS nodes >  Data locality: move the code to the data ✛  Each block of each file is replicated 3 times >  For reliability in the face of machine, drive failures >  Provide a few options for data locality during processing 6
  • 7. HDFS ARCHITECTURE
  • 8. ✛  Each cluster has… >  A single Name Node ∗  Stores file system metadata ∗  Stores “Block ID” -> Data Node mapping >  Many Data Nodes ∗  Store actual file data >  Clients of HDFS… ∗  Communicate with Name Node to browse file system, get block locations for files ∗  Communicate directly with Data Nodes to read/write files 8
  • 9. 9
  • 10. IMPALA INTRODUCTION
  • 11. ✛  General-purpose SQL query engine: >  Should work both for analytical and transactional workloads >  Will support queries that take from milliseconds to hours ✛  Runs directly within Hadoop: >  Reads widely used Hadoop file formats >  Talks directly to HDFS (or HBase) >  Runs on same nodes that run Hadoop processes 11
  • 12. ✛  Uses HQL for query language >  Hive Query Language – what Apache Hive uses >  Very close to complete SQL-92 compliance ✛  Extremely high performance >  C++ instead of Java >  Runtime code generation >  Completely new execution engine that doesn't build on MapReduce 12
  • 13. ✛  Runs as a distributed service in cluster >  One Impala daemon on each node with data >  Doesn’t use Hadoop Map/Reduce at all ✛  User submits query via ODBC/JDBC to any of the daemons ✛  Query is distributed to all nodes with relevant data ✛  If any node fails, the query fails and is reexecuted 13
  • 14. IMPALA ARCHITECTURE
  • 15. ✛  Two daemons: impalad and statestored ✛  Impala daemon (impalad) >  Handles client requests >  Handles all internal requests related to query execution ✛  State store daemon (statestored) >  Provides name service of cluster members >  Hive table metadata distribution 15
  • 16. ✛  Query execution phases >  Request arrives to impalad via odbc/jdbc >  Planner turns request into collection of plan fragments ∗  Plan fragments may be executed in parallel >  Coordinator impalad initiates execution of plan fragments on remote impalad daemons ✛  During execution >  Intermediate results are streamed between executors >  Query results are streamed back to client 16
  • 17. ✛  During execution, impalad daemons connect directly to HDFS/HBase to read/write data 17
  • 18. HDFS IMPROVEMENTS MOTIVATED BY IMPALA
  • 19. ✛  Impala is concerned with very low latency queries >  Need to make best use of available aggregate disk throughput ✛  Impala’s more efficient execution engine is far more likely to be I/O bound as compared to Hive >  Implies that for many queries the best performance improvement will be from improved I/O ✛  Impala query execution has no shuffle phase >  Implies that joins between tables does not necessitate all-to-all communication 19
  • 20. ✛  Expose HDFS block replica disk location information ✛  Allow for explicitly co-located block replicas across files ✛  In-memory caching of hot tables/files ✛  Reduced copies during reading, short-circuit reads 20
  • 21. ✛  The problem: NameNode knows which DataNodes blocks are on, not which disks >  Only the DNs are aware of block replica -> disk map ✛  Impala wants to make sure that separate plan fragments operate on data on separate disks >  Maximize aggregate available disk throughput 21
  • 22. ✛  The solution: add new RPC call to DataNodes to expose which volumes (disks) replicas are stored on ✛  During query planning phase, impalad… >  Determines all DNs data for query is stored on >  Queries those DNs to get volume information ✛  During query execution phase, impalad… >  Queues disk reads so that only 1 or 2 reads ever happen to a given disk at a given time ✛  With this additional info, Impala is able to ensure disk reads are large, minimize seeks 22
  • 23. ✛  The problem: when performing a join, a single impalad may have to read from both a local file and a remote file on another DN ✛  Local reads at full disk throughput: ~800 MB/s ✛  Remote reads in a 1 gigabit network: ~128 MB/s ✛  Ideally all reads should be done on local disks 23
  • 24. ✛  The solution: add feature to HDFS to specify that a set of files should have their replicas placed on the same set of nodes ✛  Gives Impala more control to lay out data ✛  Can ensure that tables/files which are joined frequently have their data co-located ✛  Additionally, more fine-grained block placement control allows for potential improvements in columnar formats like Parquet 24
  • 25. ✛  The problem: Impala queries are often bottlenecked at maximum disk throughput ✛  Memory throughput is much higher ✛  Memory is getting cheaper/denser >  Routinely seeing DNs with 48GB-96GB of RAM ✛  We’ve observed substantial Impala speedups when file data ends up in OS buffer cache 25
  • 26. ✛  The solution: Add facility to HDFS to explicitly read specific HDFS files into main memory ✛  Allows Impala to read data at full memory bandwidth speeds (several GB/s) ✛  Give cluster operator control over which files/ tables are queried frequently and thus should be kept in memory >  Don’t want an MR job to inadvertently evict data from memory via the OS buffer cache 26
  • 27. ✛  The problem: A typical read in HDFS must be read from disk by DN, copied into DN memory, sent over network, copied into client buffers, etc. ✛  All of these extraneous copies use unnecessary memory, CPU resources 27
  • 28. ✛  The solution: Allow for reads to be performed directly on local files, use direct buffers ✛  Added facility to HDFS to allow for reads to completely bypass DataNode when client colocated with block replica files ✛  Added API in libhdfs to supply direct byte buffers to HDFS read operations to reduce number of copies to bare minimum 28
  • 29. ✛  For simpler queries (no joins, tpch-q*) on large datasets (1TB) >  5-10x faster than Hive ✛  For complex queries on large datasets (1TB) >  20-50x faster than Hive ✛  For complex queries out of buffer cache (300GB) >  25-150x faster than Hive ✛  Due to Impala’s improved execution engine, low startup time, improved I/O, etc. 29