Cloudera Impala

682 views

Published on

Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use near-ubiquitous SQL to explore your own data at scale.

As presented to Portland Big Data User Group on July 23rd 2014.
http://www.meetup.com/Hadoop-Portland/events/194930422/

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
682
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cloudera Impala

  1. 1. 1 Cloudera  Impala   Portland  Big  Data  User  Group,  July  2014     Alex  Moundalexis   @technmsg  
  2. 2. Thirty  Seconds  About  Alex   •  SoluGons  Architect   •  aka  consultant   •  government   •  infrastructure   •  former  coder  of  Perl   •  former  administrator   •  fan  of  Portland     2  
  3. 3. What  Does  Cloudera  Do?   •  product   •  distribuGon  of  Hadoop  components,  Apache  licensed   •  enterprise  tooling   •  support   •  training   •  services  (aka  consulGng)   •  community   3
  4. 4. Disclaimer   •  Cloudera  builds  things  soPware   •  most  donated  to  Apache   •  some  closed-­‐source   •  Cloudera  “products”  I  reference  are  open  source   •  Apache  Licensed   •  source  code  is  on  GitHub   •  hVps://github.com/cloudera   4
  5. 5. What  This  Talk  Isn’t  About   •  deploying   •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   •  sizing  &  tuning   •  depends  heavily  on  data  and  workload   •  coding   •  unless  you  count  XML  or  CSV  or  SQL   •  algorithms   5
  6. 6. Public  Domain  IFCAR  
  7. 7. CC  BY-­‐SA  Lilian  De  Cassai  
  8. 8. cloud·∙e·∙ra  im·∙pal·∙a   8 /kloudˈi(ə)rə  imˈpalə/     noun     a  modern,  open  source,  MPP  SQL  query  engine   for  Apache  Hadoop.     “Cloudera  Impala  provides  fast,  ad  hoc  SQL  query   capability  for  Apache  Hadoop,  complemenGng   tradiGonal  MapReduce  batch  processing.”  
  9. 9. 9 Quick  and  dirty,  for  context.   The  Apache  Hadoop  Ecosystem  
  10. 10. Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  HDFS   •  MapReduce   •  Today,  dozens  of  interrelated  components   •  I/O   •  Processing   •  Specialty  ApplicaGons   •  ConfiguraGon   •  Workflow   10
  11. 11. HDFS   •  Distributed,  highly  fault-­‐tolerant  filesystem   •  OpGmized  for  large  streaming  access  to  data   •  Based  on  Google  File  System   •  hVp://research.google.com/archive/gfs.html   11
  12. 12. Lots  of  Commodity  Machines   12 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  13. 13. MapReduce  (MR)   •  Programming  paradigm   •  Batch  oriented,  not  realGme   •  Works  well  with  distributed  compuGng   •  Lots  of  Java,  but  other  languages  supported   •  Based  on  Google’s  paper   •  hVp://research.google.com/archive/mapreduce.html   13
  14. 14. Under  the  Covers   14
  15. 15. You specify map() and reduce() functions.

×