Your SlideShare is downloading. ×

Introduction to Cloudera Impala

403
views

Published on

Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use …

Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use near-ubiquitous SQL to explore your own data at scale.

As presented to Charm City Linux on March 25th 2014.
http://www.meetup.com/CharmCityLinux/events/168288632/

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
403
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
22
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1 Cloudera  Impala   Charm  City  Linux,  March  2014     Alex  Moundalexis   alexm+ccl@clouderagovt.com     @technmsg  
  • 2. Thirty  Seconds  About  Alex   •  SoluEons  Architect   •  aka  consultant   •  government   •  infrastructure   •  former  coder  of  Perl   •  former  administrator   •  likes  shiny  objects   2  
  • 3. What  Does  Cloudera  Do?   •  product   •  distribuEon  of  Hadoop  components,  Apache  licensed   •  enterprise  tooling   •  support   •  training   •  services  (aka  consulEng)   •  community   3
  • 4. Disclaimer   •  Cloudera  builds  things  soPware   •  most  donated  to  Apache   •  some  closed-­‐source   •  Cloudera  “products”  I  reference  are  open  source   •  Apache  Licensed   •  source  code  is  on  GitHub   •  hVps://github.com/cloudera   4
  • 5. What  This  Talk  Isn’t  About   •  deploying   •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   •  sizing  &  tuning   •  depends  heavily  on  data  and  workload   •  coding   •  unless  you  count  XML  or  CSV  or  SQL   •  algorithms   5
  • 6. 6 Quick  and  dirty,  for  context.   The  Apache  Hadoop  Ecosystem  
  • 7. Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  HDFS   •  MapReduce   •  Today,  dozens  of  interrelated  components   •  I/O   •  Processing   •  Specialty  ApplicaEons   •  ConfiguraEon   •  Workflow   7
  • 8. HDFS   •  Distributed,  highly  fault-­‐tolerant  filesystem   •  OpEmized  for  large  streaming  access  to  data   •  Based  on  Google  File  System   •  hVp://research.google.com/archive/gfs.html   8
  • 9. Lots  of  Commodity  Machines   9 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 10. MapReduce  (MR)   •  Programming  paradigm   •  Batch  oriented,  not  realEme   •  Works  well  with  distributed  compuEng   •  Lots  of  Java,  but  other  languages  supported   •  Based  on  Google’s  paper   •  hVp://research.google.com/archive/mapreduce.html   10
  • 11. Under  the  Covers   11
  • 12. You specify map() and reduce() functions.