Your SlideShare is downloading. ×
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera


Published on

When and why to use Hadoop. Hadoop-able problems and use cases.

When and why to use Hadoop. Hadoop-able problems and use cases.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. What  is  Hadoop,  and  When  Should  I   Consider  Using  It?   Houston  HUG   June  6th,  2011   Vikram  Oberoi,  Cloudera   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 2. About  me  •  Data  engineer  at  Cloudera,  present   •    Using  data  and  Hadoop  to  enable  more  responsive  support  •  Data  engineer  at  Meebo,  Aug  ’09  –  Nov’10   •  Data  infrastructure,  analyLcs  •  CS  at  Stanford,  ’09   •  Senior  project:  ext3  and  XFS  under  Hadoop  MapReduce   workloads  •  Data  engineer  at  Meebo,  ’08   •  Built  an  A/B  tesLng  system  •  SDE  Intern  at  Amazon,  ’07   •  R&D  on  item-­‐to-­‐item  similariLes   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 3. What  will  I  talk  about?  •  What  is  Hadoop?    •  Typical  Hadoop-­‐able  problems  and  use  cases    •  Cloudera  overview   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 4. What  is  Hadoop?   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 5. Big  Data  Problem:  Exploding  Data  Volumes  •  Online   •  Web-­‐ready  devices   •  Social  media   Complex, Unstructured •  Digital  content  •  Enterprise   •  TransacLons     Relational •  R&D  data   •  OperaLonal  (control)  data  •  Open  data  iniLaLves   •  2,500 exabytes of new information in 2012 with Internet as primary driver •  Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. Copyright  2011  Cloudera  Inc.  All  rights  reserved   .
  • 6. Big  Data  Problem:  Data  Economics   •   Return  on  Byte  =  value  to  be  extracted  from  that  byte  /  cost  of  storing  that   byte   •   If  ROB  is  <  1  then  it  will  be  buried  into  tape  wasteland,  thus  we  need   cheaper  ac#ve  storage.   High  ROB   Low  ROB   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 7. Hadoop:  A  Data  PlaEorm  with  Unique  Benefits   •   Consolidates  Everything   •   Move  complex  and  relaLonal     data  into  a  single  repository   •   Stores  Inexpensively   MapReduce   •   Keep  raw  data  always  available   •   Use  commodity  hardware   •   Processes  at  the  Source   Hadoop  Distributed   •   Eliminate  ETL  boglenecks   File  System  (HDFS)   •   Mine  data  first,  govern  later     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 8. Hadoop  Distributed  File  System  (HDFS)   “How  is  data  stored?”  •  Based  on  design  of  Google’s  GFS  •  Data  stored  in  large  files   •  Files  can  contain  any  data  •  Files  separated  into  blocks   •  64MB  up  to  256MB  per  block  (tunable)   •  Each  block  replicated  across  a  cluster  (tunable,  usually  3   replicas  across  the  cluster)   •  This  buys  you:  fault  tolerance,  parallelizable  disk  reads  •  Store  whatever  you  want  in  it   •  This  buys  you:  flexibility     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 9. MapReduce  “How  is  data  processed?”  •  Framework  designed  for  parallel  processing  of  large  disk   bound  batch  jobs  •  Data  processed  at  the  source   •  File  ‘foo’  has  5  blocks,  processing  happens  on  5  nodes   •  Parallelized  disk  reads  à  remove  disk  bogleneck  •  Way  to  express  algorithms  such  that  they  are   parallelizable  •  Two  funcLons  at  the  core  of  every  job:   •  Map  funcLon  (group  by)   •  Reduce  funcLon  (perform  acLon  on  group)   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 10. What  is  Hadoop?  •  A  scalable  fault-­‐tolerant  distributed  system    for  data  storage   and  processing  (open  source  under  the  Apache  license)  •  Scalable  data  processing  engine   •  Hadoop  Distributed  File  System  (HDFS):  self-­‐healing  high-­‐bandwidth   clustered  storage   •  MapReduce:  fault-­‐tolerant  distributed  processing    •  Key  value   •  Flexible  -­‐>  store  data  without  a  schema  and  add  it  later  as  needed   •  Affordable  -­‐>  cost  /  TB  at  a  fracLon  of  tradiLonal  opLons   •  Broadly  adopted  -­‐>  a  large  and  acLve  ecosystem   •  Proven  at  scale  -­‐>  dozens  of  petabyte  +  implementaLons  in   producLon  today   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 11. Cloudera’s  DistribuSon  Including  Apache  Hadoop    The  Industry’s  Leading  Hadoop  Distribu<on   Hue   Hue  SDK   Oozie   Oozie   Hive   Pig/   Hive   Flume,  Sqoop   HBase   Zookeeper  •  Open  source  –  100%  Apache  licensed  and  free  for  download  •  Simplified  –  Component  versions  &  dependencies  managed  for  you  •  Integrated  –  All  components  &  funcLons  interoperate  through  standard  API’s  •  Reliable  –  Patched  with  fixes  from  future  releases  to  improve  stability  •  Supported  –  Employs  project  founders  and  commigers  for  >90%  of  components   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 12. Typical  Hadoop-­‐able  problems   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 13. What  is  common  across  Hadoop-­‐able  problems?   Nature  of  the  data   •  Complex  data   •  MulLple  data  sources   •  Lots  of  it   Nature  of  the  analysis   •  Batch  processing   •  Parallelizable   Copyright  2010  Cloudera  Inc.  All  rights  reserved   13  
  • 14. What  kinds  of  analyses  are  possible  with  Hadoop?   •  Text  mining   •  CollaboraLve  filtering   •  Index  building   •  PredicLon  models   •  Graph  creaLon  and   •  SenLment  analysis   analysis   •  Risk  assessment   •  Pagern  recogniLon     Copyright  2010  Cloudera  Inc.  All  rights  reserved   14  
  • 15. Top  10  Hadoop-­‐able  Problems   See  archived  webinar  on  1.  Modeling  True  Risk  2.  Customer  Churn  Analysis  3.  RecommendaSon  engines  4.  Ad  TargeSng  5.  Point  Of  Sale  TransacSon  Analysis  6.  Analysing  Network  Data  To  Predict  Failure  7.  Threat  Analysis/Fraud  DetecSon  8.  Trade  Surveillance  9.  Search  Quality  10.  Data  “Sandbox”   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 16. Example:  Modeling  True  Risk   Copyright  2010  Cloudera  Inc.  All  rights  reserved   16  
  • 17. Example:  Modeling  True  Risk   SoluSon  with  Hadoop   •  Source,  parse  and  aggregate  disparate  data     sources  to  build  comprehensive  data  picture   •  e.g.  credit  card  records,  call  recordings,  chat   sessions,  emails,  banking  acLvity   •  Structure  and  analyze   •  SenLment  analysis,  graph  creaLon,  pagern   recogniLon   Typical  Industry   •  Financial  Services  (Banks,  Insurance)     Copyright  2010  Cloudera  Inc.  All  rights  reserved   17  
  • 18. Example:  Threat  Analysis   Copyright  2010  Cloudera  Inc.  All  rights  reserved   18  
  • 19. Example:  Threat  Analysis   SoluSon  with  Hadoop   •  Parallel  processing  over  huge  datasets   •  Pagern  recogniLon  to  idenLfy  anomalies  i.e.  threats   Typical  Industry   •  Security   •  Financial  Services   •  General:  spam  fighLng,     click  fraud     Copyright  2010  Cloudera  Inc.  All  rights  reserved   19  
  • 20. Example:  RecommendaSon  Engine   Copyright  2010  Cloudera  Inc.  All  rights  reserved   20  
  • 21. Example:    RecommendaSon  Engine   SoluSon  with  Hadoop   •  Batch  processing  framework   •  Allow  execuLon  in  in  parallel  over  large  datasets   •  CollaboraLve  filtering   •  CollecLng  ‘taste’  informaLon  from  many  users   •  ULlizing  informaLon  to  predict  what  similar   users  like   Typical  Industry   •  Ecommerce,  Manufacturing,  Retail     Copyright  2010  Cloudera  Inc.  All  rights  reserved   21  
  • 22. Example:  Analyzing  Network  Data  to  Predict  Failure   Copyright  2010  Cloudera  Inc.  All  rights  reserved   22  
  • 23. Example:  Analyzing  Network  Data  to  Predict  Failure   SoluSon  with  Hadoop   •  Take  the  computaLon  to  the  data   •  Expand  the  range  of  indexing  techniques  from  simple   scans  to  more  complex  data  mining     •  Beger  understand  how  the  network  reacts  to  fluctuaLons   •  How  previously  thought  discrete  anomalies  may,  in   fact,  be  interconnected   •  IdenLfy  leading  indicators  of  component  failure   Typical  Industry   •  ULliLes,  TelecommunicaLons,     Data  Centers     Copyright  2010  Cloudera  Inc.  All  rights  reserved   23  
  • 24. Example:  SupporSng  Hadoop  at  Cloudera  •  Collect  data  from  customer  clusters   •  OS  configs,  Hadoop  configs,  command  outputs,  logs   •  Data  served  by  HBase,  used  by  supporters  •  Consolidate  data  about  Hadoop  in  HDFS   •  Mailing  lists,  issue  trackers,  wiki  pages,  IRC,  books   •  Customer  cluster  data  •  Analyze  many  data  sources  to  understand  Hadoop   issues  and  deployments   •  Build  tools  to  enable  easier  diagnosis  or  proacLve  support   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 25. Cloudera  overview   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 26. Cloudera  Offerings  Enabling  the  Enterprise  Adop<on  of  Apache  Hadoop   PLATFORM   SUPPORT  &  APPLICATIONS   PROFESSIONAL  SERVICES   TRAINING   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 27. Contact/Resources/QuesSons  •  •  #cloudera  #hadoop  •  @cloudera  •  Cloudera  Groups:  hgp://  •  Hadoop  the  DefiniLve  Guide  •  10  Hadoop-­‐able  problems  on  Slideshare  •  QuesLons?  (P.S.  We’re  hiring  SA’s  in  Houston!)   Copyright  2011  Cloudera  Inc.  All  rights  reserved