What	  is	  Hadoop,	  and	  When	  Should	  I	                 Consider	  Using	  It?	                  Houston	  HUG	    ...
About	  me	  •  Data	  engineer	  at	  Cloudera,	  present	       •  	   Using	  data	  and	  Hadoop	  to	  enable	  more	...
What	  will	  I	  talk	  about?	  •  What	  is	  Hadoop?	  	  •  Typical	  Hadoop-­‐able	  problems	  and	  use	  cases	  ...
What	  is	  Hadoop?	   Copyright	  2011	  Cloudera	  Inc.	  All	  rights	  reserved	  
Big	  Data	  Problem:	  Exploding	  Data	  Volumes	  •  Online	      •  Web-­‐ready	  devices	      •  Social	  media	    ...
Big	  Data	  Problem:	  Data	  Economics	    • 	  Return	  on	  Byte	  =	  value	  to	  be	  extracted	  from	  that	  byt...
Hadoop:	  A	  Data	  PlaEorm	  with	  Unique	  Benefits	                                                                   ...
Hadoop	  Distributed	  File	  System	  (HDFS)	         “How	  is	  data	  stored?”	  •  Based	  on	  design	  of	  Google’...
MapReduce	  “How	  is	  data	  processed?”	  •  Framework	  designed	  for	  parallel	  processing	  of	  large	  disk	   ...
What	  is	  Hadoop?	  •  A	  scalable	  fault-­‐tolerant	  distributed	  system	  	  for	  data	  storage	     and	  proce...
Cloudera’s	  DistribuSon	  Including	  Apache	  Hadoop	  	  The	  Industry’s	  Leading	  Hadoop	  Distribu<on	            ...
Typical	  Hadoop-­‐able	  problems	            Copyright	  2011	  Cloudera	  Inc.	  All	  rights	  reserved	  
What	  is	  common	  across	  Hadoop-­‐able	  problems?	   Nature	  of	  the	  data	   •  Complex	  data	   •  MulLple	  d...
What	  kinds	  of	  analyses	  are	  possible	  with	  Hadoop?	   •  Text	  mining	                                       ...
Top	  10	  Hadoop-­‐able	  Problems	   See	  archived	  webinar	  on	  cloudera.com	  1.  Modeling	  True	  Risk	  2.  Cus...
Example:	  Modeling	  True	  Risk	                     Copyright	  2010	  Cloudera	  Inc.	  All	  rights	  reserved	     1...
Example:	  Modeling	  True	  Risk	   SoluSon	  with	  Hadoop	   •  Source,	  parse	  and	  aggregate	  disparate	  data	  ...
Example:	  Threat	  Analysis	                      Copyright	  2010	  Cloudera	  Inc.	  All	  rights	  reserved	     18	  
Example:	  Threat	  Analysis	   SoluSon	  with	  Hadoop	   •  Parallel	  processing	  over	  huge	  datasets	   •  Pagern	...
Example:	  RecommendaSon	  Engine	                 Copyright	  2010	  Cloudera	  Inc.	  All	  rights	  reserved	     20	  
Example:	  	  RecommendaSon	  Engine	   SoluSon	  with	  Hadoop	   •  Batch	  processing	  framework	       •  Allow	  exe...
Example:	  Analyzing	  Network	  Data	  to	  Predict	  Failure	                     Copyright	  2010	  Cloudera	  Inc.	  A...
Example:	  Analyzing	  Network	  Data	  to	  Predict	  Failure	   SoluSon	  with	  Hadoop	   •  Take	  the	  computaLon	  ...
Example:	  SupporSng	  Hadoop	  at	  Cloudera	  •  Collect	  data	  from	  customer	  clusters	      •  OS	  configs,	  Had...
Cloudera	  overview	       Copyright	  2011	  Cloudera	  Inc.	  All	  rights	  reserved	  
Cloudera	  Offerings	  Enabling	  the	  Enterprise	  Adop<on	  of	  Apache	  Hadoop	                           PLATFORM	   ...
Contact/Resources/QuesSons	  •  vikram@cloudera.com	  •  irc.freenode.net	  #cloudera	  #hadoop	  •  @cloudera	  •  Cloude...
Upcoming SlideShare
Loading in …5

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera


Published on

When and why to use Hadoop. Hadoop-able problems and use cases.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

  1. 1. What  is  Hadoop,  and  When  Should  I   Consider  Using  It?   Houston  HUG   June  6th,  2011   Vikram  Oberoi,  Cloudera   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  2. 2. About  me  •  Data  engineer  at  Cloudera,  present   •    Using  data  and  Hadoop  to  enable  more  responsive  support  •  Data  engineer  at  Meebo,  Aug  ’09  –  Nov’10   •  Data  infrastructure,  analyLcs  •  CS  at  Stanford,  ’09   •  Senior  project:  ext3  and  XFS  under  Hadoop  MapReduce   workloads  •  Data  engineer  at  Meebo,  ’08   •  Built  an  A/B  tesLng  system  •  SDE  Intern  at  Amazon,  ’07   •  R&D  on  item-­‐to-­‐item  similariLes   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  3. 3. What  will  I  talk  about?  •  What  is  Hadoop?    •  Typical  Hadoop-­‐able  problems  and  use  cases    •  Cloudera  overview   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  4. 4. What  is  Hadoop?   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  5. 5. Big  Data  Problem:  Exploding  Data  Volumes  •  Online   •  Web-­‐ready  devices   •  Social  media   Complex, Unstructured •  Digital  content  •  Enterprise   •  TransacLons     Relational •  R&D  data   •  OperaLonal  (control)  data  •  Open  data  iniLaLves   •  2,500 exabytes of new information in 2012 with Internet as primary driver •  Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. Copyright  2011  Cloudera  Inc.  All  rights  reserved   .
  6. 6. Big  Data  Problem:  Data  Economics   •   Return  on  Byte  =  value  to  be  extracted  from  that  byte  /  cost  of  storing  that   byte   •   If  ROB  is  <  1  then  it  will  be  buried  into  tape  wasteland,  thus  we  need   cheaper  ac#ve  storage.   High  ROB   Low  ROB   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  7. 7. Hadoop:  A  Data  PlaEorm  with  Unique  Benefits   •   Consolidates  Everything   •   Move  complex  and  relaLonal     data  into  a  single  repository   •   Stores  Inexpensively   MapReduce   •   Keep  raw  data  always  available   •   Use  commodity  hardware   •   Processes  at  the  Source   Hadoop  Distributed   •   Eliminate  ETL  boglenecks   File  System  (HDFS)   •   Mine  data  first,  govern  later     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  8. 8. Hadoop  Distributed  File  System  (HDFS)   “How  is  data  stored?”  •  Based  on  design  of  Google’s  GFS  •  Data  stored  in  large  files   •  Files  can  contain  any  data  •  Files  separated  into  blocks   •  64MB  up  to  256MB  per  block  (tunable)   •  Each  block  replicated  across  a  cluster  (tunable,  usually  3   replicas  across  the  cluster)   •  This  buys  you:  fault  tolerance,  parallelizable  disk  reads  •  Store  whatever  you  want  in  it   •  This  buys  you:  flexibility     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  9. 9. MapReduce  “How  is  data  processed?”  •  Framework  designed  for  parallel  processing  of  large  disk   bound  batch  jobs  •  Data  processed  at  the  source   •  File  ‘foo’  has  5  blocks,  processing  happens  on  5  nodes   •  Parallelized  disk  reads  à  remove  disk  bogleneck  •  Way  to  express  algorithms  such  that  they  are   parallelizable  •  Two  funcLons  at  the  core  of  every  job:   •  Map  funcLon  (group  by)   •  Reduce  funcLon  (perform  acLon  on  group)   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  10. 10. What  is  Hadoop?  •  A  scalable  fault-­‐tolerant  distributed  system    for  data  storage   and  processing  (open  source  under  the  Apache  license)  •  Scalable  data  processing  engine   •  Hadoop  Distributed  File  System  (HDFS):  self-­‐healing  high-­‐bandwidth   clustered  storage   •  MapReduce:  fault-­‐tolerant  distributed  processing    •  Key  value   •  Flexible  -­‐>  store  data  without  a  schema  and  add  it  later  as  needed   •  Affordable  -­‐>  cost  /  TB  at  a  fracLon  of  tradiLonal  opLons   •  Broadly  adopted  -­‐>  a  large  and  acLve  ecosystem   •  Proven  at  scale  -­‐>  dozens  of  petabyte  +  implementaLons  in   producLon  today   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  11. 11. Cloudera’s  DistribuSon  Including  Apache  Hadoop    The  Industry’s  Leading  Hadoop  Distribu<on   Hue   Hue  SDK   Oozie   Oozie   Hive   Pig/   Hive   Flume,  Sqoop   HBase   Zookeeper  •  Open  source  –  100%  Apache  licensed  and  free  for  download  •  Simplified  –  Component  versions  &  dependencies  managed  for  you  •  Integrated  –  All  components  &  funcLons  interoperate  through  standard  API’s  •  Reliable  –  Patched  with  fixes  from  future  releases  to  improve  stability  •  Supported  –  Employs  project  founders  and  commigers  for  >90%  of  components   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  12. 12. Typical  Hadoop-­‐able  problems   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  13. 13. What  is  common  across  Hadoop-­‐able  problems?   Nature  of  the  data   •  Complex  data   •  MulLple  data  sources   •  Lots  of  it   Nature  of  the  analysis   •  Batch  processing   •  Parallelizable   Copyright  2010  Cloudera  Inc.  All  rights  reserved   13  
  14. 14. What  kinds  of  analyses  are  possible  with  Hadoop?   •  Text  mining   •  CollaboraLve  filtering   •  Index  building   •  PredicLon  models   •  Graph  creaLon  and   •  SenLment  analysis   analysis   •  Risk  assessment   •  Pagern  recogniLon     Copyright  2010  Cloudera  Inc.  All  rights  reserved   14  
  15. 15. Top  10  Hadoop-­‐able  Problems   See  archived  webinar  on  cloudera.com  1.  Modeling  True  Risk  2.  Customer  Churn  Analysis  3.  RecommendaSon  engines  4.  Ad  TargeSng  5.  Point  Of  Sale  TransacSon  Analysis  6.  Analysing  Network  Data  To  Predict  Failure  7.  Threat  Analysis/Fraud  DetecSon  8.  Trade  Surveillance  9.  Search  Quality  10.  Data  “Sandbox”   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  16. 16. Example:  Modeling  True  Risk   Copyright  2010  Cloudera  Inc.  All  rights  reserved   16  
  17. 17. Example:  Modeling  True  Risk   SoluSon  with  Hadoop   •  Source,  parse  and  aggregate  disparate  data     sources  to  build  comprehensive  data  picture   •  e.g.  credit  card  records,  call  recordings,  chat   sessions,  emails,  banking  acLvity   •  Structure  and  analyze   •  SenLment  analysis,  graph  creaLon,  pagern   recogniLon   Typical  Industry   •  Financial  Services  (Banks,  Insurance)     Copyright  2010  Cloudera  Inc.  All  rights  reserved   17  
  18. 18. Example:  Threat  Analysis   Copyright  2010  Cloudera  Inc.  All  rights  reserved   18  
  19. 19. Example:  Threat  Analysis   SoluSon  with  Hadoop   •  Parallel  processing  over  huge  datasets   •  Pagern  recogniLon  to  idenLfy  anomalies  i.e.  threats   Typical  Industry   •  Security   •  Financial  Services   •  General:  spam  fighLng,     click  fraud     Copyright  2010  Cloudera  Inc.  All  rights  reserved   19  
  20. 20. Example:  RecommendaSon  Engine   Copyright  2010  Cloudera  Inc.  All  rights  reserved   20  
  21. 21. Example:    RecommendaSon  Engine   SoluSon  with  Hadoop   •  Batch  processing  framework   •  Allow  execuLon  in  in  parallel  over  large  datasets   •  CollaboraLve  filtering   •  CollecLng  ‘taste’  informaLon  from  many  users   •  ULlizing  informaLon  to  predict  what  similar   users  like   Typical  Industry   •  Ecommerce,  Manufacturing,  Retail     Copyright  2010  Cloudera  Inc.  All  rights  reserved   21  
  22. 22. Example:  Analyzing  Network  Data  to  Predict  Failure   Copyright  2010  Cloudera  Inc.  All  rights  reserved   22  
  23. 23. Example:  Analyzing  Network  Data  to  Predict  Failure   SoluSon  with  Hadoop   •  Take  the  computaLon  to  the  data   •  Expand  the  range  of  indexing  techniques  from  simple   scans  to  more  complex  data  mining     •  Beger  understand  how  the  network  reacts  to  fluctuaLons   •  How  previously  thought  discrete  anomalies  may,  in   fact,  be  interconnected   •  IdenLfy  leading  indicators  of  component  failure   Typical  Industry   •  ULliLes,  TelecommunicaLons,     Data  Centers     Copyright  2010  Cloudera  Inc.  All  rights  reserved   23  
  24. 24. Example:  SupporSng  Hadoop  at  Cloudera  •  Collect  data  from  customer  clusters   •  OS  configs,  Hadoop  configs,  command  outputs,  logs   •  Data  served  by  HBase,  used  by  supporters  •  Consolidate  data  about  Hadoop  in  HDFS   •  Mailing  lists,  issue  trackers,  wiki  pages,  IRC,  books   •  Customer  cluster  data  •  Analyze  many  data  sources  to  understand  Hadoop   issues  and  deployments   •  Build  tools  to  enable  easier  diagnosis  or  proacLve  support   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  25. 25. Cloudera  overview   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  26. 26. Cloudera  Offerings  Enabling  the  Enterprise  Adop<on  of  Apache  Hadoop   PLATFORM   SUPPORT  &  APPLICATIONS   PROFESSIONAL  SERVICES   TRAINING   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  27. 27. Contact/Resources/QuesSons  •  vikram@cloudera.com  •  irc.freenode.net  #cloudera  #hadoop  •  @cloudera  •  Cloudera  Groups:  hgp://groups.cloudera.org  •  Hadoop  the  DefiniLve  Guide  •  10  Hadoop-­‐able  problems  on  Slideshare  •  QuesLons?  (P.S.  We’re  hiring  SA’s  in  Houston!)   Copyright  2011  Cloudera  Inc.  All  rights  reserved