Your SlideShare is downloading. ×
0
Common	  and	  Unique	  Use	  Cases	  for	  Apache	  Hadoop	  	  August	  30,	  2011	  
Agenda	  •    What	  is	  Apache	  Hadoop?	  •    Log	  Processing	  •    Catching	  `Osama’	  •    Extract	  Transform	  ...
Exploding	  Data	  Volumes	  •  Online	         •      Web-­‐ready	  devices	         •      Social	  media	              ...
Origin	  of	  Hadoop	  How	  does	  an	  elephant	  sneak	  up	  on	  you?	                                               ...
What	  is	  Apache	  Hadoop?	  Open	  Source	  Storage	  and	  Processing	  Engine	                                       ...
What	  is	  Apache	  Hadoop?	  The	  Standard	  Way	  Big	  Data	  Gets	  Done	  •  Hadoop	  is	  Flexible:	         •    ...
What	  is	  Apache	  Hadoop?	  The	  Importance	  of	  Being	  Open	    No	  Lock-­‐In	  -­‐	  Investments	  in	  skills,	...
Agenda	  •    What	  is	  Apache	  Hadoop?	  •    Log	  Processing	  •    Catching	  `Osama’	  •    Extract	  Transform	  ...
Log	  Processing	  A	  Perfect	  Fit	  •  Common	  uses	  of	  logs	         •  Find	  or	  count	  events	  (grep)	      ...
Log	  Processing	  A	  Perfect	  Fit	  •  Shoot…too	  much	  data	         •  Homegrown	  parallel	  processing	  omen	  d...
Log	  Processing	    A	  Perfect	  Fit	    •  MapReduce	  to	  the	  rescue!	           •  Processing	  is	  done	  per	  ...
Log	  Processing	  A	  Perfect	  Fit	  •  Network	  or	  disk	  are	  bolenecks       •  Reading	  100GB	  of	  data	     ...
Log	  Processing	  A	  Perfect	  Fit	  •  Hadoop	  to	  the	  rescue!	         •  Eliminates	  network	  boleneck,	  data	...
Log	  Processing	  A	  Perfect	  Fit	  •  Hadoop	  currently	  scales	  to	  4,000	  nodes	         •  Goal	  for	  next	 ...
Agenda	  •    What	  is	  Apache	  Hadoop?	  •    Log	  Processing	  •    Catching	  `Osama’	  •    Extract	  Transform	  ...
Catching	  `Osama’	  Embarrassingly	  Parallel	  •  You	  have	  a	  few	  billion	  images	  of	  faces	  with	  geo-­‐ta...
Catching	  `Osama’	  Embarrassingly	  Parallel	  •  Store	  the	  images	  in	  Hadoop	  •  When	  processing,	  Hadoop	  ...
Catching	  `Osama’	  Embarrassingly	  Parallel	                                                                           ...
Agenda	  •    What	  is	  Apache	  Hadoop?	  •    Log	  Processing	  •    Catching	  `Osama’	  •    Extract	  Transform	  ...
Extract	  Transform	  Load	  (ETL)	  Everyone	  is	  doing	  it	  •  One	  of	  the	  most	  common	  use	  cases	  I	  se...
Extract	  Transform	  Load	  (ETL)	  Everyone	  is	  doing	  it	  `Real’	  Time	  System	  (Website)	                     ...
Extract	  Transform	  Load	  (ETL)	  Everyone	  is	  doing	  it	  `Real’	  Time	  System	  (Website)	                     ...
Extract	  Transform	  Load	  (ETL)	  Everyone	  is	  doing	  it	  `Real’	  Time	  System	  (Website)	                     ...
Agenda	  •    What	  is	  Apache	  Hadoop?	  •    Log	  Processing	  •    Catching	  `Osama’	  •    Extract	  Transform	  ...
AnalyScs	  in	  HBase	  Scaling	  writes	  •  AnalyBcs	  is	  omen	  simply	  counBng	  things	  •  Facebook	  chose	  HBa...
AnalyScs	  in	  HBase	  Scaling	  writes	                                                              User	  &	  Content	...
AnalyScs	  in	  HBase	   Scaling	  writes	                                                                                ...
Facebook	  AnalyScs	  Scaling	  writes	  •  Real-­‐Bme	  counters	  of	  URLs	  shared,	  links	  “liked”,	     impression...
Agenda	  •    What	  is	  Apache	  Hadoop?	  •    Log	  Processing	  •    Catching	  `Osama’	  •    Extract	  Transform	  ...
Machine	  Learning	  Apache	  Mahout	         Text	  Clustering	  on	  Google	  News	  
Machine	  Learning	  Apache	  Mahout	         CollaboraBve	  Filtering	  on	  Amazon	  
Machine	  Learning	  Apache	  Mahout	        ClassificaBon	  in	  GMail	  
Machine	  Learning	  Apache	  Mahout	  •  Apache	  Mahout	  implements	       •  CollaboraBve	  Filtering	  	       •  Cla...
Agenda	  •    What	  is	  Apache	  Hadoop?	  •    Log	  Processing	  •    Catching	  `Osama’	  •    Extract	  Transform	  ...
Final	  Thoughts	  Use	  the	  right	  tool	  •    Other	  use	  cases	         •     OpenTSDB	  an	  open	  distributed,	...
About	  the	  Presenter	  •  Brock	  Noland	  •  brock@cloudera.com	  •  hp://twier.com/brocknoland	  •  TC-­‐HUG	  hp://t...
Upcoming SlideShare
Loading in...5
×

Common and unique use cases for Apache Hadoop

34,602

Published on

Published in: Technology, News & Politics
2 Comments
48 Likes
Statistics
Notes
No Downloads
Views
Total Views
34,602
On Slideshare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
1,065
Comments
2
Likes
48
Embeds 0
No embeds

No notes for slide

Transcript of "Common and unique use cases for Apache Hadoop"

  1. 1. Common  and  Unique  Use  Cases  for  Apache  Hadoop    August  30,  2011  
  2. 2. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  3. 3. Exploding  Data  Volumes  •  Online   •  Web-­‐ready  devices   •  Social  media   Complex, Unstructured •  Digital  content   •  Smart  grids  •  Enterprise   Relational •  TransacBons     •  R&D  data   •  OperaBonal  (control)  data       Digital  universe  grew  by  62%  last  year  to   2,500  exabytes  of  new  informaBon  in   800K  petabytes  and  will  grow  to  1.2   2012  with  Internet  as  primary  driver   “zeabytes”  this  year     Source:  An  IDC  White  Paper  -­‐  sponsored  by  EMC.  As  the  Economy  Contracts,  the     Digital  Universe  Expands.  May  2009     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  4. 4. Origin  of  Hadoop  How  does  an  elephant  sneak  up  on  you?   Hadoop  wins   Terabyte  sort   benchmark   Releases   Open  Source,   CDH3  and   Publishes   MapReduce   Cloudera   MapReduce,   &  HDFS   Runs  4,000   Enterprise   Open  Source,   GFS  Paper   project   Node  Hadoop   Web  Crawler   created  by   Cluster   project   Launches  SQL   Doug  Cucng   created  by   Support  for   Doug  Cucng   Hadoop  2002   2003   2004   2005   2006   2007   2008   2009   2010   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  5. 5. What  is  Apache  Hadoop?  Open  Source  Storage  and  Processing  Engine   •   Consolidates  Everything   •   Move  complex  and  relaBonal     data  into  a  single  repository   •   Stores  Inexpensively   •   Keep  raw  data  always  available   MapReduce   •   Use  commodity  hardware   •   Processes  at  the  Source   •   Eliminate  ETL  bolenecks   Hadoop  Distributed   •   Mine  data  first,  govern  later     File  System  (HDFS)   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  6. 6. What  is  Apache  Hadoop?  The  Standard  Way  Big  Data  Gets  Done  •  Hadoop  is  Flexible:   •  Structured,  unstructured   •  Schema,  no  schema   •  High  volume,  merely  terabytes   •  All  kinds  of  analyBc  applicaBons  •  Hadoop  is  Open:  100%  Apache-­‐licensed  open  source  •  Hadoop  is  Scalable:  Proven  at  petabyte  scale  •  Benefits:   •  Controls  costs  by  storing  data  more  affordably  per  terabyte  than  any  other   plalorm   •  Drives  revenue  by  extracBng  value  from  data  that  was  previously  out  of  reach   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  7. 7. What  is  Apache  Hadoop?  The  Importance  of  Being  Open   No  Lock-­‐In  -­‐  Investments  in  skills,  services  &     hardware  are  preserved  regardless  of  vendor  choice   Community  Development  -­‐  Hadoop  &     related  projects  are  expanding  at  a     rapid  pace   Rich  Ecosystem  -­‐  Dozens  of     complementary  somware,  hardware    and  services  firms     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  8. 8. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  9. 9. Log  Processing  A  Perfect  Fit  •  Common  uses  of  logs   •  Find  or  count  events  (grep)   grep  “ERROR”  file   grep  -­‐c  “ERROR”  file   •  Calculate  metrics  (performance  or  user  behavior  analysis)   awk  ‘{sums[$1]+=$2;  counts[$1]+=1}  END  {for(k  in  counts)  {print  sums[k]/counts  [k]}}’   •  InvesBgate  user  sessions   grep  “USER”  files  …  |  sort  |  less  
  10. 10. Log  Processing  A  Perfect  Fit  •  Shoot…too  much  data   •  Homegrown  parallel  processing  omen  done  on  per  file  basis,  cause  it’s   easy   •  No  parallelism  on  a  single  large  file   Task  0   access_log   Task  1   Task  2   access_log   access_log  
  11. 11. Log  Processing   A  Perfect  Fit   •  MapReduce  to  the  rescue!   •  Processing  is  done  per  unit  of  data   Task  0   Task  1   Task  2   Task  3  access_log        0-­‐64MB      64-­‐128MB  128-­‐192MB  192-­‐256MB   Each  task  is  responsible  for  a  unit  of  data  
  12. 12. Log  Processing  A  Perfect  Fit  •  Network  or  disk  are  bolenecks •  Reading  100GB  of  data   •  14  minutes  with  1GbE  network  connecBon   •  22  minutes  on  standard  disk  drive   access_log   ited   Bandwidth  is  lim grep  
  13. 13. Log  Processing  A  Perfect  Fit  •  Hadoop  to  the  rescue!   •  Eliminates  network  boleneck,  data  is  on  local  disk   •  Data  is  read  from  many,  many  disks  in  parallel     Physical  Machines   NodeA   NodeX   NodeY   NodeZ   Task  0   Task  1   Task  2   Task  3   0-­‐64MB   64-­‐128MB   128-­‐192MB   192-­‐256MB  
  14. 14. Log  Processing  A  Perfect  Fit  •  Hadoop  currently  scales  to  4,000  nodes   •  Goal  for  next  release  is  10,000  nodes  •  Nodes  typically  have  12  hard  drives  •  A  single  hard  drive  has  throughput  of  about  75MB/second  •  12  Hard  Drives  *  75  MB/second  *  4000  Nodes  =  3.4  TB/second   •  That’s  bytes,  not  bits  •  That’s  enough  bandwidth  to  read  1PB  (1000  TB)  in  5  minutes  
  15. 15. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  16. 16. Catching  `Osama’  Embarrassingly  Parallel  •  You  have  a  few  billion  images  of  faces  with  geo-­‐tags   •  Tremendous  storage  problem   •  Tremendous  processing  problem   •  Bandwidth   •  CoordinaBon  
  17. 17. Catching  `Osama’  Embarrassingly  Parallel  •  Store  the  images  in  Hadoop  •  When  processing,  Hadoop  will  read  the  images  from   local  disk,  thousands  of  local  disks  spread  throughout   the  cluster  •  Use  Map  only  job  to  compare  input  images  against   `needle’  image  
  18. 18. Catching  `Osama’  Embarrassingly  Parallel   Tasks  have  copy  of  `needle’   Map  Task  0   Map  Task  1          Store  images  in  Sequence  Files   Output  faces   `matching’  needle  
  19. 19. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  20. 20. Extract  Transform  Load  (ETL)  Everyone  is  doing  it  •  One  of  the  most  common  use  cases  I  see  is  replacing   ETL  processes  •  Hadoop  is  a  huge  sink  of  cheap  storage  and  processing  •  Aggregates  built  in  Hadoop  and  exported  •  Apache  Hive  provides  SQL  like  querying  on  raw  data  
  21. 21. Extract  Transform  Load  (ETL)  Everyone  is  doing  it  `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   ETL   Much  blood  shed,  here  
  22. 22. Extract  Transform  Load  (ETL)  Everyone  is  doing  it  `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   Import Hadoop     Export  
  23. 23. Extract  Transform  Load  (ETL)  Everyone  is  doing  it  `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   Apache Hadoop     Sqoop   Apache   Sqoop  
  24. 24. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  25. 25. AnalyScs  in  HBase  Scaling  writes  •  AnalyBcs  is  omen  simply  counBng  things  •  Facebook  chose  HBase  to  store  it’s  massive  counter  infrastructure  (more   later)  •  How  might  one  implement  a  counter  infrastructure  in  HBase?  
  26. 26. AnalyScs  in  HBase  Scaling  writes   User  &  Content  Type  Counters   `Like’  buon  IMG  request     sends  HTTP  request  to   User   Content   Counter   Facebook  servers  which   brock@me.com   NEWS   5431   increments  several  counters   brock@me.com   TECH   79310   brock@me.com   SHOPPING   59   tom@him.com   SPORTS   94214   Individual  Page  Counters   URL   Counter   com.cloudera/blog/…   154   com.cloudera/downloads/…   923621   com.cloudera/resources/…   2138  
  27. 27. AnalyScs  in  HBase   Scaling  writes   Individual  Page  Counters  Host  is  reversed  in  URL  as  part  of  the  key   URL   Counter   com.cloudera/blog/…   154   com.cloudera/downloads/…   923621   com.cloudera/resources/…   2138   •  Data  is  physically  stored  in  sorted  order     •  Scanning  all  `com.cloudera’  counters  results  in  sequenBal  I/O  
  28. 28. Facebook  AnalyScs  Scaling  writes  •  Real-­‐Bme  counters  of  URLs  shared,  links  “liked”,   impressions  generated  •  20  billion  events/day  (200K  events/sec)  •  ~30  second  latency  from  click  to  count  •  Heavy  use  of  incrementColumnValue  API  for   consistent  counters  •  Tried  MySQL,  Cassandra,  seled  on  HBase     hp://Bny.cloudera.com/hbase-­‐„-­‐analyBcs  
  29. 29. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  30. 30. Machine  Learning  Apache  Mahout   Text  Clustering  on  Google  News  
  31. 31. Machine  Learning  Apache  Mahout   CollaboraBve  Filtering  on  Amazon  
  32. 32. Machine  Learning  Apache  Mahout   ClassificaBon  in  GMail  
  33. 33. Machine  Learning  Apache  Mahout  •  Apache  Mahout  implements   •  CollaboraBve  Filtering     •  ClassificaBon     •  Clustering   •  Frequent  itemset  •  More  coming  with  the  integraBon  of  MapReduce.Next  
  34. 34. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  35. 35. Final  Thoughts  Use  the  right  tool  •  Other  use  cases   •  OpenTSDB  an  open  distributed,  scalable  Time  Series  Database  (TSDB)   •  Building  Search  Indexes  (canonical  use  case)   •  Facebook  Messaging   •  Cheap  and  Deep  Storage,  e.g.  archiving  emails  for  SOX  compliance   •  Audit  Logging  •  Non-­‐Use  Cases   •  Data  processing  is  handled  by  one  beefy  server   •  Data  requires  transacBons  
  36. 36. About  the  Presenter  •  Brock  Noland  •  brock@cloudera.com  •  hp://twier.com/brocknoland  •  TC-­‐HUG  hp://tch.ug  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×