Common and unique use cases for Apache Hadoop

  • 33,552 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Interesting
    Are you sure you want to
    Your message goes here
  • good
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
33,552
On Slideshare
0
From Embeds
0
Number of Embeds
18

Actions

Shares
Downloads
1,024
Comments
2
Likes
46

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Common  and  Unique  Use  Cases  for  Apache  Hadoop    August  30,  2011  
  • 2. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 3. Exploding  Data  Volumes  •  Online   •  Web-­‐ready  devices   •  Social  media   Complex, Unstructured •  Digital  content   •  Smart  grids  •  Enterprise   Relational •  TransacBons     •  R&D  data   •  OperaBonal  (control)  data       Digital  universe  grew  by  62%  last  year  to   2,500  exabytes  of  new  informaBon  in   800K  petabytes  and  will  grow  to  1.2   2012  with  Internet  as  primary  driver   “zeabytes”  this  year     Source:  An  IDC  White  Paper  -­‐  sponsored  by  EMC.  As  the  Economy  Contracts,  the     Digital  Universe  Expands.  May  2009     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 4. Origin  of  Hadoop  How  does  an  elephant  sneak  up  on  you?   Hadoop  wins   Terabyte  sort   benchmark   Releases   Open  Source,   CDH3  and   Publishes   MapReduce   Cloudera   MapReduce,   &  HDFS   Runs  4,000   Enterprise   Open  Source,   GFS  Paper   project   Node  Hadoop   Web  Crawler   created  by   Cluster   project   Launches  SQL   Doug  Cucng   created  by   Support  for   Doug  Cucng   Hadoop  2002   2003   2004   2005   2006   2007   2008   2009   2010   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 5. What  is  Apache  Hadoop?  Open  Source  Storage  and  Processing  Engine   •   Consolidates  Everything   •   Move  complex  and  relaBonal     data  into  a  single  repository   •   Stores  Inexpensively   •   Keep  raw  data  always  available   MapReduce   •   Use  commodity  hardware   •   Processes  at  the  Source   •   Eliminate  ETL  bolenecks   Hadoop  Distributed   •   Mine  data  first,  govern  later     File  System  (HDFS)   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 6. What  is  Apache  Hadoop?  The  Standard  Way  Big  Data  Gets  Done  •  Hadoop  is  Flexible:   •  Structured,  unstructured   •  Schema,  no  schema   •  High  volume,  merely  terabytes   •  All  kinds  of  analyBc  applicaBons  •  Hadoop  is  Open:  100%  Apache-­‐licensed  open  source  •  Hadoop  is  Scalable:  Proven  at  petabyte  scale  •  Benefits:   •  Controls  costs  by  storing  data  more  affordably  per  terabyte  than  any  other   plalorm   •  Drives  revenue  by  extracBng  value  from  data  that  was  previously  out  of  reach   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 7. What  is  Apache  Hadoop?  The  Importance  of  Being  Open   No  Lock-­‐In  -­‐  Investments  in  skills,  services  &     hardware  are  preserved  regardless  of  vendor  choice   Community  Development  -­‐  Hadoop  &     related  projects  are  expanding  at  a     rapid  pace   Rich  Ecosystem  -­‐  Dozens  of     complementary  somware,  hardware    and  services  firms     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 8. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 9. Log  Processing  A  Perfect  Fit  •  Common  uses  of  logs   •  Find  or  count  events  (grep)   grep  “ERROR”  file   grep  -­‐c  “ERROR”  file   •  Calculate  metrics  (performance  or  user  behavior  analysis)   awk  ‘{sums[$1]+=$2;  counts[$1]+=1}  END  {for(k  in  counts)  {print  sums[k]/counts  [k]}}’   •  InvesBgate  user  sessions   grep  “USER”  files  …  |  sort  |  less  
  • 10. Log  Processing  A  Perfect  Fit  •  Shoot…too  much  data   •  Homegrown  parallel  processing  omen  done  on  per  file  basis,  cause  it’s   easy   •  No  parallelism  on  a  single  large  file   Task  0   access_log   Task  1   Task  2   access_log   access_log  
  • 11. Log  Processing   A  Perfect  Fit   •  MapReduce  to  the  rescue!   •  Processing  is  done  per  unit  of  data   Task  0   Task  1   Task  2   Task  3  access_log        0-­‐64MB      64-­‐128MB  128-­‐192MB  192-­‐256MB   Each  task  is  responsible  for  a  unit  of  data  
  • 12. Log  Processing  A  Perfect  Fit  •  Network  or  disk  are  bolenecks •  Reading  100GB  of  data   •  14  minutes  with  1GbE  network  connecBon   •  22  minutes  on  standard  disk  drive   access_log   ited   Bandwidth  is  lim grep  
  • 13. Log  Processing  A  Perfect  Fit  •  Hadoop  to  the  rescue!   •  Eliminates  network  boleneck,  data  is  on  local  disk   •  Data  is  read  from  many,  many  disks  in  parallel     Physical  Machines   NodeA   NodeX   NodeY   NodeZ   Task  0   Task  1   Task  2   Task  3   0-­‐64MB   64-­‐128MB   128-­‐192MB   192-­‐256MB  
  • 14. Log  Processing  A  Perfect  Fit  •  Hadoop  currently  scales  to  4,000  nodes   •  Goal  for  next  release  is  10,000  nodes  •  Nodes  typically  have  12  hard  drives  •  A  single  hard  drive  has  throughput  of  about  75MB/second  •  12  Hard  Drives  *  75  MB/second  *  4000  Nodes  =  3.4  TB/second   •  That’s  bytes,  not  bits  •  That’s  enough  bandwidth  to  read  1PB  (1000  TB)  in  5  minutes  
  • 15. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 16. Catching  `Osama’  Embarrassingly  Parallel  •  You  have  a  few  billion  images  of  faces  with  geo-­‐tags   •  Tremendous  storage  problem   •  Tremendous  processing  problem   •  Bandwidth   •  CoordinaBon  
  • 17. Catching  `Osama’  Embarrassingly  Parallel  •  Store  the  images  in  Hadoop  •  When  processing,  Hadoop  will  read  the  images  from   local  disk,  thousands  of  local  disks  spread  throughout   the  cluster  •  Use  Map  only  job  to  compare  input  images  against   `needle’  image  
  • 18. Catching  `Osama’  Embarrassingly  Parallel   Tasks  have  copy  of  `needle’   Map  Task  0   Map  Task  1          Store  images  in  Sequence  Files   Output  faces   `matching’  needle  
  • 19. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 20. Extract  Transform  Load  (ETL)  Everyone  is  doing  it  •  One  of  the  most  common  use  cases  I  see  is  replacing   ETL  processes  •  Hadoop  is  a  huge  sink  of  cheap  storage  and  processing  •  Aggregates  built  in  Hadoop  and  exported  •  Apache  Hive  provides  SQL  like  querying  on  raw  data  
  • 21. Extract  Transform  Load  (ETL)  Everyone  is  doing  it  `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   ETL   Much  blood  shed,  here  
  • 22. Extract  Transform  Load  (ETL)  Everyone  is  doing  it  `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   Import Hadoop     Export  
  • 23. Extract  Transform  Load  (ETL)  Everyone  is  doing  it  `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   Apache Hadoop     Sqoop   Apache   Sqoop  
  • 24. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 25. AnalyScs  in  HBase  Scaling  writes  •  AnalyBcs  is  omen  simply  counBng  things  •  Facebook  chose  HBase  to  store  it’s  massive  counter  infrastructure  (more   later)  •  How  might  one  implement  a  counter  infrastructure  in  HBase?  
  • 26. AnalyScs  in  HBase  Scaling  writes   User  &  Content  Type  Counters   `Like’  buon  IMG  request     sends  HTTP  request  to   User   Content   Counter   Facebook  servers  which   brock@me.com   NEWS   5431   increments  several  counters   brock@me.com   TECH   79310   brock@me.com   SHOPPING   59   tom@him.com   SPORTS   94214   Individual  Page  Counters   URL   Counter   com.cloudera/blog/…   154   com.cloudera/downloads/…   923621   com.cloudera/resources/…   2138  
  • 27. AnalyScs  in  HBase   Scaling  writes   Individual  Page  Counters  Host  is  reversed  in  URL  as  part  of  the  key   URL   Counter   com.cloudera/blog/…   154   com.cloudera/downloads/…   923621   com.cloudera/resources/…   2138   •  Data  is  physically  stored  in  sorted  order     •  Scanning  all  `com.cloudera’  counters  results  in  sequenBal  I/O  
  • 28. Facebook  AnalyScs  Scaling  writes  •  Real-­‐Bme  counters  of  URLs  shared,  links  “liked”,   impressions  generated  •  20  billion  events/day  (200K  events/sec)  •  ~30  second  latency  from  click  to  count  •  Heavy  use  of  incrementColumnValue  API  for   consistent  counters  •  Tried  MySQL,  Cassandra,  seled  on  HBase     hp://Bny.cloudera.com/hbase-­‐„-­‐analyBcs  
  • 29. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 30. Machine  Learning  Apache  Mahout   Text  Clustering  on  Google  News  
  • 31. Machine  Learning  Apache  Mahout   CollaboraBve  Filtering  on  Amazon  
  • 32. Machine  Learning  Apache  Mahout   ClassificaBon  in  GMail  
  • 33. Machine  Learning  Apache  Mahout  •  Apache  Mahout  implements   •  CollaboraBve  Filtering     •  ClassificaBon     •  Clustering   •  Frequent  itemset  •  More  coming  with  the  integraBon  of  MapReduce.Next  
  • 34. Agenda  •  What  is  Apache  Hadoop?  •  Log  Processing  •  Catching  `Osama’  •  Extract  Transform  Load  (ETL)  •  AnalyBcs  in  HBase  •  Machine  Learning  •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 35. Final  Thoughts  Use  the  right  tool  •  Other  use  cases   •  OpenTSDB  an  open  distributed,  scalable  Time  Series  Database  (TSDB)   •  Building  Search  Indexes  (canonical  use  case)   •  Facebook  Messaging   •  Cheap  and  Deep  Storage,  e.g.  archiving  emails  for  SOX  compliance   •  Audit  Logging  •  Non-­‐Use  Cases   •  Data  processing  is  handled  by  one  beefy  server   •  Data  requires  transacBons  
  • 36. About  the  Presenter  •  Brock  Noland  •  brock@cloudera.com  •  hp://twier.com/brocknoland  •  TC-­‐HUG  hp://tch.ug