Your SlideShare is downloading. ×
Hadoop Summit San Diego Feb2013
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Hadoop Summit San Diego Feb2013


Published on

Published in: Technology

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Hadoop  Use  Cases   At   Narayan  Bharadwaj             Director,  Product  Management   Monitoring  &  Big  Data                            @nadubharadwaj                
  • 2. Safe  harbor  Safe  harbor  statement  under  the  Private  Securi8es  Li8ga8on  Reform  Act  of  1995:  This  presenta8on  may  contain  forward-­‐looking  statements  that  involve  risks,  uncertain8es,  and  assump8ons.  If  any  such  uncertain8es  materialize  or  if  any  of  the  assump8ons  proves  incorrect,  the  results  of,  inc.  could  differ  materially  from  the  results  expressed  or  implied  by  the  forward-­‐looking  statements  we  make.  All  statements  other  than  statements  of  historical  fact  could  be  deemed  forward-­‐looking,  including  any  projec8ons  of  product  or  service  availability,  subscriber  growth,  earnings,  revenues,  or  other  financial  items  and  any  statements  regarding  strategies  or  plans  of  management  for  future  opera8ons,  statements  of  belief,  any  statements  concerning  new,  planned,  or  upgraded  services  or  technology  developments  and  customer  contracts  or  use  of  our  services.  The  risks  and  uncertain8es  referred  to  above  include  –  but  are  not  limited  to  –  risks  associated  with  developing  and  delivering  new  func8onality  for  our  service,  new  products  and  services,  our  new  business  model,  our  past  opera8ng  losses,  possible  fluctua8ons  in  our  opera8ng  results  and  rate  of  growth,  interrup8ons  or  delays  in  our  Web  hos8ng,  breach  of  our  security  measures,  the  outcome  of  intellectual  property  and  other  li8ga8on,  risks  associated  with  possible  mergers  and  acquisi8ons,  the  immature  market  in  which  we  operate,  our  rela8vely  limited  opera8ng  history,  our  ability  to  expand,  retain,  and  mo8vate  our  employees  and  manage  our  growth,  new  releases  of  our  service  and  successful  customer  deployment,  our  limited  history  reselling  non-­‐  products,  and  u8liza8on  and  selling  to  larger  enterprise  customers.  Further  informa8on  on  poten8al  factors  that  could  affect  the  financial  results  of,  inc.  is  included  in  our  annual  report  on  Form  10-­‐Q  for  the  most  recent  fiscal  quarter  ended  July  31,  2012.  This  documents  and  others  containing  important  disclosures  are  available  on  the  SEC  Filings  sec8on  of  the  Investor  Informa8on  sec8on  of  our  Web  site.  Any  unreleased  services  or  features  referenced  in  this  or  other  presenta8ons,  press  releases  or  public  statements  are  not  currently  available  and  may  not  be  delivered  on  8me  or  at  all.  Customers  who  purchase  our  services  should  make  the  purchase  decisions  based  upon  features  that  are  currently  available.,  inc.  assumes  no  obliga8on  and  does  not  intend  to  update  these  forward-­‐looking  statements.  
  • 3. Agenda  •  Technology  •  Big  Data  use  cases  •  Use  case  discussion  •  Q&A  
  • 4. Got  “Cloud  Data”?  130k  customers   1  billion  transac8ons/day  Millions  of  users   Terabytes/day  
  • 5. Technology  
  • 6. Big  Data  Ecosystem  Phoenix   Oozie  
  • 7. Phoenix   “We  put  the  SQL  back  in  NoSQL”  •  SQL  layer  on  HBase  •  Seamless  applica8on  integra8on   –  Standard  JDBC  interface   –  DDL  statement  support  •  Low  query  latency   –  SQL  query  è  Mul8ple  HBase  scans   –  Co-­‐processors,  custom  filters   –  Milliseconds  for  small  queries   –  Seconds  for  tens  of  millions  rows  •  hdps://  
  • 8. Contribu8ons   @pRaShAnT1784  :  Prashant  Kommireddi      Lars  Ho<ansl        @thefutureian  :  Ian  Varley  
  • 9. Data  Science  tools  ecosystem  Apache  Pig  
  • 10. Big  Data  Use  Cases   User  behavior  Product  Metrics   Capacity  planning   analysis   Monitoring   Query  Run8me   Collec8ons   intelligence   Predic8on   Early  Warning   Collabora8ve   Search  Relevancy   System   Filtering   Internal  App   Product  feature  
  • 11. Product  Metrics  
  • 12. Product  Metrics  –  Problem  Statement   •  Track  feature  usage/adop8on  across  130k+   customers   –  Eg:  Accounts,  Contacts,  Visualforce,  Apex,…   •  Track  standard  metrics  across  all  features   –  Eg:  #Requests,  #UniqueOrgs,  #UniqueUsers,  AvgResponseTime,…   •  Track  features  and  metrics  across  all  channels   –  API,  UI,  Mobile   •  Primary  audience:  Execu8ves,  Product  Managers  
  • 13. Product  Metrics  Pipeline   User  Input   CollaboraWon   Reports,  Dashboards   (Page  Layout)   (ChaXer)   Workflow   Formula   Fields                        Feature  Metrics   Trend  Metrics                        (Custom  Object)   (Custom  Object)   API   API    Client  Machine   Java  Program   Pig  script  generator   Workflow   Log  Pull   Hadoop   Log  Files  
  • 14. VisualizaWon  (Reports  &  Dashboards)   Note:  Feature  Names  are  not  displayed  
  • 15. VisualizaWon  (Reports  &  Dashboards)  
  • 16. Collaborate,  Iterate  (ChaXer)  
  • 17. User  Behavior  Analysis  
  • 18. Problem  Statement  §  How  do  we  reduce  number  of  clicks  on  the  user  interface?  §  What  are  the  top  user  click  path  sequences?  §  What  are  the  user  clusters/personas?  •  Approach:   •  Markov  transi8on  for  click  path,  D3.js  visuals   •  K-­‐means  (unsupervised)  clustering  for  user  groups  
  • 19. Markov  TransiWons  for  "Setup"  pages   Note:  Based  on  an  internal  Salesforce  org  
  • 20. K-­‐means  clustering  of  "Setup"  pages   Note:  Based  on  an  internal  Salesforce  org  
  • 21. Collabora8ve  Filtering  
  • 22. CollaboraWve  Filtering  –  Problem  Statement   •  Show  similar  files  within  an  organiza8on   –  Content-­‐based  approach   –  Community-­‐base  approach  
  • 23. Popular  File  
  • 24. Related  File  
  • 25. We  found  this  relaWonship  using  item-­‐to-­‐item  collaboraWve  filtering   •  Amazon  published  this  algorithm  in  2003.   –  RecommendaJons:  Item-­‐to-­‐Item  CollaboraJve  Filtering,  by   Gregory  Linden,  Brent  Smith,  and  Jeremy  York.    IEEE  Internet  Compu8ng,   January-­‐February  2003.   •  At  Salesforce,  we  adapted  this  algorithm  for   Hadoop,  and  we  use  it  to  recommend  files  to   view  and  users  to  follow.  
  • 26. Example:  CF  on  5  files   Vision  Statement   Annual  Report  Dilbert  Comic   Darth  Vader  Cartoon   Disk  Usage  Report  
  • 27. View  History  Table   Darth   Annual   Vision   Dilbert   Disk  Usage   Vader   Report   Statement   Cartoon   Report   Cartoon   Miranda   1   1   1   0   0   (CEO)   Bob  (CFO)   1   1   1   0   0   Susan   0   1   1   1   0   (Sales)   Chun   0   0   1   1   0   (Sales)   Alice  (IT)   0   0   1   1   1  
  • 28. RelaWonships  between  the  files   Annual  Report   Vision  Statement   Darth  Vader   Cartoon   Dilbert  Cartoon   Disk  Usage   Report  
  • 29. RelaWonships  between  the  files   Annual  Report   2 Vision  Statement   0 1 3 2 0 Darth  Vader   0 Cartoon   Dilbert   Cartoon   3 1 1 Disk  Usage   Report  
  • 30. Sorted  relaWonships  for  each  file  Annual   Vision   Dilbert   Darth   Disk  Usage  Report   Statement   Cartoon   Vader   Report   Cartoon  Dilbert  (2)   Dilbert  (3)   Vision  Stmt.  (3)   Dilbert  (3)   Dilbert  (1)  Vision  Stmt.  (2)   Annual  Rpt.  (2)   Darth  Vader  (3)   Vision  Stmt.  (1)   Darth  Vader  (1)   Darth  Vader  (1)   Annual  Rpt.  (2)   Disk  Usage  (1)   Disk  Usage  (1)   The  popularity  problem:  no8ce  that  Dilbert  appears  first  in  every  list.    This  is   probably  not  what  we  want.   The  solu8on:  divide  the  relaWonship  tallies  by  file  populariWes.  
  • 31. Normalized  relaWonships  between  the  files   Annual  Report   .82   Vision  Statement   0 .33   .63   .77   0 0 Darth  Vader   Cartoon   Dilbert  Cartoon   .77   .58   .45   Disk  Usage   Report  
  • 32. Sorted  relaWonships  for  each  file,  normalized  by  file  populariWes  Annual   Vision   Dilbert   Darth  Vader   Disk  Usage  Report   Statement   Cartoon   Cartoon   Report  Vision  Stmt.   Annual  Report     Darth  Vader   Darth  Vader   Dilbert  (.77)  (.82)   (.82)   (.77)   (.58)   Vision  Stmt.   Disk  Usage   Dilbert  Dilbert  (.63)   Dilbert  (.77)   (.77)   (.58)   (.45)   Darth  Vader     Annual  Report   Vision  Stmt.   (.33)   (.63)   (.33)   Disk  Usage   (.45)   High  rela8onship  tallies  AND  similar  popularity  values  now  drive  closeness.  
  • 33. The  item-­‐to-­‐item  CF  algorithm   1)  Compute  file  populari8es   2)  Compute  rela8onship  tallies  and  divide  by   file  populari8es   3)  Sort  and  store  the  results  
  • 34. MapReduce  Overview  Map   Shuffle   Reduce   (adapted  from  hdp://­‐framework/wiki/ MapReduce)  
  • 35. 1.  Compute  File  PopulariWes   <user,  file>   Inverse  iden8ty  map   <file,  List<user>>   Reduce   <file,  (user  count)>   Result  is  a  table  of  (file,  popularity)  pairs  that  you  store  in  the  Hadoop  distributed  cache.  
  • 36. Example:  File  popularity  for  Dilbert   (Miranda,  Dilbert),  (Bob,  Dilbert),  (Susan,  Dilbert),  (Chun,  Dilbert),  (Alice,  Dilbert)   Inverse  iden8ty  map   <Dilbert,  {Miranda,  Bob,  Susan,  Chun,  Alice}>   Reduce   (Dilbert,  5)  
  • 37. 2a.  Compute  relaWonship  tallies  -­‐  find  all  relaWonships  in  view  history  table     <user,  file>     Iden8ty  map   <user,  List<file>>   Reduce   <(file1,  file2),  Integer(1)>,     <(file1,  file3),  Integer(1)>,    …     <(file(n-­‐1),  file(n)),  Integer(1)>   Rela8onships  have  their  file  IDs  in  alphabe8cal  order  to  avoid  double   coun8ng.  
  • 38. Example  2a:  Miranda’s  (CEO)  file  relaWonship  votes   (Miranda,  Annual  Report),  (Miranda,  Vision  Statement),  (Miranda,  Dilbert)   Iden8ty  map   <Miranda,  {Annual  Report,  Vision  Statement,  Dilbert}>   Reduce   <(Annual  Report,  Dilbert),  Integer(1)>,     <(Annual  Report,  Vision  Statement),  Integer(1)>,     <(Dilbert,  Vision  Statement),  Integer(1)>  
  • 39. 2b.  Tally  the  relaWonship  votes  -­‐  just  a  word  count,  where  each  relaWonship  occurrence  is  a  word     <(file1,  file2),  Integer(1)>   Iden8ty  map   <(file1,  file2),  List<Integer(1)>   Reduce:  count  and  divide   by  populari8es   <file1,  (file2,  similarity  score)>,  <file2,    (file1,  similarity  score)>   Note  that  we  emit  each  result  twice,   one  for  each  file  that  belongs  to  a  rela8onship.  
  • 40. Example  2b:  the  Dilbert/Darth  Vader  relaWonship   <(Dilbert,  Vader),  Integer(1)>,   <(Dilbert,  Vader),  Integer(1)>,     <(Dilbert,  Vader),  Integer(1)>   Iden8ty  map   <(Dilbert,  Vader),  {1,  1,  1}>   Reduce:  count  and  divide   by  populari8es   <Dilbert,  (Vader,  sqrt(3/5))>,  <Vader,  (Dilbert,  sqrt(3/5))>  
  • 41. 3.  Sort  and  store  results   <file1,  (file2,  similarity  score)>   Iden8ty  map   <file1,  List<(file2,  similarity  score)>>   Reduce   <file1,  {top  n  similar  files}>   Store  the  results  in  your  loca8on  of  choice  
  • 42. Example  3:  SorWng  the  results  for  Dilbert   <Dilbert,  (Annual  Report,  .63)>,   <Dilbert,  (Vision  Statement,  .77)>,   <Dilbert,  (Disk  Usage,  .45)>,   <Dilbert,  (Darth  Vader,  .77)>   Iden8ty  map   <Dilbert,  {(Annual  Report,  .63),  (Vision  Statement,  .77),  (Disk  Usage,  .45),  (Darth  Vader,  .77)}>   Reduce   <Dilbert,  {Darth  Vader,  Vision  Statement}>  (Top  2  files)   Store  results  
  • 43. Appendix   •  Cosine  formula  and  normaliza8on  trick  to   avoid  the  distributed  cache   A• B A B cosθ AB = = • A B A B •  Mahout  has  CF   •  Asympto8c  order  of  the  algorithm  is  O(M*N2)   € in  worst  case,  but  is  helped  by  sparsity.  
  • 44. Narayan  Bharadwaj  Monitoring,  Big  Data  @salesforce   @nadubharadwaj