Hadoop Summit San Diego Feb2013
  • 1. Hadoop  Use  Cases   At  Salesforce.com   Narayan  Bharadwaj             Director,  Product  Management   Monitoring  &  Big  Data       Salesforce.com                            @nadubharadwaj                
  • 3. Agenda  •  Technology  •  Big  Data  use  cases  •  Use  case  discussion  •  Q&A  
  • 4. Got  “Cloud  Data”?  130k  customers   1  billion  transac8ons/day  Millions  of  users   Terabytes/day  
  • 5. Technology  
  • 6. Big  Data  Ecosystem  Phoenix   Oozie  
  • 7. Phoenix   “We  put  the  SQL  back  in  NoSQL”  •  SQL  layer  on  HBase  •  Seamless  applica8on  integra8on   –  Standard  JDBC  interface   –  DDL  statement  support  •  Low  query  latency   –  SQL  query  è  Mul8ple  HBase  scans   –  Co-­‐processors,  custom  filters   –  Milliseconds  for  small  queries   –  Seconds  for  tens  of  millions  rows  •  hdps://github.com/forcedotcom/phoenix  
  • 8. Contribu8ons   @pRaShAnT1784  :  Prashant  Kommireddi      Lars  Ho<ansl        @thefutureian  :  Ian  Varley  
  • 9. Data  Science  tools  ecosystem  Apache  Pig  
  • 10. Big  Data  Use  Cases   User  behavior  Product  Metrics   Capacity  planning   analysis   Monitoring   Query  Run8me   Collec8ons   intelligence   Predic8on   Early  Warning   Collabora8ve   Search  Relevancy   System   Filtering   Internal  App   Product  feature  
  • 11. Product  Metrics  
  • 12. Product  Metrics  –  Problem  Statement   •  Track  feature  usage/adop8on  across  130k+   customers   –  Eg:  Accounts,  Contacts,  Visualforce,  Apex,…   •  Track  standard  metrics  across  all  features   –  Eg:  #Requests,  #UniqueOrgs,  #UniqueUsers,  AvgResponseTime,…   •  Track  features  and  metrics  across  all  channels   –  API,  UI,  Mobile   •  Primary  audience:  Execu8ves,  Product  Managers  
  • 13. Product  Metrics  Pipeline   User  Input   CollaboraWon   Reports,  Dashboards   (Page  Layout)   (ChaXer)   Workflow   Formula   Fields                        Feature  Metrics   Trend  Metrics                        (Custom  Object)   (Custom  Object)   API   API    Client  Machine   Java  Program   Pig  script  generator   Workflow   Log  Pull   Hadoop   Log  Files  
  • 14. VisualizaWon  (Reports  &  Dashboards)   Note:  Feature  Names  are  not  displayed  
  • 15. VisualizaWon  (Reports  &  Dashboards)  
  • 16. Collaborate,  Iterate  (ChaXer)  
  • 17. User  Behavior  Analysis  
  • 18. Problem  Statement  §  How  do  we  reduce  number  of  clicks  on  the  user  interface?  §  What  are  the  top  user  click  path  sequences?  §  What  are  the  user  clusters/personas?  •  Approach:   •  Markov  transi8on  for  click  path,  D3.js  visuals   •  K-­‐means  (unsupervised)  clustering  for  user  groups  
  • 19. Markov  TransiWons  for  "Setup"  pages   Note:  Based  on  an  internal  Salesforce  org  
  • 20. K-­‐means  clustering  of  "Setup"  pages   Note:  Based  on  an  internal  Salesforce  org  
  • 21. Collabora8ve  Filtering  
  • 22. CollaboraWve  Filtering  –  Problem  Statement   •  Show  similar  files  within  an  organiza8on   –  Content-­‐based  approach   –  Community-­‐base  approach  
  • 23. Popular  File  
  • 24. Related  File  
  • 25. We  found  this  relaWonship  using  item-­‐to-­‐item  collaboraWve  filtering   •  Amazon  published  this  algorithm  in  2003.   –  Amazon.com  RecommendaJons:  Item-­‐to-­‐Item  CollaboraJve  Filtering,  by   Gregory  Linden,  Brent  Smith,  and  Jeremy  York.    IEEE  Internet  Compu8ng,   January-­‐February  2003.   •  At  Salesforce,  we  adapted  this  algorithm  for   Hadoop,  and  we  use  it  to  recommend  files  to   view  and  users  to  follow.  
  • 26. Example:  CF  on  5  files   Vision  Statement   Annual  Report  Dilbert  Comic   Darth  Vader  Cartoon   Disk  Usage  Report  
  • 27. View  History  Table   Darth   Annual   Vision   Dilbert   Disk  Usage   Vader   Report   Statement   Cartoon   Report   Cartoon   Miranda   1   1   1   0   0   (CEO)   Bob  (CFO)   1   1   1   0   0   Susan   0   1   1   1   0   (Sales)   Chun   0   0   1   1   0   (Sales)   Alice  (IT)   0   0   1   1   1  
  • 28. RelaWonships  between  the  files   Annual  Report   Vision  Statement   Darth  Vader   Cartoon   Dilbert  Cartoon   Disk  Usage   Report  
  • 29. RelaWonships  between  the  files   Annual  Report   2 Vision  Statement   0 1 3 2 0 Darth  Vader   0 Cartoon   Dilbert   Cartoon   3 1 1 Disk  Usage   Report  
  • 30. Sorted  relaWonships  for  each  file  Annual   Vision   Dilbert   Darth   Disk  Usage  Report   Statement   Cartoon   Vader   Report   Cartoon  Dilbert  (2)   Dilbert  (3)   Vision  Stmt.  (3)   Dilbert  (3)   Dilbert  (1)  Vision  Stmt.  (2)   Annual  Rpt.  (2)   Darth  Vader  (3)   Vision  Stmt.  (1)   Darth  Vader  (1)   Darth  Vader  (1)   Annual  Rpt.  (2)   Disk  Usage  (1)   Disk  Usage  (1)   The  popularity  problem:  no8ce  that  Dilbert  appears  first  in  every  list.    This  is   probably  not  what  we  want.   The  solu8on:  divide  the  relaWonship  tallies  by  file  populariWes.  
  • 31. Normalized  relaWonships  between  the  files   Annual  Report   .82   Vision  Statement   0 .33   .63   .77   0 0 Darth  Vader   Cartoon   Dilbert  Cartoon   .77   .58   .45   Disk  Usage   Report  
  • 32. Sorted  relaWonships  for  each  file,  normalized  by  file  populariWes  Annual   Vision   Dilbert   Darth  Vader   Disk  Usage  Report   Statement   Cartoon   Cartoon   Report  Vision  Stmt.   Annual  Report     Darth  Vader   Darth  Vader   Dilbert  (.77)  (.82)   (.82)   (.77)   (.58)   Vision  Stmt.   Disk  Usage   Dilbert  Dilbert  (.63)   Dilbert  (.77)   (.77)   (.58)   (.45)   Darth  Vader     Annual  Report   Vision  Stmt.   (.33)   (.63)   (.33)   Disk  Usage   (.45)   High  rela8onship  tallies  AND  similar  popularity  values  now  drive  closeness.  
  • 33. The  item-­‐to-­‐item  CF  algorithm   1)  Compute  file  populari8es   2)  Compute  rela8onship  tallies  and  divide  by   file  populari8es   3)  Sort  and  store  the  results  
  • 34. MapReduce  Overview  Map   Shuffle   Reduce   (adapted  from  hdp://code.google.com/p/mapreduce-­‐framework/wiki/ MapReduce)  
  • 35. 1.  Compute  File  PopulariWes   <user,  file>   Inverse  iden8ty  map   <file,  List<user>>   Reduce   <file,  (user  count)>   Result  is  a  table  of  (file,  popularity)  pairs  that  you  store  in  the  Hadoop  distributed  cache.  
  • 36. Example:  File  popularity  for  Dilbert   (Miranda,  Dilbert),  (Bob,  Dilbert),  (Susan,  Dilbert),  (Chun,  Dilbert),  (Alice,  Dilbert)   Inverse  iden8ty  map   <Dilbert,  {Miranda,  Bob,  Susan,  Chun,  Alice}>   Reduce   (Dilbert,  5)  
  • 37. 2a.  Compute  relaWonship  tallies  -­‐  find  all  relaWonships  in  view  history  table     <user,  file>     Iden8ty  map   <user,  List<file>>   Reduce   <(file1,  file2),  Integer(1)>,     <(file1,  file3),  Integer(1)>,    …     <(file(n-­‐1),  file(n)),  Integer(1)>   Rela8onships  have  their  file  IDs  in  alphabe8cal  order  to  avoid  double   coun8ng.  
  • 38. Example  2a:  Miranda’s  (CEO)  file  relaWonship  votes   (Miranda,  Annual  Report),  (Miranda,  Vision  Statement),  (Miranda,  Dilbert)   Iden8ty  map   <Miranda,  {Annual  Report,  Vision  Statement,  Dilbert}>   Reduce   <(Annual  Report,  Dilbert),  Integer(1)>,     <(Annual  Report,  Vision  Statement),  Integer(1)>,     <(Dilbert,  Vision  Statement),  Integer(1)>  
  • 39. 2b.  Tally  the  relaWonship  votes  -­‐  just  a  word  count,  where  each  relaWonship  occurrence  is  a  word     <(file1,  file2),  Integer(1)>   Iden8ty  map   <(file1,  file2),  List<Integer(1)>   Reduce:  count  and  divide   by  populari8es   <file1,  (file2,  similarity  score)>,  <file2,    (file1,  similarity  score)>   Note  that  we  emit  each  result  twice,   one  for  each  file  that  belongs  to  a  rela8onship.  
  • 40. Example  2b:  the  Dilbert/Darth  Vader  relaWonship   <(Dilbert,  Vader),  Integer(1)>,   <(Dilbert,  Vader),  Integer(1)>,     <(Dilbert,  Vader),  Integer(1)>   Iden8ty  map   <(Dilbert,  Vader),  {1,  1,  1}>   Reduce:  count  and  divide   by  populari8es   <Dilbert,  (Vader,  sqrt(3/5))>,  <Vader,  (Dilbert,  sqrt(3/5))>  
  • 41. 3.  Sort  and  store  results   <file1,  (file2,  similarity  score)>   Iden8ty  map   <file1,  List<(file2,  similarity  score)>>   Reduce   <file1,  {top  n  similar  files}>   Store  the  results  in  your  loca8on  of  choice  
  • 42. Example  3:  SorWng  the  results  for  Dilbert   <Dilbert,  (Annual  Report,  .63)>,   <Dilbert,  (Vision  Statement,  .77)>,   <Dilbert,  (Disk  Usage,  .45)>,   <Dilbert,  (Darth  Vader,  .77)>   Iden8ty  map   <Dilbert,  {(Annual  Report,  .63),  (Vision  Statement,  .77),  (Disk  Usage,  .45),  (Darth  Vader,  .77)}>   Reduce   <Dilbert,  {Darth  Vader,  Vision  Statement}>  (Top  2  files)   Store  results  
  • 43. Appendix   •  Cosine  formula  and  normaliza8on  trick  to   avoid  the  distributed  cache   A• B A B cosθ AB = = • A B A B •  Mahout  has  CF   •  Asympto8c  order  of  the  algorithm  is  O(M*N2)   € in  worst  case,  but  is  helped  by  sparsity.  
  • 44. Narayan  Bharadwaj  Monitoring,  Big  Data  @salesforce   @nadubharadwaj