Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How Salesforce.com Uses HadoopSome Data Science Use CasesNarayan Bharadwaj             Jed Crosbysalesforce.com           ...
Safe Harbor Safe harbor statement under the Private Securities Litigation Reform Act of 1995: This presentation may contai...
Agenda  • Technology  • Hadoop use cases  • Use case discussion     • Product Metrics     • User Behavior Analysis     • C...
Got “Cloud Data”?            130k customers      800 million transactions/day            Millions of users   Terabytes/day
Technology
Hadoop Overview - Started by Doug Cutting at Yahoo! - Based on two Google papers    Google File System (GFS): http://rese...
Our Hadoop Ecosystem   Apache Pig
Contributions    @pRaShAnT1784 : Prashant Kommireddi   Lars Hofhansl              @thefutureian : Ian Varley
Use Cases
Hadoop Use Cases                            User behavior   Product Metrics                                      Capacity ...
Product Metrics
Product Metrics – Problem Statement   Track feature usage/adoption across 130k+ customers     Eg: Accounts, Contacts, Vis...
Data Pipeline                                                       Fancy UI          Feature (What?)                     ...
Product Metrics Pipeline                     User Input                      User Input                                   ...
Feature Metrics (Custom Object)Id      Feature Name     PM      Instrumentation   Metric1     Metric2     Metric3      Met...
Feature Metrics (Custom Object)
User Input (Page Layout)                           Formula                           Field                             Wor...
User Input (Child Custom Object)                                   Child                                   Objects
Apache Pig
Basic Pig Script Construct        -- Define UDFs        DEFINE GFV GetFieldValue(‘/path/to/udf/file’);        -- Load data...
Java Pig Script Generator (Client)
Trend Metrics (Custom Object) Id       Date          #Requests   #Unique Orgs   #Unique Users   Avg ResponseTime  F0001   ...
Upload to Trend Metrics (Custom Object)
Visualization (Reports & Dashboards)
Visualization (Reports & Dashboards)
Collaborate, Iterate (Chatter)
Recap                     User Input                      User Input                                                      ...
User Behavior Analysis
Problem Statement How do we reduce number of clicks on the user interface? Need to understand top user click paths. What...
Markov Transitions for "Setup" Pages
K-means clustering of "Setup" Pages
Collaborative Filtering        Jed Crosby
Collaborative Filtering – Problem Statement   Show similar files within an organization      Content-based approach     ...
Popular File
Related File
We found this relationship using item-to-item collaborativefiltering   Amazon published this algorithm in 2003.      Amaz...
Example: CF on 5 files                                                    Vision Statement                Annual ReportDil...
View History Table                Annual   Vision      Dilbert   Darth Vader   Disk Usage                Report   Statemen...
Relationships Between the Files                    Annual Report                Vision Statement                          ...
Relationships Between the Files                        Annual Report                                                 2    ...
Sorted Relationships for Each FileAnnual                  Vision                 Dilbert                Darth Vader       ...
Normalized Relationships Between the Files                    Annual Report               .82                  Vision Stat...
Sorted relationships for each file, normalized by file popularitiesAnnual Report              Vision                Dilber...
The Item-to-Item CF Algorithm  1) Compute file popularities  2) Compute relationship tallies and divide by file populariti...
MapReduce Overview    Map                           Shuffle                       Reduce          (adapted from http://cod...
1. Compute File Popularities                                           <user, file>                                       ...
Example: File popularity for Dilbert  (Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbe...
2a. Compute Relationship Tallies − Find All Relationships in View History Table                                       <use...
Example 2a: Miranda’s (CEO) File Relationship Votes        (Miranda, Annual Report), (Miranda, Vision Statement), (Miranda...
2b. Tally the Relationship Votes − Just a Word Count, Where EachRelationship Occurrence is a Word                         ...
Example 2b: the Dilbert/Darth Vader Relationship                            <(Dilbert, Vader), Integer(1)>,               ...
3. Sort and Store Results                            <file1, (file2, similarity score)>                                   ...
Example 3: Sorting the Results for Dilbert                                 <Dilbert, (Annual Report, .63)>,               ...
Appendix  Cosine formula and normalization trick to avoid the distributed  cache                            A •B   A   B  ...
Narayan Bharadwaj           Jed CrosbyDirector, Product Management   Data Scientist      @nadubharadwaj           @JedCrosby
Dreamforce_2012_Hadoop_Use_Cases
Upcoming SlideShare
Loading in …5
×

Dreamforce_2012_Hadoop_Use_Cases

962 views

Published on

Published in: Technology
  • Be the first to comment

Dreamforce_2012_Hadoop_Use_Cases

  1. 1. How Salesforce.com Uses HadoopSome Data Science Use CasesNarayan Bharadwaj Jed Crosbysalesforce.com salesforce.com @nadubharadwaj @JedCrosby
  2. 2. Safe Harbor Safe harbor statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of intellectual property and other litigation, risks associated with possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-Q for the most recent fiscal quarter ended July 31, 2012. This documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward- looking statements.
  3. 3. Agenda • Technology • Hadoop use cases • Use case discussion • Product Metrics • User Behavior Analysis • Collaborative Filtering • Q&A Every time you see the elephant, we will attempt to explain a Hadoop related concept.
  4. 4. Got “Cloud Data”? 130k customers 800 million transactions/day Millions of users Terabytes/day
  5. 5. Technology
  6. 6. Hadoop Overview - Started by Doug Cutting at Yahoo! - Based on two Google papers  Google File System (GFS): http://research.google.com/archive/gfs.html  Google MapReduce: http://research.google.com/archive/mapreduce.html - Hadoop is an open source Apache project  Hadoop Distributed File System (HDFS)  Distributed Processing Framework (MapReduce) - Several related projects  HBase, Hive, Pig, Flume, ZooKeeper, Mahout, Oozie, HCatalog
  7. 7. Our Hadoop Ecosystem Apache Pig
  8. 8. Contributions @pRaShAnT1784 : Prashant Kommireddi Lars Hofhansl @thefutureian : Ian Varley
  9. 9. Use Cases
  10. 10. Hadoop Use Cases User behavior Product Metrics Capacity planning analysis Monitoring Query Runtime Collections intelligence Prediction Early Warning System Collaborative Filtering Search Relevancy Internal App Internal App Product feature Product feature
  11. 11. Product Metrics
  12. 12. Product Metrics – Problem Statement Track feature usage/adoption across 130k+ customers  Eg: Accounts, Contacts, Visualforce, Apex,… Track standard metrics across all features  Eg: #Requests, #UniqueOrgs, #UniqueUsers, AvgResponseTime,… Track features and metrics across all channels  API, UI, Mobile Primary audience: Executives, Product Managers
  13. 13. Data Pipeline Fancy UI Feature (What?) (Visualize) Feature Metadata Daily Summary (Instrumentation) (Output) Crunch it (How?) Storage & Processing
  14. 14. Product Metrics Pipeline User Input User Input Reports, Dashboards Reports, Dashboards (Page Layout) (Page Layout) Formula Workflow Formula Workflow Fields Fields Feature Metrics Feature Metrics Trend Metrics Trend Metrics (Custom Object) (Custom Object) (Custom Object) (Custom Object) API API Client Machine Client Machine Java Program Java Program Pig script generator Pig script generator Workflow Workflow Log Pull Log Pull Hadoop Hadoop Log Files Log Files
  15. 15. Feature Metrics (Custom Object)Id Feature Name PM Instrumentation Metric1 Metric2 Metric3 Metric4 StatusF0001 Accounts John /001 #requests #UniqOrgs #UniqUsers AvgRT DevF0002 Contacts Nancy /003 #requests #UniqOrgs #UniqUsers AvgRT ReviewF0003 API Eric A #requests #UniqOrgs #UniqUsers AvgRT DeployedF0004 Visualforce Roger V #requests #UniqOrgs #UniqUsers AvgRT DecomF0005 Apex Kim axapx #requests #UniqOrgs #UniqUsers AvgRT DeployedF0006 Custom Objects Chun /aXX #requests #UniqOrgs #UniqUsers AvgRT DeployedF0008 Chatter Jed chcmd #requests #UniqOrgs #UniqUsers AvgRT DeployedF0009 Reports Steve R #requests #UniqOrgs #UniqUsers AvgRT Deployed
  16. 16. Feature Metrics (Custom Object)
  17. 17. User Input (Page Layout) Formula Field Workflow Rule
  18. 18. User Input (Child Custom Object) Child Objects
  19. 19. Apache Pig
  20. 20. Basic Pig Script Construct -- Define UDFs DEFINE GFV GetFieldValue(‘/path/to/udf/file’); -- Load data A = LOAD ‘/path/to/cloud/data/log/files’ USING PigStorage(); -- Filter data B = FILTER A BY GFV(row, ‘logRecordType’) == ‘U’; -- Extract Fields C = FOREACH B GENERATE GFV(*, ‘orgId’), LFV(*. ‘userId’) …….. -- Group G = GROUP C BY …… -- Compute output metrics O = FOREACH G { orgs = C.orgId; uniqueOrgs = DISTINCT orgs; } -- Store or Dump results STORE O INTO ‘/path/to/user/output’;
  21. 21. Java Pig Script Generator (Client)
  22. 22. Trend Metrics (Custom Object) Id Date #Requests #Unique Orgs #Unique Users Avg ResponseTime F0001 06/01/2012 <big> <big> <big> <little> F0002 06/01/2012 <big> <big> <big> <little> F0003 06/01/2012 <big> <big> <big> <little> F0001 06/02/2012 <big> <big> <big> <little> F0002 06/02/2012 <big> <big> <big> <little> F0003 06/03/2012 <big> <big> <big> <little>
  23. 23. Upload to Trend Metrics (Custom Object)
  24. 24. Visualization (Reports & Dashboards)
  25. 25. Visualization (Reports & Dashboards)
  26. 26. Collaborate, Iterate (Chatter)
  27. 27. Recap User Input User Input Reports, Dashboards Reports, Dashboards (Page Layout) (Page Layout) Formula Workflow Formula Workflow Fields Fields Feature Metrics Feature Metrics Trend Metrics Trend Metrics (Custom Object) (Custom Object) (Custom Object) (Custom Object) API API Client Machine Client Machine Java Program Java Program Pig script generator Pig script generator Workflow Workflow Log Pull Log Pull Hadoop Hadoop Log Files Log Files
  28. 28. User Behavior Analysis
  29. 29. Problem Statement How do we reduce number of clicks on the user interface? Need to understand top user click paths. What are they typically trying to do? What are the user clusters/personas?Approach:• Markov transition for click path, D3.js visuals• K-means (unsupervised) clustering for user groups
  30. 30. Markov Transitions for "Setup" Pages
  31. 31. K-means clustering of "Setup" Pages
  32. 32. Collaborative Filtering Jed Crosby
  33. 33. Collaborative Filtering – Problem Statement Show similar files within an organization  Content-based approach  Community-base approach
  34. 34. Popular File
  35. 35. Related File
  36. 36. We found this relationship using item-to-item collaborativefiltering Amazon published this algorithm in 2003.  Amazon.com Recommendations: Item-to-Item Collaborative Filtering, by Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet Computing, January-February 2003. At Salesforce, we adapted this algorithm for Hadoop, and we use it to recommend files to view and users to follow.
  37. 37. Example: CF on 5 files Vision Statement Annual ReportDilbert Comic Darth Vader Cartoon Disk Usage Report
  38. 38. View History Table Annual Vision Dilbert Darth Vader Disk Usage Report Statement Cartoon Cartoon Report Miranda 1 1 1 0 0 (CEO) Bob (CFO) 1 1 1 0 0 Susan 0 1 1 1 0 (Sales) Chun (Sales) 0 0 1 1 0 Alice (IT) 0 0 1 1 1
  39. 39. Relationships Between the Files Annual Report Vision Statement Darth Vader Cartoon Dilbert Cartoon Disk Usage Report
  40. 40. Relationships Between the Files Annual Report 2 Vision Statement 0 1 3 2 0 Darth Vader 0 Cartoon Dilbert Cartoon 3 1 1 Disk Usage Report
  41. 41. Sorted Relationships for Each FileAnnual Vision Dilbert Darth Vader Disk UsageReport Statement Cartoon Cartoon ReportDilbert (2) Dilbert (3) Vision Stmt. (3) Dilbert (3) Dilbert (1)Vision Stmt. (2) Annual Rpt. (2) Darth Vader (3) Vision Stmt. (1) Darth Vader (1) Darth Vader (1) Annual Rpt. (2) Disk Usage (1) Disk Usage (1) The popularity problem: notice that Dilbert appears first in every list. This is probably not what we want. The solution: divide the relationship tallies by file popularities.
  42. 42. Normalized Relationships Between the Files Annual Report .82 Vision Statement 0 .33 .63 .77 0 0 Darth Vader Dilbert Cartoon Cartoon .77 .45 .58 Disk Usage Report
  43. 43. Sorted relationships for each file, normalized by file popularitiesAnnual Report Vision Dilbert Darth Vader Disk Usage Statement Cartoon Cartoon ReportVision Stmt. Annual Report Darth Vader Darth Vader Dilbert (.77)(.82) (.82) (.77) (.58) Vision Stmt. Disk Usage DilbertDilbert (.63) Dilbert (.77) (.77) (.58) (.45) Darth Vader Annual Report Vision Stmt. (.33) (.63) (.33) Disk Usage (.45) High relationship tallies AND similar popularity values now drive closeness.
  44. 44. The Item-to-Item CF Algorithm 1) Compute file popularities 2) Compute relationship tallies and divide by file popularities 3) Sort and store the results
  45. 45. MapReduce Overview Map Shuffle Reduce (adapted from http://code.google.com/p/mapreduce-framework/wiki/MapReduce)
  46. 46. 1. Compute File Popularities <user, file> Inverse identity map <file, List<user>> Reduce <file, (user count)> Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache.
  47. 47. Example: File popularity for Dilbert (Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbert) Inverse identity map <Dilbert, {Miranda, Bob, Susan, Chun, Alice}> Reduce (Dilbert, 5)
  48. 48. 2a. Compute Relationship Tallies − Find All Relationships in View History Table <user, file> Identity map <user, List<file>> Reduce <(file1, file2), Integer(1)>, <(file1, file3), Integer(1)>, … <(file(n-1), file(n)), Integer(1)> Relationships have their file IDs in alphabetical order to avoid double counting.
  49. 49. Example 2a: Miranda’s (CEO) File Relationship Votes (Miranda, Annual Report), (Miranda, Vision Statement), (Miranda, Dilbert) Identity map <Miranda, {Annual Report, Vision Statement, Dilbert}> Reduce <(Annual Report, Dilbert), Integer(1)>, <(Annual Report, Vision Statement), Integer(1)>, <(Dilbert, Vision Statement), Integer(1)>
  50. 50. 2b. Tally the Relationship Votes − Just a Word Count, Where EachRelationship Occurrence is a Word <(file1, file2), Integer(1)> Identity map <(file1, file2), List<Integer(1)> Reduce: count and divide by popularities <file1, (file2, similarity score)>, <file2, (file1, similarity score)> Note that we emit each result twice, one for each file that belongs to a relationship.
  51. 51. Example 2b: the Dilbert/Darth Vader Relationship <(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)> Identity map <(Dilbert, Vader), {1, 1, 1}> Reduce: count and divide by popularities <Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))>
  52. 52. 3. Sort and Store Results <file1, (file2, similarity score)> Identity map <file1, List<(file2, similarity score)>> Reduce <file1, {top n similar files}> Store the results in your location of choice
  53. 53. Example 3: Sorting the Results for Dilbert <Dilbert, (Annual Report, .63)>, <Dilbert, (Vision Statement, .77)>, <Dilbert, (Disk Usage, .45)>, <Dilbert, (Darth Vader, .77)> Identity map <Dilbert, {(Annual Report, .63), (Vision Statement, .77), (Disk Usage, .45), (Darth Vader, .77)}> Reduce <Dilbert, {Darth Vader, Vision Statement}> (Top 2 files) Store results
  54. 54. Appendix Cosine formula and normalization trick to avoid the distributed cache A •B A B cos θAB = = • A B A B Mahout has CF Asymptotic order of the algorithm is O(M*N2) in worst case, but is helped by sparsity.
  55. 55. Narayan Bharadwaj Jed CrosbyDirector, Product Management Data Scientist @nadubharadwaj @JedCrosby

×