Your SlideShare is downloading. ×
0
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

How Salesforce.com uses Hadoop

6,843

Published on

Video: http://www.youtube.com/watch?v=BT8WvQMMaV0 …

Video: http://www.youtube.com/watch?v=BT8WvQMMaV0

Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms. In this webinar, we will discuss an internal use case and a product use case:

Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics).

Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,843
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
190
Comments
0
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. How Salesforce.com uses Hadoop Narayan Bharadwaj Data Science @nadubharadwaj Jed Crosby Data Science @JedCrosby #forcewebinar Follow us @forcedotcom
  • 2. Safe Harbor Safe harbor statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year ended January 31, 2011 and in our quarterly report on Form 10-Q for the most recent fiscal quarter ended October 31, 2011. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements. Follow us @forcedotcom
  • 3. Agenda §  Hadoop use cases §  Use case 1 - Product Metrics* §  Technology §  Use case 2- Collaborative Filtering* §  Q&A *Every time you see the elephant, we will attempt to explain a Hadoop related concept. Follow us @forcedotcom
  • 4. Got “Cloud Data”? 130k customers 780 million transactions/day Millions of users Terabytes/day Follow us @forcedotcom
  • 5. Hadoop Overview §  Started by Doug Cutting at Yahoo! §  Based on two Google papers –  Google File System (GFS): http://research.google.com/archive/gfs.html –  Google MapReduce: http://research.google.com/archive/mapreduce.html §  Hadoop is an open source Apache project –  Hadoop Distributed File System (HDFS) –  Distributed Processing Framework (MapReduce) §  Several related projects –  HBase, Hive, Pig, Flume, ZooKeeper, Mahout, Oozie, HCatalog Follow us @forcedotcom
  • 6. Hadoop use cases User behavior Product Metrics Capacity planning analysis Monitoring Performance Security intelligence analysis Ad-hoc log Collaborative Search Relevancy searches Filtering Follow us @forcedotcom
  • 7. Product Metrics
  • 8. Product Metrics – Problem Statement §  Track feature usage/adoption across 130k+ customers –  Eg: Accounts, Contacts, Visualforce, Apex,… §  Track standard metrics across all features –  Eg: #Requests, #UniqueOrgs, #UniqueUsers, AvgResponseTime,… §  Track features and metrics across all channels –  API, UI, Mobile §  Primary audience: Executives, Product Managers Follow us @forcedotcom
  • 9. Data Pipeline Collaborate & Fancy UI Feature (What?) Iterate (Visualize) Feature Metadata Daily Summary (Instrumentation) (Output) Crunch it (How?) Storage & Processing Follow us @forcedotcom
  • 10. Product Metrics Pipeline User Input Collaboration Reports, (Page Layout) (Chatter) Dashboards Formula Workflow Fields Feature Metrics Trend Metrics (Custom Object) (Custom Object) API API Client Machine Java Program Pig script generator Workflow Log Pull Hadoop Log Files Follow us @forcedotcom
  • 11. Feature Metrics (Custom Object)Id Feature Name PM Instrumentation Metric1 Metric2 Metric3 Metric4 StatusF0001 Accounts John /001 #requests #UniqOrgs #UniqUsers AvgRT DevF0002 Contacts Nancy /003 #requests #UniqOrgs #UniqUsers AvgRT ReviewF0003 API Eric A #requests #UniqOrgs #UniqUsers AvgRT DeployedF0004 Visualforce Roger V #requests #UniqOrgs #UniqUsers AvgRT DecomF0005 Apex Kim axapx #requests #UniqOrgs #UniqUsers AvgRT DeployedF0006 Custom Objects Chun /aXX #requests #UniqOrgs #UniqUsers AvgRT DeployedF0008 Chatter Jed chcmd #requests #UniqOrgs #UniqUsers AvgRT DeployedF0009 Reports Steve R #requests #UniqOrgs #UniqUsers AvgRT Deployed Follow us @forcedotcom
  • 12. Feature Metrics (Custom Object) Follow us @forcedotcom
  • 13. User Input (Page Layout) Formula Field Workflow Rule Follow us @forcedotcom
  • 14. User Input (Child Custom Object) Child Objects Follow us @forcedotcom
  • 15. Apache Pig
  • 16. Basic Pig script construct -- Define UDFs DEFINE GFV GetFieldValue(‘/path/to/udf/file’); -- Load data A = LOAD ‘/path/to/cloud/data/log/files’ USING PigStorage(); -- Filter data B = FILTER A BY GFV(row, ‘logRecordType’) == ‘U’; -- Extract Fields C = FOREACH B GENERATE GFV(*, ‘orgId’), LFV(*. ‘userId’) …….. -- Group G = GROUP C BY …… -- Compute output metrics O = FOREACH G { orgs = C.orgId; uniqueOrgs = DISTINCT orgs; } -- Store or Dump results STORE O INTO ‘/path/to/user/output’; Follow us @forcedotcom
  • 17. Java Pig Script Generator (Client) Follow us @forcedotcom
  • 18. Trend Metrics (Custom Object) #Unique #Unique AvgId Date #Requests Orgs Users ResponseTime F0001 06/01/2012 <big> <big> <big> <little> F0002 06/01/2012 <big> <big> <big> <little> F0003 06/01/2012 <big> <big> <big> <little> F0001 06/02/2012 <big> <big> <big> <little> F0002 06/02/2012 <big> <big> <big> <little> F0003 06/03/2012 <big> <big> <big> <little> Follow us @forcedotcom
  • 19. Upload to Trend Metrics (Custom Object) Follow us @forcedotcom
  • 20. Visualization (Reports & Dashboards) Follow us @forcedotcom
  • 21. Visualization (Reports & Dashboards) Follow us @forcedotcom
  • 22. Collaborate, Iterate (Chatter) Follow us @forcedotcom
  • 23. Recap User Input Collaboration Reports, (Page Layout) (Chatter) Dashboards Formula Workflow Fields Feature Metrics Trend Metrics (Custom Object) (Custom Object) API API Client Machine Java Program Pig script generator Workflow Log Pull Hadoop Log Files Follow us @forcedotcom
  • 24. Technology
  • 25. Hadoop ecosystem Apache Hadoop Version=0.20.2 Follow us @forcedotcom
  • 26. Contributions @pRaShAnT1784 : Prashant Kommireddi Lars Hofhansl @thefutureian : Ian Varley Follow us @forcedotcom
  • 27. Data Science tools ecosystem Apache Pig Version=0.9.1 Follow us @forcedotcom
  • 28. Collaborative Filtering
  • 29. Collaborative Filtering – Problem Statement §  Show similar files within an organization –  Content-based approach –  Community-base approach Follow us @forcedotcom
  • 30. Popular File Follow us @forcedotcom
  • 31. Related File Follow us @forcedotcom
  • 32. We found this relationship using item-to-item collaborativefiltering §  Amazon published this algorithm in 2003. –  Amazon.com Recommendations: Item-to-Item Collaborative Filtering, by Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet Computing, January-February 2003. §  At Salesforce, we adapted this algorithm for Hadoop, and we use it to recommend files to view and users to follow. Follow us @forcedotcom
  • 33. Example: CF on 5 files Vision Statement Annual ReportDilbert Comic Darth Vader Cartoon Disk Usage Report Follow us @forcedotcom
  • 34. View History Table Annual Vision Dilbert Darth Disk Report Statement Cartoon Vader Usage Cartoon Report Miranda 1 1 1 0 0 (CEO) Bob (CFO) 1 1 1 0 0 Susan 0 1 1 1 0 (Sales) Chun 0 0 1 1 0 (Sales) Alice (IT) 0 0 1 1 1 Follow us @forcedotcom
  • 35. Relationships between the files Annual Report Vision Statement Darth Vader Cartoon Dilbert Cartoon Disk Usage Report Follow us @forcedotcom
  • 36. Relationships between the files Annual Report 2 Vision Statement 0 1 3 2 0 Darth Vader 0 Cartoon Dilbert Cartoon 3 1 1 Disk Usage Report Follow us @forcedotcom
  • 37. Sorted relationships for each fileAnnual Vision Dilbert Darth Vader Disk UsageReport Statement Cartoon Cartoon ReportDilbert (2) Dilbert (3) Vision Stmt. (3) Dilbert (3) Dilbert (1)Vision Stmt. (2) Annual Rpt. (2) Darth Vader (3) Vision Stmt. (1) Darth Vader (1) Darth Vader (1) Annual Rpt. (2) Disk Usage (1) Disk Usage (1) The popularity problem: notice that Dilbert appears first in every list. This is probably not what we want. The solution: divide the relationship tallies by file popularities. Follow us @forcedotcom
  • 38. Normalized relationships between the files Annual Report Vision Statement .82 0 .33 .77 .63 0 0 Darth Vader Cartoon Dilbert Cartoon .77 .45 .58 Disk Usage Report Follow us @forcedotcom
  • 39. Sorted relationships for each file, normalized by file popularitiesAnnual Report Vision Dilbert Darth Vader Disk Usage Statement Cartoon Cartoon ReportVision Stmt. Annual Report Darth Vader Dilbert (.77) Darth Vader(.82) (.82) (.77) (.58)Dilbert (.63) Dilbert (.77) Vision Stmt. Disk Usage Dilbert (.77) (.58) (.45) Darth Vader Annual Report Vision Stmt. (.33) (.63) (.33) Disk Usage (.45) High relationship tallies AND similar popularity values now drive closeness. Follow us @forcedotcom
  • 40. The item-to-item CF algorithm 1)  Compute file popularities 2)  Compute relationship tallies and divide by file popularities 3)  Sort and store the results Follow us @forcedotcom
  • 41. MapReduce Overview Map Shuffle Reduce (adapted from http://code.google.com/p/mapreduce-framework/wiki/MapReduce) Follow us @forcedotcom
  • 42. 1. Compute File Popularities <user, file> Inverse identity map <file, List<user>> Reduce <file, (user count)> Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache. Follow us @forcedotcom
  • 43. Example: File popularity for Dilbert (Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbert) Inverse identity map <Dilbert, {Miranda, Bob, Susan, Chun, Alice}> Reduce (Dilbert, 5) Follow us @forcedotcom
  • 44. 2a. Compute relationship tallies - find all relationships in viewhistory table <user, file> Identity map <user, List<file>> Reduce <(file1, file2), Integer(1)>, <(file1, file3), Integer(1)>, … <(file(n-1), file(n)), Integer(1)> Relationships have their file IDs in alphabetical order to avoid double counting. Follow us @forcedotcom
  • 45. Example 2a: Miranda’s (CEO) file relationship votes (Miranda, Annual Report), (Miranda, Vision Statement), (Miranda, Dilbert) Identity map <Miranda, {Annual Report, Vision Statement, Dilbert}> Reduce <(Annual Report, Dilbert), Integer(1)>, <(Annual Report, Vision Statement), Integer(1)>, <(Dilbert, Vision Statement), Integer(1)> Follow us @forcedotcom
  • 46. 2b. Tally the relationship votes - just a word count, where eachrelationship occurrence is a word <(file1, file2), Integer(1)> Identity map <(file1, file2), List<Integer(1)> Reduce: count and divide by popularities <file1, (file2, similarity score)>, <file2, (file1, similarity score)> Note that we emit each result twice, one for each file that belongs to a relationship. Follow us @forcedotcom
  • 47. Example 2b: the Dilbert/Darth Vader relationship <(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)> Identity map <(Dilbert, Vader), {1, 1, 1}> Reduce: count and divide by popularities <Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))> Follow us @forcedotcom
  • 48. 3. Sort and store results <file1, (file2, similarity score)> Identity map <file1, List<(file2, similarity score)>> Reduce <file1, {top n similar files}> Store the results in your location of choice Follow us @forcedotcom
  • 49. Example 3: Sorting the results for Dilbert <Dilbert, (Annual Report, .63)>, <Dilbert, (Vision Statement, .77)>, <Dilbert, (Disk Usage, .45)>, <Dilbert, (Darth Vader, .77)> Identity map<Dilbert, {(Annual Report, .63), (Vision Statement, .77), (Disk Usage, .45), (Darth Vader, .77)}> Reduce <Dilbert, {Darth Vader, Vision Statement}> (Top 2 files) Store results Follow us @forcedotcom
  • 50. Appendix§  Cosine formula and normalization trick to avoid the distributed cache A• B A B cosθ AB = = • A B A B§  Mahout has CF§  Asymptotic order of the algorithm is O(M*N2) in worst € case, but is helped by sparsity. Follow us @forcedotcom
  • 51. Summary Hadoop Cloud Data Hadoop + Force.com = Recommendation algorithms Follow us @forcedotcom
  • 52. @forcedotcom / #forcewebinarDeveloper Force Groupfacebook.com/forcedotcomDeveloper Force – Force.comCommunity Follow us @forcedotcom
  • 53. Upcoming Events§  June 26 – Mobile CodeTalk –  http://bit.ly/mct-wr§  June 27 – Painless Mobile App Development –  http://bit.ly/mobileapp-hp http://bit.ly/mdc-hp Follow us @forcedotcom
  • 54. Q&A http://bit.ly/ hadoopsurveyNarayan Bharadwaj Jed Crosby Prashant Kommireddi Santosh Rau@nadubharadwaj @JedCrosby @pRaShAnT1784 @santoshrau @SalesforceEng Follow us @forcedotcom

×