Video: http://www.youtube.com/watch?v=BT8WvQMMaV0
Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms. In this webinar, we will discuss an internal use case and a product use case:
Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics).
Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
How Salesforce.com uses Hadoop
1. How Salesforce.com uses Hadoop
Narayan Bharadwaj
Data Science
@nadubharadwaj
Jed Crosby
Data Science
@JedCrosby
#forcewebinar
Follow us @forcedotcom
2. Safe Harbor
Safe harbor statement under the Private Securities Litigation Reform Act of 1995:
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such
uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ
materially from the results expressed or implied by the forward-looking statements we make. All statements other than
statements of historical fact could be deemed forward-looking, including any projections of product or service availability,
subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of
management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or
technology developments and customer contracts or use of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and
delivering new functionality for our service, new products and services, our new business model, our past operating losses,
possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our
security measures, the outcome of any litigation, risks associated with completed and any possible mergers and
acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain,
and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our
limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further
information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report
on Form 10-K for the most recent fiscal year ended January 31, 2011 and in our quarterly report on Form 10-Q for the most
recent fiscal quarter ended October 31, 2011. These documents and others containing important disclosures are available
on the SEC Filings section of the Investor Information section of our Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not
currently available and may not be delivered on time or at all. Customers who purchase our services should make the
purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does
not intend to update these forward-looking statements.
Follow us @forcedotcom
3. Agenda
§ Hadoop use cases
§ Use case 1 - Product Metrics*
§ Technology
§ Use case 2- Collaborative Filtering*
§ Q&A
*Every time you see the elephant, we will attempt to
explain a Hadoop related concept.
Follow us @forcedotcom
4. Got “Cloud Data”?
130k customers 780 million transactions/day
Millions of users Terabytes/day
Follow us @forcedotcom
5. Hadoop Overview
§ Started by Doug Cutting at Yahoo!
§ Based on two Google papers
– Google File System (GFS): http://research.google.com/archive/gfs.html
– Google MapReduce: http://research.google.com/archive/mapreduce.html
§ Hadoop is an open source Apache project
– Hadoop Distributed File System (HDFS)
– Distributed Processing Framework (MapReduce)
§ Several related projects
– HBase, Hive, Pig, Flume, ZooKeeper, Mahout, Oozie, HCatalog
Follow us @forcedotcom
6. Hadoop use cases
User behavior
Product Metrics Capacity planning
analysis
Monitoring Performance
Security
intelligence analysis
Ad-hoc log Collaborative
Search Relevancy
searches Filtering
Follow us @forcedotcom
8. Product Metrics – Problem Statement
§ Track feature usage/adoption across 130k+ customers
– Eg: Accounts, Contacts, Visualforce, Apex,…
§ Track standard metrics across all features
– Eg: #Requests, #UniqueOrgs, #UniqueUsers,
AvgResponseTime,…
§ Track features and metrics across all channels
– API, UI, Mobile
§ Primary audience: Executives, Product Managers
Follow us @forcedotcom
9. Data Pipeline
Collaborate & Fancy UI
Feature (What?)
Iterate (Visualize)
Feature Metadata Daily Summary
(Instrumentation) (Output)
Crunch it
(How?)
Storage & Processing
Follow us @forcedotcom
10. Product Metrics Pipeline
User Input Collaboration Reports,
(Page Layout) (Chatter) Dashboards
Formula
Workflow
Fields
Feature Metrics Trend Metrics
(Custom Object) (Custom Object)
API
API
Client Machine
Java Program
Pig script generator
Workflow
Log Pull
Hadoop
Log Files
Follow us @forcedotcom
11. Feature Metrics (Custom Object)
Id Feature Name PM Instrumentation Metric1 Metric2 Metric3 Metric4 Status
F0001 Accounts John /001 #requests #UniqOrgs #UniqUsers AvgRT Dev
F0002 Contacts Nancy /003 #requests #UniqOrgs #UniqUsers AvgRT Review
F0003 API Eric A #requests #UniqOrgs #UniqUsers AvgRT Deployed
F0004 Visualforce Roger V #requests #UniqOrgs #UniqUsers AvgRT Decom
F0005 Apex Kim axapx #requests #UniqOrgs #UniqUsers AvgRT Deployed
F0006 Custom Objects Chun /aXX #requests #UniqOrgs #UniqUsers AvgRT Deployed
F0008 Chatter Jed chcmd #requests #UniqOrgs #UniqUsers AvgRT Deployed
F0009 Reports Steve R #requests #UniqOrgs #UniqUsers AvgRT Deployed
Follow us @forcedotcom
16. Basic Pig script construct
-- Define UDFs
DEFINE GFV GetFieldValue(‘/path/to/udf/file’);
-- Load data
A = LOAD ‘/path/to/cloud/data/log/files’ USING PigStorage();
-- Filter data
B = FILTER A BY GFV(row, ‘logRecordType’) == ‘U’;
-- Extract Fields
C = FOREACH B GENERATE GFV(*, ‘orgId’), LFV(*. ‘userId’) ……..
-- Group
G = GROUP C BY ……
-- Compute output metrics
O = FOREACH G {
orgs = C.orgId; uniqueOrgs = DISTINCT orgs;
}
-- Store or Dump results
STORE O INTO ‘/path/to/user/output’;
Follow us @forcedotcom
29. Collaborative Filtering – Problem Statement
§ Show similar files within an organization
– Content-based approach
– Community-base approach
Follow us @forcedotcom
32. We found this relationship using item-to-item collaborative
filtering
§ Amazon published this algorithm in 2003.
– Amazon.com Recommendations: Item-to-Item Collaborative Filtering,
by Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet
Computing, January-February 2003.
§ At Salesforce, we adapted this algorithm for Hadoop,
and we use it to recommend files to view and users to
follow.
Follow us @forcedotcom
33. Example: CF on 5 files
Vision Statement
Annual Report
Dilbert Comic
Darth Vader Cartoon
Disk Usage Report
Follow us @forcedotcom
34. View History Table
Annual Vision Dilbert Darth Disk
Report Statement Cartoon Vader Usage
Cartoon Report
Miranda 1 1 1 0 0
(CEO)
Bob (CFO) 1 1 1 0 0
Susan 0 1 1 1 0
(Sales)
Chun 0 0 1 1 0
(Sales)
Alice (IT) 0 0 1 1 1
Follow us @forcedotcom
35. Relationships between the files
Annual Report Vision Statement
Darth Vader
Cartoon
Dilbert
Cartoon
Disk Usage
Report
Follow us @forcedotcom
36. Relationships between the files
Annual
Report 2 Vision Statement
0 1
3
2
0 Darth Vader
0 Cartoon
Dilbert
Cartoon 3
1
1
Disk Usage
Report
Follow us @forcedotcom
37. Sorted relationships for each file
Annual Vision Dilbert Darth Vader Disk Usage
Report Statement Cartoon Cartoon Report
Dilbert (2) Dilbert (3) Vision Stmt. (3) Dilbert (3) Dilbert (1)
Vision Stmt. (2) Annual Rpt. (2) Darth Vader (3) Vision Stmt. (1) Darth Vader (1)
Darth Vader (1) Annual Rpt. (2) Disk Usage (1)
Disk Usage (1)
The popularity problem: notice that Dilbert appears first in every list.
This is probably not what we want.
The solution: divide the relationship tallies by file popularities.
Follow us @forcedotcom
38. Normalized relationships between the files
Annual Report Vision Statement
.82
0 .33
.77
.63
0
0 Darth Vader
Cartoon
Dilbert
Cartoon .77
.45 .58
Disk Usage
Report
Follow us @forcedotcom
39. Sorted relationships for each file, normalized by file popularities
Annual Report Vision Dilbert Darth Vader Disk Usage
Statement Cartoon Cartoon Report
Vision Stmt. Annual Report Darth Vader Dilbert (.77) Darth Vader
(.82) (.82) (.77) (.58)
Dilbert (.63) Dilbert (.77) Vision Stmt. Disk Usage Dilbert
(.77) (.58) (.45)
Darth Vader Annual Report Vision Stmt.
(.33) (.63) (.33)
Disk Usage
(.45)
High relationship tallies AND similar popularity values now drive closeness.
Follow us @forcedotcom
40. The item-to-item CF algorithm
1) Compute file popularities
2) Compute relationship tallies and divide by file
popularities
3) Sort and store the results
Follow us @forcedotcom
41. MapReduce Overview
Map Shuffle Reduce
(adapted from http://code.google.com/p/mapreduce-framework/wiki/MapReduce)
Follow us @forcedotcom
42. 1. Compute File Popularities
<user, file>
Inverse identity map
<file, List<user>>
Reduce
<file, (user count)>
Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache.
Follow us @forcedotcom
44. 2a. Compute relationship tallies - find all relationships in view
history table
<user, file>
Identity map
<user, List<file>>
Reduce
<(file1, file2), Integer(1)>,
<(file1, file3), Integer(1)>,
…
<(file(n-1), file(n)), Integer(1)>
Relationships have their file IDs in alphabetical order
to avoid double counting.
Follow us @forcedotcom
46. 2b. Tally the relationship votes - just a word count, where each
relationship occurrence is a word
<(file1, file2), Integer(1)>
Identity map
<(file1, file2), List<Integer(1)>
Reduce: count and
divide by popularities
<file1, (file2, similarity score)>, <file2, (file1, similarity score)>
Note that we emit each result twice, one for each file that belongs to a
relationship.
Follow us @forcedotcom
47. Example 2b: the Dilbert/Darth Vader relationship
<(Dilbert, Vader), Integer(1)>,
<(Dilbert, Vader), Integer(1)>,
<(Dilbert, Vader), Integer(1)>
Identity map
<(Dilbert, Vader), {1, 1, 1}>
Reduce: count and
divide by popularities
<Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))>
Follow us @forcedotcom
48. 3. Sort and store results
<file1, (file2, similarity score)>
Identity map
<file1, List<(file2, similarity score)>>
Reduce
<file1, {top n similar files}>
Store the results in your location of choice
Follow us @forcedotcom
50. Appendix
§ Cosine formula and normalization trick to avoid the
distributed cache
A• B A B
cosθ AB = = •
A B A B
§ Mahout has CF
§ Asymptotic order of the algorithm is O(M*N2) in worst
€
case, but is helped by sparsity.
Follow us @forcedotcom
51. Summary
Hadoop Cloud Data
Hadoop + Force.com = Recommendation algorithms
Follow us @forcedotcom
53. Upcoming Events
§ June 26 – Mobile CodeTalk
– http://bit.ly/mct-wr
§ June 27 – Painless Mobile App
Development
– http://bit.ly/mobileapp-hp
http://bit.ly/mdc-hp
Follow us @forcedotcom