Dreamforce_2012_Hadoop_Use_Cases

How Salesforce.com Uses Hadoop
Some Data Science Use Cases
Narayan Bharadwaj Jed Crosby
salesforce.com salesforce.com
@nadubharadwaj @JedCrosby

Safe Harbor
Safe harbor statement under the Private Securities Litigation Reform Act of 1995:

This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties
materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results
expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be
deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other
financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any
statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services.

The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new
functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our
operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of
intellectual property and other litigation, risks associated with possible mergers and acquisitions, the immature market in which we
operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new
releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization
and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of
salesforce.com, inc. is included in our annual report on Form 10-Q for the most recent fiscal quarter ended July 31, 2012. This
documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of
our Web site.

Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently
available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based
upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-
looking statements.

Agenda

• Technology
• Hadoop use cases
• Use case discussion
• Product Metrics
• User Behavior Analysis
• Collaborative Filtering
• Q&A
Every time you see the elephant, we will attempt to explain a
Hadoop related concept.

Got “Cloud Data”?

130k customers 800 million transactions/day
Millions of users Terabytes/day

Hadoop Overview

- Started by Doug Cutting at Yahoo!
- Based on two Google papers
 Google File System (GFS): http://research.google.com/archive/gfs.html
 Google MapReduce: http://research.google.com/archive/mapreduce.html
- Hadoop is an open source Apache project
 Hadoop Distributed File System (HDFS)
 Distributed Processing Framework (MapReduce)
- Several related projects
 HBase, Hive, Pig, Flume, ZooKeeper, Mahout, Oozie, HCatalog

Our Hadoop Ecosystem

Apache Pig

Contributions
@pRaShAnT1784 : Prashant Kommireddi

Lars Hofhansl @thefutureian : Ian Varley

Hadoop Use Cases
User behavior
Product Metrics Capacity planning
analysis

Monitoring Query Runtime
Collections
intelligence Prediction

Early Warning System Collaborative Filtering Search Relevancy

Internal App
Internal App Product feature
Product feature

Product Metrics – Problem Statement

Track feature usage/adoption across 130k+ customers
 Eg: Accounts, Contacts, Visualforce, Apex,…

Track standard metrics across all features
 Eg: #Requests, #UniqueOrgs, #UniqueUsers, AvgResponseTime,…

Track features and metrics across all channels
 API, UI, Mobile

Primary audience: Executives, Product Managers

Data Pipeline

Fancy UI
Feature (What?)
(Visualize)

Feature Metadata Daily Summary
(Instrumentation) (Output)

Crunch it
(How?)

Storage & Processing

Product Metrics Pipeline

User Input
User Input
Reports, Dashboards
Reports, Dashboards
(Page Layout)
(Page Layout)

Formula
Workflow

Formula
Workflow

Fields
Fields
Feature Metrics
Feature Metrics Trend Metrics
Trend Metrics
(Custom Object)
(Custom Object) (Custom Object)
(Custom Object)

API

API
Client Machine
Client Machine
Java Program
Java Program
Pig script generator

Workflow
Workflow

Log Pull
Log Pull
Hadoop
Hadoop Log Files
Log Files

Feature Metrics (Custom Object)

Id Feature Name PM Instrumentation Metric1 Metric2 Metric3 Metric4 Status

F0001 Accounts John /001 #requests #UniqOrgs #UniqUsers AvgRT Dev
F0002 Contacts Nancy /003 #requests #UniqOrgs #UniqUsers AvgRT Review
F0003 API Eric A #requests #UniqOrgs #UniqUsers AvgRT Deployed

F0004 Visualforce Roger V #requests #UniqOrgs #UniqUsers AvgRT Decom

F0005 Apex Kim axapx #requests #UniqOrgs #UniqUsers AvgRT Deployed
F0006 Custom Objects Chun /aXX #requests #UniqOrgs #UniqUsers AvgRT Deployed

F0008 Chatter Jed chcmd #requests #UniqOrgs #UniqUsers AvgRT Deployed
F0009 Reports Steve R #requests #UniqOrgs #UniqUsers AvgRT Deployed

Feature Metrics (Custom Object)

User Input (Page Layout)
Formula
Field

Workflow
Rule

User Input (Child Custom Object)

Child
Objects

Basic Pig Script Construct
-- Define UDFs
DEFINE GFV GetFieldValue(‘/path/to/udf/file’);
-- Load data
A = LOAD ‘/path/to/cloud/data/log/files’ USING PigStorage();
-- Filter data
B = FILTER A BY GFV(row, ‘logRecordType’) == ‘U’;
-- Extract Fields
C = FOREACH B GENERATE GFV(*, ‘orgId’), LFV(*. ‘userId’) ……..
-- Group
G = GROUP C BY ……
-- Compute output metrics
O = FOREACH G {
orgs = C.orgId; uniqueOrgs = DISTINCT orgs;
}
-- Store or Dump results
STORE O INTO ‘/path/to/user/output’;

Java Pig Script Generator (Client)

Trend Metrics (Custom Object)

Id Date #Requests #Unique Orgs #Unique Users Avg ResponseTime

F0001 06/01/2012 <big> <big> <big> <little>






Upload to Trend Metrics (Custom Object)

Visualization (Reports & Dashboards)

Collaborate, Iterate (Chatter)

Recap

User Input
User Input
Reports, Dashboards
Reports, Dashboards
(Page Layout)
(Page Layout)

Formula
Workflow

Formula
Workflow

Fields
Fields
Feature Metrics
Feature Metrics Trend Metrics
Trend Metrics
(Custom Object)
(Custom Object) (Custom Object)
(Custom Object)

API

API
Client Machine
Client Machine
Java Program
Java Program

Workflow
Workflow

Log Pull
Log Pull
Hadoop
Hadoop Log Files
Log Files

Problem Statement

 How do we reduce number of clicks on the user interface?
 Need to understand top user click paths. What are they typically trying to do?
 What are the user clusters/personas?

Approach:
• Markov transition for click path, D3.js visuals
• K-means (unsupervised) clustering for user groups

Markov Transitions for "Setup" Pages

K-means clustering of "Setup" Pages

Collaborative Filtering

Jed Crosby

Collaborative Filtering – Problem Statement

Show similar files within an organization
 Content-based approach
 Community-base approach

We found this relationship using item-to-item collaborative
filtering

Amazon published this algorithm in 2003.
 Amazon.com Recommendations: Item-to-Item Collaborative Filtering, by
Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet Computing,
January-February 2003.

At Salesforce, we adapted this algorithm for Hadoop, and we use
it to recommend files to view and users to follow.

Example: CF on 5 files
Vision Statement
Annual Report

Dilbert Comic
Darth Vader Cartoon

Disk Usage Report

View History Table

Annual Vision Dilbert Darth Vader Disk Usage
Report Statement Cartoon Cartoon Report
Miranda
1 1 1 0 0
(CEO)

Bob (CFO) 1 1 1 0 0

Susan
0 1 1 1 0
(Sales)

Chun (Sales) 0 0 1 1 0

Alice (IT) 0 0 1 1 1

Relationships Between the Files

Annual Report Vision Statement

Darth Vader
Cartoon
Dilbert Cartoon

Disk Usage
Report

Relationships Between the Files

Annual Report
2 Vision Statement

0 1
3
2

0 Darth Vader
0 Cartoon
Dilbert Cartoon
3

1 1

Disk Usage
Report

Sorted Relationships for Each File

Annual Vision Dilbert Darth Vader Disk Usage
Report Statement Cartoon Cartoon Report
Dilbert (2) Dilbert (3) Vision Stmt. (3) Dilbert (3) Dilbert (1)
Vision Stmt. (2) Annual Rpt. (2) Darth Vader (3) Vision Stmt. (1) Darth Vader (1)

Darth Vader (1) Annual Rpt. (2) Disk Usage (1)
Disk Usage (1)

The popularity problem: notice that Dilbert appears first in every list. This is
probably not what we want.

The solution: divide the relationship tallies by file popularities.

Normalized Relationships Between the Files

Annual Report .82 Vision Statement

0 .33
.63 .77

0
0 Darth Vader
Dilbert Cartoon Cartoon
.77

.45 .58

Disk Usage
Report

Sorted relationships for each file, normalized by file popularities

Annual Report Vision Dilbert Darth Vader Disk Usage
Statement Cartoon Cartoon Report
Vision Stmt. Annual Report Darth Vader Darth Vader
Dilbert (.77)
(.82) (.82) (.77) (.58)
Vision Stmt. Disk Usage Dilbert
Dilbert (.63) Dilbert (.77)
(.77) (.58) (.45)
Darth Vader Annual Report Vision Stmt.
(.33) (.63) (.33)
Disk Usage
(.45)

High relationship tallies AND similar popularity values now drive closeness.

The Item-to-Item CF Algorithm

1) Compute file popularities
2) Compute relationship tallies and divide by file popularities
3) Sort and store the results

MapReduce Overview
Map Shuffle Reduce

(adapted from http://code.google.com/p/mapreduce-framework/wiki/MapReduce)

1. Compute File Popularities

<user, file>

Inverse identity map

<file, List<user>>

Reduce

<file, (user count)>

Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache.

Example: File popularity for Dilbert

(Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbert)

Inverse identity map

<Dilbert, {Miranda, Bob, Susan, Chun, Alice}>

Reduce

(Dilbert, 5)

2a. Compute Relationship Tallies − Find All Relationships in View History Table

<user, file>

Identity map

<user, List<file>>

Reduce

<(file1, file2), Integer(1)>,
<(file1, file3), Integer(1)>,
…
<(file(n-1), file(n)), Integer(1)>

Relationships have their file IDs in alphabetical order to avoid double counting.

Example 2a: Miranda’s (CEO) File Relationship Votes

(Miranda, Annual Report), (Miranda, Vision Statement), (Miranda, Dilbert)

Identity map

<Miranda, {Annual Report, Vision Statement, Dilbert}>

Reduce

<(Annual Report, Dilbert), Integer(1)>,
<(Annual Report, Vision Statement), Integer(1)>,
<(Dilbert, Vision Statement), Integer(1)>

2b. Tally the Relationship Votes − Just a Word Count, Where Each
Relationship Occurrence is a Word

<(file1, file2), Integer(1)>

Identity map

<(file1, file2), List<Integer(1)>

Reduce: count and divide
by popularities

<file1, (file2, similarity score)>, <file2, (file1, similarity score)>

Note that we emit each result twice,
one for each file that belongs to a relationship.

Example 2b: the Dilbert/Darth Vader Relationship

<(Dilbert, Vader), Integer(1)>,
<(Dilbert, Vader), Integer(1)>,
<(Dilbert, Vader), Integer(1)>

Identity map

<(Dilbert, Vader), {1, 1, 1}>

Reduce: count and divide
by popularities

<Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))>

3. Sort and Store Results

<file1, (file2, similarity score)>

Identity map

<file1, List<(file2, similarity score)>>

Reduce

<file1, {top n similar files}>

Store the results in your location of choice

Example 3: Sorting the Results for Dilbert

<Dilbert, (Annual Report, .63)>,
<Dilbert, (Vision Statement, .77)>,
<Dilbert, (Disk Usage, .45)>,
<Dilbert, (Darth Vader, .77)>

Identity map

<Dilbert, {(Annual Report, .63), (Vision Statement, .77), (Disk Usage, .45), (Darth Vader, .77)}>

Reduce

<Dilbert, {Darth Vader, Vision Statement}> (Top 2 files)

Store results

Appendix

Cosine formula and normalization trick to avoid the distributed
cache
A •B A B
cos θAB = = •
A B A B

Mahout has CF
Asymptotic order of the algorithm is O(M*N2) in worst case, but is
helped by sparsity.

Narayan Bharadwaj Jed Crosby
Director, Product Management Data Scientist
@nadubharadwaj @JedCrosby

Dreamforce_2012_Hadoop_Use_Cases

Dreamforce_2012_Hadoop_Use_Cases

Recommended

Recommended

More Related Content

Similar to Dreamforce_2012_Hadoop_Use_Cases

Similar to Dreamforce_2012_Hadoop_Use_Cases (20)

Recently uploaded

Recently uploaded (20)

Dreamforce_2012_Hadoop_Use_Cases

Editor's Notes