SlideShare a Scribd company logo
1 of 54
Download to read offline
How Salesforce.com uses Hadoop


  Narayan Bharadwaj
  Data Science
      @nadubharadwaj

  Jed Crosby
  Data Science
      @JedCrosby

  #forcewebinar
                   Follow us @forcedotcom
Safe Harbor
  Safe harbor statement under the Private Securities Litigation Reform Act of 1995:

  This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such
  uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ
  materially from the results expressed or implied by the forward-looking statements we make. All statements other than
  statements of historical fact could be deemed forward-looking, including any projections of product or service availability,
  subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of
  management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or
  technology developments and customer contracts or use of our services.

  The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and
  delivering new functionality for our service, new products and services, our new business model, our past operating losses,
  possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our
  security measures, the outcome of any litigation, risks associated with completed and any possible mergers and
  acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain,
  and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our
  limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further
  information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report
  on Form 10-K for the most recent fiscal year ended January 31, 2011 and in our quarterly report on Form 10-Q for the most
  recent fiscal quarter ended October 31, 2011. These documents and others containing important disclosures are available
  on the SEC Filings section of the Investor Information section of our Web site.

  Any unreleased services or features referenced in this or other presentations, press releases or public statements are not
  currently available and may not be delivered on time or at all. Customers who purchase our services should make the
  purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does
  not intend to update these forward-looking statements.




                                                  Follow us @forcedotcom
Agenda

 §  Hadoop use cases
 §  Use case 1 - Product Metrics*
 §  Technology
 §  Use case 2- Collaborative Filtering*
 §  Q&A




             *Every time you see the elephant, we will attempt to
             explain a Hadoop related concept.


                         Follow us @forcedotcom
Got “Cloud Data”?




              130k customers      780 million transactions/day
              Millions of users   Terabytes/day




                       Follow us @forcedotcom
Hadoop Overview

 §  Started by Doug Cutting at Yahoo!
 §  Based on two Google papers
     –  Google File System (GFS): http://research.google.com/archive/gfs.html
     –  Google MapReduce: http://research.google.com/archive/mapreduce.html


 §  Hadoop is an open source Apache project
     –  Hadoop Distributed File System (HDFS)
     –  Distributed Processing Framework (MapReduce)


 §  Several related projects
     –  HBase, Hive, Pig, Flume, ZooKeeper, Mahout, Oozie, HCatalog




                                    Follow us @forcedotcom
Hadoop use cases


                       User behavior
   Product Metrics                            Capacity planning
                         analysis




      Monitoring        Performance
                                                  Security
     intelligence         analysis




     Ad-hoc log         Collaborative
                                              Search Relevancy
      searches            Filtering



                     Follow us @forcedotcom
Product Metrics
Product Metrics – Problem Statement



 §  Track feature usage/adoption across 130k+ customers
    –  Eg: Accounts, Contacts, Visualforce, Apex,…


 §  Track standard metrics across all features
    –  Eg: #Requests, #UniqueOrgs, #UniqueUsers,
       AvgResponseTime,…


 §  Track features and metrics across all channels
    –  API, UI, Mobile


 §  Primary audience: Executives, Product Managers

                          Follow us @forcedotcom
Data Pipeline

                                    Collaborate &          Fancy UI
        Feature (What?)
                                        Iterate           (Visualize)




        Feature Metadata                                Daily Summary
        (Instrumentation)                                  (Output)




                                     Crunch it
                                      (How?)




                            Storage & Processing




                               Follow us @forcedotcom
Product Metrics Pipeline

                    User Input                  Collaboration                            Reports,
                  (Page Layout)                   (Chatter)                             Dashboards




                                                                                                        Formula
       Workflow




                                                                                                         Fields
                   Feature Metrics                                                   Trend Metrics
                   (Custom Object)                                                   (Custom Object)




                                     API




                                                                               API
                                             Client Machine

                                               Java Program

                                            Pig script generator




                                                                    Workflow




                                                                                             Log Pull
                                              Hadoop
                                                                                                              Log Files




                                           Follow us @forcedotcom
Feature Metrics (Custom Object)


Id      Feature Name     PM      Instrumentation     Metric1      Metric2     Metric3      Metric4   Status


F0001   Accounts         John    /001                #requests    #UniqOrgs   #UniqUsers   AvgRT     Dev

F0002   Contacts         Nancy   /003                #requests    #UniqOrgs   #UniqUsers   AvgRT     Review

F0003   API              Eric    A                   #requests    #UniqOrgs   #UniqUsers   AvgRT     Deployed



F0004   Visualforce      Roger   V                   #requests    #UniqOrgs   #UniqUsers   AvgRT     Decom



F0005   Apex             Kim     axapx               #requests    #UniqOrgs   #UniqUsers   AvgRT     Deployed

F0006   Custom Objects   Chun    /aXX                #requests    #UniqOrgs   #UniqUsers   AvgRT     Deployed



F0008   Chatter          Jed     chcmd               #requests    #UniqOrgs   #UniqUsers   AvgRT     Deployed

F0009   Reports          Steve   R                   #requests    #UniqOrgs   #UniqUsers   AvgRT     Deployed




                                         Follow us @forcedotcom
Feature Metrics (Custom Object)




                         Follow us @forcedotcom
User Input (Page Layout)
                                                    Formula
                                                    Field




                                                      Workflow
                                                      Rule




                           Follow us @forcedotcom
User Input (Child Custom Object)




                                                  Child
                                                  Objects




                         Follow us @forcedotcom
Apache Pig
Basic Pig script construct

  -- Define UDFs
  DEFINE GFV GetFieldValue(‘/path/to/udf/file’);

  -- Load data
  A = LOAD ‘/path/to/cloud/data/log/files’ USING PigStorage();
  -- Filter data
  B = FILTER A BY GFV(row, ‘logRecordType’) == ‘U’;

  -- Extract Fields
  C = FOREACH B GENERATE GFV(*, ‘orgId’), LFV(*. ‘userId’) ……..
  -- Group

  G = GROUP C BY ……
  -- Compute output metrics
  O = FOREACH G {
                          orgs = C.orgId; uniqueOrgs = DISTINCT orgs;

                      }
  -- Store or Dump results
  STORE O INTO ‘/path/to/user/output’;



                                              Follow us @forcedotcom
Java Pig Script Generator (Client)




                          Follow us @forcedotcom
Trend Metrics (Custom Object)



                                  #Unique          #Unique   Avg
Id     Date         #Requests
                                  Orgs             Users     ResponseTime

 F0001 06/01/2012     <big>            <big>         <big>      <little>

 F0002 06/01/2012     <big>            <big>         <big>      <little>

 F0003 06/01/2012     <big>            <big>         <big>      <little>

 F0001 06/02/2012     <big>            <big>         <big>      <little>

 F0002 06/02/2012     <big>            <big>         <big>      <little>

 F0003 06/03/2012     <big>            <big>         <big>      <little>




                          Follow us @forcedotcom
Upload to Trend Metrics (Custom Object)




                         Follow us @forcedotcom
Visualization (Reports & Dashboards)




                         Follow us @forcedotcom
Visualization (Reports & Dashboards)




                         Follow us @forcedotcom
Collaborate, Iterate (Chatter)




                           Follow us @forcedotcom
Recap

                     User Input                  Collaboration                            Reports,
                   (Page Layout)                   (Chatter)                             Dashboards




                                                                                                         Formula
        Workflow




                                                                                                          Fields
                    Feature Metrics                                                   Trend Metrics
                    (Custom Object)                                                   (Custom Object)




                                      API




                                                                                API
                                              Client Machine

                                                Java Program

                                             Pig script generator




                                                                     Workflow




                                                                                              Log Pull
                                               Hadoop
                                                                                                               Log Files




                                            Follow us @forcedotcom
Technology
Hadoop ecosystem




      Apache Hadoop
      Version=0.20.2




                       Follow us @forcedotcom
Contributions

     @pRaShAnT1784 : Prashant Kommireddi




    Lars Hofhansl                         @thefutureian : Ian Varley




                        Follow us @forcedotcom
Data Science tools ecosystem




       Apache Pig
       Version=0.9.1




                       Follow us @forcedotcom
Collaborative Filtering
Collaborative Filtering – Problem Statement




 §  Show similar files within an organization
    –  Content-based approach
    –  Community-base approach




                         Follow us @forcedotcom
Popular File




               Follow us @forcedotcom
Related File




               Follow us @forcedotcom
We found this relationship using item-to-item collaborative
filtering




 §  Amazon published this algorithm in 2003.
    –  Amazon.com Recommendations: Item-to-Item Collaborative Filtering,
       by Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet
       Computing, January-February 2003.

 §  At Salesforce, we adapted this algorithm for Hadoop,
     and we use it to recommend files to view and users to
     follow.




                            Follow us @forcedotcom
Example: CF on 5 files

                                                         Vision Statement
                Annual Report




Dilbert Comic

                                                                Darth Vader Cartoon




                                Disk Usage Report




                                Follow us @forcedotcom
View History Table




              Annual   Vision           Dilbert       Darth     Disk
              Report   Statement        Cartoon       Vader     Usage
                                                      Cartoon   Report
 Miranda          1         1                     1       0         0
 (CEO)
 Bob (CFO)        1         1                     1       0         0
 Susan            0         1                     1       1         0
 (Sales)
 Chun             0         0                     1       1         0
 (Sales)
 Alice (IT)       0         0                     1       1         1




                         Follow us @forcedotcom
Relationships between the files




                   Annual Report                      Vision Statement




                                                                         Darth Vader
                                                                         Cartoon
         Dilbert
         Cartoon




                                        Disk Usage
                                        Report



                                   Follow us @forcedotcom
Relationships between the files



                    Annual
                    Report                   2            Vision Statement




                                                     0              1
                                      3
                    2


                                                         0                   Darth Vader
                                 0                                           Cartoon
          Dilbert
          Cartoon                             3



                                                              1
                             1



                                           Disk Usage
                                           Report



                                     Follow us @forcedotcom
Sorted relationships for each file




Annual                Vision               Dilbert                Darth Vader        Disk Usage
Report                Statement            Cartoon                Cartoon            Report
Dilbert (2)           Dilbert (3)          Vision Stmt. (3)       Dilbert (3)        Dilbert (1)
Vision Stmt. (2)      Annual Rpt. (2)      Darth Vader (3)        Vision Stmt. (1)   Darth Vader (1)


                      Darth Vader (1)      Annual Rpt. (2)        Disk Usage (1)
                                           Disk Usage (1)



              The popularity problem: notice that Dilbert appears first in every list.
              This is probably not what we want.


              The solution: divide the relationship tallies by file popularities.



                                         Follow us @forcedotcom
Normalized relationships between the files



                 Annual Report                                Vision Statement
                                             .82




                                                      0                  .33
                                       .77
                     .63


                                                          0
                                 0                                               Darth Vader
                                                                                 Cartoon
           Dilbert
           Cartoon                             .77




                           .45                                 .58




                                             Disk Usage
                                             Report



                                     Follow us @forcedotcom
Sorted relationships for each file, normalized by file popularities




Annual Report Vision                    Dilbert               Darth Vader       Disk Usage
              Statement                 Cartoon               Cartoon           Report
Vision Stmt.        Annual Report       Darth Vader           Dilbert (.77)     Darth Vader
(.82)               (.82)               (.77)                                   (.58)
Dilbert (.63)       Dilbert (.77)       Vision Stmt.          Disk Usage        Dilbert
                                        (.77)                 (.58)             (.45)
                    Darth Vader         Annual Report         Vision Stmt.
                    (.33)               (.63)                 (.33)
                                        Disk Usage
                                        (.45)




          High relationship tallies AND similar popularity values now drive closeness.



                                     Follow us @forcedotcom
The item-to-item CF algorithm




 1)  Compute file popularities
 2)  Compute relationship tallies and divide by file
     popularities
 3)  Sort and store the results




                         Follow us @forcedotcom
MapReduce Overview
    Map                        Shuffle                       Reduce




      (adapted from http://code.google.com/p/mapreduce-framework/wiki/MapReduce)
                                Follow us @forcedotcom
1. Compute File Popularities



                                       <user, file>


                                                     Inverse identity map



                                    <file, List<user>>


                                                      Reduce



                                    <file, (user count)>


 Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache.


                                   Follow us @forcedotcom
Example: File popularity for Dilbert




  (Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbert)



                                                   Inverse identity map



                     <Dilbert, {Miranda, Bob, Susan, Chun, Alice}>



                                                   Reduce



                                         (Dilbert, 5)




                                     Follow us @forcedotcom
2a. Compute relationship tallies - find all relationships in view
history table



                                <user, file>

                                             Identity map


                             <user, List<file>>

                                             Reduce


                         <(file1, file2), Integer(1)>,
                         <(file1, file3), Integer(1)>,
                         …
                         <(file(n-1), file(n)), Integer(1)>


           Relationships have their file IDs in alphabetical order
           to avoid double counting.
                             Follow us @forcedotcom
Example 2a: Miranda’s (CEO) file relationship votes




     (Miranda, Annual Report), (Miranda, Vision Statement), (Miranda, Dilbert)


                                                Identity map


              <Miranda, {Annual Report, Vision Statement, Dilbert}>

                                                 Reduce


                      <(Annual Report, Dilbert), Integer(1)>,
                      <(Annual Report, Vision Statement), Integer(1)>,
                      <(Dilbert, Vision Statement), Integer(1)>




                                Follow us @forcedotcom
2b. Tally the relationship votes - just a word count, where each
relationship occurrence is a word




                              <(file1, file2), Integer(1)>


                                                   Identity map


                            <(file1, file2), List<Integer(1)>



                                                   Reduce: count and
                                                   divide by popularities


          <file1, (file2, similarity score)>, <file2, (file1, similarity score)>


  Note that we emit each result twice, one for each file that belongs to a
  relationship.
                                   Follow us @forcedotcom
Example 2b: the Dilbert/Darth Vader relationship




                           <(Dilbert, Vader), Integer(1)>,
                           <(Dilbert, Vader), Integer(1)>,
                           <(Dilbert, Vader), Integer(1)>


                                                Identity map


                           <(Dilbert, Vader), {1, 1, 1}>



                                                Reduce: count and
                                                divide by popularities


            <Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))>




                               Follow us @forcedotcom
3. Sort and store results



                        <file1, (file2, similarity score)>


                                                Identity map



                     <file1, List<(file2, similarity score)>>


                                                Reduce


                          <file1, {top n similar files}>




                  Store the results in your location of choice


                               Follow us @forcedotcom
Example 3: Sorting the results for Dilbert


                               <Dilbert, (Annual Report, .63)>,
                               <Dilbert, (Vision Statement, .77)>,
                               <Dilbert, (Disk Usage, .45)>,
                               <Dilbert, (Darth Vader, .77)>


                                                      Identity map


<Dilbert, {(Annual Report, .63), (Vision Statement, .77), (Disk Usage, .45), (Darth Vader, .77)}>


                                                      Reduce


                  <Dilbert, {Darth Vader, Vision Statement}> (Top 2 files)




                                        Store results
                                     Follow us @forcedotcom
Appendix




§  Cosine formula and normalization trick to avoid the
    distributed cache

                          A• B   A   B
              cosθ AB   =      =   •
                          A B    A   B
§  Mahout has CF
§  Asymptotic order of the algorithm is O(M*N2) in worst
     €
    case, but is helped by sparsity.




                        Follow us @forcedotcom
Summary




          Hadoop                                       Cloud Data




    Hadoop + Force.com =                        Recommendation algorithms




                       Follow us @forcedotcom
@forcedotcom / #forcewebinar


Developer Force Group


facebook.com/forcedotcom


Developer Force – Force.com
Community

   Follow us @forcedotcom
Upcoming Events

§  June 26 – Mobile CodeTalk
   –  http://bit.ly/mct-wr


§  June 27 – Painless Mobile App
    Development
   –  http://bit.ly/mobileapp-hp




                             http://bit.ly/mdc-hp
                               Follow us @forcedotcom
Q&A
                     http://bit.ly/
                    hadoopsurvey

Narayan Bharadwaj    Jed Crosby            Prashant Kommireddi   Santosh Rau
@nadubharadwaj       @JedCrosby            @pRaShAnT1784         @santoshrau

                              @SalesforceEng
                         Follow us @forcedotcom

More Related Content

What's hot

Interactive Analytics in Human Time
Interactive Analytics in Human TimeInteractive Analytics in Human Time
Interactive Analytics in Human Time
DataWorks Summit
 
Everbridge: Notification in a Heartbeat
Everbridge: Notification in a HeartbeatEverbridge: Notification in a Heartbeat
Everbridge: Notification in a Heartbeat
Everbridge, Inc.
 
Why is DDS the Right Technology for the Industrial Internet?
Why is DDS the Right Technology for the Industrial Internet?Why is DDS the Right Technology for the Industrial Internet?
Why is DDS the Right Technology for the Industrial Internet?
Real-Time Innovations (RTI)
 

What's hot (20)

Apache hive
Apache hiveApache hive
Apache hive
 
Interactive Analytics in Human Time
Interactive Analytics in Human TimeInteractive Analytics in Human Time
Interactive Analytics in Human Time
 
BigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataBigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big Data
 
IT-Serve: IT Solutions and Service Provider | Dubai UAE
IT-Serve: IT Solutions and Service Provider | Dubai UAEIT-Serve: IT Solutions and Service Provider | Dubai UAE
IT-Serve: IT Solutions and Service Provider | Dubai UAE
 
Infrastructure Strategy Plan
Infrastructure Strategy Plan Infrastructure Strategy Plan
Infrastructure Strategy Plan
 
Spend analysis sapariba.pdf
Spend analysis sapariba.pdfSpend analysis sapariba.pdf
Spend analysis sapariba.pdf
 
Chapitre i-intro
Chapitre i-introChapitre i-intro
Chapitre i-intro
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
BigData_Chp5: Putting it all together
BigData_Chp5: Putting it all togetherBigData_Chp5: Putting it all together
BigData_Chp5: Putting it all together
 
FHIR REST API 導論與使用
FHIR REST API 導論與使用FHIR REST API 導論與使用
FHIR REST API 導論與使用
 
Cours Big Data Chap6
Cours Big Data Chap6Cours Big Data Chap6
Cours Big Data Chap6
 
Everbridge: Notification in a Heartbeat
Everbridge: Notification in a HeartbeatEverbridge: Notification in a Heartbeat
Everbridge: Notification in a Heartbeat
 
Traitement distribue en BIg Data - KAFKA Broker and Kafka Streams
Traitement distribue en BIg Data - KAFKA Broker and Kafka StreamsTraitement distribue en BIg Data - KAFKA Broker and Kafka Streams
Traitement distribue en BIg Data - KAFKA Broker and Kafka Streams
 
北護大/FHIR 開發簡介與應用
北護大/FHIR 開發簡介與應用北護大/FHIR 開發簡介與應用
北護大/FHIR 開發簡介與應用
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
UC4 SCHEDULING
UC4 SCHEDULINGUC4 SCHEDULING
UC4 SCHEDULING
 
Why is DDS the Right Technology for the Industrial Internet?
Why is DDS the Right Technology for the Industrial Internet?Why is DDS the Right Technology for the Industrial Internet?
Why is DDS the Right Technology for the Industrial Internet?
 
How Application Discovery and Dependency Mapping can stop you from losing cus...
How Application Discovery and Dependency Mapping can stop you from losing cus...How Application Discovery and Dependency Mapping can stop you from losing cus...
How Application Discovery and Dependency Mapping can stop you from losing cus...
 

Viewers also liked

APAC Big Data Strategy RadhaKrishna Hiremane
APAC Big Data  Strategy RadhaKrishna  HiremaneAPAC Big Data  Strategy RadhaKrishna  Hiremane
APAC Big Data Strategy RadhaKrishna Hiremane
IntelAPAC
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
Brock Noland
 
How to Leverage Usage Data to Drive Product Messaging and Adoption - Rachel S...
How to Leverage Usage Data to Drive Product Messaging and Adoption - Rachel S...How to Leverage Usage Data to Drive Product Messaging and Adoption - Rachel S...
How to Leverage Usage Data to Drive Product Messaging and Adoption - Rachel S...
ProductCamp Boston
 
RaffaelloTorraco_CoachTrainer
RaffaelloTorraco_CoachTrainerRaffaelloTorraco_CoachTrainer
RaffaelloTorraco_CoachTrainer
Raffaello Torraco
 

Viewers also liked (20)

Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
APAC Big Data Strategy RadhaKrishna Hiremane
APAC Big Data  Strategy RadhaKrishna  HiremaneAPAC Big Data  Strategy RadhaKrishna  Hiremane
APAC Big Data Strategy RadhaKrishna Hiremane
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Mobile Shopping
Mobile ShoppingMobile Shopping
Mobile Shopping
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Spring ’15 Release Preview - Platform Feature Highlights
Spring ’15 Release Preview - Platform Feature HighlightsSpring ’15 Release Preview - Platform Feature Highlights
Spring ’15 Release Preview - Platform Feature Highlights
 
Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin Keynote
 
Big Data Project using HIVE - college scorecard
Big Data Project using HIVE - college scorecardBig Data Project using HIVE - college scorecard
Big Data Project using HIVE - college scorecard
 
How to Leverage Usage Data to Drive Product Messaging and Adoption - Rachel S...
How to Leverage Usage Data to Drive Product Messaging and Adoption - Rachel S...How to Leverage Usage Data to Drive Product Messaging and Adoption - Rachel S...
How to Leverage Usage Data to Drive Product Messaging and Adoption - Rachel S...
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
RaffaelloTorraco_CoachTrainer
RaffaelloTorraco_CoachTrainerRaffaelloTorraco_CoachTrainer
RaffaelloTorraco_CoachTrainer
 
Social Sharing
Social Sharing Social Sharing
Social Sharing
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Javascript
JavascriptJavascript
Javascript
 
Case study of online machine learning for display advertising in Yahoo! JAPAN
Case study of online machine learning for display advertising in Yahoo! JAPANCase study of online machine learning for display advertising in Yahoo! JAPAN
Case study of online machine learning for display advertising in Yahoo! JAPAN
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 

Similar to How Salesforce.com uses Hadoop

Dreamforce_2012_Hadoop_Use_Cases
Dreamforce_2012_Hadoop_Use_CasesDreamforce_2012_Hadoop_Use_Cases
Dreamforce_2012_Hadoop_Use_Cases
Narayan Bharadwaj
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013
Narayan Bharadwaj
 
Webinar september 2013
Webinar september 2013Webinar september 2013
Webinar september 2013
Marc Gille
 
Using Web Technologies to Build Native iPhone & Android Applications
Using Web Technologies to Build Native iPhone & Android ApplicationsUsing Web Technologies to Build Native iPhone & Android Applications
Using Web Technologies to Build Native iPhone & Android Applications
Axway Appcelerator
 
Social ent. with java on heroku
Social ent. with java on herokuSocial ent. with java on heroku
Social ent. with java on heroku
Anand B Narasimhan
 
Introduction To Jira Slide Share
Introduction To Jira Slide ShareIntroduction To Jira Slide Share
Introduction To Jira Slide Share
Renjith V
 

Similar to How Salesforce.com uses Hadoop (20)

How salesforce.com Uses Hadoop Webinar
How salesforce.com Uses Hadoop WebinarHow salesforce.com Uses Hadoop Webinar
How salesforce.com Uses Hadoop Webinar
 
How Salesforce.com Uses Hadoop
How Salesforce.com Uses HadoopHow Salesforce.com Uses Hadoop
How Salesforce.com Uses Hadoop
 
Dreamforce_2012_Hadoop_Use_Cases
Dreamforce_2012_Hadoop_Use_CasesDreamforce_2012_Hadoop_Use_Cases
Dreamforce_2012_Hadoop_Use_Cases
 
Hadoop + Forcedotcom = Like
Hadoop + Forcedotcom = LikeHadoop + Forcedotcom = Like
Hadoop + Forcedotcom = Like
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013
 
SWIMing in a Standards Soup
SWIMing in a Standards SoupSWIMing in a Standards Soup
SWIMing in a Standards Soup
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
 
Webinar september 2013
Webinar september 2013Webinar september 2013
Webinar september 2013
 
JamfNation Roadshow Frankfurt-2019 - Security & Business Intelligence
JamfNation Roadshow Frankfurt-2019 - Security & Business IntelligenceJamfNation Roadshow Frankfurt-2019 - Security & Business Intelligence
JamfNation Roadshow Frankfurt-2019 - Security & Business Intelligence
 
Open Source World : Using Web Technologies to build native iPhone and Android...
Open Source World : Using Web Technologies to build native iPhone and Android...Open Source World : Using Web Technologies to build native iPhone and Android...
Open Source World : Using Web Technologies to build native iPhone and Android...
 
Using Web Technologies to Build Native iPhone & Android Applications
Using Web Technologies to Build Native iPhone & Android ApplicationsUsing Web Technologies to Build Native iPhone & Android Applications
Using Web Technologies to Build Native iPhone & Android Applications
 
Social ent. with java on heroku
Social ent. with java on herokuSocial ent. with java on heroku
Social ent. with java on heroku
 
Social Enterprise Java Apps on Heroku Webinar
Social Enterprise Java Apps on Heroku WebinarSocial Enterprise Java Apps on Heroku Webinar
Social Enterprise Java Apps on Heroku Webinar
 
Lean product management for web2.0 by Sujoy Bhatacharjee, April
Lean product management for web2.0 by Sujoy Bhatacharjee, April Lean product management for web2.0 by Sujoy Bhatacharjee, April
Lean product management for web2.0 by Sujoy Bhatacharjee, April
 
Data Mining with SpagoBI suite
Data Mining with SpagoBI suiteData Mining with SpagoBI suite
Data Mining with SpagoBI suite
 
PyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web ApplicationsPyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web Applications
 
AI and ML Series - Generative Extraction and Classification of Documents in S...
AI and ML Series - Generative Extraction and Classification of Documents in S...AI and ML Series - Generative Extraction and Classification of Documents in S...
AI and ML Series - Generative Extraction and Classification of Documents in S...
 
Agados POC Report to Build/Rebuild for ERP PKG
Agados POC Report to Build/Rebuild for ERP PKG Agados POC Report to Build/Rebuild for ERP PKG
Agados POC Report to Build/Rebuild for ERP PKG
 
Building an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsBuilding an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult Steps
 
Introduction To Jira Slide Share
Introduction To Jira Slide ShareIntroduction To Jira Slide Share
Introduction To Jira Slide Share
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

How Salesforce.com uses Hadoop

  • 1. How Salesforce.com uses Hadoop Narayan Bharadwaj Data Science @nadubharadwaj Jed Crosby Data Science @JedCrosby #forcewebinar Follow us @forcedotcom
  • 2. Safe Harbor Safe harbor statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year ended January 31, 2011 and in our quarterly report on Form 10-Q for the most recent fiscal quarter ended October 31, 2011. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements. Follow us @forcedotcom
  • 3. Agenda §  Hadoop use cases §  Use case 1 - Product Metrics* §  Technology §  Use case 2- Collaborative Filtering* §  Q&A *Every time you see the elephant, we will attempt to explain a Hadoop related concept. Follow us @forcedotcom
  • 4. Got “Cloud Data”? 130k customers 780 million transactions/day Millions of users Terabytes/day Follow us @forcedotcom
  • 5. Hadoop Overview §  Started by Doug Cutting at Yahoo! §  Based on two Google papers –  Google File System (GFS): http://research.google.com/archive/gfs.html –  Google MapReduce: http://research.google.com/archive/mapreduce.html §  Hadoop is an open source Apache project –  Hadoop Distributed File System (HDFS) –  Distributed Processing Framework (MapReduce) §  Several related projects –  HBase, Hive, Pig, Flume, ZooKeeper, Mahout, Oozie, HCatalog Follow us @forcedotcom
  • 6. Hadoop use cases User behavior Product Metrics Capacity planning analysis Monitoring Performance Security intelligence analysis Ad-hoc log Collaborative Search Relevancy searches Filtering Follow us @forcedotcom
  • 8. Product Metrics – Problem Statement §  Track feature usage/adoption across 130k+ customers –  Eg: Accounts, Contacts, Visualforce, Apex,… §  Track standard metrics across all features –  Eg: #Requests, #UniqueOrgs, #UniqueUsers, AvgResponseTime,… §  Track features and metrics across all channels –  API, UI, Mobile §  Primary audience: Executives, Product Managers Follow us @forcedotcom
  • 9. Data Pipeline Collaborate & Fancy UI Feature (What?) Iterate (Visualize) Feature Metadata Daily Summary (Instrumentation) (Output) Crunch it (How?) Storage & Processing Follow us @forcedotcom
  • 10. Product Metrics Pipeline User Input Collaboration Reports, (Page Layout) (Chatter) Dashboards Formula Workflow Fields Feature Metrics Trend Metrics (Custom Object) (Custom Object) API API Client Machine Java Program Pig script generator Workflow Log Pull Hadoop Log Files Follow us @forcedotcom
  • 11. Feature Metrics (Custom Object) Id Feature Name PM Instrumentation Metric1 Metric2 Metric3 Metric4 Status F0001 Accounts John /001 #requests #UniqOrgs #UniqUsers AvgRT Dev F0002 Contacts Nancy /003 #requests #UniqOrgs #UniqUsers AvgRT Review F0003 API Eric A #requests #UniqOrgs #UniqUsers AvgRT Deployed F0004 Visualforce Roger V #requests #UniqOrgs #UniqUsers AvgRT Decom F0005 Apex Kim axapx #requests #UniqOrgs #UniqUsers AvgRT Deployed F0006 Custom Objects Chun /aXX #requests #UniqOrgs #UniqUsers AvgRT Deployed F0008 Chatter Jed chcmd #requests #UniqOrgs #UniqUsers AvgRT Deployed F0009 Reports Steve R #requests #UniqOrgs #UniqUsers AvgRT Deployed Follow us @forcedotcom
  • 12. Feature Metrics (Custom Object) Follow us @forcedotcom
  • 13. User Input (Page Layout) Formula Field Workflow Rule Follow us @forcedotcom
  • 14. User Input (Child Custom Object) Child Objects Follow us @forcedotcom
  • 16. Basic Pig script construct -- Define UDFs DEFINE GFV GetFieldValue(‘/path/to/udf/file’); -- Load data A = LOAD ‘/path/to/cloud/data/log/files’ USING PigStorage(); -- Filter data B = FILTER A BY GFV(row, ‘logRecordType’) == ‘U’; -- Extract Fields C = FOREACH B GENERATE GFV(*, ‘orgId’), LFV(*. ‘userId’) …….. -- Group G = GROUP C BY …… -- Compute output metrics O = FOREACH G { orgs = C.orgId; uniqueOrgs = DISTINCT orgs; } -- Store or Dump results STORE O INTO ‘/path/to/user/output’; Follow us @forcedotcom
  • 17. Java Pig Script Generator (Client) Follow us @forcedotcom
  • 18. Trend Metrics (Custom Object) #Unique #Unique Avg Id Date #Requests Orgs Users ResponseTime F0001 06/01/2012 <big> <big> <big> <little> F0002 06/01/2012 <big> <big> <big> <little> F0003 06/01/2012 <big> <big> <big> <little> F0001 06/02/2012 <big> <big> <big> <little> F0002 06/02/2012 <big> <big> <big> <little> F0003 06/03/2012 <big> <big> <big> <little> Follow us @forcedotcom
  • 19. Upload to Trend Metrics (Custom Object) Follow us @forcedotcom
  • 20. Visualization (Reports & Dashboards) Follow us @forcedotcom
  • 21. Visualization (Reports & Dashboards) Follow us @forcedotcom
  • 22. Collaborate, Iterate (Chatter) Follow us @forcedotcom
  • 23. Recap User Input Collaboration Reports, (Page Layout) (Chatter) Dashboards Formula Workflow Fields Feature Metrics Trend Metrics (Custom Object) (Custom Object) API API Client Machine Java Program Pig script generator Workflow Log Pull Hadoop Log Files Follow us @forcedotcom
  • 25. Hadoop ecosystem Apache Hadoop Version=0.20.2 Follow us @forcedotcom
  • 26. Contributions @pRaShAnT1784 : Prashant Kommireddi Lars Hofhansl @thefutureian : Ian Varley Follow us @forcedotcom
  • 27. Data Science tools ecosystem Apache Pig Version=0.9.1 Follow us @forcedotcom
  • 29. Collaborative Filtering – Problem Statement §  Show similar files within an organization –  Content-based approach –  Community-base approach Follow us @forcedotcom
  • 30. Popular File Follow us @forcedotcom
  • 31. Related File Follow us @forcedotcom
  • 32. We found this relationship using item-to-item collaborative filtering §  Amazon published this algorithm in 2003. –  Amazon.com Recommendations: Item-to-Item Collaborative Filtering, by Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet Computing, January-February 2003. §  At Salesforce, we adapted this algorithm for Hadoop, and we use it to recommend files to view and users to follow. Follow us @forcedotcom
  • 33. Example: CF on 5 files Vision Statement Annual Report Dilbert Comic Darth Vader Cartoon Disk Usage Report Follow us @forcedotcom
  • 34. View History Table Annual Vision Dilbert Darth Disk Report Statement Cartoon Vader Usage Cartoon Report Miranda 1 1 1 0 0 (CEO) Bob (CFO) 1 1 1 0 0 Susan 0 1 1 1 0 (Sales) Chun 0 0 1 1 0 (Sales) Alice (IT) 0 0 1 1 1 Follow us @forcedotcom
  • 35. Relationships between the files Annual Report Vision Statement Darth Vader Cartoon Dilbert Cartoon Disk Usage Report Follow us @forcedotcom
  • 36. Relationships between the files Annual Report 2 Vision Statement 0 1 3 2 0 Darth Vader 0 Cartoon Dilbert Cartoon 3 1 1 Disk Usage Report Follow us @forcedotcom
  • 37. Sorted relationships for each file Annual Vision Dilbert Darth Vader Disk Usage Report Statement Cartoon Cartoon Report Dilbert (2) Dilbert (3) Vision Stmt. (3) Dilbert (3) Dilbert (1) Vision Stmt. (2) Annual Rpt. (2) Darth Vader (3) Vision Stmt. (1) Darth Vader (1) Darth Vader (1) Annual Rpt. (2) Disk Usage (1) Disk Usage (1) The popularity problem: notice that Dilbert appears first in every list. This is probably not what we want. The solution: divide the relationship tallies by file popularities. Follow us @forcedotcom
  • 38. Normalized relationships between the files Annual Report Vision Statement .82 0 .33 .77 .63 0 0 Darth Vader Cartoon Dilbert Cartoon .77 .45 .58 Disk Usage Report Follow us @forcedotcom
  • 39. Sorted relationships for each file, normalized by file popularities Annual Report Vision Dilbert Darth Vader Disk Usage Statement Cartoon Cartoon Report Vision Stmt. Annual Report Darth Vader Dilbert (.77) Darth Vader (.82) (.82) (.77) (.58) Dilbert (.63) Dilbert (.77) Vision Stmt. Disk Usage Dilbert (.77) (.58) (.45) Darth Vader Annual Report Vision Stmt. (.33) (.63) (.33) Disk Usage (.45) High relationship tallies AND similar popularity values now drive closeness. Follow us @forcedotcom
  • 40. The item-to-item CF algorithm 1)  Compute file popularities 2)  Compute relationship tallies and divide by file popularities 3)  Sort and store the results Follow us @forcedotcom
  • 41. MapReduce Overview Map Shuffle Reduce (adapted from http://code.google.com/p/mapreduce-framework/wiki/MapReduce) Follow us @forcedotcom
  • 42. 1. Compute File Popularities <user, file> Inverse identity map <file, List<user>> Reduce <file, (user count)> Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache. Follow us @forcedotcom
  • 43. Example: File popularity for Dilbert (Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbert) Inverse identity map <Dilbert, {Miranda, Bob, Susan, Chun, Alice}> Reduce (Dilbert, 5) Follow us @forcedotcom
  • 44. 2a. Compute relationship tallies - find all relationships in view history table <user, file> Identity map <user, List<file>> Reduce <(file1, file2), Integer(1)>, <(file1, file3), Integer(1)>, … <(file(n-1), file(n)), Integer(1)> Relationships have their file IDs in alphabetical order to avoid double counting. Follow us @forcedotcom
  • 45. Example 2a: Miranda’s (CEO) file relationship votes (Miranda, Annual Report), (Miranda, Vision Statement), (Miranda, Dilbert) Identity map <Miranda, {Annual Report, Vision Statement, Dilbert}> Reduce <(Annual Report, Dilbert), Integer(1)>, <(Annual Report, Vision Statement), Integer(1)>, <(Dilbert, Vision Statement), Integer(1)> Follow us @forcedotcom
  • 46. 2b. Tally the relationship votes - just a word count, where each relationship occurrence is a word <(file1, file2), Integer(1)> Identity map <(file1, file2), List<Integer(1)> Reduce: count and divide by popularities <file1, (file2, similarity score)>, <file2, (file1, similarity score)> Note that we emit each result twice, one for each file that belongs to a relationship. Follow us @forcedotcom
  • 47. Example 2b: the Dilbert/Darth Vader relationship <(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)> Identity map <(Dilbert, Vader), {1, 1, 1}> Reduce: count and divide by popularities <Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))> Follow us @forcedotcom
  • 48. 3. Sort and store results <file1, (file2, similarity score)> Identity map <file1, List<(file2, similarity score)>> Reduce <file1, {top n similar files}> Store the results in your location of choice Follow us @forcedotcom
  • 49. Example 3: Sorting the results for Dilbert <Dilbert, (Annual Report, .63)>, <Dilbert, (Vision Statement, .77)>, <Dilbert, (Disk Usage, .45)>, <Dilbert, (Darth Vader, .77)> Identity map <Dilbert, {(Annual Report, .63), (Vision Statement, .77), (Disk Usage, .45), (Darth Vader, .77)}> Reduce <Dilbert, {Darth Vader, Vision Statement}> (Top 2 files) Store results Follow us @forcedotcom
  • 50. Appendix §  Cosine formula and normalization trick to avoid the distributed cache A• B A B cosθ AB = = • A B A B §  Mahout has CF §  Asymptotic order of the algorithm is O(M*N2) in worst € case, but is helped by sparsity. Follow us @forcedotcom
  • 51. Summary Hadoop Cloud Data Hadoop + Force.com = Recommendation algorithms Follow us @forcedotcom
  • 52. @forcedotcom / #forcewebinar Developer Force Group facebook.com/forcedotcom Developer Force – Force.com Community Follow us @forcedotcom
  • 53. Upcoming Events §  June 26 – Mobile CodeTalk –  http://bit.ly/mct-wr §  June 27 – Painless Mobile App Development –  http://bit.ly/mobileapp-hp http://bit.ly/mdc-hp Follow us @forcedotcom
  • 54. Q&A http://bit.ly/ hadoopsurvey Narayan Bharadwaj Jed Crosby Prashant Kommireddi Santosh Rau @nadubharadwaj @JedCrosby @pRaShAnT1784 @santoshrau @SalesforceEng Follow us @forcedotcom