SlideShare a Scribd company logo
1 of 18
Download to read offline
Hadoop and Hive at Facebook
    Data and Applications
       Dhruba Borthakur, Ding Zhou

             Your Company Logo Here




           Wednesday, June 10, 2009 
                                    
             Santa Clara Marriott
                                 
Who generates this data?

Lots of data is generated on Facebook
»  200 million active users
»  20 million users update their statuses at least
   once each day
»  More than 850 million photos uploaded to the site
   each month
»  More than 8 million videos uploaded each month
»  More than 1 billion pieces of content (web links,
   news stories, blog posts, notes, photos, etc.)
   shared each week

 http://www.slideshare.net/guest5b1607/text-analytics-summit-2009-roddy-lindsay-social-media-happiness-petabytes-and-lols
Where do we store parts of this data?


»  Hadoop/Hive Warehouse
   ›  4800 cores, 2 PetaBytes
      total size


»  Other Hadoop Clusters
   •  HDFS-Scribe cluster: 320
      cores, 160 TB total size
   •  Hadoop Archival Cluster :
      80 cores, 200TB total size
   •  Test cluster : 800 cores,
      150 TB total size
Data Collection using Scribe
                                                   Network 
                                                   Storage 
                                                   and 
                                                   Servers 

 Web Servers            Scribe MidTier 




  Oracle RAC    Hadoop Hive Warehouse     MySQL 
Data Collection using Scribe and HDFS
                                             Scribe MidTier 

                                                       RealBme 
                                                       Hadoop 
                                                       Cluster 
  Web Servers 




   Oracle RAC    Hadoop Hive Warehouse 
                 Hadoop Scribe Integration
                                                MySQL 
Data Archive: Move old data to cheap storage


 Hadoop Warehouse 

                     distcp 

                                                 NFS 
                Hadoop Archive Node                     Cheap NAS 


                                    Hadoop Archival Cluster 
                                    20TB per node 

                     HADOOP‐5048 
  Hive Query 
Hive User Interfaces



                         Hive shell access




           Hive Web UI
Data Analysis at Facebook

»  Business Intelligence
   ›  Growth and monetization strategies
   ›  Product insights & decisions
   ›  Philosophy: build meta tools and provide easy access to data


»  Artificial Intelligence
   ›  Recommendation & ranking products
   ›  Advertising optimization
   ›  Text analytics
   ›  Philosophy: model inference; data preparation; model building;
BI: Build centralized reporting tools

»  Top-level site metrics
                                   Bird-view of user growth
                                   by countries




       Comparing certain metrics
       between user groups
BI: Make AdHoc reporting easy

»  Example: “Find the number of status updates
   mentioning ‘swine flu’ per day last month”

»    SELECT a.date, count(1)
»    FROM status_updates a
»    WHERE a.status LIKE “%swine flu%”
»    AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’
»    GROUP BY a.date
Build site metric dashboard in a day
»  Data collection:
   ›    Define metrics and log format (Hive schema)
   ›    Add logging to the site (Scribe logging)
   ›    Create a Hive table partitioned by date
   ›    Set up metric ETL cron job (Hive -> mysql/oracle)
»  Data visualization (using mysql)
»  Data access (adhoc query using Hive)
Build Machine Learning Products on
Hadoop/Hive
 •  Recommendation & ranking
 •  Advertising optimization
 •  Text analytics
What applications the user may like
»  Recommend apps based on
   social and demographic
   popularity

»  User-app log is huge
»  Joining user-app log with
   user demographics is difficult

»  Hive for data aggregation
Who the user wants to connect
»  Take existing edges and
   user feedbacks as labels
»  Build regression models
   based on user profile and
   local graph features

»  Too many friends of friends
»  Model trained by sampling

»  Hive for model inference
»  Hive for feature selection
What users are talking about (Lexicon)
»  Market research & ad tool

»  Extract popular words from user
   content
»  Slice by age, gender, region
»  Sentiment analysis
                                               laid-off
»  Keyword association

»  Hadoop used for text analytics




                                     Words associated with vodka
What ads the user might click on
»  Predict user-ad click-through

»  Ads click data is sparse so
   sampling can miss info
»  Many ML algorithms are
   iterative thus not easy for
   hadoop

»  Hadoop for model training
Build ensemble ML models on Hadoop


                                          Train models locally
                                          Cross-Test models locally
»  Each mapper trains a
   number of models
»  Each model output as a           ds1        ds2          ds3       ds4
   intermediate feature

»  Model selection at reducer
»  A regression model is built
   on selected features
                                                ensembles


                                 Models assembled by ensemble methods
                                 Model inference in a second Hadoop job
In summary

»  Hadoop and Hive at Facebook
    »  Support product strategy and decision;
    »  Recommendation & ranking products;
    »  Advertising optimization;
    »  Text analytics tools;

»    So Zuckerberg’s urgent questions are answered;
»    So celebrities know where their fans are from;
»    So we know one can like vodka and lemonade at the same time;
»    It’s fun playing with the data;



                                                Dhruba Borthakur, Ding Zhou
                                                             dhruba@, dzhou@

More Related Content

What's hot

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 

What's hot (19)

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop
HadoopHadoop
Hadoop
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 

Similar to Facebook Hadoop Data & Applications

Architectures For Scaling Ajax
Architectures For Scaling AjaxArchitectures For Scaling Ajax
Architectures For Scaling Ajaxwolframkriesing
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Communitytinacallahan
 
API's, Freebase, and the Collaborative Semantic web
API's, Freebase, and the Collaborative Semantic webAPI's, Freebase, and the Collaborative Semantic web
API's, Freebase, and the Collaborative Semantic webDan Delany
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et HadoopMongoDB
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Extending The My Sql Data Landscape
Extending The My Sql Data LandscapeExtending The My Sql Data Landscape
Extending The My Sql Data LandscapeRonald Bradford
 
Apache Solr Changes the Way You Build Sites
Apache Solr Changes the Way You Build SitesApache Solr Changes the Way You Build Sites
Apache Solr Changes the Way You Build SitesPeter
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 

Similar to Facebook Hadoop Data & Applications (20)

20080611accel
20080611accel20080611accel
20080611accel
 
20081022cca
20081022cca20081022cca
20081022cca
 
Markup As An Api
Markup As An ApiMarkup As An Api
Markup As An Api
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Architectures For Scaling Ajax
Architectures For Scaling AjaxArchitectures For Scaling Ajax
Architectures For Scaling Ajax
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
מיכאל
מיכאלמיכאל
מיכאל
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Community
 
Web 2.0 101
Web 2.0 101Web 2.0 101
Web 2.0 101
 
20080528dublinpt1
20080528dublinpt120080528dublinpt1
20080528dublinpt1
 
API's, Freebase, and the Collaborative Semantic web
API's, Freebase, and the Collaborative Semantic webAPI's, Freebase, and the Collaborative Semantic web
API's, Freebase, and the Collaborative Semantic web
 
20080528dublinpt2
20080528dublinpt220080528dublinpt2
20080528dublinpt2
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et Hadoop
 
MongoDB and Hadoop
MongoDB and HadoopMongoDB and Hadoop
MongoDB and Hadoop
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Qcon
QconQcon
Qcon
 
Extending The My Sql Data Landscape
Extending The My Sql Data LandscapeExtending The My Sql Data Landscape
Extending The My Sql Data Landscape
 
Apache Solr Changes the Way You Build Sites
Apache Solr Changes the Way You Build SitesApache Solr Changes the Way You Build Sites
Apache Solr Changes the Way You Build Sites
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 

Recently uploaded

React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Facebook Hadoop Data & Applications

  • 1. Hadoop and Hive at Facebook Data and Applications Dhruba Borthakur, Ding Zhou Your Company Logo Here Wednesday, June 10, 2009    Santa Clara Marriott  
  • 2. Who generates this data? Lots of data is generated on Facebook »  200 million active users »  20 million users update their statuses at least once each day »  More than 850 million photos uploaded to the site each month »  More than 8 million videos uploaded each month »  More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week http://www.slideshare.net/guest5b1607/text-analytics-summit-2009-roddy-lindsay-social-media-happiness-petabytes-and-lols
  • 3. Where do we store parts of this data? »  Hadoop/Hive Warehouse ›  4800 cores, 2 PetaBytes total size »  Other Hadoop Clusters •  HDFS-Scribe cluster: 320 cores, 160 TB total size •  Hadoop Archival Cluster : 80 cores, 200TB total size •  Test cluster : 800 cores, 150 TB total size
  • 4. Data Collection using Scribe Network  Storage  and  Servers  Web Servers  Scribe MidTier  Oracle RAC  Hadoop Hive Warehouse  MySQL 
  • 5. Data Collection using Scribe and HDFS Scribe MidTier  RealBme  Hadoop  Cluster  Web Servers  Oracle RAC  Hadoop Hive Warehouse  Hadoop Scribe Integration MySQL 
  • 6. Data Archive: Move old data to cheap storage Hadoop Warehouse  distcp  NFS  Hadoop Archive Node  Cheap NAS  Hadoop Archival Cluster  20TB per node  HADOOP‐5048  Hive Query 
  • 7. Hive User Interfaces Hive shell access Hive Web UI
  • 8. Data Analysis at Facebook »  Business Intelligence ›  Growth and monetization strategies ›  Product insights & decisions ›  Philosophy: build meta tools and provide easy access to data »  Artificial Intelligence ›  Recommendation & ranking products ›  Advertising optimization ›  Text analytics ›  Philosophy: model inference; data preparation; model building;
  • 9. BI: Build centralized reporting tools »  Top-level site metrics Bird-view of user growth by countries Comparing certain metrics between user groups
  • 10. BI: Make AdHoc reporting easy »  Example: “Find the number of status updates mentioning ‘swine flu’ per day last month” »  SELECT a.date, count(1) »  FROM status_updates a »  WHERE a.status LIKE “%swine flu%” »  AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’ »  GROUP BY a.date
  • 11. Build site metric dashboard in a day »  Data collection: ›  Define metrics and log format (Hive schema) ›  Add logging to the site (Scribe logging) ›  Create a Hive table partitioned by date ›  Set up metric ETL cron job (Hive -> mysql/oracle) »  Data visualization (using mysql) »  Data access (adhoc query using Hive)
  • 12. Build Machine Learning Products on Hadoop/Hive •  Recommendation & ranking •  Advertising optimization •  Text analytics
  • 13. What applications the user may like »  Recommend apps based on social and demographic popularity »  User-app log is huge »  Joining user-app log with user demographics is difficult »  Hive for data aggregation
  • 14. Who the user wants to connect »  Take existing edges and user feedbacks as labels »  Build regression models based on user profile and local graph features »  Too many friends of friends »  Model trained by sampling »  Hive for model inference »  Hive for feature selection
  • 15. What users are talking about (Lexicon) »  Market research & ad tool »  Extract popular words from user content »  Slice by age, gender, region »  Sentiment analysis laid-off »  Keyword association »  Hadoop used for text analytics Words associated with vodka
  • 16. What ads the user might click on »  Predict user-ad click-through »  Ads click data is sparse so sampling can miss info »  Many ML algorithms are iterative thus not easy for hadoop »  Hadoop for model training
  • 17. Build ensemble ML models on Hadoop Train models locally Cross-Test models locally »  Each mapper trains a number of models »  Each model output as a ds1 ds2 ds3 ds4 intermediate feature »  Model selection at reducer »  A regression model is built on selected features ensembles Models assembled by ensemble methods Model inference in a second Hadoop job
  • 18. In summary »  Hadoop and Hive at Facebook »  Support product strategy and decision; »  Recommendation & ranking products; »  Advertising optimization; »  Text analytics tools; »  So Zuckerberg’s urgent questions are answered; »  So celebrities know where their fans are from; »  So we know one can like vodka and lemonade at the same time; »  It’s fun playing with the data; Dhruba Borthakur, Ding Zhou dhruba@, dzhou@