• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Facebook Hadoop Data & Applications

Facebook Hadoop Data & Applications






Total Views
Views on SlideShare
Embed Views



1 Embed 67

http://www.slideshare.net 67



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Facebook Hadoop Data & Applications Facebook Hadoop Data & Applications Presentation Transcript

    • Hadoop and Hive at Facebook Data and Applications Dhruba Borthakur, Ding Zhou Your Company Logo Here Wednesday, June 10, 2009    Santa Clara Marriott  
    • Who generates this data? Lots of data is generated on Facebook »  200 million active users »  20 million users update their statuses at least once each day »  More than 850 million photos uploaded to the site each month »  More than 8 million videos uploaded each month »  More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week http://www.slideshare.net/guest5b1607/text-analytics-summit-2009-roddy-lindsay-social-media-happiness-petabytes-and-lols
    • Where do we store parts of this data? »  Hadoop/Hive Warehouse ›  4800 cores, 2 PetaBytes total size »  Other Hadoop Clusters •  HDFS-Scribe cluster: 320 cores, 160 TB total size •  Hadoop Archival Cluster : 80 cores, 200TB total size •  Test cluster : 800 cores, 150 TB total size
    • Data Collection using Scribe Network  Storage  and  Servers  Web Servers  Scribe MidTier  Oracle RAC  Hadoop Hive Warehouse  MySQL 
    • Data Collection using Scribe and HDFS Scribe MidTier  RealBme  Hadoop  Cluster  Web Servers  Oracle RAC  Hadoop Hive Warehouse  Hadoop Scribe Integration MySQL 
    • Data Archive: Move old data to cheap storage Hadoop Warehouse  distcp  NFS  Hadoop Archive Node  Cheap NAS  Hadoop Archival Cluster  20TB per node  HADOOP‐5048  Hive Query 
    • Hive User Interfaces Hive shell access Hive Web UI
    • Data Analysis at Facebook »  Business Intelligence ›  Growth and monetization strategies ›  Product insights & decisions ›  Philosophy: build meta tools and provide easy access to data »  Artificial Intelligence ›  Recommendation & ranking products ›  Advertising optimization ›  Text analytics ›  Philosophy: model inference; data preparation; model building;
    • BI: Build centralized reporting tools »  Top-level site metrics Bird-view of user growth by countries Comparing certain metrics between user groups
    • BI: Make AdHoc reporting easy »  Example: “Find the number of status updates mentioning ‘swine flu’ per day last month” »  SELECT a.date, count(1) »  FROM status_updates a »  WHERE a.status LIKE “%swine flu%” »  AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’ »  GROUP BY a.date
    • Build site metric dashboard in a day »  Data collection: ›  Define metrics and log format (Hive schema) ›  Add logging to the site (Scribe logging) ›  Create a Hive table partitioned by date ›  Set up metric ETL cron job (Hive -> mysql/oracle) »  Data visualization (using mysql) »  Data access (adhoc query using Hive)
    • Build Machine Learning Products on Hadoop/Hive •  Recommendation & ranking •  Advertising optimization •  Text analytics
    • What applications the user may like »  Recommend apps based on social and demographic popularity »  User-app log is huge »  Joining user-app log with user demographics is difficult »  Hive for data aggregation
    • Who the user wants to connect »  Take existing edges and user feedbacks as labels »  Build regression models based on user profile and local graph features »  Too many friends of friends »  Model trained by sampling »  Hive for model inference »  Hive for feature selection
    • What users are talking about (Lexicon) »  Market research & ad tool »  Extract popular words from user content »  Slice by age, gender, region »  Sentiment analysis laid-off »  Keyword association »  Hadoop used for text analytics Words associated with vodka
    • What ads the user might click on »  Predict user-ad click-through »  Ads click data is sparse so sampling can miss info »  Many ML algorithms are iterative thus not easy for hadoop »  Hadoop for model training
    • Build ensemble ML models on Hadoop Train models locally Cross-Test models locally »  Each mapper trains a number of models »  Each model output as a ds1 ds2 ds3 ds4 intermediate feature »  Model selection at reducer »  A regression model is built on selected features ensembles Models assembled by ensemble methods Model inference in a second Hadoop job
    • In summary »  Hadoop and Hive at Facebook »  Support product strategy and decision; »  Recommendation & ranking products; »  Advertising optimization; »  Text analytics tools; »  So Zuckerberg’s urgent questions are answered; »  So celebrities know where their fans are from; »  So we know one can like vodka and lemonade at the same time; »  It’s fun playing with the data; Dhruba Borthakur, Ding Zhou dhruba@, dzhou@