Facebook Hadoop Data & Applications

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    3 Favorites

    Facebook Hadoop Data & Applications - Presentation Transcript

    1. Hadoop and Hive at Facebook Data and Applications Dhruba Borthakur, Ding Zhou Your Company Logo Here Wednesday, June 10, 2009    Santa Clara Marriott  
    2. Who generates this data? Lots of data is generated on Facebook »  200 million active users »  20 million users update their statuses at least once each day »  More than 850 million photos uploaded to the site each month »  More than 8 million videos uploaded each month »  More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week http://www.slideshare.net/guest5b1607/text-analytics-summit-2009-roddy-lindsay-social-media-happiness-petabytes-and-lols
    3. Where do we store parts of this data? »  Hadoop/Hive Warehouse ›  4800 cores, 2 PetaBytes total size »  Other Hadoop Clusters •  HDFS-Scribe cluster: 320 cores, 160 TB total size •  Hadoop Archival Cluster : 80 cores, 200TB total size •  Test cluster : 800 cores, 150 TB total size
    4. Data Collection using Scribe Network  Storage  and  Servers  Web Servers  Scribe MidTier  Oracle RAC  Hadoop Hive Warehouse  MySQL 
    5. Data Collection using Scribe and HDFS Scribe MidTier  RealBme  Hadoop  Cluster  Web Servers  Oracle RAC  Hadoop Hive Warehouse  Hadoop Scribe Integration MySQL 
    6. Data Archive: Move old data to cheap storage Hadoop Warehouse  distcp  NFS  Hadoop Archive Node  Cheap NAS  Hadoop Archival Cluster  20TB per node  HADOOP‐5048  Hive Query 
    7. Hive User Interfaces Hive shell access Hive Web UI
    8. Data Analysis at Facebook »  Business Intelligence ›  Growth and monetization strategies ›  Product insights & decisions ›  Philosophy: build meta tools and provide easy access to data »  Artificial Intelligence ›  Recommendation & ranking products ›  Advertising optimization ›  Text analytics ›  Philosophy: model inference; data preparation; model building;
    9. BI: Build centralized reporting tools »  Top-level site metrics Bird-view of user growth by countries Comparing certain metrics between user groups
    10. BI: Make AdHoc reporting easy »  Example: “Find the number of status updates mentioning ‘swine flu’ per day last month” »  SELECT a.date, count(1) »  FROM status_updates a »  WHERE a.status LIKE “%swine flu%” »  AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’ »  GROUP BY a.date
    11. Build site metric dashboard in a day »  Data collection: ›  Define metrics and log format (Hive schema) ›  Add logging to the site (Scribe logging) ›  Create a Hive table partitioned by date ›  Set up metric ETL cron job (Hive -> mysql/oracle) »  Data visualization (using mysql) »  Data access (adhoc query using Hive)
    12. Build Machine Learning Products on Hadoop/Hive •  Recommendation & ranking •  Advertising optimization •  Text analytics
    13. What applications the user may like »  Recommend apps based on social and demographic popularity »  User-app log is huge »  Joining user-app log with user demographics is difficult »  Hive for data aggregation
    14. Who the user wants to connect »  Take existing edges and user feedbacks as labels »  Build regression models based on user profile and local graph features »  Too many friends of friends »  Model trained by sampling »  Hive for model inference »  Hive for feature selection
    15. What users are talking about (Lexicon) »  Market research & ad tool »  Extract popular words from user content »  Slice by age, gender, region »  Sentiment analysis laid-off »  Keyword association »  Hadoop used for text analytics Words associated with vodka
    16. What ads the user might click on »  Predict user-ad click-through »  Ads click data is sparse so sampling can miss info »  Many ML algorithms are iterative thus not easy for hadoop »  Hadoop for model training
    17. Build ensemble ML models on Hadoop Train models locally Cross-Test models locally »  Each mapper trains a number of models »  Each model output as a ds1 ds2 ds3 ds4 intermediate feature »  Model selection at reducer »  A regression model is built on selected features ensembles Models assembled by ensemble methods Model inference in a second Hadoop job
    18. In summary »  Hadoop and Hive at Facebook »  Support product strategy and decision; »  Recommendation & ranking products; »  Advertising optimization; »  Text analytics tools; »  So Zuckerberg’s urgent questions are answered; »  So celebrities know where their fans are from; »  So we know one can like vodka and lemonade at the same time; »  It’s fun playing with the data; Dhruba Borthakur, Ding Zhou dhruba@, dzhou@
    SlideShare Zeitgeist 2009

    + dzhoudzhou Nominate

    custom

    708 views, 3 favs, 0 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 708
      • 708 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 3
    • Downloads 56
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories