joint statistical meeting 2008

576 views
524 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
576
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
32
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

joint statistical meeting 2008

  1. 1. Data Analysis at Facebook Jeff Hammerbacher, Ding Zhou* Facebook Inc.
  2. 2. Outline • How does Facebook work • Managing Big Data • Data Analysis for Business Intelligence • Data Analysis for “Artificial Intelligence” • Questions
  3. 3. How does Facebook work?
  4. 4. Profile page - content generation portal
  5. 5. Newsfeed page - content consumption portal
  6. 6. Friends page - social graph portal
  7. 7. App page - social app platform
  8. 8. Facebook Data ▪ Social Graph Data ▪ The Nodes: ▪ 100m+ users; 100+ dimensions each user (numerical, text, categorical); ▪ 350k registrations daily; ▪ The Edges: ▪ 200+ friends each user (median); ▪ 20 categories of edges (fb friends, co-workers, family, etc); ▪ Social Behavior Data ▪ Social Interactions: interactions among users, via 100+ interaction types; ▪ Social Actions: between users and 33k+ facebook apps, via 200+ action types; ▪ Social Content Data ▪ Content of Posts, Notes, Photos, Video, etc
  9. 9. Managing Big Data ▪ Data scale [backend]: ▪ Over 1.3 PB raw capacity in largest cluster; ▪ Nearly 2 TB uncompressed data per day; ▪ Over 20 TB read/write per day; ▪ Distributed Data management: ▪ HDFS/Hadoop (MapReduce in Java); ▪ MetaStore (MetaData management); ▪ Hive QL (Query language on Hadoop+MetaStore); ▪ Usage: ▪ at least 50 engineers have run hadoop jobs ▪ 3,514 Jobs weekly ▪ 821 Projections,152 Joins, 800 Aggregates, 600 Loaders weekly
  10. 10. Hadoop - MapReduce in Java facebook:1 data:1 analysis:1 team:1 data:1 data:1 facebook:1 analysis:1 facebook data team uses: 1 data:2 uses hadoop for hadoop: 1 facebook:1 data analysis for: 1 for:1 hadoop:1 team:1 for:1 uses: 1 hadoop:1 team:1 uses: 1 data:1 analysis:1 MapReduce Execution Flow [Dean, J and Ghemawat, S, 2004]
  11. 11. Data Analysis for Business Intelligence
  12. 12. Data for Business Intelligence ▪ General Goal: ▪ support growth and monetization strategies, and product decisions ▪ User Behavior Studies ▪ NUX: Longitudinal study using LARS and recursive partitioning to identify features predictive of engagement; ▪ Identity*: Unsupervised learning over user session data to identify common usage patterns. Techniques employed include K-Means, PageRank, dimension reduction methods; ▪ Experimentation Platform ▪ Columbus*: Top-level site health metrics; drill down by user groups (country, age, gender...); ▪ Columbus++*: A/B testing for impact of site change on site health metrics;; ▪ Reporting System ▪ ad-hoc analysis done by Hive queries * - underlined are projects that Ding Zhou participates in;
  13. 13. Columbus Geographical bird-view of growth by country Comparison between user groups
  14. 14. Data Analysis for “Artificial Intelligence” -- predicting user social behavior
  15. 15. who the user will interact with • predict interactions between friends • features are user profile and browsing history • tried linear models and tree models • applied for search, newsfeed, etc
  16. 16. who the user hasn’t found yet • missing edge prediction problem • observations are friend/non-friend pairs • features include profile and local graph info • profile info more informative • graph info supplemental if profile incomplete
  17. 17. what applications the user may like* • 33k apps, only 0.1% of them used; • a different recommendation problem; • prediction model not applicable, user preference unavailable; • build a prediction model to infer “user ratings”; • user-based + item-based recommendation • how to combine profile, social graph, ratings? * projects that Ding Zhou participates in;
  18. 18. what content is interesting* • newsfeed as the main content distribution channel • stories generated by 100s of social actions: on the site, platform, or the Web • <0.1% of possible stories are shown • predictions built on story features, and user browsing history * projects that Ding Zhou participates in;
  19. 19. Challenges in Data - 100s of TBs of meaningful data available - 1,000s of non-trivial features - sampling not always applicable (e.g. small app has no user data) - prediction requirements ▪ models regularly applied for 10 billion novel samples ▪ models used on-the-fly for 100k samples in 50 ms
  20. 20. Special Machine Learning Problems - use machine learning to predict user behavior ▪ labels: insufficient; inferred implicitly; imbalanced; ▪ features: high-dimensional; strongly correlated; noisy; - scale requires distributed algorithms ▪ in-house implementation of tree ensemble methods (bagging predictors) ▪ larger training sets grant performance improvements - speed and accuracy improvements underway
  21. 21. tip of the iceberg Questions?
  22. 22. (c) 2004-2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

×