0
Data Science at Facebook
Itamar Rosenn
Eric Sun
5/4/09
Facebook Data
▪ Social Graph
▪ 200M+ active users
▪ 100M+ users come to site each day
▪ several hundred thousand new users...
Managing Data at Scale
▪ Solution: Hadoop + Hive
▪ HDFS / Hadoop (MapReduce in Java)
▪ MetaStore (metadata management)
▪ H...
Data Science - What We Do
Product Health Metrics
Launch Evaluations
Growth Modeling
User Churn Modeling
Production Incenti...
Data Science – Who We Are
Dennis Decoste Roddy Lindsay Alex Smith
Thomas Lento
Venky Iyer
Ravi Grover Cameron Marlow
Lee B...
Maintained Relationships on Facebook
▪ Question: is Facebook increasing the size of people’s personal
networks?
▪ Task:
▪ ...
Types of Relationships
People you know
▪ Facebook friends = people you’ve met at some point in life
▪ Researchers have est...
Measuring Network Size on Facebook
Examine the relationships of a random user sample over 30 days on the site. We defined ...
Findings
▪ As a function of the # of friends a user has, she is passively engaging with 2 to 2.5 more people
than with who...
Systemic Effects
▪ The stark constrast between these networks shows the effect of technologies like Newsfeed.
Content Production among New Users
▪ Mission: Give people the power to share and make the world more
open and connected.
▪...
Content Production
In new users’ first two weeks:
▪ 45% upload a photo
▪ 41% use a 3rd-party app
▪ 30% send a private mess...
Content Production
In new users’ first two weeks:
▪ 45% upload a photo
▪ 41% use a 3rd-party app
▪ 30% send a private mess...
Production Incentives Hypotheses
▪ H1: Newcomers who receive more feedback on their initial
content will go on to contribu...
Method
Quantitative
▪ Selected two cohorts: Nov. 5,
2007 (N= 347,403) Mar. 3, 2008
(N=254,603)
▪ Observed activity in thei...
Features
Independent Variables
H1. Feedback
▪ Comments received
H2. Distribution
▪ # of times content was viewed in Newsfe...
Results
Intercept 1.2
Controls Coefficient % change from int.
Age (in years) -0.01 -1.0% ***
Male (0/1) 0.48 +39.3% ***
Fe...
Summary of Results
Hypothesis Early-uploaders Non-early-
uploaders
H1. Feedback Support N/A
H2. Distribution Modest Suppor...
Modeling Contagion Through Newsfeed
▪ How do ideas spread through a social network?
▪ Use Facebook Pages to model diffusio...
Theory of the Influentials
▪ Old Theory: it’s all about the “influentials” (Malcolm Gladwell, etc.)
▪ Idea: reach a tiny g...
Contagion Theory
▪ Duncan Watts: Anyone can be an influencer.
▪ Ideas don’t spread via influentials. Instead, ideas spread...
How Do Ideas Spread on Facebook?
▪ News Feed allows for efficient diffusion of ideas
▪ Facebook’s Pages product is one of ...
Large-Scale Result: Large Connected Trees of Diffusion
▪ Diffusion chain for Stripy, a cartoon popular in Bosnia (blue) &
...
Large Connected Clusters
▪ Often, the vast majority of fans can be connected into one cluster;
sometimes over 90% of the f...
How Do These Large Clusters Come
About?• Are these large clusters started by “one guy”?
▪ No: across all Pages of meaningf...
Diffusion Chains on Facebook vs. Real Life
• The connected nature of Facebook (combined with easy
methods of communication...
How are Long Diffusion Chains Created?
• Goal: test whether the Influentials theory or the
Contagion theory is more applic...
Data
▪ Data consists of all the associations (actor  follower) for a
representative selection of Pages.
▪ Pages were at l...
Prediction Model
Response: max_chain_length
Predictors:
▪ gender
▪ log age
▪ log Facebook_age
▪ log feed_exposure (# frien...
Results
• Only consistent coefficient is on feed_exposure (#
friends who saw News Feed story).
▪ Coefficient hovers around...
Conclusions
• Facebook News Feed enables long-lasting chains of
diffusion that may reach many more people than real-life
d...
Contact
www.facebook.com/data
itamar@facebook.com
esun@facebook.com
(c) 2007 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
Upcoming SlideShare
Loading in...5
×

weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

486

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
486
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • To check for inflated standard errors due to multicollinearity between controls and independent variables, we calculated
    variance inflation factors (VIFs). All VIFs are well below 4,
    indicating low collinearity between factors [31].
  • FB: most people also hear from 1 and pass to 1
  • Explain negative binomial
  • Explain negative binomial
  • Transcript of "weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1"

    1. 1. Data Science at Facebook Itamar Rosenn Eric Sun 5/4/09
    2. 2. Facebook Data ▪ Social Graph ▪ 200M+ active users ▪ 100M+ users come to site each day ▪ several hundred thousand new users join each day ▪ hundreds of dimensions per user (numerical, categorical, text) ▪ average user has over 120 friends ▪ friendships on Facebook span many different types of relationships ▪ Social Behavior ▪ Actions: users interact with hundreds of thousands of applications, on and off the site ▪ Interactions: users interact directly with each other via over 100 distinct types of events ▪ Social Content ▪ Photos, Status Updates, Platform Application Content, Events, Posts, Videos, Notes, etc...
    3. 3. Managing Data at Scale ▪ Solution: Hadoop + Hive ▪ HDFS / Hadoop (MapReduce in Java) ▪ MetaStore (metadata management) ▪ HiveQL (SQL-like query language on top of Hadoop + MetaStore) ▪ Data Scale ▪ More than 1PB raw capacity in largest HDFS / Hadoop cluster ▪ Over 2TB uncompressed data collected each day ▪ Dozens of TB worth of data read / written each day via Hadoop + Hive
    4. 4. Data Science - What We Do Product Health Metrics Launch Evaluations Growth Modeling User Churn Modeling Production Incentives Content Diffusion Ad CTR Prediction PYMK Search Ranking Highlights Behavioral Analysis Data-Driven Systems Data Infrastructure Hive Hadoop
    5. 5. Data Science – Who We Are Dennis Decoste Roddy Lindsay Alex Smith Thomas Lento Venky Iyer Ravi Grover Cameron Marlow Lee Byron Itamar Rosenn Danny Ferrante James Mayfield
    6. 6. Maintained Relationships on Facebook ▪ Question: is Facebook increasing the size of people’s personal networks? ▪ Task: ▪ the types of relationships people maintain on the site ▪ the relative size of these groups
    7. 7. Types of Relationships People you know ▪ Facebook friends = people you’ve met at some point in life ▪ Researchers have estimated this number to be somewhere between 300 and 3,000. (Gladwell, Killworth) Communication network ▪ Individuals with whom you communicate on a regular basis ▪ Includes your core support network, which may be as low as 3 people ▪ Kossinets and Watts observed communication network size of 10-20 Maintained relationships ▪ Social technologies like Newsfeed or RSS readers allow you to keep up with the things that people you know are doing ▪ This information consumption is a form of relationship management, as it can lead to direct
    8. 8. Measuring Network Size on Facebook Examine the relationships of a random user sample over 30 days on the site. We defined networks in 4 ways: All friends ▪ The largest representation of a person’s network is the set of people they have verified as friends. Reciprocal communication ▪ The number of friends with whom the user had reciprocal exchanges via messages, wall posts, or comments. This provides a measure of the user’s core network. One-way communication ▪ The number of friends to whom the user has reached out via messages, wall posts, or comments. Maintained relationships ▪ The number of friends whose Newsfeed stories the user has clicked on, or whose profiles the user has visited at least twice
    9. 9. Findings ▪ As a function of the # of friends a user has, she is passively engaging with 2 to 2.5 more people than with whom she directly communicates
    10. 10. Systemic Effects ▪ The stark constrast between these networks shows the effect of technologies like Newsfeed.
    11. 11. Content Production among New Users ▪ Mission: Give people the power to share and make the world more open and connected. ▪ Question: What mechanisms lead Facebook newcomers to share content on the site?
    12. 12. Content Production In new users’ first two weeks: ▪ 45% upload a photo ▪ 41% use a 3rd-party app ▪ 30% send a private message ▪ 27% compose a status update ▪ 22% write on a friend’s wall
    13. 13. Content Production In new users’ first two weeks: ▪ 45% upload a photo ▪ 41% use a 3rd-party app ▪ 30% send a private message ▪ 27% compose a status update ▪ 22% write on a friend’s wall
    14. 14. Production Incentives Hypotheses ▪ H1: Newcomers who receive more feedback on their initial content will go on to contribute more content. ▪ H2: Newcomers whose initial content receives greater distribution will go on to produce more content. ▪ H3: Social learning: Newcomers whose friends share more content will go on to produce more content themselves. ▪ H4: Singling out: Newcomers who are singled out in content that their friends produce will go on to produce more content themselves.
    15. 15. Method Quantitative ▪ Selected two cohorts: Nov. 5, 2007 (N= 347,403) Mar. 3, 2008 (N=254,603) ▪ Observed activity in their first two weeks ▪ Predicted how many photos they would upload between third and fifteenth week on Facebook Qualitative ▪ 40-minute semi-structured interviews with seven new users ▪ Recorded audio/video and screen ▪ Asked about typical uses of facebook, content production, social norms, privacy
    16. 16. Features Independent Variables H1. Feedback ▪ Comments received H2. Distribution ▪ # of times content was viewed in Newsfeed ▪ # of friends who viewed content in Newsfeed H3. Social Learning ▪ Number of friends’ photos seen ▪ H4. Singling Out ▪ Number of times tagged Controls ▪ Age ▪ Gender ▪ Number of friends ▪ Total pages viewed ▪ Initial engagement with photos: ▪ # of photos uploaded ▪ # of photos viewed ▪ Photo tags created ▪ Photo comments written
    17. 17. Results Intercept 1.2 Controls Coefficient % change from int. Age (in years) -0.01 -1.0% *** Male (0/1) 0.48 +39.3% *** Female (0/1) 1.21 +131.2% *** Pages viewed + 0.24 +18.4% *** Photo pages viewed + 2.80 +597.4% *** Photo comments made 0.15 +11.2% *** Photo tags created 0.10 +6.9% *** Photos uploaded 0.30 +22.8% *** Independent Vars Coefficient % change from int. Comments received (0/1) 0.09 +6.2% *** Photo views received 0.04 +2.6% *** Photo stories seen 0.09 +6.1% *** Photo tags received (0/1) 0.03 +2.1% (ns) Model 1 – Early Uploaders Intercept 1.9 Controls Coefficient % change from int. Age (in years) -0.01 -0.7% *** Male (0/1) 0.84 +79.6% *** Female (0/1) 1.43 +169.8% *** Pages viewed + -0.02 -1.6% *** Photo pages viewed + 2.35 +408.3% *** Photo comments made 0.24 +17.7% *** Photo tags created 0.17 +12.6% *** Early-uploader (0/1) 0.39 +30.6% *** Independent Vars Coefficient % change from int. Photo stories seen X early- uploader 0.15 +10.7% *** Photo stories seen X non-early-uploader 0.03 +2.2% *** Photo tags received X early-uploader (0/1) -0.05 -3.6% (ns) Photo tags received X non-early-uploader (0/1) 0.10 +7.2% *** Model 2 - Everyone
    18. 18. Summary of Results Hypothesis Early-uploaders Non-early- uploaders H1. Feedback Support N/A H2. Distribution Modest Support N/A H3. Social learning Support Support H4. Singling out No Support Support ▪ We learn from our friends. If our friends engage with photos, we do too. Social learning is the main lever for content production. ▪ For new users already uploading photos feedback is associated
    19. 19. Modeling Contagion Through Newsfeed ▪ How do ideas spread through a social network? ▪ Use Facebook Pages to model diffusion patterns ▪ Compare results with existing models of diffusion ▪ Show how Facebook advertising campaigns may be more successful than off-Facebook advertising campaigns due to Facebook’s inter- connectedness and diffusion properties. ▪ Note: Research based on “old” Facebook (pre-March 2009) ▪ Still relevant: first empirical analysis of large-scale collisions of short chains
    20. 20. Theory of the Influentials ▪ Old Theory: it’s all about the “influentials” (Malcolm Gladwell, etc.) ▪ Idea: reach a tiny group of Influential people, and you’ll reach everyone else through them for free ▪ $1+ billion/year spent on word-of-mouth campaigns targeting Influentials; amount is growing 36% per year (MarketingVOX)
    21. 21. Contagion Theory ▪ Duncan Watts: Anyone can be an influencer. ▪ Ideas don’t spread via influentials. Instead, ideas spread like viruses: either you’re susceptible, or you’re not ▪ Success depends not on how persuasive the early adopter(s) are, but whether everyone else is easily persuaded.
    22. 22. How Do Ideas Spread on Facebook? ▪ News Feed allows for efficient diffusion of ideas ▪ Facebook’s Pages product is one of the most viral features of the site. ▪ People may see multiple friends fan a Page in a single Feed story, so a node in the graph can have multiple parents Alice fans a Page Bob sees Alice’s action on his News Feed; Bob fans the Page as well Charlie sees Alice’s action on his News Feed; Charlie fans the Page as well Chain of Length 1
    23. 23. Large-Scale Result: Large Connected Trees of Diffusion ▪ Diffusion chain for Stripy, a cartoon popular in Bosnia (blue) & Slovenia (yellow). Croatia (green) has yet to find its connecting bridge.
    24. 24. Large Connected Clusters ▪ Often, the vast majority of fans can be connected into one cluster; sometimes over 90% of the fans for one particular Page can be connected. ▪ Example: On 8/21/08, 71,090 of 96,922 fans of the Nastia Liukin Page (73.3%) were in one connected cluster. ▪ For Pages created after 7/1/08, the median Page had 69.48% of its Fans in one connected cluster as of 8/19/08.
    25. 25. How Do These Large Clusters Come About?• Are these large clusters started by “one guy”? ▪ No: across all Pages of meaningful size (>1000 Fans), 14.8% of the Fans in the biggest cluster were “start points.” ▪ The variability in this percentage becomes very small as # fans increases ▪ The average node in the biggest cluster is connected to 2.899 others. • Large clusters are formed when many long chains of diffusion merge together.
    26. 26. Diffusion Chains on Facebook vs. Real Life • The connected nature of Facebook (combined with easy methods of communication) makes long diffusion chains possible. ▪ In word-of-mouth studies of information propagation, most people hear of an idea from 1 person and pass it on to 1 other person ▪ Only 38% of paths involve at least four individuals (Brown & Reingen 1987) ▪ On Facebook, 86.4% of paths of Page diffusion involve at least 4 individuals
    27. 27. How are Long Diffusion Chains Created? • Goal: test whether the Influentials theory or the Contagion theory is more applicable to Facebook ▪ Attempt to predict size of diffusion chains that a particular user will create using characteristics of the user and/or the Page. ▪ If size can be predicted, we can then identify the most influential users.
    28. 28. Data ▪ Data consists of all the associations (actor  follower) for a representative selection of Pages. ▪ Pages were at least 40 days old and had at least 5,000 fans
    29. 29. Prediction Model Response: max_chain_length Predictors: ▪ gender ▪ log age ▪ log Facebook_age ▪ log feed_exposure (# friends who saw News Feed story) ▪ log friend_count ▪ log activity_count (wall posts + messages sent + photos added) ▪ log popularity (controls for News Feed exposure via Coefficient) Method: zero-inflated negative binomial regression
    30. 30. Results • Only consistent coefficient is on feed_exposure (# friends who saw News Feed story). ▪ Coefficient hovers around 1: if News Feed publishes a user’s action to 1% more people, we expect a 1% longer max_chain • Implies that friend_count is not realistically meaningful. ▪ After controlling for distribution and popularity, neither demographic characteristics nor number of Facebook friend seems to play an important role in the prediction of maximum diffucion chain length.
    31. 31. Conclusions • Facebook News Feed enables long-lasting chains of diffusion that may reach many more people than real-life diffusion chains. • The Facebook network is very connected: ideas with good receptiveness will attract wide, long connected clusters. • Long chains are not a function of Facebook age, activity, users’ demographics, or even # of friends: it’s only related to exposure.
    32. 32. Contact www.facebook.com/data itamar@facebook.com esun@facebook.com
    33. 33. (c) 2007 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×