Your SlideShare is downloading. ×
weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

460
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
460
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • To check for inflated standard errors due to multicollinearity between controls and independent variables, we calculated
    variance inflation factors (VIFs). All VIFs are well below 4,
    indicating low collinearity between factors [31].
  • FB: most people also hear from 1 and pass to 1
  • Explain negative binomial
  • Explain negative binomial
  • Transcript

    • 1. Data Science at Facebook Itamar Rosenn Eric Sun 5/4/09
    • 2. Facebook Data ▪ Social Graph ▪ 200M+ active users ▪ 100M+ users come to site each day ▪ several hundred thousand new users join each day ▪ hundreds of dimensions per user (numerical, categorical, text) ▪ average user has over 120 friends ▪ friendships on Facebook span many different types of relationships ▪ Social Behavior ▪ Actions: users interact with hundreds of thousands of applications, on and off the site ▪ Interactions: users interact directly with each other via over 100 distinct types of events ▪ Social Content ▪ Photos, Status Updates, Platform Application Content, Events, Posts, Videos, Notes, etc...
    • 3. Managing Data at Scale ▪ Solution: Hadoop + Hive ▪ HDFS / Hadoop (MapReduce in Java) ▪ MetaStore (metadata management) ▪ HiveQL (SQL-like query language on top of Hadoop + MetaStore) ▪ Data Scale ▪ More than 1PB raw capacity in largest HDFS / Hadoop cluster ▪ Over 2TB uncompressed data collected each day ▪ Dozens of TB worth of data read / written each day via Hadoop + Hive
    • 4. Data Science - What We Do Product Health Metrics Launch Evaluations Growth Modeling User Churn Modeling Production Incentives Content Diffusion Ad CTR Prediction PYMK Search Ranking Highlights Behavioral Analysis Data-Driven Systems Data Infrastructure Hive Hadoop
    • 5. Data Science – Who We Are Dennis Decoste Roddy Lindsay Alex Smith Thomas Lento Venky Iyer Ravi Grover Cameron Marlow Lee Byron Itamar Rosenn Danny Ferrante James Mayfield
    • 6. Maintained Relationships on Facebook ▪ Question: is Facebook increasing the size of people’s personal networks? ▪ Task: ▪ the types of relationships people maintain on the site ▪ the relative size of these groups
    • 7. Types of Relationships People you know ▪ Facebook friends = people you’ve met at some point in life ▪ Researchers have estimated this number to be somewhere between 300 and 3,000. (Gladwell, Killworth) Communication network ▪ Individuals with whom you communicate on a regular basis ▪ Includes your core support network, which may be as low as 3 people ▪ Kossinets and Watts observed communication network size of 10-20 Maintained relationships ▪ Social technologies like Newsfeed or RSS readers allow you to keep up with the things that people you know are doing ▪ This information consumption is a form of relationship management, as it can lead to direct
    • 8. Measuring Network Size on Facebook Examine the relationships of a random user sample over 30 days on the site. We defined networks in 4 ways: All friends ▪ The largest representation of a person’s network is the set of people they have verified as friends. Reciprocal communication ▪ The number of friends with whom the user had reciprocal exchanges via messages, wall posts, or comments. This provides a measure of the user’s core network. One-way communication ▪ The number of friends to whom the user has reached out via messages, wall posts, or comments. Maintained relationships ▪ The number of friends whose Newsfeed stories the user has clicked on, or whose profiles the user has visited at least twice
    • 9. Findings ▪ As a function of the # of friends a user has, she is passively engaging with 2 to 2.5 more people than with whom she directly communicates
    • 10. Systemic Effects ▪ The stark constrast between these networks shows the effect of technologies like Newsfeed.
    • 11. Content Production among New Users ▪ Mission: Give people the power to share and make the world more open and connected. ▪ Question: What mechanisms lead Facebook newcomers to share content on the site?
    • 12. Content Production In new users’ first two weeks: ▪ 45% upload a photo ▪ 41% use a 3rd-party app ▪ 30% send a private message ▪ 27% compose a status update ▪ 22% write on a friend’s wall
    • 13. Content Production In new users’ first two weeks: ▪ 45% upload a photo ▪ 41% use a 3rd-party app ▪ 30% send a private message ▪ 27% compose a status update ▪ 22% write on a friend’s wall
    • 14. Production Incentives Hypotheses ▪ H1: Newcomers who receive more feedback on their initial content will go on to contribute more content. ▪ H2: Newcomers whose initial content receives greater distribution will go on to produce more content. ▪ H3: Social learning: Newcomers whose friends share more content will go on to produce more content themselves. ▪ H4: Singling out: Newcomers who are singled out in content that their friends produce will go on to produce more content themselves.
    • 15. Method Quantitative ▪ Selected two cohorts: Nov. 5, 2007 (N= 347,403) Mar. 3, 2008 (N=254,603) ▪ Observed activity in their first two weeks ▪ Predicted how many photos they would upload between third and fifteenth week on Facebook Qualitative ▪ 40-minute semi-structured interviews with seven new users ▪ Recorded audio/video and screen ▪ Asked about typical uses of facebook, content production, social norms, privacy
    • 16. Features Independent Variables H1. Feedback ▪ Comments received H2. Distribution ▪ # of times content was viewed in Newsfeed ▪ # of friends who viewed content in Newsfeed H3. Social Learning ▪ Number of friends’ photos seen ▪ H4. Singling Out ▪ Number of times tagged Controls ▪ Age ▪ Gender ▪ Number of friends ▪ Total pages viewed ▪ Initial engagement with photos: ▪ # of photos uploaded ▪ # of photos viewed ▪ Photo tags created ▪ Photo comments written
    • 17. Results Intercept 1.2 Controls Coefficient % change from int. Age (in years) -0.01 -1.0% *** Male (0/1) 0.48 +39.3% *** Female (0/1) 1.21 +131.2% *** Pages viewed + 0.24 +18.4% *** Photo pages viewed + 2.80 +597.4% *** Photo comments made 0.15 +11.2% *** Photo tags created 0.10 +6.9% *** Photos uploaded 0.30 +22.8% *** Independent Vars Coefficient % change from int. Comments received (0/1) 0.09 +6.2% *** Photo views received 0.04 +2.6% *** Photo stories seen 0.09 +6.1% *** Photo tags received (0/1) 0.03 +2.1% (ns) Model 1 – Early Uploaders Intercept 1.9 Controls Coefficient % change from int. Age (in years) -0.01 -0.7% *** Male (0/1) 0.84 +79.6% *** Female (0/1) 1.43 +169.8% *** Pages viewed + -0.02 -1.6% *** Photo pages viewed + 2.35 +408.3% *** Photo comments made 0.24 +17.7% *** Photo tags created 0.17 +12.6% *** Early-uploader (0/1) 0.39 +30.6% *** Independent Vars Coefficient % change from int. Photo stories seen X early- uploader 0.15 +10.7% *** Photo stories seen X non-early-uploader 0.03 +2.2% *** Photo tags received X early-uploader (0/1) -0.05 -3.6% (ns) Photo tags received X non-early-uploader (0/1) 0.10 +7.2% *** Model 2 - Everyone
    • 18. Summary of Results Hypothesis Early-uploaders Non-early- uploaders H1. Feedback Support N/A H2. Distribution Modest Support N/A H3. Social learning Support Support H4. Singling out No Support Support ▪ We learn from our friends. If our friends engage with photos, we do too. Social learning is the main lever for content production. ▪ For new users already uploading photos feedback is associated
    • 19. Modeling Contagion Through Newsfeed ▪ How do ideas spread through a social network? ▪ Use Facebook Pages to model diffusion patterns ▪ Compare results with existing models of diffusion ▪ Show how Facebook advertising campaigns may be more successful than off-Facebook advertising campaigns due to Facebook’s inter- connectedness and diffusion properties. ▪ Note: Research based on “old” Facebook (pre-March 2009) ▪ Still relevant: first empirical analysis of large-scale collisions of short chains
    • 20. Theory of the Influentials ▪ Old Theory: it’s all about the “influentials” (Malcolm Gladwell, etc.) ▪ Idea: reach a tiny group of Influential people, and you’ll reach everyone else through them for free ▪ $1+ billion/year spent on word-of-mouth campaigns targeting Influentials; amount is growing 36% per year (MarketingVOX)
    • 21. Contagion Theory ▪ Duncan Watts: Anyone can be an influencer. ▪ Ideas don’t spread via influentials. Instead, ideas spread like viruses: either you’re susceptible, or you’re not ▪ Success depends not on how persuasive the early adopter(s) are, but whether everyone else is easily persuaded.
    • 22. How Do Ideas Spread on Facebook? ▪ News Feed allows for efficient diffusion of ideas ▪ Facebook’s Pages product is one of the most viral features of the site. ▪ People may see multiple friends fan a Page in a single Feed story, so a node in the graph can have multiple parents Alice fans a Page Bob sees Alice’s action on his News Feed; Bob fans the Page as well Charlie sees Alice’s action on his News Feed; Charlie fans the Page as well Chain of Length 1
    • 23. Large-Scale Result: Large Connected Trees of Diffusion ▪ Diffusion chain for Stripy, a cartoon popular in Bosnia (blue) & Slovenia (yellow). Croatia (green) has yet to find its connecting bridge.
    • 24. Large Connected Clusters ▪ Often, the vast majority of fans can be connected into one cluster; sometimes over 90% of the fans for one particular Page can be connected. ▪ Example: On 8/21/08, 71,090 of 96,922 fans of the Nastia Liukin Page (73.3%) were in one connected cluster. ▪ For Pages created after 7/1/08, the median Page had 69.48% of its Fans in one connected cluster as of 8/19/08.
    • 25. How Do These Large Clusters Come About?• Are these large clusters started by “one guy”? ▪ No: across all Pages of meaningful size (>1000 Fans), 14.8% of the Fans in the biggest cluster were “start points.” ▪ The variability in this percentage becomes very small as # fans increases ▪ The average node in the biggest cluster is connected to 2.899 others. • Large clusters are formed when many long chains of diffusion merge together.
    • 26. Diffusion Chains on Facebook vs. Real Life • The connected nature of Facebook (combined with easy methods of communication) makes long diffusion chains possible. ▪ In word-of-mouth studies of information propagation, most people hear of an idea from 1 person and pass it on to 1 other person ▪ Only 38% of paths involve at least four individuals (Brown & Reingen 1987) ▪ On Facebook, 86.4% of paths of Page diffusion involve at least 4 individuals
    • 27. How are Long Diffusion Chains Created? • Goal: test whether the Influentials theory or the Contagion theory is more applicable to Facebook ▪ Attempt to predict size of diffusion chains that a particular user will create using characteristics of the user and/or the Page. ▪ If size can be predicted, we can then identify the most influential users.
    • 28. Data ▪ Data consists of all the associations (actor  follower) for a representative selection of Pages. ▪ Pages were at least 40 days old and had at least 5,000 fans
    • 29. Prediction Model Response: max_chain_length Predictors: ▪ gender ▪ log age ▪ log Facebook_age ▪ log feed_exposure (# friends who saw News Feed story) ▪ log friend_count ▪ log activity_count (wall posts + messages sent + photos added) ▪ log popularity (controls for News Feed exposure via Coefficient) Method: zero-inflated negative binomial regression
    • 30. Results • Only consistent coefficient is on feed_exposure (# friends who saw News Feed story). ▪ Coefficient hovers around 1: if News Feed publishes a user’s action to 1% more people, we expect a 1% longer max_chain • Implies that friend_count is not realistically meaningful. ▪ After controlling for distribution and popularity, neither demographic characteristics nor number of Facebook friend seems to play an important role in the prediction of maximum diffucion chain length.
    • 31. Conclusions • Facebook News Feed enables long-lasting chains of diffusion that may reach many more people than real-life diffusion chains. • The Facebook network is very connected: ideas with good receptiveness will attract wide, long connected clusters. • Long chains are not a function of Facebook age, activity, users’ demographics, or even # of friends: it’s only related to exposure.
    • 32. Contact www.facebook.com/data itamar@facebook.com esun@facebook.com
    • 33. (c) 2007 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

    ×