Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network


weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Uploaded on


  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • To check for inflated standard errors due to multicollinearity between controls and independent variables, we calculatedvariance inflation factors (VIFs). All VIFs are well below 4, indicating low collinearity between factors [31].
  • FB: most people also hear from 1 and pass to 1
  • Explain negative binomial
  • Explain negative binomial


  • 1. Data Science at Facebook
    Itamar Rosenn
    Eric Sun
  • 2. Facebook Data
    Social Graph
    200M+ active users
    100M+ users come to site each day
    several hundred thousand new users join each day
    hundreds of dimensions per user (numerical, categorical, text)
    average user has over 120 friends
    friendships on Facebook span many different types of relationships
    Social Behavior
    Actions: users interact with hundreds of thousands of applications, on and off the site
    Interactions: users interact directly with each other via over 100 distinct types of events
    Social Content
    Photos, Status Updates, Platform Application Content, Events, Posts, Videos, Notes, etc...
  • 3. Managing Data at Scale
    Solution: Hadoop + Hive
    HDFS / Hadoop (MapReduce in Java)
    MetaStore (metadata management)
    HiveQL (SQL-like query language on top of Hadoop + MetaStore)
    Data Scale
    More than 1PB raw capacity in largest HDFS / Hadoop cluster
    Over 2TB uncompressed data collected each day
    Dozens of TB worth of data read / written each day via Hadoop + Hive
  • 4. Data Science - What We Do
    Behavioral Analysis
    Data-Driven Systems
    Product Health Metrics
    Launch Evaluations
    Growth Modeling
    User Churn Modeling
    Production Incentives
    Content Diffusion
    Ad CTR Prediction
    Search Ranking
    Data Infrastructure
  • 5. Data Science – Who We Are
    Dennis Decoste
    Alex Smith
    Roddy Lindsay
    Thomas Lento
    Cameron Marlow
    Ravi Grover
    Danny Ferrante
    Lee Byron
    James Mayfield
  • 6. Maintained Relationships on Facebook
    Question: is Facebook increasing the size of people’s personal networks?
    the types of relationships people maintain on the site
    the relative size of these groups
  • 7. Types of Relationships
    People you know
    Facebook friends = people you’ve met at some point in life
    Researchers have estimated this number to be somewhere between 300 and 3,000. (Gladwell, Killworth)
    Communication network
    Individuals with whom you communicate on a regular basis
    Includes your core support network, which may be as low as 3 people
    Kossinets and Watts observed communication network size of 10-20
    Maintained relationships
    Social technologies like Newsfeed or RSS readers allow you to keep up with the things that people you know are doing
    This information consumption is a form of relationship management, as it can lead to direct communication in the future
  • 8. Measuring Network Size on Facebook
    Examine the relationships of a random user sample over 30 days on the site. We defined networks in 4 ways:
    All friends
    The largest representation of a person’s network is the set of people they have verified as friends.
    Reciprocal communication
    The number of friends with whom the user had reciprocal exchanges via messages, wall posts, or comments. This provides a measure of the user’s core network.
    One-way communication
    The number of friends to whom the user has reached out via messages, wall posts, or comments.
    Maintained relationships
    The number of friends whose Newsfeed stories the user has clicked on, or whose profiles the user has visited at least twice
  • 9. Findings
    • As a function of the # of friends a user has, she is passively engaging with 2 to 2.5 more people than with whom she directly communicates
  • Systemic Effects
    • The stark constrast between these networks shows the effect of technologies like Newsfeed.
  • Content Production among New Users
    Mission: Give people the power to share and make the world more open and connected.
    Question: What mechanisms lead Facebook newcomers to share content on the site?
  • 10. Content Production
    In new users’ first two weeks:
    45% upload a photo
    41% use a 3rd-party app
    30% send a private message
    27% compose a status update
    22% write on a friend’s wall
  • 11. Content Production
    In new users’ first two weeks:
    45% upload a photo
    41% use a 3rd-party app
    30% send a private message
    27% compose a status update
    22% write on a friend’s wall
  • 12. Production Incentives Hypotheses
    H1: Newcomers who receive more feedback on their initial content will go on to contribute more content.
    H2: Newcomers whose initial content receives greater distribution will go on to produce more content.
    H3: Social learning:Newcomers whose friends share more content will go on to produce more content themselves.
    H4: Singling out:Newcomers who are singled out in content that their friends produce will go on to produce more content themselves.
  • 13. Method
    • 40-minute semi-structured interviews with seven new users
    • 14. Recorded audio/video and screen
    • 15. Asked about typical uses of facebook, content production, social norms, privacy
    Selected two cohorts:
Nov. 5, 2007 (N= 347,403) 
Mar. 3, 2008 (N=254,603)
    Observed activity in their first two weeks
    Predicted how many photos they would upload between third and fifteenth week on Facebook
  • 16. Features
    Independent Variables
    H1. Feedback
    Comments received
    H2. Distribution
    # of times content was viewed in Newsfeed
    # of friends who viewed content in Newsfeed
    H3. Social Learning
    Number of friends’ photos seen
    H4. Singling Out
    Number of times tagged
  • 25. Results
    Model 1 – Early Uploaders
    Model 2 - Everyone
  • 26. Summary of Results
    • We learn from our friends. If our friends engage with photos, we do too. Social learning is the main lever for content production.
    • 27. For new users already uploading photos feedback is associated with increased content production, and distribution is marginally important.

  • Modeling Contagion Through Newsfeed
    How do ideas spread through a social network?
    Use Facebook Pages to model diffusion patterns
    Compare results with existing models of diffusion
    Show how Facebook advertising campaigns may be more successful than off-Facebook advertising campaigns due to Facebook’sinter-connectedness and diffusion properties.
    Note: Research based on “old” Facebook (pre-March 2009)
    Still relevant: first empirical analysis of large-scale collisions of short chains
  • 28. Theory of the Influentials
    Old Theory: it’s all about the “influentials” (Malcolm Gladwell, etc.)
    Idea: reach a tiny group of Influential people, and you’ll reach everyone else through them for free
    $1+ billion/year spent on word-of-mouth campaigns targeting Influentials; amount is growing 36% per year (MarketingVOX)
  • 29. Contagion Theory
    Duncan Watts: Anyone can be an influencer.
    Ideas don’t spread via influentials. Instead, ideas spread like viruses: either you’re susceptible, or you’re not
    Success depends not on how persuasive the early adopter(s) are, but whether everyone else is easily persuaded.
  • 30. How Do Ideas Spread on Facebook?
    News Feed allows for efficient diffusion of ideas
    Facebook’s Pages product is one of the most viral features of the site.
    People may see multiple friends fan a Page in a single Feed story, so a node in the graph can have multiple parents
    Alice fans a Page
    Chain of Length 1
    Bob sees Alice’s action on his News Feed; Bob fans the Page as well
    Charlie sees Alice’s action on his News Feed; Charlie fans the Page as well
  • 31. Large-Scale Result: Large Connected Trees of Diffusion
    Diffusion chain for Stripy, a cartoon popular in Bosnia (blue) & Slovenia (yellow). Croatia (green) has yet to find its connecting bridge.
  • 32. Large Connected Clusters
    Often, the vast majority of fans can be connected into one cluster; sometimes over 90% of the fans for one particular Page can be connected.
    Example: On 8/21/08, 71,090 of 96,922 fans of the NastiaLiukin Page (73.3%) were in one connected cluster.
    For Pages created after 7/1/08, the median Page had 69.48% of its Fans in one connected cluster as of 8/19/08.
  • 33. How Do These Large Clusters Come About?
    Are these large clusters started by “one guy”?
    No: across all Pages of meaningful size (>1000 Fans), 14.8% of the Fans in the biggest cluster were “start points.”
    The variability in this percentage becomes very small as # fans increases
    The average node in the biggest cluster is connected to 2.899 others.
    Large clusters are formed when many long chains of diffusion merge together.
  • 34. Diffusion Chains on Facebook vs. Real Life
    The connected nature of Facebook (combined with easy methods of communication) makes long diffusion chains possible.
    In word-of-mouth studies of information propagation, most people hear of an idea from 1 person and pass it on to 1 other person
    Only 38% of paths involve at least four individuals (Brown & Reingen 1987)
    On Facebook, 86.4% of paths of Page diffusion involve at least 4 individuals
  • 35. How are Long Diffusion Chains Created?
    Goal: test whether the Influentials theory or the Contagion theory is more applicable to Facebook
    Attempt to predict size of diffusion chains that a particular user will create using characteristics of the user and/or the Page.
    If size can be predicted, we can then identify the most influential users.
  • 36. Data
    Data consists of all the associations (actor  follower) for a representative selection of Pages.
    Pages were at least 40 days old and had at least 5,000 fans
  • 37. Prediction Model
    Response: max_chain_length
    log age
    log Facebook_age
    log feed_exposure (# friends who saw News Feed story)
    log friend_count
    log activity_count (wall posts + messages sent + photos added)
    log popularity (controls for News Feed exposure via Coefficient)
    Method: zero-inflated negative binomial regression
  • 38. Results
    Only consistent coefficient is on feed_exposure(# friends who saw News Feed story).
    Coefficient hovers around 1: if News Feed publishes a user’s action to 1% more people, we expect a 1% longer max_chain
    Implies that friend_count is not realistically meaningful.
    After controlling for distribution and popularity, neither demographic characteristics nor number of Facebook friend seems to play an important role in the prediction of maximum diffucion chain length.
  • 39. Conclusions
    Facebook News Feed enables long-lasting chains of diffusion that may reach many more people than real-life diffusion chains.
    The Facebook network is very connected: ideas with good receptiveness will attract wide, long connected clusters.
    Long chains are not a function of Facebook age, activity, users’ demographics, or even # of friends: it’s only related to exposure.
  • 40. Contact