Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DataEngConf: The Science of Virality at BuzzFeed

By Adam Kelleher (Sr Data Scientist, BuzzFeed)
BuzzFeed has developed the technology to attribute pageviews to a referring user. Using these data, we can construct diffusion graphs for our articles. These graphs introduce a whole collection of new performance metrics, and their complexity opens the door for a new assortment of complications to go with them. I'll mention some past work that has been done to create similar data from old pageview events. Then, I'll work through how we process these data into graph objects (avoiding some pitfalls), and mention some of the new ways of looking at web analytics implied by these objects. I'll talk about how we can take advantage of the structure of these objects to make certain algorithms more efficient. Finally, I'll cover some of the future applications we're particularly excited about!

  • Login to see the comments

DataEngConf: The Science of Virality at BuzzFeed

  1. 1. HISTORY OF VIRALITY
  2. 2. THE DATA
  3. 3. THE DATA: OLD VERSION Article being viewed User viewing article Time of pageview Referring domain
  4. 4. THE DATA: NEW VERSION Article being viewed Time of pageview Referring domain User viewing article Referring User
  5. 5. DIFFERENT PERSPECTIVE: Pageviews are a process on a graph!
  6. 6. WHAT THE GRAPH LOOKS LIKE:
  7. 7. WHAT THE PROCESS LOOKS LIKE:
  8. 8. WHAT THE DATA LOOKS LIKE:
  9. 9. WHAT CAN DO YOU WITH OLD PAGEVIEWS? (Educated) Guess!
  10. 10. CONNIE
  11. 11. OLD GRAPH RECONSTRUCTION: MODEL-BASED INFERENCE Probabilistic: You can infer connections that aren’t there! Error Prone: Graph statistics can be susceptible to small changes in the graph Gets larger when differences in pageview times gets smaller
  12. 12. SIMPLIFIED VERSION: Observe: Guess:
  13. 13. SIMPLIFIED VERSION: Guess: Reality:
  14. 14. Check out a toy implementation here! github.com/akellehe/pyconnie
  15. 15. NEW GRAPH RECONSTRUCTION: TRIVIAL These are actually Unique Visitors …
  16. 16. LIFE IS A LITTLE MESSY… This is more like what the Pageview graph looks like
  17. 17. PROBLEM: DATA MUNGING • Lots of potential for heuristics! • How do we get promotion attribution from propagations? • Trees are important: how can we be sure we get them?
  18. 18. PROBLEM: STREAMLINING ANALYSIS • How do we work from a common set of definitions? • How do we avoid repeating analysis? • How can we streamline data visualization? EDA? • How do we share optimized analyses? And avoid inefficient (but correct) algorithms?
  19. 19. DEFINE DATA STRUCTURES! • All data munging happens “under the hood” • Data pre-processing is unit-tested • No room for heuristics: standardization! • Hard math definitions can be consistency-checked!
  20. 20. PROPAGATION SET For one article For the site (or other set of articles, S)
  21. 21. PROPAGATION SET Pageviews to article b in time T Pageviews to the site in time T The simplest data structure. Just a representation of the raw pageview logs. Represented as a generator of UserEdge objects
  22. 22. PROPAGATION GRAPH,
  23. 23. PROPAGATION GRAPH
  24. 24. PROPAGATION GRAPH
  25. 25. INFLUENCE GRAPH Propagation graph together with a map, That measures the influence of the origin user in p on the pageviewing user
  26. 26. CONSIDER:
  27. 27. PROPAGATION FOREST
  28. 28. PROPAGATION FOREST The propagation graph is great, but we’d also like a concept like unique visitors! If there is attribution ordering in the graph, we can trace content back to its source!
  29. 29. PROPAGATION FOREST: FIRST PARENT ATTRIBUTION n pageviews One UV
  30. 30. PROPAGATION FOREST gets the credit
  31. 31. RESULT: ALL GRAPHS ARE FORESTS Promotions have 0 indegree, Users have 1 indegree total edges in connected components: Trees!
  32. 32. CAREFUL FOR EDGE CASES: MISSING DATA? All connected components should be rooted at a promotion source. What happens if we lose the first edge (e.g. use the wrong T)?
  33. 33. PROPAGATION FOREST: CYCLE BREAKING Consider … Cycle is not broken by first-parent attribution Traversal algorithms go on forever!
  34. 34. PROPAGATION FOREST: CYCLE BREAKING Consider … As long as they’re not equal, the can be ordered, say Then, there is a node in the cycle with an out-edge younger than its in-edge: The original pageview for that node must have been lost. Cut the in-edge (FPA!).
  35. 35. SUCCESS! Cycle-breaking + FPA = Trees! Each tree is the UV graph downstream from a promotion source: promotion attribution! Additional Benefits: Most information diffusion analyses model trees growing on graphs. Many algorithms simplify when run on trees!
  36. 36. SUPERTREE We may want to run an algorithm, or calculate a tree statistic from a whole forest, instead of just one tree. How can we do that? Merge all the roots (promotion sources) together into one “super-node” The whole forest becomes a SuperTree!
  37. 37. SUPERTREE: EXAMPLE
  38. 38. SUPERTREE: EXAMPLE
  39. 39. APPLICATION: LARGE SCALE DATA VIS
  40. 40. WHY IS IT SLOW? Layouts often consider repelling each node from every other: time complexity Good for a few thousand nodes
  41. 41. OPENORD: SIMULATED ANNEALING Linear main layout Quadratic settling Phase Implemented in Gephi
  42. 42. OPENORD Good for ~10k Users Slow for ~100k Users Messy! (if you skip the quadratic step!)
  43. 43. TAKE ADVANTAGE OF TREE STRUCTURE! Traverse the tree to decide where to place nodes!
  44. 44. H3 LAYOUT Each parent is in the center of a hemisphere. Children are laid out on the surface of the hemisphere They become centers of smaller hemispheres (if they’re parents) Etc.
  45. 45. A NEW IMPLEMENTATION pip install pyh3
  46. 46. WITH D3
  47. 47. MORE APPLICATIONS
  48. 48. ATTRIBUTION Instead of
  49. 49. CASCADE PREDICTION
  50. 50. GRAPH AND TEMPORAL PROPERTIES ARE IMPORTANT!
  51. 51. TEST THE INFLUENTIALS HYPOTHESIS
  52. 52. IMPROVE CONTENT TARGETING
  53. 53. FINDING THE CAUSES OF VIRALITY Consider Fitting a Model: User Features, content features, context features, User pair features
  54. 54. UNDER CONSTRUCTION: Online Regression! Real-time feature weights tell which features correlate with propagation probabilities! Drives hypothesis-building!
  55. 55. THE TEAM

×