Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Building A Data-Driven Newsroom Wit... by Refinery29 885 views
- Defining Brand at Refinery29 [First... by FirstMark Capital 2117 views
- Umubanotours by umubanotours 208 views
- How Refinery29 Is Globally Scaling ... by Digiday 513 views
- Leadership in buzzfeed by maddiejee 1230 views
- 76 credgetsmarter copy by Claude Luster 437 views

By Adam Kelleher (Sr Data Scientist, BuzzFeed)

BuzzFeed has developed the technology to attribute pageviews to a referring user. Using these data, we can construct diffusion graphs for our articles. These graphs introduce a whole collection of new performance metrics, and their complexity opens the door for a new assortment of complications to go with them. I'll mention some past work that has been done to create similar data from old pageview events. Then, I'll work through how we process these data into graph objects (avoiding some pitfalls), and mention some of the new ways of looking at web analytics implied by these objects. I'll talk about how we can take advantage of the structure of these objects to make certain algorithms more efficient. Finally, I'll cover some of the future applications we're particularly excited about!

No Downloads

Total views

741

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

24

Comments

5

Likes

2

No notes for slide

----- Meeting Notes (9/30/15 14:43) -----

We've reconstructed a piece of the social graph!

- 1. HISTORY OF VIRALITY
- 2. THE DATA
- 3. THE DATA: OLD VERSION Article being viewed User viewing article Time of pageview Referring domain
- 4. THE DATA: NEW VERSION Article being viewed Time of pageview Referring domain User viewing article Referring User
- 5. DIFFERENT PERSPECTIVE: Pageviews are a process on a graph!
- 6. WHAT THE GRAPH LOOKS LIKE:
- 7. WHAT THE PROCESS LOOKS LIKE:
- 8. WHAT THE DATA LOOKS LIKE:
- 9. WHAT CAN DO YOU WITH OLD PAGEVIEWS? (Educated) Guess!
- 10. CONNIE
- 11. OLD GRAPH RECONSTRUCTION: MODEL-BASED INFERENCE Probabilistic: You can infer connections that aren’t there! Error Prone: Graph statistics can be susceptible to small changes in the graph Gets larger when differences in pageview times gets smaller
- 12. SIMPLIFIED VERSION: Observe: Guess:
- 13. SIMPLIFIED VERSION: Guess: Reality:
- 14. Check out a toy implementation here! github.com/akellehe/pyconnie
- 15. NEW GRAPH RECONSTRUCTION: TRIVIAL These are actually Unique Visitors …
- 16. LIFE IS A LITTLE MESSY… This is more like what the Pageview graph looks like
- 17. PROBLEM: DATA MUNGING • Lots of potential for heuristics! • How do we get promotion attribution from propagations? • Trees are important: how can we be sure we get them?
- 18. PROBLEM: STREAMLINING ANALYSIS • How do we work from a common set of definitions? • How do we avoid repeating analysis? • How can we streamline data visualization? EDA? • How do we share optimized analyses? And avoid inefficient (but correct) algorithms?
- 19. DEFINE DATA STRUCTURES! • All data munging happens “under the hood” • Data pre-processing is unit-tested • No room for heuristics: standardization! • Hard math definitions can be consistency-checked!
- 20. PROPAGATION SET For one article For the site (or other set of articles, S)
- 21. PROPAGATION SET Pageviews to article b in time T Pageviews to the site in time T The simplest data structure. Just a representation of the raw pageview logs. Represented as a generator of UserEdge objects
- 22. PROPAGATION GRAPH,
- 23. PROPAGATION GRAPH
- 24. PROPAGATION GRAPH
- 25. INFLUENCE GRAPH Propagation graph together with a map, That measures the influence of the origin user in p on the pageviewing user
- 26. CONSIDER:
- 27. PROPAGATION FOREST
- 28. PROPAGATION FOREST The propagation graph is great, but we’d also like a concept like unique visitors! If there is attribution ordering in the graph, we can trace content back to its source!
- 29. PROPAGATION FOREST: FIRST PARENT ATTRIBUTION n pageviews One UV
- 30. PROPAGATION FOREST gets the credit
- 31. RESULT: ALL GRAPHS ARE FORESTS Promotions have 0 indegree, Users have 1 indegree total edges in connected components: Trees!
- 32. CAREFUL FOR EDGE CASES: MISSING DATA? All connected components should be rooted at a promotion source. What happens if we lose the first edge (e.g. use the wrong T)?
- 33. PROPAGATION FOREST: CYCLE BREAKING Consider … Cycle is not broken by first-parent attribution Traversal algorithms go on forever!
- 34. PROPAGATION FOREST: CYCLE BREAKING Consider … As long as they’re not equal, the can be ordered, say Then, there is a node in the cycle with an out-edge younger than its in-edge: The original pageview for that node must have been lost. Cut the in-edge (FPA!).
- 35. SUCCESS! Cycle-breaking + FPA = Trees! Each tree is the UV graph downstream from a promotion source: promotion attribution! Additional Benefits: Most information diffusion analyses model trees growing on graphs. Many algorithms simplify when run on trees!
- 36. SUPERTREE We may want to run an algorithm, or calculate a tree statistic from a whole forest, instead of just one tree. How can we do that? Merge all the roots (promotion sources) together into one “super-node” The whole forest becomes a SuperTree!
- 37. SUPERTREE: EXAMPLE
- 38. SUPERTREE: EXAMPLE
- 39. APPLICATION: LARGE SCALE DATA VIS
- 40. WHY IS IT SLOW? Layouts often consider repelling each node from every other: time complexity Good for a few thousand nodes
- 41. OPENORD: SIMULATED ANNEALING Linear main layout Quadratic settling Phase Implemented in Gephi
- 42. OPENORD Good for ~10k Users Slow for ~100k Users Messy! (if you skip the quadratic step!)
- 43. TAKE ADVANTAGE OF TREE STRUCTURE! Traverse the tree to decide where to place nodes!
- 44. H3 LAYOUT Each parent is in the center of a hemisphere. Children are laid out on the surface of the hemisphere They become centers of smaller hemispheres (if they’re parents) Etc.
- 45. A NEW IMPLEMENTATION pip install pyh3
- 46. WITH D3
- 47. MORE APPLICATIONS
- 48. ATTRIBUTION Instead of
- 49. CASCADE PREDICTION
- 50. GRAPH AND TEMPORAL PROPERTIES ARE IMPORTANT!
- 51. TEST THE INFLUENTIALS HYPOTHESIS
- 52. IMPROVE CONTENT TARGETING
- 53. FINDING THE CAUSES OF VIRALITY Consider Fitting a Model: User Features, content features, context features, User pair features
- 54. UNDER CONSTRUCTION: Online Regression! Real-time feature weights tell which features correlate with propagation probabilities! Drives hypothesis-building!
- 55. THE TEAM

No public clipboards found for this slide

Login to see the comments