By Adam Kelleher (Sr Data Scientist, BuzzFeed)
BuzzFeed has developed the technology to attribute pageviews to a referring user. Using these data, we can construct diffusion graphs for our articles. These graphs introduce a whole collection of new performance metrics, and their complexity opens the door for a new assortment of complications to go with them. I'll mention some past work that has been done to create similar data from old pageview events. Then, I'll work through how we process these data into graph objects (avoiding some pitfalls), and mention some of the new ways of looking at web analytics implied by these objects. I'll talk about how we can take advantage of the structure of these objects to make certain algorithms more efficient. Finally, I'll cover some of the future applications we're particularly excited about!
OLD GRAPH RECONSTRUCTION:
Probabilistic: You can infer connections that aren’t
Error Prone: Graph statistics can be susceptible to
small changes in the graph
Gets larger when differences in
pageview times gets smaller
Check out a toy implementation here!
NEW GRAPH RECONSTRUCTION:
LIFE IS A LITTLE
PROBLEM: DATA MUNGING
• Lots of potential for heuristics!
• How do we get promotion attribution
• Trees are important: how can we be
sure we get them?
• How do we work from a common set of definitions?
• How do we avoid repeating analysis?
• How can we streamline data visualization? EDA?
• How do we share optimized analyses? And avoid
inefficient (but correct) algorithms?
• All data munging happens “under the hood”
• Data pre-processing is unit-tested
• No room for heuristics: standardization!
• Hard math definitions can be consistency-checked!
For one article
For the site (or other set of articles, S)
Pageviews to article b
in time T
Pageviews to the site
in time T
The simplest data structure. Just a
representation of the raw pageview logs.
Represented as a generator of UserEdge objects
RESULT: ALL GRAPHS
Promotions have 0 indegree,
Users have 1 indegree
total edges in connected components:
CAREFUL FOR EDGE
CASES: MISSING DATA?
All connected components should be rooted at a
What happens if we lose the first edge (e.g. use the
Consider … Cycle is not broken by
Traversal algorithms go
As long as they’re not
equal, the can be
Then, there is a node in the
cycle with an out-edge
younger than its in-edge:
The original pageview for
that node must have been
lost. Cut the in-edge
Cycle-breaking + FPA = Trees!
Each tree is the UV graph downstream from a
promotion source: promotion attribution!
Most information diffusion analyses model trees growing on
Many algorithms simplify when run on trees!
We may want to run an algorithm, or calculate a tree
statistic from a whole forest, instead of just one
tree. How can we do that?
Merge all the roots (promotion sources) together into
The whole forest becomes a SuperTree!