The Relational Event Analysis Toolkit      Hadoop World 2011, New York, NY      Josh Lospinoso              Guy Filippelli...
What’s the problem?We want to say something intelligent about howindividuals behave in networks:  – General tendencies in ...
…but everything depends on              everythingWhat weighs in on how often:  – Computers send packets to each other  – ...
…but everything depends on              everythingIf we ignore all of this stuff—and it’s the mostinteresting stuff—we’ve ...
This changes everything• Truly revolutionary work in statistical social  network analysis from academia in the past  five ...
What influences Joshto email Guy with higher      frequency?
Reciprocity
Similarity
Similarity
Similarity
Jeremy normally sends15 emails per day.
Guy sends Jeremy anemail tomorrowmorning.
Do we expect Jeremyto send more emailstomorrow?
How statisticians think of the worldThe world is chaotic and unpredictable, but itexhibits tendencies.We model why events ...
A       C    B
A                                         C                       BThere’s a 1/6 chance—even odds—any email occurs.
A                                               C                          B                       A emails BLet’s say thi...
A       C    B
A                                           C                       BThere’s now a 5/10 chance that the next email we  obs...
Keep in mind that we don’t pick these odds out of thin                        air.We hypothesize about what behaviors are ...
Keep in mind that we don’t pick these odds out of thin                        air.We hypothesize about what behaviors are ...
At any moment, for 3 people, there are 6possible directed events that could occur.We model the rates at which these events...
…but what if we want to analyze a network of25K people?
…but what if we want to analyze a network of25K people?That’s nearly 625M events that could occur atany instant! [ 25000 x...
HadoopWe can use cheap hardware to:• Store large amounts of data• Perform statistical modeling on this data
Modeling and EstimationIn a simple model, suppose we want to baselinereciprocity of text messagesIf I am sent a text messa...
Modeling and EstimationIn a simple model, suppose we want to baselinereciprocity of text messagesIf I am sent a text messa...
Modeling and EstimationDefine:• Set of three people { A, B, C }• Messages < , ,   ∈  where s is sender, r  is receiver, t ...
Modeling and EstimationDefine:• Set of three people { A, B, C }• Messages  , ,   ∈  where s is sender, r  is receiver, t i...
Modeling and Estimation                 , ,  =0              1 = .5 per day
Modeling and Estimation                 , ,  =1                 , ,  =0
Modeling and EstimationWe have to estimate these rate functions givenan event history:A,B,.1   B,A,.2   C,B,.7   C,A,1...
Modeling and EstimationWe want to maximize the probability density ofour model over our data.This is called maximum likeli...
Modeling and EstimationFor our first event, A,B,.1 we have:               , ,    =0               , ,    =0               ...
Modeling and Estimation     For our second event, B,A,.2 we have:Event History:       , ,    =0A,B,.1             , ,    =...
Modeling and Estimation      For our third event, C,B,.7 we have:Event History:       , ,    =1A,B,.1             , ,    =...
Modeling and Estimation     For our fourth event, C,A,1. we have:Event History:       , ,    =1A,B,.1             , ,    =...
Modeling and Estimation     For our fourth event, C,A,1. we have:Event History:       , ,    =1A,B,.1                     ...
Modeling and EstimationKnowing all of these statistics combinations, wecan maximize the likelihood function over theparame...
MapReduce -- OptimizationFor each observation / statistics pair, wecalculate the Log-Likelihood, its first, and itssecond ...
Example -- Enron~.5M messages between 150 senior managersAvailable from http://www.cs.cmu.edu/~enron/Baselining for full d...
Person          Sent   Received    Totaljeff.dasovich@enron.com     11566      4961    16527tana.jones@enron.com         9...
Example -- EnronEffect                  Estimate           Std. ErrorOutdegree               -.204                (.02)Rec...
Example -- EnronWe can draw a baseline. So what?Now we consider what would happen if weadmitted fixed effects parameters: ...
Example – Enron Fixed EffectsPerson                     Effect               FE Position           Estimate (SE)jeff.dasov...
The Relational Event Analysis Toolkit      Hadoop World 2011, New York, NY      Josh Lospinoso              Guy Filippelli...
Hadoop World 2011: Building Relational Event History Model with Hadoop - Josh Lospinoso- University of Oxford
Upcoming SlideShare
Loading in …5
×

Hadoop World 2011: Building Relational Event History Model with Hadoop - Josh Lospinoso- University of Oxford

1,520 views
1,386 views

Published on

"In this session we will look at Reveal, a statistical network analysis library built on Hadoop that uses relational event history analysis to grapple with the complexity, temporal causality, and uncertainty associated with dynamically evolving, growing, and changing networks. There are a broad range of applications for this work, from finance to social
network analysis to network security."

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,520
On SlideShare
0
From Embeds
0
Number of Embeds
250
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop World 2011: Building Relational Event History Model with Hadoop - Josh Lospinoso- University of Oxford

  1. 1. The Relational Event Analysis Toolkit Hadoop World 2011, New York, NY Josh Lospinoso Guy Filippellijosh@redowlconsulting.com guy@redowlconsulting.com
  2. 2. What’s the problem?We want to say something intelligent about howindividuals behave in networks: – General tendencies in behavior – Systematic deviations from general tendencies – Anomalous behavior & event detection – Key event mining
  3. 3. …but everything depends on everythingWhat weighs in on how often: – Computers send packets to each other – Co-workers email each other – Businesses transact goods – Proteins interact These data sets are huge, jumbled, dependent messes.
  4. 4. …but everything depends on everythingIf we ignore all of this stuff—and it’s the mostinteresting stuff—we’ve already lost: – How can we explain network phenomena without context? – How much information do we amputate from the data? We can’t treat people, computers, etc. like particles
  5. 5. This changes everything• Truly revolutionary work in statistical social network analysis from academia in the past five years• Cheap commodity computer hardware• Workable frameworks for computing on this cheap hardware
  6. 6. What influences Joshto email Guy with higher frequency?
  7. 7. Reciprocity
  8. 8. Similarity
  9. 9. Similarity
  10. 10. Similarity
  11. 11. Jeremy normally sends15 emails per day.
  12. 12. Guy sends Jeremy anemail tomorrowmorning.
  13. 13. Do we expect Jeremyto send more emailstomorrow?
  14. 14. How statisticians think of the worldThe world is chaotic and unpredictable, but itexhibits tendencies.We model why events occur in a networkthrough tendencies surrounding e.g. human orcomputer behavior.“All models are wrong, some are useful”
  15. 15. A C B
  16. 16. A C BThere’s a 1/6 chance—even odds—any email occurs.
  17. 17. A C B A emails BLet’s say this increases the rate of emails from B to A
  18. 18. A C B
  19. 19. A C BThere’s now a 5/10 chance that the next email we observe is from B to A because of reciprocity.
  20. 20. Keep in mind that we don’t pick these odds out of thin air.We hypothesize about what behaviors are important—
  21. 21. Keep in mind that we don’t pick these odds out of thin air.We hypothesize about what behaviors are important— then let the data tell us: • What’s important? • How important is it? • How uncertain are we?
  22. 22. At any moment, for 3 people, there are 6possible directed events that could occur.We model the rates at which these events tendto occur through effects like similarity,reciprocity, the “trickle-down” effect, etc.So, we need to keep track of the context foreach of these possible relationships.
  23. 23. …but what if we want to analyze a network of25K people?
  24. 24. …but what if we want to analyze a network of25K people?That’s nearly 625M events that could occur atany instant! [ 25000 x 24999 ]This is impossible without a scalable computeplatform.
  25. 25. HadoopWe can use cheap hardware to:• Store large amounts of data• Perform statistical modeling on this data
  26. 26. Modeling and EstimationIn a simple model, suppose we want to baselinereciprocity of text messagesIf I am sent a text message, does that increasethe rate that I send text messages to the sender?
  27. 27. Modeling and EstimationIn a simple model, suppose we want to baselinereciprocity of text messagesIf I am sent a text message, does that increasethe rate that I send text messages to the sender?We could elaborate this baseline to see who isbad at responding to emails!
  28. 28. Modeling and EstimationDefine:• Set of three people { A, B, C }• Messages < , , ∈ where s is sender, r is receiver, t is timestamp
  29. 29. Modeling and EstimationDefine:• Set of three people { A, B, C }• Messages , , ∈ where s is sender, r is receiver, t is timestamp• The reciprocation function , , = #{ , , ∈ }• The rate function , , = { 1 + 2 , , }
  30. 30. Modeling and Estimation , , =0 1 = .5 per day
  31. 31. Modeling and Estimation , , =1 , , =0
  32. 32. Modeling and EstimationWe have to estimate these rate functions givenan event history:A,B,.1 B,A,.2 C,B,.7 C,A,1...
  33. 33. Modeling and EstimationWe want to maximize the probability density ofour model over our data.This is called maximum likelihood estimation.Based on a few easy to compute derivatives, wework our way forward through the data…
  34. 34. Modeling and EstimationFor our first event, A,B,.1 we have: , , =0 , , =0 , , =0 , , =0 , , =0 , , =0
  35. 35. Modeling and Estimation For our second event, B,A,.2 we have:Event History: , , =0A,B,.1 , , =0 , , =1 , , =0 , , =0 , , =0
  36. 36. Modeling and Estimation For our third event, C,B,.7 we have:Event History: , , =1A,B,.1 , , =0B,A,.2 , , =1 , , =0 , , =0 , , =0
  37. 37. Modeling and Estimation For our fourth event, C,A,1. we have:Event History: , , =1A,B,.1 , , =0B,A,.2 , , =1C,B,.7 , , =1 , , =0 , , =0
  38. 38. Modeling and Estimation For our fourth event, C,A,1. we have:Event History: , , =1A,B,.1 Note that we must , , =0 know the wholeB,A,.2 , , =1 event history!C,B,.7 , , =1 , , =0 , , =0
  39. 39. Modeling and EstimationKnowing all of these statistics combinations, wecan maximize the likelihood function over theparameter space for .
  40. 40. MapReduce -- OptimizationFor each observation / statistics pair, wecalculate the Log-Likelihood, its first, and itssecond derivative (“contributions”)MapObserv.; CollectionObserv. = Observ.; ContributionsReduceObserv.; Contributions= Null;AddedContributions
  41. 41. Example -- Enron~.5M messages between 150 senior managersAvailable from http://www.cs.cmu.edu/~enron/Baselining for full dataset is not yet completeWe present and interpret a smaller dataset here(between top 10 most active users)
  42. 42. Person Sent Received Totaljeff.dasovich@enron.com 11566 4961 16527tana.jones@enron.com 9947 4416 14363sara.shackleton@enron.com 5849 4226 10075kay.mann@enron.com 6445 2098 8543chris.germany@enron.com 6903 1312 8215louise.kitchen@enron.com 1950 3645 5595vince.kaminski@enron.com 4146 1436 5582gerald.nemec@enron.com 2668 2680 5348mark.taylor@enron.com 2351 2951 5302susan.mara@enron.com 2596 2008 4604
  43. 43. Example -- EnronEffect Estimate Std. ErrorOutdegree -.204 (.02)Reciprocity .576 (.03)*Due to the duration of the dataset, we use a decay function todown-weight older events: exp − − .Here, log 2/ is called the “half life”We use a “half life” of one week (~.1) , T: days
  44. 44. Example -- EnronWe can draw a baseline. So what?Now we consider what would happen if weadmitted fixed effects parameters: Candidate Sender Fixed Effect Candidate Receiver Fixed Effect
  45. 45. Example – Enron Fixed EffectsPerson Effect FE Position Estimate (SE)jeff.dasovich@enron.com Outdegree Sender .175 (.07)jeff.dasovich@enron.com Outdegree Receiver -.04 (.03)jeff.dasovich@enron.com Reciprocity Sender -.24 (.10)jeff.dasovich@enron.com Reciprocity Receiver .19 (.07)tana.jones@enron.com Outdegree Sender .14 (.06)tana.jones@enron.com Outdegree Receiver .02 (.09)…*Estimates obtained by conducting a full Newton-Raphson step evaluated at the baseline.
  46. 46. The Relational Event Analysis Toolkit Hadoop World 2011, New York, NY Josh Lospinoso Guy Filippellijosh@redowlconsulting.com guy@redowlconsulting.com

×