Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Matthew S. Weber
Hai Nguyen
Rutgers University
WebSci 2015
Oxford, UK
BIG DATA,
BIG ISSUES
3
Dataset Research Potential Dates Captures Unique URLs
Hurricane Katrina Online networks and organizational
resilience (C...
4
5
6
News Media on the Web
(Weber, Ognyanova, Kosterich & Nguyen, 2015)
To what degree are large-scale datasets reliable?
8
9
10
11
12
13
14
March 16, 2008
15
16
• Scale out across multiple datasets:
– US House – 2005:2013:
– US Senate – 2005:2013
– Hurrican Katrina – 2003:2012:
– Oc...
0 5 10 15 20 25 30
050000010000001500000200000025000003000000
Potential vs. Actual URLs
CountofPages
18t
CountofURLs
Poten...
19
0e+002e+064e+066e+06
Changes in Crawl Completeness
CountofPages
t
CountofURLs
OWS
House
Senate
Katrina
existing
potenti...
In the ideal case, it would be possible to create a factor that corrects
for data degrade:
bt
How does this help?
Each of ...
21
22
23
Challenges are not unique to these
data
Courtesy of Marc Smith, NodeXL
Lessons Learned
• Degradation is a factor in working with available large-scale data
– In part, degradation is related to ...
Get in contact with us:
– matthew.weber@rutgers.edu
– @mediareinvented
The Team
– Kris Carpenter, Vinay Goel, Internet Arc...
Big Data? Big Issues:  Degradation in Longitudinal Data and Implications for Social Sciences
Upcoming SlideShare
Loading in …5
×

Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences

353 views

Published on

Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences. Paper presented at Web Science 2015, Oxford, UK.

Published in: Data & Analytics
  • Hi there! Get Your Professional Job-Winning Resume Here - Check our website! http://bit.ly/resumpro
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences

  1. 1. Matthew S. Weber Hai Nguyen Rutgers University WebSci 2015 Oxford, UK BIG DATA, BIG ISSUES
  2. 2. 3 Dataset Research Potential Dates Captures Unique URLs Hurricane Katrina Online networks and organizational resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; information dissemination 2003 – 2012 1,694,236 663,740 Superstorm Sandy 2003 – 2012 41,703,112 20,013,455 US Senate Study the growth of political activity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarization & media discourse 109th – 112th Congresses 26,965,770 8,674,397 US House 51,840,777 12,410,014 Occupy Wall Street Previous research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formation and role of alliances between SMOs 2010 – 2012 247,928,272 11,3259,655 US Media Previous studies of news media organizations (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evolutionary patterns 2008 – 2012 1,315,132,555 539,184,823
  3. 3. 4
  4. 4. 5
  5. 5. 6 News Media on the Web (Weber, Ognyanova, Kosterich & Nguyen, 2015)
  6. 6. To what degree are large-scale datasets reliable?
  7. 7. 8
  8. 8. 9
  9. 9. 10
  10. 10. 11
  11. 11. 12
  12. 12. 13
  13. 13. 14 March 16, 2008
  14. 14. 15
  15. 15. 16
  16. 16. • Scale out across multiple datasets: – US House – 2005:2013: – US Senate – 2005:2013 – Hurrican Katrina – 2003:2012: – Occupy Wall Street – 2010:2012 17
  17. 17. 0 5 10 15 20 25 30 050000010000001500000200000025000003000000 Potential vs. Actual URLs CountofPages 18t CountofURLs Potential Actual Difference
  18. 18. 19 0e+002e+064e+066e+06 Changes in Crawl Completeness CountofPages t CountofURLs OWS House Senate Katrina existing potential b = set a unit of time for analysis, c choosing n perios across a total time T
  19. 19. In the ideal case, it would be possible to create a factor that corrects for data degrade: bt How does this help? Each of the illustrated cases fits against an exponential function ~ b • Senate: 0.13 • House: 0.13 • Katrina: 0.02 • OWS: 0.10 20 ebt
  20. 20. 21
  21. 21. 22
  22. 22. 23 Challenges are not unique to these data Courtesy of Marc Smith, NodeXL
  23. 23. Lessons Learned • Degradation is a factor in working with available large-scale data – In part, degradation is related to the provenance of the data – In turn, there is a need to record the origins of datasets (provenance) • Patterns of degradation prove problematic for statistical analyses – Ex: network analysis with snowball samples vs. whole network • Continued work needed to develop research guidelines as more scholars engage with this data 24
  24. 24. Get in contact with us: – matthew.weber@rutgers.edu – @mediareinvented The Team – Kris Carpenter, Vinay Goel, Internet Archive – David Lazer, Katherine Ognyanova, Northeastern University – Allie Kosterich, Hai Nguyen, Rutgers University Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers

×