Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Matthew S. Weber
Hai Nguyen
Rutgers University
IEEE Big Data Congress 2015
Millenium Hotel, NY, NY
Wednesday, July 1, 2015...
3
Dataset Research Potential Dates Captures Unique URLs
Hurricane Katrina Online networks and organizational
resilience (C...
What’s in the data?
4
Source | Destination | Date | Frequency | Content Type | Bytes | Descriptive Text
Link Data:
http://...
5
6
7
News Media on the Web
(Weber, Ognyanova, Kosterich & Nguyen, 2015)
To what degree are large-scale datasets reliable?
11
12
13
14
15
16
17
March 16, 2008
18
19
• Scale out across multiple datasets:
– US House – 2005:2013:
– US Senate – 2005:2013
– Hurrican Katrina – 2003:2012:
– Oc...
0 5 10 15 20 25 30
050000010000001500000200000025000003000000
Potential vs. Actual URLs
CountofPages
21t
CountofURLs
Poten...
22
0e+002e+064e+066e+06
Changes in Crawl Completeness
CountofPages
t
CountofURLs
OWS
House
Senate
Katrina
existing
potenti...
In the ideal case, it would be possible to create a factor that corrects
for data degrade:
bt
How does this help?
Each of ...
24
25
26
Challenges are not unique to these
data
Courtesy of Marc Smith, NodeXL
Lessons Learned
• Degradation is a factor in working with available large-scale data
– In part, degradation is related to ...
Get in contact with us:
– matthew.weber@rutgers.edu
– @mediareinvented
The Team
– Kris Carpenter, Vinay Goel, Internet Arc...
Internet Archives as a Tool for Research: Decay in Large Scale Archival Records
Internet Archives as a Tool for Research: Decay in Large Scale Archival Records
Internet Archives as a Tool for Research: Decay in Large Scale Archival Records
Upcoming SlideShare
Loading in …5
×

Internet Archives as a Tool for Research: Decay in Large Scale Archival Records

219 views

Published on

Internet Archives as a Tool for Research: Decay in Large Scale Archival Records. Paper presented at the IEEE Big Data Congress 2015, New York, NY.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Internet Archives as a Tool for Research: Decay in Large Scale Archival Records

  1. 1. Matthew S. Weber Hai Nguyen Rutgers University IEEE Big Data Congress 2015 Millenium Hotel, NY, NY Wednesday, July 1, 2015 BIG DATA, BIG ISSUES
  2. 2. 3 Dataset Research Potential Dates Captures Unique URLs Hurricane Katrina Online networks and organizational resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; information dissemination 2003 – 2012 1,694,236 663,740 Superstorm Sandy 2003 – 2012 41,703,112 20,013,455 US Senate Study the growth of political activity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarization & media discourse 109th – 112th Congresses 26,965,770 8,674,397 US House 51,840,777 12,410,014 Occupy Wall Street Previous research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formation and role of alliances between SMOs 2010 – 2012 247,928,272 11,3259,655 US Media Previous studies of news media organizations (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evolutionary patterns 2008 – 2012 1,315,132,555 539,184,823
  3. 3. What’s in the data? 4 Source | Destination | Date | Frequency | Content Type | Bytes | Descriptive Text Link Data: http://gawker.com/5953665/mitt-romneys- staff-played-the-media-covering-them-in-a- friendly-game-of-flag-football Mitt Romney's Staff Played the Media Covering Them in a Friendly Game of Flag http://gawker.com 2012-10-22
  4. 4. 5
  5. 5. 6
  6. 6. 7 News Media on the Web (Weber, Ognyanova, Kosterich & Nguyen, 2015)
  7. 7. To what degree are large-scale datasets reliable?
  8. 8. 11
  9. 9. 12
  10. 10. 13
  11. 11. 14
  12. 12. 15
  13. 13. 16
  14. 14. 17 March 16, 2008
  15. 15. 18
  16. 16. 19
  17. 17. • Scale out across multiple datasets: – US House – 2005:2013: – US Senate – 2005:2013 – Hurrican Katrina – 2003:2012: – Occupy Wall Street – 2010:2012 20
  18. 18. 0 5 10 15 20 25 30 050000010000001500000200000025000003000000 Potential vs. Actual URLs CountofPages 21t CountofURLs Potential Actual Difference
  19. 19. 22 0e+002e+064e+066e+06 Changes in Crawl Completeness CountofPages t CountofURLs OWS House Senate Katrina existing potential b = set a unit of time for analysis, c choosing n perios across a total time T
  20. 20. In the ideal case, it would be possible to create a factor that corrects for data degrade: bt How does this help? Each of the illustrated cases fits against an exponential function ~ b • Senate: 0.13 • House: 0.13 • Katrina: 0.02 • OWS: 0.10 23 ebt
  21. 21. 24
  22. 22. 25
  23. 23. 26 Challenges are not unique to these data Courtesy of Marc Smith, NodeXL
  24. 24. Lessons Learned • Degradation is a factor in working with available large-scale data – In part, degradation is related to the provenance of the data – In turn, there is a need to record the origins of datasets (provenance) • Patterns of degradation prove problematic for statistical analyses – Ex: network analysis with snowball samples vs. whole network • Continued work needed to develop research guidelines as more scholars engage with this data 27
  25. 25. Get in contact with us: – matthew.weber@rutgers.edu – @mediareinvented The Team – Kris Carpenter, Vinay Goel, Internet Archive – David Lazer, Katherine Ognyanova, Northeastern University – Allie Kosterich, Hai Nguyen, Rutgers University Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers

×