Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Matthew S. Weber
Rutgers University
@docmattweber
Presented At
130th Annual Meeting of
Big Data, Big Theory &
The Thread o...
Credit: Flickr @ilovecology
Pekin Daily Times, Pekin, IL, October 8, 2013
Credit: Pekin Daily Times
4
What’s in the data?
7
Source | Destination | Date | Frequency | Content Type | Bytes | Content
Link Data:
http://gawker.co...
8
13
News Media on the Web
(Weber, Ognyanova, Kosterich & Nguyen, 2015)
NJ Local News: 2007 - 2012
17
0
1
2
3
4
5
6
7
0
100
200
300
400
500
600
700
800
900
1000
2007 2008 2009 2010 2011 2012
Avg.MBperWebpage
Avg.NumberofW...
18
Dataset Research Potential Dates Captures Unique URLs
Hurricane Katrina Online networks and organizational
resilience (...
To what degree are large-scale datasets reliable?
20
21
22
0 5 10 15 20 25 30
050000010000001500000200000025000003000000
Potential vs. Actual URLs
CountofPages
23t
CountofURLs
Poten...
24
0e+002e+064e+066e+06
Changes in Crawl Completeness
CountofPages
t
CountofURLs
OWS
House
Senate
Katrina
existing
potenti...
In the ideal case, it would be possible to create a factor that corrects
for data degrade:
bt
How does this help?
Each of ...
26
Challenges are not unique to these
data
Courtesy of Marc Smith, NodeXL
27
Research support from:
NSF Award #1244727; Additional support from the NetSCI Lab @ Rutgers
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
Upcoming SlideShare
Loading in …5
×

From Big Data to Big Theory: Lessons Learned from Archival Internet Research.

201 views

Published on

From Big Data to Big Theory: Lessons Learned from Archival Internet Research. Paper presented at the Annual Meeting of the American Historical Association, Atlanta, GA.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

From Big Data to Big Theory: Lessons Learned from Archival Internet Research.

  1. 1. Matthew S. Weber Rutgers University @docmattweber Presented At 130th Annual Meeting of Big Data, Big Theory & The Thread of Recent History
  2. 2. Credit: Flickr @ilovecology
  3. 3. Pekin Daily Times, Pekin, IL, October 8, 2013 Credit: Pekin Daily Times
  4. 4. 4
  5. 5. What’s in the data? 7 Source | Destination | Date | Frequency | Content Type | Bytes | Content Link Data: http://gawker.com/5953665/mitt-romneys- staff-played-the-media-covering-them-in-a- friendly-game-of-flag-football Mitt Romney's Staff Played the Media Covering Them in a Friendly Game of Flag http://gawker.com 2012-10-22
  6. 6. 8
  7. 7. 13 News Media on the Web (Weber, Ognyanova, Kosterich & Nguyen, 2015)
  8. 8. NJ Local News: 2007 - 2012
  9. 9. 17 0 1 2 3 4 5 6 7 0 100 200 300 400 500 600 700 800 900 1000 2007 2008 2009 2010 2011 2012 Avg.MBperWebpage Avg.NumberofWebpages NJ.com Domain Analysis Number of Pages Avg MB
  10. 10. 18 Dataset Research Potential Dates Captures Unique URLs Hurricane Katrina Online networks and organizational resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; information dissemination 2003 – 2012 1,694,236 663,740 Superstorm Sandy 2003 – 2012 41,703,112 20,013,455 US Senate Study the growth of political activity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarization & media discourse 109th – 112th Congresses 26,965,770 8,674,397 US House 51,840,777 12,410,014 Occupy Wall Street Previous research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formation and role of alliances between SMOs 2010 – 2012 247,928,272 11,3259,655 US Media Previous studies of news media organizations (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evolutionary patterns 2008 – 2012 1,315,132,555 539,184,823
  11. 11. To what degree are large-scale datasets reliable?
  12. 12. 20
  13. 13. 21
  14. 14. 22
  15. 15. 0 5 10 15 20 25 30 050000010000001500000200000025000003000000 Potential vs. Actual URLs CountofPages 23t CountofURLs Potential Actual Difference
  16. 16. 24 0e+002e+064e+066e+06 Changes in Crawl Completeness CountofPages t CountofURLs OWS House Senate Katrina existing potential b = set a unit of time for analysis, c choosing n periods across a total time T
  17. 17. In the ideal case, it would be possible to create a factor that corrects for data degrade: bt How does this help? Each of the illustrated cases fits against an exponential function ~ b • Senate: 0.13 • House: 0.13 • Katrina: 0.02 • OWS: 0.10 25 ebt
  18. 18. 26 Challenges are not unique to these data Courtesy of Marc Smith, NodeXL
  19. 19. 27
  20. 20. Research support from: NSF Award #1244727; Additional support from the NetSCI Lab @ Rutgers

×