SlideShare a Scribd company logo
1 of 25
Matthew S. Weber
Hai Nguyen
Rutgers University
WebSci 2015
Oxford, UK
BIG DATA,
BIG ISSUES
3
Dataset Research Potential Dates Captures Unique URLs
Hurricane Katrina Online networks and organizational
resilience (Chewning, Lai and Doerfel,
2012; Perry, Taylor and Doerfel, 2003) in
the wake of disasters; information
dissemination
2003 – 2012 1,694,236 663,740
Superstorm
Sandy
2003 – 2012 41,703,112 20,013,455
US Senate Study the growth of political activity in
online environments (Adamic & Glance,
2005; Bruns, 2007; Chang & Park, 2012);
polarization & media discourse
109th – 112th
Congresses
26,965,770 8,674,397
US House 51,840,777 12,410,014
Occupy Wall
Street
Previous research on NGOs in the online
environment (Bach & Stark, 2004;
Shumate, 2003, 2012; Shumate, Fulk, &
Monge, 2005); use of hyperlink data to
study the formation and role of alliances
between SMOs
2010 – 2012 247,928,272 11,3259,655
US Media
Previous studies of news media
organizations (Greer & Mensing, 2006;
Weber, 2012; Weber & Monge, In
Press); focus on evolutionary patterns
2008 – 2012 1,315,132,555 539,184,823
4
5
6
News Media on the Web
(Weber, Ognyanova, Kosterich & Nguyen, 2015)
To what degree are large-scale datasets reliable?
8
9
10
11
12
13
14
March 16, 2008
15
16
• Scale out across multiple datasets:
– US House – 2005:2013:
– US Senate – 2005:2013
– Hurrican Katrina – 2003:2012:
– Occupy Wall Street – 2010:2012
17
0 5 10 15 20 25 30
050000010000001500000200000025000003000000
Potential vs. Actual URLs
CountofPages
18t
CountofURLs
Potential
Actual
Difference
19
0e+002e+064e+066e+06
Changes in Crawl Completeness
CountofPages
t
CountofURLs
OWS
House
Senate
Katrina
existing
potential
b =
set a unit of time for analysis, c
choosing n perios across a total time T
In the ideal case, it would be possible to create a factor that corrects
for data degrade:
bt
How does this help?
Each of the illustrated cases fits against an
exponential function ~ b
• Senate: 0.13
• House: 0.13
• Katrina: 0.02
• OWS: 0.10
20
ebt
21
22
23
Challenges are not unique to these
data
Courtesy of Marc Smith, NodeXL
Lessons Learned
• Degradation is a factor in working with available large-scale data
– In part, degradation is related to the provenance of the data
– In turn, there is a need to record the origins of datasets (provenance)
• Patterns of degradation prove problematic for statistical analyses
– Ex: network analysis with snowball samples vs. whole network
• Continued work needed to develop research guidelines as more
scholars engage with this data
24
Get in contact with us:
– matthew.weber@rutgers.edu
– @mediareinvented
The Team
– Kris Carpenter, Vinay Goel, Internet Archive
– David Lazer, Katherine Ognyanova, Northeastern University
– Allie Kosterich, Hai Nguyen, Rutgers University
Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers

More Related Content

What's hot

Synthetic Data Generation using exponential random Graph modeling
Synthetic Data Generation using exponential random Graph modelingSynthetic Data Generation using exponential random Graph modeling
Synthetic Data Generation using exponential random Graph modelingGraph-TA
 
Dealing with Open Data in Istat
Dealing with Open Data in IstatDealing with Open Data in Istat
Dealing with Open Data in IstatGiovanni Barbieri
 
Big Data Analytics on Hadoop RainStor Infographic
Big Data Analytics on Hadoop RainStor InfographicBig Data Analytics on Hadoop RainStor Infographic
Big Data Analytics on Hadoop RainStor InfographicRainStor
 
Distribution of maximal clique size of the
Distribution of maximal clique size of theDistribution of maximal clique size of the
Distribution of maximal clique size of theIJCNCJournal
 
The study about the analysis of responsiveness pair clustering tosocial netwo...
The study about the analysis of responsiveness pair clustering tosocial netwo...The study about the analysis of responsiveness pair clustering tosocial netwo...
The study about the analysis of responsiveness pair clustering tosocial netwo...acijjournal
 
Geographic Information Management Transformation
Geographic Information Management TransformationGeographic Information Management Transformation
Geographic Information Management TransformationPat Kenny
 

What's hot (6)

Synthetic Data Generation using exponential random Graph modeling
Synthetic Data Generation using exponential random Graph modelingSynthetic Data Generation using exponential random Graph modeling
Synthetic Data Generation using exponential random Graph modeling
 
Dealing with Open Data in Istat
Dealing with Open Data in IstatDealing with Open Data in Istat
Dealing with Open Data in Istat
 
Big Data Analytics on Hadoop RainStor Infographic
Big Data Analytics on Hadoop RainStor InfographicBig Data Analytics on Hadoop RainStor Infographic
Big Data Analytics on Hadoop RainStor Infographic
 
Distribution of maximal clique size of the
Distribution of maximal clique size of theDistribution of maximal clique size of the
Distribution of maximal clique size of the
 
The study about the analysis of responsiveness pair clustering tosocial netwo...
The study about the analysis of responsiveness pair clustering tosocial netwo...The study about the analysis of responsiveness pair clustering tosocial netwo...
The study about the analysis of responsiveness pair clustering tosocial netwo...
 
Geographic Information Management Transformation
Geographic Information Management TransformationGeographic Information Management Transformation
Geographic Information Management Transformation
 

Similar to Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences

From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.mwe400
 
Wire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub ProjectWire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub Projectmwe400
 
Internet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam UniversityInternet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam Universitymwe400
 
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...SayantanRoy14
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
The web bang project michele zadra
The web bang project michele zadraThe web bang project michele zadra
The web bang project michele zadraMichele Zadra
 
wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor networkparry prabhu
 
Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...
Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...
Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...FIA2010
 
Dissertation Social Network Sites
Dissertation Social Network SitesDissertation Social Network Sites
Dissertation Social Network SitesXenia K-i
 
Scholarship in the Digital Age
Scholarship in the Digital AgeScholarship in the Digital Age
Scholarship in the Digital AgeEric Meyer
 
Kid171 chap0 english version
Kid171 chap0 english versionKid171 chap0 english version
Kid171 chap0 english versionFrank S.C. Tseng
 
Ripples on the Web: Diffusion of Activity Bursts across Hyperlink Networks in...
Ripples on the Web: Diffusion of Activity Bursts across Hyperlink Networks in...Ripples on the Web: Diffusion of Activity Bursts across Hyperlink Networks in...
Ripples on the Web: Diffusion of Activity Bursts across Hyperlink Networks in...Brian Keegan
 
CeB - f - s01
CeB - f - s01CeB - f - s01
CeB - f - s01gauvins
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsNeo4j
 
Adaptive network models of socio-cultural dynamics
Adaptive network models of socio-cultural dynamicsAdaptive network models of socio-cultural dynamics
Adaptive network models of socio-cultural dynamicsHiroki Sayama
 

Similar to Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences (20)

From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
From Big Data to Big Theory: Lessons Learned from Archival Internet Research.
 
Wire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub ProjectWire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub Project
 
Internet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam UniversityInternet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam University
 
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
The web bang project michele zadra
The web bang project michele zadraThe web bang project michele zadra
The web bang project michele zadra
 
wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor network
 
Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...
Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...
Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...
 
Ngdm09 han gao
Ngdm09 han gaoNgdm09 han gao
Ngdm09 han gao
 
Dissertation Social Network Sites
Dissertation Social Network SitesDissertation Social Network Sites
Dissertation Social Network Sites
 
Scholarship in the Digital Age
Scholarship in the Digital AgeScholarship in the Digital Age
Scholarship in the Digital Age
 
10 problems 06
10 problems 0610 problems 06
10 problems 06
 
Kid171 chap0 english version
Kid171 chap0 english versionKid171 chap0 english version
Kid171 chap0 english version
 
Ripples on the Web: Diffusion of Activity Bursts across Hyperlink Networks in...
Ripples on the Web: Diffusion of Activity Bursts across Hyperlink Networks in...Ripples on the Web: Diffusion of Activity Bursts across Hyperlink Networks in...
Ripples on the Web: Diffusion of Activity Bursts across Hyperlink Networks in...
 
CeB - f - s01
CeB - f - s01CeB - f - s01
CeB - f - s01
 
Ongoing Research in Data Studies
Ongoing Research in Data StudiesOngoing Research in Data Studies
Ongoing Research in Data Studies
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale Analytics
 
Adaptive network models of socio-cultural dynamics
Adaptive network models of socio-cultural dynamicsAdaptive network models of socio-cultural dynamics
Adaptive network models of socio-cultural dynamics
 
Homelessness Data Discussion
Homelessness Data DiscussionHomelessness Data Discussion
Homelessness Data Discussion
 

More from mwe400

050817 geomedia news networks
050817 geomedia news networks050817 geomedia news networks
050817 geomedia news networksmwe400
 
022217 ia hackathon presentation
022217 ia  hackathon presentation022217 ia  hackathon presentation
022217 ia hackathon presentationmwe400
 
062016 jcdl media networks upload
062016 jcdl media networks upload062016 jcdl media networks upload
062016 jcdl media networks uploadmwe400
 
Web Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives UnleashedWeb Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives Unleashedmwe400
 
Immutable Technology and the Breakdown of Organizational Change.
Immutable Technology and the Breakdown of Organizational Change.Immutable Technology and the Breakdown of Organizational Change.
Immutable Technology and the Breakdown of Organizational Change.mwe400
 
032415 marketing 101 watershed upload
032415 marketing 101   watershed upload032415 marketing 101   watershed upload
032415 marketing 101 watershed uploadmwe400
 
AEJMC 2014 - Big Data and Education
AEJMC 2014 - Big Data and EducationAEJMC 2014 - Big Data and Education
AEJMC 2014 - Big Data and Educationmwe400
 
AEJMC 2014 - Online News and Linking
AEJMC 2014 - Online News and LinkingAEJMC 2014 - Online News and Linking
AEJMC 2014 - Online News and Linkingmwe400
 

More from mwe400 (8)

050817 geomedia news networks
050817 geomedia news networks050817 geomedia news networks
050817 geomedia news networks
 
022217 ia hackathon presentation
022217 ia  hackathon presentation022217 ia  hackathon presentation
022217 ia hackathon presentation
 
062016 jcdl media networks upload
062016 jcdl media networks upload062016 jcdl media networks upload
062016 jcdl media networks upload
 
Web Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives UnleashedWeb Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives Unleashed
 
Immutable Technology and the Breakdown of Organizational Change.
Immutable Technology and the Breakdown of Organizational Change.Immutable Technology and the Breakdown of Organizational Change.
Immutable Technology and the Breakdown of Organizational Change.
 
032415 marketing 101 watershed upload
032415 marketing 101   watershed upload032415 marketing 101   watershed upload
032415 marketing 101 watershed upload
 
AEJMC 2014 - Big Data and Education
AEJMC 2014 - Big Data and EducationAEJMC 2014 - Big Data and Education
AEJMC 2014 - Big Data and Education
 
AEJMC 2014 - Online News and Linking
AEJMC 2014 - Online News and LinkingAEJMC 2014 - Online News and Linking
AEJMC 2014 - Online News and Linking
 

Recently uploaded

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Recently uploaded (20)

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences

  • 1. Matthew S. Weber Hai Nguyen Rutgers University WebSci 2015 Oxford, UK BIG DATA, BIG ISSUES
  • 2.
  • 3. 3 Dataset Research Potential Dates Captures Unique URLs Hurricane Katrina Online networks and organizational resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; information dissemination 2003 – 2012 1,694,236 663,740 Superstorm Sandy 2003 – 2012 41,703,112 20,013,455 US Senate Study the growth of political activity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarization & media discourse 109th – 112th Congresses 26,965,770 8,674,397 US House 51,840,777 12,410,014 Occupy Wall Street Previous research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formation and role of alliances between SMOs 2010 – 2012 247,928,272 11,3259,655 US Media Previous studies of news media organizations (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evolutionary patterns 2008 – 2012 1,315,132,555 539,184,823
  • 4. 4
  • 5. 5
  • 6. 6 News Media on the Web (Weber, Ognyanova, Kosterich & Nguyen, 2015)
  • 7. To what degree are large-scale datasets reliable?
  • 8. 8
  • 9. 9
  • 10. 10
  • 11. 11
  • 12. 12
  • 13. 13
  • 15. 15
  • 16. 16
  • 17. • Scale out across multiple datasets: – US House – 2005:2013: – US Senate – 2005:2013 – Hurrican Katrina – 2003:2012: – Occupy Wall Street – 2010:2012 17
  • 18. 0 5 10 15 20 25 30 050000010000001500000200000025000003000000 Potential vs. Actual URLs CountofPages 18t CountofURLs Potential Actual Difference
  • 19. 19 0e+002e+064e+066e+06 Changes in Crawl Completeness CountofPages t CountofURLs OWS House Senate Katrina existing potential b = set a unit of time for analysis, c choosing n perios across a total time T
  • 20. In the ideal case, it would be possible to create a factor that corrects for data degrade: bt How does this help? Each of the illustrated cases fits against an exponential function ~ b • Senate: 0.13 • House: 0.13 • Katrina: 0.02 • OWS: 0.10 20 ebt
  • 21. 21
  • 22. 22
  • 23. 23 Challenges are not unique to these data Courtesy of Marc Smith, NodeXL
  • 24. Lessons Learned • Degradation is a factor in working with available large-scale data – In part, degradation is related to the provenance of the data – In turn, there is a need to record the origins of datasets (provenance) • Patterns of degradation prove problematic for statistical analyses – Ex: network analysis with snowball samples vs. whole network • Continued work needed to develop research guidelines as more scholars engage with this data 24
  • 25. Get in contact with us: – matthew.weber@rutgers.edu – @mediareinvented The Team – Kris Carpenter, Vinay Goel, Internet Archive – David Lazer, Katherine Ognyanova, Northeastern University – Allie Kosterich, Hai Nguyen, Rutgers University Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers

Editor's Notes

  1. There are many types of large-scale data… only talking about Internet based data… focusing on datasets that are re-used. - Markus - “social scientists are used to fine-grain, well-controlled data, and that doesn’t exist on the web”
  2. 20th Century Collection = 9TB of metadata Media Seed List = 4,891 For instance, researchers have proposed focusing archival efforts on capturing data that changes the most frequently, in order to capture the majority of new content [36]. Elsewhere, researchers have suggested that crawling strategies should prioritize archival efforts based on the size and relative position of websites within their larger ecosystems [37].
  3. Driscoll and Walker (2014) For instance, a comparison of Twitter data collected via a public API and data collected from a “fire hose” provided by GNIP PowerTrack, found significant differences between the two datasets. In most cases the PowerTrack data proved to be more powerful,
  4. 3 month windows of time…
  5. Also looked at the size of the webpages, and estimating out size… wasn’t as reliable.