Using the Web as a Data
Source: Challenges for
Linked Science
Carsten Keßler	 	 Hunter College, City University of New York
	 	 	 	 	 	 	 http://carsten.io	 	 @carstenkessler
Data collection
Then and now
http://www.sysomos.com/inside-twitter/twitter-friendship-data
Validity issues
Sampling bias, anyone?
„Scottish independence referendum results“ von Brythones. Lizenziert unter CC-BY-SA 4.0 über Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Scottish_independence_referendum_results.svg#/media/File:Scottish_independence_referendum_results.svg
Reproducibility issues
The ivory tower, locked up
Linked Science
…to the rescue?
Does LISC work here?
• Semantic annotation, linking and archiving is problematic:
• large datasets
• in constant flux
• legal restrictions
• terms of service may change at any time
• legislation required to permit archiving
Can we fix sampling bias?
• Probably not, but…
• need to be aware of it
• if possible, test against other datasets to check for validity of
conclusions
• can we somehow encode this, maybe as provenance?
Should we enforce
data publication?
• Tradeoff between:
• Learning about interesting
new research going on at
those companies, without
any ability to verify or
reproduce results
• Implementing and
enforcing strict rules for
conference and journal
outlets that would force
authors to provide the
data used in their analysis
Thank you.
Carsten Keßler	 	 Hunter College, City University of New York
	 	 	 	 	 	 	 http://carsten.io	 	 @carstenkessler

Using the Web as a Data Source: Challenges for Linked Science

  • 1.
    Using the Webas a Data Source: Challenges for Linked Science Carsten Keßler Hunter College, City University of New York http://carsten.io @carstenkessler
  • 2.
  • 5.
  • 6.
  • 9.
    „Scottish independence referendumresults“ von Brythones. Lizenziert unter CC-BY-SA 4.0 über Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Scottish_independence_referendum_results.svg#/media/File:Scottish_independence_referendum_results.svg
  • 11.
  • 14.
  • 15.
    Does LISC workhere? • Semantic annotation, linking and archiving is problematic: • large datasets • in constant flux • legal restrictions • terms of service may change at any time • legislation required to permit archiving
  • 16.
    Can we fixsampling bias? • Probably not, but… • need to be aware of it • if possible, test against other datasets to check for validity of conclusions • can we somehow encode this, maybe as provenance?
  • 17.
    Should we enforce datapublication? • Tradeoff between: • Learning about interesting new research going on at those companies, without any ability to verify or reproduce results • Implementing and enforcing strict rules for conference and journal outlets that would force authors to provide the data used in their analysis
  • 18.
    Thank you. Carsten Keßler Hunter College, City University of New York http://carsten.io @carstenkessler