Data Infrastructure for Coastal and Estuarine Science

Data Infrastructures for Estuarine
and Coastal Science
Anne E. Thessen
http://www.slideshare.net/athessen
annethessen@gmail.com

Photo Credit: NASA/ GSFC/ NOAA/ USGS

Outline
• Why are we talking about data
infrastructures?
• What are the challenges?
• What are the requirements?
• What parts are already available?
• How do we get there?
• PSA

Data Type Important Easy
Atmospheric Data 52.2% 21.6%
Climate Data 56.0% 23.3%
Oceanographic Data 42.5% 18.9%
Geophysical Data 55.5% 22.0%
Geological Data 56.3% 19.8%
Critical Zone Data 19.3% 8.2%
Hydrology Data 48.4% 20.1%
Results from EarthCube Stakeholder Alignment Survey
Why Are We Talking About Data
Infrastructure?

Working with multiple data sets from many disciplines?
Working with multiple data sets within a discipline?
88.1% say it is important
23.5% say it is easy
70.7% say it is important
9.8% say it is easy
Results from EarthCube Stakeholder Alignment Survey
Infrastructure?

Infrastructure?
• “Data Deluge”
• Large-scale problems
• Maturation of the
internet
• Increased investment
(i.e. EarthCube)
• Estuarine and coastal
science has
interdisciplinary nature
and strong sharing
culture

User
Needs
Where Do We Start?
Available
Technology
Existing
Infrastructure
Incentives

Sociological
Technological
• Data sharing
• Incentives
• Data cultures
• Science practices
• Massive heterogeneity
• Storage capacity
• Moving data around
• Efficient query
• Processing speed
• Knowledge representation

Stakeholder Assessment
Data producers
Photo Credit: The University of Nottingham Photo Credit: Kay Nietfeld/EPA
Data consumers

What is the current state of sharing?
• Data sharing varies widely by discipline
– No universal rules or agreements
– Sharing in marine science is 40%
– Other disciplines - 10% to 100%

• Data sharing varies widely and by discipline
• Far more scientists say they are willing to
share data than actually do
– Time to prepare
– Concerns about misuse

• Data sharing varies widely and by discipline
• Far more scientists say they are willing to
share data than actually do
• Lack of access to data is a major impediment

If sharing is so important why
aren’t more people doing it?
The large proportion of researchers who claim to
be willing to share data and the low numbers of
researchers who actually make their data easily
available suggests that data sharing would
increase substantially if the proper infrastructure
were in place.

Reasons for Not Sharing
• Not enough time or funding
• No place to put the data
• No standards or policies for sharing
• Others have no need for the data
• Loss of control
• No way to get credit
• Sensitive data cannot be shared
• Errors will be exposed
• Loss of competitiveness

Social Infrastructure Requirements
• Repository capability
• Place conditions on access
• Mechanisms for data citation and credit
• Data sharing policy
• Value added services
• Requirements from publishers and funders
• Respect for confidentiality
• Ease of use

We need a system that can
• Share
• Preserve
• Digitize
• Automate
• Integrate
– Data
– Infrastructure

Data Set Heterogeneity
• Data format
• Data file format
• Data quality and completeness
• Physical samples

What Will We Do With the Data?
• Preserve Data
– Format migration
– Redundancy
– Self-Repair
• Serve Data
– Discoverable
– Accessible
– Usable

Technical Infrastructure Requirements
• Preservation
• Layered service architecture
• Repository functions
• Accommodate heterogeneity
• Bridge digital and physical

Review Requirements
Sociological
• Repository capability
• Place conditions on access
• Mechanisms for data
citation and credit
• Data sharing policy
• Value added services
• Requirements from
publishers and funders
• Respect for confidentiality
• Ease of use
Technological
• Preservation
• Layered service architecture
• Repository functions
• Accommodate
heterogeneity
• Bridge digital and physical

What is Available?
Repositories

What is Available?
Citation
Repositories

What is Available?
Preservation
Repositories
Citation

What is Available?
Quality Control and Usage Metrics
Repositories
Citation
Preservation
Crowd Sourcing
Web 2.0

What is Available?
Integration
Repositories
Citation
Preservation
Quality and
Metrics
Web 3.0

What is Available?
Mobilization
Repositories
Citation
Preservation
Quality and
Metrics
Integration

What is Available?
Access Protocols
Web Services
Data Brokers Repositories
Citation
Preservation
Quality and
Metrics
Integration
Mobilization

What is Available?
Standards
Repositories
Citation
Preservation
Quality and
Metrics
Integration
Mobilization
Access

How Can it all Fit Together?
Quality
and
Metrics
Access
Citation
Preservation
Mobilization
Integration
Repositories
Standards

Who Should Be Doing All This Work?
• Librarians
• Data Scientists
• Informaticians
• Ontologists
• Computer Scientists
• Software Developers
• Standards Groups
Image by Michael Krigsman

Why Share Data?
• Increased recognition
• Increased economic opportunities
• Improved data set
• Improved science
• Time and money saved

Photo Credit: Emergency Cleaning Solutions

Photo Credit: The Collared Sheep

Acknowledgements
• Benjamin Fertig
• David Patterson
• Mike Kemp
• John Milliman
• Melissa Cragin
• Sayeed Choudhury
• Tim DiLauro
• Carol Palmer
• Nathan Wilson
• Alan Renear
• Ruth Duerr
• Cyndy Chandler
• Peter Fox
• Krishna Sinha
• Janet Fredericks
• Carl Lagoze

References
Atkins DE, Droegemeier KK, Feldman SI, Garcia-Molina H, Klein ML, Messerschmitt DG, Messina P, Ostriker JP, Wright MH.
2003. Revolutionizing science and engineering through cyberinfrastructure.
Borgman CL. 2010. Research data: who will share what, with whom, when, and why? Fifth China-North America Library
Conference 2010
Borgman CL. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and
Technology 63(6):1059-1078
Burton A, Treloar A. 2009. Designing for discovery and re-use: the ANDS data-sharing verbs approach to service decomposition.
The International Journal of Digital Curation 4.
Costello M. 2009. Motivating online publication of data. BioScience 59:418-426
Cragin MH, Palmer CL, Carlson JR, Witt M. 2010. Data sharing, small science and institutional repositories. Philosophical
Transactions of the Royal Society A 368:4023-4038
Edwards PN, Mayernik MS, Batcheller AL, Bowker GC, Borgman CL. 2011. Science friction: data, metadata and collaboration.
Social Studies of Science 41(5):667-690
Enke N, Thessen AE, Bach K, Bendix J, Seeger B, Gemeinholzer B. 2012. The User’s View on Biodiversity Data Sharing.
Ecological Informatics 11: 25-33
Field D Sansone SA, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, Millard S,
Mugabushaka AM, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J. 2009.
‘Omics data-sharing. Science 326:234-236
Froese R, Lloris D, Opitz S. 2003. Scientific data in the public domain. ACP-EU Fisheries Research Report 14:267-271.
Gleditsch NP, Strand H. 2003. Posting your data: will you be scooped or will you be famous? International Study Perspectives
4:89-97
Heidorn PB. 2008. Shedding light on the dark data in the long tail of science. Library Trends 57:280-299.
Henty M, Weaver B, Bradbury SJ, Simon P. 2008. Investigating data management practices in Australian Universities. APSR. QUT
digital repository http://eprints.qut.edu.au/14549
Hey T, Tansley S, Tolle K. 2009. The Fourth Paradigm. Microsoft Research. Redmond, WA, USA, 252 pp.

References
Key Perspectives Ltd. 2010. Data Dimensions: disciplinary differences in research data-sharing, reuse and long term viability.
DCC Scarp Synthesis Report. ISSN 1759-586X
Laogze C, Patzke K. 2011. A research agenda for data curation cyberinfrastructure. JCDL’11
Mayernik MS, DiLauro T, Duerr R, Metsger E, Thessen AE Choudhury GS. 2013. Data Conservancy provenance, context and
lineage services: key components for data preservation and curation. Data Science Journal 12:158-171
Palmer CL, Cragin MH, Heidorn PB, Smith LC. 2007. Data curation for the long tail of science: the case of environmental studies.
Digital Curation
Palmer CL, Weber NM, Cragin MH. 2011. The analytic potential of scientific data: understanding re-use value. ASIST 2011
Piwowar HA, Day RS, Fridsma DB. 2007. Sharing detailed research data is associated with increased citation rate. PLoS ONE
3:e308
Savage CJ, Vickers AJ. 2009. Empirical study of data-sharing by authors publishing in PLoS journals. PLoS ONE 4: e7078
Sinha AK, Thessen AE, Barnes CG. 2013. Geoinformatics: towards an integrative view of Earth as a system, in Bickford, M.E.,
ed., The Web of Geological Sciences: Advances, Impacts, and Interactions: Geological Society of America Special Paper 500,
p. 1-14. 10.1130/2013.2500(19)
Smith VS. 2009. Data publication: towards a database of everything. BMC Research Notes 2:113
Tenopir C, Allard S, Douglass KL, Aydinoglu AU, Wu L, Read E, Manoff M, Frame M. 2011. Data sharing by scientists: practices
and perceptions. PLoS ONE 6.6
Thessen AE, Patterson DJ. 2011. Data issues in the life sciences. ZooKeys 150:15-51
Wallis JC, Mayernik MS, Borgman CL, Pepe A. 2010. Digital libraries for scientific data discovery and reuse: from vision to
practical reality. Joint Conference on Digital Libraries 2010
Weber NM, Baker KS, Thomer AK, Chao TC, Palmer CL. 2012. Value and context in data use: domain analysis revisited.
Proceedings of the American Society for Information Science and Technology. 49(1):1-10
Whitlock MC. 2011. Data archiving in ecology and evolution: best practices. TREE 26(2):61-65

Data Infrastructure for Coastal and Estuarine Science

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Viewers also liked

Viewers also liked (10)

Similar to Data Infrastructure for Coastal and Estuarine Science

Similar to Data Infrastructure for Coastal and Estuarine Science (20)

More from Anne Thessen

More from Anne Thessen (10)

Recently uploaded

Recently uploaded (20)

Data Infrastructure for Coastal and Estuarine Science

Editor's Notes