SlideShare a Scribd company logo
1 of 40
Data Infrastructures for Estuarine
and Coastal Science
Anne E. Thessen
http://www.slideshare.net/athessen
annethessen@gmail.com
Photo Credit: NASA/ GSFC/ NOAA/ USGS
Outline
• Why are we talking about data
infrastructures?
• What are the challenges?
• What are the requirements?
• What parts are already available?
• How do we get there?
• PSA
Data Type Important Easy
Atmospheric Data 52.2% 21.6%
Climate Data 56.0% 23.3%
Oceanographic Data 42.5% 18.9%
Geophysical Data 55.5% 22.0%
Geological Data 56.3% 19.8%
Critical Zone Data 19.3% 8.2%
Hydrology Data 48.4% 20.1%
Results from EarthCube Stakeholder Alignment Survey
Why Are We Talking About Data
Infrastructure?
Working with multiple data sets from many disciplines?
Working with multiple data sets within a discipline?
88.1% say it is important
23.5% say it is easy
70.7% say it is important
9.8% say it is easy
Results from EarthCube Stakeholder Alignment Survey
Why Are We Talking About Data
Infrastructure?
Why Are We Talking About Data
Infrastructure?
• “Data Deluge”
• Large-scale problems
• Maturation of the
internet
• Increased investment
(i.e. EarthCube)
• Estuarine and coastal
science has
interdisciplinary nature
and strong sharing
culture
User
Needs
Where Do We Start?
Available
Technology
Existing
Infrastructure
Incentives
Sociological
Technological
• Data sharing
• Incentives
• Data cultures
• Science practices
• Massive heterogeneity
• Storage capacity
• Moving data around
• Efficient query
• Processing speed
• Knowledge representation
Stakeholder Assessment
Data producers
Photo Credit: The University of Nottingham Photo Credit: Kay Nietfeld/EPA
Data consumers
What is the current state of sharing?
• Data sharing varies widely by discipline
– No universal rules or agreements
– Sharing in marine science is 40%
– Other disciplines - 10% to 100%
What is the current state of sharing?
• Data sharing varies widely and by discipline
• Far more scientists say they are willing to
share data than actually do
– Time to prepare
– Concerns about misuse
What is the current state of sharing?
• Data sharing varies widely and by discipline
• Far more scientists say they are willing to
share data than actually do
• Lack of access to data is a major impediment
If sharing is so important why
aren’t more people doing it?
The large proportion of researchers who claim to
be willing to share data and the low numbers of
researchers who actually make their data easily
available suggests that data sharing would
increase substantially if the proper infrastructure
were in place.
Reasons for Not Sharing
• Not enough time or funding
• No place to put the data
• No standards or policies for sharing
• Others have no need for the data
• Loss of control
• No way to get credit
• Sensitive data cannot be shared
• Errors will be exposed
• Loss of competitiveness
Social Infrastructure Requirements
• Repository capability
• Place conditions on access
• Mechanisms for data citation and credit
• Data sharing policy
• Value added services
• Requirements from publishers and funders
• Respect for confidentiality
• Ease of use
We need a system that can
• Share
• Preserve
• Digitize
• Automate
• Integrate
– Data
– Infrastructure
Data Set Size
Data Set Heterogeneity
• Data format
• Data file format
• Data quality and completeness
• Physical samples
What Will We Do With the Data?
• Preserve Data
– Format migration
– Redundancy
– Self-Repair
• Serve Data
– Discoverable
– Accessible
– Usable
Technical Infrastructure Requirements
• Preservation
• Layered service architecture
• Repository functions
• Accommodate heterogeneity
• Bridge digital and physical
Review Requirements
Sociological
• Repository capability
• Place conditions on access
• Mechanisms for data
citation and credit
• Data sharing policy
• Value added services
• Requirements from
publishers and funders
• Respect for confidentiality
• Ease of use
Technological
• Preservation
• Layered service architecture
• Repository functions
• Accommodate
heterogeneity
• Bridge digital and physical
What is Available?
Repositories
What is Available?
Citation
Repositories
What is Available?
Preservation
Repositories
Citation
What is Available?
Quality Control and Usage Metrics
Repositories
Citation
Preservation
Crowd Sourcing
Web 2.0
What is Available?
Integration
Repositories
Citation
Preservation
Quality and
Metrics
Web 3.0
What is Available?
Mobilization
Repositories
Citation
Preservation
Quality and
Metrics
Integration
What is Available?
Access Protocols
Web Services
Data Brokers Repositories
Citation
Preservation
Quality and
Metrics
Integration
Mobilization
What is Available?
Standards
Repositories
Citation
Preservation
Quality and
Metrics
Integration
Mobilization
Access
How Can it all Fit Together?
Quality
and
Metrics
Access
Citation
Preservation
Mobilization
Integration
Repositories
Standards
Who Should Be Doing All This Work?
• Librarians
• Data Scientists
• Informaticians
• Ontologists
• Computer Scientists
• Software Developers
• Standards Groups
Image by Michael Krigsman
PSA
Why Share Data?
• Increased recognition
• Increased economic opportunities
• Improved data set
• Improved science
• Time and money saved
Photo Credit: Emergency Cleaning Solutions
Photo Credit: The Collared Sheep
Acknowledgements
• Benjamin Fertig
• David Patterson
• Mike Kemp
• John Milliman
• Melissa Cragin
• Sayeed Choudhury
• Tim DiLauro
• Carol Palmer
• Nathan Wilson
• Alan Renear
• Ruth Duerr
• Cyndy Chandler
• Peter Fox
• Krishna Sinha
• Janet Fredericks
• Carl Lagoze
Questions?
References
Atkins DE, Droegemeier KK, Feldman SI, Garcia-Molina H, Klein ML, Messerschmitt DG, Messina P, Ostriker JP, Wright MH.
2003. Revolutionizing science and engineering through cyberinfrastructure.
Borgman CL. 2010. Research data: who will share what, with whom, when, and why? Fifth China-North America Library
Conference 2010
Borgman CL. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and
Technology 63(6):1059-1078
Burton A, Treloar A. 2009. Designing for discovery and re-use: the ANDS data-sharing verbs approach to service decomposition.
The International Journal of Digital Curation 4.
Costello M. 2009. Motivating online publication of data. BioScience 59:418-426
Cragin MH, Palmer CL, Carlson JR, Witt M. 2010. Data sharing, small science and institutional repositories. Philosophical
Transactions of the Royal Society A 368:4023-4038
Edwards PN, Mayernik MS, Batcheller AL, Bowker GC, Borgman CL. 2011. Science friction: data, metadata and collaboration.
Social Studies of Science 41(5):667-690
Enke N, Thessen AE, Bach K, Bendix J, Seeger B, Gemeinholzer B. 2012. The User’s View on Biodiversity Data Sharing.
Ecological Informatics 11: 25-33
Field D Sansone SA, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, Millard S,
Mugabushaka AM, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J. 2009.
‘Omics data-sharing. Science 326:234-236
Froese R, Lloris D, Opitz S. 2003. Scientific data in the public domain. ACP-EU Fisheries Research Report 14:267-271.
Gleditsch NP, Strand H. 2003. Posting your data: will you be scooped or will you be famous? International Study Perspectives
4:89-97
Heidorn PB. 2008. Shedding light on the dark data in the long tail of science. Library Trends 57:280-299.
Henty M, Weaver B, Bradbury SJ, Simon P. 2008. Investigating data management practices in Australian Universities. APSR. QUT
digital repository http://eprints.qut.edu.au/14549
Hey T, Tansley S, Tolle K. 2009. The Fourth Paradigm. Microsoft Research. Redmond, WA, USA, 252 pp.
References
Key Perspectives Ltd. 2010. Data Dimensions: disciplinary differences in research data-sharing, reuse and long term viability.
DCC Scarp Synthesis Report. ISSN 1759-586X
Laogze C, Patzke K. 2011. A research agenda for data curation cyberinfrastructure. JCDL’11
Mayernik MS, DiLauro T, Duerr R, Metsger E, Thessen AE Choudhury GS. 2013. Data Conservancy provenance, context and
lineage services: key components for data preservation and curation. Data Science Journal 12:158-171
Palmer CL, Cragin MH, Heidorn PB, Smith LC. 2007. Data curation for the long tail of science: the case of environmental studies.
Digital Curation
Palmer CL, Weber NM, Cragin MH. 2011. The analytic potential of scientific data: understanding re-use value. ASIST 2011
Piwowar HA, Day RS, Fridsma DB. 2007. Sharing detailed research data is associated with increased citation rate. PLoS ONE
3:e308
Savage CJ, Vickers AJ. 2009. Empirical study of data-sharing by authors publishing in PLoS journals. PLoS ONE 4: e7078
Sinha AK, Thessen AE, Barnes CG. 2013. Geoinformatics: towards an integrative view of Earth as a system, in Bickford, M.E.,
ed., The Web of Geological Sciences: Advances, Impacts, and Interactions: Geological Society of America Special Paper 500,
p. 1-14. 10.1130/2013.2500(19)
Smith VS. 2009. Data publication: towards a database of everything. BMC Research Notes 2:113
Tenopir C, Allard S, Douglass KL, Aydinoglu AU, Wu L, Read E, Manoff M, Frame M. 2011. Data sharing by scientists: practices
and perceptions. PLoS ONE 6.6
Thessen AE, Patterson DJ. 2011. Data issues in the life sciences. ZooKeys 150:15-51
Wallis JC, Mayernik MS, Borgman CL, Pepe A. 2010. Digital libraries for scientific data discovery and reuse: from vision to
practical reality. Joint Conference on Digital Libraries 2010
Weber NM, Baker KS, Thomer AK, Chao TC, Palmer CL. 2012. Value and context in data use: domain analysis revisited.
Proceedings of the American Society for Information Science and Technology. 49(1):1-10
Whitlock MC. 2011. Data archiving in ecology and evolution: best practices. TREE 26(2):61-65

More Related Content

What's hot

Ausplots Training - Session 1
Ausplots Training - Session 1Ausplots Training - Session 1
Ausplots Training - Session 1bensparrowau
 
Tim Osborn: Research Integrity: Integrity of the published record
Tim Osborn: Research Integrity: Integrity of the published recordTim Osborn: Research Integrity: Integrity of the published record
Tim Osborn: Research Integrity: Integrity of the published recordJisc
 
Connecting Metabolomic Data with Context
Connecting Metabolomic Data with ContextConnecting Metabolomic Data with Context
Connecting Metabolomic Data with ContextDmitry Grapov
 
CREATING A REGIONAL PM2.5 MAP BY FUSING SATELLITE AND KRIGING ESTIMATES
 CREATING A REGIONAL PM2.5 MAP BY FUSING SATELLITE AND KRIGING ESTIMATES CREATING A REGIONAL PM2.5 MAP BY FUSING SATELLITE AND KRIGING ESTIMATES
CREATING A REGIONAL PM2.5 MAP BY FUSING SATELLITE AND KRIGING ESTIMATES Nabin Malakar
 
Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the futurePistoia Alliance
 

What's hot (8)

Ausplots Training - Session 1
Ausplots Training - Session 1Ausplots Training - Session 1
Ausplots Training - Session 1
 
citation analysis vlag
citation analysis vlagcitation analysis vlag
citation analysis vlag
 
Tim Osborn: Research Integrity: Integrity of the published record
Tim Osborn: Research Integrity: Integrity of the published recordTim Osborn: Research Integrity: Integrity of the published record
Tim Osborn: Research Integrity: Integrity of the published record
 
Connecting Metabolomic Data with Context
Connecting Metabolomic Data with ContextConnecting Metabolomic Data with Context
Connecting Metabolomic Data with Context
 
CREATING A REGIONAL PM2.5 MAP BY FUSING SATELLITE AND KRIGING ESTIMATES
 CREATING A REGIONAL PM2.5 MAP BY FUSING SATELLITE AND KRIGING ESTIMATES CREATING A REGIONAL PM2.5 MAP BY FUSING SATELLITE AND KRIGING ESTIMATES
CREATING A REGIONAL PM2.5 MAP BY FUSING SATELLITE AND KRIGING ESTIMATES
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the future
 
Biostatistics Conference
Biostatistics ConferenceBiostatistics Conference
Biostatistics Conference
 

Viewers also liked

Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Anne Thessen
 
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesNext-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesAnne Thessen
 
The influence of estuarine habitats on the expression of life history of char...
The influence of estuarine habitats on the expression of life history of char...The influence of estuarine habitats on the expression of life history of char...
The influence of estuarine habitats on the expression of life history of char...KBay Council
 
Sundarbans Snapshot Presentation
Sundarbans Snapshot PresentationSundarbans Snapshot Presentation
Sundarbans Snapshot PresentationAbhishek Das
 
Adwords Conversion Day
Adwords Conversion DayAdwords Conversion Day
Adwords Conversion DayConversionista
 
What, why and how to A/B test with AI
What, why and how to A/B test with AIWhat, why and how to A/B test with AI
What, why and how to A/B test with AIConversionista
 
Konvertera i mobilen också - Stockholm Ecommerce
Konvertera i mobilen också - Stockholm EcommerceKonvertera i mobilen också - Stockholm Ecommerce
Konvertera i mobilen också - Stockholm EcommerceConversionista
 
Psykologi + data + experiment = Vanta black
Psykologi + data + experiment = Vanta blackPsykologi + data + experiment = Vanta black
Psykologi + data + experiment = Vanta blackConversionista
 
Kundresan - Bara en fin bild på kontoret eller ett verktyg för att nå affär...
Kundresan  -  Bara en fin bild på kontoret eller ett verktyg för att nå affär...Kundresan  -  Bara en fin bild på kontoret eller ett verktyg för att nå affär...
Kundresan - Bara en fin bild på kontoret eller ett verktyg för att nå affär...Conversionista
 

Viewers also liked (10)

Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
 
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesNext-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
 
Semseo adwords
Semseo adwordsSemseo adwords
Semseo adwords
 
The influence of estuarine habitats on the expression of life history of char...
The influence of estuarine habitats on the expression of life history of char...The influence of estuarine habitats on the expression of life history of char...
The influence of estuarine habitats on the expression of life history of char...
 
Sundarbans Snapshot Presentation
Sundarbans Snapshot PresentationSundarbans Snapshot Presentation
Sundarbans Snapshot Presentation
 
Adwords Conversion Day
Adwords Conversion DayAdwords Conversion Day
Adwords Conversion Day
 
What, why and how to A/B test with AI
What, why and how to A/B test with AIWhat, why and how to A/B test with AI
What, why and how to A/B test with AI
 
Konvertera i mobilen också - Stockholm Ecommerce
Konvertera i mobilen också - Stockholm EcommerceKonvertera i mobilen också - Stockholm Ecommerce
Konvertera i mobilen också - Stockholm Ecommerce
 
Psykologi + data + experiment = Vanta black
Psykologi + data + experiment = Vanta blackPsykologi + data + experiment = Vanta black
Psykologi + data + experiment = Vanta black
 
Kundresan - Bara en fin bild på kontoret eller ett verktyg för att nå affär...
Kundresan  -  Bara en fin bild på kontoret eller ett verktyg för att nå affär...Kundresan  -  Bara en fin bild på kontoret eller ett verktyg för att nå affär...
Kundresan - Bara en fin bild på kontoret eller ett verktyg för att nå affär...
 

Similar to Data Infrastructure for Coastal and Estuarine Science

How and Why to Share Your Data
How and Why to Share Your DataHow and Why to Share Your Data
How and Why to Share Your Datakfear
 
A Tale of Two Data Catalogs
A Tale of Two Data CatalogsA Tale of Two Data Catalogs
A Tale of Two Data Catalogsreadkev
 
Data publishing at the UQ Library
Data publishing at the UQ LibraryData publishing at the UQ Library
Data publishing at the UQ LibraryARDC
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data CitationMicah Altman
 
IEDA Overview & Updates, March 2014
IEDA Overview & Updates, March 2014IEDA Overview & Updates, March 2014
IEDA Overview & Updates, March 2014iedadata
 
Linking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual ArchivesLinking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual ArchivesMicah Altman
 
Magle data curation in libraries
Magle data curation in librariesMagle data curation in libraries
Magle data curation in librariesC. Tobin Magle
 
ODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific DataODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific Datadatacite
 
DataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE
 
Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Susanna-Assunta Sansone
 
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...hsuleslie
 
EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data CommonsVivien Bonazzi
 
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...SC CTSI at USC and CHLA
 
Dissemination Information Packages (DIPS) for Information Reuse
Dissemination Information Packages (DIPS) for Information Reuse Dissemination Information Packages (DIPS) for Information Reuse
Dissemination Information Packages (DIPS) for Information Reuse Micah Altman
 
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...Anne Thessen
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Sandra Binning
 
Data Services at a Liberal Arts College Library
Data Services at a Liberal Arts College LibraryData Services at a Liberal Arts College Library
Data Services at a Liberal Arts College LibraryJulie Judkins
 
Next generation data services at the Marriott Library
Next generation data services at the Marriott LibraryNext generation data services at the Marriott Library
Next generation data services at the Marriott LibraryRebekah Cummings
 
SEAD: Sustainable Environment-Actionable Data - Robert McDonald - RDAP12
SEAD: Sustainable Environment-Actionable Data - Robert McDonald - RDAP12 SEAD: Sustainable Environment-Actionable Data - Robert McDonald - RDAP12
SEAD: Sustainable Environment-Actionable Data - Robert McDonald - RDAP12 ASIS&T
 

Similar to Data Infrastructure for Coastal and Estuarine Science (20)

How and Why to Share Your Data
How and Why to Share Your DataHow and Why to Share Your Data
How and Why to Share Your Data
 
A Tale of Two Data Catalogs
A Tale of Two Data CatalogsA Tale of Two Data Catalogs
A Tale of Two Data Catalogs
 
Data publishing at the UQ Library
Data publishing at the UQ LibraryData publishing at the UQ Library
Data publishing at the UQ Library
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data Citation
 
IEDA Overview & Updates, March 2014
IEDA Overview & Updates, March 2014IEDA Overview & Updates, March 2014
IEDA Overview & Updates, March 2014
 
Linking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual ArchivesLinking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual Archives
 
Magle data curation in libraries
Magle data curation in librariesMagle data curation in libraries
Magle data curation in libraries
 
ODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific DataODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific Data
 
DataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data Sharing
 
Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014
 
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
Sediment Experimentalist Network (SEN): Sharing and reusing methods and data ...
 
EMBL Australian Bioinformatics Resource AHM - Data Commons
EMBL Australian Bioinformatics Resource AHM   - Data CommonsEMBL Australian Bioinformatics Resource AHM   - Data Commons
EMBL Australian Bioinformatics Resource AHM - Data Commons
 
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
 
Dissemination Information Packages (DIPS) for Information Reuse
Dissemination Information Packages (DIPS) for Information Reuse Dissemination Information Packages (DIPS) for Information Reuse
Dissemination Information Packages (DIPS) for Information Reuse
 
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
 
Big data
Big dataBig data
Big data
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?
 
Data Services at a Liberal Arts College Library
Data Services at a Liberal Arts College LibraryData Services at a Liberal Arts College Library
Data Services at a Liberal Arts College Library
 
Next generation data services at the Marriott Library
Next generation data services at the Marriott LibraryNext generation data services at the Marriott Library
Next generation data services at the Marriott Library
 
SEAD: Sustainable Environment-Actionable Data - Robert McDonald - RDAP12
SEAD: Sustainable Environment-Actionable Data - Robert McDonald - RDAP12 SEAD: Sustainable Environment-Actionable Data - Robert McDonald - RDAP12
SEAD: Sustainable Environment-Actionable Data - Robert McDonald - RDAP12
 

More from Anne Thessen

Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Anne Thessen
 
Unifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsUnifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsAnne Thessen
 
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Anne Thessen
 
Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Anne Thessen
 
Linking biodiversity data for ecology
Linking biodiversity data for ecologyLinking biodiversity data for ecology
Linking biodiversity data for ecologyAnne Thessen
 
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKKnowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKAnne Thessen
 
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...Anne Thessen
 
Visualizing Evolution
Visualizing EvolutionVisualizing Evolution
Visualizing EvolutionAnne Thessen
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyAnne Thessen
 
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeKnowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeAnne Thessen
 

More from Anne Thessen (10)

Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
 
Unifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsUnifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and Environments
 
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
 
Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...
 
Linking biodiversity data for ecology
Linking biodiversity data for ecologyLinking biodiversity data for ecology
Linking biodiversity data for ecology
 
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKKnowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTK
 
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
 
Visualizing Evolution
Visualizing EvolutionVisualizing Evolution
Visualizing Evolution
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal Taxonomy
 
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeKnowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
 

Recently uploaded

Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 

Recently uploaded (20)

Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 

Data Infrastructure for Coastal and Estuarine Science

  • 1. Data Infrastructures for Estuarine and Coastal Science Anne E. Thessen http://www.slideshare.net/athessen annethessen@gmail.com
  • 2. Photo Credit: NASA/ GSFC/ NOAA/ USGS
  • 3. Outline • Why are we talking about data infrastructures? • What are the challenges? • What are the requirements? • What parts are already available? • How do we get there? • PSA
  • 4. Data Type Important Easy Atmospheric Data 52.2% 21.6% Climate Data 56.0% 23.3% Oceanographic Data 42.5% 18.9% Geophysical Data 55.5% 22.0% Geological Data 56.3% 19.8% Critical Zone Data 19.3% 8.2% Hydrology Data 48.4% 20.1% Results from EarthCube Stakeholder Alignment Survey Why Are We Talking About Data Infrastructure?
  • 5. Working with multiple data sets from many disciplines? Working with multiple data sets within a discipline? 88.1% say it is important 23.5% say it is easy 70.7% say it is important 9.8% say it is easy Results from EarthCube Stakeholder Alignment Survey Why Are We Talking About Data Infrastructure?
  • 6. Why Are We Talking About Data Infrastructure? • “Data Deluge” • Large-scale problems • Maturation of the internet • Increased investment (i.e. EarthCube) • Estuarine and coastal science has interdisciplinary nature and strong sharing culture
  • 7. User Needs Where Do We Start? Available Technology Existing Infrastructure Incentives
  • 8. Sociological Technological • Data sharing • Incentives • Data cultures • Science practices • Massive heterogeneity • Storage capacity • Moving data around • Efficient query • Processing speed • Knowledge representation
  • 9. Stakeholder Assessment Data producers Photo Credit: The University of Nottingham Photo Credit: Kay Nietfeld/EPA Data consumers
  • 10. What is the current state of sharing? • Data sharing varies widely by discipline – No universal rules or agreements – Sharing in marine science is 40% – Other disciplines - 10% to 100%
  • 11. What is the current state of sharing? • Data sharing varies widely and by discipline • Far more scientists say they are willing to share data than actually do – Time to prepare – Concerns about misuse
  • 12. What is the current state of sharing? • Data sharing varies widely and by discipline • Far more scientists say they are willing to share data than actually do • Lack of access to data is a major impediment
  • 13. If sharing is so important why aren’t more people doing it? The large proportion of researchers who claim to be willing to share data and the low numbers of researchers who actually make their data easily available suggests that data sharing would increase substantially if the proper infrastructure were in place.
  • 14. Reasons for Not Sharing • Not enough time or funding • No place to put the data • No standards or policies for sharing • Others have no need for the data • Loss of control • No way to get credit • Sensitive data cannot be shared • Errors will be exposed • Loss of competitiveness
  • 15. Social Infrastructure Requirements • Repository capability • Place conditions on access • Mechanisms for data citation and credit • Data sharing policy • Value added services • Requirements from publishers and funders • Respect for confidentiality • Ease of use
  • 16. We need a system that can • Share • Preserve • Digitize • Automate • Integrate – Data – Infrastructure
  • 18. Data Set Heterogeneity • Data format • Data file format • Data quality and completeness • Physical samples
  • 19. What Will We Do With the Data? • Preserve Data – Format migration – Redundancy – Self-Repair • Serve Data – Discoverable – Accessible – Usable
  • 20. Technical Infrastructure Requirements • Preservation • Layered service architecture • Repository functions • Accommodate heterogeneity • Bridge digital and physical
  • 21. Review Requirements Sociological • Repository capability • Place conditions on access • Mechanisms for data citation and credit • Data sharing policy • Value added services • Requirements from publishers and funders • Respect for confidentiality • Ease of use Technological • Preservation • Layered service architecture • Repository functions • Accommodate heterogeneity • Bridge digital and physical
  • 25. What is Available? Quality Control and Usage Metrics Repositories Citation Preservation Crowd Sourcing Web 2.0
  • 28. What is Available? Access Protocols Web Services Data Brokers Repositories Citation Preservation Quality and Metrics Integration Mobilization
  • 29. What is Available? Standards Repositories Citation Preservation Quality and Metrics Integration Mobilization Access
  • 30. How Can it all Fit Together? Quality and Metrics Access Citation Preservation Mobilization Integration Repositories Standards
  • 31. Who Should Be Doing All This Work? • Librarians • Data Scientists • Informaticians • Ontologists • Computer Scientists • Software Developers • Standards Groups Image by Michael Krigsman
  • 32. PSA
  • 33. Why Share Data? • Increased recognition • Increased economic opportunities • Improved data set • Improved science • Time and money saved
  • 34. Photo Credit: Emergency Cleaning Solutions
  • 35. Photo Credit: The Collared Sheep
  • 36.
  • 37. Acknowledgements • Benjamin Fertig • David Patterson • Mike Kemp • John Milliman • Melissa Cragin • Sayeed Choudhury • Tim DiLauro • Carol Palmer • Nathan Wilson • Alan Renear • Ruth Duerr • Cyndy Chandler • Peter Fox • Krishna Sinha • Janet Fredericks • Carl Lagoze
  • 39. References Atkins DE, Droegemeier KK, Feldman SI, Garcia-Molina H, Klein ML, Messerschmitt DG, Messina P, Ostriker JP, Wright MH. 2003. Revolutionizing science and engineering through cyberinfrastructure. Borgman CL. 2010. Research data: who will share what, with whom, when, and why? Fifth China-North America Library Conference 2010 Borgman CL. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology 63(6):1059-1078 Burton A, Treloar A. 2009. Designing for discovery and re-use: the ANDS data-sharing verbs approach to service decomposition. The International Journal of Digital Curation 4. Costello M. 2009. Motivating online publication of data. BioScience 59:418-426 Cragin MH, Palmer CL, Carlson JR, Witt M. 2010. Data sharing, small science and institutional repositories. Philosophical Transactions of the Royal Society A 368:4023-4038 Edwards PN, Mayernik MS, Batcheller AL, Bowker GC, Borgman CL. 2011. Science friction: data, metadata and collaboration. Social Studies of Science 41(5):667-690 Enke N, Thessen AE, Bach K, Bendix J, Seeger B, Gemeinholzer B. 2012. The User’s View on Biodiversity Data Sharing. Ecological Informatics 11: 25-33 Field D Sansone SA, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, Millard S, Mugabushaka AM, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J. 2009. ‘Omics data-sharing. Science 326:234-236 Froese R, Lloris D, Opitz S. 2003. Scientific data in the public domain. ACP-EU Fisheries Research Report 14:267-271. Gleditsch NP, Strand H. 2003. Posting your data: will you be scooped or will you be famous? International Study Perspectives 4:89-97 Heidorn PB. 2008. Shedding light on the dark data in the long tail of science. Library Trends 57:280-299. Henty M, Weaver B, Bradbury SJ, Simon P. 2008. Investigating data management practices in Australian Universities. APSR. QUT digital repository http://eprints.qut.edu.au/14549 Hey T, Tansley S, Tolle K. 2009. The Fourth Paradigm. Microsoft Research. Redmond, WA, USA, 252 pp.
  • 40. References Key Perspectives Ltd. 2010. Data Dimensions: disciplinary differences in research data-sharing, reuse and long term viability. DCC Scarp Synthesis Report. ISSN 1759-586X Laogze C, Patzke K. 2011. A research agenda for data curation cyberinfrastructure. JCDL’11 Mayernik MS, DiLauro T, Duerr R, Metsger E, Thessen AE Choudhury GS. 2013. Data Conservancy provenance, context and lineage services: key components for data preservation and curation. Data Science Journal 12:158-171 Palmer CL, Cragin MH, Heidorn PB, Smith LC. 2007. Data curation for the long tail of science: the case of environmental studies. Digital Curation Palmer CL, Weber NM, Cragin MH. 2011. The analytic potential of scientific data: understanding re-use value. ASIST 2011 Piwowar HA, Day RS, Fridsma DB. 2007. Sharing detailed research data is associated with increased citation rate. PLoS ONE 3:e308 Savage CJ, Vickers AJ. 2009. Empirical study of data-sharing by authors publishing in PLoS journals. PLoS ONE 4: e7078 Sinha AK, Thessen AE, Barnes CG. 2013. Geoinformatics: towards an integrative view of Earth as a system, in Bickford, M.E., ed., The Web of Geological Sciences: Advances, Impacts, and Interactions: Geological Society of America Special Paper 500, p. 1-14. 10.1130/2013.2500(19) Smith VS. 2009. Data publication: towards a database of everything. BMC Research Notes 2:113 Tenopir C, Allard S, Douglass KL, Aydinoglu AU, Wu L, Read E, Manoff M, Frame M. 2011. Data sharing by scientists: practices and perceptions. PLoS ONE 6.6 Thessen AE, Patterson DJ. 2011. Data issues in the life sciences. ZooKeys 150:15-51 Wallis JC, Mayernik MS, Borgman CL, Pepe A. 2010. Digital libraries for scientific data discovery and reuse: from vision to practical reality. Joint Conference on Digital Libraries 2010 Weber NM, Baker KS, Thomer AK, Chao TC, Palmer CL. 2012. Value and context in data use: domain analysis revisited. Proceedings of the American Society for Information Science and Technology. 49(1):1-10 Whitlock MC. 2011. Data archiving in ecology and evolution: best practices. TREE 26(2):61-65

Editor's Notes

  1. Good morning! My name is Anne Thessen and I’m going to speak to you today about data infrastructures for estuarine and coastal science. First I’d like to thank the conference organizers for inviting me to speak. The slides for this talk will be posted to slideshare later today.
  2. Data are the raw material for discovery and innovation. We increase understanding of the earth as a system through the effective management of data. The nature of current problems, such as climate change and fisheries management demands a holistic approach using data collected over large space/time scales. How do we build a system that can support that type of inquiry and help scientists answer questions they haven’t even asked yet? That system is what I am going to talk about today.
  3. First I’m going to do a little convincing about why we need such a system. Then I’m going to talk about why we are thinking about data infrastructures now. Then I will discuss the challenges and talk about what a data infrastructure might look like. What exists now that we can use and build on? And then finally, how do we get there. Then I will end with a brief public service announcement.
  4. First, the convincing. This table came from a survey of scientists asking how important it was for them to access and use data sets from different disciplines and then how easy it was to do so. The survey was the EarthCube Stakeholder Alignment Survey and I will talk more about EarthCube later. The important thing to note is that around half of all respondents said these different data sets were important to access and use, but only about 20 % of respondents said it was easy.
  5. Here is more data from the stakeholder alignment survey. Respondents were asked how important it was to access and use multiple data sets from within their discipline. 88% said it was important, but only 23.5% said it was easy. Then they were asked how important it was to access and use multiple data sets from many disciplines. 70% said it was important but less than 10% said it was easy. So there is a clear gap between was scientists want to do with data and what they can do with data. This isn’t to say that data sets aren’t being shared and reused and integrated because they are. Right now the process is very manual and therefore limited in scope.
  6. Why are we talking about data infrastructures now? I argue that we are at a convergence of factors that make now the perfect time to move forward with building data infrastructure. I’m sure you’ve all heard about big data and the data deluge. Many scientists are being buried under the data they generate and are actively looking for ways to cope. As I said before scientists are being asked to address large-scale problems that require large-scale data. We have a maturation of the internet that helps with finding and accessing data. Probably most importantly, we have a recognition from funding agencies that an infrastructure of this sort is needed and they are willing to devote resources to it. Finally, estuarine and coastal science specifically stands to greatly benefit from such an infrastructure because of its interdisciplinary nature and its already strong data sharing culture.
  7. Hopefully, I’ve convinced you that a data infrastructure for estuarine and coastal science is beneficial. The first question of course, is where do we start? How do we start? There are many factors to take into consideration to build an infrastructure that people will actually use. We need to consider user needs, incentives to participate, what is the available technology that we can use and finally what pieces of existing infrastructure can we use? All of these factors will make up the required functions, or requirements, of the data infrastructure.
  8. A data sharing infrastructure will have to accommodate sociological needs and technological needs. The needs will help define the requirements. There are many stakeholders with different needs. I will focus on the research scientist as the stakeholder in this talk.
  9. Research scientists have needs as a data producer and needs as a data consumer. Some of these needs will be synergistic. Many scientists are more likely to participate as a data producer if their needs as a data consumer are met. This is important because a data infrastructure will only be useful if scientists are willing to place their data within the infrastructure.
  10. That brings us to the issue of data sharing, which is largely a sociological issue. Let’s review the current state of sharing data. …
  11. On the other side, scientists want others to share and say that lack of access to data generated by others is a major impediment to their research and science as a whole.
  12. When we ask researchers why they don’t share their data they give reasons that fall under one or more of these categories:
  13. We can take these reasons for not sharing and use them to develop requirements for a data infrastructure. If we address the scientists’ concerns, we can have a reasonable expectation that they will participate. So, for example, to address the lack of a place to put data, the infrastructure should have repository capabilities. To address the fear of loss of control or competitiveness, the infrastructure should allow providers to place conditions on accessing their data.
  14. Now we can discuss the technological requirements. At a very basic level, we need a system that can.. The first question, is what kind of data are we talking about here? What will we have to accommodate?
  15. So let’s take a look at one aspect of the data landscape and that is data set size. You may have heard of the long tail of small science. That’s what this curve represents. On the left we have the small numbers of large data sets and on the right we have the many small data sets. The vast majority of research output in the US falls in this long tail. The data sets on opposite ends of this curve differ in more ways that just size. The data sets here tend to be more standardized. They are often born digital. The data sets here are very heterogeneous, may never be digitized and are difficult to find.
  16. Data sets differ in many other ways as well. Even the definition of data set is poorly defined. Data sets differ in format, file format, quality and completeness. By completeness I’m referring to metadata. Some data sets have physical samples. That adds another dimension of complexity to a data infrastructure. Here I show some examples of types of data sets that an infrastructure would have to deal with. We have hand written, tabular data. We have an old format. We have a sediment core.
  17. Once we have a good idea of what we are dealing with, then we have to decide what we will do with the data. There are two important tasks for a data infrastructure and they are related. They are preserving data and serving data. One without the other does not work very well. There is no point in preserving data if no one can ever get at it again and you can use data that no longer exist because it wasn’t preserved. Just like it does no good to make all these yummy canned foods if no one ever gets to eat them. Digital data preservation is very difficult, partially because no one really knows how to do it or what it really means. Many people think they are preserving data when they actually aren’t. Several groups are trying to figure this out and some preservation actions have been identified such as format migration, redundancy and enabling self-repair. Serving data to a user is about ensuring that data are discoverable, accessible and usable. The key here is appropriate metadata, good search and browse and ease of use.
  18. We can take the list of things an infrastructure has to cope with and use it to develop more requirements. A data infrastructure needs to preserve data. Because not all data sets will require the same level of service, the system needs a layered service architecture. The infrastructure needs to have repository functions and be able to cope with many types of heterogeneity. The infrastructure needs to be able to bridge the digital and the physical.
  19. To recap, this is a review of all the sociological and technological requirements for a data infrastructure. Note that some of the requirements overlap each other. Some requirements, like supportive policies from publishers and funders will not come directly from the infrastructure. So the question now becomes, how do we get to where we want to go? We’ve outlined our requirements. Let’s take stock of what we already have.
  20. There are already many repositories for scientific data. I don’t have the time to discuss each one. The logos I’ve included are not meant to be comprehensive. They are just examples. Repositories differ by the type of data they take and the level of service they provide. Some data types are not served by a repository.
  21. I’m going to take a bit more time speaking about data citation because this is still relatively new. There is a movement to bring data sets to the level of publications. There are still a lot of kinks to iron out but NSF is now accepting relevant publication and products on bio sketches. Repositories like figshare are assigning unique, citable identifiers to data sets in their system and projects like ImpactStory are synthesizing scientific output such as data sets and software.
  22. I’ve already talked about the difficulty of digital data preservation. Most of the work in this area is coming from the library community.
  23. Certain aspects of quality control can be automated. Many repositories, like GBIF, have algorithms for quality checking data, like latitude and longitude. Usage metrics and other types of quality can be measured using Web 2.0 and crowd sourcing solutions, such as assigning data sets stars or keeping track on how many times a data set is downloaded.
  24. Innovative ways of automated data integration are developing through advancements in Web 3.0 or semantic web technologies that focus on knowledge representation. These are the types of technologies used by IBM’s Watson.
  25. Data that are not or have not traditionally been digital can be made digital through programs like labfolder that acts as a digital lab notebook and kepler that digitally documents workflows.
  26. To avoid silos, numerous access protocols and web services have been developed for efficient access to data sets. There are even data brokers that will move data around for you.
  27. While many groups say that they have no standards for their data, there are several standards bodies actively developing and maintaining different types of data and metadata standards.
  28. Once you’ve taken stock of what already exists, then it is time to find the gaps and to figure out how it will all fit together. The specific answer is currently being hashed out. The NSF EarthCube program is doing exactly that. If you’ve never heard of EarthCube, I’ll briefly say that it is an NSF project to build data infrastructure for geoscience. You can find more at earthcube.org.
  29. You are all probably thinking, “Who will do all of this work? I certainly don’t have the time!” That’s okay because you, the research scientist, will not have to do this work. There are (read list) all very capable of bringing this system to fruition. But what you will have to do is collaborate. These people need input from practitioners to make sure that what they build will meet a need. Now that funding agencies are supporting the development of these infrastructures, there are opportunities to fund these collaborations. EarthCube has a great mechanism called memberconnect that helps folks from different disciplines who don’t normally talk to each other, find each other.
  30. Now for the last bit, the public service announcement. Much of this vision for a data infrastructure still relies on scientists sharing their data. Unfortunately, the upfront work of preparing data is a major impediment to sharing. To many, the benefits are not obvious, so I’m going to spend a little time talking about the benefits.
  31. Scientist who share their data get increased recognition in the form of more citations and professional reputation. Some see increased economic opportunities for things like selling photographs. The data set itself can be improved by having errors corrected, metadata enhanced or other bits added. We can have improved science all around by sharing data. There were some very high profile cases of research fraud that probably would not have been allowed to happen if data sharing were more common. And last but not least when data are shared, time and money are saved.
  32. If that’s not enough to convince, I’m going to talk a little about data hoarding. Data hoarding was a term that I came up with on the spot and the more I thought about it the more I liked it because hoarding data can be very similar to hoarding things. This is a picture of a hoarded home. This person spent a lot of time gathering things and these things may have been valuable at one time, but they were not taken care of properly and now no one can use them. This person doesn’t know what he/she has any more.
  33. Here are two pictures of offices that look similar to the hoarded home. This picture is from my office. It shows approximately a cubic meter of paper containing data. Some of it is published and some of it is not. It’s been a few years since I’ve looked at it. I’ve probably forgotten a few things about it. This is another picture of an office with a pile of paper that appears to not be organized in any way. How useful is this data? I worked very hard to fill these boxes. I show these pictures because the fate of the contents of a hoarded home and hoarded data are often the same.
  34. If I get hit by a bus, neither my husband or my son will bother to try to figure out what is on all that paper. When I retire, my Dean or whoever, isn’t going to sort through anything. My life’s work will go in the dumpster – just like the objects in the hoarded home. And just like our other possessions, sharing data with others helps to keep it relevant and useful. It’s when I put my data on a computer disk and throw it in a desk drawer that it begins its inevitable decay. At the end of the day, throwing away data is throwing away money.
  35. With that I will take a few questions. I do want to mention that I started a data management company last fall. I have a couple of clients and I would love to help you with your data problems. So feel free to talk to me about it or email.