3. Outline
• Why are we talking about data
infrastructures?
• What are the challenges?
• What are the requirements?
• What parts are already available?
• How do we get there?
• PSA
4. Data Type Important Easy
Atmospheric Data 52.2% 21.6%
Climate Data 56.0% 23.3%
Oceanographic Data 42.5% 18.9%
Geophysical Data 55.5% 22.0%
Geological Data 56.3% 19.8%
Critical Zone Data 19.3% 8.2%
Hydrology Data 48.4% 20.1%
Results from EarthCube Stakeholder Alignment Survey
Why Are We Talking About Data
Infrastructure?
5. Working with multiple data sets from many disciplines?
Working with multiple data sets within a discipline?
88.1% say it is important
23.5% say it is easy
70.7% say it is important
9.8% say it is easy
Results from EarthCube Stakeholder Alignment Survey
Why Are We Talking About Data
Infrastructure?
6. Why Are We Talking About Data
Infrastructure?
• “Data Deluge”
• Large-scale problems
• Maturation of the
internet
• Increased investment
(i.e. EarthCube)
• Estuarine and coastal
science has
interdisciplinary nature
and strong sharing
culture
10. What is the current state of sharing?
• Data sharing varies widely by discipline
– No universal rules or agreements
– Sharing in marine science is 40%
– Other disciplines - 10% to 100%
11. What is the current state of sharing?
• Data sharing varies widely and by discipline
• Far more scientists say they are willing to
share data than actually do
– Time to prepare
– Concerns about misuse
12. What is the current state of sharing?
• Data sharing varies widely and by discipline
• Far more scientists say they are willing to
share data than actually do
• Lack of access to data is a major impediment
13. If sharing is so important why
aren’t more people doing it?
The large proportion of researchers who claim to
be willing to share data and the low numbers of
researchers who actually make their data easily
available suggests that data sharing would
increase substantially if the proper infrastructure
were in place.
14. Reasons for Not Sharing
• Not enough time or funding
• No place to put the data
• No standards or policies for sharing
• Others have no need for the data
• Loss of control
• No way to get credit
• Sensitive data cannot be shared
• Errors will be exposed
• Loss of competitiveness
15. Social Infrastructure Requirements
• Repository capability
• Place conditions on access
• Mechanisms for data citation and credit
• Data sharing policy
• Value added services
• Requirements from publishers and funders
• Respect for confidentiality
• Ease of use
16. We need a system that can
• Share
• Preserve
• Digitize
• Automate
• Integrate
– Data
– Infrastructure
18. Data Set Heterogeneity
• Data format
• Data file format
• Data quality and completeness
• Physical samples
19. What Will We Do With the Data?
• Preserve Data
– Format migration
– Redundancy
– Self-Repair
• Serve Data
– Discoverable
– Accessible
– Usable
20. Technical Infrastructure Requirements
• Preservation
• Layered service architecture
• Repository functions
• Accommodate heterogeneity
• Bridge digital and physical
21. Review Requirements
Sociological
• Repository capability
• Place conditions on access
• Mechanisms for data
citation and credit
• Data sharing policy
• Value added services
• Requirements from
publishers and funders
• Respect for confidentiality
• Ease of use
Technological
• Preservation
• Layered service architecture
• Repository functions
• Accommodate
heterogeneity
• Bridge digital and physical
30. How Can it all Fit Together?
Quality
and
Metrics
Access
Citation
Preservation
Mobilization
Integration
Repositories
Standards
31. Who Should Be Doing All This Work?
• Librarians
• Data Scientists
• Informaticians
• Ontologists
• Computer Scientists
• Software Developers
• Standards Groups
Image by Michael Krigsman
37. Acknowledgements
• Benjamin Fertig
• David Patterson
• Mike Kemp
• John Milliman
• Melissa Cragin
• Sayeed Choudhury
• Tim DiLauro
• Carol Palmer
• Nathan Wilson
• Alan Renear
• Ruth Duerr
• Cyndy Chandler
• Peter Fox
• Krishna Sinha
• Janet Fredericks
• Carl Lagoze
39. References
Atkins DE, Droegemeier KK, Feldman SI, Garcia-Molina H, Klein ML, Messerschmitt DG, Messina P, Ostriker JP, Wright MH.
2003. Revolutionizing science and engineering through cyberinfrastructure.
Borgman CL. 2010. Research data: who will share what, with whom, when, and why? Fifth China-North America Library
Conference 2010
Borgman CL. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and
Technology 63(6):1059-1078
Burton A, Treloar A. 2009. Designing for discovery and re-use: the ANDS data-sharing verbs approach to service decomposition.
The International Journal of Digital Curation 4.
Costello M. 2009. Motivating online publication of data. BioScience 59:418-426
Cragin MH, Palmer CL, Carlson JR, Witt M. 2010. Data sharing, small science and institutional repositories. Philosophical
Transactions of the Royal Society A 368:4023-4038
Edwards PN, Mayernik MS, Batcheller AL, Bowker GC, Borgman CL. 2011. Science friction: data, metadata and collaboration.
Social Studies of Science 41(5):667-690
Enke N, Thessen AE, Bach K, Bendix J, Seeger B, Gemeinholzer B. 2012. The User’s View on Biodiversity Data Sharing.
Ecological Informatics 11: 25-33
Field D Sansone SA, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, Millard S,
Mugabushaka AM, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J. 2009.
‘Omics data-sharing. Science 326:234-236
Froese R, Lloris D, Opitz S. 2003. Scientific data in the public domain. ACP-EU Fisheries Research Report 14:267-271.
Gleditsch NP, Strand H. 2003. Posting your data: will you be scooped or will you be famous? International Study Perspectives
4:89-97
Heidorn PB. 2008. Shedding light on the dark data in the long tail of science. Library Trends 57:280-299.
Henty M, Weaver B, Bradbury SJ, Simon P. 2008. Investigating data management practices in Australian Universities. APSR. QUT
digital repository http://eprints.qut.edu.au/14549
Hey T, Tansley S, Tolle K. 2009. The Fourth Paradigm. Microsoft Research. Redmond, WA, USA, 252 pp.
40. References
Key Perspectives Ltd. 2010. Data Dimensions: disciplinary differences in research data-sharing, reuse and long term viability.
DCC Scarp Synthesis Report. ISSN 1759-586X
Laogze C, Patzke K. 2011. A research agenda for data curation cyberinfrastructure. JCDL’11
Mayernik MS, DiLauro T, Duerr R, Metsger E, Thessen AE Choudhury GS. 2013. Data Conservancy provenance, context and
lineage services: key components for data preservation and curation. Data Science Journal 12:158-171
Palmer CL, Cragin MH, Heidorn PB, Smith LC. 2007. Data curation for the long tail of science: the case of environmental studies.
Digital Curation
Palmer CL, Weber NM, Cragin MH. 2011. The analytic potential of scientific data: understanding re-use value. ASIST 2011
Piwowar HA, Day RS, Fridsma DB. 2007. Sharing detailed research data is associated with increased citation rate. PLoS ONE
3:e308
Savage CJ, Vickers AJ. 2009. Empirical study of data-sharing by authors publishing in PLoS journals. PLoS ONE 4: e7078
Sinha AK, Thessen AE, Barnes CG. 2013. Geoinformatics: towards an integrative view of Earth as a system, in Bickford, M.E.,
ed., The Web of Geological Sciences: Advances, Impacts, and Interactions: Geological Society of America Special Paper 500,
p. 1-14. 10.1130/2013.2500(19)
Smith VS. 2009. Data publication: towards a database of everything. BMC Research Notes 2:113
Tenopir C, Allard S, Douglass KL, Aydinoglu AU, Wu L, Read E, Manoff M, Frame M. 2011. Data sharing by scientists: practices
and perceptions. PLoS ONE 6.6
Thessen AE, Patterson DJ. 2011. Data issues in the life sciences. ZooKeys 150:15-51
Wallis JC, Mayernik MS, Borgman CL, Pepe A. 2010. Digital libraries for scientific data discovery and reuse: from vision to
practical reality. Joint Conference on Digital Libraries 2010
Weber NM, Baker KS, Thomer AK, Chao TC, Palmer CL. 2012. Value and context in data use: domain analysis revisited.
Proceedings of the American Society for Information Science and Technology. 49(1):1-10
Whitlock MC. 2011. Data archiving in ecology and evolution: best practices. TREE 26(2):61-65
Editor's Notes
Good morning! My name is Anne Thessen and I’m going to speak to you today about data infrastructures for estuarine and coastal science. First I’d like to thank the conference organizers for inviting me to speak. The slides for this talk will be posted to slideshare later today.
Data are the raw material for discovery and innovation. We increase understanding of the earth as a system through the effective management of data. The nature of current problems, such as climate change and fisheries management demands a holistic approach using data collected over large space/time scales. How do we build a system that can support that type of inquiry and help scientists answer questions they haven’t even asked yet? That system is what I am going to talk about today.
First I’m going to do a little convincing about why we need such a system. Then I’m going to talk about why we are thinking about data infrastructures now. Then I will discuss the challenges and talk about what a data infrastructure might look like. What exists now that we can use and build on? And then finally, how do we get there. Then I will end with a brief public service announcement.
First, the convincing. This table came from a survey of scientists asking how important it was for them to access and use data sets from different disciplines and then how easy it was to do so. The survey was the EarthCube Stakeholder Alignment Survey and I will talk more about EarthCube later. The important thing to note is that around half of all respondents said these different data sets were important to access and use, but only about 20 % of respondents said it was easy.
Here is more data from the stakeholder alignment survey. Respondents were asked how important it was to access and use multiple data sets from within their discipline. 88% said it was important, but only 23.5% said it was easy. Then they were asked how important it was to access and use multiple data sets from many disciplines. 70% said it was important but less than 10% said it was easy. So there is a clear gap between was scientists want to do with data and what they can do with data. This isn’t to say that data sets aren’t being shared and reused and integrated because they are. Right now the process is very manual and therefore limited in scope.
Why are we talking about data infrastructures now? I argue that we are at a convergence of factors that make now the perfect time to move forward with building data infrastructure. I’m sure you’ve all heard about big data and the data deluge. Many scientists are being buried under the data they generate and are actively looking for ways to cope. As I said before scientists are being asked to address large-scale problems that require large-scale data. We have a maturation of the internet that helps with finding and accessing data. Probably most importantly, we have a recognition from funding agencies that an infrastructure of this sort is needed and they are willing to devote resources to it. Finally, estuarine and coastal science specifically stands to greatly benefit from such an infrastructure because of its interdisciplinary nature and its already strong data sharing culture.
Hopefully, I’ve convinced you that a data infrastructure for estuarine and coastal science is beneficial. The first question of course, is where do we start? How do we start? There are many factors to take into consideration to build an infrastructure that people will actually use. We need to consider user needs, incentives to participate, what is the available technology that we can use and finally what pieces of existing infrastructure can we use? All of these factors will make up the required functions, or requirements, of the data infrastructure.
A data sharing infrastructure will have to accommodate sociological needs and technological needs. The needs will help define the requirements. There are many stakeholders with different needs. I will focus on the research scientist as the stakeholder in this talk.
Research scientists have needs as a data producer and needs as a data consumer. Some of these needs will be synergistic. Many scientists are more likely to participate as a data producer if their needs as a data consumer are met. This is important because a data infrastructure will only be useful if scientists are willing to place their data within the infrastructure.
That brings us to the issue of data sharing, which is largely a sociological issue. Let’s review the current state of sharing data. …
On the other side, scientists want others to share and say that lack of access to data generated by others is a major impediment to their research and science as a whole.
When we ask researchers why they don’t share their data they give reasons that fall under one or more of these categories:
We can take these reasons for not sharing and use them to develop requirements for a data infrastructure. If we address the scientists’ concerns, we can have a reasonable expectation that they will participate. So, for example, to address the lack of a place to put data, the infrastructure should have repository capabilities. To address the fear of loss of control or competitiveness, the infrastructure should allow providers to place conditions on accessing their data.
Now we can discuss the technological requirements. At a very basic level, we need a system that can.. The first question, is what kind of data are we talking about here? What will we have to accommodate?
So let’s take a look at one aspect of the data landscape and that is data set size. You may have heard of the long tail of small science. That’s what this curve represents. On the left we have the small numbers of large data sets and on the right we have the many small data sets. The vast majority of research output in the US falls in this long tail. The data sets on opposite ends of this curve differ in more ways that just size. The data sets here tend to be more standardized. They are often born digital. The data sets here are very heterogeneous, may never be digitized and are difficult to find.
Data sets differ in many other ways as well. Even the definition of data set is poorly defined. Data sets differ in format, file format, quality and completeness. By completeness I’m referring to metadata. Some data sets have physical samples. That adds another dimension of complexity to a data infrastructure. Here I show some examples of types of data sets that an infrastructure would have to deal with. We have hand written, tabular data. We have an old format. We have a sediment core.
Once we have a good idea of what we are dealing with, then we have to decide what we will do with the data. There are two important tasks for a data infrastructure and they are related. They are preserving data and serving data. One without the other does not work very well. There is no point in preserving data if no one can ever get at it again and you can use data that no longer exist because it wasn’t preserved. Just like it does no good to make all these yummy canned foods if no one ever gets to eat them. Digital data preservation is very difficult, partially because no one really knows how to do it or what it really means. Many people think they are preserving data when they actually aren’t. Several groups are trying to figure this out and some preservation actions have been identified such as format migration, redundancy and enabling self-repair. Serving data to a user is about ensuring that data are discoverable, accessible and usable. The key here is appropriate metadata, good search and browse and ease of use.
We can take the list of things an infrastructure has to cope with and use it to develop more requirements. A data infrastructure needs to preserve data. Because not all data sets will require the same level of service, the system needs a layered service architecture. The infrastructure needs to have repository functions and be able to cope with many types of heterogeneity. The infrastructure needs to be able to bridge the digital and the physical.
To recap, this is a review of all the sociological and technological requirements for a data infrastructure. Note that some of the requirements overlap each other. Some requirements, like supportive policies from publishers and funders will not come directly from the infrastructure. So the question now becomes, how do we get to where we want to go? We’ve outlined our requirements. Let’s take stock of what we already have.
There are already many repositories for scientific data. I don’t have the time to discuss each one. The logos I’ve included are not meant to be comprehensive. They are just examples. Repositories differ by the type of data they take and the level of service they provide. Some data types are not served by a repository.
I’m going to take a bit more time speaking about data citation because this is still relatively new. There is a movement to bring data sets to the level of publications. There are still a lot of kinks to iron out but NSF is now accepting relevant publication and products on bio sketches. Repositories like figshare are assigning unique, citable identifiers to data sets in their system and projects like ImpactStory are synthesizing scientific output such as data sets and software.
I’ve already talked about the difficulty of digital data preservation. Most of the work in this area is coming from the library community.
Certain aspects of quality control can be automated. Many repositories, like GBIF, have algorithms for quality checking data, like latitude and longitude. Usage metrics and other types of quality can be measured using Web 2.0 and crowd sourcing solutions, such as assigning data sets stars or keeping track on how many times a data set is downloaded.
Innovative ways of automated data integration are developing through advancements in Web 3.0 or semantic web technologies that focus on knowledge representation. These are the types of technologies used by IBM’s Watson.
Data that are not or have not traditionally been digital can be made digital through programs like labfolder that acts as a digital lab notebook and kepler that digitally documents workflows.
To avoid silos, numerous access protocols and web services have been developed for efficient access to data sets. There are even data brokers that will move data around for you.
While many groups say that they have no standards for their data, there are several standards bodies actively developing and maintaining different types of data and metadata standards.
Once you’ve taken stock of what already exists, then it is time to find the gaps and to figure out how it will all fit together. The specific answer is currently being hashed out. The NSF EarthCube program is doing exactly that. If you’ve never heard of EarthCube, I’ll briefly say that it is an NSF project to build data infrastructure for geoscience. You can find more at earthcube.org.
You are all probably thinking, “Who will do all of this work? I certainly don’t have the time!” That’s okay because you, the research scientist, will not have to do this work. There are (read list) all very capable of bringing this system to fruition. But what you will have to do is collaborate. These people need input from practitioners to make sure that what they build will meet a need. Now that funding agencies are supporting the development of these infrastructures, there are opportunities to fund these collaborations. EarthCube has a great mechanism called memberconnect that helps folks from different disciplines who don’t normally talk to each other, find each other.
Now for the last bit, the public service announcement. Much of this vision for a data infrastructure still relies on scientists sharing their data. Unfortunately, the upfront work of preparing data is a major impediment to sharing. To many, the benefits are not obvious, so I’m going to spend a little time talking about the benefits.
Scientist who share their data get increased recognition in the form of more citations and professional reputation. Some see increased economic opportunities for things like selling photographs. The data set itself can be improved by having errors corrected, metadata enhanced or other bits added. We can have improved science all around by sharing data. There were some very high profile cases of research fraud that probably would not have been allowed to happen if data sharing were more common. And last but not least when data are shared, time and money are saved.
If that’s not enough to convince, I’m going to talk a little about data hoarding. Data hoarding was a term that I came up with on the spot and the more I thought about it the more I liked it because hoarding data can be very similar to hoarding things. This is a picture of a hoarded home. This person spent a lot of time gathering things and these things may have been valuable at one time, but they were not taken care of properly and now no one can use them. This person doesn’t know what he/she has any more.
Here are two pictures of offices that look similar to the hoarded home. This picture is from my office. It shows approximately a cubic meter of paper containing data. Some of it is published and some of it is not. It’s been a few years since I’ve looked at it. I’ve probably forgotten a few things about it. This is another picture of an office with a pile of paper that appears to not be organized in any way. How useful is this data? I worked very hard to fill these boxes. I show these pictures because the fate of the contents of a hoarded home and hoarded data are often the same.
If I get hit by a bus, neither my husband or my son will bother to try to figure out what is on all that paper. When I retire, my Dean or whoever, isn’t going to sort through anything. My life’s work will go in the dumpster – just like the objects in the hoarded home. And just like our other possessions, sharing data with others helps to keep it relevant and useful. It’s when I put my data on a computer disk and throw it in a desk drawer that it begins its inevitable decay. At the end of the day, throwing away data is throwing away money.
With that I will take a few questions. I do want to mention that I started a data management company last fall. I have a couple of clients and I would love to help you with your data problems. So feel free to talk to me about it or email.