Data Infrastructure for Coastal and Estuarine Science


Published on

This talk was given at the Atlantic Estuarine Research Society at their 2014 Sprint meeting in Ocean City, Maryland, USA

Published in: Science
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Good morning! My name is Anne Thessen and I’m going to speak to you today about data infrastructures for estuarine and coastal science. First I’d like to thank the conference organizers for inviting me to speak. The slides for this talk will be posted to slideshare later today.
  • Data are the raw material for discovery and innovation. We increase understanding of the earth as a system through the effective management of data. The nature of current problems, such as climate change and fisheries management demands a holistic approach using data collected over large space/time scales. How do we build a system that can support that type of inquiry and help scientists answer questions they haven’t even asked yet? That system is what I am going to talk about today.
  • First I’m going to do a little convincing about why we need such a system. Then I’m going to talk about why we are thinking about data infrastructures now. Then I will discuss the challenges and talk about what a data infrastructure might look like. What exists now that we can use and build on? And then finally, how do we get there. Then I will end with a brief public service announcement.
  • First, the convincing. This table came from a survey of scientists asking how important it was for them to access and use data sets from different disciplines and then how easy it was to do so. The survey was the EarthCube Stakeholder Alignment Survey and I will talk more about EarthCube later. The important thing to note is that around half of all respondents said these different data sets were important to access and use, but only about 20 % of respondents said it was easy.
  • Here is more data from the stakeholder alignment survey. Respondents were asked how important it was to access and use multiple data sets from within their discipline. 88% said it was important, but only 23.5% said it was easy. Then they were asked how important it was to access and use multiple data sets from many disciplines. 70% said it was important but less than 10% said it was easy. So there is a clear gap between was scientists want to do with data and what they can do with data. This isn’t to say that data sets aren’t being shared and reused and integrated because they are. Right now the process is very manual and therefore limited in scope.
  • Why are we talking about data infrastructures now? I argue that we are at a convergence of factors that make now the perfect time to move forward with building data infrastructure. I’m sure you’ve all heard about big data and the data deluge. Many scientists are being buried under the data they generate and are actively looking for ways to cope. As I said before scientists are being asked to address large-scale problems that require large-scale data. We have a maturation of the internet that helps with finding and accessing data. Probably most importantly, we have a recognition from funding agencies that an infrastructure of this sort is needed and they are willing to devote resources to it. Finally, estuarine and coastal science specifically stands to greatly benefit from such an infrastructure because of its interdisciplinary nature and its already strong data sharing culture.
  • Hopefully, I’ve convinced you that a data infrastructure for estuarine and coastal science is beneficial. The first question of course, is where do we start? How do we start? There are many factors to take into consideration to build an infrastructure that people will actually use. We need to consider user needs, incentives to participate, what is the available technology that we can use and finally what pieces of existing infrastructure can we use? All of these factors will make up the required functions, or requirements, of the data infrastructure.
  • A data sharing infrastructure will have to accommodate sociological needs and technological needs. The needs will help define the requirements. There are many stakeholders with different needs. I will focus on the research scientist as the stakeholder in this talk.
  • Research scientists have needs as a data producer and needs as a data consumer. Some of these needs will be synergistic. Many scientists are more likely to participate as a data producer if their needs as a data consumer are met. This is important because a data infrastructure will only be useful if scientists are willing to place their data within the infrastructure.
  • That brings us to the issue of data sharing, which is largely a sociological issue. Let’s review the current state of sharing data. …
  • On the other side, scientists want others to share and say that lack of access to data generated by others is a major impediment to their research and science as a whole.
  • When we ask researchers why they don’t share their data they give reasons that fall under one or more of these categories:
  • We can take these reasons for not sharing and use them to develop requirements for a data infrastructure. If we address the scientists’ concerns, we can have a reasonable expectation that they will participate. So, for example, to address the lack of a place to put data, the infrastructure should have repository capabilities. To address the fear of loss of control or competitiveness, the infrastructure should allow providers to place conditions on accessing their data.
  • Now we can discuss the technological requirements. At a very basic level, we need a system that can.. The first question, is what kind of data are we talking about here? What will we have to accommodate?
  • So let’s take a look at one aspect of the data landscape and that is data set size. You may have heard of the long tail of small science. That’s what this curve represents. On the left we have the small numbers of large data sets and on the right we have the many small data sets. The vast majority of research output in the US falls in this long tail. The data sets on opposite ends of this curve differ in more ways that just size. The data sets here tend to be more standardized. They are often born digital. The data sets here are very heterogeneous, may never be digitized and are difficult to find.
  • Data sets differ in many other ways as well. Even the definition of data set is poorly defined. Data sets differ in format, file format, quality and completeness. By completeness I’m referring to metadata. Some data sets have physical samples. That adds another dimension of complexity to a data infrastructure. Here I show some examples of types of data sets that an infrastructure would have to deal with. We have hand written, tabular data. We have an old format. We have a sediment core.
  • Once we have a good idea of what we are dealing with, then we have to decide what we will do with the data. There are two important tasks for a data infrastructure and they are related. They are preserving data and serving data. One without the other does not work very well. There is no point in preserving data if no one can ever get at it again and you can use data that no longer exist because it wasn’t preserved. Just like it does no good to make all these yummy canned foods if no one ever gets to eat them. Digital data preservation is very difficult, partially because no one really knows how to do it or what it really means. Many people think they are preserving data when they actually aren’t. Several groups are trying to figure this out and some preservation actions have been identified such as format migration, redundancy and enabling self-repair. Serving data to a user is about ensuring that data are discoverable, accessible and usable. The key here is appropriate metadata, good search and browse and ease of use.
  • We can take the list of things an infrastructure has to cope with and use it to develop more requirements. A data infrastructure needs to preserve data. Because not all data sets will require the same level of service, the system needs a layered service architecture. The infrastructure needs to have repository functions and be able to cope with many types of heterogeneity. The infrastructure needs to be able to bridge the digital and the physical.
  • To recap, this is a review of all the sociological and technological requirements for a data infrastructure. Note that some of the requirements overlap each other. Some requirements, like supportive policies from publishers and funders will not come directly from the infrastructure. So the question now becomes, how do we get to where we want to go? We’ve outlined our requirements. Let’s take stock of what we already have.
  • There are already many repositories for scientific data. I don’t have the time to discuss each one. The logos I’ve included are not meant to be comprehensive. They are just examples. Repositories differ by the type of data they take and the level of service they provide. Some data types are not served by a repository.
  • I’m going to take a bit more time speaking about data citation because this is still relatively new. There is a movement to bring data sets to the level of publications. There are still a lot of kinks to iron out but NSF is now accepting relevant publication and products on bio sketches. Repositories like figshare are assigning unique, citable identifiers to data sets in their system and projects like ImpactStory are synthesizing scientific output such as data sets and software.
  • I’ve already talked about the difficulty of digital data preservation. Most of the work in this area is coming from the library community.
  • Certain aspects of quality control can be automated. Many repositories, like GBIF, have algorithms for quality checking data, like latitude and longitude. Usage metrics and other types of quality can be measured using Web 2.0 and crowd sourcing solutions, such as assigning data sets stars or keeping track on how many times a data set is downloaded.
  • Innovative ways of automated data integration are developing through advancements in Web 3.0 or semantic web technologies that focus on knowledge representation. These are the types of technologies used by IBM’s Watson.
  • Data that are not or have not traditionally been digital can be made digital through programs like labfolder that acts as a digital lab notebook and kepler that digitally documents workflows.
  • To avoid silos, numerous access protocols and web services have been developed for efficient access to data sets. There are even data brokers that will move data around for you.
  • While many groups say that they have no standards for their data, there are several standards bodies actively developing and maintaining different types of data and metadata standards.
  • Once you’ve taken stock of what already exists, then it is time to find the gaps and to figure out how it will all fit together. The specific answer is currently being hashed out. The NSF EarthCube program is doing exactly that. If you’ve never heard of EarthCube, I’ll briefly say that it is an NSF project to build data infrastructure for geoscience. You can find more at
  • You are all probably thinking, “Who will do all of this work? I certainly don’t have the time!” That’s okay because you, the research scientist, will not have to do this work. There are (read list) all very capable of bringing this system to fruition. But what you will have to do is collaborate. These people need input from practitioners to make sure that what they build will meet a need. Now that funding agencies are supporting the development of these infrastructures, there are opportunities to fund these collaborations. EarthCube has a great mechanism called memberconnect that helps folks from different disciplines who don’t normally talk to each other, find each other.
  • Now for the last bit, the public service announcement. Much of this vision for a data infrastructure still relies on scientists sharing their data. Unfortunately, the upfront work of preparing data is a major impediment to sharing. To many, the benefits are not obvious, so I’m going to spend a little time talking about the benefits.
  • Scientist who share their data get increased recognition in the form of more citations and professional reputation. Some see increased economic opportunities for things like selling photographs. The data set itself can be improved by having errors corrected, metadata enhanced or other bits added. We can have improved science all around by sharing data. There were some very high profile cases of research fraud that probably would not have been allowed to happen if data sharing were more common. And last but not least when data are shared, time and money are saved.
  • If that’s not enough to convince, I’m going to talk a little about data hoarding. Data hoarding was a term that I came up with on the spot and the more I thought about it the more I liked it because hoarding data can be very similar to hoarding things. This is a picture of a hoarded home. This person spent a lot of time gathering things and these things may have been valuable at one time, but they were not taken care of properly and now no one can use them. This person doesn’t know what he/she has any more.
  • Here are two pictures of offices that look similar to the hoarded home. This picture is from my office. It shows approximately a cubic meter of paper containing data. Some of it is published and some of it is not. It’s been a few years since I’ve looked at it. I’ve probably forgotten a few things about it. This is another picture of an office with a pile of paper that appears to not be organized in any way. How useful is this data? I worked very hard to fill these boxes. I show these pictures because the fate of the contents of a hoarded home and hoarded data are often the same.
  • If I get hit by a bus, neither my husband or my son will bother to try to figure out what is on all that paper. When I retire, my Dean or whoever, isn’t going to sort through anything. My life’s work will go in the dumpster – just like the objects in the hoarded home. And just like our other possessions, sharing data with others helps to keep it relevant and useful. It’s when I put my data on a computer disk and throw it in a desk drawer that it begins its inevitable decay. At the end of the day, throwing away data is throwing away money.
  • With that I will take a few questions. I do want to mention that I started a data management company last fall. I have a couple of clients and I would love to help you with your data problems. So feel free to talk to me about it or email.
  • Data Infrastructure for Coastal and Estuarine Science

    1. 1. Data Infrastructures for Estuarine and Coastal Science Anne E. Thessen
    2. 2. Photo Credit: NASA/ GSFC/ NOAA/ USGS
    3. 3. Outline • Why are we talking about data infrastructures? • What are the challenges? • What are the requirements? • What parts are already available? • How do we get there? • PSA
    4. 4. Data Type Important Easy Atmospheric Data 52.2% 21.6% Climate Data 56.0% 23.3% Oceanographic Data 42.5% 18.9% Geophysical Data 55.5% 22.0% Geological Data 56.3% 19.8% Critical Zone Data 19.3% 8.2% Hydrology Data 48.4% 20.1% Results from EarthCube Stakeholder Alignment Survey Why Are We Talking About Data Infrastructure?
    5. 5. Working with multiple data sets from many disciplines? Working with multiple data sets within a discipline? 88.1% say it is important 23.5% say it is easy 70.7% say it is important 9.8% say it is easy Results from EarthCube Stakeholder Alignment Survey Why Are We Talking About Data Infrastructure?
    6. 6. Why Are We Talking About Data Infrastructure? • “Data Deluge” • Large-scale problems • Maturation of the internet • Increased investment (i.e. EarthCube) • Estuarine and coastal science has interdisciplinary nature and strong sharing culture
    7. 7. User Needs Where Do We Start? Available Technology Existing Infrastructure Incentives
    8. 8. Sociological Technological • Data sharing • Incentives • Data cultures • Science practices • Massive heterogeneity • Storage capacity • Moving data around • Efficient query • Processing speed • Knowledge representation
    9. 9. Stakeholder Assessment Data producers Photo Credit: The University of Nottingham Photo Credit: Kay Nietfeld/EPA Data consumers
    10. 10. What is the current state of sharing? • Data sharing varies widely by discipline – No universal rules or agreements – Sharing in marine science is 40% – Other disciplines - 10% to 100%
    11. 11. What is the current state of sharing? • Data sharing varies widely and by discipline • Far more scientists say they are willing to share data than actually do – Time to prepare – Concerns about misuse
    12. 12. What is the current state of sharing? • Data sharing varies widely and by discipline • Far more scientists say they are willing to share data than actually do • Lack of access to data is a major impediment
    13. 13. If sharing is so important why aren’t more people doing it? The large proportion of researchers who claim to be willing to share data and the low numbers of researchers who actually make their data easily available suggests that data sharing would increase substantially if the proper infrastructure were in place.
    14. 14. Reasons for Not Sharing • Not enough time or funding • No place to put the data • No standards or policies for sharing • Others have no need for the data • Loss of control • No way to get credit • Sensitive data cannot be shared • Errors will be exposed • Loss of competitiveness
    15. 15. Social Infrastructure Requirements • Repository capability • Place conditions on access • Mechanisms for data citation and credit • Data sharing policy • Value added services • Requirements from publishers and funders • Respect for confidentiality • Ease of use
    16. 16. We need a system that can • Share • Preserve • Digitize • Automate • Integrate – Data – Infrastructure
    17. 17. Data Set Size
    18. 18. Data Set Heterogeneity • Data format • Data file format • Data quality and completeness • Physical samples
    19. 19. What Will We Do With the Data? • Preserve Data – Format migration – Redundancy – Self-Repair • Serve Data – Discoverable – Accessible – Usable
    20. 20. Technical Infrastructure Requirements • Preservation • Layered service architecture • Repository functions • Accommodate heterogeneity • Bridge digital and physical
    21. 21. Review Requirements Sociological • Repository capability • Place conditions on access • Mechanisms for data citation and credit • Data sharing policy • Value added services • Requirements from publishers and funders • Respect for confidentiality • Ease of use Technological • Preservation • Layered service architecture • Repository functions • Accommodate heterogeneity • Bridge digital and physical
    22. 22. What is Available? Repositories
    23. 23. What is Available? Citation Repositories
    24. 24. What is Available? Preservation Repositories Citation
    25. 25. What is Available? Quality Control and Usage Metrics Repositories Citation Preservation Crowd Sourcing Web 2.0
    26. 26. What is Available? Integration Repositories Citation Preservation Quality and Metrics Web 3.0
    27. 27. What is Available? Mobilization Repositories Citation Preservation Quality and Metrics Integration
    28. 28. What is Available? Access Protocols Web Services Data Brokers Repositories Citation Preservation Quality and Metrics Integration Mobilization
    29. 29. What is Available? Standards Repositories Citation Preservation Quality and Metrics Integration Mobilization Access
    30. 30. How Can it all Fit Together? Quality and Metrics Access Citation Preservation Mobilization Integration Repositories Standards
    31. 31. Who Should Be Doing All This Work? • Librarians • Data Scientists • Informaticians • Ontologists • Computer Scientists • Software Developers • Standards Groups Image by Michael Krigsman
    32. 32. PSA
    33. 33. Why Share Data? • Increased recognition • Increased economic opportunities • Improved data set • Improved science • Time and money saved
    34. 34. Photo Credit: Emergency Cleaning Solutions
    35. 35. Photo Credit: The Collared Sheep
    36. 36. Acknowledgements • Benjamin Fertig • David Patterson • Mike Kemp • John Milliman • Melissa Cragin • Sayeed Choudhury • Tim DiLauro • Carol Palmer • Nathan Wilson • Alan Renear • Ruth Duerr • Cyndy Chandler • Peter Fox • Krishna Sinha • Janet Fredericks • Carl Lagoze
    37. 37. Questions?
    38. 38. References Atkins DE, Droegemeier KK, Feldman SI, Garcia-Molina H, Klein ML, Messerschmitt DG, Messina P, Ostriker JP, Wright MH. 2003. Revolutionizing science and engineering through cyberinfrastructure. Borgman CL. 2010. Research data: who will share what, with whom, when, and why? Fifth China-North America Library Conference 2010 Borgman CL. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology 63(6):1059-1078 Burton A, Treloar A. 2009. Designing for discovery and re-use: the ANDS data-sharing verbs approach to service decomposition. The International Journal of Digital Curation 4. Costello M. 2009. Motivating online publication of data. BioScience 59:418-426 Cragin MH, Palmer CL, Carlson JR, Witt M. 2010. Data sharing, small science and institutional repositories. Philosophical Transactions of the Royal Society A 368:4023-4038 Edwards PN, Mayernik MS, Batcheller AL, Bowker GC, Borgman CL. 2011. Science friction: data, metadata and collaboration. Social Studies of Science 41(5):667-690 Enke N, Thessen AE, Bach K, Bendix J, Seeger B, Gemeinholzer B. 2012. The User’s View on Biodiversity Data Sharing. Ecological Informatics 11: 25-33 Field D Sansone SA, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, Millard S, Mugabushaka AM, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J. 2009. ‘Omics data-sharing. Science 326:234-236 Froese R, Lloris D, Opitz S. 2003. Scientific data in the public domain. ACP-EU Fisheries Research Report 14:267-271. Gleditsch NP, Strand H. 2003. Posting your data: will you be scooped or will you be famous? International Study Perspectives 4:89-97 Heidorn PB. 2008. Shedding light on the dark data in the long tail of science. Library Trends 57:280-299. Henty M, Weaver B, Bradbury SJ, Simon P. 2008. Investigating data management practices in Australian Universities. APSR. QUT digital repository Hey T, Tansley S, Tolle K. 2009. The Fourth Paradigm. Microsoft Research. Redmond, WA, USA, 252 pp.
    39. 39. References Key Perspectives Ltd. 2010. Data Dimensions: disciplinary differences in research data-sharing, reuse and long term viability. DCC Scarp Synthesis Report. ISSN 1759-586X Laogze C, Patzke K. 2011. A research agenda for data curation cyberinfrastructure. JCDL’11 Mayernik MS, DiLauro T, Duerr R, Metsger E, Thessen AE Choudhury GS. 2013. Data Conservancy provenance, context and lineage services: key components for data preservation and curation. Data Science Journal 12:158-171 Palmer CL, Cragin MH, Heidorn PB, Smith LC. 2007. Data curation for the long tail of science: the case of environmental studies. Digital Curation Palmer CL, Weber NM, Cragin MH. 2011. The analytic potential of scientific data: understanding re-use value. ASIST 2011 Piwowar HA, Day RS, Fridsma DB. 2007. Sharing detailed research data is associated with increased citation rate. PLoS ONE 3:e308 Savage CJ, Vickers AJ. 2009. Empirical study of data-sharing by authors publishing in PLoS journals. PLoS ONE 4: e7078 Sinha AK, Thessen AE, Barnes CG. 2013. Geoinformatics: towards an integrative view of Earth as a system, in Bickford, M.E., ed., The Web of Geological Sciences: Advances, Impacts, and Interactions: Geological Society of America Special Paper 500, p. 1-14. 10.1130/2013.2500(19) Smith VS. 2009. Data publication: towards a database of everything. BMC Research Notes 2:113 Tenopir C, Allard S, Douglass KL, Aydinoglu AU, Wu L, Read E, Manoff M, Frame M. 2011. Data sharing by scientists: practices and perceptions. PLoS ONE 6.6 Thessen AE, Patterson DJ. 2011. Data issues in the life sciences. ZooKeys 150:15-51 Wallis JC, Mayernik MS, Borgman CL, Pepe A. 2010. Digital libraries for scientific data discovery and reuse: from vision to practical reality. Joint Conference on Digital Libraries 2010 Weber NM, Baker KS, Thomer AK, Chao TC, Palmer CL. 2012. Value and context in data use: domain analysis revisited. Proceedings of the American Society for Information Science and Technology. 49(1):1-10 Whitlock MC. 2011. Data archiving in ecology and evolution: best practices. TREE 26(2):61-65