Currently, these data are difficult to find, obtain, and use because people from disciplines across the natural and social sciences collect, describe and store their data in many different ways. These data could have significant value if it was possible to connect data collectors with potential users of data and if it was easy for individuals to search for, aggregate, and maintain valuable data for the long term.
To expand a bit on the previous slide … We characterize the needs of sustainability scientists as a “long tail problem” where scientists need diverse data from multiple different sources that overlap in geographic coverage and time, but also have gaps in location, time, resolution, and types of measurements. The data are heterogeneous and vary in format, metadata, size, and quality. One of the biggest challenges we face is supporting diverse needs for heterogeneous data.Our strategies for coping with the diversity of data effectively are based several underlying principles for long tail phenomena: While the aggregate demand for SEAD’s service is large and growing, demand for any particular collection of data is small and focused. Therefore, the investments SEAD makes in any particular set of data have to be quite low. Deciding which data merits investment in curation should be driven by its value to the community and its potential for productive use. Building on a strong foundation of existing infrastructure, collaborative relationships, and expertise, the SEAD team will be able to tackle challenging problems in the long tail with innovative, forwarding-looking and outward facing approaches.
Mention something about the 18-month prototype and that the tasks during this time-frame focus on the first 3 bullets.
Additional text for bullet 1: Provide tools and services that provide benefits to data providers during active projects Provide tools and services that allow data users to collaboratively curate data
We will build usable and useful tools that scientists can take advantage of as they collect, generate and organize data in their active projects. This Active Curation approach will be designed with a great deal of user input to make sure that the tools are light-weight, easy to learn, easy to use, and more effective than the painstaking, hand-crafted approach that many sustainability scientists use today. The Active Curation approach will make data management easier for data producers and lower the curation costs to SEAD.Another part of our strategy is to deploy a variety of social networking and social-media inspired tools to engage the community of data producers and users. These include tools for annotation, rating and commentary on data sets, visualizations of publication and citation networks that map the invisible college of sustainability science researchers, and social networking tools that help build network effects. We have designed our program with multiple mechanisms to encourage participation in SEAD and adoption of its approach. These include domain engagement workshops to surface needs and requirements, ensure usability of tools, and enlisting key leaders in sustainability as early adopters and promoters of SEAD. These strategies along with support for centralized curation services, education, outreach and training will create a model for sustainable access and preservation of heterogeneous data for sustainability science and other small science disciplines in the long tail.
Robert, I wanted to illustrate the long-term repository piece, but couldn’t find anything very good from previous slides. I put this in for now, but you may have something better.
GETTING THE MOST OUT OF DATANET: APANEL DISCUSSION OF THE NSF FUNDEDDATANET PARTNERSHIPSRobert H. McDonald – SEAD – Indiana UniversityCatherine Fitch – TerraPop – Minnesota Population CenterRichard Marciano – Datanet Federation Consortium – University ofNorth CarolinaSayeed Choudhury – Data Conservancy – Johns Hopkins UniversityWilliam Michener – DataOne – University of New Mexico NSF DATANET PROGRAM- OFFICE OF CYBERINFRASTRUCTURE
NSF DATANET PROGRAM• DataNet efforts effectively balance: • Production infrastructure for operational data curation services • Research to create next generation data cyberininfrastructure• DataNet awards are partnerships: • Responsive to user communities to define their meaningful and useful scope • Form a coordinated network to provide national, interdisciplinary data models and infrastructure
SEAD TEAMUniversity of Michigan: Margaret Hedstrom (UM PI), AnnZimmerman (Co-PI and Project Manager), George Alter, BryanBeecher, Charles Severance, Karen Woollams, Jude Yew. IndianaUniversity: Beth Plale (IU PI), Katy Borner, Robert H.McDonald, Kavitha Chandrasekar, Robert Ping, StacyKowalczyk, Robert Light. University of Illinois: Praveen Kumar(UIUC PI), Rob Kooper, Luigi Marini, Terry McLaren. RensselaerPolytechnic Institute: Jim Myers (RPI PI), Ram Prasanna GovindKrishnan, Lindsay Todd, Adam Wilson. #OCI0940824
SEAD PARTNERSHIP Beth PlaleMargaret Hedstrom, PI Katy BörnerAnn Zimmerman Robert H. McDonald Praveen James Myers Kumar George Alter & Bryan Beecher
Datachallenges• Heterogeneity of all kinds• Multiple scales• Multidisciplinary• Many small datasets
Provide innovative newmodels and tools forserving the long tail ofscientific research
SEAD’S GOALS Provide data services that address the pressing needs of researchers working toward sustainability Integrate these services into an generalizable “Active and Social Curation” infrastructure well-suited to the social structure and economics of long-tail research communities Develop capabilities to package and migrate datasets to a federated repository infrastructure for long-term preservation Education, outreach, & training, to maximize value and disseminate SEAD’s contributions to other projects and communities
SEAD’S STRATEGYMove data curation upstream in the data life cycle • Involve domain scientists in setting priorities for evolution of data and services • Use a wide variety of mechanisms to remain resilient in a dynamic research and technology environment
ACTIVE AND SOCIALCURATION• Engage researchers during projects, not at the end• Use information that is automatically captured or generated through tools to reduce the costs of metadata collection and to capture its value in actionable form• Further reduce costs by re-engineering curation processes to leverage this rich metadata and volunteered effort
ACTIVE CURATION MODEL Active Curation Social Media ReviewWorkflows Rating Data Commenting Metadata
SEAD LAYERCAKE VIEW Network of Data Producers Services over an active content layer Web User Interface that is backed Active Content Repository by/harvested into a Services Provided federated archive Content Curation Archival data Other Mining Decisions services infrastructure based generation on institutional Virtual Archives resources Institutional Repositories Data IU RPI UIUC UM ICPSR Conservancy User Network
ACKNOWLEDGMENTSSEAD is funded by the National ScienceFoundation under cooperative agreement#OCI0940824