P. Bryan Heidorn University of Arizona and JRS Biodiversity Foundation 2011 Scripting Life: the science behind ViBRANT Par...
University of Arizona Today:  25°C Sunny
Thesis <ul><li>Large amounts of data remain uncurated </li></ul><ul><li>Most of that data is from small data sets and is c...
Cyberinfrastructure Vision <ul><li>“ The anticipated growth in both the production and repurposing of digital data raises ...
Recognition of need for data curation <ul><li>“ Recommendation 6 : The NSF, working in partnership with collection manager...
<ul><li>Recognition of the importance of Information </li></ul><ul><li>Recognition of the need for education </li></ul><ul...
Why Libraries and Museums <ul><li>Long history of scholarly data management </li></ul><ul><li>Skills overlap such a develo...
The problem <ul><li>Recognition of the problem </li></ul><ul><li>Information is not in accessible format  </li></ul><ul><l...
<ul><li>Dark data is the data that we know is/was there but we can’t see it.   </li></ul>Hubble Space Telescope composite ...
Related Ideas <ul><li>John Porter:  </li></ul><ul><ul><li>Deep verses Wide databases </li></ul></ul><ul><li>Swanson:  </li...
GenBank PDB f ( x )= ax k + o ( x k ) Power Law of Science Data f ( x )= ax k + o ( x k )| X<.20 Data Volume Science Proje...
Does NSF’s Data Follow the Power Law? I do not know but if  $1 = X bytes…..
20-80  Rule The small are big! Total Grants 9347  $2,137,636,716 20% 80% Number Grants 1869 7478 Total Dollars $1,199,088,...
Biology 2009 #Grants: 1886  $Total: $744,168,471  ≈ €550,000,000 Distribution 1266 < $.5 million ≈ €370,000 Mode: $304,691...
<ul><li>Because it is high volume </li></ul><ul><li>Because it is information rich – high entropy </li></ul><ul><li>While ...
Where to find dark data <ul><li>Scientist’s backpacks and desks </li></ul><ul><li>Literature/Biodiversity Heritage Library...
What is dark data good for? <ul><li>Ecological Niche Modeling </li></ul><ul><li>Climate Change niche change prediction </l...
Problematic Transition <ul><li>Personal Information Management vs Knowledge Organization  </li></ul><ul><li>Pluralistic vs...
Contrast in Styles  (White, in press) <ul><li>Personal Information Management </li></ul><ul><li>One-Few users </li></ul><u...
New Information Disciplines <ul><li>Digital Curator : an expert knowledgeable of and with responsibility for the content o...
Roles
Skills
Library Roles <ul><li>Life Cycle Phases </li></ul><ul><ul><li>Plan </li></ul></ul><ul><ul><li>Create </li></ul></ul><ul><u...
How to Organize at a higher level? <ul><li>It is difficult to find what is already known </li></ul><ul><li>Clonal specimen...
Biological Science Collections (BiSciCol) Tracker  Muséum national d'histoire naturelle Nairobi National Museum Living Col...
BiSciCol Tracker
The Future is all about Data <ul><li>How do we get it? </li></ul><ul><li>How do we analyze it? </li></ul><ul><li>How do we...
Digital/Data Curation Programs <ul><li>University of Illinois </li></ul><ul><ul><li>Graduate School of Library and Informa...
Education Needs <ul><li>Biological Information Specialist </li></ul><ul><li>Concentration in Data Curation (MSLIS) </li></...
MSLIS Data Curation Concentration <ul><li>Data Curation Educational Program  (DCEP) </li></ul><ul><ul><li>IMLS – Laura Bus...
Biological Information Specialists <ul><li>At present: </li></ul><ul><li>Biologists at all degree levels self-trained in i...
Master of Science in Biological Informatics <ul><li>Degree Program began September 2007  </li></ul><ul><li>Part of campus-...
What does a BIS need to know? <ul><li>Biological training   and interest in solving biological research problems </li></ul...
UIUC bioinformatics core coursework <ul><li>Cross-disciplinary course distribution requirement </li></ul><ul><ul><ul><li>B...
Sample of existing LIS courses <ul><li>Information Organization and Knowledge Representation </li></ul><ul><li>LIS 551 Int...
University of Arizona Graduate Certificate in Digital Records Management <ul><li>Six Graduate Courses within MLA program <...
Workforce <ul><li>Data Curation Workforce Summit  </li></ul><ul><ul><li>Dec 6 th  at IDCC Chicago </li></ul></ul><ul><ul><...
The Future is Collaboration  and Data Sharing <ul><li>Libraries </li></ul><ul><li>Museums </li></ul><ul><li>Government </l...
Merci Merci
Upcoming SlideShare
Loading in …5
×

Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

1,138 views

Published on

The Path to Enlightened Solutions for Biodiversity's Dark Data
Keynote at Scripting Life: the science behind ViBRANT http://vbrant.eu/presentations

Published in: Education, Technology
3 Comments
1 Like
Statistics
Notes
  • Interesting ideas; there is a need too to be able to assign a permanent identifier to a particular specimen such that data pertaining to that specimen -- not the genus or anything as abstract as that -- can be pulled together. See, for example, http://agro.biodiver.se/2011/04/genebank-data-identifiers/ and http://dagendresen.wordpress.com/2011/04/16/dois-for-genebank-collections/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You are correct. I should have a slide on that. We need to overlay the audio somehow.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I am missing the slide about immortality of scientists. If I remember correctly it went about like this (about personal data management):
    Because scientists are immortal and publications and institutions cannot handle heterogeneous data we publish everything on the personal home page.
    The problem is: to convince people that they are not immortal.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
1,138
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
3
Likes
1
Embeds 0
No embeds

No notes for slide
  • Contributor:  The University of Arizona Herbarium Creator:  Homer Leroy Shantz From: HOMER L. SHANTZ, 1876 TO 1958: A BOTANIST IN AFRICA AND THE AMERICAS by Kathleen McConnell Language:  English Title:  azu_shantz_19310222_a_2_m Created:  1931-02-22 Continent:  North America Country:  United States Country Then:  United States Place:  Saguaro Forest Province:  Arizona Is this item in the public domain?:  No
  • Change to new front image
  • Add jobs from the interagency working group preport.
  • Re: Lesliey Wyborn
  • Government staff, scientists, researchers, land manager spend to much time looking for data and getting it into a shape that is useful It is too difficult for data gatherers to make their data available in a useful format.
  • Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

    1. 1. P. Bryan Heidorn University of Arizona and JRS Biodiversity Foundation 2011 Scripting Life: the science behind ViBRANT Paris, France 20-21 January 2011 The Path to Enlightened Solutions for Biodiversity's Dark Data
    2. 2. University of Arizona Today: 25°C Sunny
    3. 3. Thesis <ul><li>Large amounts of data remain uncurated </li></ul><ul><li>Most of that data is from small data sets and is currently largely invisible – Dark Data </li></ul><ul><li>This data should be curated locally but not by scientists alone </li></ul><ul><li>Need for long-lived institutions </li></ul>
    4. 4. Cyberinfrastructure Vision <ul><li>“ The anticipated growth in both the production and repurposing of digital data raises complex issues not only of scale and heterogeneity, but also of stewardship, curation and long-term access . ” </li></ul><ul><ul><li>NSF Cyberinfrastructure Vision for 21st Century Discovery, Chapter 3 </li></ul></ul>
    5. 5. Recognition of need for data curation <ul><li>“ Recommendation 6 : The NSF, working in partnership with collection managers and the community at large, should act to develop and mature the career path for data scientists and to ensure that the research enterprise includes a sufficient number of high-quality data scientists.” </li></ul><ul><li>Long-Lived Digital Data Collections: Enabling Research and Education in the 21 st Century, Recommendations </li></ul>
    6. 6. <ul><li>Recognition of the importance of Information </li></ul><ul><li>Recognition of the need for education </li></ul><ul><li>New work roles within traditional institutions </li></ul>Interagency Working Group on Digital Data
    7. 7. Why Libraries and Museums <ul><li>Long history of scholarly data management </li></ul><ul><li>Skills overlap such a development of metadata standards, ontologies, controlled vocabularies, thesauri </li></ul><ul><li>Long-lived institutions </li></ul><ul><li>Existing overlap with museums and archives </li></ul>
    8. 8. The problem <ul><li>Recognition of the problem </li></ul><ul><li>Information is not in accessible format </li></ul><ul><li>Computer Science, Information Science and Technology has not addressed the problem </li></ul><ul><li>No training or incentive for data generators </li></ul>
    9. 9. <ul><li>Dark data is the data that we know is/was there but we can’t see it. </li></ul>Hubble Space Telescope composite image &quot;ring&quot; of dark matter in the galaxy cluster Cl 0024+17
    10. 10. Related Ideas <ul><li>John Porter: </li></ul><ul><ul><li>Deep verses Wide databases </li></ul></ul><ul><li>Swanson: </li></ul><ul><ul><li>Undiscovered Public Knowledge </li></ul></ul><ul><li>Science Commons: </li></ul><ul><ul><li>Big Verses Small science </li></ul></ul>
    11. 11. GenBank PDB f ( x )= ax k + o ( x k ) Power Law of Science Data f ( x )= ax k + o ( x k )| X<.20 Data Volume Science Projects and Initiatives
    12. 12. Does NSF’s Data Follow the Power Law? I do not know but if $1 = X bytes…..
    13. 13. 20-80 Rule The small are big! Total Grants 9347 $2,137,636,716 20% 80% Number Grants 1869 7478 Total Dollars $1,199,088,125 $938,548,595 Range $6,892,810-$350,000 $350,000- $831
    14. 14. Biology 2009 #Grants: 1886 $Total: $744,168,471 ≈ €550,000,000 Distribution 1266 < $.5 million ≈ €370,000 Mode: $304,691 ≈ €225,000 Myth of the mega-project
    15. 15. <ul><li>Because it is high volume </li></ul><ul><li>Because it is information rich – high entropy </li></ul><ul><li>While needs of large data are understood small data and integration are not understood </li></ul><ul><li>Heidorn, P. Bryan (2008). Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57(2) Fall 2008 . Institutional Repositories: Institutional Repositories: Current State and Future. Edited by Sarah Sheeves and Melissa Cragin. ( http://hdl.handle.net/2142/9127 ). </li></ul>Small data is big science
    16. 16. Where to find dark data <ul><li>Scientist’s backpacks and desks </li></ul><ul><li>Literature/Biodiversity Heritage Library </li></ul><ul><li>Museum Specimens </li></ul><ul><li>Field notes </li></ul><ul><li>Citizen Observations </li></ul>
    17. 17. What is dark data good for? <ul><li>Ecological Niche Modeling </li></ul><ul><li>Climate Change niche change prediction </li></ul><ul><li>Taxonomic Name Resolution </li></ul><ul><li>Literature Search Support </li></ul><ul><ul><li>Taxonomic intelligence </li></ul></ul><ul><ul><li>Key-like – character searching </li></ul></ul><ul><li>Phenology and Phenology change </li></ul><ul><li>Food-web / trophic level </li></ul>
    18. 18. Problematic Transition <ul><li>Personal Information Management vs Knowledge Organization </li></ul><ul><li>Pluralistic vs Unified (Hjørland, 2007) </li></ul>
    19. 19. Contrast in Styles (White, in press) <ul><li>Personal Information Management </li></ul><ul><li>One-Few users </li></ul><ul><li>Visual/Spatial </li></ul><ul><li>Project Oriented </li></ul><ul><li>Knowledge Organization </li></ul><ul><li>Many users </li></ul><ul><li>Language based </li></ul><ul><li>Long-term orientation </li></ul>
    20. 20. New Information Disciplines <ul><li>Digital Curator : an expert knowledgeable of and with responsibility for the content of a digital collection(s) </li></ul><ul><li>Digital Archivist : an expert competent to appraise, acquire, authenticate, preserve, and provide access to records in digital form </li></ul><ul><li>Data Scientists : the information and computer scientists, database and software engineers and programmers, disciplinary experts, expert annotators, and others, who are crucial to the successful management of a digital data collection </li></ul><ul><li>(Long Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century, report of the National Science Board, September, 2005) </li></ul>
    21. 21. Roles
    22. 22. Skills
    23. 23. Library Roles <ul><li>Life Cycle Phases </li></ul><ul><ul><li>Plan </li></ul></ul><ul><ul><li>Create </li></ul></ul><ul><ul><li>Keep </li></ul></ul><ul><ul><li>Dispose </li></ul></ul><ul><li>Data Management Function </li></ul><ul><ul><li>Access </li></ul></ul><ul><ul><li>Document </li></ul></ul><ul><ul><li>Organize </li></ul></ul><ul><ul><li>Protect </li></ul></ul>
    24. 24. How to Organize at a higher level? <ul><li>It is difficult to find what is already known </li></ul><ul><li>Clonal specimens may be stored in different museums around the world </li></ul><ul><li>DNA analysis may be conducted on one but not the other </li></ul><ul><li>Micrographs may be in a database </li></ul><ul><li>Taxonomic treatments or revisions may exist </li></ul>
    25. 25. Biological Science Collections (BiSciCol) Tracker Muséum national d'histoire naturelle Nairobi National Museum Living Collection: Missouri Botanical Garden Determination Gene Sequence Parasitism S1: KNM S2: MNHN S3: MBG ? ? GENBANK ? ? ? ? Agave sisalana ?
    26. 26. BiSciCol Tracker
    27. 27. The Future is all about Data <ul><li>How do we get it? </li></ul><ul><li>How do we analyze it? </li></ul><ul><li>How do we disseminate it (Maps, charts tables..)? </li></ul><ul><li>How do we keep it? </li></ul><ul><ul><li>Provenance, Storage Weeding </li></ul></ul><ul><li>How do we make it sustainable? </li></ul>
    28. 28. Digital/Data Curation Programs <ul><li>University of Illinois </li></ul><ul><ul><li>Graduate School of Library and Information Science </li></ul></ul><ul><li>University of Arizona </li></ul><ul><ul><li>School of Information Resources and Library Science </li></ul></ul><ul><li>University of North Carolina </li></ul><ul><ul><ul><li>School of Information and Library Science </li></ul></ul></ul>
    29. 29. Education Needs <ul><li>Biological Information Specialist </li></ul><ul><li>Concentration in Data Curation (MSLIS) </li></ul><ul><li>Certificate of Advanced Study in Data Curation for Libraries and Scientist </li></ul><ul><li>Information and professional education in biodiversity informatics </li></ul>
    30. 30. MSLIS Data Curation Concentration <ul><li>Data Curation Educational Program (DCEP) </li></ul><ul><ul><li>IMLS – Laura Bush 21 st Century Librarian Program, </li></ul></ul><ul><ul><ul><li>RE-05-06-0036-06 (Heidorn, PI) </li></ul></ul></ul><ul><li>Students with the DC concentration will be trained to add value to data and promote sharing across labs and disciplinary specializations </li></ul>
    31. 31. Biological Information Specialists <ul><li>At present: </li></ul><ul><li>Biologists at all degree levels self-trained in information technology </li></ul><ul><li>Information technologists at all degree levels self-trained in biology </li></ul><ul><ul><ul><li>(both with gaps in knowledge for many months, years) </li></ul></ul></ul><ul><li>Differing roles of BIS in large and small </li></ul>
    32. 32. Master of Science in Biological Informatics <ul><li>Degree Program began September 2007 </li></ul><ul><li>Part of campus-wide bioinformatics masters program </li></ul><ul><li>NSF/CISE/IIS, Education Research and Curriculum Development, 0534567 (Palmer, PI) </li></ul><ul><li>Combines Biology, Bioinformatics, Computer Science core with LIS courses </li></ul>
    33. 33. What does a BIS need to know? <ul><li>Biological training and interest in solving biological research problems </li></ul><ul><li>Information skills </li></ul><ul><li>Evaluation and implementation of information systems: user based assessment and continual quality improvement for the development of tools that work and are used. </li></ul><ul><li>Information acquisition, management, and dissemination: development of digital libraries, data archives, institutional repositories, and related tools. </li></ul><ul><li>Information organization and integration: ontology development, structuring information for optimal use and sharing, and standards development. </li></ul>
    34. 34. UIUC bioinformatics core coursework <ul><li>Cross-disciplinary course distribution requirement </li></ul><ul><ul><ul><li>Bioinformatics: Computing in Molecular Biology Algorithms in Bioinformatics Principles of Systematics </li></ul></ul></ul><ul><ul><ul><li>Computer Science: Algorithms Database Systems </li></ul></ul></ul><ul><ul><ul><li>Biology: Human Genetics Introductory Biochemistry Macromolecular Modeling </li></ul></ul></ul>
    35. 35. Sample of existing LIS courses <ul><li>Information Organization and Knowledge Representation </li></ul><ul><li>LIS 551 Interfaces to Information Systems </li></ul><ul><li>LIS 590DM Document Modeling </li></ul><ul><li>LIS 590RO Representing and Organizing Information Resources </li></ul><ul><li>LIS590ON Ontologies in Natural Science </li></ul><ul><li>Information Resources, Uses and users </li></ul><ul><li>LIS 503 Use and Users of Information </li></ul><ul><li>LIS 522 Information Sources in the Sciences </li></ul><ul><li>LIS 590TR Information Transfer and Collaboration in Science </li></ul><ul><li>Information Systems </li></ul><ul><li>LIS 456 Information Storage and Retrieval </li></ul><ul><li>LIS 509 Building Digital Libraries </li></ul><ul><li>LIS 566 Architecture of Network Information Systems </li></ul><ul><li>LIS 590EP Electronic Publishing </li></ul><ul><li>Disciplinary Focus </li></ul><ul><li>LIS 530B Health Sciences Information Services and Resources </li></ul><ul><li>LIS 590HI Healthcare Informatics (Healthcare Infrastructure) </li></ul><ul><li>LIS 590EI/BDI Ecological Informatics (Biodiversity Informatics) </li></ul>
    36. 36. University of Arizona Graduate Certificate in Digital Records Management <ul><li>Six Graduate Courses within MLA program </li></ul><ul><li>Focus on repositories </li></ul><ul><li>Cross over with Knowledge Representation and Metadata </li></ul>
    37. 37. Workforce <ul><li>Data Curation Workforce Summit </li></ul><ul><ul><li>Dec 6 th at IDCC Chicago </li></ul></ul><ul><ul><li>Identify the Skill sets needed to government data curation </li></ul></ul><ul><ul><li>Department of Energy, US National Science Foundation, Institute of Museum and Library Services, Oak Ridge National Laboratory, USGS National Biological Information Infrastructure, CIESIN </li></ul></ul>
    38. 38. The Future is Collaboration and Data Sharing <ul><li>Libraries </li></ul><ul><li>Museums </li></ul><ul><li>Government </li></ul><ul><li>Universities </li></ul>To bring the best data to the major problems and opportunities of our time and the future <ul><li>NGO </li></ul><ul><li>Private Land Holders </li></ul><ul><li>Ranches </li></ul><ul><li>Farms </li></ul>
    39. 39. Merci Merci

    ×