SlideShare a Scribd company logo
1 of 65
Dark Data In the Long Tail of Science:   Examples in Biology September 2, 2009 National Institute of Standards and Technology P. Bryan Heidorn NSF University of Illinois  University of Arizona
Introduction ,[object Object],[object Object],[object Object],[object Object]
Cyberinfrastructure Vision ,[object Object],[object Object]
Recognition of need for data curation ,[object Object],[object Object]
[object Object],[object Object],[object Object],Interagency Working Group on Digital Data
New Information Disciplines ,[object Object],[object Object],[object Object],[object Object]
Library Skills
Economics of the long tail ,[object Object],[object Object],[object Object],[object Object]
Naive View of Science Data GenBank PDB f ( x )= ax k + o ( x k ) Power Law of Science Data f ( x )= ax k + o ( x k )| X<.20 Data Volume Science Projects and Initiatives
Does NSF’s Data Follow the Power Law? I do not know but if  $1 = X bytes…..
20-80  Rule The small are big! $350,000- $831 $6,892,810-$350,000 Range $938,548,595 $1,199,088,125 Total Dollars 7478 1869 Number Grants 80% 20% 9347  $2,137,636,716 Total Grants
[object Object],Hubble Space Telescope composite image &quot;ring&quot; of dark matter in the galaxy cluster Cl 0024+17
Related Ideas ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Why is the tail also important ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Technical Solutions: Move the tail to the head (increase k) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Solutions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Institutional Solutions ,[object Object],[object Object],[object Object],[object Object],Library director John Hanson told the  Associated Press that a couple of dozen people are cited each year for failure to return materials or pay fines. The incident cost Dalibor about $30 for the two overdue paperbacks. It cost her mother $172 to free her. Book and Bake Sale at the Mary E. Tippitt Memorial Library in Townsend.  Sailing Yacht  Maltese Falco owned by Tom Perkins
Organizational Solutions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Questions about the long-tail ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Barriers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
My Solutions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Automatic Metadata Extraction (Darwin Core) From Museum Specimen Labels 2008 Dublin Core Conference P. Bryan Heidorn, Qin Wei University of Illinois at Urbana-Champaign … <co> Curtis,  </co><hdlc>  North American Pl </hdlc><cnl> No.</cnl><cn> 503*</cn> <gn> Polygala</gn><sp> ambigua,</sp><sa> Nutt.,</sa><val> var.</val> <hb> Coral soil,</hb><lc> Cudjoe Key, South Florida. </lc><col> Legit</col><co> A. H. Curtiss.</co><dt>February</dt>…
The problem ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Why care about the specimens? ,[object Object],[object Object]
http://www.ncdc.noaa.gov/img/climate/globalwarming/ar4-fig-3-9.gif
Why care ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
A real-life example:  Baronia brevicornis  and its single food plant,  Acacia cochliacantha (Soberon)
B. brevicornis  Abiotic Niche using BS Garp
Natural History Specimens
S ample records
Sample OCR Output ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Label Labels ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Label Labels ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Example Training Record ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Supervised Learning Framework Gold Classified Labels Training Phase Application Phase Machine Learner Unclassified Labels Segmented Text Silver  Classified Labels Segmentation  Machine  Classifier Unclassified  Labels Human Editing Trained  Model
Herbis Experimental Data ,[object Object],[object Object],[object Object]
Performances of NB and HMM
Element Identifiers
Improved Performance With Field Element Identifiers
 
Learning w/ pre categorization Gold Labels Machine Learner Model n Classified Labels Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Learner Machine Learner Model 2 Model 1 Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Classification Machine Classification Machine Classification Classified Labels Classified Labels Unclassified Labels
FIG. 5. Improved Performance of Specialist Model Specialist100 Curtiss VS 100 General
P. Bryan Heidorn 1 , Hong Zhang 1 , Eugene Chung 2   and   BGWG 1 Graduate School of Library and Information Science,  2 Linguistics, University of Illinois  Machine Learning in BioGeomancer’s Locality Specification SPNHC & NSCA 2006
BioGeomancer Working Group (BGWG)  http://203.202.1.217/bgwebsite/index.html ,[object Object],[object Object],[object Object],[object Object]
Participants
Example Locality Types F; NF; FS Seward Peninsula; vic. Bluff, S coast 204 FPOH 0.4 mi N Collinston on LA 138 181 FOO WALTMAN, 9 MI N, 2.5 MI W OF  160 P; FOH; NP TIESMA RD, 1.5 MI NW EDGEWATER; OFF LAKE MICHIGAN R  109 P; POH INDIAN CREEK, 11 MI. W HWY 160 100 NF; FH near Aleutian Islands; S of Amukta Pass  86 FOH; F dario 7 mi wnw of; RIO VIEJO 43 Locality Type Specification of Location   Record #
 
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],FRAME
Xiaoya Tang and P. Bryan Heidorn ,[object Object],Long leaves … ...  Leaves  20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m    1.5–3.5 cm, ……...  Inflorescences:  ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, ……. User query Description of leaf Length in texts
Information Extraction From FNA Templates for  useful information Extraction Rules Structured  information  Leaf_Shape obovate Leaf_Shape orbiculate Blade_Dimension 3—9 x 3—8 cm   ………… .. ………… .. Original documents ……… .. Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often irregularly doubly serrate to nearly dentate,   . ……………… Knowledge bases … .. PartBlade: Leaf blade Blades blade …… Pattern:: * <PartBlade> ' ' <leafShape> * ( <leafShape> ) ',' *  Output:: leaf {leafShape $1} Pattern:: * <PartBlade> * ', ' ( <Range> ' ' * <LengUnit> ) * <PartBase> Output:: leaf {bladeDimension $1} User log analysis Leaf_Shape Leaf_Margin Leaf_Apex     Leaf_Base Blade_Dimension … .. … .. 
Results – System Performance NT: number of tasks accomplished in total NTH: number of tasks accomplished per hour TSR: task success rate SSR: search success rate NSST: number of searches to accomplish a task TST: time spent to accomplish a task NDVST: number of documents viewed to  accomplish a task 0.162 14.75 11.16 NDVST 0.72 435.2 338.8 TST 0.000 9.584 4.779 NSST 0.053 0.568 3.598 4.50 SEARF 0.011 0.000 0.005 0.005 Sig.(ANOVA) 0.210 0.860 8.078 6.75 SEARFA SSR TSR NTH NT Group
Education Programs ,[object Object],[object Object],[object Object],[object Object]
Biological Information Specialists ,[object Object],[object Object],[object Object],[object Object],[object Object]
Master of Science in Biological Informatics ,[object Object],[object Object],[object Object],[object Object]
What does a BIS need to know? ,[object Object],[object Object],[object Object],[object Object],[object Object]
UIUC bioinformatics core coursework ,[object Object],[object Object],[object Object],[object Object]
Sample of existing LIS courses ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MSLIS Data Curation Concentration ,[object Object],[object Object],[object Object],[object Object]
New research directions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Example Service ,[object Object],[object Object],[object Object]
JRS Biodiversity Foundation ,[object Object],[object Object],[object Object]
JRS Biodiversity Foundation ,[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],JRS Biodiversity Foundation
National Science Foundation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Taxonomic Database Working Group ,[object Object],[object Object],[object Object],[object Object]

More Related Content

Viewers also liked

The blessing and the curse: handshaking between general and specialist data r...
The blessing and the curse: handshaking between general and specialist data r...The blessing and the curse: handshaking between general and specialist data r...
The blessing and the curse: handshaking between general and specialist data r...Hilmar Lapp
 
Library and data lecture for inf21306
Library and data lecture for  inf21306Library and data lecture for  inf21306
Library and data lecture for inf21306Hugo Besemer
 
Reproducible Science - Panel at iEvoBio 2014
Reproducible Science - Panel at iEvoBio 2014 Reproducible Science - Panel at iEvoBio 2014
Reproducible Science - Panel at iEvoBio 2014 Hilmar Lapp
 
Introduction to Research Data Management at Lancaster University
Introduction to Research Data Management at Lancaster UniversityIntroduction to Research Data Management at Lancaster University
Introduction to Research Data Management at Lancaster UniversityLancaster University Library
 
Open Bioinformatics Foundation: 2014 Update & Some Introspection
Open Bioinformatics Foundation: 2014 Update & Some IntrospectionOpen Bioinformatics Foundation: 2014 Update & Some Introspection
Open Bioinformatics Foundation: 2014 Update & Some IntrospectionHilmar Lapp
 
Sharing Data: An Introductory Workshop from OpenAIRE and Foster
Sharing Data: An Introductory Workshop from OpenAIRE and FosterSharing Data: An Introductory Workshop from OpenAIRE and Foster
Sharing Data: An Introductory Workshop from OpenAIRE and FosterOpenAIRE
 
The Needs of Stakeholders in the RDM Process - the role of LEARN
The Needs of Stakeholders in the RDM Process - the role of LEARNThe Needs of Stakeholders in the RDM Process - the role of LEARN
The Needs of Stakeholders in the RDM Process - the role of LEARNLEARN Project
 
Open science and the individual researcher
Open science and the individual researcherOpen science and the individual researcher
Open science and the individual researcherBram Zandbelt
 
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...LIBER Europe
 

Viewers also liked (11)

The blessing and the curse: handshaking between general and specialist data r...
The blessing and the curse: handshaking between general and specialist data r...The blessing and the curse: handshaking between general and specialist data r...
The blessing and the curse: handshaking between general and specialist data r...
 
Library and data lecture for inf21306
Library and data lecture for  inf21306Library and data lecture for  inf21306
Library and data lecture for inf21306
 
Reproducible Science - Panel at iEvoBio 2014
Reproducible Science - Panel at iEvoBio 2014 Reproducible Science - Panel at iEvoBio 2014
Reproducible Science - Panel at iEvoBio 2014
 
Introduction to Research Data Management at Lancaster University
Introduction to Research Data Management at Lancaster UniversityIntroduction to Research Data Management at Lancaster University
Introduction to Research Data Management at Lancaster University
 
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
Introduction to Research Data Management - 2015-05-27 - Social Sciences Divis...
 
Open Bioinformatics Foundation: 2014 Update & Some Introspection
Open Bioinformatics Foundation: 2014 Update & Some IntrospectionOpen Bioinformatics Foundation: 2014 Update & Some Introspection
Open Bioinformatics Foundation: 2014 Update & Some Introspection
 
Sharing Data: An Introductory Workshop from OpenAIRE and Foster
Sharing Data: An Introductory Workshop from OpenAIRE and FosterSharing Data: An Introductory Workshop from OpenAIRE and Foster
Sharing Data: An Introductory Workshop from OpenAIRE and Foster
 
Data Metadata and Data Citation - Emma Ganley (PLoS)
Data Metadata and Data Citation - Emma Ganley (PLoS)Data Metadata and Data Citation - Emma Ganley (PLoS)
Data Metadata and Data Citation - Emma Ganley (PLoS)
 
The Needs of Stakeholders in the RDM Process - the role of LEARN
The Needs of Stakeholders in the RDM Process - the role of LEARNThe Needs of Stakeholders in the RDM Process - the role of LEARN
The Needs of Stakeholders in the RDM Process - the role of LEARN
 
Open science and the individual researcher
Open science and the individual researcherOpen science and the individual researcher
Open science and the individual researcher
 
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...
 

Similar to Dark Data In the Long Tail of Science:   Examples in Biology

Mblwhoil2010 Heidorn
Mblwhoil2010 HeidornMblwhoil2010 Heidorn
Mblwhoil2010 HeidornBryan Heidorn
 
The Perils and Promise of Environmental Data Science
The Perils and Promise of Environmental Data ScienceThe Perils and Promise of Environmental Data Science
The Perils and Promise of Environmental Data ScienceDawn Wright
 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBryan Heidorn
 
Digital Data Sharing: Opportunities and Challenges of Opening Research
Digital Data Sharing: Opportunities and Challenges of Opening ResearchDigital Data Sharing: Opportunities and Challenges of Opening Research
Digital Data Sharing: Opportunities and Challenges of Opening ResearchMartin Donnelly
 
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...Stephane Fellah
 
Module 1 - Data Around Us .pptx
Module 1 - Data Around Us .pptxModule 1 - Data Around Us .pptx
Module 1 - Data Around Us .pptxesta2310819
 
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reusevoginip
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Bryan Heidorn
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Datavbrant
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositoriesChris Rusbridge
 
Geographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data SourcesGeographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data SourcesIan Turton
 
CLIR Fellows - Science Data - 14_0730
CLIR Fellows - Science Data - 14_0730CLIR Fellows - Science Data - 14_0730
CLIR Fellows - Science Data - 14_0730jeffreylancaster
 
Goldschmidt2019 Samples Workshop
Goldschmidt2019 Samples WorkshopGoldschmidt2019 Samples Workshop
Goldschmidt2019 Samples WorkshopKerstin Lehnert
 
Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms: Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms: Martin Donnelly
 
Research Data Management: A Tale of Two Paradigms
Research Data Management: A Tale of Two ParadigmsResearch Data Management: A Tale of Two Paradigms
Research Data Management: A Tale of Two Paradigmstarastar
 

Similar to Dark Data In the Long Tail of Science:   Examples in Biology (20)

Mblwhoil2010 Heidorn
Mblwhoil2010 HeidornMblwhoil2010 Heidorn
Mblwhoil2010 Heidorn
 
April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...
April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...
April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...
 
The Perils and Promise of Environmental Data Science
The Perils and Promise of Environmental Data ScienceThe Perils and Promise of Environmental Data Science
The Perils and Promise of Environmental Data Science
 
Christine borgman keynote
Christine borgman keynoteChristine borgman keynote
Christine borgman keynote
 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary Challenge
 
Digital Data Sharing: Opportunities and Challenges of Opening Research
Digital Data Sharing: Opportunities and Challenges of Opening ResearchDigital Data Sharing: Opportunities and Challenges of Opening Research
Digital Data Sharing: Opportunities and Challenges of Opening Research
 
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
 
Module 1 - Data Around Us .pptx
Module 1 - Data Around Us .pptxModule 1 - Data Around Us .pptx
Module 1 - Data Around Us .pptx
 
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reuse
 
FAIRy Stories
FAIRy StoriesFAIRy Stories
FAIRy Stories
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Data
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Geographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data SourcesGeographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data Sources
 
share23webversion-1
share23webversion-1share23webversion-1
share23webversion-1
 
Open Science
Open Science Open Science
Open Science
 
CLIR Fellows - Science Data - 14_0730
CLIR Fellows - Science Data - 14_0730CLIR Fellows - Science Data - 14_0730
CLIR Fellows - Science Data - 14_0730
 
Goldschmidt2019 Samples Workshop
Goldschmidt2019 Samples WorkshopGoldschmidt2019 Samples Workshop
Goldschmidt2019 Samples Workshop
 
Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms: Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms:
 
Research Data Management: A Tale of Two Paradigms
Research Data Management: A Tale of Two ParadigmsResearch Data Management: A Tale of Two Paradigms
Research Data Management: A Tale of Two Paradigms
 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Dark Data In the Long Tail of Science:   Examples in Biology

  • 1. Dark Data In the Long Tail of Science:   Examples in Biology September 2, 2009 National Institute of Standards and Technology P. Bryan Heidorn NSF University of Illinois University of Arizona
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 8.
  • 9. Naive View of Science Data GenBank PDB f ( x )= ax k + o ( x k ) Power Law of Science Data f ( x )= ax k + o ( x k )| X<.20 Data Volume Science Projects and Initiatives
  • 10. Does NSF’s Data Follow the Power Law? I do not know but if $1 = X bytes…..
  • 11. 20-80 Rule The small are big! $350,000- $831 $6,892,810-$350,000 Range $938,548,595 $1,199,088,125 Total Dollars 7478 1869 Number Grants 80% 20% 9347 $2,137,636,716 Total Grants
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Automatic Metadata Extraction (Darwin Core) From Museum Specimen Labels 2008 Dublin Core Conference P. Bryan Heidorn, Qin Wei University of Illinois at Urbana-Champaign … <co> Curtis, </co><hdlc> North American Pl </hdlc><cnl> No.</cnl><cn> 503*</cn> <gn> Polygala</gn><sp> ambigua,</sp><sa> Nutt.,</sa><val> var.</val> <hb> Coral soil,</hb><lc> Cudjoe Key, South Florida. </lc><col> Legit</col><co> A. H. Curtiss.</co><dt>February</dt>…
  • 23.
  • 24.
  • 26.
  • 27. A real-life example: Baronia brevicornis and its single food plant, Acacia cochliacantha (Soberon)
  • 28. B. brevicornis Abiotic Niche using BS Garp
  • 31.
  • 32.
  • 33.
  • 34.
  • 35. Supervised Learning Framework Gold Classified Labels Training Phase Application Phase Machine Learner Unclassified Labels Segmented Text Silver Classified Labels Segmentation Machine Classifier Unclassified Labels Human Editing Trained Model
  • 36.
  • 39. Improved Performance With Field Element Identifiers
  • 40.  
  • 41. Learning w/ pre categorization Gold Labels Machine Learner Model n Classified Labels Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Learner Machine Learner Model 2 Model 1 Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Classification Machine Classification Machine Classification Classified Labels Classified Labels Unclassified Labels
  • 42. FIG. 5. Improved Performance of Specialist Model Specialist100 Curtiss VS 100 General
  • 43. P. Bryan Heidorn 1 , Hong Zhang 1 , Eugene Chung 2 and BGWG 1 Graduate School of Library and Information Science, 2 Linguistics, University of Illinois Machine Learning in BioGeomancer’s Locality Specification SPNHC & NSCA 2006
  • 44.
  • 46. Example Locality Types F; NF; FS Seward Peninsula; vic. Bluff, S coast 204 FPOH 0.4 mi N Collinston on LA 138 181 FOO WALTMAN, 9 MI N, 2.5 MI W OF 160 P; FOH; NP TIESMA RD, 1.5 MI NW EDGEWATER; OFF LAKE MICHIGAN R 109 P; POH INDIAN CREEK, 11 MI. W HWY 160 100 NF; FH near Aleutian Islands; S of Amukta Pass 86 FOH; F dario 7 mi wnw of; RIO VIEJO 43 Locality Type Specification of Location Record #
  • 47.  
  • 48.
  • 49.
  • 50. Information Extraction From FNA Templates for useful information Extraction Rules Structured information Leaf_Shape obovate Leaf_Shape orbiculate Blade_Dimension 3—9 x 3—8 cm ………… .. ………… .. Original documents ……… .. Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often irregularly doubly serrate to nearly dentate, . ……………… Knowledge bases … .. PartBlade: Leaf blade Blades blade …… Pattern:: * <PartBlade> ' ' <leafShape> * ( <leafShape> ) ',' * Output:: leaf {leafShape $1} Pattern:: * <PartBlade> * ', ' ( <Range> ' ' * <LengUnit> ) * <PartBase> Output:: leaf {bladeDimension $1} User log analysis Leaf_Shape Leaf_Margin Leaf_Apex    Leaf_Base Blade_Dimension … .. … .. 
  • 51. Results – System Performance NT: number of tasks accomplished in total NTH: number of tasks accomplished per hour TSR: task success rate SSR: search success rate NSST: number of searches to accomplish a task TST: time spent to accomplish a task NDVST: number of documents viewed to accomplish a task 0.162 14.75 11.16 NDVST 0.72 435.2 338.8 TST 0.000 9.584 4.779 NSST 0.053 0.568 3.598 4.50 SEARF 0.011 0.000 0.005 0.005 Sig.(ANOVA) 0.210 0.860 8.078 6.75 SEARFA SSR TSR NTH NT Group
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.

Editor's Notes

  1. Change to new front image
  2. Add jobs from the interagency working group preport.
  3. Rework with new librarian image
  4. Insert lake victoria overlay
  5. Insert lake victoria overlay
  6. Insert lake victoria overlay