SlideShare a Scribd company logo
1 of 29
Plants, Potions and
People
Continuing Adventures in
Scientific Extraction,
Transformation and Loading
MITCH MILLER
SCIENTIFIC THINKING, LLC
VERMONT CODE CAMP 9
SEPTEMBER 16, 2017
Introduction
 Independent consultant in scientific information systems
 Many projects relate to extraction, transformation, loading of data
 Acronyms and buzzwords often cause confusion
 If this presentation contains a word you don’t understand, please ask!
Project starting points
 Given: list of medicinal plants
 Mission:
 Gather valid scientific names
 Gather IDs each plant for 6 online resources and format URLs
 Prepare data for loading into in-house database
Background
 Target users are regulators
 Evaluate herbal products as commercial medicines
 Users must determine if each product is beneficial
 Decision process is time-constrained
 As brief as one week
 Users need a rich set of data to facilitate informed decisions
 Provide direct URLs to online resources so users don’t have to search
 Data had to be loaded into system by target date in advance of database
migration
Taxonomy
 Definition
 Theory and practice of grouping individuals into species, arranging species
into larger groups, and giving those groups names, thus producing a
classification[2]
 One of several definitions on Wikipedia
 Taxonomists apply a hierarchical set of group assignments to a sample.
 See example
Example: Aloe Vera
 https://plants.usda.gov/java/ClassificationServlet?source=display&classid=
ALVE2
Taxonomy (continued)
 Taxonomists do not always agree
 Useful for researchers/regulators to know how to classify a product!
 Key term: binomial name
 Genus + species = binomial name
 Names are in Greek and Latin
 Designations/qualifiers are also in Latin
Input for this project
 Spreadsheet listing plants of interest
 Plants listed as: genus, species and author
 Legacy IDs for 2 sources
Output
 Spreadsheets containing direct URLs for matching materials that can be
loaded, via macro, into the database
The resources
Royal Botanic Gardens, Kew
 “…global resource for plant and fungal knowledge…”
 Offers MPNS – Medicinal Plant Names Service
 “..a global resource for medicinal plant names that enables health
professionals and researchers to access information about plants and plant
products relevant to pharmacological research, health regulation, traditional
medicine and functional foods. ”
 Cross-reference for names of plants
 http://mpns.kew.org/mpns-portal/
 Entries have an ID and a database qualifier
 Both are used to locate records
ITIS
 Integrated Taxonomic Information System (ITIS)
 taxonomic information on plants, animals, fungi, and microbes
 partnership of U.S., Canadian, and Mexican agencies
 https://www.itis.gov/about_itis.html
 USGS, NOAA, EPA, ARS
 Data available as PostGRESQL dump
 Data transferred to local database and then merged into output
GRIN
 Germplasm Resources Information Network
 acquire, characterize, preserve, document, and distribute to scientists,
germplasm of all lifeforms important for food and agricultural production.
 ID was provided as input
 https://www.ars-grin.gov/
PLANTS
 The PLANTS Database provides standardized information about the
vascular plants, mosses, liverworts, hornworts, and lichens of the U.S. and
its territories
 …collaborative effort of the USDA NRCS National Plant Data Team (NPDT),
the USDA NRCS Information Technology Center (ITC), The USDA National
Information Technology Center (NITC), and many other partners.
 https://plants.usda.gov/
 Data available as .txt file
NCBI Taxonomy
 National Center for Biotechnology Information
 A division of the National Library of Medicine (NLM) at the National
Institutes of Health (NIH)
 Mission: ‘…storing and analyzing knowledge about molecular biology,
biochemistry, and genetics’
 The Taxonomy Database is a curated classification and nomenclature for
all of the organisms in the public sequence databases.
 Searchable online
Wikipedia
 Online encyclopedia
 10+ languages
 Editable by anyone
 https://www.wikipedia.org/
Tools used
 KNIME
 MS Excel
Development tool: KNIME
 Graphic, component-based programming environment
 Drag functional components from palette onto canvas to create program
 Configure most components by setting parameters
 Connect components to route data from one to another
 Run and observe data traveling down the lines
 KNIME stands for KoNstanz Information MinEr
 Pronounced “Nighm”
 Originally a product of the University of Konstanz, Germany 2004
 Currently produced by KNIME.com AG, a company in Zurich, Switzerland
 Free version available for download
 Windows, Linux, Mac
KNIME (continued)
 Many third party components provide domain-specific functionality
 E.g., analysis/manipulation of chemical structures
 Online forums provide support
 Multiple updates per year
Power of KNIME for ETL
 Each operation handled by a specific graphical component
 E.g., read an Excel file; query a database, apply a transformation…
 Each operation can be run individually
 Data is cached indefinitely in a local file
 You can return to your workflow after a day or two and change handling steps
 Very easy to explore alternatives by creating branches and comparing the
results
 Components can write to databases, files, etc.
How each source was handled
Kew
 Inconvenient truth: IDs change from version to version
 First step: take ID from input spreadsheet, form a URL, check for data
 If no data, create a search URL using genus and species
 Submit the search URL (POST)
 Parse search results for individual record IDs
 When a valid species record is found, parse ID and names
GRIN
 Valid ID was present in input
 Nothing to do for GRIN!
ITIS
 Complete dataset was available for download as PostgreSQL dump file
 Dump file was loaded into a database on local machine
 Data includes TSN (taxonomic serial number) and binomial name
 Example:
https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_va
lue=58#null
NCBI Taxonomy
 Data was queried via URL
 Example: https://www.ncbi.nlm.nih.gov/taxonomy/?term=Blepharis+edulis
 Retrieve search results
 Parse ID out of HTML
 Form direct URL using ID
 https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=328124
Wikipedia
 Form a URL using genus + species
 Try to retrieve the page with this URL
 No hit? Try again with just genus
 Add valid URL to output
Let’s examine the code
Conclusion
 Data were retrieved, sorted and loaded into the legacy database well in
advance of the data migration cutoff
Thank you for listening!
 Mitch Miller
 Mitch.miller@thinkscience.us
 http://thinkscience.us

More Related Content

What's hot

Zmasek TOPSAN Biohackathon 2011
Zmasek TOPSAN Biohackathon 2011Zmasek TOPSAN Biohackathon 2011
Zmasek TOPSAN Biohackathon 2011cmzmasek
 
Data retriveal ,srg and dbget
Data retriveal ,srg and dbgetData retriveal ,srg and dbget
Data retriveal ,srg and dbgetSurendraKumar338
 
Designing Biological Databases
Designing Biological DatabasesDesigning Biological Databases
Designing Biological DatabasesArjei Balandra
 
The DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with ConfidenceThe DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
 
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...marcosmartinezromero
 
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...Maulik Kamdar
 
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
 
Semantic Annotation Dc 2009
Semantic Annotation Dc 2009Semantic Annotation Dc 2009
Semantic Annotation Dc 2009sdas617
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
 
Lightning Talk, Konkiel: Bootstrapping Library Data Management Services for E...
Lightning Talk, Konkiel: Bootstrapping Library Data Management Services for E...Lightning Talk, Konkiel: Bootstrapping Library Data Management Services for E...
Lightning Talk, Konkiel: Bootstrapping Library Data Management Services for E...ASIS&T
 
DataTags, The Tags Toolset, and Dataverse Integration
DataTags, The Tags Toolset, and Dataverse IntegrationDataTags, The Tags Toolset, and Dataverse Integration
DataTags, The Tags Toolset, and Dataverse IntegrationMichael Bar-Sinai
 
Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Todd Vision
 
Publishing data and code openly
Publishing data and code openlyPublishing data and code openly
Publishing data and code openlyFAIRDOM
 
Mashing Up Drug Discovery
Mashing Up Drug DiscoveryMashing Up Drug Discovery
Mashing Up Drug DiscoverySciBite Limited
 
Webtools For Reference Search
Webtools For Reference SearchWebtools For Reference Search
Webtools For Reference Searchwiser pku
 

What's hot (20)

Zmasek TOPSAN Biohackathon 2011
Zmasek TOPSAN Biohackathon 2011Zmasek TOPSAN Biohackathon 2011
Zmasek TOPSAN Biohackathon 2011
 
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
An Open Repository Model for Acquiring Knowledge About Scientific ExperimentsAn Open Repository Model for Acquiring Knowledge About Scientific Experiments
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
 
Data retriveal ,srg and dbget
Data retriveal ,srg and dbgetData retriveal ,srg and dbget
Data retriveal ,srg and dbget
 
Designing Biological Databases
Designing Biological DatabasesDesigning Biological Databases
Designing Biological Databases
 
The DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with ConfidenceThe DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with Confidence
 
Dr David Schindel and Mike Trizna - BOL Data Portal
Dr David Schindel and Mike Trizna - BOL Data PortalDr David Schindel and Mike Trizna - BOL Data Portal
Dr David Schindel and Mike Trizna - BOL Data Portal
 
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
 
Chemspider Presentation at the ACS Meeting in New orleans
Chemspider Presentation at the ACS Meeting in New orleansChemspider Presentation at the ACS Meeting in New orleans
Chemspider Presentation at the ACS Meeting in New orleans
 
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
 
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
 
Semantic Annotation Dc 2009
Semantic Annotation Dc 2009Semantic Annotation Dc 2009
Semantic Annotation Dc 2009
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 
Lightning Talk, Konkiel: Bootstrapping Library Data Management Services for E...
Lightning Talk, Konkiel: Bootstrapping Library Data Management Services for E...Lightning Talk, Konkiel: Bootstrapping Library Data Management Services for E...
Lightning Talk, Konkiel: Bootstrapping Library Data Management Services for E...
 
The Uniform Resource Layer
The Uniform Resource LayerThe Uniform Resource Layer
The Uniform Resource Layer
 
DataTags, The Tags Toolset, and Dataverse Integration
DataTags, The Tags Toolset, and Dataverse IntegrationDataTags, The Tags Toolset, and Dataverse Integration
DataTags, The Tags Toolset, and Dataverse Integration
 
Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck
 
Publishing data and code openly
Publishing data and code openlyPublishing data and code openly
Publishing data and code openly
 
Mashing Up Drug Discovery
Mashing Up Drug DiscoveryMashing Up Drug Discovery
Mashing Up Drug Discovery
 
Challenges in Normalizing and Disambiguating Organization Names, by John Fereira
Challenges in Normalizing and Disambiguating Organization Names, by John FereiraChallenges in Normalizing and Disambiguating Organization Names, by John Fereira
Challenges in Normalizing and Disambiguating Organization Names, by John Fereira
 
Webtools For Reference Search
Webtools For Reference SearchWebtools For Reference Search
Webtools For Reference Search
 

Similar to Presentation from Code Camp 2017

In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateNeuroscience Information Framework
 
Using Taxonomies to Create People Directories and Author Networks
Using Taxonomies to Create People Directories and Author Networks Using Taxonomies to Create People Directories and Author Networks
Using Taxonomies to Create People Directories and Author Networks Access Innovations, Inc.
 
Asis&t webinar people directories access innovations
Asis&t webinar people directories access innovationsAsis&t webinar people directories access innovations
Asis&t webinar people directories access innovationsBert Carelli
 
The Electronic Notebook Ontology
The Electronic Notebook OntologyThe Electronic Notebook Ontology
The Electronic Notebook OntologyStuart Chalk
 
2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAGopen_phacts
 
Faster R & D Analysis Tool - TRG
Faster R & D Analysis Tool - TRG Faster R & D Analysis Tool - TRG
Faster R & D Analysis Tool - TRG TRG
 
Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...Michel Dumontier
 
Talk_linked_data_for_hcls_at_iswc2009
Talk_linked_data_for_hcls_at_iswc2009Talk_linked_data_for_hcls_at_iswc2009
Talk_linked_data_for_hcls_at_iswc2009Jun Zhao
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringKelly Lipiec
 
Howe et al. - 2015 - BioAssay Research Database (BARD) chemical biolog
Howe et al. - 2015 - BioAssay Research Database (BARD) chemical biologHowe et al. - 2015 - BioAssay Research Database (BARD) chemical biolog
Howe et al. - 2015 - BioAssay Research Database (BARD) chemical biologEleanor Howe
 
Neuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptxNeuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptxJagannath University
 
Neuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptxNeuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptxJagannath University
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 

Similar to Presentation from Code Camp 2017 (20)

In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
 
Improving online chemistry one structure at a time
Improving online chemistry one structure at a timeImproving online chemistry one structure at a time
Improving online chemistry one structure at a time
 
Using Taxonomies to Create People Directories and Author Networks
Using Taxonomies to Create People Directories and Author Networks Using Taxonomies to Create People Directories and Author Networks
Using Taxonomies to Create People Directories and Author Networks
 
Online Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery SystemsOnline Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery Systems
 
Asis&t webinar people directories access innovations
Asis&t webinar people directories access innovationsAsis&t webinar people directories access innovations
Asis&t webinar people directories access innovations
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
 
The Electronic Notebook Ontology
The Electronic Notebook OntologyThe Electronic Notebook Ontology
The Electronic Notebook Ontology
 
2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG
 
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
 
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
 
Faster R & D Analysis Tool - TRG
Faster R & D Analysis Tool - TRG Faster R & D Analysis Tool - TRG
Faster R & D Analysis Tool - TRG
 
Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...
 
Talk_linked_data_for_hcls_at_iswc2009
Talk_linked_data_for_hcls_at_iswc2009Talk_linked_data_for_hcls_at_iswc2009
Talk_linked_data_for_hcls_at_iswc2009
 
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
 
Howe et al. - 2015 - BioAssay Research Database (BARD) chemical biolog
Howe et al. - 2015 - BioAssay Research Database (BARD) chemical biologHowe et al. - 2015 - BioAssay Research Database (BARD) chemical biolog
Howe et al. - 2015 - BioAssay Research Database (BARD) chemical biolog
 
Neuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptxNeuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptx
 
Neuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptxNeuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptx
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Presentation from Code Camp 2017

  • 1. Plants, Potions and People Continuing Adventures in Scientific Extraction, Transformation and Loading MITCH MILLER SCIENTIFIC THINKING, LLC VERMONT CODE CAMP 9 SEPTEMBER 16, 2017
  • 2. Introduction  Independent consultant in scientific information systems  Many projects relate to extraction, transformation, loading of data  Acronyms and buzzwords often cause confusion  If this presentation contains a word you don’t understand, please ask!
  • 3. Project starting points  Given: list of medicinal plants  Mission:  Gather valid scientific names  Gather IDs each plant for 6 online resources and format URLs  Prepare data for loading into in-house database
  • 4. Background  Target users are regulators  Evaluate herbal products as commercial medicines  Users must determine if each product is beneficial  Decision process is time-constrained  As brief as one week  Users need a rich set of data to facilitate informed decisions  Provide direct URLs to online resources so users don’t have to search  Data had to be loaded into system by target date in advance of database migration
  • 5. Taxonomy  Definition  Theory and practice of grouping individuals into species, arranging species into larger groups, and giving those groups names, thus producing a classification[2]  One of several definitions on Wikipedia  Taxonomists apply a hierarchical set of group assignments to a sample.  See example
  • 6. Example: Aloe Vera  https://plants.usda.gov/java/ClassificationServlet?source=display&classid= ALVE2
  • 7. Taxonomy (continued)  Taxonomists do not always agree  Useful for researchers/regulators to know how to classify a product!  Key term: binomial name  Genus + species = binomial name  Names are in Greek and Latin  Designations/qualifiers are also in Latin
  • 8. Input for this project  Spreadsheet listing plants of interest  Plants listed as: genus, species and author  Legacy IDs for 2 sources
  • 9. Output  Spreadsheets containing direct URLs for matching materials that can be loaded, via macro, into the database
  • 11. Royal Botanic Gardens, Kew  “…global resource for plant and fungal knowledge…”  Offers MPNS – Medicinal Plant Names Service  “..a global resource for medicinal plant names that enables health professionals and researchers to access information about plants and plant products relevant to pharmacological research, health regulation, traditional medicine and functional foods. ”  Cross-reference for names of plants  http://mpns.kew.org/mpns-portal/  Entries have an ID and a database qualifier  Both are used to locate records
  • 12. ITIS  Integrated Taxonomic Information System (ITIS)  taxonomic information on plants, animals, fungi, and microbes  partnership of U.S., Canadian, and Mexican agencies  https://www.itis.gov/about_itis.html  USGS, NOAA, EPA, ARS  Data available as PostGRESQL dump  Data transferred to local database and then merged into output
  • 13. GRIN  Germplasm Resources Information Network  acquire, characterize, preserve, document, and distribute to scientists, germplasm of all lifeforms important for food and agricultural production.  ID was provided as input  https://www.ars-grin.gov/
  • 14. PLANTS  The PLANTS Database provides standardized information about the vascular plants, mosses, liverworts, hornworts, and lichens of the U.S. and its territories  …collaborative effort of the USDA NRCS National Plant Data Team (NPDT), the USDA NRCS Information Technology Center (ITC), The USDA National Information Technology Center (NITC), and many other partners.  https://plants.usda.gov/  Data available as .txt file
  • 15. NCBI Taxonomy  National Center for Biotechnology Information  A division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH)  Mission: ‘…storing and analyzing knowledge about molecular biology, biochemistry, and genetics’  The Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases.  Searchable online
  • 16. Wikipedia  Online encyclopedia  10+ languages  Editable by anyone  https://www.wikipedia.org/
  • 18. Development tool: KNIME  Graphic, component-based programming environment  Drag functional components from palette onto canvas to create program  Configure most components by setting parameters  Connect components to route data from one to another  Run and observe data traveling down the lines  KNIME stands for KoNstanz Information MinEr  Pronounced “Nighm”  Originally a product of the University of Konstanz, Germany 2004  Currently produced by KNIME.com AG, a company in Zurich, Switzerland  Free version available for download  Windows, Linux, Mac
  • 19. KNIME (continued)  Many third party components provide domain-specific functionality  E.g., analysis/manipulation of chemical structures  Online forums provide support  Multiple updates per year
  • 20. Power of KNIME for ETL  Each operation handled by a specific graphical component  E.g., read an Excel file; query a database, apply a transformation…  Each operation can be run individually  Data is cached indefinitely in a local file  You can return to your workflow after a day or two and change handling steps  Very easy to explore alternatives by creating branches and comparing the results  Components can write to databases, files, etc.
  • 21. How each source was handled
  • 22. Kew  Inconvenient truth: IDs change from version to version  First step: take ID from input spreadsheet, form a URL, check for data  If no data, create a search URL using genus and species  Submit the search URL (POST)  Parse search results for individual record IDs  When a valid species record is found, parse ID and names
  • 23. GRIN  Valid ID was present in input  Nothing to do for GRIN!
  • 24. ITIS  Complete dataset was available for download as PostgreSQL dump file  Dump file was loaded into a database on local machine  Data includes TSN (taxonomic serial number) and binomial name  Example: https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_va lue=58#null
  • 25. NCBI Taxonomy  Data was queried via URL  Example: https://www.ncbi.nlm.nih.gov/taxonomy/?term=Blepharis+edulis  Retrieve search results  Parse ID out of HTML  Form direct URL using ID  https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=328124
  • 26. Wikipedia  Form a URL using genus + species  Try to retrieve the page with this URL  No hit? Try again with just genus  Add valid URL to output
  • 28. Conclusion  Data were retrieved, sorted and loaded into the legacy database well in advance of the data migration cutoff
  • 29. Thank you for listening!  Mitch Miller  Mitch.miller@thinkscience.us  http://thinkscience.us