08448380779 Call Girls In Civil Lines Women Seeking Men
Presentation from Code Camp 2017
1. Plants, Potions and
People
Continuing Adventures in
Scientific Extraction,
Transformation and Loading
MITCH MILLER
SCIENTIFIC THINKING, LLC
VERMONT CODE CAMP 9
SEPTEMBER 16, 2017
2. Introduction
Independent consultant in scientific information systems
Many projects relate to extraction, transformation, loading of data
Acronyms and buzzwords often cause confusion
If this presentation contains a word you don’t understand, please ask!
3. Project starting points
Given: list of medicinal plants
Mission:
Gather valid scientific names
Gather IDs each plant for 6 online resources and format URLs
Prepare data for loading into in-house database
4. Background
Target users are regulators
Evaluate herbal products as commercial medicines
Users must determine if each product is beneficial
Decision process is time-constrained
As brief as one week
Users need a rich set of data to facilitate informed decisions
Provide direct URLs to online resources so users don’t have to search
Data had to be loaded into system by target date in advance of database
migration
5. Taxonomy
Definition
Theory and practice of grouping individuals into species, arranging species
into larger groups, and giving those groups names, thus producing a
classification[2]
One of several definitions on Wikipedia
Taxonomists apply a hierarchical set of group assignments to a sample.
See example
6. Example: Aloe Vera
https://plants.usda.gov/java/ClassificationServlet?source=display&classid=
ALVE2
7. Taxonomy (continued)
Taxonomists do not always agree
Useful for researchers/regulators to know how to classify a product!
Key term: binomial name
Genus + species = binomial name
Names are in Greek and Latin
Designations/qualifiers are also in Latin
8. Input for this project
Spreadsheet listing plants of interest
Plants listed as: genus, species and author
Legacy IDs for 2 sources
11. Royal Botanic Gardens, Kew
“…global resource for plant and fungal knowledge…”
Offers MPNS – Medicinal Plant Names Service
“..a global resource for medicinal plant names that enables health
professionals and researchers to access information about plants and plant
products relevant to pharmacological research, health regulation, traditional
medicine and functional foods. ”
Cross-reference for names of plants
http://mpns.kew.org/mpns-portal/
Entries have an ID and a database qualifier
Both are used to locate records
12. ITIS
Integrated Taxonomic Information System (ITIS)
taxonomic information on plants, animals, fungi, and microbes
partnership of U.S., Canadian, and Mexican agencies
https://www.itis.gov/about_itis.html
USGS, NOAA, EPA, ARS
Data available as PostGRESQL dump
Data transferred to local database and then merged into output
13. GRIN
Germplasm Resources Information Network
acquire, characterize, preserve, document, and distribute to scientists,
germplasm of all lifeforms important for food and agricultural production.
ID was provided as input
https://www.ars-grin.gov/
14. PLANTS
The PLANTS Database provides standardized information about the
vascular plants, mosses, liverworts, hornworts, and lichens of the U.S. and
its territories
…collaborative effort of the USDA NRCS National Plant Data Team (NPDT),
the USDA NRCS Information Technology Center (ITC), The USDA National
Information Technology Center (NITC), and many other partners.
https://plants.usda.gov/
Data available as .txt file
15. NCBI Taxonomy
National Center for Biotechnology Information
A division of the National Library of Medicine (NLM) at the National
Institutes of Health (NIH)
Mission: ‘…storing and analyzing knowledge about molecular biology,
biochemistry, and genetics’
The Taxonomy Database is a curated classification and nomenclature for
all of the organisms in the public sequence databases.
Searchable online
18. Development tool: KNIME
Graphic, component-based programming environment
Drag functional components from palette onto canvas to create program
Configure most components by setting parameters
Connect components to route data from one to another
Run and observe data traveling down the lines
KNIME stands for KoNstanz Information MinEr
Pronounced “Nighm”
Originally a product of the University of Konstanz, Germany 2004
Currently produced by KNIME.com AG, a company in Zurich, Switzerland
Free version available for download
Windows, Linux, Mac
19. KNIME (continued)
Many third party components provide domain-specific functionality
E.g., analysis/manipulation of chemical structures
Online forums provide support
Multiple updates per year
20. Power of KNIME for ETL
Each operation handled by a specific graphical component
E.g., read an Excel file; query a database, apply a transformation…
Each operation can be run individually
Data is cached indefinitely in a local file
You can return to your workflow after a day or two and change handling steps
Very easy to explore alternatives by creating branches and comparing the
results
Components can write to databases, files, etc.
22. Kew
Inconvenient truth: IDs change from version to version
First step: take ID from input spreadsheet, form a URL, check for data
If no data, create a search URL using genus and species
Submit the search URL (POST)
Parse search results for individual record IDs
When a valid species record is found, parse ID and names
23. GRIN
Valid ID was present in input
Nothing to do for GRIN!
24. ITIS
Complete dataset was available for download as PostgreSQL dump file
Dump file was loaded into a database on local machine
Data includes TSN (taxonomic serial number) and binomial name
Example:
https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_va
lue=58#null
25. NCBI Taxonomy
Data was queried via URL
Example: https://www.ncbi.nlm.nih.gov/taxonomy/?term=Blepharis+edulis
Retrieve search results
Parse ID out of HTML
Form direct URL using ID
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=328124
26. Wikipedia
Form a URL using genus + species
Try to retrieve the page with this URL
No hit? Try again with just genus
Add valid URL to output