Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presentation from Code Camp 2017


Published on

Collection of a set of taxonomic data on plants of medicinal usage

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Presentation from Code Camp 2017

  1. 1. Plants, Potions and People Continuing Adventures in Scientific Extraction, Transformation and Loading MITCH MILLER SCIENTIFIC THINKING, LLC VERMONT CODE CAMP 9 SEPTEMBER 16, 2017
  2. 2. Introduction  Independent consultant in scientific information systems  Many projects relate to extraction, transformation, loading of data  Acronyms and buzzwords often cause confusion  If this presentation contains a word you don’t understand, please ask!
  3. 3. Project starting points  Given: list of medicinal plants  Mission:  Gather valid scientific names  Gather IDs each plant for 6 online resources and format URLs  Prepare data for loading into in-house database
  4. 4. Background  Target users are regulators  Evaluate herbal products as commercial medicines  Users must determine if each product is beneficial  Decision process is time-constrained  As brief as one week  Users need a rich set of data to facilitate informed decisions  Provide direct URLs to online resources so users don’t have to search  Data had to be loaded into system by target date in advance of database migration
  5. 5. Taxonomy  Definition  Theory and practice of grouping individuals into species, arranging species into larger groups, and giving those groups names, thus producing a classification[2]  One of several definitions on Wikipedia  Taxonomists apply a hierarchical set of group assignments to a sample.  See example
  6. 6. Example: Aloe Vera  ALVE2
  7. 7. Taxonomy (continued)  Taxonomists do not always agree  Useful for researchers/regulators to know how to classify a product!  Key term: binomial name  Genus + species = binomial name  Names are in Greek and Latin  Designations/qualifiers are also in Latin
  8. 8. Input for this project  Spreadsheet listing plants of interest  Plants listed as: genus, species and author  Legacy IDs for 2 sources
  9. 9. Output  Spreadsheets containing direct URLs for matching materials that can be loaded, via macro, into the database
  10. 10. The resources
  11. 11. Royal Botanic Gardens, Kew  “…global resource for plant and fungal knowledge…”  Offers MPNS – Medicinal Plant Names Service  “..a global resource for medicinal plant names that enables health professionals and researchers to access information about plants and plant products relevant to pharmacological research, health regulation, traditional medicine and functional foods. ”  Cross-reference for names of plants   Entries have an ID and a database qualifier  Both are used to locate records
  12. 12. ITIS  Integrated Taxonomic Information System (ITIS)  taxonomic information on plants, animals, fungi, and microbes  partnership of U.S., Canadian, and Mexican agencies   USGS, NOAA, EPA, ARS  Data available as PostGRESQL dump  Data transferred to local database and then merged into output
  13. 13. GRIN  Germplasm Resources Information Network  acquire, characterize, preserve, document, and distribute to scientists, germplasm of all lifeforms important for food and agricultural production.  ID was provided as input 
  14. 14. PLANTS  The PLANTS Database provides standardized information about the vascular plants, mosses, liverworts, hornworts, and lichens of the U.S. and its territories  …collaborative effort of the USDA NRCS National Plant Data Team (NPDT), the USDA NRCS Information Technology Center (ITC), The USDA National Information Technology Center (NITC), and many other partners.   Data available as .txt file
  15. 15. NCBI Taxonomy  National Center for Biotechnology Information  A division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH)  Mission: ‘…storing and analyzing knowledge about molecular biology, biochemistry, and genetics’  The Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases.  Searchable online
  16. 16. Wikipedia  Online encyclopedia  10+ languages  Editable by anyone 
  17. 17. Tools used  KNIME  MS Excel
  18. 18. Development tool: KNIME  Graphic, component-based programming environment  Drag functional components from palette onto canvas to create program  Configure most components by setting parameters  Connect components to route data from one to another  Run and observe data traveling down the lines  KNIME stands for KoNstanz Information MinEr  Pronounced “Nighm”  Originally a product of the University of Konstanz, Germany 2004  Currently produced by AG, a company in Zurich, Switzerland  Free version available for download  Windows, Linux, Mac
  19. 19. KNIME (continued)  Many third party components provide domain-specific functionality  E.g., analysis/manipulation of chemical structures  Online forums provide support  Multiple updates per year
  20. 20. Power of KNIME for ETL  Each operation handled by a specific graphical component  E.g., read an Excel file; query a database, apply a transformation…  Each operation can be run individually  Data is cached indefinitely in a local file  You can return to your workflow after a day or two and change handling steps  Very easy to explore alternatives by creating branches and comparing the results  Components can write to databases, files, etc.
  21. 21. How each source was handled
  22. 22. Kew  Inconvenient truth: IDs change from version to version  First step: take ID from input spreadsheet, form a URL, check for data  If no data, create a search URL using genus and species  Submit the search URL (POST)  Parse search results for individual record IDs  When a valid species record is found, parse ID and names
  23. 23. GRIN  Valid ID was present in input  Nothing to do for GRIN!
  24. 24. ITIS  Complete dataset was available for download as PostgreSQL dump file  Dump file was loaded into a database on local machine  Data includes TSN (taxonomic serial number) and binomial name  Example: lue=58#null
  25. 25. NCBI Taxonomy  Data was queried via URL  Example:  Retrieve search results  Parse ID out of HTML  Form direct URL using ID 
  26. 26. Wikipedia  Form a URL using genus + species  Try to retrieve the page with this URL  No hit? Try again with just genus  Add valid URL to output
  27. 27. Let’s examine the code
  28. 28. Conclusion  Data were retrieved, sorted and loaded into the legacy database well in advance of the data migration cutoff
  29. 29. Thank you for listening!  Mitch Miller  