Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Text Mining and Environmental Metadata Suggestion

982 views

Published on

Brief of description of standardised sequence metadata and their importance in comparative/integrative analysis. Thorough description of the ENVIRONMENTS tagger. Demonstration of a browser extension able to list on-demand Diseases, Tissues, Environments, and Organisms identified in a selected piece of text in a web page (Thanks to the contribution of Dr. Lars Juhl Jensen and members of this group)

Published in: Science
  • Be the first to comment

  • Be the first to like this

Text Mining and Environmental Metadata Suggestion

  1. 1. Text Mining and Environmental Metadata Suggestion Evangelos Pafilis Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC) Hellenic Centre for Marine Research (HCMR), Heraklio Crete, Greece pafilis@hcmr.gr, http://epafilis.info ENA – 1st Dec 2014 – EBI, UK
  2. 2. Species – Environments ENA – 1st Dec 2014 – EBI, UK
  3. 3. Comparative Αnalysis • Location • Environment • Time Period ENA – 1st Dec 2014 – EBI, UK ? Coral Reefs Image from http://theresilientearth.com/
  4. 4. Not Trivial ENA – 1st Dec 2014 – EBI, UK
  5. 5. Slide by Dr. P. Yilmaz, http://www.arb-silva.de/projects/contextual-data/
  6. 6. Essential Context Information Metadata Meta- = Μετά (“after”) => data “after” data => data describing data ENA – 1st Dec 2014 – EBI, UK
  7. 7. a clear definition, that can be interpreted in many, sometimes conflicting, ways ENA – 1st Dec 2014 – EBI, UK
  8. 8. a clear definition, that can be interpreted in many, sometimes conflicting, ways Essential Context Information ENA – 1st Dec 2014 – EBI, UK
  9. 9. Community Standards • Standards (such as MiXS, MIMARKS) see http://gensc.org/gc_wiki/index.php/GSC_Publications for a comprehensive list of publications • capture genomic/metagenomic and other type of sequence contextual information • Including detailed guidelines on how to annotate a sample (e.g. Yilmaz P et al. (2011) The ISME journal 5: 1565–1567) ENA – 1st Dec 2014 – EBI, UK http://gensc.org/
  10. 10. P. Yilmaz et al., Nat Biotech 29, 415–420 (2011)
  11. 11. source: http://wiki.gensc.org/index.php?title=MIMARKS
  12. 12. http://www.tomorrowstarted.com/2013/01/how-a-key-works/.html ENA – 1st Dec 2014 – EBI, UK
  13. 13. • Project descriptions • Scientific-content web pages • Full text scientific articles • Literature abstracts • In-house documents ENA – 1st Dec 2014 – EBI, UK
  14. 14. Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific. Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”) ENA – 1st Dec 2014 – EBI, UK
  15. 15. Looking up terms: Intensive, learning curve ENA – 1st Dec 2014 – EBI, UK
  16. 16. Literature Mining ENA – 1st Dec 2014 – EBI, UK
  17. 17. processing text to extract facts of interest ENA – 1st Dec 2014 – EBI, UK
  18. 18. ENVIRONMENTS ENA – 1st Dec 2014 – EBI, UK
  19. 19. ENVIRONMENTS: ENVO term identification in text terrestrial, aquatic, marine, lagoon, coral reef, sediment, freshwater, soil ENA – 1st Dec 2014 – EBI, UK
  20. 20. ENVIRONMENTS: ENVO term identification in text Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific. Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”) ENA – 1st Dec 2014 – EBI, UK
  21. 21. ENVIRONMENTS: ENVO term identification in text ID: ENVO:00000150 Name: coral reef Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific. Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”) ENA – 1st Dec 2014 – EBI, UK
  22. 22. ENVIRONMENTS: ENVO term identification in text ID: ENVO:00000150 Name: coral reef Microbes are key players in both healthy and degraded coral reefs. A combination of metagenomics, microscopy, culturing, and water chemistry were used to characterize microbial communities on four coral atolls in the Northern Line Islands, central Pacific. Source: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440039.3 (“Project Description”) ENA – 1st Dec 2014 – EBI, UK
  23. 23. ENVIRONMENTS http://environments.hcmr.gr http://environments-eol.blogspot.gr/ ENA – 1st Dec 2014 – EBI, UK ● Dictionary based ● Open source ● Environment Ontology ● fast performance ● 4000 PubMed abstracts / second * ● Based on SPECIES name recognition tagger (Pafilis et al, PLOS ONE) ● E600 gold standard: ENVO-based corpus of EOL Species pages ● Recognition Accuracy – Mention Level: - F1: 82.0% 87.1% of the TPs: exact id among predicted ones ● Submitted preprint: http://biorxiv.org/ content/early/2014/11/13/011403 Pafilis E et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS ONE 8(6): e65390, *: based a single-thread run on an Intel 2,27GHz, 24 GB RAM processing a set of 536,052 abstracts
  24. 24. ENVO: source of environment descriptor names and synonyms http://environmentontology.org ~1600 terms, June 2013 ENA – 1st Dec 2014 – EBI, UK biome environmental feature environmental material environmental condition … … … … habitat … Based on slides by Dr. Pier Luigi Buttigier, AWI, Bremenhaven, Germany
  25. 25. ENVIRONMENTS – Improving Accuracy ● Increasing matches in text ● orthographic variation supported e.g. freshwater, fresh water, and fresh-water ● Case-insensitive matching ● Synonym generation to reflect the way environment descriptive terms are mentioned in text (both generic and ENVO specific) Action Example ● Preventing overmatching (i.e. avoiding increased FP) ● „stopword-list” (e.g. spring, well, range) ENA – 1st Dec 2014 – EBI, UK Add a variant in which non-informative words have been removed epipelagic zone → epipelagic estuarine biome → estuarine Plural form addition sediment → sediments Adjective form addition lagoon → lagoonal
  26. 26. Scope ENVO parts Not included: species tissues foods Limitations – Known Issues negation not supported conflicts with anatomy terms (e.g. mouth, blowhole) ENA – 1st Dec 2014 – EBI, UK
  27. 27. ENVIRONMENTS – Sample Output eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477 ENA – 1st Dec 2014 – EBI, UK File Name Start coord End coord Match text ENVO ID Tags corresponding to “Habitat” text data object: http://eol.org/data_objects/31415353 of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221
  28. 28. ENVIRONMENTS – Sample Output eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000192 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000043 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289845 346289853 mud flats ENVO:00000012 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:01000001 eol_documents_ascii_nonHTML.txt 346289871 346289873 mud ENVO:00010483 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000180 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000191 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00002297 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000176 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000000 eol_documents_ascii_nonHTML.txt 346289905 346289910 mounds ENVO:00000477 ENA – 1st Dec 2014 – EBI, UK File Name Start coord End coord Match text ENVO ID Traversing all IS_A, PART_OF Relationships in ENVO Tags corresponding to “Habitat” text data object: http://eol.org/data_objects/31415353 of EOL Taxon Phoenicopterus ruber (Greater Flamingo): http://eol.org/pages/913221
  29. 29. Download ENA – 1st Dec 2014 – EBI, UK ENVIRONMENTS • Home Page: http://environments.hcmr.gr/ • Tagger Software: http://download.jensenlab.org/environments_tagger.tar.gz
  30. 30. other forms of access ENA – 1st Dec 2014 – EBI, UK
  31. 31. ENA – 1st Dec 2014 – EBI, UK http://eol.org/info/discover_what
  32. 32. ENA – 1st Dec 2014 – EBI, UK ID: ENVO:00000150 Name: coral reef ENVIRONMENTS ACTION ES1103 Interactive Curation http://www.ncbi.nlm.nih.gov/pubmed/18301735
  33. 33. Interactive Curation ENA – 1st Dec 2014 – EBI, UK ACTION ES1103 http://www.ncbi.nlm.nih.gov/pubmed/18301735
  34. 34. Interactive Curation ENA – 1st Dec 2014 – EBI, UK ACTION ES1103 http://www.ncbi.nlm.nih.gov/pubmed/18301735
  35. 35. Interactive Curation ENA – 1st Dec 2014 – EBI, UK ACTION ES1103 http://www.ncbi.nlm.nih.gov/pubmed/18301735
  36. 36. Interactive Curation ENA – 1st Dec 2014 – EBI, UK ACTION ES1103 http://www.ncbi.nlm.nih.gov/pubmed/18301735
  37. 37. ENA – 1st Dec 2014 – EBI, UK ACTION ES1103 Not only ENVO terms
  38. 38. ENA – 1st Dec 2014 – EBI, UK ACTION ES1103 http://www.ncbi.nlm.nih.gov/pubmed/18301735
  39. 39. What else is being identified? ENA – 1st Dec 2014 – EBI, UK ACTION ES1103 ready you to discover!
  40. 40. ENA – 1st Dec 2014 – EBI, UK ACTION ES1103
  41. 41. Summary ! Importance of standardized metadata and annotations ! ENVO: Standardized hierarchically organized descriptions of environment types ! Literature, project and other scientific content web pages may describe the environment context of a metagenomics sample ENA – 1st Dec 2014 – EBI, UK ! ENVIRONMENTS: ! Dictionary-based environment descriptive term identification ! Ontological Community standards, e.g. ENVO: name source ! Command line application ! Browser extensions, a user-friendly interface ! Highly Interactive ! Can be used while browsing the web ! Extract ENVO from a selected part of a web page ! Extended for: ! Organism, diseases, and tissue mention identification
  42. 42. Digging-out Information http://hartpurylrc.Photo by Dr Chatzinikolaou E files.wordpress.com ENA – 1st Dec 2014 – EBI, UK
  43. 43. BioCreative: Metagenomics Track Critical Assessment of Information Extraction in Biology • Preparing a Metagenomics Track as part of the BioCreative 2015 challenge • Aim: improve the environmental-context annotation of sequences in major metagenomics repositories. • Track coordinator: Dr. L. Hirschman, MITRE • BioCreative (www.biocreative.org) ENA – 1st Dec 2014 – EBI, UK
  44. 44. Biodiversity – Genomics ENVIRONMENTS-EOL http://environments-eol.blogspot.com/ Encyclopedia of Life (EOL) http://www.eol.org • process EOL taxon pages • extract environmental context (ENVO terms) • EOL Taxon Page: Quick Facts, Data tab • integrated in Traitbank • large scale biological questions Rubenstein Fellowship 2013 In collab: Jennifer Hammock, Patrick Leary, Katja Schulz, Cyndy Parr Hexanchus griseus EOL page, http://eol.org/pages/212027 SEQenv http://environments.hcmr.gr/seqenv.html • annotate microbial sequences with ENVO terms • sequence analysis, literature mining, visualization • GenBank isolation source, PubMed Abstracts • sample comparison, temporal/spatial pattern analysis • extension: proteins, protein families, 3D visualization Reused: Analysis of America bird habitats, http://blog.eol.org/ (NoPlaceLikeHome, in collab: Rob Stevenson, Carl Nordman) ACTION ES1103 ENA – 1st Dec 2014 – EBI, UK
  45. 45. http://jensenlab.org/ Santos A et al. (under review), preprint: http://biorxiv.org/content/early/2014/11/10/010975 Frankild S et al. (under review), preprint: http://biorxiv.org/content/early/2014/08/25/008425 Pafilis E et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS ONE 8(6): e65390 ENA – 1st Dec 2014 – EBI, UK
  46. 46. Acknowledgements Thank You! HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou Lucia Fanini, Sarah Faulwetter, Anastasis Oulas NNF CPR: Lars Juhl Jensen, Sune Frankild U Mass: Rob Stevenson Uni Glasgow: Christopher Quince, Umer Ijaz EOL: Cynthia Parr, Jennifer Hammock, Patrick Leary, Katja Schulz MM-MPI: J. Schnetzer, AWI: Dr P. Buttigieg, HITS: Dr. S. Berger and more Funding: EOL Rubenstein Fellowship, LifeWatch Greece, MARBIGEN, NNF-CPR, EOL-BHL NESCent Researh, Sprint 2014,”SEQenv” Hackathons (COST ES1103) ENA – 1st Dec 2014 – EBI, UK Amvrakikos Lagoons, May 2011 ACTION ES1103
  47. 47. Acknowledgements Thank You! HCMR-IMBG: Christos Arvanitidis, Christina Pavloudi, Katerina Vasileiadou ENA – 1st Dec 2014 – EBI, UK id: ENVO:00000038 name: lagoon Amvrakikos Lagoons, May 2011 ACTION ES1103 Lucia Fanini, Sarah Faulwetter, Anastasis Oulas NNF CPR: Lars Juhl Jensen, Sune Frankild U Mass: Rob Stevenson Uni Glasgow: Christopher Quince, Umer Ijaz EOL: Cynthia Parr, Jennifer Hammock, Patrick Leary, Katja Schulz MM-MPI: J. Schnetzer, AWI: Dr P. Buttigieg, and more Funding: EOL Rubenstein Fellowship, LifeWatch Greece, MARBIGEN, NNF-CPR, EOL-BHL NESCent Researh, Sprint 2014,”SEQenv” Hackathons (COST ES1103)
  48. 48. Tutorial • Start Firefox • Install the “megx-seqenv-bar.xpi” • Drug and Drop • “Install Now” and “Restart” • Visit a couple of PubMed abstracts or article web pages of your preference • Annotate the complete abstract, • Annotate selected sentences only ENA – 1st Dec 2014 – EBI, UK

×