Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry...
Overview <ul><li>Project Prospect: what we find and how we find it. </li></ul><ul><li>RDF: How should we be disseminating ...
 
 
 
 
 
 
Project Prospect: What do we find? <ul><li>Chemical compounds </li></ul><ul><li>Chemical terms from the IUPAC Gold Book </...
Project Prospect: How do we find it? <ul><li>For compound names: </li></ul><ul><li>~60% Oscar  (Corbett and Murray-Rust 20...
 
RDF in an RSS reader
RDF: how we do it now <ul><li>Content module from RSS 1.0 </li></ul><ul><li>http://web.resource.org/rss/1.0/modules/conten...
RDF: what it looks like now <ul><li><item rdf:about=http://xlink.rsc.org/?DOI=b716356h&amp;RSS=1> </li></ul><ul><li><title...
Basics for a chemical ontology <ul><li>Unambiguous representation of objects of chemical discourse </li></ul><ul><li>Prope...
Basics for a chemical ontology: 1. Objects of chemical discourse <ul><li>Must be able to represent and clearly distinguish...
Imidazole
An imidazole
The imidazole side-chain/group/ring
Can ChEBI handle this? <ul><li>Imidazoles (!) (CHEBI:24780)  </li></ul><ul><li>Imidazole (CHEBI:16069) </li></ul><ul><li>I...
Disambiguation <ul><li>One Sense per Discourse  (Gale  et al.  1992) </li></ul><ul><li>…  this doesn’t hold  at all </li><...
Disambiguation: What a one sense per collocation feature set might look like <ul><li>CLASS: </li></ul><ul><li>w (–1)  = a,...
Basics for a chemical ontology: 2. Parthood relations <ul><li>Parthood in ChEBI means at least three things: </li></ul><ul...
Basics for a chemical ontology: 2. Parthood relations <ul><li>Is  possibly  chemically part of: </li></ul><ul><li>Lead(2+)...
Basics for a chemical ontology: 2. Parthood relations <ul><li>Is part of a  mixture </li></ul><ul><li>Kanamycin A  part_of...
Basics for a chemical ontology: 2. Parthood relations <ul><li>Solution 1: define relationships according to pattern: all i...
Basics for a chemical ontology: 2. Parthood relations <ul><li>Solution 2 (for discussion): Distinguish molecular-level rel...
Open questions <ul><li>How do we represent the relationship between named entities and documents? </li></ul><ul><li>How do...
Acknowledgements <ul><li>University of Cambridge: Peter Corbett </li></ul><ul><li>OBO Foundry: Chris Mungall (Berkeley), B...
Open questions <ul><li>How do we represent the relationship between named entities and documents? </li></ul><ul><li>How do...
Upcoming SlideShare
Loading in …5
×

Chemical named entity recognition and literature mark-up

2,759 views

Published on

Presentation by Colin Batchelor, Royal Society of Chemistry publishing, in Manchester, March 2008

Published in: Technology, Education
  • Be the first to comment

Chemical named entity recognition and literature mark-up

  1. 1. Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry [email_address]
  2. 2. Overview <ul><li>Project Prospect: what we find and how we find it. </li></ul><ul><li>RDF: How should we be disseminating it? </li></ul><ul><li>Next steps: Basics for a chemical ontology. </li></ul>
  3. 9. Project Prospect: What do we find? <ul><li>Chemical compounds </li></ul><ul><li>Chemical terms from the IUPAC Gold Book </li></ul><ul><li>Gene products: function, process, location </li></ul><ul><li>Nucleotide and polypeptide sequence terms </li></ul><ul><li>Cell types </li></ul>
  4. 10. Project Prospect: How do we find it? <ul><li>For compound names: </li></ul><ul><li>~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) </li></ul><ul><li>~20% PubChem </li></ul><ul><li>~20% ChemDraw </li></ul><ul><li>For compound numbers: </li></ul><ul><li>~70% author ChemDraw </li></ul><ul><li>~30% editors </li></ul>
  5. 12. RDF in an RSS reader
  6. 13. RDF: how we do it now <ul><li>Content module from RSS 1.0 </li></ul><ul><li>http://web.resource.org/rss/1.0/modules/content </li></ul><ul><li>In what sense does an article “contain” pyridine or base pairs? </li></ul><ul><li>We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”. </li></ul>
  7. 14. RDF: what it looks like now <ul><li><item rdf:about=http://xlink.rsc.org/?DOI=b716356h&amp;RSS=1> </li></ul><ul><li><title> [… title] </title> </li></ul><ul><li><link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link> </li></ul><ul><li><description> [… blah] </description> </li></ul><ul><li><content:encoded> [… human-readable stuff</content:encoded> </li></ul><ul><li>[… dublin core stuff …] </li></ul><ul><li><content:items> </li></ul><ul><li><rdf:Bag> </li></ul><ul><li><rdf:li> </li></ul><ul><li><content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1&quot;/> </li></ul><ul><li></rdf:li> </li></ul><ul><li><rdf:li> </li></ul><ul><li><content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/> </li></ul><ul><li></rdf:li> </li></ul><ul><li></rdf:Bag> </li></ul><ul><li></content:items> </li></ul><ul><li></item> </li></ul>
  8. 15. Basics for a chemical ontology <ul><li>Unambiguous representation of objects of chemical discourse </li></ul><ul><li>Proper parthood relations </li></ul>
  9. 16. Basics for a chemical ontology: 1. Objects of chemical discourse <ul><li>Must be able to represent and clearly distinguish </li></ul><ul><li>Compounds </li></ul><ul><li>Classes of compound </li></ul><ul><li>Parts of molecules </li></ul><ul><li>Mixtures </li></ul><ul><li>Would be nice to have: </li></ul><ul><li>Disambiguation cues for the first three </li></ul>
  10. 17. Imidazole
  11. 18. An imidazole
  12. 19. The imidazole side-chain/group/ring
  13. 20. Can ChEBI handle this? <ul><li>Imidazoles (!) (CHEBI:24780) </li></ul><ul><li>Imidazole (CHEBI:16069) </li></ul><ul><li>Imidazole ring not yet </li></ul><ul><li>Imidazolyl group not yet (but methyl, benzyl, etc. ) </li></ul><ul><li>… and there are no disambiguation cues </li></ul>
  14. 21. Disambiguation <ul><li>One Sense per Discourse (Gale et al. 1992) </li></ul><ul><li>… this doesn’t hold at all </li></ul><ul><li>One Sense per Collocation (Yarowsky 1993) </li></ul><ul><li>… matches our intuitions </li></ul>
  15. 22. Disambiguation: What a one sense per collocation feature set might look like <ul><li>CLASS: </li></ul><ul><li>w (–1) = a, an, the, this </li></ul><ul><li>w (0) plural (bit of a cheat, as not a collocation) </li></ul><ul><li>PART: </li></ul><ul><li>w (–1) = bridging, terminal </li></ul><ul><li>w (+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) </li></ul><ul><li>w (+1) w (+2) = “building block”, “protecting group”, “side chain” </li></ul>
  16. 23. Basics for a chemical ontology: 2. Parthood relations <ul><li>Parthood in ChEBI means at least three things: </li></ul><ul><li>is necessarily chemically part of </li></ul><ul><li>carbonyl group part_of carbonyl compounds </li></ul>
  17. 24. Basics for a chemical ontology: 2. Parthood relations <ul><li>Is possibly chemically part of: </li></ul><ul><li>Lead(2+) part_of lead diacetate </li></ul><ul><li>(most lead(2+) isn’t) </li></ul><ul><li>Electron part_of muonium (!) </li></ul>
  18. 25. Basics for a chemical ontology: 2. Parthood relations <ul><li>Is part of a mixture </li></ul><ul><li>Kanamycin A part_of kanamycin </li></ul>
  19. 26. Basics for a chemical ontology: 2. Parthood relations <ul><li>Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al. , “Relations in biomedical ontologies”, 2005) </li></ul><ul><li>carbonyl compound has_part carbonyl group </li></ul><ul><li>Lead diacetate has_part lead(2+) (?!) </li></ul><ul><li>Muonium has_part electron </li></ul><ul><li>Kanamycin has_part kanamycin A (?!) </li></ul>
  20. 27. Basics for a chemical ontology: 2. Parthood relations <ul><li>Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships </li></ul><ul><li>Carbonyl compound molecule has_part carbonyl substituent </li></ul><ul><li>Muonium atom has_part electron </li></ul><ul><li>Kanamycin has_component kanamycin A </li></ul><ul><li>Lead diacetate has_component lead(2+) (?!) </li></ul>
  21. 28. Open questions <ul><li>How do we represent the relationship between named entities and documents? </li></ul><ul><li>How do we integrate ontologies and word-sense disambiguation? </li></ul><ul><li>What is the best way of distinguishing molecules and samples? </li></ul>
  22. 29. Acknowledgements <ul><li>University of Cambridge: Peter Corbett </li></ul><ul><li>OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo) </li></ul><ul><li>www.projectprospect.org </li></ul>
  23. 30. Open questions <ul><li>How do we represent the relationship between named entities and documents? </li></ul><ul><li>How do we integrate ontologies and word-sense disambiguation? </li></ul><ul><li>What is the best way of distinguishing molecules and samples? </li></ul>

×