Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry...
Overview <ul><li>Project Prospect: what we find and how we find it. </li></ul><ul><li>RDF: How should we be disseminating ...
 
 
 
 
 
 
Project Prospect: What do we find? <ul><li>Chemical compounds </li></ul><ul><li>Chemical terms from the IUPAC Gold Book </...
Project Prospect: How do we find it? <ul><li>For compound names: </li></ul><ul><li>~60% Oscar  (Corbett and Murray-Rust 20...
 
RDF in an RSS reader
RDF: how we do it now <ul><li>Content module from RSS 1.0 </li></ul><ul><li>http://web.resource.org/rss/1.0/modules/conten...
RDF: what it looks like now <ul><li><item rdf:about=http://xlink.rsc.org/?DOI=b716356h&amp;RSS=1> </li></ul><ul><li><title...
Basics for a chemical ontology <ul><li>Unambiguous representation of objects of chemical discourse </li></ul><ul><li>Prope...
Basics for a chemical ontology: 1. Objects of chemical discourse <ul><li>Must be able to represent and clearly distinguish...
Imidazole
An imidazole
The imidazole side-chain/group/ring
Can ChEBI handle this? <ul><li>Imidazoles (!) (CHEBI:24780)  </li></ul><ul><li>Imidazole (CHEBI:16069) </li></ul><ul><li>I...
Disambiguation <ul><li>One Sense per Discourse  (Gale  et al.  1992) </li></ul><ul><li>…  this doesn’t hold  at all </li><...
Disambiguation: What a one sense per collocation feature set might look like <ul><li>CLASS: </li></ul><ul><li>w (–1)  = a,...
Basics for a chemical ontology: 2. Parthood relations <ul><li>Parthood in ChEBI means at least three things: </li></ul><ul...
Basics for a chemical ontology: 2. Parthood relations <ul><li>Is  possibly  chemically part of: </li></ul><ul><li>Lead(2+)...
Basics for a chemical ontology: 2. Parthood relations <ul><li>Is part of a  mixture </li></ul><ul><li>Kanamycin A  part_of...
Basics for a chemical ontology: 2. Parthood relations <ul><li>Solution 1: define relationships according to pattern: all i...
Basics for a chemical ontology: 2. Parthood relations <ul><li>Solution 2 (for discussion): Distinguish molecular-level rel...
Open questions <ul><li>How do we represent the relationship between named entities and documents? </li></ul><ul><li>How do...
Acknowledgements <ul><li>University of Cambridge: Peter Corbett </li></ul><ul><li>OBO Foundry: Chris Mungall (Berkeley), B...
Open questions <ul><li>How do we represent the relationship between named entities and documents? </li></ul><ul><li>How do...
Upcoming SlideShare
Loading in …5
×

Chemical named entity recognition and literature mark-up

2,652 views
2,566 views

Published on

Presentation by Colin Batchelor, Royal Society of Chemistry publishing, in Manchester, March 2008

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,652
On SlideShare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
41
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Chemical named entity recognition and literature mark-up

  1. 1. Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry [email_address]
  2. 2. Overview <ul><li>Project Prospect: what we find and how we find it. </li></ul><ul><li>RDF: How should we be disseminating it? </li></ul><ul><li>Next steps: Basics for a chemical ontology. </li></ul>
  3. 9. Project Prospect: What do we find? <ul><li>Chemical compounds </li></ul><ul><li>Chemical terms from the IUPAC Gold Book </li></ul><ul><li>Gene products: function, process, location </li></ul><ul><li>Nucleotide and polypeptide sequence terms </li></ul><ul><li>Cell types </li></ul>
  4. 10. Project Prospect: How do we find it? <ul><li>For compound names: </li></ul><ul><li>~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) </li></ul><ul><li>~20% PubChem </li></ul><ul><li>~20% ChemDraw </li></ul><ul><li>For compound numbers: </li></ul><ul><li>~70% author ChemDraw </li></ul><ul><li>~30% editors </li></ul>
  5. 12. RDF in an RSS reader
  6. 13. RDF: how we do it now <ul><li>Content module from RSS 1.0 </li></ul><ul><li>http://web.resource.org/rss/1.0/modules/content </li></ul><ul><li>In what sense does an article “contain” pyridine or base pairs? </li></ul><ul><li>We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”. </li></ul>
  7. 14. RDF: what it looks like now <ul><li><item rdf:about=http://xlink.rsc.org/?DOI=b716356h&amp;RSS=1> </li></ul><ul><li><title> [… title] </title> </li></ul><ul><li><link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link> </li></ul><ul><li><description> [… blah] </description> </li></ul><ul><li><content:encoded> [… human-readable stuff</content:encoded> </li></ul><ul><li>[… dublin core stuff …] </li></ul><ul><li><content:items> </li></ul><ul><li><rdf:Bag> </li></ul><ul><li><rdf:li> </li></ul><ul><li><content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1&quot;/> </li></ul><ul><li></rdf:li> </li></ul><ul><li><rdf:li> </li></ul><ul><li><content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/> </li></ul><ul><li></rdf:li> </li></ul><ul><li></rdf:Bag> </li></ul><ul><li></content:items> </li></ul><ul><li></item> </li></ul>
  8. 15. Basics for a chemical ontology <ul><li>Unambiguous representation of objects of chemical discourse </li></ul><ul><li>Proper parthood relations </li></ul>
  9. 16. Basics for a chemical ontology: 1. Objects of chemical discourse <ul><li>Must be able to represent and clearly distinguish </li></ul><ul><li>Compounds </li></ul><ul><li>Classes of compound </li></ul><ul><li>Parts of molecules </li></ul><ul><li>Mixtures </li></ul><ul><li>Would be nice to have: </li></ul><ul><li>Disambiguation cues for the first three </li></ul>
  10. 17. Imidazole
  11. 18. An imidazole
  12. 19. The imidazole side-chain/group/ring
  13. 20. Can ChEBI handle this? <ul><li>Imidazoles (!) (CHEBI:24780) </li></ul><ul><li>Imidazole (CHEBI:16069) </li></ul><ul><li>Imidazole ring not yet </li></ul><ul><li>Imidazolyl group not yet (but methyl, benzyl, etc. ) </li></ul><ul><li>… and there are no disambiguation cues </li></ul>
  14. 21. Disambiguation <ul><li>One Sense per Discourse (Gale et al. 1992) </li></ul><ul><li>… this doesn’t hold at all </li></ul><ul><li>One Sense per Collocation (Yarowsky 1993) </li></ul><ul><li>… matches our intuitions </li></ul>
  15. 22. Disambiguation: What a one sense per collocation feature set might look like <ul><li>CLASS: </li></ul><ul><li>w (–1) = a, an, the, this </li></ul><ul><li>w (0) plural (bit of a cheat, as not a collocation) </li></ul><ul><li>PART: </li></ul><ul><li>w (–1) = bridging, terminal </li></ul><ul><li>w (+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) </li></ul><ul><li>w (+1) w (+2) = “building block”, “protecting group”, “side chain” </li></ul>
  16. 23. Basics for a chemical ontology: 2. Parthood relations <ul><li>Parthood in ChEBI means at least three things: </li></ul><ul><li>is necessarily chemically part of </li></ul><ul><li>carbonyl group part_of carbonyl compounds </li></ul>
  17. 24. Basics for a chemical ontology: 2. Parthood relations <ul><li>Is possibly chemically part of: </li></ul><ul><li>Lead(2+) part_of lead diacetate </li></ul><ul><li>(most lead(2+) isn’t) </li></ul><ul><li>Electron part_of muonium (!) </li></ul>
  18. 25. Basics for a chemical ontology: 2. Parthood relations <ul><li>Is part of a mixture </li></ul><ul><li>Kanamycin A part_of kanamycin </li></ul>
  19. 26. Basics for a chemical ontology: 2. Parthood relations <ul><li>Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al. , “Relations in biomedical ontologies”, 2005) </li></ul><ul><li>carbonyl compound has_part carbonyl group </li></ul><ul><li>Lead diacetate has_part lead(2+) (?!) </li></ul><ul><li>Muonium has_part electron </li></ul><ul><li>Kanamycin has_part kanamycin A (?!) </li></ul>
  20. 27. Basics for a chemical ontology: 2. Parthood relations <ul><li>Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships </li></ul><ul><li>Carbonyl compound molecule has_part carbonyl substituent </li></ul><ul><li>Muonium atom has_part electron </li></ul><ul><li>Kanamycin has_component kanamycin A </li></ul><ul><li>Lead diacetate has_component lead(2+) (?!) </li></ul>
  21. 28. Open questions <ul><li>How do we represent the relationship between named entities and documents? </li></ul><ul><li>How do we integrate ontologies and word-sense disambiguation? </li></ul><ul><li>What is the best way of distinguishing molecules and samples? </li></ul>
  22. 29. Acknowledgements <ul><li>University of Cambridge: Peter Corbett </li></ul><ul><li>OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo) </li></ul><ul><li>www.projectprospect.org </li></ul>
  23. 30. Open questions <ul><li>How do we represent the relationship between named entities and documents? </li></ul><ul><li>How do we integrate ontologies and word-sense disambiguation? </li></ul><ul><li>What is the best way of distinguishing molecules and samples? </li></ul>

×