0
ACS National Meeting, Indianapolis, USA 8th September 2013
Tackling the difficult areas of
chemical entity extraction:
Mis...
ACS National Meeting, Indianapolis, USA 8th September 2013
Text mining is big business
2013 Bio-IT World Best Practices wi...
ACS National Meeting, Indianapolis, USA 8th September 2013
Approaches to Entity recognition
• Dictionary based
• Grammar b...
ACS National Meeting, Indianapolis, USA 8th September 2013
Approaches to Entity recognition
• Dictionary based approaches ...
ACS National Meeting, Indianapolis, USA 8th September 2013
Advantages of grammars
• Don’t require annotated corpora
• Enco...
ACS National Meeting, Indianapolis, USA 8th September 2013
Simple grammar Example
Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7...
ACS National Meeting, Indianapolis, USA 8th September 2013
Grammar for IUPAC names
• Grammar for complete molecules: 485 r...
ACS National Meeting, Indianapolis, USA 8th September 2013
State machine size
0
2000000
4000000
6000000
8000000
10000000
1...
ACS National Meeting, Indianapolis, USA 8th September 2013
Two Level State Machines
• Breaks problems into a state machine...
ACS National Meeting, Indianapolis, USA 8th September 2013
State machine RevisiteD
0
2000000
4000000
6000000
8000000
10000...
ACS National Meeting, Indianapolis, USA 8th September 2013
Grammar inheritance
• Molecule grammar serves as a good startin...
ACS National Meeting, Indianapolis, USA 8th September 2013
Unconventional entities #1
• Formulae:
– Sum formulae
• C20H25N...
ACS National Meeting, Indianapolis, USA 8th September 2013
Unconventional entities #2
• Peptide formulae
– Cys-Tyr-Phe-Gln...
ACS National Meeting, Indianapolis, USA 8th September 2013
Unconventional entities #3
• Patent numbers
– U.S. Pat. No. 6,6...
ACS National Meeting, Indianapolis, USA 8th September 2013
navigating
ACS National Meeting, Indianapolis, USA 8th September 2013
Fast spelling correction
• Historically we have used Levenshtei...
ACS National Meeting, Indianapolis, USA 8th September 2013
Words Ignored for spelling
correction (gray)
ACS National Meeting, Indianapolis, USA 8th September 2013
Exceptions to local errors
• Whether a space is allowed may onl...
ACS National Meeting, Indianapolis, USA 8th September 2013
BioCreative IV
• CHEMDNER (Chemical compound and drug
name reco...
ACS National Meeting, Indianapolis, USA 8th September 2013
Typical annotated Abstract
ACS National Meeting, Indianapolis, USA 8th September 2013
Dictionaries… bigger is better
• For high recall of trivial nam...
ACS National Meeting, Indianapolis, USA 8th September 2013
Aggressive filtering
• “what you don't see won't hurt you”
• He...
ACS National Meeting, Indianapolis, USA 8th September 2013
Structure Aware filtering
• “Do not tag proteins, polypeptides ...
ACS National Meeting, Indianapolis, USA 8th September 2013
Entity Extension
• Even PubChem is far from comprehensive hence...
ACS National Meeting, Indianapolis, USA 8th September 2013
Entity Merging
• Adjacent entities may actually be the same
ent...
ACS National Meeting, Indianapolis, USA 8th September 2013
Using an ontology to determine
when terms add information
• Gen...
ACS National Meeting, Indianapolis, USA 8th September 2013
Abbreviation detection
• Based on the Hearst and Schwartz algor...
ACS National Meeting, Indianapolis, USA 8th September 2013
AnTI-Abbreviation detection
• Finds entities detected as abbrev...
ACS National Meeting, Indianapolis, USA 8th September 2013
Grammars used
• Systematic molecule
• Systematic prefix
• Syste...
ACS National Meeting, Indianapolis, USA 8th September 2013
Dictionaries used
• Noise words e.g. lead
• Trivial polymer
• G...
ACS National Meeting, Indianapolis, USA 8th September 2013
Making the most of the knowledge
provided
• Use training data t...
ACS National Meeting, Indianapolis, USA 8th September 2013
Results
(on development set)
Configuration Precision Recall F-s...
ACS National Meeting, Indianapolis, USA 8th September 2013
Future work
• Typically we are focused on generating
structures...
ACS National Meeting, Indianapolis, USA 8th September 2013
Conclusions
• Two level state machines allow many complicated
g...
ACS National Meeting, Indianapolis, USA 8th September 2013
daniel@nextmovesoftware.com
Tackling the difficult areas of che...
Upcoming SlideShare
Loading in...5
×

Tackling the difficult areas of chemical entity extraction

1,396

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,396
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Tackling the difficult areas of chemical entity extraction"

  1. 1. ACS National Meeting, Indianapolis, USA 8th September 2013 Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK
  2. 2. ACS National Meeting, Indianapolis, USA 8th September 2013 Text mining is big business 2013 Bio-IT World Best Practices winner
  3. 3. ACS National Meeting, Indianapolis, USA 8th September 2013 Approaches to Entity recognition • Dictionary based • Grammar based • Machine Learning LeadMineLeadMine
  4. 4. ACS National Meeting, Indianapolis, USA 8th September 2013 Approaches to Entity recognition • Dictionary based approaches are ideal for relating entities to concepts but only recognise a finite number of terms – Will not recognise novel compound names • Hence for chemistry, dictionary approaches need to be used in conjunction with another method
  5. 5. ACS National Meeting, Indianapolis, USA 8th September 2013 Advantages of grammars • Don’t require annotated corpora • Encode knowledge about the domain • Very fast recognition • Allow spelling correction if an entity is a near match to one recognised by the grammar
  6. 6. ACS National Meeting, Indianapolis, USA 8th September 2013 Simple grammar Example Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’ Digit : Digit1to9 | ‘0’ Cid : ‘CID:’ Digit1to9 Digit* C I D 1..9: 0..9
  7. 7. ACS National Meeting, Indianapolis, USA 8th September 2013 Grammar for IUPAC names • Grammar for complete molecules: 485 rules – trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'... – ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ... • Generally aims to match a superset of the nomenclature covered by IUPAC • Specifically this is the superset that can be theoretically be converted to structures
  8. 8. ACS National Meeting, Indianapolis, USA 8th September 2013 State machine size 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Statesrequired Recall on names from MayBridge catalogue
  9. 9. ACS National Meeting, Indianapolis, USA 8th September 2013 Two Level State Machines • Breaks problems into a state machine that keeps track of when concepts have to be matched and a state machine that matches each concept e.g. an acyclic group – Avoids duplication of states to match the same concept in slightly different contexts – Slower as multiple concepts may be possible that are allowed to start with the same characters
  10. 10. ACS National Meeting, Indianapolis, USA 8th September 2013 State machine RevisiteD 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Statesrequired Recall on names from MayBridge catalogue
  11. 11. ACS National Meeting, Indianapolis, USA 8th September 2013 Grammar inheritance • Molecule grammar serves as a good starting point for a substituent grammar or generic chemical grammar – Inherit rules rather than duplicate them – Allow overriding of rules pluralisedChemical : chemical 's' elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition metal'|'transuranic element' | _elementaryMetalAtom
  12. 12. ACS National Meeting, Indianapolis, USA 8th September 2013 Unconventional entities #1 • Formulae: – Sum formulae • C20H25NO6 – Line formulae • CH3CH2CH2Cl (complete molecule) • CH2CH2 (linker) • CH3CH2 (substituent) – Salts • MgSO4
  13. 13. ACS National Meeting, Indianapolis, USA 8th September 2013 Unconventional entities #2 • Peptide formulae – Cys-Tyr-Phe-Gln-Asn-Cys-Pro-Arg-Gly-NH2 • Oligosaccharides – α-L-Fucp-(1→4)-[β-D-Galp-(1→3)]-β-D-GlcpNAc- (1→3)-β-D-Galp-(1→4)-D-Glc-ol • Oligonucleotides – 3'-AATG-5'
  14. 14. ACS National Meeting, Indianapolis, USA 8th September 2013 Unconventional entities #3 • Patent numbers – U.S. Pat. No. 6,677,355 • Journal references – (1974) J. Biol. Chem. 249, 4250-4256 • CAS numbers – 90-13-1 • InChI and SMILES
  15. 15. ACS National Meeting, Indianapolis, USA 8th September 2013 navigating
  16. 16. ACS National Meeting, Indianapolis, USA 8th September 2013 Fast spelling correction • Historically we have used Levenshtein-like distance measures (all possible corrections) • Only use spelling correction when recognition fails • Allow a certain level of “look behind” – 13 characters empirically found to yield identical results – Speeds up spelling correction ~80% • Dictionary of common English words can be used to prevent attempting spelling correction
  17. 17. ACS National Meeting, Indianapolis, USA 8th September 2013 Words Ignored for spelling correction (gray)
  18. 18. ACS National Meeting, Indianapolis, USA 8th September 2013 Exceptions to local errors • Whether a space is allowed may only be decidable once the suffix of a chemical name is encountered propyl bromochloromethanol  propylbromochloromethanol propyl bromochloromethanoate 19 character look behind required!
  19. 19. ACS National Meeting, Indianapolis, USA 8th September 2013 BioCreative IV • CHEMDNER (Chemical compound and drug name recognition task) • 10000 annotated PubMed abstracts (3500 for training, 3500 for development and 3000 for testing) • Deadline for submission: This Thursday
  20. 20. ACS National Meeting, Indianapolis, USA 8th September 2013 Typical annotated Abstract
  21. 21. ACS National Meeting, Indianapolis, USA 8th September 2013 Dictionaries… bigger is better • For high recall of trivial names dictionaries with high coverage are required. • The largest publically available dictionary is PubChem with over 94 million terms • However most of these terms are either not useful or actually detrimental to text mining
  22. 22. ACS National Meeting, Indianapolis, USA 8th September 2013 Aggressive filtering • “what you don't see won't hurt you” • Hence remove terms are also English words or start with an English word – Accomplished using a large English dictionary with chemistry terms removed • Remove internal identifiers used by depositors • Remove terms that are matched by our grammars • Ultimate result: 94 million less than 3 million
  23. 23. ACS National Meeting, Indianapolis, USA 8th September 2013 Structure Aware filtering • “Do not tag proteins, polypeptides (> 15aa), nucleic acid polymers, polysaccharides, oligosaccharides [tetrasaccharide or longer] and other biochemicals.” • About 40,000 polypeptides and oligosaccharides excluded from PubChem using these criteria
  24. 24. ACS National Meeting, Indianapolis, USA 8th September 2013 Entity Extension • Even PubChem is far from comprehensive hence it can be useful to extend the start and/or end of entities to avoid partial hits – α-santalol can be recognised from santalol in the dictionary • Extension is bracketing aware and blocked by English words • Entity trimming also performed to comply with the annotation guidelines – ‘Allura Red AC dye’  ‘Allura Red AC’
  25. 25. ACS National Meeting, Indianapolis, USA 8th September 2013 Entity Merging • Adjacent entities may actually be the same entities – Ethyl ester one entity – (+)-limonene epoxide  one entity BUT – Hexane-benzene two entities
  26. 26. ACS National Meeting, Indianapolis, USA 8th September 2013 Using an ontology to determine when terms add information • Genistein isoflavone  two entities • Glycine ester  one entity Genistein showing isoflavone core structure
  27. 27. ACS National Meeting, Indianapolis, USA 8th September 2013 Abbreviation detection • Based on the Hearst and Schwartz algorithm • Detects abbreviations of the following forms: – Tetrahydrofuran (THF) – THF (tetrahydrofuran) – Tetrahydrofuran (THF; – (tetrahydrofuran, THF) – THF = tetrahydrofuran Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.
  28. 28. ACS National Meeting, Indianapolis, USA 8th September 2013 AnTI-Abbreviation detection • Finds entities detected as abbreviations of unrecognised entities – Can mean a common chemical abbreviation has been redefined in the scope of the document current good manufacturing practice (cGMP) cGMP = Cyclic guanosine monophosphate =
  29. 29. ACS National Meeting, Indianapolis, USA 8th September 2013 Grammars used • Systematic molecule • Systematic prefix • Systematic generic name • Registry number • CAS number • Chemical formulae • Systematic polymer • Semi systematic chemical name – Systematic prefix + common trivial name/name from PubChem
  30. 30. ACS National Meeting, Indianapolis, USA 8th September 2013 Dictionaries used • Noise words e.g. lead • Trivial polymer • Generic chemical terms (some from ChEBI) • Common abbreviations • Common trivial names • Filtered PubChem • Alloys • Allotropes • Minerals
  31. 31. ACS National Meeting, Indianapolis, USA 8th September 2013 Making the most of the knowledge provided • Use training data to identify terms that are not currently recognised (a whitelist) • Identify terms that are often false positives (a blacklist) • Each false positive and false negative is placed into such a list if its inclusion increased F-score (harmonic mean of precision/recall)
  32. 32. ACS National Meeting, Indianapolis, USA 8th September 2013 Results (on development set) Configuration Precision Recall F-score Baseline 0.87 0.82 0.84 WhiteList 0.86 0.85 0.85 BlackList 0.88 0.80 0.84 WhiteList + BlackList 0.87 0.83 0.85
  33. 33. ACS National Meeting, Indianapolis, USA 8th September 2013 Future work • Typically we are focused on generating structures from the entities we recognise – Line formula parsing – Generic chemical name parsing (difficult to do in a way that the results are not tied to a particular toolkit) • Grammars serve as an excellent starting point for writing parsers
  34. 34. ACS National Meeting, Indianapolis, USA 8th September 2013 Conclusions • Two level state machines allow many complicated grammars to be represented by far fewer states • Back tracking spelling correction can provide significant speed improvements without effecting recall • Check out our blog (nextmovesoftware.co.uk/blog) in a couple of weeks to find out how we did in BioCreative!
  35. 35. ACS National Meeting, Indianapolis, USA 8th September 2013 daniel@nextmovesoftware.com Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities Thank you for your attention
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×