LeadMine: A grammar and dictionary driven approach to chemical entity recognition
1. Abstract 7. Abbreviation Detection
NextMove Software Limited
Innovation Centre (Unit 23)
Cambridge Science Park
Milton Road, Cambridge
England CB4 0EY
LeadMine: A grammar and dictionary driven approach to
chemical entity recognition
Daniel Lowe and Roger Sayle
NextMove Software Ltd, Cambridge
LeadMine is a system for recognizing entities, especially chemical entities,
using large grammars and dictionaries. Entities are identified without an
explicit tokenization step. To allow recognition of terms slightly outside the
coverage of these resources spelling correction, entity extension and entity
merging are used. Recall is enhanced by the use of abbreviation detection,
and precision is enhanced by the removal of abbreviations of non-entities.
With the use of training data to produce further dictionaries of terms to
recognize/ignore LeadMine achieved 86.2% precision and 85.0% recall on an
unused development set.
1. Sayle R, Xie PH, Muresan S. Improved Chemical Text Mining of Patents with Infinite
Dictionaries and Automatic Spelling Correction. Journal of Chemical Information
and Modeling. 2011;52(1):51–62.
2. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara
R, Darsow M, Guedj M, Ashburner M. ChEBI: a database and ontology for chemical
entities of biological interest. Nucleic Acids Research. 2008;36:D344–350.
3. Schwartz A, Hearst M. A Simple Algorithm for Identifying Abbreviation Definitions
in Biomedical Text. In: Proceedings of the Pacific Symposium on Biocomputing.
Kauai; 2003. pp. 451–462.
The rules for chemical nomenclature are written as formal grammars e.g.
alkanStem : ‘meth’ | ‘eth’ | ‘prop’…
alkane: alkanStem ‘ane’
(485 rules are used in the systematic chemical name grammar and
many are inherited by the derived grammars)
The 2.94 million term PubChem dictionary is the primary source of
trivial names. It was produced by running a series of filters against the
~94 million synonyms provided by PubChem. These included
removing terms that are English words or start with an English word.
The records for structures that contained tetrasaccharides (or longer)
or hexadecapeptides (or longer) were excluded.
4. LeadMine Annotation
5. Entity extension and entity merging
Entities are extended until they reach whitespace, a mismatched bracket or
an English word. Entities are then trimmed of non-essential parts. Finally
adjacent entities are merged unless they are distinct molecules or one is an
instance of the other according to ChEBI (e.g. genistein is an isoflavone).
LeadMine combines the capabilities of grammars to recognize regular
entities with the coverage of dictionaries. The results are readily
understandable and can be iteratively improved.
The Hearst and Schwartz algorithm was adapted to recognize
abbreviations of the following forms:
• Tetrahydrofuran (THF)
• THF (tetrahydrofuran)
• Tetrahydrofuran (THF;
• Tetrahydrofuran (THF,
• (tetrahydrofuran, THF)
• THF = tetrahydrofuran
A list of domain specific abbreviations is used, which do not contain the
characters of the abbreviation e.g. mercury Hg or estrone E1
The training set was used to automatically identify holes in coverage and
identify common false positives and from this derive a dictionary of terms to
include (Whitelist) and a dictionary of terms to exclude (BlackList). The
workflow was then evaluated on the development set for the task of
identifying all chemical entity mentions.
Configuration Precision Recall F-score
Baseline 0.869 0.820 0.844
WhiteList 0.862 0.850 0.856
BlackList 0.882 0.803 0.841
WhiteList + Blacklist 0.873 0.832 0.852
8. Non-entity abbreviation removal
The Hearst and Schwartz algorithm is used to find abbreviations which are
recognised entities but for which the unabbreviated form is not an entity.
The abbreviation is then ignored e.g.
current good manufacturing practice (cGMP)
LeadMine works internally on a normalized string with mappings back to the
original input. Normalization allows XML tags to be ignored and requires
fewer lexical varieties to be recognised.
5` or 5’ or 5′ (backtick/quotation mark/prime) 5'
Input Found entities After extension/merging
α-Santalol Santalol α-Santalol
Allura Red AC dye Allura Red AC dye Allura Red AC
Glycine ester Glycine AND ester Glycine ester
Hexane-benzene Hexane AND benzene Hexane AND benzene
Genistein AND isoflavone Genistein AND isoflavone
Green: Traditional dictionaries
Orange: Blocking dictionaries