SlideShare a Scribd company logo
Advanced grammars for
state-of-the-art named
entity recognition (NER)
Roger Sayle and daniel lowe
NextMove Software, Cambridge, UK
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
overview
• NextMove Software’s LeadMine text-mining engine
internally uses “CaffeineFix” (.cfx) technology for
specifying and efficiently matching important terms.
• In addition to case-sensitive and case-insensitive
term matching CaffeineFix/LeadMine also support
spelling correction (fuzzy matching).
• The most common usage is to simply compile
dictionaries into binary form for fast matching.
• Advanced users, specify “regular expressions”.
• In this presentation, we go beyond REGEXPs.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
leadmine v2 entity types
1. Chemicals
2. Biomolecules
3. Anatomy
4. Cell Lines
5. Diseases
6. Symptoms
7. Mechanisms of Action
8. Species/Organisms
9. Companies
10. Named Reactions
11. Regions
12. Languages/Possessives
1.1 Dictionary Names
1.2 Systematic Names
1.3 Generic Classes
1.4 Polymers
1.5 Formulae
2.1 Proteins
2.2 Genes
2.3 E.C. Numbers
2.4 PDB Codes
3.1 Cell Types
3.2 Cytogenetic Loci
1.1.1 Abbreviations
1.1.2 CAS RN Numbers
1.1.3 Registry Numbers
1.2.1 Functional Groups
1.2.2 Elements
1.2.3 Acids
1.2.4 SMILES
1.2.5 InChIs
2.1.1 Targets
2.1.2 P450s
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
named entity normal forms
• Chemicals SMILES and/or InChI
• Proteins UniProt
• Genes Entrez GeneID/HGNC
• Targets ChEMBL
• Species/Organism NCBI Taxonomy ID
• Diseases/Symptoms ICD-10
• Named Reactions RXNO
• Mechanism of Action ATC
• Many of these can also use NLM MeSH Terms.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Example entity dictionary as dag
• Nitrogen containing heterocycles as minimal DFA:
– Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine,
Pyrimidine, Pyrazine
• CaffeineFix supports (very large) user dictionaries.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Obo ontologies as dictionaries
• In addition to regular TSV (tab-separated value) files
for storing dictionaries, LeadMine’s obo2dict also
supports OBO ontologies, a convenient method for
tracking synonyms and foreign language forms.
[Term]
id: RXNO:0000006
name: Diels-Alder reaction
synonym: "Diels-Alder cycloaddition" EXACT []
synonym: "ディールス・アルダー反応" EXACT Japanese []
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Plural form generation
• LeadMine’s pluralize automatically generates English
plural forms from singular dictionary entries.
diels-alder couplings RXNO:0000006
diels-alder cycloadditions RXNO:0000006
diels-alder reactions RXNO:0000006
acridine syntheses RXNO:0000518
acyclic beckmann rearrangements RXNO:0000564
acyloin condensations RXNO:0000085
olefin metatheses RXNO:0000280
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Unusual entities
• ISBN, URL, PubMed SQL statement
• Roman Numerals, Date Solvent Mixture
• ColorState, Zip codes Hearst Patterns
• Katakana Unknown acid
• HELM, InChI, SMILES, v2000 Unknown antibody
• Credit Card Numbers Unknown disease
• Region Unknown INN
• Person Ordinal numbers
• Disease Cardinal numbers
• Journal de, es, fr, it, sv
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Grammars within grammars
• LeadMine grammar’s are specified constructively
effectively producing even more entity types.
• Region = City + Continent + Country + Island + Lake +
Mountain + Ocean + River + Sea + State/Province +
OtherFeature + OtherRegion.
• City = CityAlbania + CityAndorra + CityAustralia + CityAustria +
… + CityUS + …
• CityUS = CityUS_AK + CityUS_AL + CityUS_AR + CityUS_AZ +
CityUS_CA + CityUS_CO + …
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Pharma registry numbers
• CaffineFix v2.0 supports sets of user-defined
regular expressions as dictionaries.
• One application is specifying the format of
registry numbers, such as GSK204454A
• Prefix: “A” | “AZ” | “BMY” | “GSK” | “LY” | …
• Number: d{3-7}
• Suffix: (“.” d) | [“a” .. “z”]
• RegistryNumber: Prefix [“ ” | “-”] Number [Suffix]
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Cardinal numbers
• English
– One, ten, two thousand and forty eight, ten million
• German
– Eins, Zehn, Hundert, Million, Viermillion
– Vierhundertsiebenundzwanzigtausendfünfhundertvierunddreißig
• French
– Trois cents, un mille, mille neuf cent quatre-vingts dix-huit
• Italian
– Uno, due, trenta, ottocentosessantamila settecentoottantanove
• Swedish
– en miljon trehundrasjuttiåtta tusen niohundrasjuttiett
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
cas registry number grammar
• Two to seven digits, followed by a hyphen, two digits,
a hyphen and a final check digit
– e.g. 7732-18-5
• Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Cas check digit calculation
• More generally CaffeineFix’s finite state machines
can do limited processing...
• The final check digit of a CAS number is calculated by
series term summation modulo 10.
• The last digit time 1, the previous digit times 2, the
previous digit times 3, and computing the sum
modulo 10.
• The CAS number for water is 7732-18-5.
• The checksum 5 is calculated as (1x8 + 2x1 + 3x2 +
4x3 + 5x7 + 6x7) mod 10 = 5.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Fsm for matching cas check digits
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Fsm for matching cas check digits
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
cas number correction example
• 7732-18-8? Did you mean...
– 7732-18-5
– 7732-11-8
– 77328-18-8
– 7733-18-8
– 77342-18-8
– 77392-18-8
– 71732-18-8
– 76732-18-8
– 97732-18-8
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Roman numerals
One useful operator is NonEmpty that removes the empty string
from the set of valid matches, and requires at least one or more
characters to match.
I
II
III
IV
V
VI
VII
VIII
IX
X
XX
XXX
XL
L
LX
LXX
LXXX
XC
C
CC
CCC
CD
D
DC
DCC
DCCC
CM
M
MM
MMM
Thousands Hundreds Tens Units
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Unknown acid
• Another operators allows wildcards with exceptions,
effectively a not operator.
• An unknown acid is “[a-z’-]+ acid” where the first
word excludes:
– Stop words: a, the, and, any, is, in, was, etc.
– Common qualifiers: acceptable, preferred, etc.
– Adjectives: battery, free, inorganic, strong, etc.
– Known acids: acetic, nitric, amino, carboxylic, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Unknown inn
• A variation on this theme allows LeadMine to
recognize novel (recently announced) kinase
inhibitors and antibodies based on the structure of
their INN names.
• An unknown kinase inhibitor is “[a-z]+inib” and an
unknown antibody is “[a-z]+mab” where the words
exclude previously known/reported INN names and
“colliding” English words.
april != capropril, KappaB != rozrolimupab, yuletide != exenatide,
triumvir != zanamivir, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Person grammar
• The named person grammar matches:
1. [Salutation] FirstName [Initials] Surname [Suffix]
2. [Salutation] FirstName [Initials] UnknownSurname [Suffix]
3. [Salutation] UnknownFirstName [Initials] Surname [Suffix]
• where
Salutation includes Mr., Mrs., Dr., Sir, His Highness, …
FirstName includes David, John, Sarah, Tom, Angela, …
Surname includes Smith, Jones, Overington, …
UnknownFirstname excludes Big, Lake, The, Outer, etc.
UnknownSurname excludes Avenue, Bridge, Street, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
List construction operator
• Another frequently used idiom, are the operators for
constructing comma separated list.
• These turn the grammar matching “X” into the
grammar matching things like “X, X, X and X”.
• More specifically:
(X [ “,” “ ”? X]* (“ and ”| “ or ” | “ and/or ” )? X
• Another variation of this allows “other”, “similar”
and “related” to the final X if the list is non-empty.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Hearst pattern grammars
• An example use of list constructions is in the
recognition of Heart Patterns.
1. X such as Y [“including”, “especially” etc.]
2. Y and other X [“and related”, “or similar” etc.]
3. such X as Y
• Where X is category or classification term;
• And Y is a list of exemplified terms.
• Marti A. Hearst, “Automatic Acquisition of Hyponyms from Large Text Corpora”, Proceedings
of the 14th International Conference on Computational Linguistics, Nantes, France, July 1992.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Complex object builder
• An application of the list construction operator is in
our “complex object builder” construction operator.
ComplexObjectBuilder cob;
cob.insert(“red”, “lorry”, “lorries”);
cob.insert(“yellow”, “lorry”, “lorries”);
• Allows matching not only of
“red lorry”, “red lorries”, “yellow lorry” and “yellow lorries”
• But also of…
“red and yellow lorries”, “yellow and red lorries”, etc.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
complex disease examples
• Adenomatous polyps of the colon and rectum.
• Fibroepithelia or epithelial hyperplasias.
• Inherited spinocerebellar ataxia.
• Stage II or stage III colorectal cancer.
• Inherited breast and overian cancers.
• Argentinian, Bolivian and Korean haemorrhagic
fevers.
• Dermatitis due to heat, cold, radiation, cosmetics,
fungi and shellfish.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Grammars for Safety text mining
• “May cause lung damage if swallowed”
– “may” → “can”, “could”, “may”, “might”, “will”, etc.
– “cause” → “lead to”, “result in”, “trigger”, “bring on”, …
– “lung damage” → “explosion”, “cancer”, “injury”, …
– “if” → “when”, “once”…
– “swallowed” → “heated”, “shaken”, “dried”, “ignited”…
• “Highly toxic”
– “highly” → “very”, “extremely”, “unusually”, “intensely”…
– “toxic” → “explosive”, “carcinogenic”, “poisonous”…
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
efficient protein variant naming
• CaffeineFix technology can also be applied to naming
peptides and arbitrary protein variants/mutants.
• Consider the a database of the following 11 peptides:
– CFFQNCPRG phenylpressin
– CFVRNCPTG annetocin
– CFWTSCPIG octopressin
– CYFQNCPRG argipressin
– CYFQNCPKG lypressin
– CYFRNCPIG cephalotocin
– CYIQNCPLG oxytocin
– CYIQNCPPG prol-oxytocin
– CYIQNCPRG vasotocin
– CYIQSCPIG seritocin
– CYISNCPIG isotocin
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Dag representation of sequences
These 11 peptides may be efficiently represented and
search as a “directed acyclic graph” [38 vs. 99 states]
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
entirety of uniprot/swissprot
• Using this representation, all 540546 protein
sequences in uniprot_sprot, which contains over
192M amino acids, requires 142M states (1.4Gb).
• This data structure allows close analogues to be
identified much faster than using NCBI blastp.
• For example, all 540546 sequences can be queried
against this database (i.e. all-against-all) in ~9m30s
on a single core on a laptop.
• The sequence from PDB 1CRN (crambin 46AA) is
canonically named as [L25I]P01542 in 0.002s.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
Application to precision medicine
• A more realistic example is that sequence of the
gene “spastic paraplegia4” with six mutations from
OMIM:604277 can be canonically named as
[I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0
• Run-time for this query is 0.2s.
• By comparison, blastp 2.2.29+ takes about 6s.
– With default arguments, NCBI blastp run time is 7s.
– Only 6s with –num_descriptions 1 –num_alignments 1.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
summary
• LeadMine’s .cfx files can do far more than efficiently
match very large dictionaries of terms.
• Indeed, many of the grammars used at NextMove
Software potentially match an infinite number of
terms.
• Construction of domain specific grammars can be
done in collaboration with LeadMine customers.
253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017

More Related Content

What's hot

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 
Federated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 TutorialFederated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 Tutorial
Muhammad Saleem
 
Federated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataFederated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of Data
Muhammad Saleem
 
Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federation
Muhammad Saleem
 
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationHiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
Muhammad Saleem
 
Federated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFedFederated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFed
Muhammad Saleem
 
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesSAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
Muhammad Saleem
 
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
Olaf Hartig
 
FedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked DataFedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked Data
aschwarte
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
Thomas Meehan
 

What's hot (10)

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Federated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 TutorialFederated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 Tutorial
 
Federated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataFederated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of Data
 
Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federation
 
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationHiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
 
Federated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFedFederated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFed
 
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesSAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
 
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
 
FedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked DataFedX - Optimization Techniques for Federated Query Processing on Linked Data
FedX - Optimization Techniques for Federated Query Processing on Linked Data
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 

Similar to Advanced grammars for state-of-the-art named entity recognition (NER)

Making the Old New Again - Modern Technical Provides Access to Historical Che...
Making the Old New Again - Modern Technical Provides Access to Historical Che...Making the Old New Again - Modern Technical Provides Access to Historical Che...
Making the Old New Again - Modern Technical Provides Access to Historical Che...
Iconic Translation Machines
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
dan2097
 
Tackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionTackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionNextMove Software
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...
greatsalvation813
 
Cartographic Resources Cataloging with RDA Workshop
Cartographic Resources Cataloging with RDA WorkshopCartographic Resources Cataloging with RDA Workshop
Cartographic Resources Cataloging with RDA Workshop
ALATechSource
 
RDA and Hebraica: Applying RDA in one cataloging community
RDA and Hebraica: Applying RDA in one cataloging communityRDA and Hebraica: Applying RDA in one cataloging community
RDA and Hebraica: Applying RDA in one cataloging community
AJL2011
 
Serials & E-Books in RDA
Serials & E-Books in RDASerials & E-Books in RDA
Serials & E-Books in RDARenette Davis
 
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
Dr. Haxel Consult
 
Introducing RDA: June 2013
Introducing RDA: June 2013Introducing RDA: June 2013
Introducing RDA: June 2013ALATechSource
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Seth Grimes
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
Seth Grimes
 
In grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionIn grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solution
NextMove Software
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
valuebound
 
regularexpression-180328061400.pptx hehe
regularexpression-180328061400.pptx heheregularexpression-180328061400.pptx hehe
regularexpression-180328061400.pptx hehe
anthonykharl6
 
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...
reynaldogonzales13
 

Similar to Advanced grammars for state-of-the-art named entity recognition (NER) (16)

Making the Old New Again - Modern Technical Provides Access to Historical Che...
Making the Old New Again - Modern Technical Provides Access to Historical Che...Making the Old New Again - Modern Technical Provides Access to Historical Che...
Making the Old New Again - Modern Technical Provides Access to Historical Che...
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
 
Tackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionTackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extraction
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...
 
Cartographic Resources Cataloging with RDA Workshop
Cartographic Resources Cataloging with RDA WorkshopCartographic Resources Cataloging with RDA Workshop
Cartographic Resources Cataloging with RDA Workshop
 
RDA and Hebraica: Applying RDA in one cataloging community
RDA and Hebraica: Applying RDA in one cataloging communityRDA and Hebraica: Applying RDA in one cataloging community
RDA and Hebraica: Applying RDA in one cataloging community
 
Serials & E-Books in RDA
Serials & E-Books in RDASerials & E-Books in RDA
Serials & E-Books in RDA
 
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...II-PIC 2017: Why did I miss that Patent? How value added databases of STN  he...
II-PIC 2017: Why did I miss that Patent? How value added databases of STN he...
 
Introducing RDA: June 2013
Introducing RDA: June 2013Introducing RDA: June 2013
Introducing RDA: June 2013
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
 
In grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionIn grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solution
 
Cas 2
Cas 2Cas 2
Cas 2
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
regularexpression-180328061400.pptx hehe
regularexpression-180328061400.pptx heheregularexpression-180328061400.pptx hehe
regularexpression-180328061400.pptx hehe
 
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...
vsip.info_diccionario-para-ingenierios-espanol-ingles-ingles-espanol-2da-ed-p...
 

More from NextMove Software

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
NextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
NextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
NextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
NextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
NextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
NextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
NextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
NextMove Software
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
NextMove Software
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
NextMove Software
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
NextMove Software
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be useful
NextMove Software
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
NextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be useful
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 

Recently uploaded

Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
ssuserbfdca9
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 

Recently uploaded (20)

Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 

Advanced grammars for state-of-the-art named entity recognition (NER)

  • 1. Advanced grammars for state-of-the-art named entity recognition (NER) Roger Sayle and daniel lowe NextMove Software, Cambridge, UK 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 2. overview • NextMove Software’s LeadMine text-mining engine internally uses “CaffeineFix” (.cfx) technology for specifying and efficiently matching important terms. • In addition to case-sensitive and case-insensitive term matching CaffeineFix/LeadMine also support spelling correction (fuzzy matching). • The most common usage is to simply compile dictionaries into binary form for fast matching. • Advanced users, specify “regular expressions”. • In this presentation, we go beyond REGEXPs. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 3. leadmine v2 entity types 1. Chemicals 2. Biomolecules 3. Anatomy 4. Cell Lines 5. Diseases 6. Symptoms 7. Mechanisms of Action 8. Species/Organisms 9. Companies 10. Named Reactions 11. Regions 12. Languages/Possessives 1.1 Dictionary Names 1.2 Systematic Names 1.3 Generic Classes 1.4 Polymers 1.5 Formulae 2.1 Proteins 2.2 Genes 2.3 E.C. Numbers 2.4 PDB Codes 3.1 Cell Types 3.2 Cytogenetic Loci 1.1.1 Abbreviations 1.1.2 CAS RN Numbers 1.1.3 Registry Numbers 1.2.1 Functional Groups 1.2.2 Elements 1.2.3 Acids 1.2.4 SMILES 1.2.5 InChIs 2.1.1 Targets 2.1.2 P450s 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 4. named entity normal forms • Chemicals SMILES and/or InChI • Proteins UniProt • Genes Entrez GeneID/HGNC • Targets ChEMBL • Species/Organism NCBI Taxonomy ID • Diseases/Symptoms ICD-10 • Named Reactions RXNO • Mechanism of Action ATC • Many of these can also use NLM MeSH Terms. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 5. Example entity dictionary as dag • Nitrogen containing heterocycles as minimal DFA: – Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine, Pyrimidine, Pyrazine • CaffeineFix supports (very large) user dictionaries. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 6. Obo ontologies as dictionaries • In addition to regular TSV (tab-separated value) files for storing dictionaries, LeadMine’s obo2dict also supports OBO ontologies, a convenient method for tracking synonyms and foreign language forms. [Term] id: RXNO:0000006 name: Diels-Alder reaction synonym: "Diels-Alder cycloaddition" EXACT [] synonym: "ディールス・アルダー反応" EXACT Japanese [] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 7. Plural form generation • LeadMine’s pluralize automatically generates English plural forms from singular dictionary entries. diels-alder couplings RXNO:0000006 diels-alder cycloadditions RXNO:0000006 diels-alder reactions RXNO:0000006 acridine syntheses RXNO:0000518 acyclic beckmann rearrangements RXNO:0000564 acyloin condensations RXNO:0000085 olefin metatheses RXNO:0000280 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 8. Unusual entities • ISBN, URL, PubMed SQL statement • Roman Numerals, Date Solvent Mixture • ColorState, Zip codes Hearst Patterns • Katakana Unknown acid • HELM, InChI, SMILES, v2000 Unknown antibody • Credit Card Numbers Unknown disease • Region Unknown INN • Person Ordinal numbers • Disease Cardinal numbers • Journal de, es, fr, it, sv 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 9. Grammars within grammars • LeadMine grammar’s are specified constructively effectively producing even more entity types. • Region = City + Continent + Country + Island + Lake + Mountain + Ocean + River + Sea + State/Province + OtherFeature + OtherRegion. • City = CityAlbania + CityAndorra + CityAustralia + CityAustria + … + CityUS + … • CityUS = CityUS_AK + CityUS_AL + CityUS_AR + CityUS_AZ + CityUS_CA + CityUS_CO + … 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 10. Pharma registry numbers • CaffineFix v2.0 supports sets of user-defined regular expressions as dictionaries. • One application is specifying the format of registry numbers, such as GSK204454A • Prefix: “A” | “AZ” | “BMY” | “GSK” | “LY” | … • Number: d{3-7} • Suffix: (“.” d) | [“a” .. “z”] • RegistryNumber: Prefix [“ ” | “-”] Number [Suffix] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 11. Cardinal numbers • English – One, ten, two thousand and forty eight, ten million • German – Eins, Zehn, Hundert, Million, Viermillion – Vierhundertsiebenundzwanzigtausendfünfhundertvierunddreißig • French – Trois cents, un mille, mille neuf cent quatre-vingts dix-huit • Italian – Uno, due, trenta, ottocentosessantamila settecentoottantanove • Swedish – en miljon trehundrasjuttiåtta tusen niohundrasjuttiett 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 12. cas registry number grammar • Two to seven digits, followed by a hyphen, two digits, a hyphen and a final check digit – e.g. 7732-18-5 • Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 13. Cas check digit calculation • More generally CaffeineFix’s finite state machines can do limited processing... • The final check digit of a CAS number is calculated by series term summation modulo 10. • The last digit time 1, the previous digit times 2, the previous digit times 3, and computing the sum modulo 10. • The CAS number for water is 7732-18-5. • The checksum 5 is calculated as (1x8 + 2x1 + 3x2 + 4x3 + 5x7 + 6x7) mod 10 = 5. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 14. Fsm for matching cas check digits 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 15. Fsm for matching cas check digits 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 16. cas number correction example • 7732-18-8? Did you mean... – 7732-18-5 – 7732-11-8 – 77328-18-8 – 7733-18-8 – 77342-18-8 – 77392-18-8 – 71732-18-8 – 76732-18-8 – 97732-18-8 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 17. Roman numerals One useful operator is NonEmpty that removes the empty string from the set of valid matches, and requires at least one or more characters to match. I II III IV V VI VII VIII IX X XX XXX XL L LX LXX LXXX XC C CC CCC CD D DC DCC DCCC CM M MM MMM Thousands Hundreds Tens Units 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 18. Unknown acid • Another operators allows wildcards with exceptions, effectively a not operator. • An unknown acid is “[a-z’-]+ acid” where the first word excludes: – Stop words: a, the, and, any, is, in, was, etc. – Common qualifiers: acceptable, preferred, etc. – Adjectives: battery, free, inorganic, strong, etc. – Known acids: acetic, nitric, amino, carboxylic, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 19. Unknown inn • A variation on this theme allows LeadMine to recognize novel (recently announced) kinase inhibitors and antibodies based on the structure of their INN names. • An unknown kinase inhibitor is “[a-z]+inib” and an unknown antibody is “[a-z]+mab” where the words exclude previously known/reported INN names and “colliding” English words. april != capropril, KappaB != rozrolimupab, yuletide != exenatide, triumvir != zanamivir, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 20. Person grammar • The named person grammar matches: 1. [Salutation] FirstName [Initials] Surname [Suffix] 2. [Salutation] FirstName [Initials] UnknownSurname [Suffix] 3. [Salutation] UnknownFirstName [Initials] Surname [Suffix] • where Salutation includes Mr., Mrs., Dr., Sir, His Highness, … FirstName includes David, John, Sarah, Tom, Angela, … Surname includes Smith, Jones, Overington, … UnknownFirstname excludes Big, Lake, The, Outer, etc. UnknownSurname excludes Avenue, Bridge, Street, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 21. List construction operator • Another frequently used idiom, are the operators for constructing comma separated list. • These turn the grammar matching “X” into the grammar matching things like “X, X, X and X”. • More specifically: (X [ “,” “ ”? X]* (“ and ”| “ or ” | “ and/or ” )? X • Another variation of this allows “other”, “similar” and “related” to the final X if the list is non-empty. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 22. Hearst pattern grammars • An example use of list constructions is in the recognition of Heart Patterns. 1. X such as Y [“including”, “especially” etc.] 2. Y and other X [“and related”, “or similar” etc.] 3. such X as Y • Where X is category or classification term; • And Y is a list of exemplified terms. • Marti A. Hearst, “Automatic Acquisition of Hyponyms from Large Text Corpora”, Proceedings of the 14th International Conference on Computational Linguistics, Nantes, France, July 1992. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 23. Complex object builder • An application of the list construction operator is in our “complex object builder” construction operator. ComplexObjectBuilder cob; cob.insert(“red”, “lorry”, “lorries”); cob.insert(“yellow”, “lorry”, “lorries”); • Allows matching not only of “red lorry”, “red lorries”, “yellow lorry” and “yellow lorries” • But also of… “red and yellow lorries”, “yellow and red lorries”, etc. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 24. complex disease examples • Adenomatous polyps of the colon and rectum. • Fibroepithelia or epithelial hyperplasias. • Inherited spinocerebellar ataxia. • Stage II or stage III colorectal cancer. • Inherited breast and overian cancers. • Argentinian, Bolivian and Korean haemorrhagic fevers. • Dermatitis due to heat, cold, radiation, cosmetics, fungi and shellfish. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 25. Grammars for Safety text mining • “May cause lung damage if swallowed” – “may” → “can”, “could”, “may”, “might”, “will”, etc. – “cause” → “lead to”, “result in”, “trigger”, “bring on”, … – “lung damage” → “explosion”, “cancer”, “injury”, … – “if” → “when”, “once”… – “swallowed” → “heated”, “shaken”, “dried”, “ignited”… • “Highly toxic” – “highly” → “very”, “extremely”, “unusually”, “intensely”… – “toxic” → “explosive”, “carcinogenic”, “poisonous”… 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 26. efficient protein variant naming • CaffeineFix technology can also be applied to naming peptides and arbitrary protein variants/mutants. • Consider the a database of the following 11 peptides: – CFFQNCPRG phenylpressin – CFVRNCPTG annetocin – CFWTSCPIG octopressin – CYFQNCPRG argipressin – CYFQNCPKG lypressin – CYFRNCPIG cephalotocin – CYIQNCPLG oxytocin – CYIQNCPPG prol-oxytocin – CYIQNCPRG vasotocin – CYIQSCPIG seritocin – CYISNCPIG isotocin 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 27. Dag representation of sequences These 11 peptides may be efficiently represented and search as a “directed acyclic graph” [38 vs. 99 states] 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 28. entirety of uniprot/swissprot • Using this representation, all 540546 protein sequences in uniprot_sprot, which contains over 192M amino acids, requires 142M states (1.4Gb). • This data structure allows close analogues to be identified much faster than using NCBI blastp. • For example, all 540546 sequences can be queried against this database (i.e. all-against-all) in ~9m30s on a single core on a laptop. • The sequence from PDB 1CRN (crambin 46AA) is canonically named as [L25I]P01542 in 0.002s. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 29. Application to precision medicine • A more realistic example is that sequence of the gene “spastic paraplegia4” with six mutations from OMIM:604277 can be canonically named as [I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0 • Run-time for this query is 0.2s. • By comparison, blastp 2.2.29+ takes about 6s. – With default arguments, NCBI blastp run time is 7s. – Only 6s with –num_descriptions 1 –num_alignments 1. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017
  • 30. summary • LeadMine’s .cfx files can do far more than efficiently match very large dictionaries of terms. • Indeed, many of the grammars used at NextMove Software potentially match an infinite number of terms. • Construction of domain specific grammars can be done in collaboration with LeadMine customers. 253rd ACS National Meeting, San Francisco, CA, Tuesday 4th April 2017