SlideShare a Scribd company logo

In grammars we trust: LeadMine, a knowledge driven solution

NextMove Software
NextMove Software
NextMove SoftwareNextMove Software

We present a system employing large grammars and dictionaries to recognize a broad range of chemical entities. The system utilizes these re-sources to identify chemical entities without an explicit tokenization step. To al-low recognition of terms slightly outside the coverage of these resources we employ spelling correction, entity extension, and merging of adjacent entities. Recall is enhanced by the use of abbreviation detection and precision is en-hanced by the removal of abbreviations of non-entities. With the use of training data to produce further dictionaries of terms to recognize/ignore our system achieved 86.2% precision and 85.0% recall on an unused development set.

In grammars we trust: LeadMine, a knowledge driven solution

1 of 23
Download to read offline
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
In grammars we trust: LeadMine,
a knowledge driven solution
Daniel Lowe and Roger Sayle
NextMove Software
Cambridge, UK
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Approaches to Entity
recognition
• Dictionary based
• Grammar based
• Machine Learning
LeadMineLeadMine
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Optional
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Normalization
Input Normalized
œstradiol oestradiol
5` or 5’ or 5′ (backtick/quotation mark/prime) 5'
<p>H<sub>2</sub>O</p> H2O
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Blue: Grammars
Green: Traditional dictionaries
Orange: Blocking dictionaries
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Advantages of grammars
• Don’t require annotated corpora
• Encode knowledge about the domain
• Very fast recognition
• Allow spelling correction if an entity is a near
match to one recognized by the grammar

Recommended

Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
Chemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsChemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsNextMove Software
 
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text miningCINF 18: Wikipedia and Wiktionary as resources for chemical text mining
CINF 18: Wikipedia and Wiktionary as resources for chemical text miningNextMove Software
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightNextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...dan2097
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 

More Related Content

What's hot

ICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CASICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CASDr. Haxel Consult
 
2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) english2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) englishPOSTECH Library
 
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?Dr. Haxel Consult
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...NextMove Software
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinSimon Jupp
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...NextMove Software
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...Dr. Haxel Consult
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4jSimon Jupp
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSValery Tkachenko
 
Building (and traveling) the data-brick road: A report from the front lines ...
Building (and traveling) the data-brick road:  A report from the front lines ...Building (and traveling) the data-brick road:  A report from the front lines ...
Building (and traveling) the data-brick road: A report from the front lines ...mhaendel
 
Equivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholderEquivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholdermhaendel
 

What's hot (20)

ICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CASICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CAS
 
2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) english2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) english
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland Project
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...
 
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
 
Why Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpiderWhy Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpider
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlin
 
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4j
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
 
Building (and traveling) the data-brick road: A report from the front lines ...
Building (and traveling) the data-brick road:  A report from the front lines ...Building (and traveling) the data-brick road:  A report from the front lines ...
Building (and traveling) the data-brick road: A report from the front lines ...
 
Equivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholderEquivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholder
 

Viewers also liked

Infografik: Wie fit ist Deutschland für die Zukunft?
Infografik: Wie fit ist Deutschland für die Zukunft?Infografik: Wie fit ist Deutschland für die Zukunft?
Infografik: Wie fit ist Deutschland für die Zukunft?Bertelsmann Stiftung
 
Scaling mondrian
Scaling mondrianScaling mondrian
Scaling mondrianlucboudreau
 
8th grade list 2014
8th grade list 20148th grade list 2014
8th grade list 2014Liz Slavens
 
Receta pinxto banderilla olmeda origenes
Receta pinxto banderilla olmeda origenesReceta pinxto banderilla olmeda origenes
Receta pinxto banderilla olmeda origenesOlmeda Orígenes
 
Revolutionising the Journal through Big Data Computational Research
Revolutionising the Journal through Big Data Computational ResearchRevolutionising the Journal through Big Data Computational Research
Revolutionising the Journal through Big Data Computational ResearchAmye Kenall
 
Daily Newsletter: 15th December, 2010
Daily Newsletter: 15th December, 2010Daily Newsletter: 15th December, 2010
Daily Newsletter: 15th December, 2010Fullerton Securities
 
From Macro to Micro: Greening Your Campus HANDOUT
From Macro to Micro: Greening Your Campus HANDOUTFrom Macro to Micro: Greening Your Campus HANDOUT
From Macro to Micro: Greening Your Campus HANDOUTPaul Brown
 
Applying testing mindset to software development
Applying testing mindset to software developmentApplying testing mindset to software development
Applying testing mindset to software developmentAndrii Dzynia
 
Prueba de portada
Prueba de portadaPrueba de portada
Prueba de portadapatricio
 
Digital badging at the OU
Digital badging at the OUDigital badging at the OU
Digital badging at the OUDr Patrina Law
 
presentation for BPC
presentation for BPCpresentation for BPC
presentation for BPCjjoyce
 
Story Testimonial Pitch
Story Testimonial PitchStory Testimonial Pitch
Story Testimonial PitchGaurav Gaur
 
Information Architecture class13 04 10
Information Architecture class13 04 10Information Architecture class13 04 10
Information Architecture class13 04 10Marti Gukeisen
 

Viewers also liked (18)

Infografik: Wie fit ist Deutschland für die Zukunft?
Infografik: Wie fit ist Deutschland für die Zukunft?Infografik: Wie fit ist Deutschland für die Zukunft?
Infografik: Wie fit ist Deutschland für die Zukunft?
 
Scaling mondrian
Scaling mondrianScaling mondrian
Scaling mondrian
 
8th grade list 2014
8th grade list 20148th grade list 2014
8th grade list 2014
 
Receta pinxto banderilla olmeda origenes
Receta pinxto banderilla olmeda origenesReceta pinxto banderilla olmeda origenes
Receta pinxto banderilla olmeda origenes
 
asdfasdf
asdfasdfasdfasdf
asdfasdf
 
Revolutionising the Journal through Big Data Computational Research
Revolutionising the Journal through Big Data Computational ResearchRevolutionising the Journal through Big Data Computational Research
Revolutionising the Journal through Big Data Computational Research
 
Narmada Kannan_Resume
Narmada Kannan_ResumeNarmada Kannan_Resume
Narmada Kannan_Resume
 
Daily Newsletter: 15th December, 2010
Daily Newsletter: 15th December, 2010Daily Newsletter: 15th December, 2010
Daily Newsletter: 15th December, 2010
 
Peter Kunzlik
Peter KunzlikPeter Kunzlik
Peter Kunzlik
 
From Macro to Micro: Greening Your Campus HANDOUT
From Macro to Micro: Greening Your Campus HANDOUTFrom Macro to Micro: Greening Your Campus HANDOUT
From Macro to Micro: Greening Your Campus HANDOUT
 
Applying testing mindset to software development
Applying testing mindset to software developmentApplying testing mindset to software development
Applying testing mindset to software development
 
Prueba de portada
Prueba de portadaPrueba de portada
Prueba de portada
 
Digital badging at the OU
Digital badging at the OUDigital badging at the OU
Digital badging at the OU
 
API-diskusjonen
API-diskusjonenAPI-diskusjonen
API-diskusjonen
 
National and global public inclusive infrastructures
National and global public inclusive infrastructuresNational and global public inclusive infrastructures
National and global public inclusive infrastructures
 
presentation for BPC
presentation for BPCpresentation for BPC
presentation for BPC
 
Story Testimonial Pitch
Story Testimonial PitchStory Testimonial Pitch
Story Testimonial Pitch
 
Information Architecture class13 04 10
Information Architecture class13 04 10Information Architecture class13 04 10
Information Architecture class13 04 10
 

Similar to In grammars we trust: LeadMine, a knowledge driven solution

Tackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionTackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionNextMove Software
 
Engl313 ada project4_slidedoc2 (1)
Engl313 ada project4_slidedoc2 (1)Engl313 ada project4_slidedoc2 (1)
Engl313 ada project4_slidedoc2 (1)KatieKrahn
 
Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2ScottDorsch
 
FHIR tutorial - Afternoon
FHIR tutorial - AfternoonFHIR tutorial - Afternoon
FHIR tutorial - AfternoonEwout Kramer
 
Ethics reproducibility and data stewardship
Ethics reproducibility and data stewardshipEthics reproducibility and data stewardship
Ethics reproducibility and data stewardshipRussell Jarvis
 
FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014Ewout Kramer
 
103014 540 PMTake Test Unit II AssessmentPage 1 of 5ht.docx
103014 540 PMTake Test Unit II AssessmentPage 1 of 5ht.docx103014 540 PMTake Test Unit II AssessmentPage 1 of 5ht.docx
103014 540 PMTake Test Unit II AssessmentPage 1 of 5ht.docxhyacinthshackley2629
 
The Killer Question(s) and Associated Experiment(s)
The Killer Question(s) and Associated Experiment(s)The Killer Question(s) and Associated Experiment(s)
The Killer Question(s) and Associated Experiment(s)CIMIT
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open DataRoss Mounce
 
BIO 1030, Principles of Biology 1 Course Description .docx
BIO 1030, Principles of Biology 1 Course Description .docxBIO 1030, Principles of Biology 1 Course Description .docx
BIO 1030, Principles of Biology 1 Course Description .docxAASTHA76
 
2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big Data2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big DataBruce Kozuma
 
Week 6 Discussion Putting it All Together - Revising the Justif.docx
Week 6 Discussion Putting it All Together - Revising the Justif.docxWeek 6 Discussion Putting it All Together - Revising the Justif.docx
Week 6 Discussion Putting it All Together - Revising the Justif.docxcockekeshia
 
Optimizing the project portfolio oracle Instantis enterprise track and crys...
Optimizing the project portfolio   oracle Instantis enterprise track and crys...Optimizing the project portfolio   oracle Instantis enterprise track and crys...
Optimizing the project portfolio oracle Instantis enterprise track and crys...p6academy
 
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docx
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docxBEM 3701, Hazardous Waste Management 1 Course Descriptio.docx
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docxAASTHA76
 
How Free is Free?: Building courses with OERs
How Free is Free?: Building courses with OERsHow Free is Free?: Building courses with OERs
How Free is Free?: Building courses with OERsBCcampus
 
Agile User Studies (Agile & Beyond 2012)
Agile User Studies (Agile & Beyond 2012)Agile User Studies (Agile & Beyond 2012)
Agile User Studies (Agile & Beyond 2012)Derek Poppink CXA CUA
 

Similar to In grammars we trust: LeadMine, a knowledge driven solution (20)

Tackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionTackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extraction
 
Engl313 ada project4_slidedoc2 (1)
Engl313 ada project4_slidedoc2 (1)Engl313 ada project4_slidedoc2 (1)
Engl313 ada project4_slidedoc2 (1)
 
Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2
 
dScribe Workshop - U-M
dScribe Workshop - U-MdScribe Workshop - U-M
dScribe Workshop - U-M
 
FHIR tutorial - Afternoon
FHIR tutorial - AfternoonFHIR tutorial - Afternoon
FHIR tutorial - Afternoon
 
Ethics reproducibility and data stewardship
Ethics reproducibility and data stewardshipEthics reproducibility and data stewardship
Ethics reproducibility and data stewardship
 
FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014
 
From OER to Open Culture
From OER to Open CultureFrom OER to Open Culture
From OER to Open Culture
 
Identifying Keywords and Searching Techniques
Identifying Keywords and Searching TechniquesIdentifying Keywords and Searching Techniques
Identifying Keywords and Searching Techniques
 
103014 540 PMTake Test Unit II AssessmentPage 1 of 5ht.docx
103014 540 PMTake Test Unit II AssessmentPage 1 of 5ht.docx103014 540 PMTake Test Unit II AssessmentPage 1 of 5ht.docx
103014 540 PMTake Test Unit II AssessmentPage 1 of 5ht.docx
 
The Killer Question(s) and Associated Experiment(s)
The Killer Question(s) and Associated Experiment(s)The Killer Question(s) and Associated Experiment(s)
The Killer Question(s) and Associated Experiment(s)
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
BIO 1030, Principles of Biology 1 Course Description .docx
BIO 1030, Principles of Biology 1 Course Description .docxBIO 1030, Principles of Biology 1 Course Description .docx
BIO 1030, Principles of Biology 1 Course Description .docx
 
2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big Data2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big Data
 
Week 6 Discussion Putting it All Together - Revising the Justif.docx
Week 6 Discussion Putting it All Together - Revising the Justif.docxWeek 6 Discussion Putting it All Together - Revising the Justif.docx
Week 6 Discussion Putting it All Together - Revising the Justif.docx
 
Optimizing the project portfolio oracle Instantis enterprise track and crys...
Optimizing the project portfolio   oracle Instantis enterprise track and crys...Optimizing the project portfolio   oracle Instantis enterprise track and crys...
Optimizing the project portfolio oracle Instantis enterprise track and crys...
 
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docx
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docxBEM 3701, Hazardous Waste Management 1 Course Descriptio.docx
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docx
 
How Free is Free?: Building courses with OERs
How Free is Free?: Building courses with OERsHow Free is Free?: Building courses with OERs
How Free is Free?: Building courses with OERs
 
Agile User Studies (Agile & Beyond 2012)
Agile User Studies (Agile & Beyond 2012)Agile User Studies (Agile & Beyond 2012)
Agile User Studies (Agile & Beyond 2012)
 
Technical Errors
Technical ErrorsTechnical Errors
Technical Errors
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]NextMove Software
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsNextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 

Recently uploaded

Masterclass Unlocking Booking & Revenue Surge with Proven Strategies
Masterclass Unlocking Booking & Revenue Surge with Proven StrategiesMasterclass Unlocking Booking & Revenue Surge with Proven Strategies
Masterclass Unlocking Booking & Revenue Surge with Proven StrategiesRezStream
 
Business Visa India for Japan Citizens 1.pdf
Business Visa India for Japan Citizens 1.pdfBusiness Visa India for Japan Citizens 1.pdf
Business Visa India for Japan Citizens 1.pdfyashvardhanesecure
 
Brumby Geotrail
Brumby GeotrailBrumby Geotrail
Brumby GeotrailJaap Spee
 
Business Visa India
Business Visa IndiaBusiness Visa India
Business Visa Indianagen92928
 
MICE Presentation - Stettin Convention Bureau
MICE Presentation - Stettin Convention BureauMICE Presentation - Stettin Convention Bureau
MICE Presentation - Stettin Convention BureauMICEboard
 
Business Visa India for Japan Citizens 1.pdf
Business Visa India for Japan Citizens 1.pdfBusiness Visa India for Japan Citizens 1.pdf
Business Visa India for Japan Citizens 1.pdfyashvardhanesecure
 

Recently uploaded (8)

Masterclass Unlocking Booking & Revenue Surge with Proven Strategies
Masterclass Unlocking Booking & Revenue Surge with Proven StrategiesMasterclass Unlocking Booking & Revenue Surge with Proven Strategies
Masterclass Unlocking Booking & Revenue Surge with Proven Strategies
 
Business Visa India for Japan Citizens 1.pdf
Business Visa India for Japan Citizens 1.pdfBusiness Visa India for Japan Citizens 1.pdf
Business Visa India for Japan Citizens 1.pdf
 
Brumby Geotrail
Brumby GeotrailBrumby Geotrail
Brumby Geotrail
 
Business Visa India
Business Visa IndiaBusiness Visa India
Business Visa India
 
MICE Presentation - Stettin Convention Bureau
MICE Presentation - Stettin Convention BureauMICE Presentation - Stettin Convention Bureau
MICE Presentation - Stettin Convention Bureau
 
DATA EMPLOYED PT HARRIS IJEN JAYA MANDIRI.pdf
DATA EMPLOYED PT HARRIS IJEN JAYA MANDIRI.pdfDATA EMPLOYED PT HARRIS IJEN JAYA MANDIRI.pdf
DATA EMPLOYED PT HARRIS IJEN JAYA MANDIRI.pdf
 
Business Visa India for Japan Citizens 1.pdf
Business Visa India for Japan Citizens 1.pdfBusiness Visa India for Japan Citizens 1.pdf
Business Visa India for Japan Citizens 1.pdf
 
NATURE IS ETERNALLY BEAUTIFUL IN COLD DESERT LADAKH.
NATURE IS ETERNALLY BEAUTIFUL IN COLD DESERT LADAKH.NATURE IS ETERNALLY BEAUTIFUL IN COLD DESERT LADAKH.
NATURE IS ETERNALLY BEAUTIFUL IN COLD DESERT LADAKH.
 

In grammars we trust: LeadMine, a knowledge driven solution

  • 1. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 In grammars we trust: LeadMine, a knowledge driven solution Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK
  • 2. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Approaches to Entity recognition • Dictionary based • Grammar based • Machine Learning LeadMineLeadMine
  • 3. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Optional
  • 4. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Normalization Input Normalized œstradiol oestradiol 5` or 5’ or 5′ (backtick/quotation mark/prime) 5' <p>H<sub>2</sub>O</p> H2O
  • 5. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Blue: Grammars Green: Traditional dictionaries Orange: Blocking dictionaries
  • 6. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Advantages of grammars • Don’t require annotated corpora • Encode knowledge about the domain • Very fast recognition • Allow spelling correction if an entity is a near match to one recognized by the grammar
  • 7. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Simple grammar Example Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’ Digit : Digit1to9 | ‘0’ Cid : ‘CID:’ Digit1to9 Digit* C I D 1..9: 0..9
  • 8. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Grammar for IUPAC names • Grammar for complete molecules: 485 rules – trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'... – ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ... • Generally aims to match a superset of the nomenclature covered by IUPAC • Specifically this is the superset that can be theoretically be converted to structures
  • 9. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Grammar inheritance • Molecule grammar serves as a good starting point for a substituent grammar or generic chemical grammar – Inherit rules rather than duplicate them – Allow overriding of rules pluralizedChemical : chemical 's' elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition metal'|'transuranic element' | _elementaryMetalAtom
  • 10. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Dictionaries… bigger is better • For high recall of trivial names, dictionaries with high coverage are required. • The largest publically available dictionary is PubChem with over 94 million terms • However most of these terms are either not useful or actually detrimental to text mining
  • 11. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Aggressive filtering • “what you don't see won't hurt you” • Hence remove terms are also English words or start with an English word – Accomplished using a large English dictionary with chemistry terms removed • Remove internal identifiers used by depositors • Remove terms that are matched by our grammars • Ultimate result: 94 million  2.94 million
  • 12. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Structure Aware filtering • “Do not tag proteins, polypeptides (> 15aa), nucleic acid polymers, polysaccharides, oligosaccharides [tetrasaccharide or longer] and other biochemicals.” • About 40,000 polypeptides and oligosaccharides excluded from PubChem using these criteria
  • 13. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Entity Extension • Even PubChem is far from comprehensive hence it can be useful to extend the start and/or end of entities to avoid partial hits – α-santalol can be recognized from santalol in the dictionary • Extension is bracketing aware and blocked by English words • Entity trimming also performed to comply with the annotation guidelines – ‘Allura Red AC dye’  ‘Allura Red AC’
  • 14. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Entity Merging • Adjacent entities may actually be part of one entity – Ethyl ester one entity – (+)-limonene epoxide  one entity BUT – Hexane-benzene two entities
  • 15. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Using an ontology to determine when terms add information • Genistein isoflavone  two entities • Glycine ester  one entity Genistein showing isoflavone core structure
  • 16. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Abbreviation detection • Based on the Hearst and Schwartz algorithm • Detects abbreviations of the following forms: – Tetrahydrofuran (THF) – THF (tetrahydrofuran) – Tetrahydrofuran (THF; – Tetrahydrofuran (THF, – (tetrahydrofuran, THF) – THF = tetrahydrofuran Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.
  • 17. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Domain-specific abbreviations • Some abbreviations are not acronyms • Can use string replacements to recognize them e.g. – Sodium  Na – Estradiol  E2 Hence can recognize: 17α-ethinylestradiol  EE2
  • 18. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Non-entity abbreviation removal • Finds entities detected as abbreviations of unrecognized entities – Can mean a common chemical abbreviation has been redefined in the scope of the document current good manufacturing practice (cGMP) cGMP = Cyclic guanosine monophosphate =
  • 19. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Making the most of the knowledge provided • Use training data to identify: – Terms that are not currently recognized (whitelist) – Terms that are often false positives (blacklist) • Each false positive and false negative is placed into such a list if its inclusion increased F-score (harmonic mean of precision and recall)
  • 20. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 CEM Task Results (on development set) Configuration Precision Recall F-score Baseline 0.87 0.82 0.84 WhiteList 0.86 0.85 0.86 BlackList 0.88 0.80 0.84 WhiteList + BlackList 0.87 0.83 0.85
  • 21. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 CDI task ranking • Uses precision of entities when running against the development set with the results broken down by: – Title vs abstract? – Which dictionary matched? – Was the entity’s bounds modified? – Did the entity occur more than once in the document?
  • 22. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Conclusions • Grammars complement dictionaries to allow recognition of novel entities • Both the coverage and quality of dictionaries is important • The meaning of novel abbreviations can be determined algorithmically • Entities can be classified based on the resource that recognized them
  • 23. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com