This document summarizes advances in automatic chemical spelling correction. It discusses using edit distances and dynamic programming to correct spelling errors. Examples are given of correcting errors in chemical names, CAS registry numbers, and protein target names. Benchmarking on patent data shows automatic correction improves recall of chemical entity recognition by around 20-40%. The techniques allow fuzzy matching of chemical structures and nomenclature rules.
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
CINF 35: Structure searching for patent information: The need for speedNextMove Software
Chemical databases grow larger every year. Without investing in additional hardware or improved software, the time to search these databases will in turn grow longer annually. With an ever-increasing number of pharmaceutical patents, the amount of chemical data associated with these is growing at a rate with which hardware advances alone cannot keep up.
Using automated mining of U.S. and European patents, we have extracted large collections of structural data in the form of reactions, mixtures, and exemplified compounds. Additional information such as protein targets and diseases are also extracted from each patent and associated with the structural data. We will describe how this data can be queried with natural language phrases and how these phrases are interpreted as structural queries.
Through innovations in substructure and similarity search algorithms, it is possible to search and retrieve hundreds of millions of chemical records in fractions of a second. We will demonstrate how this is achieved on a regular desktop machine using just-in-time and ahead-of-time compilation techniques.
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
Presented by Roger Sayle at the 11th International Conference on Chemical Structures (ICCS) 2018.
Exponential database growth will always be a technical challenge but evolutionary strategies offer to hold off the inevitable in the short term and revolutionary strategies promise a longer-term solution.
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.
There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable "read-only" Electronic Lab Notebook.
In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. "GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction"). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.
Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
The molfile serves as a de facto standard for chemical information exchange. It is perhaps the most widely supported format with its core syntax being easy to understand, parse, and generate. Beyond the core syntax, more advanced features such as sgroups and enhanced stereochemistry are rarely supported, often only being partially implemented and used. Additionally, several vendors, toolkits, and service providers have added extended syntax to their molfiles to solve particular corner cases or representation problems. This talk will provide a brief summary of the less widely supported features of the molfile including sgroups and enhanced stereochemistry. Additionally, a survey of how extensions can/have been implemented and which extensions exist "in the wild". Special attention will be paid to capturing of coordination bonds for extending the InChI to handle organometallics.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
Prediction is much harder than analysis. Consider hurricanes and tornadoes; it's much easier to follow the path of destruction by locating devastated neighborhoods, than to forecast the paths of such weather systems in advance. Likewise for many chemical reactions, such as nitration (by refluxing with nitric acid and sulfuric acid) where the appearance of one or more nitro groups indicates a nitration reaction, but predicting where on a non-trivial organic molecule this functional group appears is a much harder challenge. In this sense, reaction analysis is much simpler than (either forward or retrosynthetic) synthesis planning.
NextMove Software's namerxn is an expert system for classifying reactions (from reaction SMILES, MDL connection tables or ChemDraw sketches) typically assigning each reaction instance to a leaf classification in the Royal Society of Chemistry's RXNO ontology. These tools can be helpful in the analysis of regioselectivity preferences of reactions.
This talk consists of two parts. A technical part describing the recent algorithmic and methodological improvements to the namerxn software, including describing some of the more challenging of the 1000+ reactions it currently identifies. And a scientific part that investigates the regioselective preferences of some of these reactions.
CINF 35: Structure searching for patent information: The need for speedNextMove Software
Chemical databases grow larger every year. Without investing in additional hardware or improved software, the time to search these databases will in turn grow longer annually. With an ever-increasing number of pharmaceutical patents, the amount of chemical data associated with these is growing at a rate with which hardware advances alone cannot keep up.
Using automated mining of U.S. and European patents, we have extracted large collections of structural data in the form of reactions, mixtures, and exemplified compounds. Additional information such as protein targets and diseases are also extracted from each patent and associated with the structural data. We will describe how this data can be queried with natural language phrases and how these phrases are interpreted as structural queries.
Through innovations in substructure and similarity search algorithms, it is possible to search and retrieve hundreds of millions of chemical records in fractions of a second. We will demonstrate how this is achieved on a regular desktop machine using just-in-time and ahead-of-time compilation techniques.
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
Presented by Roger Sayle at the 11th International Conference on Chemical Structures (ICCS) 2018.
Exponential database growth will always be a technical challenge but evolutionary strategies offer to hold off the inevitable in the short term and revolutionary strategies promise a longer-term solution.
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.
There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable "read-only" Electronic Lab Notebook.
In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. "GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction"). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.
Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
The molfile serves as a de facto standard for chemical information exchange. It is perhaps the most widely supported format with its core syntax being easy to understand, parse, and generate. Beyond the core syntax, more advanced features such as sgroups and enhanced stereochemistry are rarely supported, often only being partially implemented and used. Additionally, several vendors, toolkits, and service providers have added extended syntax to their molfiles to solve particular corner cases or representation problems. This talk will provide a brief summary of the less widely supported features of the molfile including sgroups and enhanced stereochemistry. Additionally, a survey of how extensions can/have been implemented and which extensions exist "in the wild". Special attention will be paid to capturing of coordination bonds for extending the InChI to handle organometallics.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Advances in Automatic Chemical Spelling Correction
1. Advances in Automatic Chemical
Spelling Correction
Roger Sayle and Daniel Lowe
NextMove Software
Cambridge, UK
ACS National Meeting, Philadelphia, USA 19th August 2012
2. example spelling errors
• Sample misspellings of pyrimidine in US patent
grants since the beginning of this year.
Incorrect Name US Patent No. Issue Date
pryimidine 8093264 10th January 2012
pyrmidine 8097728 17th January 2012
pyrimdine 8114996 14th February 2012
pyrimidne 8129897 6th March 2012
pyrimidinc 8148339 3rd April 2012
pyridmidine 8158627 8th May 2012
ACS National Meeting, Philadelphia, USA 19th August 2012
3. previous work
• Roger Sayle, Paul Hongxing Xie and Sorel Muresan,
“Improved Chemical Text Mining of Patents with
Infinite Dictionaries and Automatic Spelling
Correction”, Journal of Chemical Information and
Modeling, Vol. 52, No. 1, pp. 51-62, 2012.
• G.H. Kirby, M.R. Lord, J.D. Rayner, “Computer
Translation of IUPAC Systematic Organic Chemical
Nomenclature. 6. (Semi)automatic name correction,
Journal of Chemical Information and Computer
Science, Vol. 31, pp. 153-160, 1991.
ACS National Meeting, Philadelphia, USA 19th August 2012
4. STring EDIt Distance
• The Levenshtein Distance (Levenshtein 1965) is the
minimum number of edits (insertions, deletions or
substitutions) required to transform one string into
another.
• The Damerau-Levenshtein Distance is an extension of
Levenshtein Distance to include transposition of two
adjacent characters.
• The distances can be efficiently computed with
dynamic programming using the Needleman-
Wunsch-Sellers alignment algorithm (bioinformatics).
ACS National Meeting, Philadelphia, USA 19th August 2012
5. needleman-wunsch-sellers
c l o r a m p h e n i c a l
c
h
l
o
r
a
m
p
h
e
n
i
c
o
l
ACS National Meeting, Philadelphia, USA 19th August 2012
6. needleman-wunsch-sellers
c l o r a m p h e n i c a l
c
h
l
o
r
a
m
p
h
e
n
i
c
o
l
ACS National Meeting, Philadelphia, USA 19th August 2012
7. further complications
• Edit operations can have their own specific penalties.
• The latest implementation supports transpositions,
to catch spelling mistakes such as “chlorofrom”.
• Some dictionaries to match against have tens of
millions of entries, others are infinite.
• The start and end of the input isn’t known in free-
text and are assigned on the quality of the match.
• Correct nesting of parenthesis and brackets needs to
be enforced as part of the matching process.
• In summary - It’s mind bogglingly complicated.
ACS National Meeting, Philadelphia, USA 19th August 2012
8. dictionaries as automata
• Nitrogen containing heterocycles as minimal DFA:
– Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine,
Pyrimidine, Pyrazine
ACS National Meeting, Philadelphia, USA 19th August 2012
9. Example iupac-like grammar
• More generally, still CaffeineFix FSMs can represent
formal grammars, i.e. infinite dictionaries.
alk := “meth” | “eth” | “prop” | “but”
parent := alk “ane”
subst := “bromo” | “chloro” | “fluoro”
locant := “1” | “2” | “3” | “4” /* any digit */
prefix := [ prefix “-” ] [ loc “-” ] subst
| [ prefix ] subst
name := [ prefix [ “-” ] ] parent
ACS National Meeting, Philadelphia, USA 19th August 2012
10. Iupac-like grammar examples
• methane
• chloroethane
• 2-bromo-propane
• chloro-bromo-methane
• 1-fluoro-2-chloro-ethane
• chlorofluoromethane
• 4-bromomethane
• 1-chloro-1-chloro-1-chloro-methane
ACS National Meeting, Philadelphia, USA 19th August 2012
11. Representing grammars as dFAs
Backward edges allow matching an infinite number of words.
ACS National Meeting, Philadelphia, USA 19th August 2012
12. current iupac grammar FSM
• As of July 2012, the current CaffeineFix grammar
contains nearly 1.2 million edges.
• This grammar covers...
– 99.15% (232144/234142) names in the NCI00 database.
– 95.28% (67995/71367) names in the Maybridge catalogue.
– 95.27% (25890/48167) names in the Keyorganics catalogue.
• These figures are comparable to name-to-structure
conversion rates on these names.
ACS National Meeting, Philadelphia, USA 19th August 2012
14. low cost “frequent” edit ops
• A number of common corrections are so frequent as
to be given a lower (free) cost.
1. Deletion of whitespace.
2. Deletion of a hyphen (where not anticipated)
3. Substitution of “l” (lower case el) for “1” (one).
4. Substitution of “I” (upper case ey) for “l” (el) or “1” (one).
5. Substitution of “rn” by “m”.
6. Substitution of “1” (one) by “l” (el).
7. Substitution of “φ” by “rp” [OCR artifact].
ACS National Meeting, Philadelphia, USA 19th August 2012
15. handle with care
• Alas, introducing automatic spelling correction (fuzzy
matching) to entity recognition often requires the
introduction of white word lists to avoid problems.
– herein → heroin
– aspiring → aspirin
– cranium → uranium
– ability → abilify
• More aggressive correction leads to more problems:
– “be that the line” → methantheline
– “park on a zone” → parconazole
ACS National Meeting, Philadelphia, USA 19th August 2012
16. benchmarking and analysis
• To quantify the benefits of automatic spelling
correction to “real world” chemical text mining we
analysed the first 28 weeks of US patent grants from
2012.
• This corresponds to 145,473 documents, issued
between 3rd January 2012 and 10th July 2012.
• A total of 4,061,670 IUPAC-like systematic names
were identified, with 1,816,317 unique patent/name
pairs.
• OPSIN interprets 3,722,399 and 1,647,402.
ACS National Meeting, Philadelphia, USA 19th August 2012
17. total molecule entities
All Extracted Molecule Entities OPSIN Interpeted Molecule
Entities
434487,
97472, 87719, 12%
2% 552533, 2%
14%
Correct
Correct
Simple
Simple
D=1
D=1
3411665, 3200193
84% , 86%
Using correction retrieves ~19% more entities
and ~16% more OPSIN recognizable names.
ACS National Meeting, Philadelphia, USA 19th August 2012
18. unique molecule entries
All Extracted Molecule Entities OPSIN Interpeted Molecule
Entities
75027,
4%
310359, 67317,
4% 236330,
17% 14%
Correct Correct
Simple Simple
D=1 D=1
1430931,
79% 1343755,
82%
The effect is more pronounced with patent-cmpnd
pairs with a +27% improvement over no correction.
ACS National Meeting, Philadelphia, USA 19th August 2012
19. influence on n2s software (opsin)
Interpretation Count Fraction
Not Interpretable 116216 17.88%
Before not After 11779 1.81%
After not Before 270453 41.61%
Same Before/After 181693 27.95%
Different Before/After 69864 10.75%
Total 650005 100.00%
Although some valid names are lost by correction,
overall the effect is overwhelmingly positive.
ACS National Meeting, Philadelphia, USA 19th August 2012
20. break-down of edit operations
Edit Operation Count Fraction
Deletion 392324 47.50%
Insertion 232493 28.15%
Substitution 198438 24.03%
Transposition 2670 0.32%
Total 825,925 100.00%
ACS National Meeting, Philadelphia, USA 19th August 2012
21. drug dictionary entities
All Drug Dictionary Entities Unique Drug Dictionary Entities
71684, 30687,
6141, 0% 7% 4119, 1% 7%
Correct Correct
Simple Simple
D=1 D=1
1013711, 395482,
93% 92%
Some improvement is seen with drug dictionaries,
but there’s little benefit in fixing simple OCR issues.
ACS National Meeting, Philadelphia, USA 19th August 2012
23. non-word spelling correction
• Automatic correction technology can also be applied
to entities other than words or IUPAC-like chemical
nomenclature.
• For example, Chemical Abstract Service’s registry
numbers.
ACS National Meeting, Philadelphia, USA 19th August 2012
24. cas registry number grammar
9
0-9
-
8 -
0-9
7 -
0-9 0-9 0-9 - 0-9
10 11 12 13 14
-
0-9 6
0-9 0-9
1-4 1 3
5
0 0-9 -
5-9
0-9
2 4
-
• Two to seven digits, followed by a hyphen, two digits,
a hyphen and a final check digit
– e.g. 7732-18-5
• Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d
ACS National Meeting, Philadelphia, USA 19th August 2012
25. Cas check digit calculation
• More generally CaffeineFix’s finite state machines
can do limited processing...
• The final check digit of a CAS number is calculated by
series term summation modulo 10.
• The last digit time 1, the previous digit times 2, the
previous digit times 3, and computing the sum
modulo 10.
• The CAS number for water is 7732-18-5.
• The checksum 5 is calculated as (1x8 + 2x1 + 3x2 +
4x3 + 5x7 + 6x7) mod 10 = 5.
ACS National Meeting, Philadelphia, USA 19th August 2012
27. cas number correction example
• 7732-18-8? Did you mean...
– 7732-18-5
– 7732-11-8
– 77328-18-8
– 7733-18-8
– 77342-18-8
– 77392-18-8
– 71732-18-8
– 76732-18-8
– 97732-18-8
ACS National Meeting, Philadelphia, USA 19th August 2012
28. take home message
• Adding advanced automatic chemical spelling
correction to an annotation pipeline typically
improves recall by about 20-40%.
– Andrew Hinton, “Benchmarking ChemAxon’s Name-to-Structure batch
tool on Patent Data”, 2011 ChemAxon EUGM, Budapest.
– Sorel Muresan, “Automated Spelling Correction to Improve Recall
Rates of Name-to-Structure Tools for Chemical Text Mining”, 2011
ChemAxon EUGM.
ACS National Meeting, Philadelphia, USA 19th August 2012
29. acknowledgements
• Daniel Lowe, NextMove Software.
• Sorel Muresan and Paul Hongxing Xie, AstraZeneca.
• Thank you for your time.
• Any questions?
ACS National Meeting, Philadelphia, USA 19th August 2012