SlideShare a Scribd company logo
Advances in Automatic Chemical
     Spelling Correction

                Roger Sayle and Daniel Lowe
                    NextMove Software
                       Cambridge, UK

 ACS National Meeting, Philadelphia, USA 19th August 2012
example spelling errors

• Sample misspellings of pyrimidine in US patent
  grants since the beginning of this year.

       Incorrect Name                  US Patent No.            Issue Date
       pryimidine                      8093264                  10th January 2012
       pyrmidine                       8097728                  17th January 2012
       pyrimdine                       8114996                  14th February 2012
       pyrimidne                       8129897                  6th March 2012
       pyrimidinc                      8148339                  3rd April 2012
       pyridmidine                     8158627                  8th May 2012


     ACS National Meeting, Philadelphia, USA 19th August 2012
previous work

• Roger Sayle, Paul Hongxing Xie and Sorel Muresan,
  “Improved Chemical Text Mining of Patents with
  Infinite Dictionaries and Automatic Spelling
  Correction”, Journal of Chemical Information and
  Modeling, Vol. 52, No. 1, pp. 51-62, 2012.
• G.H. Kirby, M.R. Lord, J.D. Rayner, “Computer
  Translation of IUPAC Systematic Organic Chemical
  Nomenclature. 6. (Semi)automatic name correction,
  Journal of Chemical Information and Computer
  Science, Vol. 31, pp. 153-160, 1991.
     ACS National Meeting, Philadelphia, USA 19th August 2012
STring EDIt Distance

• The Levenshtein Distance (Levenshtein 1965) is the
  minimum number of edits (insertions, deletions or
  substitutions) required to transform one string into
  another.
• The Damerau-Levenshtein Distance is an extension of
  Levenshtein Distance to include transposition of two
  adjacent characters.
• The distances can be efficiently computed with
  dynamic programming using the Needleman-
  Wunsch-Sellers alignment algorithm (bioinformatics).
     ACS National Meeting, Philadelphia, USA 19th August 2012
needleman-wunsch-sellers
                   c l     o r a m p h e n i               c a l
             c
             h
              l
             o
             r
             a
             m
             p
             h
             e
             n
              i
             c
             o
              l



ACS National Meeting, Philadelphia, USA 19th August 2012
needleman-wunsch-sellers
                   c l      o r a m p h e n i              c a l
             c
             h
              l
             o
             r
             a
             m
             p
             h
             e
             n
              i
             c
             o
              l



ACS National Meeting, Philadelphia, USA 19th August 2012
further complications

• Edit operations can have their own specific penalties.
• The latest implementation supports transpositions,
  to catch spelling mistakes such as “chlorofrom”.
• Some dictionaries to match against have tens of
  millions of entries, others are infinite.
• The start and end of the input isn’t known in free-
  text and are assigned on the quality of the match.
• Correct nesting of parenthesis and brackets needs to
  be enforced as part of the matching process.
• In summary - It’s mind bogglingly complicated.
      ACS National Meeting, Philadelphia, USA 19th August 2012
dictionaries as automata




• Nitrogen containing heterocycles as minimal DFA:
  – Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine,
    Pyrimidine, Pyrazine
     ACS National Meeting, Philadelphia, USA 19th August 2012
Example iupac-like grammar

• More generally, still CaffeineFix FSMs can represent
  formal grammars, i.e. infinite dictionaries.

  alk := “meth” | “eth” | “prop” | “but”
  parent := alk “ane”
  subst := “bromo” | “chloro” | “fluoro”

  locant := “1” | “2” | “3” | “4” /* any digit */
  prefix := [ prefix “-” ] [ loc “-” ] subst
          | [ prefix ] subst
  name := [ prefix [ “-” ] ] parent


      ACS National Meeting, Philadelphia, USA 19th August 2012
Iupac-like grammar examples

•   methane
•   chloroethane
•   2-bromo-propane
•   chloro-bromo-methane
•   1-fluoro-2-chloro-ethane
•   chlorofluoromethane

• 4-bromomethane
• 1-chloro-1-chloro-1-chloro-methane
     ACS National Meeting, Philadelphia, USA 19th August 2012
Representing grammars as dFAs




 Backward edges allow matching an infinite number of words.

  ACS National Meeting, Philadelphia, USA 19th August 2012
current iupac grammar FSM

• As of July 2012, the current CaffeineFix grammar
  contains nearly 1.2 million edges.
• This grammar covers...
   – 99.15% (232144/234142) names in the NCI00 database.
   – 95.28% (67995/71367) names in the Maybridge catalogue.
   – 95.27% (25890/48167) names in the Keyorganics catalogue.
• These figures are comparable to name-to-structure
  conversion rates on these names.


      ACS National Meeting, Philadelphia, USA 19th August 2012
SPelling correction examples

• lH-ben zimidazole → 1H-benzimidazole
• triphenylposhine → triphenylphosphine
• 4- (2-ADAMANTYLCARBAM0YL) -5-TERT-BUTYL-
  PYRAZOL-1-YL] BENZOIC ACID →
  4-(2-adamantylcarbamoyl)-5-tert-butyl-pyrazol-1-
  yl]benzoic acid
• didec-2-ene → dodec-2-ene
• spiro[2.2]hexane → spiro[2.3]hexane


     ACS National Meeting, Philadelphia, USA 19th August 2012
low cost “frequent” edit ops

• A number of common corrections are so frequent as
  to be given a lower (free) cost.
  1.   Deletion of whitespace.
  2.   Deletion of a hyphen (where not anticipated)
  3.   Substitution of “l” (lower case el) for “1” (one).
  4.   Substitution of “I” (upper case ey) for “l” (el) or “1” (one).
  5.   Substitution of “rn” by “m”.
  6.   Substitution of “1” (one) by “l” (el).
  7.   Substitution of “φ” by “rp” [OCR artifact].


       ACS National Meeting, Philadelphia, USA 19th August 2012
handle with care

• Alas, introducing automatic spelling correction (fuzzy
  matching) to entity recognition often requires the
  introduction of white word lists to avoid problems.
   –   herein → heroin
   –   aspiring → aspirin
   –   cranium → uranium
   –   ability → abilify
• More aggressive correction leads to more problems:
   – “be that the line” → methantheline
   – “park on a zone” → parconazole
       ACS National Meeting, Philadelphia, USA 19th August 2012
benchmarking and analysis

• To quantify the benefits of automatic spelling
  correction to “real world” chemical text mining we
  analysed the first 28 weeks of US patent grants from
  2012.
• This corresponds to 145,473 documents, issued
  between 3rd January 2012 and 10th July 2012.
• A total of 4,061,670 IUPAC-like systematic names
  were identified, with 1,816,317 unique patent/name
  pairs.
• OPSIN interprets 3,722,399 and 1,647,402.
     ACS National Meeting, Philadelphia, USA 19th August 2012
total molecule entities
         All Extracted Molecule Entities                                 OPSIN Interpeted Molecule
                                                                                  Entities
                                                                     434487,
97472,                                                          87719, 12%
  2%       552533,                                                 2%
             14%
                                                   Correct
                                                                                                     Correct
                                                   Simple
                                                                                                     Simple
                                                   D=1
                                                                                                     D=1
                         3411665,                                                 3200193
                           84%                                                     , 86%




           Using correction retrieves ~19% more entities
           and ~16% more OPSIN recognizable names.
              ACS National Meeting, Philadelphia, USA 19th August 2012
unique molecule entries
         All Extracted Molecule Entities                                 OPSIN Interpeted Molecule
                                                                                  Entities
75027,
  4%

          310359,                                              67317,
                                                                 4%      236330,
            17%                                                            14%
                                                  Correct                                        Correct
                                                  Simple                                         Simple
                                                  D=1                                            D=1
                          1430931,
                            79%                                                    1343755,
                                                                                     82%




     The effect is more pronounced with patent-cmpnd
     pairs with a +27% improvement over no correction.
              ACS National Meeting, Philadelphia, USA 19th August 2012
influence on n2s software (opsin)
      Interpretation                                  Count         Fraction
 Not Interpretable                                      116216          17.88%
 Before not After                                           11779       1.81%
 After not Before                                          270453      41.61%
 Same Before/After                                         181693      27.95%
 Different Before/After                                     69864      10.75%
 Total                                                     650005     100.00%


 Although some valid names are lost by correction,
 overall the effect is overwhelmingly positive.
    ACS National Meeting, Philadelphia, USA 19th August 2012
break-down of edit operations

       Edit Operation                         Count          Fraction
    Deletion                                      392324         47.50%
    Insertion                                     232493         28.15%
    Substitution                                  198438         24.03%
    Transposition                                  2670          0.32%
    Total                                        825,925       100.00%




  ACS National Meeting, Philadelphia, USA 19th August 2012
drug dictionary entities
     All Drug Dictionary Entities                                  Unique Drug Dictionary Entities
        71684,                                                         30687,
6141, 0% 7%                                              4119, 1%        7%




                                             Correct                                            Correct
                                             Simple                                             Simple
                                             D=1                                                D=1
                 1013711,                                                       395482,
                   93%                                                            92%




 Some improvement is seen with drug dictionaries,
 but there’s little benefit in fixing simple OCR issues.
        ACS National Meeting, Philadelphia, USA 19th August 2012
target dictionary entries

• Simple correction of ChEMBL protein target names
•   prostaglandin H- 2 synthase- 1 → prostaglandin H2 synthase 1
•   Alanine amino-transferase → Alanine aminotransferase
•   acetyl cholinesterase → acetylcholinesterase
•   cyclooxy-genase-2 → cyclooxygenase-2
•   MAP kinase ERK-2 → MAP kinase ERK2
•   HEC-GLCNAC-6-ST → HEC-GLCNAC6ST
•   Herne Oxygenase → Heme Oxygenase
•   Prealburnin → Prealbumin
•   p110- delta → p110delta

        ACS National Meeting, Philadelphia, USA 19th August 2012
non-word spelling correction

• Automatic correction technology can also be applied
  to entities other than words or IUPAC-like chemical
  nomenclature.
• For example, Chemical Abstract Service’s registry
  numbers.




     ACS National Meeting, Philadelphia, USA 19th August 2012
cas registry number grammar
                                                                          9
                                                                    0-9
                                                                              -

                                                               8          -
                                                         0-9

                                                    7                -
                                              0-9                                      0-9        0-9        -        0-9
                                                                                  10         11         12       13         14
                                                                -
                                   0-9    6
              0-9       0-9
    1-4   1         3
                               5
0                       0-9                               -
    5-9
              0-9
          2         4
                                                     -




    • Two to seven digits, followed by a hyphen, two digits,
      a hyphen and a final check digit
          – e.g. 7732-18-5
    • Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d

               ACS National Meeting, Philadelphia, USA 19th August 2012
Cas check digit calculation

• More generally CaffeineFix’s finite state machines
  can do limited processing...
• The final check digit of a CAS number is calculated by
  series term summation modulo 10.
• The last digit time 1, the previous digit times 2, the
  previous digit times 3, and computing the sum
  modulo 10.
• The CAS number for water is 7732-18-5.
• The checksum 5 is calculated as (1x8 + 2x1 + 3x2 +
  4x3 + 5x7 + 6x7) mod 10 = 5.
      ACS National Meeting, Philadelphia, USA 19th August 2012
Fsm for matching cas check digits
                               21       31        41          51   61   71   81



                     11        23       33        43          53   62   72   82



                     13        25       35        45          55   63   73   83




            6        17        27       37        47          57   64   74   84



            8        19        29       39        49          59   65   75   85

   0

            2        16        22       32        42          52   66   76   86



            4        18        24       34        44          54   67   77   87



                     12        26       36        46          56   68   78   88



                     14        28       38        48          58   69   79   89



                               30       40        50          60   70   80   90




   ACS National Meeting, Philadelphia, USA 19th August 2012
cas number correction example

• 7732-18-8? Did you mean...
  –   7732-18-5
  –   7732-11-8
  –   77328-18-8
  –   7733-18-8
  –   77342-18-8
  –   77392-18-8
  –   71732-18-8
  –   76732-18-8
  –   97732-18-8
      ACS National Meeting, Philadelphia, USA 19th August 2012
take home message

• Adding advanced automatic chemical spelling
  correction to an annotation pipeline typically
  improves recall by about 20-40%.
  – Andrew Hinton, “Benchmarking ChemAxon’s Name-to-Structure batch
    tool on Patent Data”, 2011 ChemAxon EUGM, Budapest.
  – Sorel Muresan, “Automated Spelling Correction to Improve Recall
    Rates of Name-to-Structure Tools for Chemical Text Mining”, 2011
    ChemAxon EUGM.




     ACS National Meeting, Philadelphia, USA 19th August 2012
acknowledgements

• Daniel Lowe, NextMove Software.
• Sorel Muresan and Paul Hongxing Xie, AstraZeneca.


• Thank you for your time.
• Any questions?




     ACS National Meeting, Philadelphia, USA 19th August 2012

More Related Content

More from NextMove Software

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
NextMove Software
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
NextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
NextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
NextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
NextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
NextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
NextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
NextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
NextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
NextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
NextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 

Recently uploaded

Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
Kartik Tiwari
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
deeptiverma2406
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
kimdan468
 

Recently uploaded (20)

Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
 

Advances in Automatic Chemical Spelling Correction

  • 1. Advances in Automatic Chemical Spelling Correction Roger Sayle and Daniel Lowe NextMove Software Cambridge, UK ACS National Meeting, Philadelphia, USA 19th August 2012
  • 2. example spelling errors • Sample misspellings of pyrimidine in US patent grants since the beginning of this year. Incorrect Name US Patent No. Issue Date pryimidine 8093264 10th January 2012 pyrmidine 8097728 17th January 2012 pyrimdine 8114996 14th February 2012 pyrimidne 8129897 6th March 2012 pyrimidinc 8148339 3rd April 2012 pyridmidine 8158627 8th May 2012 ACS National Meeting, Philadelphia, USA 19th August 2012
  • 3. previous work • Roger Sayle, Paul Hongxing Xie and Sorel Muresan, “Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction”, Journal of Chemical Information and Modeling, Vol. 52, No. 1, pp. 51-62, 2012. • G.H. Kirby, M.R. Lord, J.D. Rayner, “Computer Translation of IUPAC Systematic Organic Chemical Nomenclature. 6. (Semi)automatic name correction, Journal of Chemical Information and Computer Science, Vol. 31, pp. 153-160, 1991. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 4. STring EDIt Distance • The Levenshtein Distance (Levenshtein 1965) is the minimum number of edits (insertions, deletions or substitutions) required to transform one string into another. • The Damerau-Levenshtein Distance is an extension of Levenshtein Distance to include transposition of two adjacent characters. • The distances can be efficiently computed with dynamic programming using the Needleman- Wunsch-Sellers alignment algorithm (bioinformatics). ACS National Meeting, Philadelphia, USA 19th August 2012
  • 5. needleman-wunsch-sellers c l o r a m p h e n i c a l c h l o r a m p h e n i c o l ACS National Meeting, Philadelphia, USA 19th August 2012
  • 6. needleman-wunsch-sellers c l o r a m p h e n i c a l c h l o r a m p h e n i c o l ACS National Meeting, Philadelphia, USA 19th August 2012
  • 7. further complications • Edit operations can have their own specific penalties. • The latest implementation supports transpositions, to catch spelling mistakes such as “chlorofrom”. • Some dictionaries to match against have tens of millions of entries, others are infinite. • The start and end of the input isn’t known in free- text and are assigned on the quality of the match. • Correct nesting of parenthesis and brackets needs to be enforced as part of the matching process. • In summary - It’s mind bogglingly complicated. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 8. dictionaries as automata • Nitrogen containing heterocycles as minimal DFA: – Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine, Pyrimidine, Pyrazine ACS National Meeting, Philadelphia, USA 19th August 2012
  • 9. Example iupac-like grammar • More generally, still CaffeineFix FSMs can represent formal grammars, i.e. infinite dictionaries. alk := “meth” | “eth” | “prop” | “but” parent := alk “ane” subst := “bromo” | “chloro” | “fluoro” locant := “1” | “2” | “3” | “4” /* any digit */ prefix := [ prefix “-” ] [ loc “-” ] subst | [ prefix ] subst name := [ prefix [ “-” ] ] parent ACS National Meeting, Philadelphia, USA 19th August 2012
  • 10. Iupac-like grammar examples • methane • chloroethane • 2-bromo-propane • chloro-bromo-methane • 1-fluoro-2-chloro-ethane • chlorofluoromethane • 4-bromomethane • 1-chloro-1-chloro-1-chloro-methane ACS National Meeting, Philadelphia, USA 19th August 2012
  • 11. Representing grammars as dFAs Backward edges allow matching an infinite number of words. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 12. current iupac grammar FSM • As of July 2012, the current CaffeineFix grammar contains nearly 1.2 million edges. • This grammar covers... – 99.15% (232144/234142) names in the NCI00 database. – 95.28% (67995/71367) names in the Maybridge catalogue. – 95.27% (25890/48167) names in the Keyorganics catalogue. • These figures are comparable to name-to-structure conversion rates on these names. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 13. SPelling correction examples • lH-ben zimidazole → 1H-benzimidazole • triphenylposhine → triphenylphosphine • 4- (2-ADAMANTYLCARBAM0YL) -5-TERT-BUTYL- PYRAZOL-1-YL] BENZOIC ACID → 4-(2-adamantylcarbamoyl)-5-tert-butyl-pyrazol-1- yl]benzoic acid • didec-2-ene → dodec-2-ene • spiro[2.2]hexane → spiro[2.3]hexane ACS National Meeting, Philadelphia, USA 19th August 2012
  • 14. low cost “frequent” edit ops • A number of common corrections are so frequent as to be given a lower (free) cost. 1. Deletion of whitespace. 2. Deletion of a hyphen (where not anticipated) 3. Substitution of “l” (lower case el) for “1” (one). 4. Substitution of “I” (upper case ey) for “l” (el) or “1” (one). 5. Substitution of “rn” by “m”. 6. Substitution of “1” (one) by “l” (el). 7. Substitution of “φ” by “rp” [OCR artifact]. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 15. handle with care • Alas, introducing automatic spelling correction (fuzzy matching) to entity recognition often requires the introduction of white word lists to avoid problems. – herein → heroin – aspiring → aspirin – cranium → uranium – ability → abilify • More aggressive correction leads to more problems: – “be that the line” → methantheline – “park on a zone” → parconazole ACS National Meeting, Philadelphia, USA 19th August 2012
  • 16. benchmarking and analysis • To quantify the benefits of automatic spelling correction to “real world” chemical text mining we analysed the first 28 weeks of US patent grants from 2012. • This corresponds to 145,473 documents, issued between 3rd January 2012 and 10th July 2012. • A total of 4,061,670 IUPAC-like systematic names were identified, with 1,816,317 unique patent/name pairs. • OPSIN interprets 3,722,399 and 1,647,402. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 17. total molecule entities All Extracted Molecule Entities OPSIN Interpeted Molecule Entities 434487, 97472, 87719, 12% 2% 552533, 2% 14% Correct Correct Simple Simple D=1 D=1 3411665, 3200193 84% , 86% Using correction retrieves ~19% more entities and ~16% more OPSIN recognizable names. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 18. unique molecule entries All Extracted Molecule Entities OPSIN Interpeted Molecule Entities 75027, 4% 310359, 67317, 4% 236330, 17% 14% Correct Correct Simple Simple D=1 D=1 1430931, 79% 1343755, 82% The effect is more pronounced with patent-cmpnd pairs with a +27% improvement over no correction. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 19. influence on n2s software (opsin) Interpretation Count Fraction Not Interpretable 116216 17.88% Before not After 11779 1.81% After not Before 270453 41.61% Same Before/After 181693 27.95% Different Before/After 69864 10.75% Total 650005 100.00% Although some valid names are lost by correction, overall the effect is overwhelmingly positive. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 20. break-down of edit operations Edit Operation Count Fraction Deletion 392324 47.50% Insertion 232493 28.15% Substitution 198438 24.03% Transposition 2670 0.32% Total 825,925 100.00% ACS National Meeting, Philadelphia, USA 19th August 2012
  • 21. drug dictionary entities All Drug Dictionary Entities Unique Drug Dictionary Entities 71684, 30687, 6141, 0% 7% 4119, 1% 7% Correct Correct Simple Simple D=1 D=1 1013711, 395482, 93% 92% Some improvement is seen with drug dictionaries, but there’s little benefit in fixing simple OCR issues. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 22. target dictionary entries • Simple correction of ChEMBL protein target names • prostaglandin H- 2 synthase- 1 → prostaglandin H2 synthase 1 • Alanine amino-transferase → Alanine aminotransferase • acetyl cholinesterase → acetylcholinesterase • cyclooxy-genase-2 → cyclooxygenase-2 • MAP kinase ERK-2 → MAP kinase ERK2 • HEC-GLCNAC-6-ST → HEC-GLCNAC6ST • Herne Oxygenase → Heme Oxygenase • Prealburnin → Prealbumin • p110- delta → p110delta ACS National Meeting, Philadelphia, USA 19th August 2012
  • 23. non-word spelling correction • Automatic correction technology can also be applied to entities other than words or IUPAC-like chemical nomenclature. • For example, Chemical Abstract Service’s registry numbers. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 24. cas registry number grammar 9 0-9 - 8 - 0-9 7 - 0-9 0-9 0-9 - 0-9 10 11 12 13 14 - 0-9 6 0-9 0-9 1-4 1 3 5 0 0-9 - 5-9 0-9 2 4 - • Two to seven digits, followed by a hyphen, two digits, a hyphen and a final check digit – e.g. 7732-18-5 • Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d ACS National Meeting, Philadelphia, USA 19th August 2012
  • 25. Cas check digit calculation • More generally CaffeineFix’s finite state machines can do limited processing... • The final check digit of a CAS number is calculated by series term summation modulo 10. • The last digit time 1, the previous digit times 2, the previous digit times 3, and computing the sum modulo 10. • The CAS number for water is 7732-18-5. • The checksum 5 is calculated as (1x8 + 2x1 + 3x2 + 4x3 + 5x7 + 6x7) mod 10 = 5. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 26. Fsm for matching cas check digits 21 31 41 51 61 71 81 11 23 33 43 53 62 72 82 13 25 35 45 55 63 73 83 6 17 27 37 47 57 64 74 84 8 19 29 39 49 59 65 75 85 0 2 16 22 32 42 52 66 76 86 4 18 24 34 44 54 67 77 87 12 26 36 46 56 68 78 88 14 28 38 48 58 69 79 89 30 40 50 60 70 80 90 ACS National Meeting, Philadelphia, USA 19th August 2012
  • 27. cas number correction example • 7732-18-8? Did you mean... – 7732-18-5 – 7732-11-8 – 77328-18-8 – 7733-18-8 – 77342-18-8 – 77392-18-8 – 71732-18-8 – 76732-18-8 – 97732-18-8 ACS National Meeting, Philadelphia, USA 19th August 2012
  • 28. take home message • Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%. – Andrew Hinton, “Benchmarking ChemAxon’s Name-to-Structure batch tool on Patent Data”, 2011 ChemAxon EUGM, Budapest. – Sorel Muresan, “Automated Spelling Correction to Improve Recall Rates of Name-to-Structure Tools for Chemical Text Mining”, 2011 ChemAxon EUGM. ACS National Meeting, Philadelphia, USA 19th August 2012
  • 29. acknowledgements • Daniel Lowe, NextMove Software. • Sorel Muresan and Paul Hongxing Xie, AstraZeneca. • Thank you for your time. • Any questions? ACS National Meeting, Philadelphia, USA 19th August 2012