Imago OCR: Open-source toolkit for chemical structure image recognition


Published on

Presentation at the Symposium on 244th ACS National Meeting & Exposition.
Hunting for Hidden Treasures: Chemical Information in Patents and Other Documents

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Imago OCR: Open-source toolkit for chemical structure image recognition

  1. 1. Imago OCR Open-source toolkit for chemical structure image recognition GGA Software Services LLC 1
  2. 2. Project goals • Perform the optical chemical structure recognition applicable for a wide range of raster images: – different image formats – various scanning quality (or even photo) – complex structures and uncommon features • Provide complete toolset for embedding recognition engine in any other application14/08/2012 GGA Software Services LLC 2
  3. 3. Applications • Automated articles and patents processing – similarity analysis • Chemical database search (PubChem, etc.) • “The Deep Web indexing” – development of a universal chemical search engine; – conversion of a human-readable data to machine- readable formats14/08/2012 GGA Software Services LLC 3
  4. 4. Use case Source image MOL format imago • BMP, DIB, JPG, JPE, PNG, PBM, P • MDL Molfile; GM, PPM, SR, RAS, TIFF; • SMILES (requires Indigo); • Images from scanner/camera; • Rendered image (requires • PDF document Indigo)14/08/2012 GGA Software Services LLC 4
  5. 5. Supported features • Multiple bonds • Single-up & single-down bonds • Bridged bonds • Aromatic rings14/08/2012 GGA Software Services LLC 5
  6. 6. Supported features • Superatom labels, charges, isotopes • Abbreviations expansion • R-groups handling • Query features14/08/2012 GGA Software Services LLC 6
  7. 7. Engine structure Raster level Prefilter & Binarization Image loader Primitives level Vectorization & Separation Molecule Structural level Logical layout analyzer export14/08/2012 GGA Software Services LLC 7
  8. 8. Preliminary filters • Pass-through filter – For rendered images (only binarization) • Cross-correlation based filter – For scanned images (quite fast) • Logical analysis based filter – For low-quality photos – Takes some time for processing • Imago allows auto-detection of suitable filter14/08/2012 GGA Software Services LLC 8
  9. 9. Cross-correlation based filter Source image Strong threshold Weak threshold ← Filter result: image combined of weak threshold image segments that passes the restrictions of the CC value between corresponding strong threshold image segments14/08/2012 GGA Software Services LLC 9
  10. 10. Logical analysis based filter • Removes noise (spots, light glares) • Suitable for out-of-focus images • Can process low-contrast images • Removes unusual artifacts • Deals with multicolor photos • Keywords: wiener filtering, wave algorithm, weak segmentation14/08/2012 GGA Software Services LLC 10
  11. 11. Preliminary separation • Separate labels and graphics: • Hu moments classifier (d1) • Contours analysis (d2) • Approximation criteria (d3) • Object is symbol if f(d1, d2, d3) > c014/08/2012 GGA Software Services LLC 11
  12. 12. Vectorization • Convert pixels to a matching polyline: • Minimization of mean distance between original and vectorized structure – Penalty for extra segments14/08/2012 GGA Software Services LLC 12
  13. 13. Logical layout analysis • Mapping labels to bonds – Group labels into superatoms • Finding multiple bonds – Dissolving of short edges – Connection of bridged bonds • Removal of surely unrelated captions • Detection of aromatic rings – Figuring out stereo bonds orientation and aromatizing molecule if circles were presented14/08/2012 GGA Software Services LLC 13
  14. 14. Adaptive methods or particular cases? • Adaptive methods • Particular-case – Based on methods optimization of – Based on some some function criteria – Wider input class – Stability range – Good performance – Probably better – Easier results in hard cases implementation14/08/2012 GGA Software Services LLC 14
  15. 15. Particular case methods • What is it? • Line? Tested line criteria: no. • Character? Tested against ‘A’: no. … Tested against ‘Z’: no. • Ring? no. • Unrecognizable object – ignore.14/08/2012 GGA Software Services LLC 15
  16. 16. Adaptive methods • What is it? • Line: approximation: d=1.6 • Character? Compared with ‘C’: d=6.1 … Compared with ‘L’: d=3.2 • Ring? approximation: d=653.3 • Final decision depends on neighbors14/08/2012 GGA Software Services LLC 16
  17. 17. Decision tree Bond with d=0.0 Label with d=0.1 (almost “C” with d=0.1 surely recognized) Then object is a bond and Then object is a letter ‘l’ and segments segments group recognized as group recognized as bond + label of bond + label with two chars with d=0.0+0.1+3.2=3.3 d=0.1+1.6=1.714/08/2012 GGA Software Services LLC 17
  18. 18. Metrics • For symbols – Distance between Fourier descriptors set • For graphics – Distance between approximated and source image • For single-up bonds – f(average fill, relative size, etc.) • For single-down bonds – f(distance between segments, line thickness, etc.) • … (every recognition method has a metric function)14/08/2012 GGA Software Services LLC 18
  19. 19. Labels correction • Any recognized symbol can have alternatives: : A(metric value of 3.2), R(4.9), P(5.0) • Imago keeps probable captions information (periodic table, abbreviations) • Labels correction: select such combination of symbols alternatives that is probably and the sum of metric values is minimal • Allows to recognize partially broken labels14/08/2012 GGA Software Services LLC 19
  20. 20. Recognition • Image recognition is a search of vectorized result gives minimal distance value between vectorized form and original image • Can be formalized depending on metrics • Search is exhaustive – Needs some restrictions to achieve good speed14/08/2012 GGA Software Services LLC 20
  21. 21. Trade-off: restricted adaptive methods • Limit metric values: d < 0.5 – surely; d > 10.0 – impossibly • Limit Euclidian distances for neighbors search (up to 100 pixels) • Limit alternatives count (not more than 10) • Assume image filling rate is less than 10% • Assume the distances for single-down bonds segments is in range 5..10 pixels • Assume the symbol aspect ratio is in range 0.5..2.0 • Some more assumptions with the “magic” constants • Gains the speed and stability14/08/2012 GGA Software Services LLC 21
  22. 22. Configuration clusters • For scanned images – Strict adaptive methods limits (fast, <300ms per image) • For photos and low quality images – Flexible limits (less than a second per image in average) • For high-resolution images – up to 5 seconds • For handwritten structures – up to 10 seconds in complex cases • Imago supports auto-detection of suitable configuration cluster14/08/2012 GGA Software Services LLC 22
  23. 23. Configuration cluster creation • Allows to gain better recognition success rate for specified images type: – different render type – images captured differently (scanner type, lighting conditions, etc.) • Process is automated – test set of target images type is required – takes some time – machine learning application14/08/2012 GGA Software Services LLC 23
  24. 24. Machine learning • Test set: amount of pairs (image; related MDL molfile) • Imago will tune the method parameters to gain the best score on the test collection – Metrics included – No information directly related to test set (such a characters table) is stored • Criteria of the complete set will be formed by small subset of the same type14/08/2012 GGA Software Services LLC 24
  25. 25. Learning effectiveness • Used Img2Structure test set with different renderer: • Initial results (before training): 202/944 correct, similarity value: 74.54% • Trained on set of 50 images with new render • Trained results: 831/944 correct, similarity value: 98.33% on the whole set14/08/2012 GGA Software Services LLC 25
  26. 26. Comparison: overall scores 1 • Image2Structure set from TREC 2011 Chemical IR Track (removed ambiguous & partial structures): original files OSRA 1.4.0 Imago 1.0 Imago 2.0 beta 1 Absolutely correct 769 / 944 540 / 944 861 / 944 Almost correct1 +31 +49 +43 Average time 2.54s 0.20s 0.31s Average similarity2 94.57% 89.59% 98.26% 1 similarity value is greater than 95%; 2 correct elements (atoms and bonds) ratio; extra and missing elements are counted too.14/08/2012 GGA Software Services LLC 26
  27. 27. Comparison: overall scores 2 • Image2Structure re-rendered using appropriate molfiles OSRA 1.4.0 Imago 1.0 Imago 2.0 beta 1 Absolutely correct 796 / 944 604 / 944 831 / 944 Almost correct1 +20 +58 +29 Average time 4.57s 0.47s 1.24s Average similarity2 93.45% 95.38% 98.33% 1 similarity value is greater than 95%; 2 correct elements (atoms and bonds) ratio; extra and missing elements are counted too.14/08/2012 GGA Software Services LLC 27
  28. 28. Common issues resolved Source OSRA Imago Large gap Lines too close No more symbols14/08/2012 GGA Software Services LLC 28
  29. 29. Imago Library • API: Methods set for – Image loading – Configuration clusters setup – Retrieving molfile results – Partial processing (filtering, approximation, validation) • Bindings for C/C++, Java • Cross-platform implementation (Windows, Linux, Mac) • Dependencies: – Boost library (LGPL license) – OpenCV library (BSD license) – Indigo (optional)14/08/2012 GGA Software Services LLC 29
  30. 30. Thank you for the attention! • Imago OCR: • Try imago recognition engine online: GGA Software Services LLC 30
  31. 31. Appendix A Imago: technical details14/08/2012 GGA Software Services LLC 31
  32. 32. Pass-trough prefilter • Calculate black, white and others pixels • If (black + white) > t0 ∙ others, – recolor others to black → image is binarized – else schedule another prefilter call • Perform accurate image downscale when image is too large (>5Mpix)14/08/2012 GGA Software Services LLC 32
  33. 33. Cross-correlation prefilter • Smooth source image → smoothed – Pyramidal reduce 2x, then pyramidal upsample 2x • Process adaptive threshold binarization filter of smoothed image: – With threshold t0 → strong – With threshold t1 → weak • Segmentate (strong, weak) images using wavemap algorithm • For each weak segment find appropriate strong segment and calculate intersection: – If intersection area to original segment area ratio is less than c0 then remove this segment (bad segment) • If reassembled image contains the rectangular structure R – crop image to R inner dimensions (locate molecules) • Calculate average pixels intensity for good segments and try to add other pixels with intensity passing this boundary (if they’re not affecting segments connectivity)14/08/2012 GGA Software Services LLC 33
  34. 34. Separator details • Given a binarized set of segments classify them into two main groups: letters and chemical bond representation • Classification result is based on the value of C = k0 ∙ r0 + k1 ∙ r1 + k2 ∙ r2 – Where (r0, r1, r2) are submethods results – And (k0, k1, k2) – weight constants (configurable)14/08/2012 GGA Software Services LLC 34
  35. 35. Separator: Hu moments • Hu moments usually differs for characters and bonds, so the classification tree can be computed • Note: some objects can not be classified that way symbols bonds r0 = 0 r0 = 114/08/2012 GGA Software Services LLC 35
  36. 36. Separator: contours analysis • Extract the outer contour of the binarized segment S; – approximate the chain contour using Teh-Chin chain approximation algorithm; – taking line thickness as a approximation parameter the polygon is approximated once again; – calculate the offsets of the contour points by a clockwise step; – the output is a chain of sequential vectors normalized by their perimeters; • Compare the chain result to the set of patterns describing valid structures – The set contains of 8x8 matrices where the cell (j, k) denotes the probability of changing the jth direction to the kth. • Result of this method is r1 – probability of {S is a bond}14/08/2012 GGA Software Services LLC 36
  37. 37. Separator: approximation criteria • For a given segment S we calculate its best approximation with n line segments (d0) and the closest distance to the most probable character (d1) – If d1 < d0 and n > n0 then probably segment represents character • Check its width/height ratio, height/average_height ratio: penalty p0 if this criteria is not matched • Result is r2 = 1 - (d1 [+ p0]) – probability of {S is a bond} – Result is r2 = d0 – probability of {S is a bond}14/08/2012 GGA Software Services LLC 37
  38. 38. Bonds skeleton analysis • Dissolve short edges • Join closest vertices • Dissolve intermediate vertices • Find multiple edges • Connect bridged bonds • Shrink short bonds • Detect and mark suspicious edges14/08/2012 GGA Software Services LLC 38
  39. 39. Basic labels analysis • Location analysis: check against baseline – The subscripts are underline: – Capitals mostly above line: • Calculate distances to all possible characters: • Alternate distances using topological features • Select the best result candidate and calculate recognition quality:14/08/2012 GGA Software Services LLC 39
  40. 40. Superatoms analysis • Concatenate recognized characters into labels • Check chemical validity • If validity check is failed – try to find the most probable alternative using other distance map elements • If such alternative is not found – try to recognize the less probable characters as bonds • Handle R-semantic, special characters: X, Q, A14/08/2012 GGA Software Services LLC 40
  41. 41. Appendix B Imago: workflow features14/08/2012 GGA Software Services LLC 41
  42. 42. Related continuous integration system Versions list Test sets … Results estimation14/08/2012 GGA Software Services LLC 42
  43. 43. Explanation: continuous integration • Some logically grounded changes may decrease the recognition rate → convenient tracking tool is required • Good way to improve overall stability • Useful visual representation of the machine- learning progress14/08/2012 GGA Software Services LLC 43
  44. 44. Embedded HTML-based logging system Embedded images Variables and parameters dump Call hierarchy Performance counters14/08/2012 GGA Software Services LLC 44
  45. 45. Explanation: logging system • Structured logs (reports) are offering – Convenient way of bugs detection; – Exact visual representation of the internal processes; • Several improvements may be evident just by looking through logs • Performance decrease is comparable to the (usual) plaintext logs • Stability is not affected14/08/2012 GGA Software Services LLC 45
  46. 46. Authors • Rostislav Chutkov • Michael Rybalkin • Kliton Andrea • Victor Smolov • GGA Software Services LLC14/08/2012 GGA Software Services LLC 46