2. Problem Statement
• Take network/pathway-like images related to
ES OR (i)PSCs (MeSH tagged) from the
database at NCBI and create dynamic
networks from them.
• Identify the text, nodes and arrows in the
system to depict the flow of the network.
3. Proposed Plan-of-Action
• Read images into the C++ code and detect text
areas to identify nodes.
• Isolate the text area and process the text using
Tesseract – a Google API
– Train tesseract for current data for better
accuracy.
– Test tesseract with current data
– Seek semantic help using meta-maps to improve
text recognition
4. Proposed Plan-of-Action
• Process the nodes and store them in a graph.
• Detect arrowheads – arrows ( -> ) and
inhibitors ( -|)
• Introduce Links in graph by detecting the
orientation (direction) of arrowheads.
• Create GPML files
• Process GPML files (maybe using Cytoscape)
5. Detect text areas to identify nodes
• Used OpenCV in C++ to process images.
• Found image contours to detect text areas in
the image. Here the focus is just on detecting
text area, not identifying words.
• Once the contours are detected, we draw
rectangles around them.
• Images in the following slides.
12. Isolate text area
Process text using Tesseract
• Tesseract takes a text area as input.
• Generates output in the form of a text file
containing identified text.
• Created a graph with each text area as a node.
• The left top and right bottom coordinates of
each node (rectangle) are its properties.
• Stored graph in a csv file.
• Images in the following slides.
16. Detect arrowheads – arrows ( -> ) and
inhibitors ( -|)
Approach 1 – Polynomial
• Arrows can be treated as a polynomial with 6
or 7 edges -
• The irregularities in the arrows led to a bad
approximation on arrow-detection.
• Smoothening filters improved the results.
• Low accuracy
17. Approach 2 – Cascade Filters
• Cascade filters use brute force object training to
find objects in images.
• In our approach I used 600 positive images
(arrows) and 1000 negative images (text, images,
diagrams, other non-arrow images) to train the
classifier.
• The results are highly over-fitted and require
further parameter tuning for better results.
• Needs more work.
• Good approach but parameters vary with images,
thus not very reliable.
• Image in the following slide.
18.
19. Current issues
• All algorithms give good results for all the
images. Might work well on some, not very
well on others.
• Text detection can be improved by training
Tesseract more.
• Arrow detection is still not giving good results.
20. Resources
• The code is available on Github - here
• Images can be found on NCBI - here
• Google Tesseract OCR - here
• Cascade classifiers can be tricky to work with.
A good tutorial is - here
• Node-OpenCV - here
Editor's Notes
Images can be obtained using the query –
(Embryonic Stem Cells[Mesh] OR "Stem Cell Research"[Mesh] OR "Pluripotent Stem Cells"[Mesh]) AND (pathway OR pathways OR network OR networks) AND pubmed pmc local[sb] AND loprovpmc[sb] AND (Review[ptyp])
Site Link –
http://www.ncbi.nlm.nih.gov/pmc/?term=24866112,24643740,24470117,24442477,24383051,24284400,24232254,24171168,24104210,23932125,23921754,23899786,23852015,23841088,23801533,23715547,23673969,23668474,23653415,23598974,23504955,23492828,23485729,23420198,23419197,23413375,23401375,23370908,23337973,23287468,23256519,23239357,23229513,23207694,23166396,23165208,23146768,23126226,23092754,23070616,22960547,22952393,22901255,22870932,22805743,22784697,22749051,22743233,22710171,22674461,22655979,22580472,22560073,22548573,22516205,22472874,22453975,22427062,22382129,22330734,22266195,22212700,22205306,22194017,22030746,22009073,21954066,21952290,21903672,21881606,21865592,21845024,21801025,21800022,21793804,21727129,21727126,21636265,21633173,21604061,21547058,21506923,21498416,21490948,21466483,21448589,21412766,21371482,21325131,21197666,21194386,21193838,21183529,21164479,21159814,21082893,20975044,20974014,20920576,20890967,20809088,20716614,20688375,20651740,20632858,20624054,20621047,20599211,20592865,20483202,20478297,20374481,20144857,20016762,19967349,19855279,19782409,19542351,19530135,19480567,19450430,19389995,19332320,19184567,20232600,19111019,19097085,19028990,19022761,19022563,18983701,18848887,18676806,18588486,18565538,18466917,18429046,18393635,18375378,18370048,18272516,17991914,17981772,17412888,17380311,12930889[PMID]&report=imagesdocsum