Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

“Semantic PDF Processing & Document Representation”

499 views

Published on

Sridhar Iyengar, IBM Distinguished Engineer at the IBM T. J. Watson Research Center, presention “Semantic PDF Processing & Document Representation” as part of the Cognitive Systems Institute Group Speaker Series.

Published in: Technology
  • Be the first to comment

“Semantic PDF Processing & Document Representation”

  1. 1. Future of Cognitive Computing and AI Semantic PDF Processing and Knowledge Representation Sridhar Iyengar Distinguished Engineer Cognitive Computing Research IBM T.J. Watson Research Center siyengar@us.ibm.com
  2. 2. 20102009 2011 2012 2013 2014 2015 50 0 25 75 100 125 150 Financial Services : Query from Unstructured Data Financial Documents (.pdf, .html, docx…) Ingest “Show me revenues for Citibank between 2009 and 2015” © 2017, IBM Corporation
  3. 3. Summary : PDF understanding is hard and requires significant Research breakthroughs and Product Innovations 3 ▪ PDF Documents are optimized for display and often do not include metadata and structure to facilitate Cognitive post processing – Existing technologies and solutions are optimized for printing and viewing – not cognitive post processing ▪ Need to handle Programmatically created (via MS Word, PPT….) and Legacy and scanned documents (Forms, hand written notes...) ▪ Approach : The definition of a Semantic Document Structure Model (DSM) for a consistent internal representation of document structures to be used in Future WDC Services and products ▪ Currently focused on Table and Diagram Understanding from PDF – Healthcare, Financial Services, Compliance, Legal… © 2017, IBM Corporation
  4. 4. Research Focus : IBM AI Platform for Business • Best platform for building applications that incorporate enterprise and industry knowledge • Time to Value at every step of cognitive application development • Tools & Methodology to support development, deployment & intuitive usage 4 Data: ‒ Structured & Unstructured data sources ‒ Multimodal (text, visual, speech, etc.) data sources ‒ Public & private data sources Training ‒ Create Domain Models and Specialize them § for conversations aligned with business process § For discovery of insights ‒ Fast adaptation to new domains ‒ Scale from small to large amounts of training data ‒ Tuned model creation for accuracy vs. training time. ‒ Incremental & Automated Knowledge Evolution Conversation ‒ Tools for SMEs train from Business Processes ‒ Inference engines for specific content structure Discovery ‒ Tools for SMEs train from Business Knowledge ‒ Reason about domain knowledge (vs. Lexical/Syntactic) Tools & Methodology ‒ Cognitive application lifecycle (code/data/model) Resilient deployment of cognitive models © 2017, IBM Corporation
  5. 5. Cognitive Computing (AI) Technologies Research Decision Support People Insights Cognitive Software and Data Life Cycle* Reasoning and Planning Human Computer Interaction Conversation Query and Retrieval* Knowledge Extraction and Representation* Learning* Natural Language & Text Understanding* Visual Comprehension* Speech and Audio Embodied Cognition Cognitive Computing Platform Infrastructure Signal Comprehension Reasoning About Domains Interaction Systems Trust and Security Semantic PDF Processing* © 2017, IBM Corporation
  6. 6. Goal : From Raw Data to Business Artifacts .pdf Line PlotBulleted List • Create representation for an obligation • Models for “obligation language” • Reason about list or data that refines the obligation • Create document fragments by parsing out chunks • Document structure models • Reason about document chunks Obligation • Create representation for a fragments • Document fragment models • Reason about fragment constituentsfragment Section fragmentfragment • Hierarchical Processing • Machine-learned models and reasoning at all levels • Learnability of artifacts, models • Learn how to specify reasoners Example 1: Semantic PDF Processing 6© 2017, IBM Corporation
  7. 7. Example 1: .pdf Line PlotBulleted List • Create representation for an obligation • Models for “obligation language” • Reason about list or data that refines the obligation • Create document fragments by parsing out chunks • Document structure models • Reason about document chunks Obligation • Create representation for a fragments • Document fragment models • Reason about fragment constituentsfragment Section fragmentfragment .mp4 SceneScene Boy Girl Night Soft Music Candles Romantic Scene • Hierarchical Processing • Machine-learned models and reasoning at all levels • Learnability of artifacts, models • Learn how to specify reasoners Example 2: Semantic MPEG Processing 7 From Raw Data to Business Artifacts © 2017, IBM Corporation
  8. 8. Complexity akin to “Natural Language Understanding” Why is PDF Processing hard? ▪ Thousands of PDF generators (driver), with their own rules for placing marks on paper. ▪ Incredible variety in content – complex tables, images, diagrams, formulas, varying resolution in scanned content ▪ No closed form / algorithmic solution feasible – must resort to machine learning. © 2017, IBM Corporation
  9. 9. Why is it hard? Variety of tables : 20-25 major table types in discussion with just one major customer Complex tables – graphical lines can be misleading – is this 1, 2 or 3 tables ? Table with visual clues only Multi-row, multi- column column headers Nested row headers Tables with Textual content Table with graphic lines Table interleaved with text and charts Complex multi-row, multi-column column headers identifiable using graphical lines and visual clues
  10. 10. Why is it hard? Variety in Image, Diagram Types L. Lin et al. / Pattern Recognition 42 (2009) 1297--1307 1305 Fig. 8. ROC curves of the detection results for bicycle parts. Each graph shows the ROC curve of the results for a different part of the bicycle using just bottom-up information and bottom-up + top-down information. We can see that the addition of top-down information greatly improves the results. We can also see that the bicycle wheel is the most reliably detected object using only bottom-up cues, so we will look for that part first. With a quick second glance, even the seat and handlebars may be “seen”, though they are actually occluded. Our algorithm simulates the top-down process (indicated by blue/green downward arrows in Fig. 4) in a similar way, using the constructed And–Or graphs. Verification of hypotheses: Each of the bottom-up proposals ac- tivates a production rule that matches the terminal nodes in the graph, and the algorithm predicts its neighboring nodes subject to the learned relationships and node attributes. For example in Fig. 4, a proposed circle will activate the rule that expands a wheel into two rings. The algorithm then searches for another circle of propor- tional radius, subject to the concentric relation with existing circle. In Fig. 5(b), the wheels are already verified. The candidate frames are then predicted with their ends affixed to the center points of the wheels. Since we cannot tell the front wheels from the rear ones at this moment, frames facing in two different directions are both pre- dicted and put in the Open List. In Fig. 5(a), the triangle templates are detected using a Generalized Hough Transform only when the wheels are first verified and frames are predicted. If no neighboring nodes are matched, the algorithm stops pursuing this proposal and removes it from the Lists. Otherwise, if all of the neighboring nodes are matched, the production rule is completed. The grouped nodes are then put in the Closed List and lined up to be another bottom-up proposal for the higher level. Note that we may have both bottom- up and top-down information being passed about a particular pro- posal as shown by the gray arrows in Fig. 3. In Fig. 4, the sub-parts of the frame are predicted in the top-down phase from the frame node (blue arrows); at the same time, they are also proposed in the bottom-up phase based on the triangles we detected (red arrows). Proposals with bidirectional supports such as these are more likely to be accepted. After one particle is accepted from the Open List, any other overlapping particles should update accordingly. Template match: The pre-defined part templates, such as the bi- cycle frames or teapot bodies, are represented by sub-sketch-graphs, which are composed of a set of linked edgelets and junctions. Once a template is proposed and placed at a location with initial attributes, the template matching process is then activated. As shown in 10 PDF rendering q .doc, .ppt rendering to .pdf keeps minimal structure formatting. Geared towards visual fidelity q Often .pdf is created by “screen scraping” or scanning or hybrid ways that do not keep structure information. Multi-modality: extremely rich information q Images + Text + Tables both co-exist as well as form nested hierarchies possibly with several levels Nested table (numeric and non-numeric + image) Tabular representation of images with pictorial cross reference Images + captions + cross references and text that comments the image
  11. 11. Two major approaches to tackling PDF Processing ▪ Unsupervised Learning and out of the box PDF processing – Works well for a large class of domains with some compromise in quality ▪ Supervised Learning with a graphical labelling tool – Potential for improved quality when many similar documents are available Both approaches can be used together
  12. 12. … … DU: Line plots (LP) DU: flow charts (FC) DU: bubble plots (BP) Image classification TU: Table understanding (Programmatic PDF Text analytics (Programmatic PDF) PDF Parser DU: scanned tables (ST) Data integration: Linking text to diagrams, tables, serialization…. PDF Understanding: High Level Overview
  13. 13. Learned Semantic Document Representation © 2017, IBM Corporation
  14. 14. PDF Processing Overview in WDC WDC DCS Service PDF Docs HTML JSON Plain Text https://www.ibm.com/watson/developercloud/document-conversion.html Current implementation of DCS has limited Table processing capability and no support for scanned documents, diagrams, graphs etc. Text and Simple Table structure © 2017, IBM Corporation
  15. 15. PDF Processing.Next Overview ( 2017/2018 ) WDC DCS.Next Service PDF, HTML, Word Docs DSM-XML JSON Plain Text HTML WDC DCS Service… PDF2HTML PDF2JSON PDF2-DSMXML New PDF Tools SME, Data Scientist (Domain Adaptation using ML) Developer Using DCS API Text , Tables, Diagrams Graphs.. PDF, HTML, Word Docs (Training) © 2017, IBM Corporation
  16. 16. PDF Conversion Architecture Programmatic PDF PDFBox API: Parse PDF Document HTML Layout + Reading Order Inference HTML Generation Table Structure Population Metadata Identification Table Identification Cleanse Raw PDF Data Open Source or Commercial Software Research Extensions Composite Unit / Region Identification Scanned PDF Cleanse Raw OCR Output OCR Engine API: Scan PDF Document • ML-based PDF conversion Pipeline is source-independent • SAME ML-based algorithms can be applied directly to data extracted from either scanned or programmatic PDF • PDF Conversion ML algorithms are unsupervised; thus achieve stated performance out-of-box with NO training / tuning data required • Deployable in Cloud for document-at-a-time processing service • Scanned PDF processing available now using Datacap OCR engine • Extension using Tesseract engine Programmatic PDF Extraction Scanned & Hybrid PDF Extraction Hybrid PDF Chart Identification ML-based PDF Conversion Pipeline
  17. 17. • HTML output from WCS PDF Conversion is directly consumable by downstream analytics • PDF Conversion Table processing example : 17 PDF Conversion Downstream Analytics Example PDF HTML Watson Knowledge Graph WCS PDF Conversion Table Processing NLQ Answering Structured Facts from Table Answer Original Scanned PDF table HTML generated from current PDF Conversion Web service Bridge Designer Length Brooklyn J. A. Roebling 1595 Manhattan G. Lindenthal 1470 Queensborough Palmer & Hornboste l 1182 Structured facts from existing Table Processing Libraries (with appropriate customization) Who designed Brooklyn Bridge? NLQ Answering J. A. Roebling …
  18. 18. Document Structure Model (Document Representation) • Define common document structure ideal for subsequent semantic analysis • Defined per feature : Section, Bulleted Lists, Headers, Footers, Footnotes, Tables, ... 18 Define how section information such as title, number and nesting should be Represented Define how list information such as list items and list type should be Represented © 2017, IBM Corporation
  19. 19. Document Structure Model (DSM) - Draft Scan PDF Prog PDF Page [1…n] Token Character Phrase TextLine Paragraph PageColumn [1…n] [1…n] [1…n] [1…n] [1…n] PageChart TableCell Table Graphical Line [1…n] means ordered list All objects have BoundingBox attribute Color displayOrder rowSpan colSpan [1…n][1…n] [1…n] [1…n] [1…n] [1…n] [1…n] Embedded Image BoundBoxCoords contents displayOrder [1…n] Logical Data Model Ontology Representation 19 • Goal: Define common document structure ideal for subsequent semantic processing • Captures both raw extracted information (text, vector graphics) along with inferred artifacts (tables, charts, paragraphs) • Start with PDF documents and extend to other formats such as Word and Excel • DSM Schema in OWL, Serializations to HTML, JSON...
  20. 20. 20102009 2011 2012 2013 2014 2015 50 0 25 75 100 125 150 Financial Services : Query from Unstructured Data Financial Documents (.pdf, .html, docx…) Ingest “Show me revenues for Citibank between 2009 and 2015” © 2017, IBM Corporation
  21. 21. Thank You

×