11. DOCUMENT CLASSIFICATION
Machine learning approach
• neural network
• no more rules – complex, difficult to maintain
General purpose Classification
• based on document type, e.g.
• language
• supplier
• stream of pages into documents
12.
13.
14.
15. INTELLIGENT DATA EXTRACTION
No prior training - Semantic analysis
• lists
• math relations
• keywords
• data types
• regular expressions
Learning capabilities - Dynamic templates
• positional hints based on layout
• structural hints – table detection
Learns while being used
16.
17.
18. TABLE DETECTION
Automatic detection
• Borders, colors, cells indentation
• Both relative and absolute positions
• Semantics: header labels; Math: Qty * Unit Price =
LineTotal
• Data types
• Formulas - e.g. the sum of Line totals is equal to
the Subtotal amount in a bill
Training