Document Recognition a technology overview Presented by:  Chris Riley
What we will cover: Why Chris? What Are the Document Recognition Technologies Who Makes Them News Flash The Future Q & A
Why Chris? Professional Experience LivingAnalytics, Inc.  Artsyl Technologies, Inc. Visioneer, Inc. ABBYY IntelliKey Solutions, Inc. Regis University Deep Study in Genetic Algorithms and Real-Time Analytics What qualifies Chris to talk to me? Subject Matter Expert for: AIIM, TAWPI, DIR, Business Solution Mag. Obtained Distinguished Services Award for Market Education  When a developer turns to sales and marketing
What we will cover: Why Chris? What Are the Document Recognition Technologies Who Makes Them News Flash The Future Q & A
The Technologies OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing
The Technologies: OCR OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Ship To:
The Technologies: ICR OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Ilya
The Technologies: OMR OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Card Account
The Technologies: IDR OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Check Invoice Bill of Lading EOB
The Technologies: Barcode OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing 1889094476620
The Technologies: Handwriting OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing * Critical *
The Technologies: Acronym Heaven OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing
The Technologies: CAR/LAR OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing 2 hundred dollars & no cents
The Technologies: Assisted Capture OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing
The Technologies: Fixed Form Processing OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Name: Ilya Date: 12/21/2982
The Technologies: Fixed Form Processing Name: Ilya Date: 12/21/2982
The Technologies: Semi-Structured Forms OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing – Complexity is Underestimated Unstructured Document Processing Invoice No: 99044 Date: 06/09/04 Invoice No: 24567 Date: 06/09/04
Invoice No: 99044 Date: 06/09/04 Invoice No: 24567 Date: 06/09/04  (06/09/2004) The Technologies: Semi-Structured Forms Note, many people confuse these documents as fixed
The Technologies: Semi-Structured Forms OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Consignee Consignor Date Term
The Technologies: Common Processes Full page conversion Classification Index level extraction Redaction Routing Auto Filing Re-Purposing Image Rotation
The Technologies: Full page conversion Image file to electronic data file ALL text on the page Includes: Image Pre-processing Document Analysis/Zoning Extraction Export ( Commonly PDF, DOC )
The Technologies: Classification Software tells you the document type Several Modes of document classification Image Based Contextual Scan batches of mixed documents Bill of Lading Invoice Check PO
The Technologies: Index Level Extraction Just certain required fields extracted Normalization of data Export usually to a database Invoice Number Invoice Date Total Amt Due Term
The Technologies: How Accurate Better question is how do you determine accuracy Document Type Accuracy Field/Zone Location Accuracy Data Type Accuracy Character Accuracy
The Technologies: Document Complexities By Data Capture Complexity – Hardest to Easiest EOB – Marginal success BancTec HOV ECS Student Transcriptions – no success, no money Invoices ( not a vertical ) Telecom Bills Legal Invoices Aggregate Invoices Traditional Invoice Checks Bill of Lading  Prescriptions and Transportation Documents HCFA UB Fax Cover Sheets Other typographic Fixed Forms
What we will cover: Why Chris? What Are the Document Recognition Technologies Who Makes Them News Flash The Future Q & A
There Really are only 4 core technology providers It takes 50 man-years to develop OCR using traditional approach
Who Makes Them:  Core Engines Traditional OCR Approach ( sorted by market share ) – All European Engines Nuance ( formally ScanSoft ) derivative of Care Engine Middle of the road cost, accuracy, speed ABBYY Most accurate, slowest, most expensive Oc é Very fast, moderately expensive ReadI.R.I.S Fastest, not very accurate Specialized Engines CharacTell ParaScript A2iA Mitek None Traditional Approach NovoDynamics TIS Paledon Other Handful of Open Source, Tesseract, Octopus Two handfuls of OLD engines, Expervision, Care
Who Makes Them:  History Ray Kurzweil father of OCR – 1974 Arguably In University for some time Caere Founded -  1976 Ray sells his Engine to Xerox PAC becomes TextBridge – 1978 ReadI.R.I.S formed by Belgium grant – 1981 Tesseract Created by HP research – 1985 Expervision Founded - 1987 UNLV becomes standards organization for OCR ABBY Founded - 1989 Russian MIT Equivalent MIPT Moscow Institute of Physics and Technology Luc Vincent Invents Document Analysis ( now at Google, lead on Tesseract project ) - 1994 ScanSoft Splits from Xerox 1998 ScanSoft acquires Caere - 2000 ScanSoft Becomes Nuance – 2005 OCR Business Takes Backseat
Who Makes Them:  Who Licenses Them EVERYONE ELSE! AnaComp Anydoc BancTec BrainWare Captaris Captivation Cardiff CVision DataCap DigiTech eCopy EMC Documentum Kofax LaserFiche LeadTools Microsoft NSi AutoStore OnBase Perceptive Imaging ReadSoft SER Top Image Systems Tower Westbrook Xerox Etc.
What we will cover: Why Chris? What Are the Document Recognition Technologies Who Makes Them News Flash The Future Q & A
News Flash Purchase Consolidation OpenText bought Captaris who bought Oce Legal Nuance sues ABBYY over core OCR algorithms If they win only one OCR engine only option is new approaches
What we will cover: Why Chris? What Are the Document Recognition Technologies Who Makes Them Buyer Beware News Flash The Future Q & A
The Future More lawsuits More consolidation Full-page OCR will be a commodity Advance Document Processing will become main-stream but less required Document classification will be next big area of research and product solutions, the new big ticket item There will be a new approach to OCR Think about what to do now that you will be gathering data rapidly
What we will cover: Why Chris? What Are the Document Recognition Technologies Who Makes Them Buyer Beware News Flash The Future Q & A
Questions and Answers

Document Recognition Market Landscape

  • 1.
    Document Recognition atechnology overview Presented by: Chris Riley
  • 2.
    What we willcover: Why Chris? What Are the Document Recognition Technologies Who Makes Them News Flash The Future Q & A
  • 3.
    Why Chris? ProfessionalExperience LivingAnalytics, Inc. Artsyl Technologies, Inc. Visioneer, Inc. ABBYY IntelliKey Solutions, Inc. Regis University Deep Study in Genetic Algorithms and Real-Time Analytics What qualifies Chris to talk to me? Subject Matter Expert for: AIIM, TAWPI, DIR, Business Solution Mag. Obtained Distinguished Services Award for Market Education When a developer turns to sales and marketing
  • 4.
    What we willcover: Why Chris? What Are the Document Recognition Technologies Who Makes Them News Flash The Future Q & A
  • 5.
    The Technologies OCR– Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing
  • 6.
    The Technologies: OCROCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Ship To:
  • 7.
    The Technologies: ICROCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Ilya
  • 8.
    The Technologies: OMROCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Card Account
  • 9.
    The Technologies: IDROCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Check Invoice Bill of Lading EOB
  • 10.
    The Technologies: BarcodeOCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing 1889094476620
  • 11.
    The Technologies: HandwritingOCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing * Critical *
  • 12.
    The Technologies: AcronymHeaven OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing
  • 13.
    The Technologies: CAR/LAROCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing 2 hundred dollars & no cents
  • 14.
    The Technologies: AssistedCapture OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing
  • 15.
    The Technologies: FixedForm Processing OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Name: Ilya Date: 12/21/2982
  • 16.
    The Technologies: FixedForm Processing Name: Ilya Date: 12/21/2982
  • 17.
    The Technologies: Semi-StructuredForms OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing – Complexity is Underestimated Unstructured Document Processing Invoice No: 99044 Date: 06/09/04 Invoice No: 24567 Date: 06/09/04
  • 18.
    Invoice No: 99044Date: 06/09/04 Invoice No: 24567 Date: 06/09/04 (06/09/2004) The Technologies: Semi-Structured Forms Note, many people confuse these documents as fixed
  • 19.
    The Technologies: Semi-StructuredForms OCR – Optical Character Recognition ICR – Intelligent Character Recognition OMR – Optical Mark Recognition IDR – Intelligent Document Recognition Barcode Handwriting All the other ones made up for marketing purposes CAR/LAR ( Check21 ) – Courtesy and Legal Amount Recognition Assisted Capture Fixed Form Process Semi-Structured Forms Processing Unstructured Document Processing Consignee Consignor Date Term
  • 20.
    The Technologies: CommonProcesses Full page conversion Classification Index level extraction Redaction Routing Auto Filing Re-Purposing Image Rotation
  • 21.
    The Technologies: Fullpage conversion Image file to electronic data file ALL text on the page Includes: Image Pre-processing Document Analysis/Zoning Extraction Export ( Commonly PDF, DOC )
  • 22.
    The Technologies: ClassificationSoftware tells you the document type Several Modes of document classification Image Based Contextual Scan batches of mixed documents Bill of Lading Invoice Check PO
  • 23.
    The Technologies: IndexLevel Extraction Just certain required fields extracted Normalization of data Export usually to a database Invoice Number Invoice Date Total Amt Due Term
  • 24.
    The Technologies: HowAccurate Better question is how do you determine accuracy Document Type Accuracy Field/Zone Location Accuracy Data Type Accuracy Character Accuracy
  • 25.
    The Technologies: DocumentComplexities By Data Capture Complexity – Hardest to Easiest EOB – Marginal success BancTec HOV ECS Student Transcriptions – no success, no money Invoices ( not a vertical ) Telecom Bills Legal Invoices Aggregate Invoices Traditional Invoice Checks Bill of Lading Prescriptions and Transportation Documents HCFA UB Fax Cover Sheets Other typographic Fixed Forms
  • 26.
    What we willcover: Why Chris? What Are the Document Recognition Technologies Who Makes Them News Flash The Future Q & A
  • 27.
    There Really areonly 4 core technology providers It takes 50 man-years to develop OCR using traditional approach
  • 28.
    Who Makes Them: Core Engines Traditional OCR Approach ( sorted by market share ) – All European Engines Nuance ( formally ScanSoft ) derivative of Care Engine Middle of the road cost, accuracy, speed ABBYY Most accurate, slowest, most expensive Oc é Very fast, moderately expensive ReadI.R.I.S Fastest, not very accurate Specialized Engines CharacTell ParaScript A2iA Mitek None Traditional Approach NovoDynamics TIS Paledon Other Handful of Open Source, Tesseract, Octopus Two handfuls of OLD engines, Expervision, Care
  • 29.
    Who Makes Them: History Ray Kurzweil father of OCR – 1974 Arguably In University for some time Caere Founded - 1976 Ray sells his Engine to Xerox PAC becomes TextBridge – 1978 ReadI.R.I.S formed by Belgium grant – 1981 Tesseract Created by HP research – 1985 Expervision Founded - 1987 UNLV becomes standards organization for OCR ABBY Founded - 1989 Russian MIT Equivalent MIPT Moscow Institute of Physics and Technology Luc Vincent Invents Document Analysis ( now at Google, lead on Tesseract project ) - 1994 ScanSoft Splits from Xerox 1998 ScanSoft acquires Caere - 2000 ScanSoft Becomes Nuance – 2005 OCR Business Takes Backseat
  • 30.
    Who Makes Them: Who Licenses Them EVERYONE ELSE! AnaComp Anydoc BancTec BrainWare Captaris Captivation Cardiff CVision DataCap DigiTech eCopy EMC Documentum Kofax LaserFiche LeadTools Microsoft NSi AutoStore OnBase Perceptive Imaging ReadSoft SER Top Image Systems Tower Westbrook Xerox Etc.
  • 31.
    What we willcover: Why Chris? What Are the Document Recognition Technologies Who Makes Them News Flash The Future Q & A
  • 32.
    News Flash PurchaseConsolidation OpenText bought Captaris who bought Oce Legal Nuance sues ABBYY over core OCR algorithms If they win only one OCR engine only option is new approaches
  • 33.
    What we willcover: Why Chris? What Are the Document Recognition Technologies Who Makes Them Buyer Beware News Flash The Future Q & A
  • 34.
    The Future Morelawsuits More consolidation Full-page OCR will be a commodity Advance Document Processing will become main-stream but less required Document classification will be next big area of research and product solutions, the new big ticket item There will be a new approach to OCR Think about what to do now that you will be gathering data rapidly
  • 35.
    What we willcover: Why Chris? What Are the Document Recognition Technologies Who Makes Them Buyer Beware News Flash The Future Q & A
  • 36.