Captiva’s intelligent capture solutions capture information from a wide variety of file format and document types. The Captiva family helps you capture business critical information from paper, fax, and electronic data sources and transform it into business-ready content suitable for processing by enterprise applications. You’ll easily automate the processing of billions of documents annually—quickly and accurately converting their contents into information that is usable for all enterprise business processes in a timely and cost-effective manner. At a high level, most intelligent capture processes have five steps—starting with capturing a document, identifying what type of document was captured, extracting key information from the document based upon the document type, ensuring that the data has been correctly extracted and is accurate, and delivering it to business processes or content repositories. (Note to Presenter: Click now in Slide Show mode for animation.) Capture involves much more than simply digitizing paper documents using a high-speed scanner. Increasingly, documents must be captured from a variety of different sources, and captured from anywhere within an enterprise—from branch offices and regional scanning centers, or ad hoc capture by field agents. (Note to Presenter: Click now in Slide Show mode for animation.) After being scanned and enhanced, advanced technologies are applied to identify document types. In some cases, documents are readily recognized based on physical appearance, especially with structured forms. In other cases, documents are identified based on the common text, such as with legal contracts. (Note to Presenter: Click now in Slide Show mode for animation.) Different document types have much different requirements for data extraction. Some documents only require simple indexing for quickly finding documents within repositories; for example, other documents have much more advanced requirements, essentially transforming all of the information on a paper document into electronic data, such as for forms, invoices, or other transactional documents. (Note to Presenter: Click now in Slide Show mode for animation.) It is especially critical that the data in documents that drive important business processes is validated. Inaccurate data can cause extremely costly problems if not found until mistakes are made or incorrect business decisions are executed. Captiva capture solutions feature database lookups and business rules to ensure that data is accurate before it is passed to the next step in the business process (Note to Presenter: Click now in Slide Show mode for animation.) Finally, Captiva provides an integrated delivery to content repositories and business applications that ensures that both images and extracted data are successfully delivered to the back-end system, in the right location, with the right business processes triggered. The intelligent capture process can be fully customized to meet individual customer requirements and includes options for distributed capture, sophisticated intelligent document recognition, and integration into larger enterprise applications that leverage service-oriented architectures. For example, by scanning all documents within its legal department, a leading pharmaceutical company has achieved significant labor productivity improvements among its high-value staff and reduced costs for physical storage space, shipping costs, and paper suppliers. The company achieved a return on investment of 113 percent with a payback period of 18 months. After having the system in place for three years, the company has saved an estimated $17.9 million by scanning in all documents.
A key element of Captiva’s Intelligent Capture suite is it’s Intelligent Document Recognition technologies – or IDR. These technologies add advanced technologies to dramatically enhance the ability to capture, organize, and transform any type of document into usable business data. Captiva advances the technology in three key areas: classification, extraction, and validation. [click] Classify As document capture is increasingly implemented across large organizations, multiple types of documents are being captured. With traditional capture, individual capture processes or time-consuming document preparation is required to capture multiple document types. Bar code labels must be added to documents or separator sheets must be inserted in between documents to help the capture system understand when a new document is starting. With Captiva’s Intelligent Classification technologies, documents can be identified based on their physical appearance – great for structured forms – or by the text content within the document – which is required for less structured documents, such as invoices or even legal documents. Captiva’s IDR technologies dramatically reduce the work required to prepare documents to be captured, significantly reducing the costs associated with capturing documents and speeding the transformation from paper documents into business-ready information. [click] Extract Once documents are classified, data is typically extracted from the documents. In some cases, this is achievable using traditional capture techniques, either with data keyers or by extracting data from known areas of the document, such as the upper right-hand corner. As more and different types of documents are being captured, the exact location of the information isn’t always known. It could be an invoice, where the data may be in different locations, depending upon the vendor, or it could be a piece of correspondence, where the address block could be anywhere. Captiva’s IDR technologies allow customers to leverage both high-speed and accurate zonal extraction techniques to extract data from known, structured document types and highly flexible freeform extraction technology to extract information, regardless of where it is located. Table-reading extraction also enables extraction of data from complex billing documents. [click] Validation The third important component of IDR is validation. Our customer tell us that the costs of finding inaccurate information later in the process are prohibitive, so as a best practice, we encourage our customers to take advantage of several forms of validation to ensure accurate information is captured and delivered to back-end systems. This makes documents more findable, it makes your business processes more reliable, and of course, it saves time and money. Captiva features validation of data against other data sources, such as ERP or other databases and it enables organizations to compare data against business rules to ensure that it is captured accurately and meets expectations. Together, Captiva’s IDR technologies advance document capture far beyond traditional capture solutions, providing far more value to customers by addressing many more applications and providing better transformation of documents into usable information. Title Month Year
Captiva Dispatcher benefits: Streamline the flow of data: Captiva Dispatcher is a technical module that can be plugged into the Input Accel document capture platform. Dispatcher processes scanned images coming from Input Accel and can handle forms, invoices, checks, explanations of benefits (EOBs), and loan documents….either single and multi-pages. Based on multiple innovative technologies, Dispatcher automatically sorts out the information and extracts key business information. EMC Captiva also delivers Dispatcher as an API module that can be called from any other document capture or third system (for example, Dispatcher can be called by Web Services through Global 360). Reduce scanning preparation time and cost: Some document capture systems are still requiring bar codes or separators to detect folders, documents types, and single and multi-pages, which consumes preparation time to manually insert the correct separators. Manually pre-sorting is an inefficient process prone to human error, while Dispatcher based on multiple classification thresholds can automatically detect the correct document type. Dispatcher can also handle portrait/landscape automatic detection as well as upside-down format or back-to-front documents. Reduce manual data entry and typing errors: Once Dispatcher detects the document type, business information, such as invoice data for Accounts Payable automation processes or patient data for EOBs applications, is extracted. Business data is validated with specific business rules and therefore is ready to be imported into an Enterprise Resource Planning (ERP), Customer Relationship Management (CRM) or third system with limited human intervention and error risk. Automatic routing and data extraction guarantee that documents are not lost, damaged, or forgotten about, as can be the case in a manual process. This in turn ensures compliance and enhances customer relationships. Improve return on investment for transactional processes: Companies invested in workflow, Business Process Management (BPM), ERP, CRM have realized that they can now increase their productivity and impact their global return on investment by providing an intelligent document capture system able to route and classify their documents’ flow. Scanning and classifying documents at the beginning of the process and delivering the images immediately to the appropriate repository is probably the standard example of Captiva Dispatcher’s impact on the Enterprise Content Management (ECM) system.
EMC Transactional Content Management July 2008 Structured documents – these are document types where data is always in the same area or region of the page. This document type usually requires zonal OCR or forms processing for highly complex forms such as mortgage applications, credit applications, etc. Examples of these types are address forms, health claim forms, benefit forms, tax forms, etc. A typical product mix to handle this document type would be InputAccel or InputAccel with FormWare for highly complex forms, and Dispatcher to identify them. Semi structured documents – these are document types where data required from the page is the same but varies in location from one vendor to another. This document type usually requires free form technology to find the data in question and extract/validate them from other systems eventually triggering transactions. Examples of these types are invoices, purchase orders, shipping documents, bill of lading, phone bills, etc. A typical product or configuration would be InputAccel for Invoices. Unstructured documents – these are document types where data or information is in the page but not always in the same area. This document type usually requires conversion of text into electronic format such as PDF or text recognition could be used to identify what the document is all about. Examples of these document types are correspondence and letters. Techniques: Global Image Analysis - Dispatcher™ uses a completely automatic learning process (“fuzzy logic” approach) for unlimited document types, building dynamically a knowledge base. This method does not rely on being able to read text data from the document but instead analyses the significant structural elements of the document, making it completely language independent! HPA - An HPA is defined manually by placing anchors on the graphical zones that are specific to a document in order to discriminate between documents. This technology should be applied when there is a high variability of documents within the same template. For example, in the case of documents such as cheques, it is not useful to discriminate too much by creating one template per bank if it is only necessary to identify that these documents are cheques, regardless of the issuing banks. Keyword - To classify documents based upon the text they contain and not according to their visual aspect or similarity with the template. Based on dictionaries of keywords often associated to the company document referential, Dispatcher™ reads the information on the document with specific OCR engines and identifies the type of incoming mail. Text Matching - New classification technology dedicated to unstructured documents. Easy to implement and set up you can on fly manage and control unstructured document classification. The objective is to extract the complete text and to compare sentences and characters sequence between documents. Therefore you can easily classify legal documents which can have different lay out or design but legal text will be exactly the same. This approach is unique on the market today and help our customers to optimize their unstructured information process. Mortgage, Legal application, HR…even financial services can get benefits of the Text Matching technology. This feature will be included into Dispacther for the 5.0 release Q2 08. Handwritten - Handwritten document is really different from others. Because of the algorithms of the “fuzzy logic” and of the learning base it is quite easy to distinguish the lay out of a handwritten document.
Note to Presenter: View in Slide Show mode for animation, and then slowly click three times. Batch/Doc folders classification: Separating out documents is automatically based on the layout analysis or specific keywords. Related to the classification technologies, Dispatcher can naturally separate images to create document folders without separators or bar codes. The benefit here is that users do not have to manually sort and prepare documents prior to scanning. Dispatcher combines graphical analysis and text analysis to define a “master” document type as the “document breaker” of the batch: Graphical analysis: Dispatcher refers to its learning base, ( i.e., the graphical analysis of the recurring information). Some documents are defined as natural separators. For example, when Dispatcher detects page 1 of Form 1, it is a new document set. Text analysis: Detection during classification of patient folder or invoice number. Dispatcher can break a batch into a document set including multiple documents. As soon as Dispatcher detects a new patient folder or invoice number, Dispatcher will create a new document set. In the example above, document sets are broken out into a logical set when a document is recognized as a given template. Doc Set 1 and 2 are from the same patient and the pages that follow the top page are attachments that are associated with the identified template.
Title Month Year
Introduce Dispatcher methods classification. Full-page Image Based analysis : For recurrent information, looking for the general lay out design – fuzzy logic approach. Hand-Precision Word Anchors: For recurrent information, looking for local detail on the lay out design as logos, or specific document area. . Handwritten Analysis : Automatic correspondence detection Full-Page Text Based Analysis : For non recurrent information. Looking for key words to classify the doc type Other: Introduction to the coming 5th classification technologies (Text Matching). It would be a powerful method for unstructured information providing verbiage comparison. No need of thesaurus and no business knowledge required to handle unstructured document information.
Dispatcher™ uses a completely automatic learning process (“fuzzy logic” approach) for unlimited document types, dynamically building a knowledge base. This method does not rely on being able to read text data from the document but instead analyses the significant structural elements of the document, making it completely language independent! New in 6.0: More image capacity for auto-learning up to 40,000 images for best accuracy
An HPA is defined manually by placing anchors on the graphical zones that are specific to a document in order to discriminate between documents. This technology should be applied when there is a high variability of documents within the same template. For example, in the case of documents such as cheques, it is not useful to discriminate too much by creating one template per bank if it is only necessary to identify that these documents are cheques, regardless of the issuing banks.
To classify documents based upon the text they contain and not according to their visual aspect or similarity with the template. Based on dictionaries of keywords often associated to the company document referential, Dispatcher™ reads the information on the document with specific OCR engines and identifies the type of incoming mail. New in 6.0: Faster keyword classification when using “fast mode”
New classification technology dedicated to unstructured documents. Easy to implement and set up you can on the fly manage and control unstructured document classification. The objective is to extract the complete text and to compare sentences and characters sequence between documents. Therefore you can easily classify legal documents which can have different lay out or design but legal text will be exactly the same. This approach is unique on the market today and help our customers to optimize their unstructured information process. Mortgage, Legal application, HR…even financial services can get benefits of the Text Matching technology. This feature will be included into Dispatcher for the 5.0 release Q4 07.
Title Month Year
Title Month Year Two Major Technologies: Template: locate which fields to capture, work well when the layout of forms is the same or where clear identifiers define the format. Used for recurring information. Free Form Approach: based on keywords and text analysis to catch out the data. You extract the same information than a template used but without any layout analysis. Used for non recurring information. IA data extraction At a basic level images are scanned and index operators key information into index fields based on image data. IA provides more advanced techniques which include the following. Zonal OCR – At setup time, an admin can specify where on a document to apply OCR (Optical Character Recognition). For example, a customer may want to extract a loan document number from a page. Rather than keying this information, IA applies OCR to read the loan number and have it pre-populate an index field. Dispatcher support zonal OCR as well. OCR Rubber Banding – IA supports full page OCR. As a document is being indexed, an operator can select a certain location on a document image and extract the OCR results. For example, rubber banding around the SSN on a page will take the OCR results and insert it into the SSN index field on screen. This provides a quick and easy way to extract data from a document without manually keying. Dispatcher extraction capabilities -Performs both zonal OCR and free form OCR extraction. Free form OCR – looks for keywords on a document image and once it locates the word, applies the extraction rules. For example, “look for the keyword P.O. and once located look below P.O to find the purchase order number”. This provides flexibility around being able to extract data from a semi-structured document. Table Extraction – Supports the extraction of line item details on a document. For example an invoice. Dispatcher Table Extraction will OCR the data and based on setup rules defined will extract the line item details (e.g. Quantity, Description, Amount) into Disptacher index fields.
New in 6.0: New 2D barcode recognition for PDF-417 and DataMatrix
New in 6.0: Updated Nuance Scansoft OCR engine improves classification and extraction accuracy
Title Month Year
Title Month Year
InputAccel compatibility enhancements The major theme of the Dispatcher 6.0 release is compatibility with InputAccel 6.0. An additional theme is new and updated recognition engines, which we will talk about later. Common sample Dispatcher reports accessible from within InputAccel Admin Console Dispatcher statistics can now be reported on from within the IA Admin Console, instead of having to run a separate program. A few commonly used Dispatcher reports are provided, which pull from the InputAccel database. This allows for easier reporting of both IA and Dispatcher statistics. Custom Dispatcher reports can be developed from within InputAccel Admin Console using Crystal Reports Since Dispatcher statistics are now stored on the IA database in addition to the separate Dispatcher database that exists today, you can use this data to develop custom Dispatcher reports using the Crystal Reports report generator included with IA 6.0. This allows for ultimate flexibility for reporting on exactly what you want.
Classification Edit and Validation user interfaces mimic IndexPlus user interface for logging in and selecting batches The user interfaces for logging in and selecting batches for Classification Edit and Validation now look very similar to those used by Scan and Index in InputAccel.
New check reading engine for U.S. and France provides recognition of CAR, LAR, MICR/CMC7 codeline, signature presence, payee name, check number, and check date Dispatcher also now provides a new check reading engine that reads various fields from U.S. and French checks, including CAR, LAR, MICR/CMC7 codeline, signature presence, payee name, check number, and check date. With this new check reading engine, you no longer have to define zones, fields and keyword rules for checks, nor do you have to spend time testing different recognition engines for best results. This engine does it all for you because it specializes in reading checks. New in 6.0: User productivity Improvements in character repair behavior in Dispatcher Validation Classification Edit pre-indexing interface provides consistent feel with Validation interface, including addition of character repair
Title Month Year
Note to Presenter: View in Slide Show mode for animation. I’d like to wrap up with a summary of what we’ve covered today… First, we talked about the key business drivers for organizations taking on initiatives to eliminate paper and manual processes: Paper is difficult to storage and manage Manual processes are slow, expensive, and error-prone Information silos create compliance risk Legacy imaging solutions are not meeting business requirements Note to Presenter: Click now in Slide Show mode for animation. Secondly, we’ve covered the four capabilities within intelligent capture: Capture —Capture from anywhere within the enterprise using a variety of input methods (scanners, MFPs, e-mail) Classify —Automatically classify all documents using sophisticated document recognition technologies Extract and validate —Automatically extract and validate data from all documents Delivery —Integrate with all systems throughout the enterprise Note to Presenter: Click now in Slide Show mode for animation. There are five reasons why customers have selected EMC for their needs: EMC has the industry’s only complete, end-to-end offering, including document capture and classification, a complete business process suite, collaboration, enterprise report management, content archiving services, records/retention management, information rights management, and much more. EMC is recognized as the market leader by IDC, Gartner, and The 451 Group as the leader in enterprise content management. EMC provides a proven, scalable, fully unified architecture that has been utilized by more than 15,000 customers. The architecture allows EMC to process all content types and processes. EMC provides both a platform and solutions approach in the areas of… Solution examples (loan origination, new account enrollment, etc.) Partner applications (Accounts Payable, contract management, etc.) Partner extensions (Adobe, iLog, etc.)
Emc Captiva The Power Of Intelligent Document Recognition - Presentation Transcript
The Power of Intelligent Document Recognition Using EMC Captiva Dispatcher
Agenda Dispatcher Overview
EMC Captiva Intelligent Capture
Capture all of your paper documents and transform these documents into electronic images and business data
Support centralized and distributed scanning environments
Enable digital offices throughout your enterprise
Identify all documents and automate data capture from business documents
Provide immediate access to your documents to both individuals and processes
Invoice Number
Vendor Name
Purchase Date
Subtotal
Grand Total
Payment Terms
10010
Acme Products
30 January 2008
$ 6,014.81
$ 6,025.88
Net 30 Days
Capture Classify Extract Validate Deliver
EMC Captiva Intelligent Capture
Invoice Number
Vendor Name
Purchase Date
Subtotal
Grand Total
Payment Terms
10010
Acme Products
30 January 2008
$ 6,014.81
$ 6,025.88
Net 30 Days
Capture Classify Extract Validate Deliver Classify Extract Validate Sophisticated image- and text-based classification tools to identify documents without manual preparation Zonal and intelligent freeform data extraction to transform all documents into electronic data Effectively control business processes by validating data for correct recognition and accuracy Intelligent Document Recognition
Captiva Dispatcher Benefits
Streamline the flow of data into enterprise applications
Reduce scanning preparation time and cost
Eliminate manual document preparation and data entry
Automated process to capture, classify, route, index, and extract information to provide data for business transactions and images for archiving/storage
Advanced Document Identification
Key Benefits
Reduce document preparation time
Index and route document to the appropriate business process
Semi-Structured Documents Invoices Checks POs Unstructured Documents Legal Contracts Patient records Structured Documents Forms Tax returns Global Image Analysis High Precision Anchors Global Image Analysis High Precision Anchors Keyword Analysis Handwritten detection Keyword Analysis Text Matching Analysis
Batch Management – Innovative Techniques Doc Set 4 Doc Set 1 Claim folder: 0045128 Doc Set 2 Doc Set 3 Claim Folder: 0045670 Advanced document identification for batch processing
Agenda
Classification Technologies
Global Image Analysis
Automatic learning and identification of documents using graphical templates
Local Image Analysis
Zonal, graphical identification of documents
Keywords Analysis
Identification of documents based on keyword
Text Matching Analysis
Identification of documents based on text blocks
Handwritten Detection
Identification of documents based on handwriting
Classification Technologies
Standard Classification Global Image Analysis
Layout/graphical analysis to determine document type
“ Fuzzy-Logic” algorithm independent of language and format
Automatic learning system to dynamically build knowledge base
Feed Dispatcher with recurring images, and document families (templates) are automatically created
Provide large image samples to increase Dispatcher efficiency
High Precision Anchors Local Image Analysis
Specify local area (such as a logo or title) to determine document type
High Precision Anchors concept
Split document families into subfamilies to define a specific process
Local Image Analysis complements Global Image Analysis when documents vary within same family/template
Keyword Classification
Keyword match to determine document type
Use a full text engine to extract document information
Match the text extraction with business dictionaries to classify your information
Tune your own keywords rules using regular expressions
Classification method for free-form/non-templatized, nonrecurring documents
Text Matching Classification
Determine document type when documents have no unique layout or keywords
Use a full text OCR engine to extract and match document information
Learn a new document the first time – one image needed
Minimal configuration settings required
Can increase the classification rate on unstructured documents in Dispatcher by up to 40%
Property Insurance. Borrower shall keep the improvements now existing or hereafter erected on the Property insured against loss by fire, hazards included within the term "extended … OCR
Technology Flow in Dispatcher 4. Business rules
1. Global/local image classification (55% to 90%)
Recognizes a document which looks like another one seen before (global) or that contains a specific pattern, like a logo (local)
Unique software to automatically build up to 10,000 templates
Speed of classification 20 to 50 pages/sec
2. Keywords text classification (5% to 20%)
Recognizes a document which contains a specific set of keywords
Multi engine OCR on header and footer
Optimized reading zone : up to 2 pages/sec
3. Text Matching text classification (15% to 40%)
Recognizes a document containing similar sequence of characters, i.e. standard letter
Automatic learning on the fly
More CPU intensive : 0.5 pages/sec
Property Insurance. Borrower shall keep the improvements now existing or hereafter erected on the Property insured against loss by fire, hazards included within the term "extended … OCR Lender may require Borrower Lender may require Borrower Lender may require Borrower library If not If not Enhanced with
Agenda
Intelligent Data Extraction
Extract critical business data based on document type
Simple indexing to extensive data extraction
Extract using zonal and free-form techniques
Enhanced extraction reduces manual costs and enables faster, more accurate business processes
Zonal Extraction Freeform Extraction
Extraction Technologies
Recognition technologies
OCR, ICR, mark sense, 1D barcoding
New in Dispatcher 6: 2D barcoding and checkreading
EMC approach provides intelligent document recognition for all documents
High-speed graphic classification and zonal extraction for highly structured documents
Flexible, accurate, text-based classification and freeform data extraction for less structured documents
Unified development and administration simplifies development and maintenance
Significant cost reduction and process efficiency
Eliminate manual document sorting
Increase automated data extraction
Benefits of document classification and routing
Organizing complex documents
Enabling routing for digital mailrooms
Benefits of data extraction
Reduced costs associated with data keying and document indexing
Increased value for business process
Intelligent Capture Recap
Reasons to
Choose EMC
Industry’s only complete enterprise solution
Fiscal strength and viability
Recognized enterprise market leader
Proven, scalable, unified architecture
Platform and solutions approach
Key Benefits
Complete ROI delivered within 12 months
Reduce document sorting and data entry labor costs by up to 90%
Reduce cycle times by over 75%
Save $1 per document to store paper documents electronically
Intelligent document recognition capabilities
Automatically classify all document types within an organization
Extract and data from structured and unstructured document types
Validate data to ensure accurate processing
Get Involved with EMC CMA Communities
Why should you join?
Collaborate and share best practices
Shape the direction of future EMC products
Network with innovators across the globe, 24/7
Join now by going to: community.EMC.com/go/ Documentum community.EMC.com/go/ SourceOne developer.EMC.com/ Documentum developer.EMC.com/ XMLtech community.EMC.com/community/labs/ d65
0 comments
Post a comment