Using Regular Expressions in Document Management Data Capture and Indexing

627 views

Published on

Learn how metadata (index information) can be pulled from documents using regular expressions or regex. See how regex is used to extract the index information, name files, create subfolders and more to feed your document management or EMR systems. Automated data capture is shown with ImageRamp from DocuFi, a powerful platform to capture index information from your scanned documents and drawings which integrates with today's document management and EMR systems.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
627
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Using Regular Expressions in Document Management Data Capture and Indexing

  1. 1. Using Regular Expressions for Data Mining and Automated Data Capture and Indexing Copyright © 2010 - 2013 DocuFi. All Rights Reserved
  2. 2. In a Document Management Environment Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
  3. 3. First: What is automated data capture? Just identifying and extracting information or data (sometimes called metadata) from scanned documents Data Capture:
  4. 4. First: What is automated data capture or data mining? Just identifying and extracting information or data (sometimes called metadata) from scanned documents Data Capture: Automated Data Capture: Applying the principles of automation to data capture, silly! This can also be called text data mining.
  5. 5. Why automate data capture? Manual Data Capture is Expensive and Time Consuming
  6. 6. Problems with manual data entry: 1.Security maybe compromised if documents taken off premises 2.A delay is introduced if documents taken off premises 3.Compared to automated extraction, manual indexing is slow 4.Manual indexing doesn’t scale well with large projects 5.Manual indexing has the potential to introduce errors into the data Why automate data capture?
  7. 7. and… Why automate data capture? Problems with manual data entry: 1.Security maybe compromised if documents taken off premises 2.A delay is introduced if documents taken off premises 3.Compared to automated extraction, manual indexing is slow 4.Manual indexing doesn’t scale well with large projects 5.Manual indexing has the potential to introduce errors into the data
  8. 8. There’s a Mountain of It!
  9. 9. There’s a Mountain of It! Let’s take a look at just invoices for example…
  10. 10. There’s a Mountain of It! According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
  11. 11. There’s a Mountain of It! Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper. According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
  12. 12. There’s a Mountain of It! Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper. and it’s expensive According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based. An Aberdeen Group March 2012 publication estimates the costs of processing a single invoice from $4.84 to $20.13.
  13. 13. So if e-invoicing is not an option (as it’s not for many), what? sending and receiving invoices electronically e-invoicing: “it is the front-end capture options…that introduce true performance gains. For example, respondents who have implemented front-end document capture (creating a scanned digital copy of a physical invoice to be used in the approval process) report invoice processing 34% faster than those who process invoices manually. Moving to the pure data end of the spectrum, companies that convert scanned documents into usable data (through optical character recognition or similar technologies), report a 26% faster processing time than those that work only with document images.” ---Aberdeen’s 2010 report ( )
  14. 14. And, We All Know, Time is Money
  15. 15. Don’t forget we are using invoices only as an example. But, this could apply to patient records, legal documents, purchase orders…any document.
  16. 16. Now that you know this is all about money, let’s go back to the focus of this slideshow.
  17. 17. Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
  18. 18. There’s a Mountain of It! What are Regular Expressions or regex? Regular expressions (regex) provide a fast and powerful method to search, extract and replace specific data found within scanned documents. Regular expressions are essentially a special text string for describing a search pattern. You could think of regular expressions as extremely powerful wildcards.
  19. 19. There’s a Mountain of It! What’s it look like? A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ
  20. 20. There’s a Mountain of It! What’s it look like? A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ ^ Start at the beginning of a string or line ∖s{1,3} Find a space that occurs between 1 and 3 times [A-Z0-9]* Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible. XYZ Find the literal characters “XYZ”
  21. 21. There’s a Mountain of It! What’s it look like? A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ ^ Start at the beginning of a string or line ∖s{1,3} Find a space that occurs between 1 and 3 times [A-Z0-9]* Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible. XYZ Find the literal characters “XYZ” If we had the value “ AZR8987XYZ” in our document at the start of a line we would get a match whereas if we had “ AZR898XY” we would not.
  22. 22. There’s a Mountain of It! Huh? Don’t worry, this is not a tutorial on writing regex. We just want to look at some examples and understand how regex can apply to data capture and indexing in a document management environment.
  23. 23. There’s a Mountain of It! Regular expressions are extremely flexible and patterns can be constructed to match almost anything. For text commonly found in documents such as dates, SSNs, ZIP codes etc., patterns are freely available on the Internet. Here are some examples: Zip Codes ^(?!00000)(?<zip>(?<zip5>∖d{5})(?:[ - ](?=∖d))?(?<zip4>∖d{4})?)$ US Phone Number ^([0-9]( |-)?)?(∖(?[0-9]{3} ∖)?|[0-9]{3})( |- )?([0-9]{3}( |-)?[0-9]{4}|[a-zA-Z0-9]{7})$ Credit Card (^(4|5)∖d{3}-?∖d{4}-?∖d{4}- ?∖d{4}|(4|5)∖d{15})|(^(6011)-?∖d{4}- ?∖d{4}-?∖d{4}|(6011)- ?∖d{12})|(^((3∖d{3}))-∖d{6}- ∖d{5}|^((3∖d{14})))
  24. 24. There’s a Mountain of It! Here is a partial invoice where you might need to capture the "Catalogue Number“. Real World Example
  25. 25. There’s a Mountain of It! In order to start constructing a regular expression we have to use what we know from the data in front of us as well as making some assumptions. During testing we can refine the regular expression. In this example we can assume from the document that the catalogue number has the format of a single uppercase letter, followed by 2 digits then a hyphen followed by a single uppercase letter and 6 digits or just 6 digits.
  26. 26. We could use the regex of [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{6} extract the data. Let's again break this down: [A-Z] Find a character from A-Z, the absence of a quantifier specification,“{}”, assumes we are only looking for 1 character ∖d{2} Find exactly 2 digits - Find the literal character “-“ [A- Z]{0,1} Find a character A-Z between 0 and 1 repetitions ∖d{6} Find exactly 6 digits This is just one way of writing a regular expression for this example although there are various ways it could be written. If we should subsequently find that the last portion of the catalogue number might contain 4 to 6 digits, we could simply amend it as follows [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{4,6}.
  27. 27. We’ll take a look at how regex is used in ImageRamp Batch. It’s a simple-to-use folder processing tool that accelerates getting data and files into various EMR, Document Management or other secure storage environments. It can be used to capture and extract data in both structured and unstructured documents. As an example, we might want to extract data from a scanned file with the following 4 fields: Now how would this work in a data capture solution? Company Name Company Number Date SIC Code
  28. 28. Here is the ImageRamp screen showing the scanned file pages and the data extracted using regex for the four fields we listed.
  29. 29. Hang on, we’ll show it. We’ll use it to split individual company’s invoices from an multipage scan based on the Company Name and extract index data. A company might use this to scan a large stack of invoices and split the file every time a new invoicing company name is located using the regex scripts. So where is the regex?
  30. 30. First we are going to define the regex to perform document splitting when a new Company Name is located in ImageRamp’s Splitting and Extraction’s Data Mining submenu as shown below. Let’s break it down—-splitting the scan stack. (?<=∖bCompany∖s*Name∖s+ ∖b)[a-z0-9∖(∖) ]* … and check the “Split if Matched” option.
  31. 31. Remember in our example we identified CompanyName, CompanyNo, Date and SICcode as the index or metadata information we want to capture. So here we are extracting the date field using the regex in the Index Fields section of the Data Mining submenu. (?<Date>(?<= ∖bDate of this return∖s+∖b)∖d{2}/∖d{2}/∖d{4}) --capturing the index data.
  32. 32. Information extracted through the text data mining with regex can also be used to name the file and create folders. Here %regex1 corresponds to the first regex field definition (CompanyName) and %regex2 corresponds to the second field definition (CompanyNo). But wait, there’s more.
  33. 33. We hope we have demonstrated the immense power of using regular expressions to extract data from both structured and unstructured data. Data in the palm of your hand…not locked in your documents! and…
  34. 34. For more on: •Data Mining PDF •Data mining Scans •Invoice Mining •Patient Record Mining •OCR mining •TIF mining •Extracting meta data, •Data extraction from unstructured data •Intelligent data capture •Data extraction •Using regex to extract data •Document scanning •Extracting data •Extract meta data, •Scanner software, •Barcode recognition, •OCR software, •Capture tutorial •Pdf scanning, •Scanning software •Indexing •Document indexing •Automated capture •Meta data •Scan to index •Batch Processing •Bulk scanning •Docufi •Imageramp •Data capture •Migration to document management the power of ImageRamp and its other features including: Learn more about… Full text OCR to PDF PDF rights management and encryption Document naming, splitting, and routing based on barcodes and… Image processing for clean up and adaptive thresholding OCR (Optical Character Recognition) Barcode reading (1D and 2D)
  35. 35. More?
  36. 36. Further reading on Regular Expressions: More? http://en.wikipedia.org/wiki/Regular_expression http://regexlib.com/ http://www.regular-expressions.info/
  37. 37. docufi.com @imageramp @docufinews

×