SlideShare a Scribd company logo
Using Regular Expressions for Data Mining and Automated Data Capture and Indexing 
Copyright © 2010 - 2013 DocuFi. All Rights Reserved
In a Document Management Environment 
Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
First: What is automated data capture? 
Just identifying and extracting information or data (sometimes called metadata) from scanned documents 
Data Capture:
First: What is automated data capture or data mining? 
Just identifying and extracting information or data (sometimes called metadata) from scanned documents 
Data Capture: 
Automated 
Data Capture: 
Applying the principles of automation to data capture, silly! 
This can also be called text data mining.
Why automate data capture? 
Manual Data Capture is Expensive 
and Time Consuming
Problems with manual data entry: 
1.Security maybe compromised if documents taken off premises 
2.A delay is introduced if documents taken off premises 
3.Compared to automated extraction, manual indexing is slow 
4.Manual indexing doesn’t scale well with large projects 
5.Manual indexing has the potential to introduce errors into the data 
Why automate data capture?
and… 
Why automate data capture? 
Problems with manual data entry: 
1.Security maybe compromised if documents taken off premises 
2.A delay is introduced if documents taken off premises 
3.Compared to automated extraction, manual indexing is slow 
4.Manual indexing doesn’t scale well with large projects 
5.Manual indexing has the potential to introduce errors into the data
There’s a Mountain of It!
There’s a Mountain of It! 
Let’s take a look at just invoices for example…
There’s a Mountain of It! 
According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
There’s a Mountain of It! 
Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper. 
According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
There’s a Mountain of It! 
Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper. 
and it’s expensive 
According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based. 
An Aberdeen Group March 2012 publication estimates the costs of processing a single invoice from $4.84 to $20.13.
So if e-invoicing is not an option (as it’s not for many), what? 
sending and receiving invoices electronically 
e-invoicing: 
“it is the front-end capture options…that introduce true performance gains. For example, respondents who have implemented front-end document capture (creating a scanned digital copy of a physical invoice to be used in the approval process) report invoice processing 34% faster than those who process invoices manually. Moving to the pure data end of the spectrum, companies that convert scanned documents into usable data (through optical character recognition or similar technologies), report a 26% faster processing time than those that work only with document images.” 
---Aberdeen’s 2010 report 
( 
)
And, We All Know, Time is Money
Don’t forget we are using invoices only as an example. But, this could apply to patient records, legal documents, purchase orders…any document.
Now that you know this is all about money, let’s go back to the focus of this slideshow.
Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
There’s a Mountain of It! 
What are Regular Expressions or regex? 
Regular expressions (regex) provide a fast and powerful method to search, extract and replace specific data found within scanned documents. 
Regular expressions are essentially a special text string for describing a search pattern. You could think of regular expressions as extremely powerful wildcards.
There’s a Mountain of It! 
What’s it look like? 
A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ
There’s a Mountain of It! 
What’s it look like? 
A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ 
^ 
Start at the beginning of a string or line 
∖s{1,3} 
Find a space that occurs between 1 and 3 times 
[A-Z0-9]* 
Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible. 
XYZ 
Find the literal characters “XYZ”
There’s a Mountain of It! 
What’s it look like? 
A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ 
^ 
Start at the beginning of a string or line 
∖s{1,3} 
Find a space that occurs between 1 and 3 times 
[A-Z0-9]* 
Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible. 
XYZ 
Find the literal characters “XYZ” 
If we had the value “ AZR8987XYZ” in our document at the start of a line we would get a match whereas if we had “ AZR898XY” we would not.
There’s a Mountain of It! 
Huh? 
Don’t worry, this is not a tutorial on writing regex. We just want to look at some examples and understand how regex can apply to data capture and indexing in a document management environment.
There’s a Mountain of It! 
Regular expressions are extremely flexible and patterns can be constructed to match almost anything. For text commonly found in documents such as dates, SSNs, ZIP codes etc., patterns are freely available on the Internet. 
Here are some examples: 
Zip Codes 
^(?!00000)(?<zip>(?<zip5>∖d{5})(?:[ - ](?=∖d))?(?<zip4>∖d{4})?)$ 
US Phone Number 
^([0-9]( |-)?)?(∖(?[0-9]{3} ∖)?|[0-9]{3})( |- )?([0-9]{3}( |-)?[0-9]{4}|[a-zA-Z0-9]{7})$ 
Credit Card 
(^(4|5)∖d{3}-?∖d{4}-?∖d{4}- ?∖d{4}|(4|5)∖d{15})|(^(6011)-?∖d{4}- ?∖d{4}-?∖d{4}|(6011)- ?∖d{12})|(^((3∖d{3}))-∖d{6}- ∖d{5}|^((3∖d{14})))
There’s a Mountain of It! 
Here is a partial invoice where you might need to capture the "Catalogue Number“. 
Real World Example
There’s a Mountain of It! 
In order to start constructing a regular expression we have to use what we know from the data in front of us as well as making some assumptions. During testing we can refine the regular expression. 
In this example we can assume from the document that the catalogue number has the format of a single uppercase letter, followed by 2 digits then a hyphen followed by a single uppercase letter and 6 digits or just 6 digits.
We could use the regex of [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{6} extract the data. Let's again break this down: 
[A-Z] 
Find a character from A-Z, the absence of a quantifier specification,“{}”, assumes we are only looking for 1 character 
∖d{2} 
Find exactly 2 digits 
- 
Find the literal character “-“ 
[A- Z]{0,1} 
Find a character A-Z between 0 and 1 repetitions 
∖d{6} 
Find exactly 6 digits 
This is just one way of writing a regular expression for this example although there are various ways it could be written. If we should subsequently find that the last portion of the catalogue number might contain 4 to 6 digits, we could simply amend it as follows [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{4,6}.
We’ll take a look at how regex is used in ImageRamp Batch. It’s a simple-to-use folder processing tool that accelerates getting data and files into various EMR, Document Management or other secure storage environments. It can be used to capture and extract data in both structured and unstructured documents. 
As an example, we might want to extract data from a scanned file with the following 4 fields: 
Now how would this work in a data capture solution? Company Name Company Number Date SIC Code
Here is the ImageRamp screen showing the scanned file pages and the data extracted using regex for the four fields we listed.
Hang on, we’ll show it. We’ll use it to split individual company’s invoices from an multipage scan based on the Company Name and extract index data. 
A company might use this to scan a large stack of invoices and split the file every time a new invoicing company name is located using the regex scripts. 
So where is the regex?
First we are going to define the regex to perform document splitting when a new Company Name is located in ImageRamp’s Splitting and Extraction’s Data Mining submenu as shown below. 
Let’s break it down—-splitting the scan stack. 
(?<=∖bCompany∖s*Name∖s+ ∖b)[a-z0-9∖(∖) ]* 
… and check the “Split if Matched” option.
Remember in our example we identified CompanyName, CompanyNo, Date and SICcode as the index or metadata information we want to capture. So here we are extracting the date field using the regex in the Index Fields section of the Data Mining submenu. 
(?<Date>(?<= ∖bDate of this return∖s+∖b)∖d{2}/∖d{2}/∖d{4}) 
--capturing the index data.
Information extracted through the text data mining with regex can also be used to name the file and create folders. 
Here %regex1 corresponds to the first regex field definition (CompanyName) 
and %regex2 corresponds to the second field definition (CompanyNo). 
But wait, there’s more.
We hope we have demonstrated the immense power of using regular expressions to extract data from both structured and unstructured data. 
Data in the palm of your hand…not locked in your documents! 
and…
For more on: 
•Data Mining PDF 
•Data mining Scans 
•Invoice Mining 
•Patient Record Mining 
•OCR mining 
•TIF mining 
•Extracting meta data, 
•Data extraction from unstructured data 
•Intelligent data capture 
•Data extraction 
•Using regex to extract data 
•Document scanning 
•Extracting data 
•Extract meta data, 
•Scanner software, 
•Barcode recognition, 
•OCR software, 
•Capture tutorial 
•Pdf scanning, 
•Scanning software 
•Indexing 
•Document indexing 
•Automated capture 
•Meta data 
•Scan to index 
•Batch Processing 
•Bulk scanning 
•Docufi 
•Imageramp 
•Data capture 
•Migration to document management 
the power of ImageRamp and its other features including: 
Learn more about… 
Full text OCR to PDF PDF rights management and encryption Document naming, splitting, and routing based on barcodes 
and… Image processing for clean up and adaptive thresholding OCR (Optical Character Recognition) Barcode reading (1D and 2D)
More?
Further reading on Regular Expressions: 
More? http://en.wikipedia.org/wiki/Regular_expression http://regexlib.com/ http://www.regular-expressions.info/
docufi.com 
@imageramp 
@docufinews

More Related Content

What's hot

Folder Watching For Automated Document Capture, Batch Scanning
Folder Watching For Automated Document Capture, Batch ScanningFolder Watching For Automated Document Capture, Batch Scanning
Folder Watching For Automated Document Capture, Batch Scanning
DocuFi, offering HAI and Infection Prevention Analytics
 
Batch Document Processing with ImageRamp Batch
Batch Document Processing with ImageRamp BatchBatch Document Processing with ImageRamp Batch
Batch Document Processing with ImageRamp Batch
DocuFi, offering HAI and Infection Prevention Analytics
 
An Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your RequirementsAn Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your Requirements
DocuFi, offering HAI and Infection Prevention Analytics
 
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
DocuFi, offering HAI and Infection Prevention Analytics
 
8 Document Capture Must Haves, a Document Management Tutorial
8 Document Capture Must Haves, a Document Management Tutorial8 Document Capture Must Haves, a Document Management Tutorial
8 Document Capture Must Haves, a Document Management Tutorial
DocuFi, offering HAI and Infection Prevention Analytics
 
Painless Document Scanning and Indexing with Alfresco
Painless Document Scanning and Indexing with AlfrescoPainless Document Scanning and Indexing with Alfresco
Painless Document Scanning and Indexing with Alfresco
BlueFishTX
 
Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...
Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...
Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...
DocuFi, offering HAI and Infection Prevention Analytics
 
Improve OCR Accuracy, Clean Up and Enhance Scanned Images
Improve OCR Accuracy, Clean Up and Enhance Scanned ImagesImprove OCR Accuracy, Clean Up and Enhance Scanned Images
Improve OCR Accuracy, Clean Up and Enhance Scanned Images
DocuFi, offering HAI and Infection Prevention Analytics
 
ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...
ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...
ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...
DocuFi, offering HAI and Infection Prevention Analytics
 
Custom Capture Tool Development
Custom Capture Tool DevelopmentCustom Capture Tool Development
PDF vs. TIFF, An Evaluation of Document Scanning File Formats
PDF vs. TIFF, An Evaluation of Document Scanning File FormatsPDF vs. TIFF, An Evaluation of Document Scanning File Formats
PDF vs. TIFF, An Evaluation of Document Scanning File Formats
DocuFi, offering HAI and Infection Prevention Analytics
 
Tips to Solve Common Problems Reading Barcodes
Tips to Solve Common Problems Reading BarcodesTips to Solve Common Problems Reading Barcodes
Tips to Solve Common Problems Reading Barcodes
DocuFi, offering HAI and Infection Prevention Analytics
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
Amr Abd El Latief
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
Rahul Chaturvedi
 
DocuSolve Scanning Solutions
DocuSolve Scanning SolutionsDocuSolve Scanning Solutions
DocuSolve Scanning Solutions
Gordon Bishop
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
Dung Nguyen
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET Journal
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
Er. Nawaraj Bhandari
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
Daniel JACOB
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
FellowBuddy.com
 

What's hot (20)

Folder Watching For Automated Document Capture, Batch Scanning
Folder Watching For Automated Document Capture, Batch ScanningFolder Watching For Automated Document Capture, Batch Scanning
Folder Watching For Automated Document Capture, Batch Scanning
 
Batch Document Processing with ImageRamp Batch
Batch Document Processing with ImageRamp BatchBatch Document Processing with ImageRamp Batch
Batch Document Processing with ImageRamp Batch
 
An Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your RequirementsAn Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your Requirements
 
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
 
8 Document Capture Must Haves, a Document Management Tutorial
8 Document Capture Must Haves, a Document Management Tutorial8 Document Capture Must Haves, a Document Management Tutorial
8 Document Capture Must Haves, a Document Management Tutorial
 
Painless Document Scanning and Indexing with Alfresco
Painless Document Scanning and Indexing with AlfrescoPainless Document Scanning and Indexing with Alfresco
Painless Document Scanning and Indexing with Alfresco
 
Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...
Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...
Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...
 
Improve OCR Accuracy, Clean Up and Enhance Scanned Images
Improve OCR Accuracy, Clean Up and Enhance Scanned ImagesImprove OCR Accuracy, Clean Up and Enhance Scanned Images
Improve OCR Accuracy, Clean Up and Enhance Scanned Images
 
ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...
ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...
ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...
 
Custom Capture Tool Development
Custom Capture Tool DevelopmentCustom Capture Tool Development
Custom Capture Tool Development
 
PDF vs. TIFF, An Evaluation of Document Scanning File Formats
PDF vs. TIFF, An Evaluation of Document Scanning File FormatsPDF vs. TIFF, An Evaluation of Document Scanning File Formats
PDF vs. TIFF, An Evaluation of Document Scanning File Formats
 
Tips to Solve Common Problems Reading Barcodes
Tips to Solve Common Problems Reading BarcodesTips to Solve Common Problems Reading Barcodes
Tips to Solve Common Problems Reading Barcodes
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
DocuSolve Scanning Solutions
DocuSolve Scanning SolutionsDocuSolve Scanning Solutions
DocuSolve Scanning Solutions
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 

Similar to Using Regular Expressions in Document Management Data Capture and Indexing

Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
Malla Reddy University
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET Journal
 
Data Mining _ Weka
Data Mining _ WekaData Mining _ Weka
Data Mining _ Weka
Ramya Krishna Puttur
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345
AkhilSinghal21
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
IRJET Journal
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Derek Kane
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
IRJET Journal
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
Optum
 
50120130406017
5012013040601750120130406017
50120130406017
IAEME Publication
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really Doing
Dave Stokes
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
LakshmiSarvani6
 
Database Project
Database ProjectDatabase Project
Database Project
haleycockrell208
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
RDBMS to Graph Webinar
RDBMS to Graph WebinarRDBMS to Graph Webinar
RDBMS to Graph Webinar
Neo4j
 
Understanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web SummitUnderstanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web Summit
Amazon Web Services
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
IJMER
 
What is parsing, and how to make a parsable CV
What is parsing, and how to make a parsable CVWhat is parsing, and how to make a parsable CV
What is parsing, and how to make a parsable CV
JobTatkal
 
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...
Amazon Web Services
 
A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence Application
Kate Subramanian
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
VISHALMARWADE1
 

Similar to Using Regular Expressions in Document Management Data Capture and Indexing (20)

Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
 
Data Mining _ Weka
Data Mining _ WekaData Mining _ Weka
Data Mining _ Weka
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
50120130406017
5012013040601750120130406017
50120130406017
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really Doing
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Database Project
Database ProjectDatabase Project
Database Project
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
RDBMS to Graph Webinar
RDBMS to Graph WebinarRDBMS to Graph Webinar
RDBMS to Graph Webinar
 
Understanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web SummitUnderstanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web Summit
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
 
What is parsing, and how to make a parsable CV
What is parsing, and how to make a parsable CVWhat is parsing, and how to make a parsable CV
What is parsing, and how to make a parsable CV
 
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...
 
A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence Application
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 

Recently uploaded

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 

Recently uploaded (20)

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 

Using Regular Expressions in Document Management Data Capture and Indexing

  • 1. Using Regular Expressions for Data Mining and Automated Data Capture and Indexing Copyright © 2010 - 2013 DocuFi. All Rights Reserved
  • 2. In a Document Management Environment Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
  • 3. First: What is automated data capture? Just identifying and extracting information or data (sometimes called metadata) from scanned documents Data Capture:
  • 4. First: What is automated data capture or data mining? Just identifying and extracting information or data (sometimes called metadata) from scanned documents Data Capture: Automated Data Capture: Applying the principles of automation to data capture, silly! This can also be called text data mining.
  • 5. Why automate data capture? Manual Data Capture is Expensive and Time Consuming
  • 6. Problems with manual data entry: 1.Security maybe compromised if documents taken off premises 2.A delay is introduced if documents taken off premises 3.Compared to automated extraction, manual indexing is slow 4.Manual indexing doesn’t scale well with large projects 5.Manual indexing has the potential to introduce errors into the data Why automate data capture?
  • 7. and… Why automate data capture? Problems with manual data entry: 1.Security maybe compromised if documents taken off premises 2.A delay is introduced if documents taken off premises 3.Compared to automated extraction, manual indexing is slow 4.Manual indexing doesn’t scale well with large projects 5.Manual indexing has the potential to introduce errors into the data
  • 9. There’s a Mountain of It! Let’s take a look at just invoices for example…
  • 10. There’s a Mountain of It! According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
  • 11. There’s a Mountain of It! Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper. According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
  • 12. There’s a Mountain of It! Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper. and it’s expensive According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based. An Aberdeen Group March 2012 publication estimates the costs of processing a single invoice from $4.84 to $20.13.
  • 13. So if e-invoicing is not an option (as it’s not for many), what? sending and receiving invoices electronically e-invoicing: “it is the front-end capture options…that introduce true performance gains. For example, respondents who have implemented front-end document capture (creating a scanned digital copy of a physical invoice to be used in the approval process) report invoice processing 34% faster than those who process invoices manually. Moving to the pure data end of the spectrum, companies that convert scanned documents into usable data (through optical character recognition or similar technologies), report a 26% faster processing time than those that work only with document images.” ---Aberdeen’s 2010 report ( )
  • 14. And, We All Know, Time is Money
  • 15. Don’t forget we are using invoices only as an example. But, this could apply to patient records, legal documents, purchase orders…any document.
  • 16. Now that you know this is all about money, let’s go back to the focus of this slideshow.
  • 17. Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
  • 18. There’s a Mountain of It! What are Regular Expressions or regex? Regular expressions (regex) provide a fast and powerful method to search, extract and replace specific data found within scanned documents. Regular expressions are essentially a special text string for describing a search pattern. You could think of regular expressions as extremely powerful wildcards.
  • 19. There’s a Mountain of It! What’s it look like? A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ
  • 20. There’s a Mountain of It! What’s it look like? A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ ^ Start at the beginning of a string or line ∖s{1,3} Find a space that occurs between 1 and 3 times [A-Z0-9]* Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible. XYZ Find the literal characters “XYZ”
  • 21. There’s a Mountain of It! What’s it look like? A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ ^ Start at the beginning of a string or line ∖s{1,3} Find a space that occurs between 1 and 3 times [A-Z0-9]* Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible. XYZ Find the literal characters “XYZ” If we had the value “ AZR8987XYZ” in our document at the start of a line we would get a match whereas if we had “ AZR898XY” we would not.
  • 22. There’s a Mountain of It! Huh? Don’t worry, this is not a tutorial on writing regex. We just want to look at some examples and understand how regex can apply to data capture and indexing in a document management environment.
  • 23. There’s a Mountain of It! Regular expressions are extremely flexible and patterns can be constructed to match almost anything. For text commonly found in documents such as dates, SSNs, ZIP codes etc., patterns are freely available on the Internet. Here are some examples: Zip Codes ^(?!00000)(?<zip>(?<zip5>∖d{5})(?:[ - ](?=∖d))?(?<zip4>∖d{4})?)$ US Phone Number ^([0-9]( |-)?)?(∖(?[0-9]{3} ∖)?|[0-9]{3})( |- )?([0-9]{3}( |-)?[0-9]{4}|[a-zA-Z0-9]{7})$ Credit Card (^(4|5)∖d{3}-?∖d{4}-?∖d{4}- ?∖d{4}|(4|5)∖d{15})|(^(6011)-?∖d{4}- ?∖d{4}-?∖d{4}|(6011)- ?∖d{12})|(^((3∖d{3}))-∖d{6}- ∖d{5}|^((3∖d{14})))
  • 24. There’s a Mountain of It! Here is a partial invoice where you might need to capture the "Catalogue Number“. Real World Example
  • 25. There’s a Mountain of It! In order to start constructing a regular expression we have to use what we know from the data in front of us as well as making some assumptions. During testing we can refine the regular expression. In this example we can assume from the document that the catalogue number has the format of a single uppercase letter, followed by 2 digits then a hyphen followed by a single uppercase letter and 6 digits or just 6 digits.
  • 26. We could use the regex of [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{6} extract the data. Let's again break this down: [A-Z] Find a character from A-Z, the absence of a quantifier specification,“{}”, assumes we are only looking for 1 character ∖d{2} Find exactly 2 digits - Find the literal character “-“ [A- Z]{0,1} Find a character A-Z between 0 and 1 repetitions ∖d{6} Find exactly 6 digits This is just one way of writing a regular expression for this example although there are various ways it could be written. If we should subsequently find that the last portion of the catalogue number might contain 4 to 6 digits, we could simply amend it as follows [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{4,6}.
  • 27. We’ll take a look at how regex is used in ImageRamp Batch. It’s a simple-to-use folder processing tool that accelerates getting data and files into various EMR, Document Management or other secure storage environments. It can be used to capture and extract data in both structured and unstructured documents. As an example, we might want to extract data from a scanned file with the following 4 fields: Now how would this work in a data capture solution? Company Name Company Number Date SIC Code
  • 28. Here is the ImageRamp screen showing the scanned file pages and the data extracted using regex for the four fields we listed.
  • 29. Hang on, we’ll show it. We’ll use it to split individual company’s invoices from an multipage scan based on the Company Name and extract index data. A company might use this to scan a large stack of invoices and split the file every time a new invoicing company name is located using the regex scripts. So where is the regex?
  • 30. First we are going to define the regex to perform document splitting when a new Company Name is located in ImageRamp’s Splitting and Extraction’s Data Mining submenu as shown below. Let’s break it down—-splitting the scan stack. (?<=∖bCompany∖s*Name∖s+ ∖b)[a-z0-9∖(∖) ]* … and check the “Split if Matched” option.
  • 31. Remember in our example we identified CompanyName, CompanyNo, Date and SICcode as the index or metadata information we want to capture. So here we are extracting the date field using the regex in the Index Fields section of the Data Mining submenu. (?<Date>(?<= ∖bDate of this return∖s+∖b)∖d{2}/∖d{2}/∖d{4}) --capturing the index data.
  • 32. Information extracted through the text data mining with regex can also be used to name the file and create folders. Here %regex1 corresponds to the first regex field definition (CompanyName) and %regex2 corresponds to the second field definition (CompanyNo). But wait, there’s more.
  • 33. We hope we have demonstrated the immense power of using regular expressions to extract data from both structured and unstructured data. Data in the palm of your hand…not locked in your documents! and…
  • 34. For more on: •Data Mining PDF •Data mining Scans •Invoice Mining •Patient Record Mining •OCR mining •TIF mining •Extracting meta data, •Data extraction from unstructured data •Intelligent data capture •Data extraction •Using regex to extract data •Document scanning •Extracting data •Extract meta data, •Scanner software, •Barcode recognition, •OCR software, •Capture tutorial •Pdf scanning, •Scanning software •Indexing •Document indexing •Automated capture •Meta data •Scan to index •Batch Processing •Bulk scanning •Docufi •Imageramp •Data capture •Migration to document management the power of ImageRamp and its other features including: Learn more about… Full text OCR to PDF PDF rights management and encryption Document naming, splitting, and routing based on barcodes and… Image processing for clean up and adaptive thresholding OCR (Optical Character Recognition) Barcode reading (1D and 2D)
  • 35. More?
  • 36. Further reading on Regular Expressions: More? http://en.wikipedia.org/wiki/Regular_expression http://regexlib.com/ http://www.regular-expressions.info/