SlideShare a Scribd company logo
1 of 9
Download to read offline
Document Types Explained:
Structured, Semi-Structured,
and Unstructured
When you start looking for an intelligent document processing
(IDP) solution for your business, one of the first questions that
vendors ask you is what kind of documents do you have? They
expect you to give an answer from one of the three choices -
structured, unstructured, or semi-structured. But there is not one
definitive answer as to what kind of documents fall into which
category. Let’s take a closer look..
Structured Data vs. Unstructured Data
Before we start talking about documents, it would be worthwhile
to talk about where this conversation has come from.
Historically, transactional systems stored and processed data
that lived in databases. Most of this data has a clear structure -
each data element has a type, a defined length, and in some
cases, possible values. Previously, this data used to live in
cleanly structured tables as rows and columns within a database.
This is how this data looked:
Over time, systems started dealing with long, textual data which
was made of long strings of typed characters. This was slowly
complemented with images, videos, spreadsheets, audio files,
and all other sorts of multimedia content. This data was
collectively referred to as unstructured data because it did not
have any fixed format.
When you look at documents from this lens, all documents
collectively can be categorized into the unstructured data
category. This is the first point of confusion - unstructured data
and structured data do not map to structured documents and
unstructured documents.
All documents are unstructured data! But within these
documents, you can further classify them into three categories
based on how they appear:
1. Structured Documents
2. Semi-Structured Documents
3. Unstructured Documents
Structured Documents
These are the documents that have a fixed format, much like their
structured data cousins. You would usually see these as forms,
payment slips, or utility bills from a provider. As long as you deal
with just one provider, you’re dealing with structured documents.
The data in these documents have fixed locations - the date will
always be located in one place, the name of the person will
occupy a fixed location, etc.
Here is an example of how a structured document looks:
The technologies that can help you with extracting data from
these documents are fairly straightforward. You can put a
template that uses OCR and then goes to a specific coordinate
on the document to pull out values for different fields.
Important considerations
One big challenge with structured documents is that you need to
create one template for each of the providers. If you are
processing utility bills, you will need to create a template for
each different variation of the bill. This does not pose much of a
problem in the beginning when the number of variations is fewer.
But as variations increase, it becomes more than a full-time job
to keep creating templates for every new provider.
The second problem is that templates change. The providers may
redesign the layout of the document or upgrade their
document-producing software and inadvertently start sending
completely new document formats that break the template.
Unfortunately, you only find out that the template has changed
when your data extraction stops working. Then you need to work
overtime to fix the template and get it to work again.
Semi-Structured Documents
Some documents have a fixed set of data but no fixed format for
this data. In some documents, the date appears on the top right
corner, in another variation, it is at the center of the document,
and in yet another, you’ll find it in the bottom left corner. Another
added complication is that the same data is qualified by different
names. In one variation, a field may be called ‘Purchase Order
Number’, in another - ‘PO Number’, and a few others may call it
“PO #”, “PO No.” or “Order Number’. These variations are
endless and because of these two challenges, you cannot use a
template-based solution for these documents.
Data extraction from these documents needs robust machine
learning algorithms that can learn on their own. You will also
need some natural language processing capabilities to
understand the context of each field.
This is how semi-structured documents look:
As you can see, these documents essentially have the same
information but it is captured in a totally different format.
Important considerations
Processing semi-structured documents requires a probabilistic
approach based on machine learning algorithms. Without that,
you will get good results for a few document types and
not-so-great results for a long tail of variations. You will also
need capabilities to add new data points on the fly.
Unstructured Documents
The third category of documents is reserved for documents that
do not have any fixed layout or fixed data points. These are
free-flowing verbose documents similar to this blog post that can
have any information presented anywhere or in any format.
Data processing for these kinds of documents requires a
significant amount of configuration and customization to let the
IDP platform learn from your specific documents. This would
involve machine learning training, custom preprocessing
pipeline, computer vision-based recognition for visual
components such as charts, complex tables, and graphs.
Important considerations
Processing unstructured documents requires quite a bit of
investment upfront. It would be prudent to calculate the ROI for
these implementations before you go too far. You either need a
considerable volume of documents or business value for
unstructured documents. Second, since this implementation
involves quite a bit of customization, the time-to-market generally
takes more time. You can spend anywhere from 6 months to a
year to implement this type of solution. The key to success is to
split this problem into multiple phases and have measurable
success criteria for each phase.
In Summary
A majority of high-value documents are either semi-structured or
unstructured. OCR and manual corrections usually provide a
good enough return for simple, structured document processing.
However, more unstructured data needs very comprehensive
technology capabilities to process. There are a number of
vendors and solutions available for structured documents that
do a pretty good job of data extraction. But as you move into
semi-structured and unstructured documents, the vendor
landscape shrinks considerably.
The complications of variations that need template-free
extraction make it difficult for most IDP platforms to perform.
Most businesses are left with the only option of engaging a
Systems Integrator (SI) to custom implement these solutions.
These usually take a very long time to implement and often fail to
deliver on accuracy and speed. A comprehensive, machine
learning and AI-based IDP platform such as Infrrd can provide
you with the predictability and high accuracy needed in data
extraction for semi-structured and unstructured documents.

More Related Content

What's hot

Creating and editing a database
Creating and editing a databaseCreating and editing a database
Creating and editing a database
crystalpullen
 

What's hot (20)

Metadata ppt
Metadata pptMetadata ppt
Metadata ppt
 
Elements of Data Documentation
Elements of Data DocumentationElements of Data Documentation
Elements of Data Documentation
 
Creating and editing a database
Creating and editing a databaseCreating and editing a database
Creating and editing a database
 
Introduction to Database SQL & PL/SQL
Introduction to Database SQL & PL/SQLIntroduction to Database SQL & PL/SQL
Introduction to Database SQL & PL/SQL
 
Managing data resources
Managing  data resourcesManaging  data resources
Managing data resources
 
Intelligent Document Management in businesses and e-Administration
Intelligent Document Management in businesses and e-AdministrationIntelligent Document Management in businesses and e-Administration
Intelligent Document Management in businesses and e-Administration
 
Introduction To Data Warehousing
Introduction To Data WarehousingIntroduction To Data Warehousing
Introduction To Data Warehousing
 
An Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your RequirementsAn Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your Requirements
 
Smarter Documentation: Shedding Light on the Black Box of Reporting Data
Smarter Documentation: Shedding Light on the Black Box of Reporting DataSmarter Documentation: Shedding Light on the Black Box of Reporting Data
Smarter Documentation: Shedding Light on the Black Box of Reporting Data
 
Database Systems
Database SystemsDatabase Systems
Database Systems
 
Data documentation and retrieval using unity in a universe®
Data documentation and retrieval using unity in a universe®Data documentation and retrieval using unity in a universe®
Data documentation and retrieval using unity in a universe®
 
Lecture 04 data resource management
Lecture 04 data resource managementLecture 04 data resource management
Lecture 04 data resource management
 
Intelligent Data Extraction, Turning Content into Data, A Look at Advanced Ca...
Intelligent Data Extraction, Turning Content into Data, A Look at Advanced Ca...Intelligent Data Extraction, Turning Content into Data, A Look at Advanced Ca...
Intelligent Data Extraction, Turning Content into Data, A Look at Advanced Ca...
 
Advanced Database System
Advanced Database SystemAdvanced Database System
Advanced Database System
 
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
 
Fundamentals of Database Design
Fundamentals of Database DesignFundamentals of Database Design
Fundamentals of Database Design
 
Lecture 1&2(rdbms-ii)
Lecture 1&2(rdbms-ii)Lecture 1&2(rdbms-ii)
Lecture 1&2(rdbms-ii)
 
Data Models [DATABASE SYSTEMS: Design, Implementation, and Management]
Data Models [DATABASE SYSTEMS: Design, Implementation, and Management]Data Models [DATABASE SYSTEMS: Design, Implementation, and Management]
Data Models [DATABASE SYSTEMS: Design, Implementation, and Management]
 
Using Regular Expressions in Document Management Data Capture and Indexing
Using Regular Expressions in Document Management Data Capture and IndexingUsing Regular Expressions in Document Management Data Capture and Indexing
Using Regular Expressions in Document Management Data Capture and Indexing
 
The_Tools_of_Structured_Analysis
The_Tools_of_Structured_AnalysisThe_Tools_of_Structured_Analysis
The_Tools_of_Structured_Analysis
 

Similar to Document Types Explained: Structured, Semi-Structured and Unstructured

Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14
Steven Toole
 
1. What are the business costs or risks of poor data quality Sup.docx
1.  What are the business costs or risks of poor data quality Sup.docx1.  What are the business costs or risks of poor data quality Sup.docx
1. What are the business costs or risks of poor data quality Sup.docx
SONU61709
 
Cibm work shop 2chapter six
Cibm  work shop 2chapter sixCibm  work shop 2chapter six
Cibm work shop 2chapter six
Shaheen Khan
 

Similar to Document Types Explained: Structured, Semi-Structured and Unstructured (20)

SAD01 - An Introduction to Systems Analysis and Design
SAD01 - An Introduction to Systems Analysis and DesignSAD01 - An Introduction to Systems Analysis and Design
SAD01 - An Introduction to Systems Analysis and Design
 
Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14
 
Introduction of Data Science and Data Analytics
Introduction of Data Science and Data AnalyticsIntroduction of Data Science and Data Analytics
Introduction of Data Science and Data Analytics
 
Hh
HhHh
Hh
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptx
 
Developing a plan for your imaging project
Developing a plan for your imaging projectDeveloping a plan for your imaging project
Developing a plan for your imaging project
 
Frequently Asked Questions About IDP
Frequently Asked Questions About IDPFrequently Asked Questions About IDP
Frequently Asked Questions About IDP
 
DU_SERIES_Session1.pdf
DU_SERIES_Session1.pdfDU_SERIES_Session1.pdf
DU_SERIES_Session1.pdf
 
INTRODUCTION TO Database Management System (DBMS)
INTRODUCTION TO Database Management System (DBMS)INTRODUCTION TO Database Management System (DBMS)
INTRODUCTION TO Database Management System (DBMS)
 
1. What are the business costs or risks of poor data quality Sup.docx
1.  What are the business costs or risks of poor data quality Sup.docx1.  What are the business costs or risks of poor data quality Sup.docx
1. What are the business costs or risks of poor data quality Sup.docx
 
Computer Data Processing And Representation 4
Computer Data Processing And Representation   4Computer Data Processing And Representation   4
Computer Data Processing And Representation 4
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
Starting a search application
Starting a search applicationStarting a search application
Starting a search application
 
ms-11.pdf
ms-11.pdfms-11.pdf
ms-11.pdf
 
Make compliance fulfillment count double
Make compliance fulfillment count doubleMake compliance fulfillment count double
Make compliance fulfillment count double
 
Володимир Довганик “5 typical features that make BA mad”
Володимир Довганик “5 typical features that make BA mad”Володимир Довганик “5 typical features that make BA mad”
Володимир Довганик “5 typical features that make BA mad”
 
Cibm work shop 2chapter six
Cibm  work shop 2chapter sixCibm  work shop 2chapter six
Cibm work shop 2chapter six
 
Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.
 
How To Write IT Documentation
How To Write IT DocumentationHow To Write IT Documentation
How To Write IT Documentation
 

More from Infrrd

More from Infrrd (16)

Intelligent Document Processing
Intelligent Document ProcessingIntelligent Document Processing
Intelligent Document Processing
 
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code ImplementationsIDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
IDP: A Booster Shot for your RPA, Chatbot and Low Code Implementations
 
Using Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdfUsing Alerts To Gain Efficiency For Document Processing.pdf
Using Alerts To Gain Efficiency For Document Processing.pdf
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
Launching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest FeaturesLaunching Infrrd IDP's Latest Features
Launching Infrrd IDP's Latest Features
 
Transformer-Based OCR.pdf
Transformer-Based OCR.pdfTransformer-Based OCR.pdf
Transformer-Based OCR.pdf
 
Invoice processing
Invoice processingInvoice processing
Invoice processing
 
Where have all the data entry candidates gone?
Where have all the data entry candidates gone?Where have all the data entry candidates gone?
Where have all the data entry candidates gone?
 
IDP with Intelligent Table Extraction
IDP with Intelligent Table ExtractionIDP with Intelligent Table Extraction
IDP with Intelligent Table Extraction
 
Understanding IDP: Data Integration
Understanding IDP: Data IntegrationUnderstanding IDP: Data Integration
Understanding IDP: Data Integration
 
Understanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback LoopUnderstanding IDP: Data Validation and Feedback Loop
Understanding IDP: Data Validation and Feedback Loop
 
Understanding IDP: Document Classification
Understanding IDP: Document ClassificationUnderstanding IDP: Document Classification
Understanding IDP: Document Classification
 
Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors Who are the top intelligent document processing (idp) vendors
Who are the top intelligent document processing (idp) vendors
 
Infrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit AutomationInfrrd's AI-enabled Audit Automation
Infrrd's AI-enabled Audit Automation
 
How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?How To Start Your Journey To Become An AI Enabled Enterprise?
How To Start Your Journey To Become An AI Enabled Enterprise?
 
Intelligent Data Capture Process
Intelligent Data Capture Process Intelligent Data Capture Process
Intelligent Data Capture Process
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 

Document Types Explained: Structured, Semi-Structured and Unstructured

  • 1. Document Types Explained: Structured, Semi-Structured, and Unstructured When you start looking for an intelligent document processing (IDP) solution for your business, one of the first questions that vendors ask you is what kind of documents do you have? They expect you to give an answer from one of the three choices - structured, unstructured, or semi-structured. But there is not one definitive answer as to what kind of documents fall into which category. Let’s take a closer look.. Structured Data vs. Unstructured Data Before we start talking about documents, it would be worthwhile to talk about where this conversation has come from. Historically, transactional systems stored and processed data that lived in databases. Most of this data has a clear structure -
  • 2. each data element has a type, a defined length, and in some cases, possible values. Previously, this data used to live in cleanly structured tables as rows and columns within a database. This is how this data looked: Over time, systems started dealing with long, textual data which was made of long strings of typed characters. This was slowly complemented with images, videos, spreadsheets, audio files, and all other sorts of multimedia content. This data was collectively referred to as unstructured data because it did not have any fixed format. When you look at documents from this lens, all documents collectively can be categorized into the unstructured data category. This is the first point of confusion - unstructured data
  • 3. and structured data do not map to structured documents and unstructured documents. All documents are unstructured data! But within these documents, you can further classify them into three categories based on how they appear: 1. Structured Documents 2. Semi-Structured Documents 3. Unstructured Documents Structured Documents These are the documents that have a fixed format, much like their structured data cousins. You would usually see these as forms, payment slips, or utility bills from a provider. As long as you deal with just one provider, you’re dealing with structured documents. The data in these documents have fixed locations - the date will always be located in one place, the name of the person will occupy a fixed location, etc.
  • 4. Here is an example of how a structured document looks: The technologies that can help you with extracting data from these documents are fairly straightforward. You can put a template that uses OCR and then goes to a specific coordinate on the document to pull out values for different fields. Important considerations One big challenge with structured documents is that you need to create one template for each of the providers. If you are processing utility bills, you will need to create a template for each different variation of the bill. This does not pose much of a
  • 5. problem in the beginning when the number of variations is fewer. But as variations increase, it becomes more than a full-time job to keep creating templates for every new provider. The second problem is that templates change. The providers may redesign the layout of the document or upgrade their document-producing software and inadvertently start sending completely new document formats that break the template. Unfortunately, you only find out that the template has changed when your data extraction stops working. Then you need to work overtime to fix the template and get it to work again. Semi-Structured Documents Some documents have a fixed set of data but no fixed format for this data. In some documents, the date appears on the top right corner, in another variation, it is at the center of the document, and in yet another, you’ll find it in the bottom left corner. Another added complication is that the same data is qualified by different names. In one variation, a field may be called ‘Purchase Order Number’, in another - ‘PO Number’, and a few others may call it “PO #”, “PO No.” or “Order Number’. These variations are endless and because of these two challenges, you cannot use a template-based solution for these documents. Data extraction from these documents needs robust machine learning algorithms that can learn on their own. You will also
  • 6. need some natural language processing capabilities to understand the context of each field. This is how semi-structured documents look: As you can see, these documents essentially have the same information but it is captured in a totally different format. Important considerations Processing semi-structured documents requires a probabilistic approach based on machine learning algorithms. Without that, you will get good results for a few document types and not-so-great results for a long tail of variations. You will also need capabilities to add new data points on the fly.
  • 7. Unstructured Documents The third category of documents is reserved for documents that do not have any fixed layout or fixed data points. These are free-flowing verbose documents similar to this blog post that can have any information presented anywhere or in any format. Data processing for these kinds of documents requires a significant amount of configuration and customization to let the IDP platform learn from your specific documents. This would involve machine learning training, custom preprocessing
  • 8. pipeline, computer vision-based recognition for visual components such as charts, complex tables, and graphs. Important considerations Processing unstructured documents requires quite a bit of investment upfront. It would be prudent to calculate the ROI for these implementations before you go too far. You either need a considerable volume of documents or business value for unstructured documents. Second, since this implementation involves quite a bit of customization, the time-to-market generally takes more time. You can spend anywhere from 6 months to a year to implement this type of solution. The key to success is to split this problem into multiple phases and have measurable success criteria for each phase. In Summary A majority of high-value documents are either semi-structured or unstructured. OCR and manual corrections usually provide a good enough return for simple, structured document processing. However, more unstructured data needs very comprehensive technology capabilities to process. There are a number of vendors and solutions available for structured documents that do a pretty good job of data extraction. But as you move into semi-structured and unstructured documents, the vendor landscape shrinks considerably.
  • 9. The complications of variations that need template-free extraction make it difficult for most IDP platforms to perform. Most businesses are left with the only option of engaging a Systems Integrator (SI) to custom implement these solutions. These usually take a very long time to implement and often fail to deliver on accuracy and speed. A comprehensive, machine learning and AI-based IDP platform such as Infrrd can provide you with the predictability and high accuracy needed in data extraction for semi-structured and unstructured documents.