Private and confidential
+
Agenda
● Limitations with current systems that deal with
unstructured documents
● How LLMs help overcome it
● LLM limitations
● Introduction to Unstract
● Structuring long documents: comparing
approaches and impact areas
○ Without using a vector database
○ Using a vector DB with simple retrieval
○ Using a vector DB with subquestion
retrieval
2
Yours Truly
● Co-founder & CEO of Unstract, a
startup solving structured data
extraction problems powered by
LLMs
● Garden variety computer nerd
● Programming mainly on Linux since
1998
● Loves building things and thus
startups
● Splits time between India and the US
● Why am I an expert on data
extraction?
○ Unstract currently processes
5M+ pages per month
3
State of the art
● Tables
● PDF forms
● Scanned documents
○ Decent scans
○ No-so-decent scans
● Handwritten scanned forms
4
● Real-world applications need
to process a wide variety in
the same document class
● Manual extraction can slow
down critical business
processes and can be very
expensive as well
PDF Challenges Impact
Data Structuring: A Huge Challenge
5
Solved by
incumbents
Solved by using
LLMs
Available to solve using
future tech—possibly
by improvements to
LLMs?
Estimated at 1.11.8Bn, CAGR 3038% in the next 10 years
6
Horizontal LLM use cases being
monetized now
#1
Q&A Bots +
Enterprise
Search
#2
Customer
Service
#3
Software
Engineering
#4
Structured Data
Extraction
There are several verticalized use cases where LLMs are being
monetized, but horizontal use cases are very limited.
#5
Sales and
Marketing
Agents might have a huge role to play in the future, but itʼs pretty early so say anything now.
Private and confidential
Copilots vs. Unstract
Unstructured
docs
Co-pilot
Incomplete Automation!
Structured data
transferred to a
human
Human
transfers
structured data
to system
System has
structured data
without a
human in the
loop
Unstructured
docs
Machine-to-
machine
automation
Complete
Automation!
Private and confidential
A brief history of document extraction
Generation 1
OCR
Uses early techniques
to extract text from
images and docs
Better than manual
entry, but zero
intelligence
��
Generation 2
Machine Learning
Mostly uses computer
vision; little tolerance
for change in form
Way better than OCR,
but still pretty brittle
��
Generation 3
LLM-based
Uses natural language
to understand context
and extract information
Users donʼt have to
worry about input
document structure
��
Private and confidential 9
Introducing Unstract
LLM Providers Vector Databases Embeddings Text Extractors
The open source platform where youʼre in full control
Private and confidential
Two phases of processing unstructured docs
Prompt Engineering Phase
Build generic prompts that successfully extract
data across variations of the same document
type in Prompt Studio
Deployment Phase
Deploy the Prompt Studio project as a ETL
Pipeline, an API endpoint or as a Manual Review
Queue instance
Private and confidential 11
Overcoming LLM Challenges in Production Use Cases
LLM
Challenge
Uses two separate LLMs for
extraction and challenging to
arrive at consensus
SinglePass
Extraction
Uses an LLM to construct a
single prompt and output JSON
to extract, saving token costs
and latency
Summarized
Extraction
Uses an LLM to construct a
summary of the input based on
user prompts, which is then
used for extraction, saving
tokens and latency
SAVINGS!
SAVINGS!
ACCURACY
COST
Private and confidential
Overview
Extracting structured information from
unstructured docs with Unstract
12
Private and confidential 13
RAG fundamentals
Embedding
Model
Milvus
Retrieved
Chunks
User
Query
LLM
LLM
Response
1
2
3
4
5
Using Vector DBs to process unstructured docs
● Without using a vector database
○ Useful when processing documents that
are only one or few pages
○ When referencing unlabeled data
● Using a vector DB with simple retrieval
● Using a vector DB with subquestion retrieval
14
Simple vs. Subquestion Retrieval
● Same as the RAG architecture
representation we saw earlier
● Will work well if the prompt is
simple
● There should be no problem if
all the information needed is
available as part of the
returned chunks
● Could be slightly cheaper
compared to the subquestion
strategy
15
● An LLM is used to form
retrieval-related
sub-questions from the
user-supplied prompt
● In a loop, the VectorDB is
queried and relevant chunks
are gathered
● Duplicate chunks are
eliminated
● Works better when chunks
that have the response are
spread across the document
● Works better since vector DB
has to only deal with simple
questions
Simple Retrieval Strategy Subquestion Retrieval Strategy
Summary
16
Strategy Cost Accuracy Notes
Full Document High for large
documents.
$2.8 in our example
Very high Good accuracy since
LLM has full context.
Could be the only
viable strategy when
dealing with short
documents like bills
and forms
Simple Retrieval Lowest of the lot
$0.36 in our example
Average, But generally
depends on the type
of extraction done
Should generally be
avoided since user
must know what
theyʼre doing
Subquestion Retrieval Could be slightly
higher compared to
Simple Retrieval
Strategy
$0.5 in our example
High This way, complex
prompts can be
broken down and sent
to the vector DB
Links
17
Unstract website
Unstract open source on Github
Documentation
LLMWhisperer (our text extraction service)
LLMWhisperer Playground
Unstract Slack Community
Zilliz Cloud (for managed Milvus)
Milvus open source on Github
Private and confidential
Thank you!
18

Challenges in Structured Document Data Extraction at Scale with LLMs

  • 1.
  • 2.
    Agenda ● Limitations withcurrent systems that deal with unstructured documents ● How LLMs help overcome it ● LLM limitations ● Introduction to Unstract ● Structuring long documents: comparing approaches and impact areas ○ Without using a vector database ○ Using a vector DB with simple retrieval ○ Using a vector DB with subquestion retrieval 2
  • 3.
    Yours Truly ● Co-founder& CEO of Unstract, a startup solving structured data extraction problems powered by LLMs ● Garden variety computer nerd ● Programming mainly on Linux since 1998 ● Loves building things and thus startups ● Splits time between India and the US ● Why am I an expert on data extraction? ○ Unstract currently processes 5M+ pages per month 3
  • 4.
    State of theart ● Tables ● PDF forms ● Scanned documents ○ Decent scans ○ No-so-decent scans ● Handwritten scanned forms 4 ● Real-world applications need to process a wide variety in the same document class ● Manual extraction can slow down critical business processes and can be very expensive as well PDF Challenges Impact
  • 5.
    Data Structuring: AHuge Challenge 5 Solved by incumbents Solved by using LLMs Available to solve using future tech—possibly by improvements to LLMs? Estimated at 1.11.8Bn, CAGR 3038% in the next 10 years
  • 6.
    6 Horizontal LLM usecases being monetized now #1 Q&A Bots + Enterprise Search #2 Customer Service #3 Software Engineering #4 Structured Data Extraction There are several verticalized use cases where LLMs are being monetized, but horizontal use cases are very limited. #5 Sales and Marketing Agents might have a huge role to play in the future, but itʼs pretty early so say anything now.
  • 7.
    Private and confidential Copilotsvs. Unstract Unstructured docs Co-pilot Incomplete Automation! Structured data transferred to a human Human transfers structured data to system System has structured data without a human in the loop Unstructured docs Machine-to- machine automation Complete Automation!
  • 8.
    Private and confidential Abrief history of document extraction Generation 1 OCR Uses early techniques to extract text from images and docs Better than manual entry, but zero intelligence �� Generation 2 Machine Learning Mostly uses computer vision; little tolerance for change in form Way better than OCR, but still pretty brittle �� Generation 3 LLM-based Uses natural language to understand context and extract information Users donʼt have to worry about input document structure ��
  • 9.
    Private and confidential9 Introducing Unstract LLM Providers Vector Databases Embeddings Text Extractors The open source platform where youʼre in full control
  • 10.
    Private and confidential Twophases of processing unstructured docs Prompt Engineering Phase Build generic prompts that successfully extract data across variations of the same document type in Prompt Studio Deployment Phase Deploy the Prompt Studio project as a ETL Pipeline, an API endpoint or as a Manual Review Queue instance
  • 11.
    Private and confidential11 Overcoming LLM Challenges in Production Use Cases LLM Challenge Uses two separate LLMs for extraction and challenging to arrive at consensus SinglePass Extraction Uses an LLM to construct a single prompt and output JSON to extract, saving token costs and latency Summarized Extraction Uses an LLM to construct a summary of the input based on user prompts, which is then used for extraction, saving tokens and latency SAVINGS! SAVINGS! ACCURACY COST
  • 12.
    Private and confidential Overview Extractingstructured information from unstructured docs with Unstract 12
  • 13.
    Private and confidential13 RAG fundamentals Embedding Model Milvus Retrieved Chunks User Query LLM LLM Response 1 2 3 4 5
  • 14.
    Using Vector DBsto process unstructured docs ● Without using a vector database ○ Useful when processing documents that are only one or few pages ○ When referencing unlabeled data ● Using a vector DB with simple retrieval ● Using a vector DB with subquestion retrieval 14
  • 15.
    Simple vs. SubquestionRetrieval ● Same as the RAG architecture representation we saw earlier ● Will work well if the prompt is simple ● There should be no problem if all the information needed is available as part of the returned chunks ● Could be slightly cheaper compared to the subquestion strategy 15 ● An LLM is used to form retrieval-related sub-questions from the user-supplied prompt ● In a loop, the VectorDB is queried and relevant chunks are gathered ● Duplicate chunks are eliminated ● Works better when chunks that have the response are spread across the document ● Works better since vector DB has to only deal with simple questions Simple Retrieval Strategy Subquestion Retrieval Strategy
  • 16.
    Summary 16 Strategy Cost AccuracyNotes Full Document High for large documents. $2.8 in our example Very high Good accuracy since LLM has full context. Could be the only viable strategy when dealing with short documents like bills and forms Simple Retrieval Lowest of the lot $0.36 in our example Average, But generally depends on the type of extraction done Should generally be avoided since user must know what theyʼre doing Subquestion Retrieval Could be slightly higher compared to Simple Retrieval Strategy $0.5 in our example High This way, complex prompts can be broken down and sent to the vector DB
  • 17.
    Links 17 Unstract website Unstract opensource on Github Documentation LLMWhisperer (our text extraction service) LLMWhisperer Playground Unstract Slack Community Zilliz Cloud (for managed Milvus) Milvus open source on Github
  • 18.