Challenges in Structured Document Data Extraction at Scale with LLMs

Agenda
● Limitations with current systems that deal with
unstructured documents
● How LLMs help overcome it
● LLM limitations
● Introduction to Unstract
● Structuring long documents: comparing
approaches and impact areas
○ Without using a vector database
○ Using a vector DB with simple retrieval
○ Using a vector DB with subquestion
retrieval
2

Yours Truly
● Co-founder & CEO of Unstract, a
startup solving structured data
extraction problems powered by
LLMs
● Garden variety computer nerd
● Programming mainly on Linux since
1998
● Loves building things and thus
startups
● Splits time between India and the US
● Why am I an expert on data
extraction?
○ Unstract currently processes
5M+ pages per month
3

State of the art
● Tables
● PDF forms
● Scanned documents
○ Decent scans
○ No-so-decent scans
● Handwritten scanned forms
4
● Real-world applications need
to process a wide variety in
the same document class
● Manual extraction can slow
down critical business
processes and can be very
expensive as well
PDF Challenges Impact

Data Structuring: A Huge Challenge
5
Solved by
incumbents
Solved by using
LLMs
Available to solve using
future tech—possibly
by improvements to
LLMs?
Estimated at 1.11.8Bn, CAGR 3038% in the next 10 years

6
Horizontal LLM use cases being
monetized now
#1
Q&A Bots +
Enterprise
Search
#2
Customer
Service
#3
Software
Engineering
#4
Structured Data
Extraction
There are several verticalized use cases where LLMs are being
monetized, but horizontal use cases are very limited.
#5
Sales and
Marketing
Agents might have a huge role to play in the future, but itʼs pretty early so say anything now.

Private and confidential
Copilots vs. Unstract
Unstructured
docs
Co-pilot
Incomplete Automation!
Structured data
transferred to a
human
Human
transfers
structured data
to system
System has
structured data
without a
human in the
loop
Unstructured
docs
Machine-to-
machine
automation
Complete
Automation!

A brief history of document extraction
Generation 1
OCR
Uses early techniques
to extract text from
images and docs
Better than manual
entry, but zero
intelligence
��
Generation 2
Machine Learning
Mostly uses computer
vision; little tolerance
for change in form
Way better than OCR,
but still pretty brittle
��
Generation 3
LLM-based
Uses natural language
to understand context
and extract information
Users donʼt have to
worry about input
document structure
��

Private and confidential 9
Introducing Unstract
LLM Providers Vector Databases Embeddings Text Extractors
The open source platform where youʼre in full control

Two phases of processing unstructured docs
Prompt Engineering Phase
Build generic prompts that successfully extract
data across variations of the same document
type in Prompt Studio
Deployment Phase
Deploy the Prompt Studio project as a ETL
Pipeline, an API endpoint or as a Manual Review
Queue instance

Overcoming LLM Challenges in Production Use Cases
LLM
Challenge
Uses two separate LLMs for
extraction and challenging to
arrive at consensus
SinglePass
Extraction
Uses an LLM to construct a
single prompt and output JSON
to extract, saving token costs
and latency
Summarized
Extraction
Uses an LLM to construct a
summary of the input based on
user prompts, which is then
used for extraction, saving
tokens and latency
SAVINGS!
SAVINGS!
ACCURACY
COST

Overview
Extracting structured information from
unstructured docs with Unstract
12

RAG fundamentals
Embedding
Model
Milvus
Retrieved
Chunks
User
Query
LLM
LLM
Response
1
2
3
4
5

Using Vector DBs to process unstructured docs
● Without using a vector database
○ Useful when processing documents that
are only one or few pages
○ When referencing unlabeled data
● Using a vector DB with simple retrieval
● Using a vector DB with subquestion retrieval
14

Simple vs. Subquestion Retrieval
● Same as the RAG architecture
representation we saw earlier
● Will work well if the prompt is
simple
● There should be no problem if
all the information needed is
available as part of the
returned chunks
● Could be slightly cheaper
compared to the subquestion
strategy
15
● An LLM is used to form
retrieval-related
sub-questions from the
user-supplied prompt
● In a loop, the VectorDB is
queried and relevant chunks
are gathered
● Duplicate chunks are
eliminated
● Works better when chunks
that have the response are
spread across the document
● Works better since vector DB
has to only deal with simple
questions
Simple Retrieval Strategy Subquestion Retrieval Strategy

Summary
16
Strategy Cost Accuracy Notes
Full Document High for large
documents.
$2.8 in our example
Very high Good accuracy since
LLM has full context.
Could be the only
viable strategy when
dealing with short
documents like bills
and forms
Simple Retrieval Lowest of the lot
$0.36 in our example
Average, But generally
depends on the type
of extraction done
Should generally be
avoided since user
must know what
theyʼre doing
Subquestion Retrieval Could be slightly
higher compared to
Simple Retrieval
Strategy
$0.5 in our example
High This way, complex
prompts can be
broken down and sent
to the vector DB

Links
17
Unstract website
Unstract open source on Github
Documentation
LLMWhisperer (our text extraction service)
LLMWhisperer Playground
Unstract Slack Community
Zilliz Cloud (for managed Milvus)
Milvus open source on Github

Thank you!
18

Challenges in Structured Document Data Extraction at Scale with LLMs

More Related Content

Similar to Challenges in Structured Document Data Extraction at Scale with LLMs

More from Zilliz

Recently uploaded

Challenges in Structured Document Data Extraction at Scale with LLMs