In my talk I will cover a task of tabular QA in application for unstructured pdf documents. I will walk you through each stage of the pipeline, from data preparation to modelling, and share valuable insights our team has gathered along the way.
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
[DSC Europe 23] Vladimir Ageev - From Tables to Answers: building QA System for In-Document Searches
1. Vladimir Ageev, Lead DS @EPAM
N O V E M B E R 2 0 2 3
EPAM Proprietary & Confidential.
2. EPAM Proprietary & Confidential. 2
We’ll cover
Tabular Question Answering case-study
• Business problem
• State of the Art
• Fine-tuning
• Productionalization
4. EPAM Proprietary & Confidential. 4
Bu
Product
SaaS platform designed to integrate technical publications into Engineering
workflows:
• provides access to
• enriches experience with like
smart search, comparison or entity linking
5. EPAM Proprietary & Confidential. 5
Bu
Table Question answering
Why?
• Hundreds of thousands
popular PDFs have
• Keyword-based search might not
find them
• Semantic search or general QA models
do not account for table structure
6. EPAM Proprietary & Confidential. 6
Bu
TQA: formal task
INPUT
User query:
”TruthfulQA highest % true”
Table representation:
[{
“text”: “% true”,
“row_id”: 0,
“col_id” : 3
}, …]
Caption:
“Table 44: Evaluation results on …”
OUTPUT
Answers coordinates
{
“text”: “79.92”,
“operation”: None,
“cells”: [{
“col_id”: 3,
“row_id”: 15
}]
}
Assumption: table detection, parsing and retrieval from PDFs are solved
Task: given a highlight
row_id: 15
col_id: 3
8. EPAM Proprietary & Confidential. 8
Bu
State of TQA
What model to chose?
• SOTA : Dater – has Open AI GPT-3
under the hood, insecure (for us)
• TaBERT – CC-BY-NC 4.0 licensed
Scores on WikiTableQuestions
*https://paperswithcode.com/sota/semantic-parsing-on-wikitablequestions
9. EPAM Proprietary & Confidential. 9
Bu
State of TQA
What model to chose?
• SOTA : Dater – has Open AI GPT-3
under the hood, insecure (for us)
• TaBERT – CC-BY-NC 4.0 licensed
• OmniTab – Seq2Seq model
generative, no cell highlighting
• TAPEX –BART-based model
generative, no cell highlighting
Scores on WikiTableQuestions
*https://paperswithcode.com/sota/semantic-parsing-on-wikitablequestions
10. EPAM Proprietary & Confidential. 10
Bu
State of TQA
What model to chose?
• SOTA : Dater – has Open AI GPT-3
under the hood, insecure (for us)
• TaBERT – CC-BY-NC 4.0 licensed
• OmniTab – Seq2Seq model
generative, no cell highlighting
• TAPEX –BART-based model
generative, no cell highlighting
– BERT-based model
aggregations and highlighting
MIT License
Scores on WikiTableQuestions
*https://paperswithcode.com/sota/semantic-parsing-on-wikitablequestions
11. EPAM Proprietary & Confidential. 11
Bu
TAPAS: how it works
BERT-based transformer encoder
Two classification heads:
Additional positional embeddings:
• Column ID
• Row ID
• Segment: query / table
• Rank: non-comparable or order
16. EPAM Proprietary & Confidential. 16
Bu
Evaluation & fine-tuning
Question types:
• Extractive: Answer, Cells
• Generative: Answer, Cells, Aggregation
• Unanswerable: None
Dataset size:
~ 3K tables
~ 10 QA pair per table
Annotation:
• ~3 months
• ~2-5 annotators, 2 rounds:
• separate tables –more diverse
• several tables per document – retrieval tests
17. EPAM Proprietary & Confidential. 17
Bu
Evaluation & fine-tuning
How to evaluate?
• F1 at cell-sets level
• F1 at answer tokens level
• Micro / macro averaging over
question types / tables / docs
18. EPAM Proprietary & Confidential. 18
Bu
Evaluation & fine-tuning
How to evaluate?
• F1 at cell-sets level
• F1 at answer tokens level
• Micro / macro averaging over
question types / tables / docs
Is 80% F1 enough? 50%? Run an impression test!
• Retrieval quality
• Correct response rate
• Overall impression (good to go?)
“ according to the following:
Very poor - The service doesn’t meet expectations.
…
Very good - The service provides great experience.”
“ Is relevant ?
Were cells highlighted?
Are cells ?”
19. EPAM Proprietary & Confidential. 19
Bu
Model/training parameters
Resources: 1x Nvidia A100 80GB
What worked for us
Training speed-up
• Gradient checkpointing
• Tensor Cores: torch.set_float32_matmul_precision('high’)
Optimization
• LR scheduling – cycling warmup + cosine decay
Data:
• Decrease “unanswerable” type in the batch
• Dropout
20. EPAM Proprietary & Confidential. 20
Bu
Performance
*Most of the tests conducted by Vadzim Piatrou
Model
Macro
tok-F1
Macro
cell-F1
Extractive
tok-F1
Extractive
cell-F1
Generative
tok-F1
Generative
cell-F1
Unanswerable
tok-F1
Unanswerable
cell-F1
TAPAS-Large Baseline 25.6 33.4 42.9 45.3 11.6 32.0 15.0 16.8
TAPAS-Large Finetuned 45.2 63.21 57.8 59.8 1.3 57.2 76.0 76.1
TAPAS-Base Baseline 23.6 29.0 26.9 38.3 10.7 26.9 17.6 18.0
Best model is Large
Finetuned models up to 2x better (incl. impression tests)
Generation capability degrades strongly
21. EPAM Proprietary & Confidential. 21
Bu
Can we use LLMs for it?
Good prompt is all you need, right?
Issues we faced:
• Hallucination – model comes up with facts outside of the table context
• Difficulties with understanding cell coordinates
• Providing structured output
22. EPAM Proprietary & Confidential. 22
Bu
Llama vs TAPAS
Model
Macro
tok-F1
Macro
cell-F1
Extractive
tok-F1
Extractive
cell-F1
Generative
tok-F1
Generative cell-
F1
Unanswerable
tok-F1
Unanswerable
cell-F1
TAPAS-Base Baseline 23.6 29.0 26.9 38.3 10.7 26.9 17.6 18.0
TAPAS-Base Finetuned 43.7 60.0 55.5 56.7 1.2 52.7 73.9 74.0
- - - -
*Most of the tests conducted by Vadzim Piatrou
is comparable to the baseline, generatives are better
23. EPAM Proprietary & Confidential. 23
Bu
Llama vs TAPAS
Model
Macro
tok-F1
Macro
cell-F1
Extractive
tok-F1
Extractive
cell-F1
Generative
tok-F1
Generative cell-
F1
Unanswerable
tok-F1
Unanswerable
cell-F1
TAPAS-Base Baseline 23.6 29.0 26.9 38.3 10.7 26.9 17.6 18.0
TAPAS-Base Finetuned 43.7 60.0 55.5 56.7 1.2 52.7 73.9 74.0
- - - -
Performance
TAPAS-Base on CPU is than Llama 2 4bit on GPU
6.5K test QA pairs take
• TAPAS-Base: ~ 45 mins, ~ 2 sec/pair, CPU
• Llama 2: 40 hours, ~ 10-20 sec/pair, Nvidia A100 80 Gb GPU
*Most of the tests conducted by Vadzim Piatrou
is comparable to the baseline, generatives are better
24. EPAM Proprietary & Confidential. 24
Bu
Question-table classifier
Good table retrieval is 80% of success
• If we are not sure about the answer, let’s still highlight the table
• Note that TAPAS confidence is ether 0 or 1
60% in F1 is the best we’ve got
25. EPAM Proprietary & Confidential. 25
Bu
Question-table classifier
Query
Candidate Table
TAPAS Answer
LGBM
Classifier
Simple
Hard
Features
It acheves about 80% in F1
Selected ”simple” answers have F1 > 80%
Good table retrieval is 80% of success
• If we are not sure about the answer, let’s still highlight the table
• Note that TAPAS confidence is ether 0 or 1
60% in F1 is the best we’ve got
Let’s build a to decide!
27. EPAM Proprietary & Confidential. 27
Bu
Service schema
Approx. architecture
Let’s cover it step-by step
client
Query
Custom
Document
Decomposition
Documents
Storage
Search index Search service
Tables
TAPAS
QT
classifier
Answers
28. EPAM Proprietary & Confidential. 28
Bu
Service schema
Approx. architecture
client
Query
Custom
Document
Decomposition
Documents
Storage
Search index Search service
Tables
TAPAS
QT
classifier
Answers
Custom engine responsible for:
(Tesseract-based + custom models)
• Document decomposition like
- layout recognition (paragraph, title, section, etc.)
- table/figure detection
29. EPAM Proprietary & Confidential. 29
Bu
Service schema
Approx. architecture
client
Query
Custom
Document
Decomposition
Documents
Storage
Search index Search service
Tables
TAPAS
QT
classifier
Answers
+ custom GO-based service
• Manages search indices
• Provides API for other services for
• collection management
(like our TQA)
+ their features
30. EPAM Proprietary & Confidential. 30
Bu
Service schema
Approx. architecture
client
Query
Custom
Document
Decomposition
Documents
Storage
Search index Search service
Tables
TAPAS
QT
classifier
Answers
deployed as a rest-service
31. EPAM Proprietary & Confidential. 31
Summary
TQA is not “solved” yet!
- models are at 60% accuracy on open
datasets
- zero-shot on open source LLMs is not
enough
32. EPAM Proprietary & Confidential. 32
Summary
TQA is not “solved” yet!
- models are at 60% accuracy on open
datasets
- zero-shot on open source LLMs is not
enough
Annotation for TQA is quite long
- you need a dedicated team
(SMEs in a perfect world)
- for a small team it might take months!
- it is worth the wait:
increase in metrics could be
up to 2x
33. EPAM Proprietary & Confidential. 33
Summary
TQA is not “solved” yet!
- models are at 60% accuracy on open
datasets
- zero-shot on open source LLMs is not
enough
Annotation for TQA is quite long
- you need a dedicated team
(SMEs in a perfect world)
- for a small team it might take months!
- it is worth the wait:
increase in metrics could be
up to 2x
Use both offline and online metrics:
- token / cell level F1
- measure impression
- small accuracy still might be enough for business