[DSC Europe 23] Vladimir Ageev - From Tables to Answers: building QA System for In-Document Searches

Vladimir Ageev, Lead DS @EPAM
N O V E M B E R 2 0 2 3
EPAM Proprietary & Confidential.

EPAM Proprietary & Confidential. 2
We’ll cover
Tabular Question Answering case-study
• Business problem
• State of the Art
• Fine-tuning
• Productionalization

TQA Feature

Bu
Product
SaaS platform designed to integrate technical publications into Engineering
workflows:
• provides access to
• enriches experience with like
smart search, comparison or entity linking

Bu
Table Question answering
Why?
• Hundreds of thousands
popular PDFs have
• Keyword-based search might not
find them
• Semantic search or general QA models
do not account for table structure

Bu
TQA: formal task
INPUT
User query:
”TruthfulQA highest % true”
Table representation:
[{
“text”: “% true”,
“row_id”: 0,
“col_id” : 3
}, …]
Caption:
“Table 44: Evaluation results on …”
OUTPUT
Answers coordinates
{
“text”: “79.92”,
“operation”: None,
“cells”: [{
“col_id”: 3,
“row_id”: 15
}]
}
Assumption: table detection, parsing and retrieval from PDFs are solved
Task: given a highlight
row_id: 15
col_id: 3

TQA Models

Bu
State of TQA
What model to chose?
• SOTA : Dater – has Open AI GPT-3
under the hood, insecure (for us)
• TaBERT – CC-BY-NC 4.0 licensed
Scores on WikiTableQuestions
*https://paperswithcode.com/sota/semantic-parsing-on-wikitablequestions

Bu
State of TQA
• OmniTab – Seq2Seq model
generative, no cell highlighting
• TAPEX –BART-based model

Bu
State of TQA
• OmniTab – Seq2Seq model
• TAPEX –BART-based model
– BERT-based model
aggregations and highlighting
MIT License

Bu
TAPAS: how it works
BERT-based transformer encoder
Two classification heads:
Additional positional embeddings:
• Column ID
• Row ID
• Segment: query / table
• Rank: non-comparable or order

Fine-tuning

Bu
Evaluation & fine-tuning
Question types:
• Extractive
Q: Max truthfulQA %
A: 79.92
Highlighted cell

Bu
Question types:
• Extractive: Answer, Cells
• Generative
Q: Average %info for Llama 2
A: 46,17
Operation: AVG
Highlighted cells

Bu
Question types:
• Generative: Answer, Cells, Aggregation
• Unanswerable
Q: Mistral % true on TruthfulQA
A: None
Operation: None
No cells

Bu
Question types:
• Generative: Answer, Cells, Aggregation
• Unanswerable: None
Dataset size:
~ 3K tables
~ 10 QA pair per table
Annotation:
• ~3 months
• ~2-5 annotators, 2 rounds:
• separate tables –more diverse
• several tables per document – retrieval tests

Bu
How to evaluate?
• F1 at cell-sets level
• F1 at answer tokens level
• Micro / macro averaging over
question types / tables / docs

Bu
How to evaluate?
• F1 at cell-sets level
• F1 at answer tokens level
• Micro / macro averaging over
question types / tables / docs
Is 80% F1 enough? 50%? Run an impression test!
• Retrieval quality
• Correct response rate
• Overall impression (good to go?)
“ according to the following:
Very poor - The service doesn’t meet expectations.
…
Very good - The service provides great experience.”
“ Is relevant ?
Were cells highlighted?
Are cells ?”

Bu
Model/training parameters
Resources: 1x Nvidia A100 80GB
What worked for us
Training speed-up
• Gradient checkpointing
• Tensor Cores: torch.set_float32_matmul_precision('high’)
Optimization
• LR scheduling – cycling warmup + cosine decay
Data:
• Decrease “unanswerable” type in the batch
• Dropout

Bu
Performance
*Most of the tests conducted by Vadzim Piatrou
Model
Macro
tok-F1
Macro
cell-F1
Extractive
tok-F1
Extractive
cell-F1
Generative
tok-F1
Generative
cell-F1
Unanswerable
tok-F1
Unanswerable
cell-F1
TAPAS-Large Baseline 25.6 33.4 42.9 45.3 11.6 32.0 15.0 16.8
TAPAS-Large Finetuned 45.2 63.21 57.8 59.8 1.3 57.2 76.0 76.1
TAPAS-Base Baseline 23.6 29.0 26.9 38.3 10.7 26.9 17.6 18.0
Best model is Large
Finetuned models up to 2x better (incl. impression tests)
Generation capability degrades strongly

Bu
Can we use LLMs for it?
Good prompt is all you need, right?
Issues we faced:
• Hallucination – model comes up with facts outside of the table context
• Difficulties with understanding cell coordinates
• Providing structured output

Bu
Llama vs TAPAS
Model
Macro
tok-F1
Macro
cell-F1
Extractive
tok-F1
Extractive
cell-F1
Generative
tok-F1
Generative cell-
F1
Unanswerable
tok-F1
Unanswerable
cell-F1
TAPAS-Base Finetuned 43.7 60.0 55.5 56.7 1.2 52.7 73.9 74.0
- - - -
is comparable to the baseline, generatives are better

Bu
Llama vs TAPAS
Model
Macro
tok-F1
Macro
cell-F1
Extractive
tok-F1
Extractive
cell-F1
Generative
tok-F1
Generative cell-
F1
Unanswerable
tok-F1
Unanswerable
cell-F1
TAPAS-Base Finetuned 43.7 60.0 55.5 56.7 1.2 52.7 73.9 74.0
- - - -
Performance
TAPAS-Base on CPU is than Llama 2 4bit on GPU
6.5K test QA pairs take
• TAPAS-Base: ~ 45 mins, ~ 2 sec/pair, CPU
• Llama 2: 40 hours, ~ 10-20 sec/pair, Nvidia A100 80 Gb GPU
is comparable to the baseline, generatives are better

Bu
Question-table classifier
Good table retrieval is 80% of success
• If we are not sure about the answer, let’s still highlight the table
• Note that TAPAS confidence is ether 0 or 1
60% in F1 is the best we’ve got

Bu
Question-table classifier
Query
Candidate Table
TAPAS Answer
LGBM
Classifier
Simple
Hard
Features
It acheves about 80% in F1
Selected ”simple” answers have F1 > 80%
Good table retrieval is 80% of success
• If we are not sure about the answer, let’s still highlight the table
• Note that TAPAS confidence is ether 0 or 1
60% in F1 is the best we’ve got
Let’s build a to decide!

Productionalization

Bu
Service schema
Approx. architecture
Let’s cover it step-by step
client
Query
Custom
Document
Decomposition
Documents
Storage
Search index Search service
Tables
TAPAS
QT
classifier
Answers

Bu
Service schema
client
Query
Custom
Document
Decomposition
Documents
Storage
Tables
TAPAS
QT
classifier
Answers
Custom engine responsible for:
(Tesseract-based + custom models)
• Document decomposition like
- layout recognition (paragraph, title, section, etc.)
- table/figure detection

Bu
Service schema
client
Query
Custom
Document
Decomposition
Documents
Storage
Tables
TAPAS
QT
classifier
Answers
+ custom GO-based service
• Manages search indices
• Provides API for other services for
• collection management
(like our TQA)
+ their features

Bu
Service schema
client
Query
Custom
Document
Decomposition
Documents
Storage
Tables
TAPAS
QT
classifier
Answers
deployed as a rest-service

Summary
TQA is not “solved” yet!
- models are at 60% accuracy on open
datasets
- zero-shot on open source LLMs is not
enough

Summary
datasets
enough
Annotation for TQA is quite long
- you need a dedicated team
(SMEs in a perfect world)
- for a small team it might take months!
- it is worth the wait:
increase in metrics could be
up to 2x

Summary
datasets
enough
Annotation for TQA is quite long
- you need a dedicated team
(SMEs in a perfect world)
- for a small team it might take months!
- it is worth the wait:
increase in metrics could be
up to 2x
Use both offline and online metrics:
- token / cell level F1
- measure impression
- small accuracy still might be enough for business

Vladimir Ageev Vadzim Piatrou, Ph.D.

[DSC Europe 23] Vladimir Ageev - From Tables to Answers: building QA System for In-Document Searches

Recommended

Recommended

More Related Content

Similar to [DSC Europe 23] Vladimir Ageev - From Tables to Answers: building QA System for In-Document Searches

Similar to [DSC Europe 23] Vladimir Ageev - From Tables to Answers: building QA System for In-Document Searches (20)

More from DataScienceConferenc1

More from DataScienceConferenc1 (20)

Recently uploaded

Recently uploaded (20)

[DSC Europe 23] Vladimir Ageev - From Tables to Answers: building QA System for In-Document Searches