Companies continuously produce several documents containing valuable information for users. However, querying these documents is challenging, mainly because of the heterogeneity and volume of documents available. In this work, we investigate the challenge of developing a Big Data Question Answering system, i.e., a system that provides a unified, reliable, and accurate way to query documents through naturally asked questions. We define a set of design principles and introduce BigQA, the first software reference architecture to meet these design principles. The architecture consists of high-level layers and is independent of programming language, technology, querying and answering algorithms. BigQA was validated through a pharmaceutical case study managing over 18k documents from Wikipedia articles and FAQ about Coronavirus. The results demonstrated the applicability of BigQA to real-world applications. In addition, we conducted 27 experiments on three open-domain datasets and compared the recall results of the well-established BM25, TF-IDF, and Dense Passage Retriever algorithms to find the most appropriate generic querying algorithm. According to the experiments, BM25 provided the highest overall performance.
Schema on read is obsolete. Welcome metaprogramming..pdf
Design Principles and a Software Reference Architecture for Big Data Question Answering Systems (ICEIS 2023)
1. Institute of Mathematics and Computer Sciences
University of São Paulo
ICEIS 2023
International Conference on Enterprise Information Systems
Design Principles and a Software Reference Architecture
for Big Data Question Answering Systems
{leonardo.mauro, pedrocjardim}@usp.br, cdac@icmc.usp.br
Leonardo Mauro Pereira Moraes (speaker)
Pedro Calciolari Jardim, Cristina Dutra Aguiar
2. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
Agenda
Introduction
- Motivation
- Problem statement
Proposal
- Design Principles
- BigQA Architecture
Experiments
- Real case study
- Algorithm evaluation
Conclusion
- Contributions
- Future work
3. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
Agenda
Introduction
- Motivation
- Problem statement
Proposal
- Design Principles
- BigQA Architecture
Experiments
- Real case study
- Algorithm evaluation
Conclusion
- Contributions
- Future work
4. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
However, how can we extract
information from these documents?
● Data has been created at tremendous
velocity, volume and variety (Big Data).
● According to MIT¹, majority of data (~80%) is
unstructured information like text, image, etc.
4
Introduction
● Data has been created at tremendous
velocity, volume and variety (Big Data).
● According to MIT¹, majority of data (~80%) is
unstructured information like text, image, etc.
Motivation
● Companies produce several text documents
containing valuable information for internal
and external users.
○ e.g., Product reports, financial documents.
○ e.g., Documentation, product description, FAQ.
However, how can we extract
information from these documents?
¹ Tapping the power of unstructured data, by MIT
Big Data Question Answering
● Companies produce several text documents
containing valuable information for internal
and external users.
○ e.g., Product reports, financial documents.
○ e.g., Documentation, product description, FAQ.
5. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
● [What] Design principles are requirements and design
considerations that should be followed to create an application.
● [What] Design principles are requirements and design
considerations that should be followed to create an application.
5
Introduction Problem Statement
How to build a Big Data Question Answering system?
● [How] A reference architecture is a software template that
structure components and relations for a particular application.
● [How] A reference architecture is a software template that
structure components and relations for a particular application.
6. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
Agenda
Introduction
- Motivation
- Problem statement
Proposal
- Design Principles
- BigQA Architecture
Experiments
- Real case study
- Algorithm evaluation
Conclusion
- Contributions
- Future work
7. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
7
Proposal 1. Design Principles
B1. Get a proper answer.
B2. Access only allowed documents.
B3. Understand natural language.
B4. Summarize the answer.
Business (B)
[What] Design principles are requirements and design considerations.
● Considering business, data, and technical aspects.
[What] Design principles are requirements and design considerations.
● Considering business, data, and technical aspects.
Business principles refer to how the
user interacts with the system.
8. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
8
Proposal 1. Design Principles
D1. Persist the raw data.
D2. Collect data from different sources.
D3. Support variety of data formats.
D4. Support huge volume of data.
D5. Support inserting and updating data.
Data (D)
[What] Design principles are requirements and design considerations.
● Considering business, data, and technical aspects.
Data principles refer to how data
is manipulated into the system.
9. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
9
Proposal 1. Design Principles
[What] Design principles are requirements and design considerations.
● Considering business, data, and technical aspects.
T1. Implement modular components.
T2. Implement flexible components.
T3. Store analytical data.
T4. Ensure security artifacts.
Technical (T)
Technical principles refer to good
practices to implement the system.
10. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
10
Proposal 2. BigQA Architecture
[How] A reference architecture is a software template.
11. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
11
Proposal 2. BigQA Architecture
[How] A reference architecture is a software template.
1. Input Layer
● Ingest documents into the system.
● Push documents to Data Lake.
● Connect to Big Data Storage.
12. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
12
Proposal 2. BigQA Architecture
2. Big Data Storage Layer
● Store the raw data and golden data.
● Keep and process the data.
● Connect to Input, Big Querying.
[How] A reference architecture is a software template.
Raw data Golden data
13. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
13
Proposal 2. BigQA Architecture
[How] A reference architecture is a software template.
3. Big Querying Layer
● Query engine responsible for processing,
and generating answers to user questions.
● Connect to Connection, Big Data Storage.
Retrieve
documents
Generate
answers
14. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
14
Proposal 2. BigQA Architecture
4. Communication Layer
● Act as an interface
for user questions.
● Connect to Big Querying.
[How] A reference architecture is a software template.
15. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
15
Proposal 2. BigQA Architecture
[How] A reference architecture is a software template.
5. Security Layer
● Address issues like network connection,
credentials, and data governance.
● Connect to all layers.
16. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
16
Proposal 2. BigQA Architecture
[How] A reference architecture is a software template.
6. Insights Layer
● Comprise the data analysis.
● Receives data from multiple layers.
● Connect to all layers.
17. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
Agenda
Introduction
- Motivation
- Problem statement
Proposal
- Design Principles
- BigQA Architecture
Experiments
- Real case study
- Algorithm evaluation
Conclusion
- Contributions
- Future work
18. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
18
Experiments 1. Real-world Case Study
Case Study
● How to deploy a system using
the BigQA architecture.
● Real-world datasets:
○ SQuAD +18,800 docs;
○ COVID-QA +2,000 docs.
Case Study
● How to deploy a system using
the BigQA architecture.
● Real-world datasets:
○ SQuAD +18,800 docs;
○ COVID-QA +2,000 docs.
Instantiated System
● Elasticsearch as Document Store
● BM25 as Document Retriever
● RoBERTa as Document Reader
● Haystack as API
19. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
19
Experiments 1. Real-world Case Study
Query 1: What law regulates drug marketing
in the pharmaceutical industry?
● interest of marketing and legal employees
in knowing about regulatory laws.
Query 1: What law regulates drug marketing
in the pharmaceutical industry?
● interest of marketing and legal employees
in knowing about regulatory laws.
Answer Document Score
Prescription Drug Marketing
Act of 1987
Pharmaceutical industry 76.34%
Food and Drug Administration
(FDA)
Pharmaceutical industry 19.77%
Torture Regulation
Capital punishment in the
United States
11.01%
Answer Document Score
Prescription Drug Marketing
Act of 1987
Pharmaceutical industry 76.34%
Food and Drug Administration
(FDA)
Pharmaceutical industry 19.77%
Torture Regulation
Capital punishment in the
United States
11.01%
Table: Results of the Query 1.
Query 2: When was the Luria-Delbrück
experiment carried out?
● interest of microbiologists in get
information about a experiment
Answer Document Score
1943 Antibiotics 29.89%
14 Arnold Schwarzenegger 6.84%
14 Arnold Schwarzenegger 3.06%
Table: Results of the Query 2.
20. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
20
Experiments 1. Real-world Case Study
Discussion
● Distinct types of queries analyzing
different business scenarios.
○ Drug marketing;
○ Date question;
○ Augmented case.
Query 3: What is the novel Coronavirus?
● interest of any pharmaceutical
employee and the general public.
Query 3: What is the novel Coronavirus?
● interest of any pharmaceutical
employee and the general public.
Answer Document Score
SARS-CoV-2 COVID-QA 87.70%
Prevention for 2019 COVID-QA 76.78%
SARS-CoV-2 COVID-QA 71.66%
Answer Document Score
SARS-CoV-2 COVID-QA 87.70%
Prevention for 2019 COVID-QA 76.78%
SARS-CoV-2 COVID-QA 71.66%
Table: Results of the Query 3.
21. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
21
Experiments 2. Algorithm Evaluation
Document Retriever
● Conduct 27 experiments.
● Investigate the recall score.
Document Retriever
● Conduct 27 experiments.
● Investigate the recall score.
SQuaD AdversarialQA DuoRC
BM25 TF-IDF DPR BM25 TF-IDF DPR BM25 TF-IDF DPR
K = 3 88.74% 81.12% 69.21% 71.89% 69.95% 69.85% 86.05% 77.37% 23.29%
K = 10 94.43% 92.01% 85.72% 81.35% 81.81% 89.17% 91.49% 87.47% 35.83%
K = 20 96.29% 95.83% 91.38% 84.89% 85.56% 99.43% 93.76% 90.80% 44.78%
Table: Recall scores of the investigated Document Retriever algorithms.
Setup
● Algorithms: BM25, TF-IDF,
Dense Passage Retriever (DPR).
● Open-domain datasets: SQuAD,
AdversarialQA, DuoRC.
○ +25k question-document.
● Vary the value of k in [3, 10, 20].
Setup
● Algorithms: BM25, TF-IDF,
Dense Passage Retriever (DPR).
● Open-domain datasets: SQuAD,
AdversarialQA, DuoRC.
○ +25k question-document.
● Vary the value of k in [3, 10, 20].
Results
● Best overall performance: BM25.
● On complex datasets, DPR seems better.
22. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
Agenda
Introduction
- Motivation
- Problem statement
Proposal
- Design Principles
- BigQA Architecture
Experiments
- Real case study
- Algorithm evaluation
Conclusion
- Contributions
- Future work
23. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
23
Conclusion
Proposal - [What] Design Principles related to
● Business and user interaction;
● Data manipulation; and
● Technical implementation.
Proposal - [What] Design Principles related to
● Business and user interaction;
● Data manipulation; and
● Technical implementation.
Experiments
● A real-world case study.
● 27 experiments on open-domain datasets.
○ All code is available on GitHub².
Experiments
● A real-world case study.
● 27 experiments on open-domain datasets.
○ All code is available on GitHub².
Future work
● Experiments exploring Doc. Reader.
● Investigate the Insights and Security layers in
terms of technologies and algorithms available.
● Instantiate new case studies.
Proposal - [How] BigQA, the first Big Data
Question Answering architecture.
● Deal with real-world and dynamic applications;
● Provide references for each answer; and
● Address security and analytical needs;
● Can incorporate business rules.
Proposal - [How] BigQA, the first Big Data
Question Answering architecture.
● Deal with real-world and dynamic applications;
● Provide references for each answer; and
● Address security and analytical needs;
● Can incorporate business rules.
² https://github.com/leomaurodesenv/big-qa-architecture
24. Thank you
for your attention ✌
Design Principles and a Software
Reference Architecture for Big Data
Question Answering Systems
25. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
25
Background Private Technologies
Few private technologies
● IBM, Watson Discovery
● Amazon, Kendra
● Sinch, AskFrank
● OpenAI, ChatGPT
ChatGPT
● Very close to human intelligence.
● Faces common large model NLP issues:
○ usually trained with static data;
○ hallucination and misleading;
○ not clear how incorporate business rules.
Our proposal
● deal with real-world and dynamic applications;
● provide references for each answer;
● can incorporate business rules.
26. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
26
Background Related Work
Studies
Reference
Architecture
Big Data
Security
Artefacts
Design
Principles
Question
Answering
Open
Domain
Case
Study
(Zhu et al., 2019) ✔ ✔ ✔ ✔
(Li et al., 2020) ✔ ✔ ✔
(Ataei and Litchfield, 2021) ✔ ✔
(Cassavia and Masciari, 2021) ✔ ✔ ✔
(Yousfi et al., 2021) ✔ ✔ ✔
(Sucunuta and Riofrio, 2010) ✔ ✔ ✔ ✔
(Nielsen et al., 2010) ✔ ✔ ✔ ✔
(Zhang et al., 2013) ✔ ✔ ✔
(Novo-Loures et al., 2020) ✔ ✔ ✔
(Karpukhin et al., 2020) ✔ ✔ ✔
BigQA (our proposal) ✔ ✔ ✔ ✔ ✔ ✔ ✔
Legend: The ✓ symbol indicates the challenges addressed by each study.
Big
Data
Architecture
Question
Answering
Architecture
Table: Characteristics of the related work and proposed BigQA.