SlideShare a Scribd company logo
1 of 26
Download to read offline
Institute of Mathematics and Computer Sciences
University of São Paulo
ICEIS 2023
International Conference on Enterprise Information Systems
Design Principles and a Software Reference Architecture
for Big Data Question Answering Systems
{leonardo.mauro, pedrocjardim}@usp.br, cdac@icmc.usp.br
Leonardo Mauro Pereira Moraes (speaker)
Pedro Calciolari Jardim, Cristina Dutra Aguiar
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
Agenda
Introduction
- Motivation
- Problem statement
Proposal
- Design Principles
- BigQA Architecture
Experiments
- Real case study
- Algorithm evaluation
Conclusion
- Contributions
- Future work
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
Agenda
Introduction
- Motivation
- Problem statement
Proposal
- Design Principles
- BigQA Architecture
Experiments
- Real case study
- Algorithm evaluation
Conclusion
- Contributions
- Future work
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
However, how can we extract
information from these documents?
● Data has been created at tremendous
velocity, volume and variety (Big Data).
● According to MIT¹, majority of data (~80%) is
unstructured information like text, image, etc.
4
Introduction
● Data has been created at tremendous
velocity, volume and variety (Big Data).
● According to MIT¹, majority of data (~80%) is
unstructured information like text, image, etc.
Motivation
● Companies produce several text documents
containing valuable information for internal
and external users.
○ e.g., Product reports, financial documents.
○ e.g., Documentation, product description, FAQ.
However, how can we extract
information from these documents?
¹ Tapping the power of unstructured data, by MIT
Big Data Question Answering
● Companies produce several text documents
containing valuable information for internal
and external users.
○ e.g., Product reports, financial documents.
○ e.g., Documentation, product description, FAQ.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
● [What] Design principles are requirements and design
considerations that should be followed to create an application.
● [What] Design principles are requirements and design
considerations that should be followed to create an application.
5
Introduction Problem Statement
How to build a Big Data Question Answering system?
● [How] A reference architecture is a software template that
structure components and relations for a particular application.
● [How] A reference architecture is a software template that
structure components and relations for a particular application.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
Agenda
Introduction
- Motivation
- Problem statement
Proposal
- Design Principles
- BigQA Architecture
Experiments
- Real case study
- Algorithm evaluation
Conclusion
- Contributions
- Future work
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
7
Proposal 1. Design Principles
B1. Get a proper answer.
B2. Access only allowed documents.
B3. Understand natural language.
B4. Summarize the answer.
Business (B)
[What] Design principles are requirements and design considerations.
● Considering business, data, and technical aspects.
[What] Design principles are requirements and design considerations.
● Considering business, data, and technical aspects.
Business principles refer to how the
user interacts with the system.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
8
Proposal 1. Design Principles
D1. Persist the raw data.
D2. Collect data from different sources.
D3. Support variety of data formats.
D4. Support huge volume of data.
D5. Support inserting and updating data.
Data (D)
[What] Design principles are requirements and design considerations.
● Considering business, data, and technical aspects.
Data principles refer to how data
is manipulated into the system.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
9
Proposal 1. Design Principles
[What] Design principles are requirements and design considerations.
● Considering business, data, and technical aspects.
T1. Implement modular components.
T2. Implement flexible components.
T3. Store analytical data.
T4. Ensure security artifacts.
Technical (T)
Technical principles refer to good
practices to implement the system.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
10
Proposal 2. BigQA Architecture
[How] A reference architecture is a software template.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
11
Proposal 2. BigQA Architecture
[How] A reference architecture is a software template.
1. Input Layer
● Ingest documents into the system.
● Push documents to Data Lake.
● Connect to Big Data Storage.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
12
Proposal 2. BigQA Architecture
2. Big Data Storage Layer
● Store the raw data and golden data.
● Keep and process the data.
● Connect to Input, Big Querying.
[How] A reference architecture is a software template.
Raw data Golden data
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
13
Proposal 2. BigQA Architecture
[How] A reference architecture is a software template.
3. Big Querying Layer
● Query engine responsible for processing,
and generating answers to user questions.
● Connect to Connection, Big Data Storage.
Retrieve
documents
Generate
answers
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
14
Proposal 2. BigQA Architecture
4. Communication Layer
● Act as an interface
for user questions.
● Connect to Big Querying.
[How] A reference architecture is a software template.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
15
Proposal 2. BigQA Architecture
[How] A reference architecture is a software template.
5. Security Layer
● Address issues like network connection,
credentials, and data governance.
● Connect to all layers.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
16
Proposal 2. BigQA Architecture
[How] A reference architecture is a software template.
6. Insights Layer
● Comprise the data analysis.
● Receives data from multiple layers.
● Connect to all layers.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
Agenda
Introduction
- Motivation
- Problem statement
Proposal
- Design Principles
- BigQA Architecture
Experiments
- Real case study
- Algorithm evaluation
Conclusion
- Contributions
- Future work
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
18
Experiments 1. Real-world Case Study
Case Study
● How to deploy a system using
the BigQA architecture.
● Real-world datasets:
○ SQuAD +18,800 docs;
○ COVID-QA +2,000 docs.
Case Study
● How to deploy a system using
the BigQA architecture.
● Real-world datasets:
○ SQuAD +18,800 docs;
○ COVID-QA +2,000 docs.
Instantiated System
● Elasticsearch as Document Store
● BM25 as Document Retriever
● RoBERTa as Document Reader
● Haystack as API
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
19
Experiments 1. Real-world Case Study
Query 1: What law regulates drug marketing
in the pharmaceutical industry?
● interest of marketing and legal employees
in knowing about regulatory laws.
Query 1: What law regulates drug marketing
in the pharmaceutical industry?
● interest of marketing and legal employees
in knowing about regulatory laws.
Answer Document Score
Prescription Drug Marketing
Act of 1987
Pharmaceutical industry 76.34%
Food and Drug Administration
(FDA)
Pharmaceutical industry 19.77%
Torture Regulation
Capital punishment in the
United States
11.01%
Answer Document Score
Prescription Drug Marketing
Act of 1987
Pharmaceutical industry 76.34%
Food and Drug Administration
(FDA)
Pharmaceutical industry 19.77%
Torture Regulation
Capital punishment in the
United States
11.01%
Table: Results of the Query 1.
Query 2: When was the Luria-Delbrück
experiment carried out?
● interest of microbiologists in get
information about a experiment
Answer Document Score
1943 Antibiotics 29.89%
14 Arnold Schwarzenegger 6.84%
14 Arnold Schwarzenegger 3.06%
Table: Results of the Query 2.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
20
Experiments 1. Real-world Case Study
Discussion
● Distinct types of queries analyzing
different business scenarios.
○ Drug marketing;
○ Date question;
○ Augmented case.
Query 3: What is the novel Coronavirus?
● interest of any pharmaceutical
employee and the general public.
Query 3: What is the novel Coronavirus?
● interest of any pharmaceutical
employee and the general public.
Answer Document Score
SARS-CoV-2 COVID-QA 87.70%
Prevention for 2019 COVID-QA 76.78%
SARS-CoV-2 COVID-QA 71.66%
Answer Document Score
SARS-CoV-2 COVID-QA 87.70%
Prevention for 2019 COVID-QA 76.78%
SARS-CoV-2 COVID-QA 71.66%
Table: Results of the Query 3.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
21
Experiments 2. Algorithm Evaluation
Document Retriever
● Conduct 27 experiments.
● Investigate the recall score.
Document Retriever
● Conduct 27 experiments.
● Investigate the recall score.
SQuaD AdversarialQA DuoRC
BM25 TF-IDF DPR BM25 TF-IDF DPR BM25 TF-IDF DPR
K = 3 88.74% 81.12% 69.21% 71.89% 69.95% 69.85% 86.05% 77.37% 23.29%
K = 10 94.43% 92.01% 85.72% 81.35% 81.81% 89.17% 91.49% 87.47% 35.83%
K = 20 96.29% 95.83% 91.38% 84.89% 85.56% 99.43% 93.76% 90.80% 44.78%
Table: Recall scores of the investigated Document Retriever algorithms.
Setup
● Algorithms: BM25, TF-IDF,
Dense Passage Retriever (DPR).
● Open-domain datasets: SQuAD,
AdversarialQA, DuoRC.
○ +25k question-document.
● Vary the value of k in [3, 10, 20].
Setup
● Algorithms: BM25, TF-IDF,
Dense Passage Retriever (DPR).
● Open-domain datasets: SQuAD,
AdversarialQA, DuoRC.
○ +25k question-document.
● Vary the value of k in [3, 10, 20].
Results
● Best overall performance: BM25.
● On complex datasets, DPR seems better.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
Agenda
Introduction
- Motivation
- Problem statement
Proposal
- Design Principles
- BigQA Architecture
Experiments
- Real case study
- Algorithm evaluation
Conclusion
- Contributions
- Future work
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
23
Conclusion
Proposal - [What] Design Principles related to
● Business and user interaction;
● Data manipulation; and
● Technical implementation.
Proposal - [What] Design Principles related to
● Business and user interaction;
● Data manipulation; and
● Technical implementation.
Experiments
● A real-world case study.
● 27 experiments on open-domain datasets.
○ All code is available on GitHub².
Experiments
● A real-world case study.
● 27 experiments on open-domain datasets.
○ All code is available on GitHub².
Future work
● Experiments exploring Doc. Reader.
● Investigate the Insights and Security layers in
terms of technologies and algorithms available.
● Instantiate new case studies.
Proposal - [How] BigQA, the first Big Data
Question Answering architecture.
● Deal with real-world and dynamic applications;
● Provide references for each answer; and
● Address security and analytical needs;
● Can incorporate business rules.
Proposal - [How] BigQA, the first Big Data
Question Answering architecture.
● Deal with real-world and dynamic applications;
● Provide references for each answer; and
● Address security and analytical needs;
● Can incorporate business rules.
² https://github.com/leomaurodesenv/big-qa-architecture
Thank you
for your attention ✌
Design Principles and a Software
Reference Architecture for Big Data
Question Answering Systems
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
25
Background Private Technologies
Few private technologies
● IBM, Watson Discovery
● Amazon, Kendra
● Sinch, AskFrank
● OpenAI, ChatGPT
ChatGPT
● Very close to human intelligence.
● Faces common large model NLP issues:
○ usually trained with static data;
○ hallucination and misleading;
○ not clear how incorporate business rules.
Our proposal
● deal with real-world and dynamic applications;
● provide references for each answer;
● can incorporate business rules.
Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture
26
Background Related Work
Studies
Reference
Architecture
Big Data
Security
Artefacts
Design
Principles
Question
Answering
Open
Domain
Case
Study
(Zhu et al., 2019) ✔ ✔ ✔ ✔
(Li et al., 2020) ✔ ✔ ✔
(Ataei and Litchfield, 2021) ✔ ✔
(Cassavia and Masciari, 2021) ✔ ✔ ✔
(Yousfi et al., 2021) ✔ ✔ ✔
(Sucunuta and Riofrio, 2010) ✔ ✔ ✔ ✔
(Nielsen et al., 2010) ✔ ✔ ✔ ✔
(Zhang et al., 2013) ✔ ✔ ✔
(Novo-Loures et al., 2020) ✔ ✔ ✔
(Karpukhin et al., 2020) ✔ ✔ ✔
BigQA (our proposal) ✔ ✔ ✔ ✔ ✔ ✔ ✔
Legend: The ✓ symbol indicates the challenges addressed by each study.
Big
Data
Architecture
Question
Answering
Architecture
Table: Characteristics of the related work and proposed BigQA.

More Related Content

Similar to Design Principles and a Software Reference Architecture for Big Data Question Answering Systems (ICEIS 2023)

Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 
How to make your data scientists happy
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy Hussain Sultan
 
How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?Noam Cohen
 
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...BigData_Europe
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdfAyele40
 
Essential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataEssential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataSociety of Petroleum Engineers
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Avoid hiring data ninja rockstars: how to build effective data teams
Avoid hiring data ninja rockstars: how to build effective data teamsAvoid hiring data ninja rockstars: how to build effective data teams
Avoid hiring data ninja rockstars: how to build effective data teamsJodieBurchell1
 
Why Program Management is Essential for IT Projects
Why Program Management is Essential for IT ProjectsWhy Program Management is Essential for IT Projects
Why Program Management is Essential for IT Projectsbbigelow
 
1 18CS54 _Software Engineering and Testing _Introduction to CO PO _Syllabus ...
1  18CS54 _Software Engineering and Testing _Introduction to CO PO _Syllabus ...1  18CS54 _Software Engineering and Testing _Introduction to CO PO _Syllabus ...
1 18CS54 _Software Engineering and Testing _Introduction to CO PO _Syllabus ...Jayanthi Kannan MK
 
CSE320 SOFTWARE ENGINEERING Lecture01 (1).ppt
CSE320  SOFTWARE ENGINEERING Lecture01 (1).pptCSE320  SOFTWARE ENGINEERING Lecture01 (1).ppt
CSE320 SOFTWARE ENGINEERING Lecture01 (1).pptDHIRENDRAHUDDA
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptxarpit206900
 
El Arte de lo Possible
El Arte de lo PossibleEl Arte de lo Possible
El Arte de lo PossibleNeo4j
 
Technip Multidomain MDM Journey
Technip Multidomain MDM JourneyTechnip Multidomain MDM Journey
Technip Multidomain MDM JourneyOrchestra Networks
 
Lesson 3 ai in the enterprise
Lesson 3   ai in the enterpriseLesson 3   ai in the enterprise
Lesson 3 ai in the enterpriseankit_ppt
 

Similar to Design Principles and a Software Reference Architecture for Big Data Question Answering Systems (ICEIS 2023) (20)

Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
How to make your data scientists happy
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy
 
How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?How Data Science Can Grow Your Business?
How Data Science Can Grow Your Business?
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf
 
Data Science and Analytics
Data Science and Analytics Data Science and Analytics
Data Science and Analytics
 
Essential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataEssential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big Data
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
 
Avoid hiring data ninja rockstars: how to build effective data teams
Avoid hiring data ninja rockstars: how to build effective data teamsAvoid hiring data ninja rockstars: how to build effective data teams
Avoid hiring data ninja rockstars: how to build effective data teams
 
Why Program Management is Essential for IT Projects
Why Program Management is Essential for IT ProjectsWhy Program Management is Essential for IT Projects
Why Program Management is Essential for IT Projects
 
1 18CS54 _Software Engineering and Testing _Introduction to CO PO _Syllabus ...
1  18CS54 _Software Engineering and Testing _Introduction to CO PO _Syllabus ...1  18CS54 _Software Engineering and Testing _Introduction to CO PO _Syllabus ...
1 18CS54 _Software Engineering and Testing _Introduction to CO PO _Syllabus ...
 
CSE320 SOFTWARE ENGINEERING Lecture01 (1).ppt
CSE320  SOFTWARE ENGINEERING Lecture01 (1).pptCSE320  SOFTWARE ENGINEERING Lecture01 (1).ppt
CSE320 SOFTWARE ENGINEERING Lecture01 (1).ppt
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
 
El Arte de lo Possible
El Arte de lo PossibleEl Arte de lo Possible
El Arte de lo Possible
 
Technip Multidomain MDM Journey
Technip Multidomain MDM JourneyTechnip Multidomain MDM Journey
Technip Multidomain MDM Journey
 
Lesson 3 ai in the enterprise
Lesson 3   ai in the enterpriseLesson 3   ai in the enterprise
Lesson 3 ai in the enterprise
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 

Recently uploaded

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Design Principles and a Software Reference Architecture for Big Data Question Answering Systems (ICEIS 2023)

  • 1. Institute of Mathematics and Computer Sciences University of São Paulo ICEIS 2023 International Conference on Enterprise Information Systems Design Principles and a Software Reference Architecture for Big Data Question Answering Systems {leonardo.mauro, pedrocjardim}@usp.br, cdac@icmc.usp.br Leonardo Mauro Pereira Moraes (speaker) Pedro Calciolari Jardim, Cristina Dutra Aguiar
  • 2. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture Agenda Introduction - Motivation - Problem statement Proposal - Design Principles - BigQA Architecture Experiments - Real case study - Algorithm evaluation Conclusion - Contributions - Future work
  • 3. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture Agenda Introduction - Motivation - Problem statement Proposal - Design Principles - BigQA Architecture Experiments - Real case study - Algorithm evaluation Conclusion - Contributions - Future work
  • 4. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture However, how can we extract information from these documents? ● Data has been created at tremendous velocity, volume and variety (Big Data). ● According to MIT¹, majority of data (~80%) is unstructured information like text, image, etc. 4 Introduction ● Data has been created at tremendous velocity, volume and variety (Big Data). ● According to MIT¹, majority of data (~80%) is unstructured information like text, image, etc. Motivation ● Companies produce several text documents containing valuable information for internal and external users. ○ e.g., Product reports, financial documents. ○ e.g., Documentation, product description, FAQ. However, how can we extract information from these documents? ¹ Tapping the power of unstructured data, by MIT Big Data Question Answering ● Companies produce several text documents containing valuable information for internal and external users. ○ e.g., Product reports, financial documents. ○ e.g., Documentation, product description, FAQ.
  • 5. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture ● [What] Design principles are requirements and design considerations that should be followed to create an application. ● [What] Design principles are requirements and design considerations that should be followed to create an application. 5 Introduction Problem Statement How to build a Big Data Question Answering system? ● [How] A reference architecture is a software template that structure components and relations for a particular application. ● [How] A reference architecture is a software template that structure components and relations for a particular application.
  • 6. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture Agenda Introduction - Motivation - Problem statement Proposal - Design Principles - BigQA Architecture Experiments - Real case study - Algorithm evaluation Conclusion - Contributions - Future work
  • 7. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 7 Proposal 1. Design Principles B1. Get a proper answer. B2. Access only allowed documents. B3. Understand natural language. B4. Summarize the answer. Business (B) [What] Design principles are requirements and design considerations. ● Considering business, data, and technical aspects. [What] Design principles are requirements and design considerations. ● Considering business, data, and technical aspects. Business principles refer to how the user interacts with the system.
  • 8. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 8 Proposal 1. Design Principles D1. Persist the raw data. D2. Collect data from different sources. D3. Support variety of data formats. D4. Support huge volume of data. D5. Support inserting and updating data. Data (D) [What] Design principles are requirements and design considerations. ● Considering business, data, and technical aspects. Data principles refer to how data is manipulated into the system.
  • 9. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 9 Proposal 1. Design Principles [What] Design principles are requirements and design considerations. ● Considering business, data, and technical aspects. T1. Implement modular components. T2. Implement flexible components. T3. Store analytical data. T4. Ensure security artifacts. Technical (T) Technical principles refer to good practices to implement the system.
  • 10. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 10 Proposal 2. BigQA Architecture [How] A reference architecture is a software template.
  • 11. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 11 Proposal 2. BigQA Architecture [How] A reference architecture is a software template. 1. Input Layer ● Ingest documents into the system. ● Push documents to Data Lake. ● Connect to Big Data Storage.
  • 12. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 12 Proposal 2. BigQA Architecture 2. Big Data Storage Layer ● Store the raw data and golden data. ● Keep and process the data. ● Connect to Input, Big Querying. [How] A reference architecture is a software template. Raw data Golden data
  • 13. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 13 Proposal 2. BigQA Architecture [How] A reference architecture is a software template. 3. Big Querying Layer ● Query engine responsible for processing, and generating answers to user questions. ● Connect to Connection, Big Data Storage. Retrieve documents Generate answers
  • 14. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 14 Proposal 2. BigQA Architecture 4. Communication Layer ● Act as an interface for user questions. ● Connect to Big Querying. [How] A reference architecture is a software template.
  • 15. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 15 Proposal 2. BigQA Architecture [How] A reference architecture is a software template. 5. Security Layer ● Address issues like network connection, credentials, and data governance. ● Connect to all layers.
  • 16. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 16 Proposal 2. BigQA Architecture [How] A reference architecture is a software template. 6. Insights Layer ● Comprise the data analysis. ● Receives data from multiple layers. ● Connect to all layers.
  • 17. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture Agenda Introduction - Motivation - Problem statement Proposal - Design Principles - BigQA Architecture Experiments - Real case study - Algorithm evaluation Conclusion - Contributions - Future work
  • 18. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 18 Experiments 1. Real-world Case Study Case Study ● How to deploy a system using the BigQA architecture. ● Real-world datasets: ○ SQuAD +18,800 docs; ○ COVID-QA +2,000 docs. Case Study ● How to deploy a system using the BigQA architecture. ● Real-world datasets: ○ SQuAD +18,800 docs; ○ COVID-QA +2,000 docs. Instantiated System ● Elasticsearch as Document Store ● BM25 as Document Retriever ● RoBERTa as Document Reader ● Haystack as API
  • 19. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 19 Experiments 1. Real-world Case Study Query 1: What law regulates drug marketing in the pharmaceutical industry? ● interest of marketing and legal employees in knowing about regulatory laws. Query 1: What law regulates drug marketing in the pharmaceutical industry? ● interest of marketing and legal employees in knowing about regulatory laws. Answer Document Score Prescription Drug Marketing Act of 1987 Pharmaceutical industry 76.34% Food and Drug Administration (FDA) Pharmaceutical industry 19.77% Torture Regulation Capital punishment in the United States 11.01% Answer Document Score Prescription Drug Marketing Act of 1987 Pharmaceutical industry 76.34% Food and Drug Administration (FDA) Pharmaceutical industry 19.77% Torture Regulation Capital punishment in the United States 11.01% Table: Results of the Query 1. Query 2: When was the Luria-Delbrück experiment carried out? ● interest of microbiologists in get information about a experiment Answer Document Score 1943 Antibiotics 29.89% 14 Arnold Schwarzenegger 6.84% 14 Arnold Schwarzenegger 3.06% Table: Results of the Query 2.
  • 20. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 20 Experiments 1. Real-world Case Study Discussion ● Distinct types of queries analyzing different business scenarios. ○ Drug marketing; ○ Date question; ○ Augmented case. Query 3: What is the novel Coronavirus? ● interest of any pharmaceutical employee and the general public. Query 3: What is the novel Coronavirus? ● interest of any pharmaceutical employee and the general public. Answer Document Score SARS-CoV-2 COVID-QA 87.70% Prevention for 2019 COVID-QA 76.78% SARS-CoV-2 COVID-QA 71.66% Answer Document Score SARS-CoV-2 COVID-QA 87.70% Prevention for 2019 COVID-QA 76.78% SARS-CoV-2 COVID-QA 71.66% Table: Results of the Query 3.
  • 21. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 21 Experiments 2. Algorithm Evaluation Document Retriever ● Conduct 27 experiments. ● Investigate the recall score. Document Retriever ● Conduct 27 experiments. ● Investigate the recall score. SQuaD AdversarialQA DuoRC BM25 TF-IDF DPR BM25 TF-IDF DPR BM25 TF-IDF DPR K = 3 88.74% 81.12% 69.21% 71.89% 69.95% 69.85% 86.05% 77.37% 23.29% K = 10 94.43% 92.01% 85.72% 81.35% 81.81% 89.17% 91.49% 87.47% 35.83% K = 20 96.29% 95.83% 91.38% 84.89% 85.56% 99.43% 93.76% 90.80% 44.78% Table: Recall scores of the investigated Document Retriever algorithms. Setup ● Algorithms: BM25, TF-IDF, Dense Passage Retriever (DPR). ● Open-domain datasets: SQuAD, AdversarialQA, DuoRC. ○ +25k question-document. ● Vary the value of k in [3, 10, 20]. Setup ● Algorithms: BM25, TF-IDF, Dense Passage Retriever (DPR). ● Open-domain datasets: SQuAD, AdversarialQA, DuoRC. ○ +25k question-document. ● Vary the value of k in [3, 10, 20]. Results ● Best overall performance: BM25. ● On complex datasets, DPR seems better.
  • 22. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture Agenda Introduction - Motivation - Problem statement Proposal - Design Principles - BigQA Architecture Experiments - Real case study - Algorithm evaluation Conclusion - Contributions - Future work
  • 23. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 23 Conclusion Proposal - [What] Design Principles related to ● Business and user interaction; ● Data manipulation; and ● Technical implementation. Proposal - [What] Design Principles related to ● Business and user interaction; ● Data manipulation; and ● Technical implementation. Experiments ● A real-world case study. ● 27 experiments on open-domain datasets. ○ All code is available on GitHub². Experiments ● A real-world case study. ● 27 experiments on open-domain datasets. ○ All code is available on GitHub². Future work ● Experiments exploring Doc. Reader. ● Investigate the Insights and Security layers in terms of technologies and algorithms available. ● Instantiate new case studies. Proposal - [How] BigQA, the first Big Data Question Answering architecture. ● Deal with real-world and dynamic applications; ● Provide references for each answer; and ● Address security and analytical needs; ● Can incorporate business rules. Proposal - [How] BigQA, the first Big Data Question Answering architecture. ● Deal with real-world and dynamic applications; ● Provide references for each answer; and ● Address security and analytical needs; ● Can incorporate business rules. ² https://github.com/leomaurodesenv/big-qa-architecture
  • 24. Thank you for your attention ✌ Design Principles and a Software Reference Architecture for Big Data Question Answering Systems
  • 25. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 25 Background Private Technologies Few private technologies ● IBM, Watson Discovery ● Amazon, Kendra ● Sinch, AskFrank ● OpenAI, ChatGPT ChatGPT ● Very close to human intelligence. ● Faces common large model NLP issues: ○ usually trained with static data; ○ hallucination and misleading; ○ not clear how incorporate business rules. Our proposal ● deal with real-world and dynamic applications; ● provide references for each answer; ● can incorporate business rules.
  • 26. Leonardo Mauro Pereira Moraes Design Principles and a Big Data Question Answering Architecture 26 Background Related Work Studies Reference Architecture Big Data Security Artefacts Design Principles Question Answering Open Domain Case Study (Zhu et al., 2019) ✔ ✔ ✔ ✔ (Li et al., 2020) ✔ ✔ ✔ (Ataei and Litchfield, 2021) ✔ ✔ (Cassavia and Masciari, 2021) ✔ ✔ ✔ (Yousfi et al., 2021) ✔ ✔ ✔ (Sucunuta and Riofrio, 2010) ✔ ✔ ✔ ✔ (Nielsen et al., 2010) ✔ ✔ ✔ ✔ (Zhang et al., 2013) ✔ ✔ ✔ (Novo-Loures et al., 2020) ✔ ✔ ✔ (Karpukhin et al., 2020) ✔ ✔ ✔ BigQA (our proposal) ✔ ✔ ✔ ✔ ✔ ✔ ✔ Legend: The ✓ symbol indicates the challenges addressed by each study. Big Data Architecture Question Answering Architecture Table: Characteristics of the related work and proposed BigQA.