SlideShare a Scribd company logo
Information science research with large
language models: between science and fiction
Fabiano Dalpiaz
Requirements Engineering Lab
Utrecht University, the Netherlands
May 15, 2024
f.dalpiaz@uu.nl @FabianoDalpiaz fabianodalpiaz
1. Large Language Models
@2024 Fabiano Dalpiaz
2
ChatGPT, depicted by ChatGPT 4.0 + DALL-E
Large Language Models (LLMs) in the news
@2024 Fabiano Dalpiaz
3
Various viewpoints on LLMs
@2024 Fabiano Dalpiaz
4
LLMs in information science
@2024 Fabiano Dalpiaz
5
LLMs in information science research
@2024 Fabiano Dalpiaz
6
⚠ LLM use disclaimers?
• “drafted by ChatGPT – rephrased by Quillbot –
images by MidJourney – prompts in Appendix A”?
⚠ Legal and ethical implications
⚠ Quoting ≠ paraphrasing
What’s ahead?
👉 Dedicated conference tracks about LLMs
👉 Exciting avenues for research!
LLMs in Software Engineering research
@2024 Fabiano Dalpiaz
7
A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Senguta, S.Yoo, J.M. Zhang. ”Large Language Models for Software Engineering: Survey and Open Problems." arXiv:2310.03533, 2023
ICSE’24 main track
How are YOU using LLMs in YOUR research?
@2024 Fabiano Dalpiaz
8
Key Message 1: Accept the Evolution
@2024 Fabiano Dalpiaz
9
Can assist us in
science fiction tasks
Large
Language
Models
are here
• As citizens
• As researchers
• As educators
They are
and will be
changing
our lives
2. Credibility in (information) science research
@2024 Fabiano Dalpiaz
10
IS research in the small – simplified illustration
@2024 Fabiano Dalpiaz
11
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
Credibility in information science research
@2024 Fabiano Dalpiaz
12
Interesting, this seems a
breakthrough. But…
how can I trust what the
authors claim?
PhD student Elize
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
How do YOU assess the credibility of a paper?
@2024 Fabiano Dalpiaz
13
Threats to credibility – the idea
@2024 Fabiano Dalpiaz
14
That idea is
wrong in the
first place!
Jim, the reviewer
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
Threats to credibility – the idea
@2024 Fabiano Dalpiaz
14
That idea is
wrong in the
first place!
Jim, the reviewer
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
Invalid criticism in science!
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
Threats to credibility – the conceptual framework
@2024 Fabiano Dalpiaz
16
It builds on a
rejected theory
It proposes a
theory that hasn’t
been tested yet
Jim, the reviewer
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
Threats to credibility – the constructed artifact
@2024 Fabiano Dalpiaz
17
Simplistic, partially
implemented
It conflicts with
the conceptual
framework
Jim, the reviewer
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
Threats to credibility – validation / evaluation
@2024 Fabiano Dalpiaz
18
• The evaluation is too small
• Mislabeled: is it a case study / experiment?
• The experimental design is flawed
• Too few subjects
• The research questions are not clear
• The metrics do not match with the RQs
• Missing threats to validity
• Wrong statistical tests
• Ethical approval missing
• The source code is not available
• No replication package
• Won’t generalize
• Too small improvement over SotA
• …
Jim, the reviewer
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
Threats to credibility – the written paper
@2024 Fabiano Dalpiaz
19
This claim is
factually wrong
The sentence is
ambiguous
Jim, the reviewer
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
Threats to credibility – peer reviewing / publication
@2024 Fabiano Dalpiaz
20
Renown
authors =
good?
Jim, the reviewer
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
Threats to credibility – peer reviewing / publication
@2024 Fabiano Dalpiaz
21
Prestigious
venue = good?
Never heard of
this journal = bad?
Jim, the reader
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
Threats to credibility – literature
@2024 Fabiano Dalpiaz
22
We propose tool Z that can be used to
classify requirements automatically,
distinguishing functional from quality
requirements.
[…]
Dalpiaz et al. [22] showed that their ML-
based approach has accuracy of 95%.
[…]
The performance of Z is superior to
that of Dalpiaz et al. [22].
I can’t find a
link to tool Z…
On which
dataset was the
95% accuracy
obtained?
What does it
mean for Z to
be superior?
Jim, the reader
Credibility in research: research methods
@2024 Fabiano Dalpiaz
23
Credibility in research: open science badges
@2024 Fabiano Dalpiaz
24
Artifacts evaluated - functional
“Work as intended”
https://www.acm.org/publications/policies/artifact-review-and-badging-current
Artifacts evaluated - reusable
Functional + very carefully
documented + well structured
Artifacts available
Publicly accessible in a an archival
repository (with DOI)
Results reproduced
Another team obtained the same
results with the artifacts provided
by the original authors
Results replicated
Another team obtained the same
results without the author-supplied
artifacts
Problem solved? How about LLMs being USED in the
research cycle?
@2024 Fabiano Dalpiaz
25
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
LLMs are already been used! (a few examples)
@2024 Fabiano Dalpiaz
26
Literature review generator: jenni.ai Originality checker: originality.ai
Writing assistant: quillbot.com
The one-size-fits-all ChatGPT
Code generation: copilot
Will the use of LLMs affect research CREDIBILITY?
@2024 Fabiano Dalpiaz
27
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature
Will the use of LLMs affect research CREDIBILITY?
@2024 Fabiano Dalpiaz
28
Key Message 2: Responsibility as Information Scientists
@2024 Fabiano Dalpiaz
29
• Can be used for
many tasks
• We are using them!
LLMs in IS
Research
• Deliver research that can
be trusted
• Discern credible results
What is
up to us?
3. Deep dive on NLP tools in
Requirements Engineering (NLP4RE)
@2024 Fabiano Dalpiaz
30
Background theory: Refinement in RE
@2024 Fabiano Dalpiaz
31
K. Pohl. "The three dimensions of requirements engineering: a framework and its applications." Information Systems 19.3 (1994): 243-258.
Specification
Representation
opaque
fair
complete
common view
informal semi-formal formal
personal view
Initial RE
input
Desired RE output
Agreement
Refinement
path in
practice
RE research,
including NLP4RE Tools
How do NLP4RE tools work?
@2024 Fabiano Dalpiaz
32
Processing text is
particularly suitable
for LLMs!!
Four categories of NLP4RE tools
@2024 Fabiano Dalpiaz
33
1. Find defects /
deviations from
good practice
2. Generate models
from NL reqs
3. Infer trace links
between NL reqs
and other artifacts
4. Identify key
abstractions
from NL
documents
D..M. Berry, R. Gacitua, P. Sawyer, and S.F.Tjong. "The case for dumb requirements engineering tools." In Proceedings of REFSQ, pp. 211-217. 2012.
Tools in NLP4RE (2021-2022, before LLMs)
@2024 Fabiano Dalpiaz
34
L. Zhao,W.Alhoshan, Al. Ferrari, K. J. Letsholo, M.A.A., E-V.. Chioasca, and R.T. Batista-Navarro. Natural Language Processing (NLP) for Requirements Engineering:A Systematic Mapping Study.
ACM Computing Surveys 54:3, 2022
Case: F/Q Requirements Classification
@2024 Fabiano Dalpiaz
35
} Seminal classification problem that
aims at identifying NFRs (or Qualities)
} Two classes: Functional and Quality
} Dozens of tools in the literature
} Keyword based, ML & DL classifiers,
zero- and few-shot learning…
Automated classification via ML
@2024 Fabiano Dalpiaz
36
Item Labels
Req 1 F
Req 2 F
Req 3 Q
Req 4 Q
Req 5 F, Q
…
Labeled dataset D
1. Builds a model M that
describes the items in D accurately
Item Labels
Req 1 F
Req 2 F
Req 3 Q
Req 4 Q
Req 5 F, Q
…
2. Given an unseen, unlabeled
dataset D’, predicts (accurately)
the labels of the items in D’
Classification
algorithm
Item Predicted Real
Req XX F F
Req XY Q F
Req XZ F, Q F, Q
ReqYZ F Q
Req XYX F F
…
An example of classification in NLP4RE
@2024 Fabiano Dalpiaz
37
Feature engineering is key as it
determines which information the classifier
should combine to construct the model
Classification with LLMs
@2024 Fabiano Dalpiaz
38
} No feature engineering needed!
} Immediate results via prompting
} Zero-shot learning
} Few-shot learning (a few labelled
examples in the prompt)
} Better results via fine-tuning
} Re-train the LLM with a labelled dataset
} Combines the LLM knowledge with the
domain-specific task
Pre-trained LLM
Domain-specific,
labelled dataset
Fine-tuned LLM
XXL general-
purpose dataset
fine-tuning
Credible research?
@2024 Fabiano Dalpiaz
39
Iris, the
req. analyst
I need to find quality
requirements in
3,000+ requirements
from 10 projects…
Will I obtain the same
performance on my
unlabeled data?
This paper does it
automatically with
great results!
4.Are the classifier’s results credible?
The ECSER pipeline
@2024 Fabiano Dalpiaz
40
Evaluating Classifiers in SE Research (ECSER)
@2024 Fabiano Dalpiaz
41
} ECSER focuses on
TreatmentValidation
} Treatment = a classifier
} Two macro phases
} Treatment design is beyond
the scope of ECSER
D. Dell'Anna, F. Basak Aydemir, F.. Dalpiaz: Evaluating classifiers in SE research:The ECSER pipeline and two replication studies. Empirical Software Engineering 28(1): 3 (2023)
ECSER’s highlight #1: data and models
@2024 Fabiano Dalpiaz
42
Training
Validation
Test
S5
ECSER’s highlight #2: p-fold cross-validation
@2024 Fabiano Dalpiaz
} In SE, data originates from different projects
} p-fold cross-validation extends k-fold cross-validation with per-project splits
(as opposed to random splits)
1. Given a set P of projects, take a subset S⊂P to train a model
2. Test the model on the remaining P  S
3. Take another subset S’ of the same size of S
4. Train the model on S’
5. Test the model on P  S’
6. …
43
ECSER’s highlight #3: the confusion matrix
@2024 Fabiano Dalpiaz
44
} It provides transparency: it allows to derive all metrics and to inspect the results
ECSER’s highlight #4: overfitting and degradation
@2024 Fabiano Dalpiaz
45
} Two metrics to analyze performance differences depending on the data splits
training set
test set
validation set
Overfitting =Test –Training
Degradation =Test –Validation
ECSER’s highlight #5: statistical tests
@2024 Fabiano Dalpiaz
46
} Which significance test? ➡
} Not only p-value. Also,
effect size! ⬇
Credible research?
@2024 Fabiano Dalpiaz
47
Iris, the
req. analyst
I need to find quality
requirements in
3,000+ requirements
from 10 projects…
Will I obtain the same
performance on my
unlabeled data?
This paper does it
automatically with
great results!
Luckily, someone
applied ECSER!
Study design
@2024 Fabiano Dalpiaz
48
S1. Evaluation method and data splitting
@2024 Fabiano Dalpiaz
49
} Most of the literature uses PROMISE NFR
} 625 requirements that pertain to 15 student projects
} Generally, the studies only perform validation, no testing
} Our choices
} Three algorithms (see previous slide)
} No hyper-parameter tuning (validation, S3-S4)
} Two binary classifiers: isFunctional and isQuality
Training
Validation
Test
S2 & S5. Training and testing the model
@2024 Fabiano Dalpiaz
50
} Training is performed on PROMISE NFR
} Testing is performed on the remaining datasets
} Test on Dronology, then test on DUAP, …
} Calculate arithmetic mean
S6. Reporting the confusion matrix
@2024 Fabiano Dalpiaz
51
} This is simply a presentation of the raw results…
} But some aspects already stand out!
S7-S8. Performance and overfitting
@2024 Fabiano Dalpiaz
52
} For simplicity, let’s examine F1 here
km500 fits best the
training set norbert has the best
performance on the
test set
ling17 has the
smallest overfitting
S9. ROC Plot (for isFunctional)
@2024 Fabiano Dalpiaz
53
norbert is the best
for most projects
ling17 tends to lead to
more false positives
km500 tends to
lead to more false
negatives
S10. Statistical tests
@2024 Fabiano Dalpiaz
54
} Is one of these classifiers significantly better?
} The results are mixed
} Yes, for km500 vs. norbert in the isFunctional case
} Almost never for isQuality
Results from the first application of ECSER
@2024 Fabiano Dalpiaz
55
} We confirm that norbert outperforms both ling17 and km500 on unseen data
} But not in a statistical sense (small sample size?)
} The “losers” still have good properties:
} ling17 has the smallest overfitting
} km500 fits best the training data
Credible research? Under certain assumptions
@2024 Fabiano Dalpiaz
56
F. Dalpiaz, D. Dell'Anna , F.B.Aydemir, S. Çevikol: Requirements Classification with Interpretable Machine Learning and Dependency Parsing. RE 2019: 142-152
Iris, the
req. analyst
Will I obtain the same
performance on my
unlabeled data?
Only if my
data resembles
Promise!
Key Message 3: Assess your results properly!
@2024 Fabiano Dalpiaz
57
• Provides guidelines for
evaluating classifiers
• Is a step-by-step tool
The
ECSER
pipeline
• Confirms some results
• Clarifies and confutes
others
ECSER’s
application
5. Future Avenue: LLMs in
Requirements Engineering
@2024 Fabiano Dalpiaz
58
LLM-Assisted RE: YOUR Vision
@2024 Fabiano Dalpiaz
59
LLM-Assisted RE: A Vision
@2024 Fabiano Dalpiaz
60
RE version 1.1
} Non-disruptive improvements in all
activities where currently some
automation takes place
} Classification
} Model derivation
} Defect identification
} Traceability
RE version 2.0
} Key focus on elicitation
} Breakthrough: automated analysis of
conversations
} RE is mainly a human-centered activity
Elicitation is heavily centered on conversations!
@2024 Fabiano Dalpiaz
61
NaPiRE (August 8, 2022)
http://www.re-survey.org/#/explore
Requirements
conversations
Requirements Analyst
Own ideas
Budget / project
constraints
Design
decisions Domain-specific
documentation
Elicitation
Elicitation: the root of (all) NL requirements
@2024 Fabiano Dalpiaz
62
Requirements
conversations
Requirements Analyst
Own ideas
Budget / project
constraints
Design
decisions Domain-specific
documentation
Elicitation
Specification
Timeliness: why researching conversations now?
@2024 Fabiano Dalpiaz
63
Increased remote work
and collaboration
Automated
transcription
(Requirements) conversations vs. specifications
@2024 Fabiano Dalpiaz
64
2+ parties (here Analyst
and Stakeholder)
Informal: no “shall”
statements, user
stories, glossary
Relevant
information may
be sparse
Includes persuasion,
uncertainty,
misunderstandings
The many layers of (requirements) conversations
@2024 Fabiano Dalpiaz
65
Turns and utterance units as
atomic entities
Cross-speaker interaction
defines the meaning
Traum, David R., and Elizabeth A. Hinkelman. "Conversation acts in task-oriented spoken dialogue." Computational intelligence 8.3 (1992): 575-599.
The purpose of a
conversation across
multiple turns
Tools for Conversational RE: Two Examples
@2024 Fabiano Dalpiaz
66
Tjerk Spijkman, Fabiano Dalpiaz, and Sjaak Brinkkemper “Back to the
Roots: Linking User Stories to Requirements Elicitation Conversations”
Proceedings of the RE 2022
Tjerk Spijkman, Xavier de Bondt, Fabiano Dalpiaz, and Sjaak
Brinkkemper “Summarization of Elicitation Conversations to Locate
Requirements-Relevant Information” Proceedings of REFSQ 2023
Trace2Conv: Key Idea
@2024 Fabiano Dalpiaz
67
Requirements
conversations
Requirements Analyst
Own ideas
Budget / project
constraints
Design
decisions Domain-specific
documentation } Supports backward, pre-RS traceability
} Largely overlooked area of research
} Aims to find information that provides
additional context to a requirement
Specification
Trace2Conv
Trace2Conv pre-LLMs
@2024 Fabiano Dalpiaz
68
As a vendor user, I can use the password forgotten
functionality whenever I forgot or want to reset my
password, so that I always have a way to create a new
password
Short demo of Trace2Conv
@2024 Fabiano Dalpiaz
69
Trace2Conv with LLMs
Expectations
} Complex pre-processing will be unnecessary
} Simple prompts will be able to match
requirements to speaker turns well
} Limitations
} Number of tokens limit
@2024 Fabiano Dalpiaz
70
} Trigger: long recorded conversations, spanning over multiple hours
} Can we facilitate the analyst in exploring the transcript by summarizing it?
Summarizing a transcript: ReConSum
Step #1: Identify
the questions
Step #2: Filter by
question relevance
Step #3: Label by
relevance type
@2024 Fabiano Dalpiaz
71
How to identify the questions? (Step #1)
Based on sequences of POS tags:
Wh-, yes/no, tag questions
Based on pre-trained DistilBert
(deep learning)
Combination: question if either
approach says so
@2024 Fabiano Dalpiaz
72
How to filter relevant questions? (Step #2)
TF-IDF can be used to rank questions
with domain-specific words
@2024 Fabiano Dalpiaz
73
Do our steps #1 and #2 work? (pre-LLM)
Step #1: Question identification
- Deep learning gives the best results
- Even better when combining the approaches
Step #2: Relevance detection:
- The combined pipeline achieves a F1-score around 67%
- [back to ECSER] error propagation from idea #1
We expect LLMs to improve the results, but this should be assessed rigorously (see ECSER)
Approach Precision Recall F1-Score
Speech Acts (DL) 81.8% 91.7% 86.5%
Part of Speech tags 69.7% 77.4% 73.4%
Combination 76.8% 95.8% 85.3%
Approach Precision Recall F1-Score
Speech Acts (DL) 64.4% 70.3% 67.2%
Part of Speech tags 53.8% 62.4% 57.8%
Combination 55.7% 81.7% 65.7%
@2024 Fabiano Dalpiaz
74
Ongoing tool: distilling domain models
ChatGPT 4.0 prompts
- Guidelines from Blaha and Rumbaugh
- combine transcripts with its own knowledge
@2024 Fabiano Dalpiaz
75
@2024 Fabiano Dalpiaz
Key challenge ahead in Conversational RE?
Lack of metrics and gold standards!
76
Key Message 4: New avenues unlocked, but…
• Opens new avenues for the
RE discipline
• LLMs will be an enabler
Coversati
onal RE
• No gold standards
• Unknown metrics
• Rigor is necessary!
What are
the perils?
@2024 Fabiano Dalpiaz
77
6.Wrap-up
@2024 Fabiano Dalpiaz
78
Take-home messages
@2024 Fabiano Dalpiaz
79
Large
language
models
are here and can
do science fiction
stuff
are changing our
job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational
RE)
Take-home messages
@2024 Fabiano Dalpiaz
79
Large
language
models
are here and can
do science fiction
stuff
are changing our
job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational
RE)
Take-home messages
@2024 Fabiano Dalpiaz
79
Large
language
models
are here and can
do science fiction
stuff
are changing our
job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational
RE)
Take-home messages
@2024 Fabiano Dalpiaz
79
Large
language
models
are here and can
do science fiction
stuff
are changing our
job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational
RE)
Thank you for listening! Questions?
f.dalpiaz@uu.nl @FabianoDalpiaz fabianodalpiaz
Special credits to
- F. Başak Aydemir
- Davide Dell’Anna
- Xavier de Bondt
- Tjerk Spijkman
- Sjaak Brinkkemper
Large
language
models
are here and can
do science fiction
stuff
are changing our
job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational
RE)

More Related Content

Similar to Information science research with large language models: between science and fiction

Design and Prototyping of a Social Media Observatory
Design and Prototyping of a Social Media ObservatoryDesign and Prototyping of a Social Media Observatory
Design and Prototyping of a Social Media Observatory
Karissa Rae McKelvey
 
“Digital Transformation: Going Beyond Buzzwords” - ConveyUX Boston 2019 Keyno...
“Digital Transformation: Going Beyond Buzzwords” - ConveyUX Boston 2019 Keyno...“Digital Transformation: Going Beyond Buzzwords” - ConveyUX Boston 2019 Keyno...
“Digital Transformation: Going Beyond Buzzwords” - ConveyUX Boston 2019 Keyno...
Jaime Levy Consulting
 
The tao of knowledge: the journey vs the goal
The tao of knowledge: the journey vs the goalThe tao of knowledge: the journey vs the goal
The tao of knowledge: the journey vs the goal
Valentina Tamma
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
Yasmin AlNoamany, PhD
 
Design & Evaluation of the Goal-Oriented Design Knowledge Library
Design & Evaluation of the Goal-Oriented Design Knowledge LibraryDesign & Evaluation of the Goal-Oriented Design Knowledge Library
Design & Evaluation of the Goal-Oriented Design Knowledge Libraryandrewhilts
 
Dr. You or, How I Learned to Stop Worry and Love the PhD
Dr. You or, How I Learned to Stop Worry and Love the PhDDr. You or, How I Learned to Stop Worry and Love the PhD
Dr. You or, How I Learned to Stop Worry and Love the PhD
Olga Botvinnik
 
Working at the Edge: Developing a Cross-disciplinary Research Agenda
Working at the Edge: Developing a Cross-disciplinary Research AgendaWorking at the Edge: Developing a Cross-disciplinary Research Agenda
Working at the Edge: Developing a Cross-disciplinary Research Agenda
Arosha Bandara
 
Blended learning and flipped classrooms for data science at Dallas Startup Week
Blended learning and flipped classrooms for data science at Dallas Startup WeekBlended learning and flipped classrooms for data science at Dallas Startup Week
Blended learning and flipped classrooms for data science at Dallas Startup Week
StartupWeekDallas
 
Data-X-Sparse-v2
Data-X-Sparse-v2Data-X-Sparse-v2
Data-X-Sparse-v2
Ikhlaq Sidhu
 
Data-X-v3.1
Data-X-v3.1Data-X-v3.1
Data-X-v3.1
Ikhlaq Sidhu
 
research on the application of smart materials in consumer durable
research on the application of smart materials in consumer durableresearch on the application of smart materials in consumer durable
research on the application of smart materials in consumer durable
Azrol Kassim
 
Pathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and ChallengesPathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and Challenges
Tao Xie
 
Big Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationBig Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & Innovation
Philip Bourne
 
IronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science ChallengeIronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science Challenge
Purdue RCODI
 
Cultivating Project-Based Learning & Leadership in Engineering Education
Cultivating Project-Based Learning & Leadership in Engineering EducationCultivating Project-Based Learning & Leadership in Engineering Education
Cultivating Project-Based Learning & Leadership in Engineering Education
Ramneek Kalra
 
Maximizing the power of ORCID with PlumX interoperability
Maximizing the power of ORCID with PlumX interoperabilityMaximizing the power of ORCID with PlumX interoperability
Maximizing the power of ORCID with PlumX interoperability
ORCID, Inc
 
Session 0.0 poster minutes madness
Session 0.0   poster minutes madnessSession 0.0   poster minutes madness
Session 0.0 poster minutes madness
semanticsconference
 
Requirements Conversations: A New Frontier in AI-for-RE
Requirements Conversations: A New Frontier in AI-for-RERequirements Conversations: A New Frontier in AI-for-RE
Requirements Conversations: A New Frontier in AI-for-RE
Fabiano Dalpiaz
 
The Generative AI System Shock, and some thoughts on Collective Intelligence ...
The Generative AI System Shock, and some thoughts on Collective Intelligence ...The Generative AI System Shock, and some thoughts on Collective Intelligence ...
The Generative AI System Shock, and some thoughts on Collective Intelligence ...
Simon Buckingham Shum
 

Similar to Information science research with large language models: between science and fiction (20)

Design and Prototyping of a Social Media Observatory
Design and Prototyping of a Social Media ObservatoryDesign and Prototyping of a Social Media Observatory
Design and Prototyping of a Social Media Observatory
 
“Digital Transformation: Going Beyond Buzzwords” - ConveyUX Boston 2019 Keyno...
“Digital Transformation: Going Beyond Buzzwords” - ConveyUX Boston 2019 Keyno...“Digital Transformation: Going Beyond Buzzwords” - ConveyUX Boston 2019 Keyno...
“Digital Transformation: Going Beyond Buzzwords” - ConveyUX Boston 2019 Keyno...
 
The tao of knowledge: the journey vs the goal
The tao of knowledge: the journey vs the goalThe tao of knowledge: the journey vs the goal
The tao of knowledge: the journey vs the goal
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
 
Design & Evaluation of the Goal-Oriented Design Knowledge Library
Design & Evaluation of the Goal-Oriented Design Knowledge LibraryDesign & Evaluation of the Goal-Oriented Design Knowledge Library
Design & Evaluation of the Goal-Oriented Design Knowledge Library
 
Dr. You or, How I Learned to Stop Worry and Love the PhD
Dr. You or, How I Learned to Stop Worry and Love the PhDDr. You or, How I Learned to Stop Worry and Love the PhD
Dr. You or, How I Learned to Stop Worry and Love the PhD
 
Working at the Edge: Developing a Cross-disciplinary Research Agenda
Working at the Edge: Developing a Cross-disciplinary Research AgendaWorking at the Edge: Developing a Cross-disciplinary Research Agenda
Working at the Edge: Developing a Cross-disciplinary Research Agenda
 
Blended learning and flipped classrooms for data science at Dallas Startup Week
Blended learning and flipped classrooms for data science at Dallas Startup WeekBlended learning and flipped classrooms for data science at Dallas Startup Week
Blended learning and flipped classrooms for data science at Dallas Startup Week
 
Data-X-Sparse-v2
Data-X-Sparse-v2Data-X-Sparse-v2
Data-X-Sparse-v2
 
Data-X-v3.1
Data-X-v3.1Data-X-v3.1
Data-X-v3.1
 
research on the application of smart materials in consumer durable
research on the application of smart materials in consumer durableresearch on the application of smart materials in consumer durable
research on the application of smart materials in consumer durable
 
Pathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and ChallengesPathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and Challenges
 
Big Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationBig Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & Innovation
 
IronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science ChallengeIronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science Challenge
 
Cultivating Project-Based Learning & Leadership in Engineering Education
Cultivating Project-Based Learning & Leadership in Engineering EducationCultivating Project-Based Learning & Leadership in Engineering Education
Cultivating Project-Based Learning & Leadership in Engineering Education
 
Maximizing the power of ORCID with PlumX interoperability
Maximizing the power of ORCID with PlumX interoperabilityMaximizing the power of ORCID with PlumX interoperability
Maximizing the power of ORCID with PlumX interoperability
 
Session 0.0 poster minutes madness
Session 0.0   poster minutes madnessSession 0.0   poster minutes madness
Session 0.0 poster minutes madness
 
AntiPatterns
AntiPatternsAntiPatterns
AntiPatterns
 
Requirements Conversations: A New Frontier in AI-for-RE
Requirements Conversations: A New Frontier in AI-for-RERequirements Conversations: A New Frontier in AI-for-RE
Requirements Conversations: A New Frontier in AI-for-RE
 
The Generative AI System Shock, and some thoughts on Collective Intelligence ...
The Generative AI System Shock, and some thoughts on Collective Intelligence ...The Generative AI System Shock, and some thoughts on Collective Intelligence ...
The Generative AI System Shock, and some thoughts on Collective Intelligence ...
 

Recently uploaded

4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
ssuserbfdca9
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 

Recently uploaded (20)

4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 

Information science research with large language models: between science and fiction

  • 1. Information science research with large language models: between science and fiction Fabiano Dalpiaz Requirements Engineering Lab Utrecht University, the Netherlands May 15, 2024 f.dalpiaz@uu.nl @FabianoDalpiaz fabianodalpiaz
  • 2. 1. Large Language Models @2024 Fabiano Dalpiaz 2 ChatGPT, depicted by ChatGPT 4.0 + DALL-E
  • 3. Large Language Models (LLMs) in the news @2024 Fabiano Dalpiaz 3
  • 4. Various viewpoints on LLMs @2024 Fabiano Dalpiaz 4
  • 5. LLMs in information science @2024 Fabiano Dalpiaz 5
  • 6. LLMs in information science research @2024 Fabiano Dalpiaz 6 ⚠ LLM use disclaimers? • “drafted by ChatGPT – rephrased by Quillbot – images by MidJourney – prompts in Appendix A”? ⚠ Legal and ethical implications ⚠ Quoting ≠ paraphrasing What’s ahead? 👉 Dedicated conference tracks about LLMs 👉 Exciting avenues for research!
  • 7. LLMs in Software Engineering research @2024 Fabiano Dalpiaz 7 A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Senguta, S.Yoo, J.M. Zhang. ”Large Language Models for Software Engineering: Survey and Open Problems." arXiv:2310.03533, 2023 ICSE’24 main track
  • 8. How are YOU using LLMs in YOUR research? @2024 Fabiano Dalpiaz 8
  • 9. Key Message 1: Accept the Evolution @2024 Fabiano Dalpiaz 9 Can assist us in science fiction tasks Large Language Models are here • As citizens • As researchers • As educators They are and will be changing our lives
  • 10. 2. Credibility in (information) science research @2024 Fabiano Dalpiaz 10
  • 11. IS research in the small – simplified illustration @2024 Fabiano Dalpiaz 11 Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature
  • 12. Credibility in information science research @2024 Fabiano Dalpiaz 12 Interesting, this seems a breakthrough. But… how can I trust what the authors claim? PhD student Elize Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature
  • 13. How do YOU assess the credibility of a paper? @2024 Fabiano Dalpiaz 13
  • 14. Threats to credibility – the idea @2024 Fabiano Dalpiaz 14 That idea is wrong in the first place! Jim, the reviewer Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature
  • 15. Threats to credibility – the idea @2024 Fabiano Dalpiaz 14 That idea is wrong in the first place! Jim, the reviewer Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature Invalid criticism in science!
  • 16. Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature Threats to credibility – the conceptual framework @2024 Fabiano Dalpiaz 16 It builds on a rejected theory It proposes a theory that hasn’t been tested yet Jim, the reviewer
  • 17. Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature Threats to credibility – the constructed artifact @2024 Fabiano Dalpiaz 17 Simplistic, partially implemented It conflicts with the conceptual framework Jim, the reviewer
  • 18. Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature Threats to credibility – validation / evaluation @2024 Fabiano Dalpiaz 18 • The evaluation is too small • Mislabeled: is it a case study / experiment? • The experimental design is flawed • Too few subjects • The research questions are not clear • The metrics do not match with the RQs • Missing threats to validity • Wrong statistical tests • Ethical approval missing • The source code is not available • No replication package • Won’t generalize • Too small improvement over SotA • … Jim, the reviewer
  • 19. Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature Threats to credibility – the written paper @2024 Fabiano Dalpiaz 19 This claim is factually wrong The sentence is ambiguous Jim, the reviewer
  • 20. Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature Threats to credibility – peer reviewing / publication @2024 Fabiano Dalpiaz 20 Renown authors = good? Jim, the reviewer
  • 21. Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature Threats to credibility – peer reviewing / publication @2024 Fabiano Dalpiaz 21 Prestigious venue = good? Never heard of this journal = bad? Jim, the reader
  • 22. Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature Threats to credibility – literature @2024 Fabiano Dalpiaz 22 We propose tool Z that can be used to classify requirements automatically, distinguishing functional from quality requirements. […] Dalpiaz et al. [22] showed that their ML- based approach has accuracy of 95%. […] The performance of Z is superior to that of Dalpiaz et al. [22]. I can’t find a link to tool Z… On which dataset was the 95% accuracy obtained? What does it mean for Z to be superior? Jim, the reader
  • 23. Credibility in research: research methods @2024 Fabiano Dalpiaz 23
  • 24. Credibility in research: open science badges @2024 Fabiano Dalpiaz 24 Artifacts evaluated - functional “Work as intended” https://www.acm.org/publications/policies/artifact-review-and-badging-current Artifacts evaluated - reusable Functional + very carefully documented + well structured Artifacts available Publicly accessible in a an archival repository (with DOI) Results reproduced Another team obtained the same results with the artifacts provided by the original authors Results replicated Another team obtained the same results without the author-supplied artifacts
  • 25. Problem solved? How about LLMs being USED in the research cycle? @2024 Fabiano Dalpiaz 25 Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature
  • 26. LLMs are already been used! (a few examples) @2024 Fabiano Dalpiaz 26 Literature review generator: jenni.ai Originality checker: originality.ai Writing assistant: quillbot.com The one-size-fits-all ChatGPT Code generation: copilot
  • 27. Will the use of LLMs affect research CREDIBILITY? @2024 Fabiano Dalpiaz 27 Research idea Conceptual framework Artifact construction Validation / evaluation Paper writing Peer review Publication Literature
  • 28. Will the use of LLMs affect research CREDIBILITY? @2024 Fabiano Dalpiaz 28
  • 29. Key Message 2: Responsibility as Information Scientists @2024 Fabiano Dalpiaz 29 • Can be used for many tasks • We are using them! LLMs in IS Research • Deliver research that can be trusted • Discern credible results What is up to us?
  • 30. 3. Deep dive on NLP tools in Requirements Engineering (NLP4RE) @2024 Fabiano Dalpiaz 30
  • 31. Background theory: Refinement in RE @2024 Fabiano Dalpiaz 31 K. Pohl. "The three dimensions of requirements engineering: a framework and its applications." Information Systems 19.3 (1994): 243-258. Specification Representation opaque fair complete common view informal semi-formal formal personal view Initial RE input Desired RE output Agreement Refinement path in practice RE research, including NLP4RE Tools
  • 32. How do NLP4RE tools work? @2024 Fabiano Dalpiaz 32 Processing text is particularly suitable for LLMs!!
  • 33. Four categories of NLP4RE tools @2024 Fabiano Dalpiaz 33 1. Find defects / deviations from good practice 2. Generate models from NL reqs 3. Infer trace links between NL reqs and other artifacts 4. Identify key abstractions from NL documents D..M. Berry, R. Gacitua, P. Sawyer, and S.F.Tjong. "The case for dumb requirements engineering tools." In Proceedings of REFSQ, pp. 211-217. 2012.
  • 34. Tools in NLP4RE (2021-2022, before LLMs) @2024 Fabiano Dalpiaz 34 L. Zhao,W.Alhoshan, Al. Ferrari, K. J. Letsholo, M.A.A., E-V.. Chioasca, and R.T. Batista-Navarro. Natural Language Processing (NLP) for Requirements Engineering:A Systematic Mapping Study. ACM Computing Surveys 54:3, 2022
  • 35. Case: F/Q Requirements Classification @2024 Fabiano Dalpiaz 35 } Seminal classification problem that aims at identifying NFRs (or Qualities) } Two classes: Functional and Quality } Dozens of tools in the literature } Keyword based, ML & DL classifiers, zero- and few-shot learning…
  • 36. Automated classification via ML @2024 Fabiano Dalpiaz 36 Item Labels Req 1 F Req 2 F Req 3 Q Req 4 Q Req 5 F, Q … Labeled dataset D 1. Builds a model M that describes the items in D accurately Item Labels Req 1 F Req 2 F Req 3 Q Req 4 Q Req 5 F, Q … 2. Given an unseen, unlabeled dataset D’, predicts (accurately) the labels of the items in D’ Classification algorithm Item Predicted Real Req XX F F Req XY Q F Req XZ F, Q F, Q ReqYZ F Q Req XYX F F …
  • 37. An example of classification in NLP4RE @2024 Fabiano Dalpiaz 37 Feature engineering is key as it determines which information the classifier should combine to construct the model
  • 38. Classification with LLMs @2024 Fabiano Dalpiaz 38 } No feature engineering needed! } Immediate results via prompting } Zero-shot learning } Few-shot learning (a few labelled examples in the prompt) } Better results via fine-tuning } Re-train the LLM with a labelled dataset } Combines the LLM knowledge with the domain-specific task Pre-trained LLM Domain-specific, labelled dataset Fine-tuned LLM XXL general- purpose dataset fine-tuning
  • 39. Credible research? @2024 Fabiano Dalpiaz 39 Iris, the req. analyst I need to find quality requirements in 3,000+ requirements from 10 projects… Will I obtain the same performance on my unlabeled data? This paper does it automatically with great results!
  • 40. 4.Are the classifier’s results credible? The ECSER pipeline @2024 Fabiano Dalpiaz 40
  • 41. Evaluating Classifiers in SE Research (ECSER) @2024 Fabiano Dalpiaz 41 } ECSER focuses on TreatmentValidation } Treatment = a classifier } Two macro phases } Treatment design is beyond the scope of ECSER D. Dell'Anna, F. Basak Aydemir, F.. Dalpiaz: Evaluating classifiers in SE research:The ECSER pipeline and two replication studies. Empirical Software Engineering 28(1): 3 (2023)
  • 42. ECSER’s highlight #1: data and models @2024 Fabiano Dalpiaz 42 Training Validation Test S5
  • 43. ECSER’s highlight #2: p-fold cross-validation @2024 Fabiano Dalpiaz } In SE, data originates from different projects } p-fold cross-validation extends k-fold cross-validation with per-project splits (as opposed to random splits) 1. Given a set P of projects, take a subset S⊂P to train a model 2. Test the model on the remaining P S 3. Take another subset S’ of the same size of S 4. Train the model on S’ 5. Test the model on P S’ 6. … 43
  • 44. ECSER’s highlight #3: the confusion matrix @2024 Fabiano Dalpiaz 44 } It provides transparency: it allows to derive all metrics and to inspect the results
  • 45. ECSER’s highlight #4: overfitting and degradation @2024 Fabiano Dalpiaz 45 } Two metrics to analyze performance differences depending on the data splits training set test set validation set Overfitting =Test –Training Degradation =Test –Validation
  • 46. ECSER’s highlight #5: statistical tests @2024 Fabiano Dalpiaz 46 } Which significance test? ➡ } Not only p-value. Also, effect size! ⬇
  • 47. Credible research? @2024 Fabiano Dalpiaz 47 Iris, the req. analyst I need to find quality requirements in 3,000+ requirements from 10 projects… Will I obtain the same performance on my unlabeled data? This paper does it automatically with great results! Luckily, someone applied ECSER!
  • 49. S1. Evaluation method and data splitting @2024 Fabiano Dalpiaz 49 } Most of the literature uses PROMISE NFR } 625 requirements that pertain to 15 student projects } Generally, the studies only perform validation, no testing } Our choices } Three algorithms (see previous slide) } No hyper-parameter tuning (validation, S3-S4) } Two binary classifiers: isFunctional and isQuality Training Validation Test
  • 50. S2 & S5. Training and testing the model @2024 Fabiano Dalpiaz 50 } Training is performed on PROMISE NFR } Testing is performed on the remaining datasets } Test on Dronology, then test on DUAP, … } Calculate arithmetic mean
  • 51. S6. Reporting the confusion matrix @2024 Fabiano Dalpiaz 51 } This is simply a presentation of the raw results… } But some aspects already stand out!
  • 52. S7-S8. Performance and overfitting @2024 Fabiano Dalpiaz 52 } For simplicity, let’s examine F1 here km500 fits best the training set norbert has the best performance on the test set ling17 has the smallest overfitting
  • 53. S9. ROC Plot (for isFunctional) @2024 Fabiano Dalpiaz 53 norbert is the best for most projects ling17 tends to lead to more false positives km500 tends to lead to more false negatives
  • 54. S10. Statistical tests @2024 Fabiano Dalpiaz 54 } Is one of these classifiers significantly better? } The results are mixed } Yes, for km500 vs. norbert in the isFunctional case } Almost never for isQuality
  • 55. Results from the first application of ECSER @2024 Fabiano Dalpiaz 55 } We confirm that norbert outperforms both ling17 and km500 on unseen data } But not in a statistical sense (small sample size?) } The “losers” still have good properties: } ling17 has the smallest overfitting } km500 fits best the training data
  • 56. Credible research? Under certain assumptions @2024 Fabiano Dalpiaz 56 F. Dalpiaz, D. Dell'Anna , F.B.Aydemir, S. Çevikol: Requirements Classification with Interpretable Machine Learning and Dependency Parsing. RE 2019: 142-152 Iris, the req. analyst Will I obtain the same performance on my unlabeled data? Only if my data resembles Promise!
  • 57. Key Message 3: Assess your results properly! @2024 Fabiano Dalpiaz 57 • Provides guidelines for evaluating classifiers • Is a step-by-step tool The ECSER pipeline • Confirms some results • Clarifies and confutes others ECSER’s application
  • 58. 5. Future Avenue: LLMs in Requirements Engineering @2024 Fabiano Dalpiaz 58
  • 59. LLM-Assisted RE: YOUR Vision @2024 Fabiano Dalpiaz 59
  • 60. LLM-Assisted RE: A Vision @2024 Fabiano Dalpiaz 60 RE version 1.1 } Non-disruptive improvements in all activities where currently some automation takes place } Classification } Model derivation } Defect identification } Traceability RE version 2.0 } Key focus on elicitation } Breakthrough: automated analysis of conversations } RE is mainly a human-centered activity
  • 61. Elicitation is heavily centered on conversations! @2024 Fabiano Dalpiaz 61 NaPiRE (August 8, 2022) http://www.re-survey.org/#/explore Requirements conversations Requirements Analyst Own ideas Budget / project constraints Design decisions Domain-specific documentation Elicitation
  • 62. Elicitation: the root of (all) NL requirements @2024 Fabiano Dalpiaz 62 Requirements conversations Requirements Analyst Own ideas Budget / project constraints Design decisions Domain-specific documentation Elicitation Specification
  • 63. Timeliness: why researching conversations now? @2024 Fabiano Dalpiaz 63 Increased remote work and collaboration Automated transcription
  • 64. (Requirements) conversations vs. specifications @2024 Fabiano Dalpiaz 64 2+ parties (here Analyst and Stakeholder) Informal: no “shall” statements, user stories, glossary Relevant information may be sparse Includes persuasion, uncertainty, misunderstandings
  • 65. The many layers of (requirements) conversations @2024 Fabiano Dalpiaz 65 Turns and utterance units as atomic entities Cross-speaker interaction defines the meaning Traum, David R., and Elizabeth A. Hinkelman. "Conversation acts in task-oriented spoken dialogue." Computational intelligence 8.3 (1992): 575-599. The purpose of a conversation across multiple turns
  • 66. Tools for Conversational RE: Two Examples @2024 Fabiano Dalpiaz 66 Tjerk Spijkman, Fabiano Dalpiaz, and Sjaak Brinkkemper “Back to the Roots: Linking User Stories to Requirements Elicitation Conversations” Proceedings of the RE 2022 Tjerk Spijkman, Xavier de Bondt, Fabiano Dalpiaz, and Sjaak Brinkkemper “Summarization of Elicitation Conversations to Locate Requirements-Relevant Information” Proceedings of REFSQ 2023
  • 67. Trace2Conv: Key Idea @2024 Fabiano Dalpiaz 67 Requirements conversations Requirements Analyst Own ideas Budget / project constraints Design decisions Domain-specific documentation } Supports backward, pre-RS traceability } Largely overlooked area of research } Aims to find information that provides additional context to a requirement Specification Trace2Conv
  • 68. Trace2Conv pre-LLMs @2024 Fabiano Dalpiaz 68 As a vendor user, I can use the password forgotten functionality whenever I forgot or want to reset my password, so that I always have a way to create a new password
  • 69. Short demo of Trace2Conv @2024 Fabiano Dalpiaz 69
  • 70. Trace2Conv with LLMs Expectations } Complex pre-processing will be unnecessary } Simple prompts will be able to match requirements to speaker turns well } Limitations } Number of tokens limit @2024 Fabiano Dalpiaz 70
  • 71. } Trigger: long recorded conversations, spanning over multiple hours } Can we facilitate the analyst in exploring the transcript by summarizing it? Summarizing a transcript: ReConSum Step #1: Identify the questions Step #2: Filter by question relevance Step #3: Label by relevance type @2024 Fabiano Dalpiaz 71
  • 72. How to identify the questions? (Step #1) Based on sequences of POS tags: Wh-, yes/no, tag questions Based on pre-trained DistilBert (deep learning) Combination: question if either approach says so @2024 Fabiano Dalpiaz 72
  • 73. How to filter relevant questions? (Step #2) TF-IDF can be used to rank questions with domain-specific words @2024 Fabiano Dalpiaz 73
  • 74. Do our steps #1 and #2 work? (pre-LLM) Step #1: Question identification - Deep learning gives the best results - Even better when combining the approaches Step #2: Relevance detection: - The combined pipeline achieves a F1-score around 67% - [back to ECSER] error propagation from idea #1 We expect LLMs to improve the results, but this should be assessed rigorously (see ECSER) Approach Precision Recall F1-Score Speech Acts (DL) 81.8% 91.7% 86.5% Part of Speech tags 69.7% 77.4% 73.4% Combination 76.8% 95.8% 85.3% Approach Precision Recall F1-Score Speech Acts (DL) 64.4% 70.3% 67.2% Part of Speech tags 53.8% 62.4% 57.8% Combination 55.7% 81.7% 65.7% @2024 Fabiano Dalpiaz 74
  • 75. Ongoing tool: distilling domain models ChatGPT 4.0 prompts - Guidelines from Blaha and Rumbaugh - combine transcripts with its own knowledge @2024 Fabiano Dalpiaz 75
  • 76. @2024 Fabiano Dalpiaz Key challenge ahead in Conversational RE? Lack of metrics and gold standards! 76
  • 77. Key Message 4: New avenues unlocked, but… • Opens new avenues for the RE discipline • LLMs will be an enabler Coversati onal RE • No gold standards • Unknown metrics • Rigor is necessary! What are the perils? @2024 Fabiano Dalpiaz 77
  • 79. Take-home messages @2024 Fabiano Dalpiaz 79 Large language models are here and can do science fiction stuff are changing our job as researchers need rigorous reporting (ECSER as an example) unlock uncharted territories (e.g., conversational RE)
  • 80. Take-home messages @2024 Fabiano Dalpiaz 79 Large language models are here and can do science fiction stuff are changing our job as researchers need rigorous reporting (ECSER as an example) unlock uncharted territories (e.g., conversational RE)
  • 81. Take-home messages @2024 Fabiano Dalpiaz 79 Large language models are here and can do science fiction stuff are changing our job as researchers need rigorous reporting (ECSER as an example) unlock uncharted territories (e.g., conversational RE)
  • 82. Take-home messages @2024 Fabiano Dalpiaz 79 Large language models are here and can do science fiction stuff are changing our job as researchers need rigorous reporting (ECSER as an example) unlock uncharted territories (e.g., conversational RE)
  • 83. Thank you for listening! Questions? f.dalpiaz@uu.nl @FabianoDalpiaz fabianodalpiaz Special credits to - F. Başak Aydemir - Davide Dell’Anna - Xavier de Bondt - Tjerk Spijkman - Sjaak Brinkkemper Large language models are here and can do science fiction stuff are changing our job as researchers need rigorous reporting (ECSER as an example) unlock uncharted territories (e.g., conversational RE)