Information science research with large language models: between science and fiction

Information science research with large
language models: between science and fiction
Fabiano Dalpiaz
Requirements Engineering Lab
Utrecht University, the Netherlands
May 15, 2024
f.dalpiaz@uu.nl @FabianoDalpiaz fabianodalpiaz

1. Large Language Models
@2024 Fabiano Dalpiaz
2
ChatGPT, depicted by ChatGPT 4.0 + DALL-E

Large Language Models (LLMs) in the news
3

Various viewpoints on LLMs
4

LLMs in information science
5

LLMs in information science research
6
⚠ LLM use disclaimers?
• “drafted by ChatGPT – rephrased by Quillbot –
images by MidJourney – prompts in Appendix A”?
⚠ Legal and ethical implications
⚠ Quoting ≠ paraphrasing
What’s ahead?
👉 Dedicated conference tracks about LLMs
👉 Exciting avenues for research!

LLMs in Software Engineering research
7
A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Senguta, S.Yoo, J.M. Zhang. ”Large Language Models for Software Engineering: Survey and Open Problems." arXiv:2310.03533, 2023
ICSE’24 main track

How are YOU using LLMs in YOUR research?
8

Key Message 1: Accept the Evolution
9
Can assist us in
science fiction tasks
Large
Language
Models
are here
• As citizens
• As researchers
• As educators
They are
and will be
changing
our lives

2. Credibility in (information) science research
10

IS research in the small – simplified illustration
11
Research idea Conceptual framework Artifact construction Validation / evaluation
Paper writing
Peer review
Publication
Literature

Credibility in information science research
12
Interesting, this seems a
breakthrough. But…
how can I trust what the
authors claim?
PhD student Elize
Paper writing
Peer review
Publication
Literature

How do YOU assess the credibility of a paper?
13

Threats to credibility – the idea
14
That idea is
wrong in the
first place!
Jim, the reviewer
Paper writing
Peer review
Publication
Literature

Threats to credibility – the idea
14
That idea is
wrong in the
first place!
Jim, the reviewer
Paper writing
Peer review
Publication
Literature
Invalid criticism in science!

Paper writing
Peer review
Publication
Literature
Threats to credibility – the conceptual framework
16
It builds on a
rejected theory
It proposes a
theory that hasn’t
been tested yet
Jim, the reviewer

Paper writing
Peer review
Publication
Literature
Threats to credibility – the constructed artifact
17
Simplistic, partially
implemented
It conflicts with
the conceptual
framework
Jim, the reviewer

Paper writing
Peer review
Publication
Literature
Threats to credibility – validation / evaluation
18
• The evaluation is too small
• Mislabeled: is it a case study / experiment?
• The experimental design is flawed
• Too few subjects
• The research questions are not clear
• The metrics do not match with the RQs
• Missing threats to validity
• Wrong statistical tests
• Ethical approval missing
• The source code is not available
• No replication package
• Won’t generalize
• Too small improvement over SotA
• …
Jim, the reviewer

Paper writing
Peer review
Publication
Literature
Threats to credibility – the written paper
19
This claim is
factually wrong
The sentence is
ambiguous
Jim, the reviewer

Paper writing
Peer review
Publication
Literature
Threats to credibility – peer reviewing / publication
20
Renown
authors =
good?
Jim, the reviewer

Paper writing
Peer review
Publication
Literature
Threats to credibility – peer reviewing / publication
21
Prestigious
venue = good?
Never heard of
this journal = bad?
Jim, the reader

Paper writing
Peer review
Publication
Literature
Threats to credibility – literature
22
We propose tool Z that can be used to
classify requirements automatically,
distinguishing functional from quality
requirements.
[…]
Dalpiaz et al. [22] showed that their ML-
based approach has accuracy of 95%.
[…]
The performance of Z is superior to
that of Dalpiaz et al. [22].
I can’t find a
link to tool Z…
On which
dataset was the
95% accuracy
obtained?
What does it
mean for Z to
be superior?
Jim, the reader

Credibility in research: research methods
23

Credibility in research: open science badges
24
Artifacts evaluated - functional
“Work as intended”
https://www.acm.org/publications/policies/artifact-review-and-badging-current
Artifacts evaluated - reusable
Functional + very carefully
documented + well structured
Artifacts available
Publicly accessible in a an archival
repository (with DOI)
Results reproduced
Another team obtained the same
results with the artifacts provided
by the original authors
Results replicated
Another team obtained the same
results without the author-supplied
artifacts

Problem solved? How about LLMs being USED in the
research cycle?
25
Paper writing
Peer review
Publication
Literature

LLMs are already been used! (a few examples)
26
Literature review generator: jenni.ai Originality checker: originality.ai
Writing assistant: quillbot.com
The one-size-fits-all ChatGPT
Code generation: copilot

Will the use of LLMs affect research CREDIBILITY?
27
Paper writing
Peer review
Publication
Literature

Will the use of LLMs affect research CREDIBILITY?
28

Key Message 2: Responsibility as Information Scientists
29
• Can be used for
many tasks
• We are using them!
LLMs in IS
Research
• Deliver research that can
be trusted
• Discern credible results
What is
up to us?

3. Deep dive on NLP tools in
Requirements Engineering (NLP4RE)
30

Background theory: Refinement in RE
31
K. Pohl. "The three dimensions of requirements engineering: a framework and its applications." Information Systems 19.3 (1994): 243-258.
Specification
Representation
opaque
fair
complete
common view
informal semi-formal formal
personal view
Initial RE
input
Desired RE output
Agreement
Refinement
path in
practice
RE research,
including NLP4RE Tools

How do NLP4RE tools work?
32
Processing text is
particularly suitable
for LLMs!!

Four categories of NLP4RE tools
33
1. Find defects /
deviations from
good practice
2. Generate models
from NL reqs
3. Infer trace links
between NL reqs
and other artifacts
4. Identify key
abstractions
from NL
documents
D..M. Berry, R. Gacitua, P. Sawyer, and S.F.Tjong. "The case for dumb requirements engineering tools." In Proceedings of REFSQ, pp. 211-217. 2012.

Tools in NLP4RE (2021-2022, before LLMs)
34
L. Zhao,W.Alhoshan, Al. Ferrari, K. J. Letsholo, M.A.A., E-V.. Chioasca, and R.T. Batista-Navarro. Natural Language Processing (NLP) for Requirements Engineering:A Systematic Mapping Study.
ACM Computing Surveys 54:3, 2022

Case: F/Q Requirements Classification
35
} Seminal classification problem that
aims at identifying NFRs (or Qualities)
} Two classes: Functional and Quality
} Dozens of tools in the literature
} Keyword based, ML & DL classifiers,
zero- and few-shot learning…

Automated classification via ML
36
Item Labels
Req 1 F
Req 2 F
Req 3 Q
Req 4 Q
Req 5 F, Q
…
Labeled dataset D
1. Builds a model M that
describes the items in D accurately
Item Labels
Req 1 F
Req 2 F
Req 3 Q
Req 4 Q
Req 5 F, Q
…
2. Given an unseen, unlabeled
dataset D’, predicts (accurately)
the labels of the items in D’
Classification
algorithm
Item Predicted Real
Req XX F F
Req XY Q F
Req XZ F, Q F, Q
ReqYZ F Q
Req XYX F F
…

An example of classification in NLP4RE
37
Feature engineering is key as it
determines which information the classifier
should combine to construct the model

Classification with LLMs
38
} No feature engineering needed!
} Immediate results via prompting
} Zero-shot learning
} Few-shot learning (a few labelled
examples in the prompt)
} Better results via fine-tuning
} Re-train the LLM with a labelled dataset
} Combines the LLM knowledge with the
domain-specific task
Pre-trained LLM
Domain-specific,
labelled dataset
Fine-tuned LLM
XXL general-
purpose dataset
fine-tuning

Credible research?
39
Iris, the
req. analyst
I need to find quality
requirements in
3,000+ requirements
from 10 projects…
Will I obtain the same
performance on my
unlabeled data?
This paper does it
automatically with
great results!

4.Are the classifier’s results credible?
The ECSER pipeline
40

Evaluating Classifiers in SE Research (ECSER)
41
} ECSER focuses on
TreatmentValidation
} Treatment = a classifier
} Two macro phases
} Treatment design is beyond
the scope of ECSER
D. Dell'Anna, F. Basak Aydemir, F.. Dalpiaz: Evaluating classifiers in SE research:The ECSER pipeline and two replication studies. Empirical Software Engineering 28(1): 3 (2023)

ECSER’s highlight #1: data and models
42
Training
Validation
Test
S5

ECSER’s highlight #2: p-fold cross-validation
} In SE, data originates from different projects
} p-fold cross-validation extends k-fold cross-validation with per-project splits
(as opposed to random splits)
1. Given a set P of projects, take a subset S⊂P to train a model
2. Test the model on the remaining P S
3. Take another subset S’ of the same size of S
4. Train the model on S’
5. Test the model on P S’
6. …
43

ECSER’s highlight #3: the confusion matrix
44
} It provides transparency: it allows to derive all metrics and to inspect the results

ECSER’s highlight #4: overfitting and degradation
45
} Two metrics to analyze performance differences depending on the data splits
training set
test set
validation set
Overfitting =Test –Training
Degradation =Test –Validation

ECSER’s highlight #5: statistical tests
46
} Which significance test? ➡
} Not only p-value. Also,
effect size! ⬇

Credible research?
47
Iris, the
req. analyst
I need to find quality
requirements in
3,000+ requirements
from 10 projects…
performance on my
unlabeled data?
This paper does it
automatically with
great results!
Luckily, someone
applied ECSER!

Study design
48

S1. Evaluation method and data splitting
49
} Most of the literature uses PROMISE NFR
} 625 requirements that pertain to 15 student projects
} Generally, the studies only perform validation, no testing
} Our choices
} Three algorithms (see previous slide)
} No hyper-parameter tuning (validation, S3-S4)
} Two binary classifiers: isFunctional and isQuality
Training
Validation
Test

S2 & S5. Training and testing the model
50
} Training is performed on PROMISE NFR
} Testing is performed on the remaining datasets
} Test on Dronology, then test on DUAP, …
} Calculate arithmetic mean

S6. Reporting the confusion matrix
51
} This is simply a presentation of the raw results…
} But some aspects already stand out!

S7-S8. Performance and overfitting
52
} For simplicity, let’s examine F1 here
km500 fits best the
training set norbert has the best
performance on the
test set
ling17 has the
smallest overfitting

S9. ROC Plot (for isFunctional)
53
norbert is the best
for most projects
ling17 tends to lead to
more false positives
km500 tends to
lead to more false
negatives

S10. Statistical tests
54
} Is one of these classifiers significantly better?
} The results are mixed
} Yes, for km500 vs. norbert in the isFunctional case
} Almost never for isQuality

Results from the first application of ECSER
55
} We confirm that norbert outperforms both ling17 and km500 on unseen data
} But not in a statistical sense (small sample size?)
} The “losers” still have good properties:
} ling17 has the smallest overfitting
} km500 fits best the training data

Credible research? Under certain assumptions
56
F. Dalpiaz, D. Dell'Anna , F.B.Aydemir, S. Çevikol: Requirements Classification with Interpretable Machine Learning and Dependency Parsing. RE 2019: 142-152
Iris, the
req. analyst
performance on my
unlabeled data?
Only if my
data resembles
Promise!

Key Message 3: Assess your results properly!
57
• Provides guidelines for
evaluating classifiers
• Is a step-by-step tool
The
ECSER
pipeline
• Confirms some results
• Clarifies and confutes
others
ECSER’s
application

5. Future Avenue: LLMs in
Requirements Engineering
58

LLM-Assisted RE: YOUR Vision
59

LLM-Assisted RE: A Vision
60
RE version 1.1
} Non-disruptive improvements in all
activities where currently some
automation takes place
} Classification
} Model derivation
} Defect identification
} Traceability
RE version 2.0
} Key focus on elicitation
} Breakthrough: automated analysis of
conversations
} RE is mainly a human-centered activity

Elicitation is heavily centered on conversations!
61
NaPiRE (August 8, 2022)
http://www.re-survey.org/#/explore
Requirements
conversations
Requirements Analyst
Own ideas
Budget / project
constraints
Design
decisions Domain-specific
documentation
Elicitation

Elicitation: the root of (all) NL requirements
62
Requirements
conversations
Own ideas
Budget / project
constraints
Design
documentation
Elicitation
Specification

Timeliness: why researching conversations now?
63
Increased remote work
and collaboration
Automated
transcription

(Requirements) conversations vs. specifications
64
2+ parties (here Analyst
and Stakeholder)
Informal: no “shall”
statements, user
stories, glossary
Relevant
information may
be sparse
Includes persuasion,
uncertainty,
misunderstandings

The many layers of (requirements) conversations
65
Turns and utterance units as
atomic entities
Cross-speaker interaction
defines the meaning
Traum, David R., and Elizabeth A. Hinkelman. "Conversation acts in task-oriented spoken dialogue." Computational intelligence 8.3 (1992): 575-599.
The purpose of a
conversation across
multiple turns

Tools for Conversational RE: Two Examples
66
Tjerk Spijkman, Fabiano Dalpiaz, and Sjaak Brinkkemper “Back to the
Roots: Linking User Stories to Requirements Elicitation Conversations”
Proceedings of the RE 2022
Tjerk Spijkman, Xavier de Bondt, Fabiano Dalpiaz, and Sjaak
Brinkkemper “Summarization of Elicitation Conversations to Locate
Requirements-Relevant Information” Proceedings of REFSQ 2023

Trace2Conv: Key Idea
67
Requirements
conversations
Own ideas
Budget / project
constraints
Design
documentation } Supports backward, pre-RS traceability
} Largely overlooked area of research
} Aims to find information that provides
additional context to a requirement
Specification
Trace2Conv

Trace2Conv pre-LLMs
68
As a vendor user, I can use the password forgotten
functionality whenever I forgot or want to reset my
password, so that I always have a way to create a new
password

Short demo of Trace2Conv
69

Trace2Conv with LLMs
Expectations
} Complex pre-processing will be unnecessary
} Simple prompts will be able to match
requirements to speaker turns well
} Limitations
} Number of tokens limit
70

} Trigger: long recorded conversations, spanning over multiple hours
} Can we facilitate the analyst in exploring the transcript by summarizing it?
Summarizing a transcript: ReConSum
Step #1: Identify
the questions
Step #2: Filter by
question relevance
Step #3: Label by
relevance type
71

How to identify the questions? (Step #1)
Based on sequences of POS tags:
Wh-, yes/no, tag questions
Based on pre-trained DistilBert
(deep learning)
Combination: question if either
approach says so
72

How to filter relevant questions? (Step #2)
TF-IDF can be used to rank questions
with domain-specific words
73

Do our steps #1 and #2 work? (pre-LLM)
Step #1: Question identification
- Deep learning gives the best results
- Even better when combining the approaches
Step #2: Relevance detection:
- The combined pipeline achieves a F1-score around 67%
- [back to ECSER] error propagation from idea #1
We expect LLMs to improve the results, but this should be assessed rigorously (see ECSER)
Approach Precision Recall F1-Score
Speech Acts (DL) 81.8% 91.7% 86.5%
Part of Speech tags 69.7% 77.4% 73.4%
Combination 76.8% 95.8% 85.3%
Approach Precision Recall F1-Score
Speech Acts (DL) 64.4% 70.3% 67.2%
Part of Speech tags 53.8% 62.4% 57.8%
Combination 55.7% 81.7% 65.7%
74

Ongoing tool: distilling domain models
ChatGPT 4.0 prompts
- Guidelines from Blaha and Rumbaugh
- combine transcripts with its own knowledge
75

Key challenge ahead in Conversational RE?
Lack of metrics and gold standards!
76

Key Message 4: New avenues unlocked, but…
• Opens new avenues for the
RE discipline
• LLMs will be an enabler
Coversati
onal RE
• No gold standards
• Unknown metrics
• Rigor is necessary!
What are
the perils?
77

6.Wrap-up
78

Take-home messages
79
Large
language
models
are here and can
do science fiction
stuff
are changing our
job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational
RE)

Thank you for listening! Questions?
f.dalpiaz@uu.nl @FabianoDalpiaz fabianodalpiaz
Special credits to
- F. Başak Aydemir
- Davide Dell’Anna
- Xavier de Bondt
- Tjerk Spijkman
- Sjaak Brinkkemper
Large
language
models
are here and can
do science fiction
stuff
are changing our
job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational
RE)

Information science research with large language models: between science and fiction

Recommended

Recommended

More Related Content

Similar to Information science research with large language models: between science and fiction

Similar to Information science research with large language models: between science and fiction (20)

Recently uploaded

Recently uploaded (20)

Information science research with large language models: between science and fiction