Information science research with large language models: between science and fiction

FabianoDalpiaz 284 views 83 slides May 15, 2024
Slide 1
Slide 1 of 83
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83

About This Presentation

Large language models (LLMs) are in the spotlight. Laypeople are aware of and are using the LLMs such as OpenAI’s ChatGPT and Google’s Gemini on a daily basis. While companies are exploring new business opportunities, researchers have gained access to an unprecedented scientific playground that ...


Slide Content

Information science research with large
language models: between science and fiction
Fabiano Dalpiaz
Requirements Engineering Lab
Utrecht University, the Netherlands
May 15, 2024
[email protected]@FabianoDalpiazfabianodalpiaz

1. Large Language Models
@2024 Fabiano Dalpiaz2
ChatGPT, depicted by ChatGPT4.0 + DALL-E

Large Language Models (LLMs) in the news
@2024 Fabiano Dalpiaz3

Various viewpoints on LLMs
@2024 Fabiano Dalpiaz4

LLMs in information science
@2024 Fabiano Dalpiaz5

LLMs in information science research
@2024 Fabiano Dalpiaz6
⚠LLM use disclaimers?
•“drafted by ChatGPT–rephrased by Quillbot–
images by MidJourney–prompts in Appendix A”?
⚠Legal and ethical implications
⚠Quoting ≠ paraphrasing
What’s ahead?
"Dedicated conference tracks about LLMs
"Exciting avenues for research!

LLMs in Software Engineeringresearch
@2024 Fabiano Dalpiaz7
A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Senguta, S. Yoo, J.M. Zhang. ”Large Language Models for Software Engineering: Survey and Open Problems." arXiv:2310.03533, 2023
ICSE’24 main track

How are YOU using LLMs in YOUR research?
@2024 Fabiano Dalpiaz8

Key Message 1: Accept the Evolution
@2024 Fabiano Dalpiaz9
Can assist us in
science fiction tasks
Large Language
Models
are here
•As citizens
•As researchers
•As educators
They are and will be
changing
our lives

2. Credibility in (information) science research
@2024 Fabiano Dalpiaz10

IS research in the small –simplified illustration
@2024 Fabiano Dalpiaz11
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature

Credibilityin information science research
@2024 Fabiano Dalpiaz12
Interesting, this seems a
breakthrough. But…
how can I trustwhat the
authors claim?
PhD student Elize
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature

How do YOU assess the credibility of a paper?
@2024 Fabiano Dalpiaz13

Threats to credibility–the idea
@2024 Fabiano Dalpiaz14
That idea is
wrong in the
first place!
Jim, the reviewer
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature

Threats to credibility–the idea
@2024 Fabiano Dalpiaz14
That idea is
wrong in the
first place!
Jim, the reviewer
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Invalid criticism in science!

Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–the conceptual framework
@2024 Fabiano Dalpiaz16
It builds on a
rejected theory
It proposes a
theory that hasn’t
been tested yet
Jim, the reviewer

Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–the constructed artifact
@2024 Fabiano Dalpiaz17
Simplistic, partially
implementedIt conflicts with
the conceptual
framework
Jim, the reviewer

Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–validation / evaluation
@2024 Fabiano Dalpiaz18
•The evaluation is too small
•Mislabeled: is it a case study / experiment?
•The experimental design is flawed
•Too few subjects
•The research questions are not clear
•The metrics do not match with the RQs
•Missing threats to validity
•Wrong statistical tests
•Ethical approval missing
•The source code is not available
•No replication package
•Won’t generalize
•Too small improvement over SotA
•…
Jim, the reviewer

Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–the written paper
@2024 Fabiano Dalpiaz19
This claim is
factually wrong
The sentence is
ambiguous
Jim, the reviewer

Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–peer reviewing / publication
@2024 Fabiano Dalpiaz20
Renown
authors =
good?
Jim, the reviewer

Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–peer reviewing / publication
@2024 Fabiano Dalpiaz21
Prestigious
venue = good?
Never heard of
this journal = bad?
Jim, the reader

Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–literature
@2024 Fabiano Dalpiaz22
WeproposetoolZthatcanbeusedto
classifyrequirementsautomatically,
distinguishingfunctionalfromquality
requirements.
[…]
Dalpiazetal.[22]showedthattheirML-
basedapproachhasaccuracyof95%.
[…]
TheperformanceofZissuperiorto
thatofDalpiazetal.[22].
I can’t find a
link to tool Z…
On which
dataset was the
95% accuracy
obtained?
What does it
mean for Z to
be superior?
Jim, the reader

Credibility in research: research methods
@2024 Fabiano Dalpiaz23

Credibility in research: open science badges
@2024 Fabiano Dalpiaz24
Artifacts evaluated -functional
“Work as intended”
https://www.acm.org/publications/policies/artifact-review-and-badging-current
Artifacts evaluated -reusable
Functional + very carefully
documented + well structured
Artifacts available
Publicly accessible in a anarchival
repository (with DOI)
Results reproduced
Another team obtained the same
results with the artifacts provided
by the original authors
Results replicated
Another team obtained the same
results without the author-supplied
artifacts

Problem solved? How about LLMs being USEDin the
research cycle?
@2024 Fabiano Dalpiaz25
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature

LLMs are already been used! (a few examples)
@2024 Fabiano Dalpiaz26
Literature review generator: jenni.aiOriginality checker: originality.ai
Writing assistant: quillbot.com
The one-size-fits-all ChatGPT
Code generation: copilot

Will the use of LLMs affect research CREDIBILITY?
@2024 Fabiano Dalpiaz27
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature

Will the use of LLMs affect research CREDIBILITY?
@2024 Fabiano Dalpiaz28

Key Message 2: Responsibility as Information Scientists
@2024 Fabiano Dalpiaz29
•Can be used for
many tasks
•We are using them!
LLMs in IS Research
•Deliver research that can
be trusted
•Discern credible results
What is up to us?

3. Deep dive on NLP tools in
Requirements Engineering (NLP4RE)
@2024 Fabiano Dalpiaz30

Background theory: Refinement in RE
@2024 Fabiano Dalpiaz31
K. Pohl. "The three dimensions of requirements engineering: a framework and its applications." Information Systems19.3 (1994): 243-258.
Specification
Representation
opaque
fair
complete
common view
informalsemi-formalformal
personal view
Initial RE
input
Desired RE output
Agreement
Refinement
path in
practice
RE research,
including NLP4RE Tools

How do NLP4RE tools work?
@2024 Fabiano Dalpiaz32
Processing text is
particularly suitable
for LLMs!!

Four categories of NLP4RE tools
@2024 Fabiano Dalpiaz33
1.Find defects /
deviations from
good practice
2.Generate models
from NL reqs
3.Infer trace links
between NL reqs
and other artifacts
4.Identify key
abstractions
from NL
documents
D..M. Berry, R. Gacitua, P. Sawyer, and S.F. Tjong. "The case for dumb requirements engineering tools." In Proceedings of REFSQ, pp. 211-217. 2012.

Tools in NLP4RE (2021-2022, before LLMs)
@2024 Fabiano Dalpiaz34
L. Zhao, W. Alhoshan, Al. Ferrari, K. J. Letsholo, M. A. A., E-V.. Chioasca, and R. T. Batista-Navarro. Natural Language Processing (NLP) for Requirements Engineering: A Systematic Mapping Study.
ACM Computing Surveys 54:3, 2022

Case: F/Q Requirements Classification
@2024 Fabiano Dalpiaz35
}Seminal classification problem that
aims at identifying NFRs (or Qualities)
}Two classes: Functional and Quality
}Dozens of tools in the literature
}Keyword based, ML & DL classifiers,
zero-and few-shot learning…

Automated classification via ML
@2024 Fabiano Dalpiaz36
ItemLabels
Req 1F
Req 2F
Req 3Q
Req 4Q
Req 5F, Q

Labeled dataset D
1. Builds a model M that
describes the items in D accurately
ItemLabels
Req 1F
Req 2F
Req 3Q
Req 4Q
Req 5F, Q

2. Given an unseen, unlabeled
dataset D’, predicts (accurately)
the labels of the items in D’
Classification
algorithm
ItemPredictedReal
Req XXFF
Req XYQF
Req XZF, QF, Q
Req YZFQ
Req XYXFF

An example of classification in NLP4RE
@2024 Fabiano Dalpiaz37
Feature engineering is key as it
determines which information the classifier
should combine to construct the model

Classification with LLMs
@2024 Fabiano Dalpiaz38
}No feature engineering needed!
}Immediate results via prompting
}Zero-shot learning
}Few-shot learning (a few labelled
examples in the prompt)
}Better results via fine-tuning
}Re-train the LLM with a labelled dataset
}Combines the LLM knowledge with the
domain-specific task
Pre-trained LLM
Domain-specific,
labelled dataset
Fine-tuned LLM
XXL general-
purpose dataset
fine-tuning

Credible research?
@2024 Fabiano Dalpiaz39
Iris, the
req. analyst
I need to find quality
requirements in
3,000+ requirements
from 10 projects…
Will I obtain the same
performance on my
unlabeled data?
This paper does it
automatically with
great results!

4. Are the classifier’s results credible?
The ECSER pipeline
@2024 Fabiano Dalpiaz40

Evaluating Classifiers in SE Research (ECSER)
@2024 Fabiano Dalpiaz41
}ECSER focuses on
Treatment Validation
}Treatment = a classifier
}Two macro phases
}Treatment design is beyond
the scope of ECSER
D. Dell'Anna, F. Basak Aydemir, F.. Dalpiaz: Evaluating classifiers in SE research: The ECSER pipeline and two replication studies. Empirical Software Engineering 28(1): 3 (2023)

ECSER’s highlight #1: data and models
@2024 Fabiano Dalpiaz42
Training
Validation
Test
S5

ECSER’s highlight #2: p-fold cross-validation
@2024 Fabiano Dalpiaz
}In SE, data originates from different projects
}p-fold cross-validation extends k-fold cross-validation with per-project splits
(as opposed to random splits)
1.Given a set P of projects, take a subset S⊂P to train a model
2.Test the model on the remaining P \S
3.Take another subset S’ of the same size of S
4.Train the model on S’
5.Test the model on P \S’
6.…
43

ECSER’s highlight #3: the confusion matrix
@2024 Fabiano Dalpiaz44
}It provides transparency: it allows to deriveall metrics and to inspectthe results

ECSER’s highlight #4: overfitting and degradation
@2024 Fabiano Dalpiaz45
}Two metrics to analyze performance differences depending on the data splits
training set
test set
validation setOverfitting = Test –Training
Degradation= Test –Validation

ECSER’s highlight #5: statistical tests
@2024 Fabiano Dalpiaz46
}Which significance test? ➡
}Not only p-value. Also,
effect size! ⬇

Credible research?
@2024 Fabiano Dalpiaz47
Iris, the
req. analyst
I need to find quality
requirements in
3,000+ requirements
from 10 projects…
Will I obtain the same
performance on my
unlabeled data?
This paper does it
automatically with
great results!
Luckily, someone
applied ECSER!

Study design
@2024 Fabiano Dalpiaz48

S1. Evaluation method and data splitting
@2024 Fabiano Dalpiaz49
}Most of the literature uses PROMISE NFR
}625 requirements that pertain to 15 student projects
}Generally, the studies only perform validation, no testing
}Our choices
}Three algorithms (see previous slide)
}No hyper-parameter tuning (validation, S3-S4)
}Two binary classifiers: isFunctionaland isQuality
Training
Validation
Test

S2 & S5. Training and testing the model
@2024 Fabiano Dalpiaz50
}Trainingis performed on PROMISE NFR
}Testingis performed on the remaining datasets
}Test on Dronology, then test on DUAP, …
}Calculate arithmetic mean

S6. Reporting the confusion matrix
@2024 Fabiano Dalpiaz51
}This is simply a presentation of the raw results…
}But some aspects already stand out!

S7-S8. Performance and overfitting
@2024 Fabiano Dalpiaz52
}For simplicity, let’s examine F1here
km500 fitsbest the
training setnorberthasthe best
performance on the
test set
ling17 hasthe
smallestoverfitting

S9. ROC Plot (for isFunctional)
@2024 Fabiano Dalpiaz53
norbertis the best
for most projects
ling17tends to lead to
more false positives
km500tends to
lead to more false
negatives

S10. Statistical tests
@2024 Fabiano Dalpiaz54
}Is one of these classifiers significantly better?
}The results are mixed
}Yes, for km500vs. norbertin the isFunctionalcase
}Almost never for isQuality

Results from the first application of ECSER
@2024 Fabiano Dalpiaz55
}We confirmthat norbertoutperforms both ling17and km500on unseen data
}But not in a statistical sense (small sample size?)
}The “losers” still have good properties:
}ling17has the smallest overfitting
}km500fits best the training data

Credible research? Under certain assumptions
@2024 Fabiano Dalpiaz56
F. Dalpiaz,D. Dell'Anna,F.B. Aydemir,S. Çevikol: Requirements Classification with Interpretable Machine Learning and Dependency Parsing.RE2019:142-152
Iris, the
req. analyst
Will I obtain the same
performance on my
unlabeled data?
Only if my
data resembles
Promise!

Key Message 3: Assess your results properly!
@2024 Fabiano Dalpiaz57
•Provides guidelines for
evaluating classifiers
•Is a step-by-step tool
The
ECSER pipeline
•Confirms some results
•Clarifies and confutes others
ECSER’s application

5. Future Avenue: LLMs in
Requirements Engineering
@2024 Fabiano Dalpiaz58

LLM-Assisted RE: YOUR Vision
@2024 Fabiano Dalpiaz59

LLM-Assisted RE: A Vision
@2024 Fabiano Dalpiaz60
RE version 1.1
}Non-disruptive improvements in all
activities where currently some
automation takes place
}Classification
}Model derivation
}Defect identification
}Traceability
RE version 2.0
}Key focus on elicitation
}Breakthrough: automated analysis of
conversations
}RE is mainly a human-centered activity

Elicitationis heavily centered on conversations!
@2024 Fabiano Dalpiaz61
NaPiRE(August 8, 2022)
http://www.re-survey.org/#/explore
Requirements
conversations
Requirements Analyst
Own ideas
Budget / project
constraints
Design
decisionsDomain-specific
documentation
Elicitation

Elicitation: the root of (all) NL requirements
@2024 Fabiano Dalpiaz62
Requirements
conversations
Requirements Analyst
Own ideas
Budget / project
constraints
Design
decisionsDomain-specific
documentation
Elicitation
Specification

Timeliness: why researching conversations now?
@2024 Fabiano Dalpiaz63
Increased remote work
and collaboration
Automated
transcription

(Requirements) conversationsvs. specifications
@2024 Fabiano Dalpiaz64
2+ parties (here Analyst
and Stakeholder)
Informal: no “shall”
statements, user
stories, glossary
Relevant
information may
be sparse
Includes persuasion,
uncertainty,
misunderstandings

The manylayersof (requirements) conversations
@2024 Fabiano Dalpiaz65
Turnsand utterance units as
atomic entities
Cross-speaker interaction
defines the meaning
Traum, David R., and Elizabeth A. Hinkelman. "Conversation acts in task-oriented spoken dialogue." Computational intelligence8.3 (1992): 575-599.
The purpose of a
conversation across
multiple turns

Tools for ConversationalRE: Two Examples
@2024 Fabiano Dalpiaz66
Tjerk Spijkman,Fabiano Dalpiaz,and Sjaak Brinkkemper “Back to the
Roots: Linking User Stories to Requirements Elicitation Conversations”
Proceedings of the RE 2022
Tjerk Spijkman,Xavier de Bondt, Fabiano Dalpiaz,and Sjaak
Brinkkemper “Summarization of Elicitation Conversations to Locate
Requirements-Relevant Information” Proceedings of REFSQ 2023

Trace2Conv: Key Idea
@2024 Fabiano Dalpiaz67
Requirements
conversations
Requirements Analyst
Own ideas
Budget / project
constraints
Design
decisionsDomain-specific
documentation
}Supports backward, pre-RS traceability
}Largely overlooked area of research
}Aims to find information that provides
additional contextto a requirement
Specification
Trace2Conv

Trace2Conv pre-LLMs
@2024 Fabiano Dalpiaz68
As a vendor user, I can use the password forgotten
functionality whenever I forgot or want to reset my
password, so that I always have a way to create a new
password

Short demo of Trace2Conv
@2024 Fabiano Dalpiaz69

Trace2Conv with LLMs
Expectations
}Complex pre-processing will be unnecessary
}Simple prompts will be able to match
requirements to speaker turns well
}Limitations
}Number of tokens limit
@2024 Fabiano Dalpiaz70

}Trigger: long recorded conversations, spanning over multiple hours
}Can we facilitate the analyst in exploring the transcript by summarizing it?
Summarizing a transcript: ReConSum
Step #1: Identify
the questions
Step #2: Filter by
question relevance
Step #3: Label by
relevance type
@2024 Fabiano Dalpiaz71

How to identify the questions? (Step #1)
Based on sequences of POS tags:
Wh-, yes/no, tag questions
Based on pre-trained DistilBert
(deep learning)
Combination: question if either
approach says so
@2024 Fabiano Dalpiaz72

How to filter relevant questions? (Step #2)
TF-IDF can be used to rank questions
with domain-specific words
@2024 Fabiano Dalpiaz73

Do our steps #1 and #2 work? (pre-LLM)
Step #1: Question identification
-Deep learning gives the best results
-Even better when combining the approaches
Step #2: Relevance detection:
-The combined pipeline achieves a F1-score around 67%
-[back to ECSER] error propagation from idea #1
We expect LLMs to improve the results, but this should be assessed rigorously (see ECSER)
ApproachPrecisionRecallF1-Score
Speech Acts (DL)81.8%91.7%86.5%
Part of Speech tags69.7%77.4%73.4%
Combination76.8%95.8%85.3%
ApproachPrecisionRecallF1-Score
Speech Acts (DL)64.4%70.3%67.2%
Part of Speech tags53.8%62.4%57.8%
Combination55.7%81.7%65.7%
@2024 Fabiano Dalpiaz74

Ongoing tool: distilling domain models
ChatGPT4.0 prompts
-Guidelines from Blaha and Rumbaugh
-combine transcripts with its own knowledge
@2024 Fabiano Dalpiaz75

@2024 Fabiano Dalpiaz
Key challenge ahead in Conversational RE?
Lack of metricsand gold standards!
76

Key Message 4: New avenues unlocked, but…
•Opens new avenues for the
RE discipline
•LLMs will be an enabler
Coversational RE
•No gold standards
•Unknown metrics
•Rigor is necessary!
What are the perils?
@2024 Fabiano Dalpiaz77

6. Wrap-up
@2024 Fabiano Dalpiaz78

Take-home messages
@2024 Fabiano Dalpiaz79
Large language models
are here and can
do science fiction
stuff
are changing our job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational RE)

Take-home messages
@2024 Fabiano Dalpiaz79
Large language models
are here and can
do science fiction
stuff
are changing our job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational RE)

Take-home messages
@2024 Fabiano Dalpiaz79
Large language models
are here and can
do science fiction
stuff
are changing our job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational RE)

Take-home messages
@2024 Fabiano Dalpiaz79
Large language models
are here and can
do science fiction
stuff
are changing our job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational RE)

Thank you for listening! Questions?
[email protected]@FabianoDalpiazfabianodalpiaz
Special credits to
-F. Başak Aydemir
-Davide Dell’Anna
-Xavier de Bondt
-Tjerk Spijkman
-Sjaak Brinkkemper
Large
language
models
are here and can
do science fiction
stuff
are changing our
job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational
RE)