Information science research with large language models: between science and fiction
FabianoDalpiaz
284 views
83 slides
May 15, 2024
Slide 1 of 83
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
About This Presentation
Large language models (LLMs) are in the spotlight. Laypeople are aware of and are using the LLMs such as OpenAI’s ChatGPT and Google’s Gemini on a daily basis. While companies are exploring new business opportunities, researchers have gained access to an unprecedented scientific playground that ...
Large language models (LLMs) are in the spotlight. Laypeople are aware of and are using the LLMs such as OpenAI’s ChatGPT and Google’s Gemini on a daily basis. While companies are exploring new business opportunities, researchers have gained access to an unprecedented scientific playground that allows for fast experimentation with limited resources and immediate results. In this talk, using concrete examples from requirements engineering, I am going to put forward several research opportunities that are enabled by the advent of LLMs. I will show how LLMs, as a key example of modern AI, unlock research topics that were deemed as too challenging until recently. Then, I will critically discuss the perils that we face when it comes to planning, conducting, and reporting on credible research results following a rigorous scientific approach. This talk will stress the inherent tension between the exciting affordances offered by this new technology, which include the ability to generate non-factual outputs (fiction), and our role and societal responsibility as information scientists.
Size: 20.35 MB
Language: en
Added: May 15, 2024
Slides: 83 pages
Slide Content
Information science research with large
language models: between science and fiction
Fabiano Dalpiaz
Requirements Engineering Lab
Utrecht University, the Netherlands
May 15, 2024 [email protected]@FabianoDalpiazfabianodalpiaz
1. Large Language Models
@2024 Fabiano Dalpiaz2
ChatGPT, depicted by ChatGPT4.0 + DALL-E
Large Language Models (LLMs) in the news
@2024 Fabiano Dalpiaz3
Various viewpoints on LLMs
@2024 Fabiano Dalpiaz4
LLMs in information science
@2024 Fabiano Dalpiaz5
LLMs in information science research
@2024 Fabiano Dalpiaz6
⚠LLM use disclaimers?
•“drafted by ChatGPT–rephrased by Quillbot–
images by MidJourney–prompts in Appendix A”?
⚠Legal and ethical implications
⚠Quoting ≠ paraphrasing
What’s ahead?
"Dedicated conference tracks about LLMs
"Exciting avenues for research!
LLMs in Software Engineeringresearch
@2024 Fabiano Dalpiaz7
A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Senguta, S. Yoo, J.M. Zhang. ”Large Language Models for Software Engineering: Survey and Open Problems." arXiv:2310.03533, 2023
ICSE’24 main track
How are YOU using LLMs in YOUR research?
@2024 Fabiano Dalpiaz8
Key Message 1: Accept the Evolution
@2024 Fabiano Dalpiaz9
Can assist us in
science fiction tasks
Large Language
Models
are here
•As citizens
•As researchers
•As educators
They are and will be
changing
our lives
2. Credibility in (information) science research
@2024 Fabiano Dalpiaz10
IS research in the small –simplified illustration
@2024 Fabiano Dalpiaz11
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Credibilityin information science research
@2024 Fabiano Dalpiaz12
Interesting, this seems a
breakthrough. But…
how can I trustwhat the
authors claim?
PhD student Elize
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
How do YOU assess the credibility of a paper?
@2024 Fabiano Dalpiaz13
Threats to credibility–the idea
@2024 Fabiano Dalpiaz14
That idea is
wrong in the
first place!
Jim, the reviewer
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–the idea
@2024 Fabiano Dalpiaz14
That idea is
wrong in the
first place!
Jim, the reviewer
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Invalid criticism in science!
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–the conceptual framework
@2024 Fabiano Dalpiaz16
It builds on a
rejected theory
It proposes a
theory that hasn’t
been tested yet
Jim, the reviewer
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–the constructed artifact
@2024 Fabiano Dalpiaz17
Simplistic, partially
implementedIt conflicts with
the conceptual
framework
Jim, the reviewer
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–validation / evaluation
@2024 Fabiano Dalpiaz18
•The evaluation is too small
•Mislabeled: is it a case study / experiment?
•The experimental design is flawed
•Too few subjects
•The research questions are not clear
•The metrics do not match with the RQs
•Missing threats to validity
•Wrong statistical tests
•Ethical approval missing
•The source code is not available
•No replication package
•Won’t generalize
•Too small improvement over SotA
•…
Jim, the reviewer
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–the written paper
@2024 Fabiano Dalpiaz19
This claim is
factually wrong
The sentence is
ambiguous
Jim, the reviewer
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–peer reviewing / publication
@2024 Fabiano Dalpiaz20
Renown
authors =
good?
Jim, the reviewer
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–peer reviewing / publication
@2024 Fabiano Dalpiaz21
Prestigious
venue = good?
Never heard of
this journal = bad?
Jim, the reader
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Threats to credibility–literature
@2024 Fabiano Dalpiaz22
WeproposetoolZthatcanbeusedto
classifyrequirementsautomatically,
distinguishingfunctionalfromquality
requirements.
[…]
Dalpiazetal.[22]showedthattheirML-
basedapproachhasaccuracyof95%.
[…]
TheperformanceofZissuperiorto
thatofDalpiazetal.[22].
I can’t find a
link to tool Z…
On which
dataset was the
95% accuracy
obtained?
What does it
mean for Z to
be superior?
Jim, the reader
Credibility in research: research methods
@2024 Fabiano Dalpiaz23
Credibility in research: open science badges
@2024 Fabiano Dalpiaz24
Artifacts evaluated -functional
“Work as intended”
https://www.acm.org/publications/policies/artifact-review-and-badging-current
Artifacts evaluated -reusable
Functional + very carefully
documented + well structured
Artifacts available
Publicly accessible in a anarchival
repository (with DOI)
Results reproduced
Another team obtained the same
results with the artifacts provided
by the original authors
Results replicated
Another team obtained the same
results without the author-supplied
artifacts
Problem solved? How about LLMs being USEDin the
research cycle?
@2024 Fabiano Dalpiaz25
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
LLMs are already been used! (a few examples)
@2024 Fabiano Dalpiaz26
Literature review generator: jenni.aiOriginality checker: originality.ai
Writing assistant: quillbot.com
The one-size-fits-all ChatGPT
Code generation: copilot
Will the use of LLMs affect research CREDIBILITY?
@2024 Fabiano Dalpiaz27
Research idea Conceptual frameworkArtifact constructionValidation / evaluation
Paper writingPeer reviewPublicationLiterature
Will the use of LLMs affect research CREDIBILITY?
@2024 Fabiano Dalpiaz28
Key Message 2: Responsibility as Information Scientists
@2024 Fabiano Dalpiaz29
•Can be used for
many tasks
•We are using them!
LLMs in IS Research
•Deliver research that can
be trusted
•Discern credible results
What is up to us?
3. Deep dive on NLP tools in
Requirements Engineering (NLP4RE)
@2024 Fabiano Dalpiaz30
Background theory: Refinement in RE
@2024 Fabiano Dalpiaz31
K. Pohl. "The three dimensions of requirements engineering: a framework and its applications." Information Systems19.3 (1994): 243-258.
Specification
Representation
opaque
fair
complete
common view
informalsemi-formalformal
personal view
Initial RE
input
Desired RE output
Agreement
Refinement
path in
practice
RE research,
including NLP4RE Tools
How do NLP4RE tools work?
@2024 Fabiano Dalpiaz32
Processing text is
particularly suitable
for LLMs!!
Four categories of NLP4RE tools
@2024 Fabiano Dalpiaz33
1.Find defects /
deviations from
good practice
2.Generate models
from NL reqs
3.Infer trace links
between NL reqs
and other artifacts
4.Identify key
abstractions
from NL
documents
D..M. Berry, R. Gacitua, P. Sawyer, and S.F. Tjong. "The case for dumb requirements engineering tools." In Proceedings of REFSQ, pp. 211-217. 2012.
Tools in NLP4RE (2021-2022, before LLMs)
@2024 Fabiano Dalpiaz34
L. Zhao, W. Alhoshan, Al. Ferrari, K. J. Letsholo, M. A. A., E-V.. Chioasca, and R. T. Batista-Navarro. Natural Language Processing (NLP) for Requirements Engineering: A Systematic Mapping Study.
ACM Computing Surveys 54:3, 2022
Case: F/Q Requirements Classification
@2024 Fabiano Dalpiaz35
}Seminal classification problem that
aims at identifying NFRs (or Qualities)
}Two classes: Functional and Quality
}Dozens of tools in the literature
}Keyword based, ML & DL classifiers,
zero-and few-shot learning…
Automated classification via ML
@2024 Fabiano Dalpiaz36
ItemLabels
Req 1F
Req 2F
Req 3Q
Req 4Q
Req 5F, Q
…
Labeled dataset D
1. Builds a model M that
describes the items in D accurately
ItemLabels
Req 1F
Req 2F
Req 3Q
Req 4Q
Req 5F, Q
…
2. Given an unseen, unlabeled
dataset D’, predicts (accurately)
the labels of the items in D’
Classification
algorithm
ItemPredictedReal
Req XXFF
Req XYQF
Req XZF, QF, Q
Req YZFQ
Req XYXFF
…
An example of classification in NLP4RE
@2024 Fabiano Dalpiaz37
Feature engineering is key as it
determines which information the classifier
should combine to construct the model
Classification with LLMs
@2024 Fabiano Dalpiaz38
}No feature engineering needed!
}Immediate results via prompting
}Zero-shot learning
}Few-shot learning (a few labelled
examples in the prompt)
}Better results via fine-tuning
}Re-train the LLM with a labelled dataset
}Combines the LLM knowledge with the
domain-specific task
Pre-trained LLM
Domain-specific,
labelled dataset
Fine-tuned LLM
XXL general-
purpose dataset
fine-tuning
Credible research?
@2024 Fabiano Dalpiaz39
Iris, the
req. analyst
I need to find quality
requirements in
3,000+ requirements
from 10 projects…
Will I obtain the same
performance on my
unlabeled data?
This paper does it
automatically with
great results!
4. Are the classifier’s results credible?
The ECSER pipeline
@2024 Fabiano Dalpiaz40
Evaluating Classifiers in SE Research (ECSER)
@2024 Fabiano Dalpiaz41
}ECSER focuses on
Treatment Validation
}Treatment = a classifier
}Two macro phases
}Treatment design is beyond
the scope of ECSER
D. Dell'Anna, F. Basak Aydemir, F.. Dalpiaz: Evaluating classifiers in SE research: The ECSER pipeline and two replication studies. Empirical Software Engineering 28(1): 3 (2023)
ECSER’s highlight #1: data and models
@2024 Fabiano Dalpiaz42
Training
Validation
Test
S5
ECSER’s highlight #2: p-fold cross-validation
@2024 Fabiano Dalpiaz
}In SE, data originates from different projects
}p-fold cross-validation extends k-fold cross-validation with per-project splits
(as opposed to random splits)
1.Given a set P of projects, take a subset S⊂P to train a model
2.Test the model on the remaining P \S
3.Take another subset S’ of the same size of S
4.Train the model on S’
5.Test the model on P \S’
6.…
43
ECSER’s highlight #3: the confusion matrix
@2024 Fabiano Dalpiaz44
}It provides transparency: it allows to deriveall metrics and to inspectthe results
ECSER’s highlight #4: overfitting and degradation
@2024 Fabiano Dalpiaz45
}Two metrics to analyze performance differences depending on the data splits
training set
test set
validation setOverfitting = Test –Training
Degradation= Test –Validation
Credible research?
@2024 Fabiano Dalpiaz47
Iris, the
req. analyst
I need to find quality
requirements in
3,000+ requirements
from 10 projects…
Will I obtain the same
performance on my
unlabeled data?
This paper does it
automatically with
great results!
Luckily, someone
applied ECSER!
Study design
@2024 Fabiano Dalpiaz48
S1. Evaluation method and data splitting
@2024 Fabiano Dalpiaz49
}Most of the literature uses PROMISE NFR
}625 requirements that pertain to 15 student projects
}Generally, the studies only perform validation, no testing
}Our choices
}Three algorithms (see previous slide)
}No hyper-parameter tuning (validation, S3-S4)
}Two binary classifiers: isFunctionaland isQuality
Training
Validation
Test
S2 & S5. Training and testing the model
@2024 Fabiano Dalpiaz50
}Trainingis performed on PROMISE NFR
}Testingis performed on the remaining datasets
}Test on Dronology, then test on DUAP, …
}Calculate arithmetic mean
S6. Reporting the confusion matrix
@2024 Fabiano Dalpiaz51
}This is simply a presentation of the raw results…
}But some aspects already stand out!
S7-S8. Performance and overfitting
@2024 Fabiano Dalpiaz52
}For simplicity, let’s examine F1here
km500 fitsbest the
training setnorberthasthe best
performance on the
test set
ling17 hasthe
smallestoverfitting
S9. ROC Plot (for isFunctional)
@2024 Fabiano Dalpiaz53
norbertis the best
for most projects
ling17tends to lead to
more false positives
km500tends to
lead to more false
negatives
S10. Statistical tests
@2024 Fabiano Dalpiaz54
}Is one of these classifiers significantly better?
}The results are mixed
}Yes, for km500vs. norbertin the isFunctionalcase
}Almost never for isQuality
Results from the first application of ECSER
@2024 Fabiano Dalpiaz55
}We confirmthat norbertoutperforms both ling17and km500on unseen data
}But not in a statistical sense (small sample size?)
}The “losers” still have good properties:
}ling17has the smallest overfitting
}km500fits best the training data
Credible research? Under certain assumptions
@2024 Fabiano Dalpiaz56
F. Dalpiaz,D. Dell'Anna,F.B. Aydemir,S. Çevikol: Requirements Classification with Interpretable Machine Learning and Dependency Parsing.RE2019:142-152
Iris, the
req. analyst
Will I obtain the same
performance on my
unlabeled data?
Only if my
data resembles
Promise!
Key Message 3: Assess your results properly!
@2024 Fabiano Dalpiaz57
•Provides guidelines for
evaluating classifiers
•Is a step-by-step tool
The
ECSER pipeline
•Confirms some results
•Clarifies and confutes others
ECSER’s application
5. Future Avenue: LLMs in
Requirements Engineering
@2024 Fabiano Dalpiaz58
LLM-Assisted RE: YOUR Vision
@2024 Fabiano Dalpiaz59
LLM-Assisted RE: A Vision
@2024 Fabiano Dalpiaz60
RE version 1.1
}Non-disruptive improvements in all
activities where currently some
automation takes place
}Classification
}Model derivation
}Defect identification
}Traceability
RE version 2.0
}Key focus on elicitation
}Breakthrough: automated analysis of
conversations
}RE is mainly a human-centered activity
Elicitation: the root of (all) NL requirements
@2024 Fabiano Dalpiaz62
Requirements
conversations
Requirements Analyst
Own ideas
Budget / project
constraints
Design
decisionsDomain-specific
documentation
Elicitation
Specification
Timeliness: why researching conversations now?
@2024 Fabiano Dalpiaz63
Increased remote work
and collaboration
Automated
transcription
(Requirements) conversationsvs. specifications
@2024 Fabiano Dalpiaz64
2+ parties (here Analyst
and Stakeholder)
Informal: no “shall”
statements, user
stories, glossary
Relevant
information may
be sparse
Includes persuasion,
uncertainty,
misunderstandings
The manylayersof (requirements) conversations
@2024 Fabiano Dalpiaz65
Turnsand utterance units as
atomic entities
Cross-speaker interaction
defines the meaning
Traum, David R., and Elizabeth A. Hinkelman. "Conversation acts in task-oriented spoken dialogue." Computational intelligence8.3 (1992): 575-599.
The purpose of a
conversation across
multiple turns
Tools for ConversationalRE: Two Examples
@2024 Fabiano Dalpiaz66
Tjerk Spijkman,Fabiano Dalpiaz,and Sjaak Brinkkemper “Back to the
Roots: Linking User Stories to Requirements Elicitation Conversations”
Proceedings of the RE 2022
Tjerk Spijkman,Xavier de Bondt, Fabiano Dalpiaz,and Sjaak
Brinkkemper “Summarization of Elicitation Conversations to Locate
Requirements-Relevant Information” Proceedings of REFSQ 2023
Trace2Conv: Key Idea
@2024 Fabiano Dalpiaz67
Requirements
conversations
Requirements Analyst
Own ideas
Budget / project
constraints
Design
decisionsDomain-specific
documentation
}Supports backward, pre-RS traceability
}Largely overlooked area of research
}Aims to find information that provides
additional contextto a requirement
Specification
Trace2Conv
Trace2Conv pre-LLMs
@2024 Fabiano Dalpiaz68
As a vendor user, I can use the password forgotten
functionality whenever I forgot or want to reset my
password, so that I always have a way to create a new
password
Short demo of Trace2Conv
@2024 Fabiano Dalpiaz69
Trace2Conv with LLMs
Expectations
}Complex pre-processing will be unnecessary
}Simple prompts will be able to match
requirements to speaker turns well
}Limitations
}Number of tokens limit
@2024 Fabiano Dalpiaz70
}Trigger: long recorded conversations, spanning over multiple hours
}Can we facilitate the analyst in exploring the transcript by summarizing it?
Summarizing a transcript: ReConSum
Step #1: Identify
the questions
Step #2: Filter by
question relevance
Step #3: Label by
relevance type
@2024 Fabiano Dalpiaz71
How to identify the questions? (Step #1)
Based on sequences of POS tags:
Wh-, yes/no, tag questions
Based on pre-trained DistilBert
(deep learning)
Combination: question if either
approach says so
@2024 Fabiano Dalpiaz72
How to filter relevant questions? (Step #2)
TF-IDF can be used to rank questions
with domain-specific words
@2024 Fabiano Dalpiaz73
Do our steps #1 and #2 work? (pre-LLM)
Step #1: Question identification
-Deep learning gives the best results
-Even better when combining the approaches
Step #2: Relevance detection:
-The combined pipeline achieves a F1-score around 67%
-[back to ECSER] error propagation from idea #1
We expect LLMs to improve the results, but this should be assessed rigorously (see ECSER)
ApproachPrecisionRecallF1-Score
Speech Acts (DL)81.8%91.7%86.5%
Part of Speech tags69.7%77.4%73.4%
Combination76.8%95.8%85.3%
ApproachPrecisionRecallF1-Score
Speech Acts (DL)64.4%70.3%67.2%
Part of Speech tags53.8%62.4%57.8%
Combination55.7%81.7%65.7%
@2024 Fabiano Dalpiaz74
Ongoing tool: distilling domain models
ChatGPT4.0 prompts
-Guidelines from Blaha and Rumbaugh
-combine transcripts with its own knowledge
@2024 Fabiano Dalpiaz75
@2024 Fabiano Dalpiaz
Key challenge ahead in Conversational RE?
Lack of metricsand gold standards!
76
Key Message 4: New avenues unlocked, but…
•Opens new avenues for the
RE discipline
•LLMs will be an enabler
Coversational RE
•No gold standards
•Unknown metrics
•Rigor is necessary!
What are the perils?
@2024 Fabiano Dalpiaz77
6. Wrap-up
@2024 Fabiano Dalpiaz78
Take-home messages
@2024 Fabiano Dalpiaz79
Large language models
are here and can
do science fiction
stuff
are changing our job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational RE)
Take-home messages
@2024 Fabiano Dalpiaz79
Large language models
are here and can
do science fiction
stuff
are changing our job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational RE)
Take-home messages
@2024 Fabiano Dalpiaz79
Large language models
are here and can
do science fiction
stuff
are changing our job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational RE)
Take-home messages
@2024 Fabiano Dalpiaz79
Large language models
are here and can
do science fiction
stuff
are changing our job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational RE)
Thank you for listening! Questions? [email protected]@FabianoDalpiazfabianodalpiaz
Special credits to
-F. Başak Aydemir
-Davide Dell’Anna
-Xavier de Bondt
-Tjerk Spijkman
-Sjaak Brinkkemper
Large
language
models
are here and can
do science fiction
stuff
are changing our
job as
researchers
need rigorous
reporting (ECSER
as an example)
unlock uncharted
territories (e.g.,
conversational
RE)