Mutation Testing for Task-Oriented Chatbots

PabloGmezAbajo 201 views 44 slides Jun 20, 2024
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtainin...


Slide Content

www.uam.es
Mutation testing
for task-oriented chatbots
Pablo Gómez-Abajo, Sara Pérez-Soler, Pablo C. Cañizares,
Esther Guerra, Juan de Lara
{Pablo.GomezA, Sara.PerezS, Pablo.Cerro, Esther.Guerra, Juan.deLara}@uam.es
Modelling& Software EngineeringResearchGroup
Universidad Autónoma de Madrid, Spain
18
th
–21
st
June 2024

Motivation
•Conversationalagentsorchatbotsareincreasinglyusedtoaccess
allsortofservicesusingnaturallanguage
•Likeanyothersoftware,chatbotsneedtobetested
•Usuallybydefiningtestscenarios
•However
•Thereiscurrentlyalackofmethodstoassessthequalityofsuch
testscenarios
•Theresultisahighriskofbuggychatbots
2/25

Motivation
•Conversationalagentsorchatbotsareincreasinglyusedtoaccess
allsortofservicesusingnaturallanguage
•Like any other software, chatbots need to be tested
•Usually by defining test scenarios
•However
•There is currently a lack of methods to assess the quality of such
test scenarios
•The result is a high risk of buggy chatbots
2/25

What is a task-oriented chatbot?
•A task-oriented chatbot is a software application used in natural language
and designed to solve a specific task
•e.g., booking a ticket, ordering a pizza, setting a medical appointment
•Via text or speech recognition
•In recent years, the use of chatbots has increased
…and many more
•Since 2022, we also have open-domain chatbots (ChatGPT, etc.) which engage in conversations
on any topic, and which we do not cover in this work
3/25

How do chatbots work?
4/25
User
NL phrase
Chatbot
chatbot
response

How do chatbots work?
5/25
User
NL phrase
intent
1
intent
n
Chatbot
match intent

intent
i

chatbot
response
3
extract
params
build
response
external
service
1
4
2
3

How do chatbots work?
6/25
1.The user sends a natural language
message to the chatbot Utterances
Utterances (user says)
Hi there!
I need to fly from Madrid to Salerno on
Wednesday at 12 PM
Good bye!

How do chatbots work?
7/25
1. The user sends a natural language
message to the chatbot
2. The chatbot tries to match the
message with an intention

How do chatbots work?
7/25
??
Intention?
1. The user sends a natural language
message to the chatbot
2. The chatbot tries to match the
message with an intention

How do chatbots work?
8/25
Hi there!
Intent: Match the user interaction with
an intention
User says Intent
Hi there!

How do chatbots work?
8/25
Hi there!
Intent
matched
Intent: Match the user interaction with
an intention
User says Intent
Hi there! Greet

Book
How do chatbots work?
9/25
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Intent: Match the user interaction with
an intention

Book
How do chatbots work?
9/25
I need
to fly
Intent
matched User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
Intent: Match the user interaction with
an intention

Book
How do chatbots work?
9/25
I need
to fly
Intent
matched User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
HOW?!
Intent: Match the user interaction with
an intention

Book
How do chatbots work?
9/25
I need
to fly
Intent
matched User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
HOW?!
Providing training phrases: a set of examples that users can use to
express an intention. Required for matching inputs with intents

Intent: Match the user interaction with
an intention

Book
How do chatbots work?
10/25
Hi there
Intent
matched
Training phrases: a set of examples
that users can use to express an
intention
●Must be provided with the intent
Training phrase Intent
Hi there! Greet
Hello Greet
Hi Greet
Hey Greet

Book
How do chatbots work?
11/25
Training phrases: a set of examples
that users can use to express an
intention
●Must be provided with the intent
I need
to fly
Intent
matched Training phrase Intent
Airplane ticket from
Madrid to Rome
tomorrow at 1 pm
Book a flight
Flight from Madrid
to Napoli on
17/06/2024 at 11:30
Book a flight

How do chatbots work?
12/25
3. Chatbot extracts information from
the message or asks for missing
information
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight

to:Salerno
How do chatbots work?
13/25
3. Chatbot extracts information from
the message or asks for missing
information
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM

to:Salerno
How do chatbots work?
13/25
3. Chatbot extracts information from
the message or asks for missing
information
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM
City

to:Salerno
How do chatbots work?
13/25
3. Chatbot extracts information from
the message or asks for missing
information
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM
City entities

to:Salerno
How do chatbots work?
13/25
3. Chatbot extracts information from
the message or asks for missing
information
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM
TimeCity entities

How do chatbots work?
14/25
4. Build the response and send back
the response to the user
I need
to fly
●Responses to the user:
○text, images
●External service queries
○External API rest
○Database, etc.
User says Action
I need to fly from
Madrid to Salerno on
Wednesday at 12PM
The price of the
ticket is 150$.
Provide a card
nº and billing
name
Both, user responses and external services queries: actions

Testing chatbots
15/25
User
Chatbot
Testcase input Testcase output
Hi there! Hi! How can I help
you?
Hi
there!
Hi!
How can I
Help you?

complete
conversations

Testing chatbots
16/25
We use Botium and Rasa-test as the test suites to test the chatbots
#me
Hi there!
#bot
What day do you want to come in?
#me
GREET_UTTERANCES_USER
#bot
GREET_RESPONSES_USER
Single test interaction
Combination of multiple tests
GREET_UTTERANCES_USER
Hi there!
Hi
Hello
Hey
GREET_RESPONSES_USER
Hi! How can I help you?
Hello, what do you need?
Greetings! This is the flight ticket
assistant Antony, how can i help you?
Multiple user utterances
Possible responses
convo file
(conversation step)
utterances
responses

Testing chatbots
17/25
Hi
there!
I need to fly
from …
Hi!
How can I
Help you?
The price
of the
ticket …
I lost my
baggage
Please,
provide
the flight
ticket id
… and complex
conversations

Mutation testing for chatbots
18/25
User says Action
Whatkindsofcoffeeareavailable?
WhatkindsofcoffeecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent

Ordera
wine

chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Intent
matched
Order a coffee
Order a wine
Tell me what
kindsofcoffeeI
can drinkhere

Mutation testing for chatbots
18/25
User says Action
Whatkindsofcoffeeareavailable?
WhatkindsofcoffeecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent

Ordera
wine

chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
0.512
0.538
0.475
0.474
Tell me what
kindsofcoffeeI
can drinkhere
Order a coffee: Keeps the two most different phrases
Order a wine
Semantic similarity

Mutation testing for chatbots
18/25
User says Action
Whatkindsofcoffeeareavailable?
WhatkindsofcoffeecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent

Ordera
wine

chatbot
response
3
extract
params
build
response
external
service
Order a coffee: Keeps the two most different phrases
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Order a wine
0.512
0.538
0.475
0.474
Tell me what
kindsofcoffeeI
can drinkhere

Mutation testing for chatbots
18/25
User says Action
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent

Ordera
wine

chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Order a wine
Intent
matched
Tell me what
kindsofcoffeeI
can drinkhere
Order a coffee

Mutation testing for chatbots
18/25
User says Action
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent

Ordera
wine

chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Order a wine
Intent
matched
Tell me what
kindsofcoffeeI
can drinkhere
Order a coffee
Test-suite

Mutation testing for chatbots
18/25
User says Action
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent

Ordera
wine

chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Order a wine
Intent
matched
Tell me what
kindsofcoffeeI
can drinkhere
Order a coffee

Mutation testing for chatbots
18/25
User says Action
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent

Ordera
wine

chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Order a wine
Intent
matched
Tell me what
kindsofcoffeeI
can drinkhere
Order a coffee
Test-suite

19/25
Operators for training phrases
DP
maxDeletes the most representative phrase of
an intent
DP
minDeletes the most different phrase of an
intent
DPWPDeletes training phrases with required
parameter
DPWLDeletes training phrases with literal
K2P
maxKeeps the 2 most representative phrases
K2P
minKeeps the 2 most different phrases
MP
max
Moves the most representative phrase to
the most similar intent
MP
min
Moves the most different phrase to the
most different intent
Mutation operators for chatbots
Operators for intents
DIPDeletesintentparameter
DPPDeletesparameterprompt
SPOSets required parameter to optional
DFIDeletesfallbackintent
Operators for entities
CREChangesregular expression
DLEDeletesliteral fromentity
Operators for actions
DADeletesactions
DPRDeletesa parameterusedin a response
SOSwaps outputs
Operators for conversation flows
DCSDeletesconversationstep
DCBDeletesconversationbifurcation
Emulationofcommonerrorsofchatbotdevelopers

Dialogflow
chatbot
model
parse
1
CONGA
meta-model
«conforms to»
annotate
annotated
chatbot
model
Tensorflow
annotation
meta-model
«conforms to»
2
mutate
mutation
operators
(WODEL)
3
chatbot
model
mutant
generate
4
chatbot
impl.
test
5
test suites
mutation
analysis
report
chatbotimpl.
WODEL-TEST
20/25
Mutation testing for chatbots

RQ1: How applicable are the defined mutation operators?
RQ2: How effective are the defined mutation operators?
21/25
39%
48%
67%
60%
77%
73%
78%80%
67%
0%0%
40%
50%
76%
14%
89%
87%
96%
Alive
Killed
Mutationscore by
mutationoperator

RQ1: How applicable are the defined mutation operators?
RQ2: How effective are the defined mutation operators?
21/25
39%
48%
67%
60%
77%
73%
78%80%
67%
0%0%
40%
50%
76%
14%
89%
87%
96%
Alive
Killed
Mutationscore by
mutationoperator

RQ3: How effective is the mutation testing process?
22/25
Botium automatic Botium by hand Rasa test
45%
94%
20%
Alive
Killed
Mutationscore
bytest suite kind

RQ3: How effective is the mutation testing process?
22/25
Botium automatic Botium by hand Rasa test
45%
94%
20%
Alive
Killed
Mutationscore
bytest suite kind

RQ4: How efficient is the mutation testing process?
23/25
0,1%0,2%0,3%
1,0%1,2%1,4%1,6%1,6%1,7%
2,6%
4,9%
8,4%
12,8%
27,5%
34,7%
0%
5%
10%
15%
20%
25%
30%
35%
Covid19_tracer
bikeShop
e2e-bot
Spaceonova
personal-bot
yassinelamarti
Rasa-demo
256644
h4h-chatbot
diagrams2ai
dusbot
legal-alien-chatbot
Email-WhatsApp-Integration
lankbanfinance
Data-mining
The mutation testing
process of 67% of the
chatbots was completed
in less than 90 minutes

RQ4: How efficient is the mutation testing process?
23/25
0,1%0,2%0,3%
1,0%1,2%1,4%1,6%1,6%1,7%
2,6%
4,9%
8,4%
12,8%
27,5%
34,7%
0%
5%
10%
15%
20%
25%
30%
35%
Covid19_tracer
bikeShop
e2e-bot
Spaceonova
personal-bot
yassinelamarti
Rasa-demo
256644
h4h-chatbot
diagrams2ai
dusbot
legal-alien-chatbot
Email-WhatsApp-Integration
lankbanfinance
Data-mining
The mutation testing
process of 67% of the
chatbots was completed
in less than 90 minutes

Conclusions
•Technology-independent approach for MuT of chatbots with
•A catalogue of 19 mutation operators for
•Training phrases, intents, entities, chatbot actions and conversation flows
•Support for test scenarios from botium and rasa-test
•Experiment with 15 chatbots and 29 test suites
•Positive results regarding applicability, effectiveness and efficiency
•Room for improvement in 86% of the test suites
•MuT for chatbots running times are costly but acceptable
•Less than 90 minutes for 67% of the chatbots
24/25

Future work
•Automate the detection of semantically equivalent mutants
•e.g., using confidence decrease heuristics
•Automate the synthesis of tests able to kill the alive mutants
•Adapt our approach to LLM-based agents
25/25

www.uam.es
Pablo Gómez-Abajo, Sara Pérez-Soler, Pablo C. Cañizares,
Esther Guerra, Juan de Lara
{Pablo.GomezA, Sara.PerezS, Pablo.Cerro, Esther.Guerra, Juan.deLara}@uam.es
Mutation testing
for task-oriented chatbots
Thank you!
./ Wodel-Test
DatasetTool demo