Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtainin...
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
Size: 1.37 MB
Language: en
Added: Jun 20, 2024
Slides: 44 pages
Slide Content
www.uam.es
Mutation testing
for task-oriented chatbots
Pablo Gómez-Abajo, Sara Pérez-Soler, Pablo C. Cañizares,
Esther Guerra, Juan de Lara
{Pablo.GomezA, Sara.PerezS, Pablo.Cerro, Esther.Guerra, Juan.deLara}@uam.es
Modelling& Software EngineeringResearchGroup
Universidad Autónoma de Madrid, Spain
18
th
–21
st
June 2024
Motivation
•Conversationalagentsorchatbotsareincreasinglyusedtoaccess
allsortofservicesusingnaturallanguage
•Like any other software, chatbots need to be tested
•Usually by defining test scenarios
•However
•There is currently a lack of methods to assess the quality of such
test scenarios
•The result is a high risk of buggy chatbots
2/25
What is a task-oriented chatbot?
•A task-oriented chatbot is a software application used in natural language
and designed to solve a specific task
•e.g., booking a ticket, ordering a pizza, setting a medical appointment
•Via text or speech recognition
•In recent years, the use of chatbots has increased
…and many more
•Since 2022, we also have open-domain chatbots (ChatGPT, etc.) which engage in conversations
on any topic, and which we do not cover in this work
3/25
How do chatbots work?
4/25
User
NL phrase
Chatbot
chatbot
response
How do chatbots work?
5/25
User
NL phrase
intent
1
intent
n
Chatbot
match intent
…
intent
i
…
chatbot
response
3
extract
params
build
response
external
service
1
4
2
3
How do chatbots work?
6/25
1.The user sends a natural language
message to the chatbot Utterances
Utterances (user says)
Hi there!
I need to fly from Madrid to Salerno on
Wednesday at 12 PM
Good bye!
How do chatbots work?
7/25
1. The user sends a natural language
message to the chatbot
2. The chatbot tries to match the
message with an intention
How do chatbots work?
7/25
??
Intention?
1. The user sends a natural language
message to the chatbot
2. The chatbot tries to match the
message with an intention
How do chatbots work?
8/25
Hi there!
Intent: Match the user interaction with
an intention
User says Intent
Hi there!
How do chatbots work?
8/25
Hi there!
Intent
matched
Intent: Match the user interaction with
an intention
User says Intent
Hi there! Greet
Book
How do chatbots work?
9/25
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Intent: Match the user interaction with
an intention
Book
How do chatbots work?
9/25
I need
to fly
Intent
matched User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
Intent: Match the user interaction with
an intention
Book
How do chatbots work?
9/25
I need
to fly
Intent
matched User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
HOW?!
Intent: Match the user interaction with
an intention
Book
How do chatbots work?
9/25
I need
to fly
Intent
matched User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
HOW?!
Providing training phrases: a set of examples that users can use to
express an intention. Required for matching inputs with intents
Intent: Match the user interaction with
an intention
Book
How do chatbots work?
10/25
Hi there
Intent
matched
Training phrases: a set of examples
that users can use to express an
intention
●Must be provided with the intent
Training phrase Intent
Hi there! Greet
Hello Greet
Hi Greet
Hey Greet
Book
How do chatbots work?
11/25
Training phrases: a set of examples
that users can use to express an
intention
●Must be provided with the intent
I need
to fly
Intent
matched Training phrase Intent
Airplane ticket from
Madrid to Rome
tomorrow at 1 pm
Book a flight
Flight from Madrid
to Napoli on
17/06/2024 at 11:30
Book a flight
How do chatbots work?
12/25
3. Chatbot extracts information from
the message or asks for missing
information
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
to:Salerno
How do chatbots work?
13/25
3. Chatbot extracts information from
the message or asks for missing
information
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM
to:Salerno
How do chatbots work?
13/25
3. Chatbot extracts information from
the message or asks for missing
information
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM
City
to:Salerno
How do chatbots work?
13/25
3. Chatbot extracts information from
the message or asks for missing
information
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM
City entities
to:Salerno
How do chatbots work?
13/25
3. Chatbot extracts information from
the message or asks for missing
information
I need
to fly
User says Intent
I need to fly from
Madrid to Salerno on
Wednesday at 12 PM
Book a flight
At this point, the chatbot extracts key information from the input: parameters
From:Madrid when:Wed. At 12 PM
TimeCity entities
How do chatbots work?
14/25
4. Build the response and send back
the response to the user
I need
to fly
●Responses to the user:
○text, images
●External service queries
○External API rest
○Database, etc.
User says Action
I need to fly from
Madrid to Salerno on
Wednesday at 12PM
The price of the
ticket is 150$.
Provide a card
nº and billing
name
Both, user responses and external services queries: actions
Testing chatbots
15/25
User
Chatbot
Testcase input Testcase output
Hi there! Hi! How can I help
you?
Hi
there!
Hi!
How can I
Help you?
…
complete
conversations
Testing chatbots
16/25
We use Botium and Rasa-test as the test suites to test the chatbots
#me
Hi there!
#bot
What day do you want to come in?
#me
GREET_UTTERANCES_USER
#bot
GREET_RESPONSES_USER
Single test interaction
Combination of multiple tests
GREET_UTTERANCES_USER
Hi there!
Hi
Hello
Hey
GREET_RESPONSES_USER
Hi! How can I help you?
Hello, what do you need?
Greetings! This is the flight ticket
assistant Antony, how can i help you?
Multiple user utterances
Possible responses
convo file
(conversation step)
utterances
responses
Testing chatbots
17/25
Hi
there!
I need to fly
from …
Hi!
How can I
Help you?
The price
of the
ticket …
I lost my
baggage
Please,
provide
the flight
ticket id
… and complex
conversations
Mutation testing for chatbots
18/25
User says Action
Whatkindsofcoffeeareavailable?
WhatkindsofcoffeecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent
…
Ordera
wine
…
chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Intent
matched
Order a coffee
Order a wine
Tell me what
kindsofcoffeeI
can drinkhere
Mutation testing for chatbots
18/25
User says Action
Whatkindsofcoffeeareavailable?
WhatkindsofcoffeecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent
…
Ordera
wine
…
chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
0.512
0.538
0.475
0.474
Tell me what
kindsofcoffeeI
can drinkhere
Order a coffee: Keeps the two most different phrases
Order a wine
Semantic similarity
Mutation testing for chatbots
18/25
User says Action
Whatkindsofcoffeeareavailable?
WhatkindsofcoffeecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent
…
Ordera
wine
…
chatbot
response
3
extract
params
build
response
external
service
Order a coffee: Keeps the two most different phrases
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Order a wine
0.512
0.538
0.475
0.474
Tell me what
kindsofcoffeeI
can drinkhere
Mutation testing for chatbots
18/25
User says Action
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent
…
Ordera
wine
…
chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Order a wine
Intent
matched
Tell me what
kindsofcoffeeI
can drinkhere
Order a coffee
Mutation testing for chatbots
18/25
User says Action
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent
…
Ordera
wine
…
chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Order a wine
Intent
matched
Tell me what
kindsofcoffeeI
can drinkhere
Order a coffee
Test-suite
Mutation testing for chatbots
18/25
User says Action
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent
…
Ordera
wine
…
chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Order a wine
Intent
matched
Tell me what
kindsofcoffeeI
can drinkhere
Order a coffee
Mutation testing for chatbots
18/25
User says Action
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
expressooran
americano
User
NL phrase
Order a
coffe
intent
n
Chatbot
match intent
…
Ordera
wine
…
chatbot
response
3
extract
params
build
response
external
service
User says Action
Whatkindsofwineareavailable?
WhatkindsofwinecanIorder?
WhatcanIdrinkhere?
Tellmewhatdrinksthereare
Youcan takean
Italianwineora
French wine
Order a wine
Intent
matched
Tell me what
kindsofcoffeeI
can drinkhere
Order a coffee
Test-suite
19/25
Operators for training phrases
DP
maxDeletes the most representative phrase of
an intent
DP
minDeletes the most different phrase of an
intent
DPWPDeletes training phrases with required
parameter
DPWLDeletes training phrases with literal
K2P
maxKeeps the 2 most representative phrases
K2P
minKeeps the 2 most different phrases
MP
max
Moves the most representative phrase to
the most similar intent
MP
min
Moves the most different phrase to the
most different intent
Mutation operators for chatbots
Operators for intents
DIPDeletesintentparameter
DPPDeletesparameterprompt
SPOSets required parameter to optional
DFIDeletesfallbackintent
Operators for entities
CREChangesregular expression
DLEDeletesliteral fromentity
Operators for actions
DADeletesactions
DPRDeletesa parameterusedin a response
SOSwaps outputs
Operators for conversation flows
DCSDeletesconversationstep
DCBDeletesconversationbifurcation
Emulationofcommonerrorsofchatbotdevelopers
Dialogflow
chatbot
model
parse
1
CONGA
meta-model
«conforms to»
annotate
annotated
chatbot
model
Tensorflow
annotation
meta-model
«conforms to»
2
mutate
mutation
operators
(WODEL)
3
chatbot
model
mutant
generate
4
chatbot
impl.
test
5
test suites
mutation
analysis
report
chatbotimpl.
WODEL-TEST
20/25
Mutation testing for chatbots
RQ1: How applicable are the defined mutation operators?
RQ2: How effective are the defined mutation operators?
21/25
39%
48%
67%
60%
77%
73%
78%80%
67%
0%0%
40%
50%
76%
14%
89%
87%
96%
Alive
Killed
Mutationscore by
mutationoperator
RQ1: How applicable are the defined mutation operators?
RQ2: How effective are the defined mutation operators?
21/25
39%
48%
67%
60%
77%
73%
78%80%
67%
0%0%
40%
50%
76%
14%
89%
87%
96%
Alive
Killed
Mutationscore by
mutationoperator
RQ3: How effective is the mutation testing process?
22/25
Botium automatic Botium by hand Rasa test
45%
94%
20%
Alive
Killed
Mutationscore
bytest suite kind
RQ3: How effective is the mutation testing process?
22/25
Botium automatic Botium by hand Rasa test
45%
94%
20%
Alive
Killed
Mutationscore
bytest suite kind
RQ4: How efficient is the mutation testing process?
23/25
0,1%0,2%0,3%
1,0%1,2%1,4%1,6%1,6%1,7%
2,6%
4,9%
8,4%
12,8%
27,5%
34,7%
0%
5%
10%
15%
20%
25%
30%
35%
Covid19_tracer
bikeShop
e2e-bot
Spaceonova
personal-bot
yassinelamarti
Rasa-demo
256644
h4h-chatbot
diagrams2ai
dusbot
legal-alien-chatbot
Email-WhatsApp-Integration
lankbanfinance
Data-mining
The mutation testing
process of 67% of the
chatbots was completed
in less than 90 minutes
RQ4: How efficient is the mutation testing process?
23/25
0,1%0,2%0,3%
1,0%1,2%1,4%1,6%1,6%1,7%
2,6%
4,9%
8,4%
12,8%
27,5%
34,7%
0%
5%
10%
15%
20%
25%
30%
35%
Covid19_tracer
bikeShop
e2e-bot
Spaceonova
personal-bot
yassinelamarti
Rasa-demo
256644
h4h-chatbot
diagrams2ai
dusbot
legal-alien-chatbot
Email-WhatsApp-Integration
lankbanfinance
Data-mining
The mutation testing
process of 67% of the
chatbots was completed
in less than 90 minutes
Conclusions
•Technology-independent approach for MuT of chatbots with
•A catalogue of 19 mutation operators for
•Training phrases, intents, entities, chatbot actions and conversation flows
•Support for test scenarios from botium and rasa-test
•Experiment with 15 chatbots and 29 test suites
•Positive results regarding applicability, effectiveness and efficiency
•Room for improvement in 86% of the test suites
•MuT for chatbots running times are costly but acceptable
•Less than 90 minutes for 67% of the chatbots
24/25
Future work
•Automate the detection of semantically equivalent mutants
•e.g., using confidence decrease heuristics
•Automate the synthesis of tests able to kill the alive mutants
•Adapt our approach to LLM-based agents
25/25
www.uam.es
Pablo Gómez-Abajo, Sara Pérez-Soler, Pablo C. Cañizares,
Esther Guerra, Juan de Lara
{Pablo.GomezA, Sara.PerezS, Pablo.Cerro, Esther.Guerra, Juan.deLara}@uam.es
Mutation testing
for task-oriented chatbots
Thank you!
./ Wodel-Test
DatasetTool demo