Explanatory Capabilities of Large Language Models in Prescriptive Process Monitoring
MarlonDumas
193 views
27 slides
Sep 03, 2024
Slide 1 of 27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
About This Presentation
Research paper presentation at the 2024 International Conference on Business Process Management (BPM).
Prescriptive process monitoring (PrPM) systems analyze ongoing business process instances to recommend real-time interventions that optimize performance. The usefulness of these systems hinges on ...
Research paper presentation at the 2024 International Conference on Business Process Management (BPM).
Prescriptive process monitoring (PrPM) systems analyze ongoing business process instances to recommend real-time interventions that optimize performance. The usefulness of these systems hinges on users applying the generated recommendations. Thus, users need to understand the rationale behind these recommendations. One way to build this understanding is to enhance each recommendation with explanations. Existing approaches generate explanations consisting of static text or plots, which users often struggle to understand. Previous work has shown that dialogue systems enhance the effectiveness of explanations in recommender systems. Large Language Models (LLMs) are an emerging technology that facilitates the construction of dialogue systems. In this paper, we investigate the applicability of LLMs for generating explanations in PrPM systems. Following a design science approach, we elicit explainability questions that users may have for PrPM outputs, we design a prompting method on this basis, and we conduct an evaluation with potential users to assess their perception of the explanations and their approach to interact with the system. The results indicate that LLMs can help users of PrPM systems to better understand the origin of the recommendations, and to produce recommendations that have sufficient detail and fulfill their expectations. On the other hand, users find that the explanations do not always address the “why” of a recommendation and do not let them judge if they can trust the recommendation.
Size: 3.47 MB
Language: en
Added: Sep 03, 2024
Slides: 27 pages
Slide Content
Explanatory Capabilities of Large Language Models in Prescriptive Process Monitoring Kateryna Kubrak , Lana Botchorishvili , Fredrik Milani, Alexander Nolte, Marlon Dumas University of Tartu, Narva mnt 18, 51009 Tartu, Estonia 22nd Business Process Management Conference (BPM 2024) 1
Process mining helps to get more insight into the process and identify improvements. Descriptive process monitoring tells you why something has happened . Register application Analyze application Notify customer Cancel application 990 965 458 499 219 196 363 502 954 923 402 350 2 Introduction Approve application 387
Prescriptive process monitoring (PrPM) tells you what to do to avoid/mitigate an undesired outcome . Register application Analyze application Approve application Notify customer Cancel application 990 965 458 499 219 196 363 502 954 923 387 402 350 ! Analysis will take too long, the customer will be notified later than promised. Assign Specialist 1 to task Analyze application . But why would I do that? Explanation 3 Introduction
Previous work has highlighted the challenges of providing understandable and compelling explanations to business users in the context of PrPM systems 1 1 Padella, A., de Leoni, M., Dogan, O., Galanti, R.: Explainable process prescriptive analytics. In: ICPM. pp. 16–23. IEEE (2022) 2 Cambria, E., Malandri, L., Mercorio, F., Mezzanzanica, M., Nobani, N.: A survey on XAI and natural language explanations. Inf. Process. Manag. 60(1), 103111 (2023) 3 Feldhus, N., Ravichandran, A.M., Möller, S.: Mediators: Conversational agents explaining NLP model behavior. CoRR abs/2206.06029 (2022) 4 Introduction The understandability of provided explanations can be enhanced through dialogue- based systems by allowing users to ask questions from different angles 2 Large Language Models (LLMs) are an emerging technology that could facilitate such a dialogue between a system and a user 3 Fixed form explanations are not as user-centric since they consist of information that is suitable for a specific type of user 2
Problem Statement: Enhancing the understandability of explanations for PrPM recommendations Research Objective: To design and evaluate an approach for LLM-based explanations of recommendations generated by PrPM techniques 5 Introduction
Method 6
Method 7
Explainable AI Question Bank (XAIQB) * * Liao, Q.V., Gruen, D.M., Miller, S.: Questioning the AI: informing design practices for explainable AI user experiences. In: CHI. pp. 1–15. ACM (2020) 8 Objectives Definition Full XAIQB Input Output Performance How Why Why not What if How to be that How to still be this Others What kind of data does the system learn from? What kind of output does the system give? How accurate/ precise/reliable are the predictions? What is the system’s overall logic? Why/how is this instance given this prediction? Why/how is this instance not predicted? What would the system predict for a different instance? How should this instance change to get a different prediction? What is the scope change permitted to still get the same prediction?
Objectives Definition 9 Category Questions Ways to explain Prototypical output Data What is the size of the event log? Number of cases in the event log Number of cases in the training and testing datasets The event log consists of {number} of cases. Performance Why should I believe that the predictions are correct? Provide performance metrics for the models (accuracy, precision, recall) The accuracy of recommendations is on average {number}. How How does the system make predictions? Describe the differences between the techniques The tool uses provides three different recommendation types: next best activity, alarm and intervention. […] The intervention is produced using Uplift Modeling package CasualLift to get the CATE and probability of outcome… Output What do the different recommendation types mean? Describe the differences between the algorithms (the techniques they use and how the recommendations differ) An alarm is a type of a recommendation that does not specify an exact action to perform in the given moment, but rather notifies that you should pay attention to the case. * Full mapping is available in supplementary material
Method 10
General contextual information General conversational rules Context Data description Task S imple language (analysts with little ML experience) S pecific answers, no elaboration N o more than two paragraphs of text Seo, W., Yang, C., & Kim, Y. H. (2023). ChaCha: Leveraging Large Language Models to Prompt Children to Share Their Emotions about Personal Events. arXiv preprint arXiv:2309.12244 . Bellan, P., Dragoni, M., & Ghidini, C. (2022, September). Extracting business process entities and relations from text using pre-trained language models and in-context learning. In International Conference on Enterprise Design, Operations, and Computing (pp. 182-199). Cham: Springer International Publishing. Jessen, U., Sroka, M., & Fahland, D. (2023). Chit-Chat or Deep Talk: Prompt Engineering for Process Mining. arXiv preprint arXiv:2307.09909. Ask ChatGPT: what else? Examples Design & Development: Prompting Method 11
Design & Development: Prompting Method 12 * Full prompting method is available in supplementary material Component Text (excerpt) Context [ PrPM tool] uses three algorithms to generate prescriptions for business processes... The [ PrPM tool] workflow involves: Uploading an event log. Defining column types. Setting parameters... The key parameters are: Case Completion: An activity that marks the end of a case, e.g., ’Application completed’... Data description - Description of MongoDB files collection - Description of cases collection General conversational rules When answering, use simple language for the explanations. Do not mention the database or show raw data in your responses. ... E x amples QUESTION: What is the size of the event log? ANSWER: The event log consists of < nr_of_cases > of cases. QUERY: [query example] STEPS: Run the query with function query_db to find the number of cases in this event log. Task Your role is to answer questions about [ PrPM tool] recommendations and query the database for specific case or event log information.
Design & Development: LLM-based chat 13
Design & Development: LLM-based chat 14
Method 15
Evaluation: Setting 12 process analysts Claims management event log Tasked to review a case and recommendations and interact with the chat Post-interview survey Interview participants 16 Goals: (1) Asess users’ perception of presented explanations, (2) Assess users’ interaction with the chat.
Evaluation: Results Questions asked 17 For participants’ questions, coding scheme based on XAIQB: 55% in category "Output" What do specific terms mean? (e.g. "CATE score", "intervention") 18% in category "Why" Why should I prefer one recommendation over another? 12% in category "Others" What documents were supplied in the claim? Every other category 7% or less Coding scheme is available in supplementary material
Evaluation: Results Characteristics to code the answers (explanations) : Coherency - W hether the explanation is internally coherent (how well the parts of it fit together) Hoffman, R.R., Mueller, S.T., Klein, G., Litman , J.: Measures for explainable AI: explanation goodness, user satisfaction, mental models, curiosity, trust, and human-ai performance. Frontiers Comput . Sci. 5 (2023) Nauta , M., Trienes, J., Pathak, S., Nguyen, E., Peters, M., Schmitt, Y., Schlötterer , J., van Keulen, M., Seifert, C.: From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable AI. ACM Comp. Surv . 55(13s), 295:1–295:42 (2023) Zemla , J.C., Sloman , S., Bechlivanidis , C., Lagnado , D.A.: Evaluating everyday explanations. Psychonomic bulletin & review 24, 1488–1500 (2017) Chat's answers 18 Relevancy to the question - Whether the explanation answers the question Completeness - Whether there are gaps in the explanation Correctness - Whether the data in the explanation is correct Compactness - Whether the explanation is repetitive or redundant
Evaluation: Results Correctness: The chat sometimes did not query the database but provided a confident answer (i.e. hallucinated) Chat's answers 20 Sometimes the chat got certain data from the database, but incorrectly matched it with a question (e.g., "90.14%" was the accuracy but got reported as probability) Compactness : In some interviews, most explanations were compact, while in others, most were not compact
Evaluation: Results Different approaches to starting the conversation: Interaction 21 Study the case and recommendations, formulate a clarifying question Ask a general question about case performance Ask what is the issue with the case Ask the chat how it can help them
Evaluation: Results Survey Survey results 22
Implications for Research Focusing on bringing more causal aspects into PrPM techniques For the recommendation to amend the claim settlement, the participants wanted to know what exactly to amend 23 * Fahland , D., Fournier, F., Limonad , L., Skarbovsky , I., Swevels , A.J.E.: How well can large language models explain business processes? CoRR abs/2401.12846 (2024) Several participants asked about the potential impact of a recommendation on the case outcome in terms of temporal or monetary value Several participants asked questions about why an action was recommended Such questions could also be addressed by incorporating a causal component into the setup*
Implications for Research Further research on explanations in PrPM context and ways to provide them An experiment could be designed to compare the understandability of LLM explanations with established methods Future researchers can use the guidance of the asked questions to cater to end-user needs 24 Expanding the prototype to encompass process-level insights Functionalities for viewing aggregated data (e.g., total active recommendations, number of recommended cases) to provide a broader process perspective for analysts
Implications for Research Improving correctness of explanations E.g. adding a verification layer Experiment comparing the performance of different LLM models for the correctness of the responses 25
Implications for Practice Add template questions to the chat, thus providing guidance 26 Adjusting the prompt to better respond to the questions from the most asked categories Another approach is to fine-tune the model Adding the capability for the chat to answer general case performance questions This requires either ensuring that the LLM would be able to calculate the e.g. cycle time based on the instructions, or ensuring the underlying tool has this information in addition to the recommendations
Thank you! Kateryna Kubrak [email protected] PhD Student Institute of Computer Science University of Tartu Icons by Freepik on flaticon.com 27