[20240909_LabSeminar_Huy]Towards Hierarchical Policy Learning for Conversational Recommendation with Hypergraph-based Reinforcement Learning.pptx

thanhdowork 52 views 19 slides Sep 09, 2024
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

Towards Hierarchical Policy Learning for Conversational Recommendation with Hypergraph-based Reinforcement Learning


Slide Content

Quang-Huy Tran Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: [email protected] 2024-09-09 Towards Hierarchical Policy Learning for Conversational Recommendation with Hypergraph-based Reinforcement Learning Sen Zhao et al. IJCAI-2023 : The Thirty-Second AAAI International Joint Conference on Artificial Intelligence

OUTLINE MOTIVATION METHODOLOGY EXPERIMENT & RESULT CONCLUSION

MOTIVATION Conversational recommendation systems ( CRS) : dynamically learn user preferences by iteratively interacting with the user . recommend the target item to the user . Overview and Limitation Challenges: Hard to converge due to lack of mutual influence during training. Main streamlines system: 2 essential decision-make procedures when to recommend(i.e., ask or recommend), and what to talk about (i.e., specific attribute/items). Develop policy learning in reinforcement learning. Complicates action selection of CRS strategy by enlarging action space and introducing data bias. imbalance in the number of items and attributes. ignores the different roles of 2 decision procedures. sub-optimal CRS strategy.

INTRODUCTION Propose a novel Director-Actor Hierarchical conversational recommender (DAHCR) with intrinsic motivation . Train from weak supervision. a dynamic hypergraph to learn user preferences from high-order relation. Contribution Emphasize the different roles in two decision procedures for CRS, and the mutual influence between them To alleviate the bad effect of model bias on the mutual influence between director and actor: model director’s options by sampling from a categorical distribution with Gumbel- softmax .

METHODOLOGY Problem Definition Multi-turn conversational recommendation (MCR) problem recommend the target item to the user by asking attributes and recommending items in the limited turns of the conversation. : item set. Each item has attribute set . At beginning of conversation, user u initialize a session with a target item and attribute of target candidate item set, candidate attribute set . At each turn t (time), ask the user an attribute Or recommend a certain number of items . User accept or reject the proposal, MCR update new candidate set. Conversation will continue until the max turn T and recommendation is successful if target item is given within T.  

METHODOLOGY Main Architecture

METHODOLOGY DAHCR Components State: three components: interactive history , related nodes , and hypergraph .   Satisfy all accepted attributes. Option O: director should choose an option ask or recommend in state t . Primitive actions A: based on state and director’s option . actor selects the primitive action (candidate attribute to ask or recommend candidate items). Transitions T: move to next state t+1 . Attributes associated with candidate item.

METHODOLOGY DAHCR Components Extrinsic rewards R a : special signals to guide the agent to select user-preferred actions strongly positive reward for successful recommendation slightly negative reward for rejected recommendation slightly positive reward when the asked attribute is accepted slightly negative reward when the asked attribute is rejected strong negative reward when the maximum turn reaches Intrinsic motivation R o : Passed from actor to director to estimate the effectiveness of director’s option. Assign a positive reward to the option of ask and a negative reward to the option of rec in this situation. When user’s preference is certain, and for rec and ask.  

METHODOLOGY DAHCR Policy Learning State Encoder : Given history interactive state is obtained:   To learn the user’s preference for the specific attributes and items: build a dynamic hypergraph . Set of related node: H yperedge set a hyperedge between the user, an attribute and items . adjacent matrix  

METHODOLOGY DAHCR Policy Learning State Encoder : employ multi-head self-attention hypergraph neural networks. Finally, aggregate information from the hyperedges to refine nodes’ representations and obtain connectivity state Get final state from interactive state and connectivity state   where denotes node in hypergraph, encodes size of candidate item and attribute set by dividing length and into ten-digit binary feature.  

METHODOLOGY DAHCR Policy Learning Hierarchical Action Selection Strategy : design a novel dueling Q-network to conduct policy learning under the hierarchical structure. Assumption: delayed rewards are discounted by a factor of , define the Q-value expected reward for director’s optio n and actor’s action based on state s t   To realize a differentiable discrete sample of director’s option and alleviate the bad effect of model bias on mutual influence between the director and actor . model o t by sampling from a categorical distribution with Gumbel- softmax

METHODOLOGY DAHCR Policy Learning The optimal Q-function with the maximum expected reward . by optimizing the hierarchical policy and under Bellman function.   Hierarchical Action Selection Strategy : The mutual influence between Director and Actor which should use Gumbel- softmax : Biases (e.g., bad options of Director)caused by Director may filter out of efficient actions for the Actor. Bias (e.g., false feedback) caused by Actor may affect the convergence of Director. where denotes the action space of primitive actionsaccording to the director’s option.

METHODOLOGY DAHCR Policy Learning Model Training: For each turn, agent get intrinsic motivation , extrinsic reward by action . Candidate actions space is updated according to the user’s feedback . D efine a replay buffer D :   for the director’s options and the actor’s actions . Apply double Q-learning for training.

EXPERIMENT AND RESULT Experiment Settings Dataset: Music artist recommendation: LastFM , LastFM *. Business recommendation: Yelp, Yelp*. Baselines: Max Entropy [1], Abs Greedy[2], CRM [3] , EAR [4], SCPR [5], UNICORN[6], and MCIPL[7]. [1] Lei, Wenqiang , et al. "Estimation-action-reflection: Towards deep interaction between conversational and recommender systems." Proceedings of the 13th International Conference on Web Search and Data Mining. 2020. [2] Christakopoulou , K., Radlinski , F., & Hofmann, K. (2016, August). Towards conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 815-824). [3] Sun, Y., & Zhang, Y. (2018, June). Conversational recommender system. In The 41st international acm sigir conference on research & development in information retrieval (pp. 235-244). [4] Lei, Wenqiang , et al. "Estimation-action-reflection: Towards deep interaction between conversational and recommender systems." Proceedings of the 13th International Conference on Web Search and Data Mining. 2020. [5] Lei, Wenqiang , et al. "Interactive path reasoning on graph for conversational recommendation." Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020. [6] Deng, Yang, et al. "Unified conversational recommendation policy learning via graph-based reinforcement learning." Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021. [7] Zhang, Yiming , et al. "Multiple choice questions based multi-interest policy learning for conversational recommendation." Proceedings of the ACM Web Conference 2022. 2022. Measurement : Success rate ( SR@t ): cumulative ratio of successful recommendations by the turn t Average turns (AT): average number of turns for all sessions. hDCG @(T, K): ranking performance of recommendations.

EXPERIMENT AND RESULT Result – Overall Perfor mance

EXPERIMENT AND RESULT R esult – Visualization and ablation study. Fig. Test performance at different training epochs. Tab. Results of the Ablation Study.

CONCLUSION P ropose a Director-Actor Hierarchical Conversational Recommender (DAHCR) with the director . select the most effective option (i.e., ask or recommend). followed by the actor accordingly choosing primitive actions that satisfy user preference. The intrinsic motivation is designed for training from weak supervision on the director’s effectiveness a dynamic hypergraph is developed to learn user preferences from high-order relations. Gumbel- softmax is employed to alleviate the bad effect of model bias on mutual influence between director and actor . Summarization