Trustworthy GAI & Applications paper list

ischenkexin 18 views 18 slides Aug 30, 2025
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

Trustworthy GAI & Applications 论文汇集


Slide Content

Trustworthy GAI & Applications 王东霞,控制科学与工程学院 勤 学 / 修 德 / 明 辨 / 笃 实

Hallucilation Social Value Safety Multi-Agent System Trustworthy AI System Fairness Robustness Single Agent Large Language Model Roles Agent Can Play: Recommender : Generate personalized suggestions based on user context, preferences, and behavior. Web Searcher: Act as an intelligent search agent that retrieves, summarizes, and contextualizes real-time web information to enhance recommendations. User Simulator: Simulate realistic user feedback for training or evaluating recommender systems. Conversational Assistant: Interact with users in natural language to understand needs and deliver tailored content. Fairness Auditor: Detect and reason about bias or fairness issues in recommendation outputs. Content Filter or Rewriter: Moderate, adapt, or personalize recommended content using natural language reasoning. ...

Part 1 Large Language Model

LLM – Safety: Risk Prompts Framework Summary Accepted by ISSTA 2025【CCF-A】 Contribution Establish a systematic four-level risk taxonomy (8 dimensions, 102 sub-risks) to unify LLM safety classification Introduces S-Eval , an adaptive LLM-driven framework for automated test generation and safety evaluation releases the largest safety benchmark (220k prompts covering 10 jailbreak attacks ) We propose S-Eval, a novel LLM-based automated Safety Evaluation framework . In contrast to prior works, S-Eval differs in efficiency, effectiveness, and adaptability. S-Eval has been deployed in real world for the automated safety evaluation of multiple LLMs serving millions of users.

LLM – Safety: Over-refusal Behavior Framework Summary Plan to Submit to ASE 2025 【CCF A】 Contribution First Fuzz Framework for Over-refusal Samples Generation Propose a General Over-Refusal Benchmark Open-source a Human-aligned Judge Model While the safety guardrails protect agents from jailbreak attacks, they also lead to over-refusal behaviors, where excessive safety measures cause agents to systematically reject benign queries . We propose a novel human-aligned fuzzing framework, ORFuzzer , which employs a multi-step process, combining seed selection, mutator selection and refinement, as well as a fine-tuned judge model to generate over-refusal samples across agents.

LLM – Robustness: Abnormal Tokens Framework Summary Accepted by ACL 2025【CCF A】 Contribution We introduce a class of special tokens, “ sticky tokens ”, in LLM agents, which skew the similarity value to any other sentence closer to the mean pairwise similarity value between token embeddings of the target model. We also propose an effective method, STD , to find these special tokens. Our experiments reveals their consistent degradation of downstream tasks like retrieval and clustering, with a degradation of 50% in certain cases , which poses a new challenge for the robustness of agents. Our identification and analysis of "sticky tokens" address an understudied issue in text embedding models. We provide an analysis of mechanistic interpretability and mitigation strategies to improve model robustness.

LLM – Safety: LVLM Defense Framework Summary Accepted by ICML 2025【CCF-A】 Contribution Multi-agent Framework A black-box, training-free, and scalable defense Maintaining Standard Performance  We propose DPS, a black-box, training-free defense method designed to counter vision attacks on large vision language models. The experimental results indicate that DPS shows superior performance against both misleading and jailbreak attacks while maintaining the model's standard performance.

LLM – Social Value Framework Summary Submitted to NIPS 2025 【CCF A】 Contribution Sources from Real Social Survey Comprehensive Content Automated Data Generation   We propose FAIR-PP, a novel framework for generating a synthetic dataset of personalized preferences derived from real-world social survey data, encompassing a diverse range of social groups , equity topics , and multidimensional value perspectives , which addresses the gap in capturing personalized preferences.

Part 2 Single Agent

Single Agent as Web Seacher – Web Dos attack Framework Summary Plan to Submit to ICLR 2026【CCF-A】 Contribution First to propose a DoS attack specifically targeting web agents Indicators on the economic safety of large models As LLM continue to demonstrate significant performance improvements, their costs have also steadily increased, particularly in complex application scenarios such as Web Agents . In response, we have designed a specialized DoS attack method that substantially increases token consumption without compromising the Agent's ability to complete its designated tasks.

Single Agent as Recommender – User Fairness Framework Summary Accepted by 1.ISSTA 2023【CCF-A】2. SIGIR 2024 【CCF-A】 3.SIGIR 2025 【CCF-A】 In areas like e-commerce, news, and social media, agents often act as recommender systems. They continuously collect user interaction data and recommend products or content to users. During this process, fairness is important for both users and content providers . Therefore, we conducted a series of studies on fairness from three key aspects: Research Focus: Propose novel fairness definition to identify different types of fairness issues RS. Design an efficient and accurate fairness testing tool for RSs. Design fairness enhancement algorithms to ensure fair competition among different providers.

Single Agent as Recommender – User Fairness Framework Summary Accepted by ISSTA 2023【CCF-A】 Contribution Propose Multi-group Fairness Definition and Multi-dimensional Fairness Metrics Efficient and effective fairness testing tool for RSs To address the challenge of evaluating deep fairness issues among user groups with multiple sensitive attributes, we propose a discrete bi-terminal particle swarm testing algorithm . This method efficiently and accurately assesses fairness in recommendation systems across multiple dimensions, including gender bias, diversity, and popularity bias .

Single Agent as Recommender – Resource Fairness Framework Summary Accepted by SIGIR 2024【CCF-A】 Contribution Propose a configurable New-item Fairness Definition and Time-based Group Fairness metric . An effective fairness enhancement algorithm to improve fairness for new-item providers. Recommender systems often underexpose new items, leading to unfairness despite their potential appeal. To address this, we propose a configurable definition of new-item exposure fairness, considering item entry-time and varying fairness needs. Based on this, we introduce CNIF, a two-stage training framework that incorporates fairness levels into model learning. t0 Old items New items recommend interact recommend interact train Recommender Model Users Brand New items ( no user interaction) Cold-start Model Strict Fairness Conditional Fairness Conditional Discrimination Perfect Discrimination Configurable New Item Exposure Fairness Multi-degree Fairness Metric Cold-start Process Retraining Timeline Recommender System Fairness Evaluation Fairness Enhancement t1 t2 Config ε

Single Agent as Recommender – Dynamic Fairness Enhancement Framework Summary Accepted by SIGIR 2025【CCF-A】 Contribution Introduce FairAgent , a reinforcement learning framework for fairness in dynamic recommendations. Design a novel reward combining accuracy, new-item exploration, and personalized fairness . An effective fairness enhancement algorithm to improve fairness for new-item providers. New items keep recommender systems fresh but often get less exposure than older items. Traditional models don’t handle this well. We propose FairAgent , a method using reinforcement learning to improve fairness and accuracy in recommending new items. It balances user feedback, encourages new item exposure, and meets users’ fairness needs. Tests show FairAgent boosts both fairness and recommendation quality.

Part 3 Multi-Agent

Multi-Agent – Trustworthy decision making Framework Summary Accepted by TDSC 2024【CCF A】 Contribution Novel Attack Scenario Provably Secure Decision Method MPR Heuristics for Specific Types of Honesty Distributions In multi-agent systems, where decision-making information is aggregated from multiple agents, attackers may disguise themselves as legitimate agents and strategically report false information to manipulate the collective decisions . We propose an optimal, robust decision mechanism called Most Probable Realisation (MPR). When peer collusion affects source selection, we prove that generally, it is NP-hard to find an optimal decision scheme.

Multi-Agent – Trustworthy alignment for RSs Framework Summary Submitted to NIPS 2025【CCF-A】 Contribution Develop an LLM-based user simulator to evaluate recommendations across multiple human-centric dimensions. Propose a trustworthy alignment framework that uses LLM feedback to optimize recommendation models toward real user satisfaction. We propose a human-centric framework for evaluating and optimizing recommender systems using Large Language Models (LLMs) as user simulators . Instead of relying only on accuracy, we collect human feedback on aspects like relevance, diversity, explainability , and manipulativeness. LLMs are aligned with this feedback to act as trustworthy evaluators , and their insights are used to improve recommendations in line with real user preferences and values. Construct user simulators Trust worthy Alignment

Thanks 勤 学 / 修 德 / 明 辨 / 笃 实
Tags