Paper TitleAuthorsJournal Name Year Published Main Problem or Research Question Main Objectives or GoalsMethods Used Main Strengths or Contributions Main Gaps or LimitationsConclusion
Large Language Models can Accurately Predict Searcher
Preferences
Paul Thomas, Seth Spielman, Nick Craswell, Bhaskar Mitra Proceedings of the 47th International ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR
'24)
2024 The main problem is that much of the evaluation and tuning
of a search system relies on relevance labels—annotations
that say whether a document is useful for a given search and
searcher. Ideally these come from real searchers, but it is hard to collect this data at scale.
The main objectives are to discuss an alternative approach
using large language models (LLMs) and prompts to select
LLMs that agree with feedback from real searchers, which can
then produce labels at scale. The goal is to show that LLMs
are as accurate as human labellers and useful for finding the
best systems and hardest queries.
The method used involves taking careful feedback from real
searchers and using this to select an LLM and prompt that
agrees with this feedback, which can then produce labels at
scale. The authors also discuss different labelling options
(Figure 1) and their costs and accuracy.
One of the main strengths is that the proposed approach
uses high-quality "gold" labels from real searchers to select
LLMs and prompts, which are as accurate as human labellers
and useful for finding the best systems and hardest queries.
The authors also highlight the unpredictability of LLM
performance with prompt features.
One potential gap is that while the proposed approach uses
high-quality "gold" labels from real searchers to select LLMs
and prompts, it may not be feasible to collect this data at
scale for all types of searches. Additionally, there may be
limitations in terms of scalability and cost.
The authors conclude that their proposed approach using
large language models (LLMs) and prompts can accurately
predict searcher preferences and is a viable alternative to
traditional labelling approaches. They also emphasize the
importance of high-quality "gold" labels for evaluating
information retrieval systems. One potential gap could be the scalability issue, as collecting data from real searchers at scale
may not be feasible or cost-effective for all types of searches.
This could limit the applicability of this approach to certain
domains or use cases.
Large Language Models are Zero-Shot Rankers for
Recommender Systems
Yupeng Hou1,2†, Junjie Zhang1†, Zihan Lin3, Hongyu Lu4,
Ruobing Xie4, Julian McAuley2, and Wayne Xin Zhao1B
arXiv (cs.IR)2024 The paper aims to investigate the capacity of large language
models that act as ranking models for recommender systems.
To formalize the recommendation problem as a conditional
ranking task, design prompting templates, and conduct
extensive experiments on two widely-used datasets.
Carefully designing prompting templates and conducting
experiments on two datasets to solve the ranking task by
large language models (LLMs).
The paper shows that LLMs have promising zero-shot ranking
abilities but struggle with perceiving the order of historical
interactions, can be biased by popularity or item positions in
prompts. It also demonstrates how these issues can be
alleviated using specially designed prompting and
bootstrapping strategies.
The paper mentions two major limitations: (1) LLMs struggle
to perceive the order of historical interactions; (2) they can be biased by popularity or item positions in prompts. A potential
gap could be that the paper does not explore other possible
biases, such as user demographics.
The code and processed datasets are available at
https://github.com/ RUCAIBox/LLMRank.
Large Language Models can Accurately Predict Searcher
Preferences
Paul Thomas, Seth Spielman, Nick Craswell, Bhaskar Mitra Proceedings of the 47th International ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR
'24)
2024 The main problem is that much of the evaluation and tuning
of a search system relies on relevance labels—annotations
that say whether a document is useful for a given search and
searcher. Ideally these come from real searchers, but it's hard
to collect this data at scale.
To discuss an alternative approach using large language
models (LLMs) and prompts to select LLMs that agree with
feedback from real searchers, which can then produce labels
at scale.
The authors take careful feedback from real searchers and
use this to select a large language model (LLM), and prompt,
that agrees with this feedback; the LLM can then produce
labels at scale.
The main strengths are that LLMs are as accurate as human
labellers and as useful for finding the best systems and
hardest queries.
One potential gap is that label quality is managed with
ongoing auditing, training, and monitoring in traditional
approaches. A potential gap is also the need for high-quality
"gold" labels, which reinforces the importance of ongoing
auditing, training, and monitoring in traditional approaches to ensure label quality.
In conclusion, this work presents an alternative approach to
using LLMs and prompts to select LLMs that agree with
feedback from real searchers. The results show that LLMs are
as accurate as human labellers and can produce labels at
scale.
When Large Language Models Meet Personalization:
Perspectives of Challenges and Opportunities
Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang,
Kai Zheng, Defu Lian, Enhong Chen
World Wide Web2024 The paper discusses the challenges and opportunities in
personalization when large language models are used.
To review the challenges in personalization and explore ways
to address them with large language models, specifically
discussing the development and challenges of existing
personalization systems, the newly emerged capabilities of
large language models, and potential uses for making use of
large language models for personalization.
Not explicitly mentioned The paper provides a perspective on the opportunities and
challenges in using large language models for personalization,
highlighting their potential to revolutionize how interaction
between humans and personalized systems occurs.
Potential gap - While the paper discusses the potential
benefits of using large language models for personalization, it
does not explicitly address the limitations or challenges that
may arise from implementing such a system. For example,
issues related to data privacy, security, and user trust in AI-
driven personalized systems.
The conclusion is implicit, as the paper focuses on providing
perspectives rather than presenting concrete results.
However, it suggests that large language models have the
potential to revolutionize personalization by enabling active
user engagement and expanding its scope beyond
information filtering.
The Rise and Potential of Large Language Model Based
Agents: A Survey
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding,
Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu
Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao
Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang
Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen
Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu,
Xuanjing Huang and Tao Gui
arXiv (pre-print)2023 The paper does not explicitly state a main problem or
research question. However, it can be inferred that the
authors aim to explore the potential of large language models
as the foundation for building artificial general intelligence
agents.
- To perform a comprehensive survey on LLM-based agents The authors used a literature review approach to conduct
their survey. They analyzed existing research papers related
to large language models and artificial intelligence agents.
1. The paper provides an overview of the concept of AI
agents from its philosophical origins to its development in AI.
2.
1. The paper does not explicitly state any gaps or limitations.
However, it can be inferred that the authors aim to explore
the potential of large language models as the foundation for
building artificial general intelligence agents. 2. A potential
gap is the lack of explicit discussion on the challenges and
difficulties in developing LLM-based AI agents.
The paper provides a comprehensive survey on LLM-based
agents, exploring their potential applications in single-agent
scenarios, multi-agent scenarios, and human-agent
cooperation. It also delves into agent societies, discussing the
behavior and personality of LLM-based agents, social
phenomena that emerge from an agent society, and insights
they offer for human society.
A Systematic Evaluation of Large Language Models of Code Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua
Hellendoorn
Proceedings of the 6th ACM SIGPLAN International
Symposium on Machine Programming (MAPS)
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions.
To fill in some of these blanks through a systematic
evaluation of the largest existing models across various
programming languages.
- Evaluation of the largest existing models (Codex, GPT-J, GPT-
Neo, GPT-NeoX-20B, and CodeParrot) across various
programming languages - Training of a new model,
PolyCoder, with 2.7B parameters based on the GPT-2
architecture, trained on 249GB of code across 12
programming languages
PolyCoder outperforms all other models in C programming
language.
- The current state-of-the-art code LMs (e.g., Codex) are not
publicly available, leaving many questions about their model
and data design decisions. The paper does not explicitly
mention any gaps. However, a potential gap could be the lack
of evaluation on specific programming languages or domains
that may require more nuanced understanding.
The authors conclude by releasing open-source models that
enable future research and application in this area. They also
highlight PolyCoder's performance in C programming
language.
A systematic evaluation of large language models of code. Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua
Hellendoorn
Proceedings of the 6th ACM SIGPLAN International
Symposium on Machine Programming (MAPS)
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions.
To fill in some of these blanks through a systematic
evaluation of the largest existing models across various
programming languages.
Existing open-source models were evaluated for their
performance on different programming languages. A new
model called PolyCoder was trained exclusively on a multi-
lingual corpus of code and outperformed all other models.
The paper presents a systematic evaluation of large language
models of code, which helps to fill in the gaps about these
models' design decisions. A new open-source model
(PolyCoder) is introduced that achieves better performance
than existing models on some programming languages.
Although Codex itself is not open- source, it's mentioned as a
reference point for comparison. However, there might be
potential gaps in terms of the data and training methods used by Codex, which are not publicly available.
The paper concludes that existing open-source models can
achieve close results to Codex on some programming
languages, but more research is needed to understand their
design decisions. A potential gap could be the lack of
transparency in the training process and data selection for
these large language models. The authors do not provide
detailed information about how PolyCoder was trained or
what specific features were used during its development.
Further investigation into this area might help improve our
understanding of these models' capabilities and limitations.
A Systematic Evaluation of Large Language Models of Code Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hell Proceedings of the 6th ACM SIGPLAN International
Symposium on Machine Programming (MAPS)
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions.
To fill in some of these blanks through a systematic
evaluation of the largest existing models across various
programming languages.
Evaluation of large language models (LMs) of code
Pretraining on 249GB of code across 12 programming
languages
Existing open-source models do achieve close results in some
programming languages, although targeted mainly for natural language modeling. The authors release a new model,
PolyCoder, with 2.7B parameters based on the GPT-2
architecture.
The current state-of-the-art code LMs are not publicly
available.
PolyCoder outperforms all models including Codex in the C
programming language. The authors release a new model,
PolyCoder, with 2.7B parameters based on the GPT-2
architecture. The lack of transparency and reproducibility in
the development process of large-scale AI models like code
LMs may hinder their adoption and improvement.
An Open Large Language Model for Code with Multi-Turn
Program Synthesis.
Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua
Hellendoorn.
ACM SIGPLAN International Symposium on Machine
Programming (MAPS).
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions. The authors aim to fill in some of these
blanks through a systematic evaluation of the largest existing
models across various programming languages.
1) To evaluate the performance of large language models
(LMs) for code generation. 2) To identify an important missing piece, i.e., a large open-source model trained exclusively on a
multi-lingual corpus of code. 3) To release a new model,
PolyCoder, with 2.7B parameters based on the GPT-2
architecture.
The authors used existing models (Codex, GPT-J, GPT-Neo,
GPT-NeoX-20B, and CodeParrot) for evaluation across various
programming languages. They also trained a new model,
PolyCoder, with 2.7B parameters based on the GPT-2
architecture.
1) The authors provide a systematic evaluation of large
language models (LMs) for code generation. 2) They identify
an important missing piece in the form of a large open-source
model trained exclusively on a multi-lingual corpus of code. 3)
They release a new model, PolyCoder, with 2.7B parameters
based on the GPT-2 architecture.
Information not available Please note that the main gap or limitation is not explicitly
mentioned in the text, but a potential gap could be the lack of evaluation on more diverse programming languages or
scenarios beyond those considered in the paper.
Evaluating large language models trained on code. Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua
Hellendoorn
Proceedings of the 6th ACM SIGPLAN International
Symposium on Machine Programming (MAPS)
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions.
To fill in some of these blanks through a systematic
evaluation of the largest existing models across various
programming languages.
The authors release a new open-source model, PolyCoder,
which outperforms all models including Codex in the C
programming language. The evaluation of existing open-
source models provides insights into their strengths and
weaknesses across different programming languages.
Information not available
Training language models to follow instructions with human
feedback.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L.
Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal,
Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser
Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter
Welinder, Paul Christiano.
NeurIPS 20222022 How to align language models with user intent on a wide
range of tasks by fine-tuning with human feedback?
To show an avenue for aligning language models with user
intent, and to demonstrate the effectiveness of this approach
in improving truthfulness and reducing toxic output
generation.
The authors collected a dataset of labeler-written prompts
and prompts submitted through a language model API. They
then used supervised learning to fine-tune GPT-3 using these
demonstrations, followed by reinforcement learning from
human feedback.
The resulting models (InstructGPT) showed improvements in
truthfulness and reductions in toxic output generation while
having minimal performance regressions on public NLP
datasets.
Although the InstructGPT models still made simple mistakes,
fine-tuning with human feedback is a promising direction for
aligning language models with human intent. However, there
may be potential limitations to this approach, such as the
need for large amounts of high-quality training data and the
possibility that humans may not always provide accurate or
consistent feedback.
The authors conclude that their results show that fine-tuning
with human feedback is a promising direction for aligning
language models with human intent. They also suggest that
further research is needed to fully understand the potential
limitations and challenges of this approach.
Using DeepSpeed and Megatron to Train Megatron-Turing
NLG 530B - a large-scale generative language model.
[Information not available] Proceedings of the 6th ACM SIGPLAN International
Symposium on Machine Programming (MAPS '22)
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions.
To fill in some of these blanks through a systematic
evaluation of the largest existing models across various
programming languages.
Evaluation of large language models (LMs) of code;
pretraining on 249GB of code across 12 programming
languages using GPT-2 architecture and releasing an open-
source model, PolyCoder.
The release of a new open-source model, PolyCoder, which
outperforms all other models in the C programming language. Existing open-source models achieve close results to Codex in
some programming languages.
[Information not available] Our trained models are open-source and publicly available at
https://github.com/VHellendoorn/Code-LMs, which enables
future research and application in this area. Note that the
provided text is about a different paper than what you
requested. I apologize for any confusion caused by my
mistake. Please provide me with more information or clarify
your request so that I can assist you better.
DIALO GPT : Large-Scale Generative Pre-training for
Conversational Response Generation
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris
Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan
Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics
2020 Neural response generation in conversational settings is
challenging due to the one-to-many problem and noise
present in human conversations.
To develop a large, tunable neural conversational response
generation model that can generate responses close to
human performance both in terms of automatic and human
evaluation.
The authors trained their model on 147M conversation-like
exchanges extracted from Reddit comment chains over a
period spanning from 2005 through 2017. They used the
Hugging Face PyTorch transformer architecture extended for
conversational response generation.
The proposed DIALO GPT model achieves performance close
to human both in terms of automatic and human evaluation
in single-turn dialogue settings, generating more relevant,
contentful, and context-consistent responses than strong
baseline systems. The pre-trained model and training pipeline
are publicly released for research into neural response
generation.
Not explicitly mentioned; however, a potential gap could be
the lack of exploration on how to adapt this approach to
other conversational settings beyond single-turn dialogue.
The authors conclude that their proposed DIALO GPT model
can generate more relevant and context-consistent responses than strong baseline systems in single-turn dialogue settings.
They also emphasize the importance of releasing pre-trained
models for research into neural response generation.
A Simple Language Model for Task-Oriented Dialogue Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih
Yavuz, Richard Socher
Salesforce ResearchThe paper proposes a simple approach to task-oriented
dialogue that uses a single causal language model trained on
all sub-tasks recast as a single sequence prediction problem.
To show that the proposed approach can solve all the sub-
tasks of task-orientated dialogue, including understanding
user input, deciding actions, and generating responses.
The authors propose a simple language model for task-
oriented dialogue that uses a pre-trained causal language
model (GPT-2) as a starting point. They recast each sub-task
into a single sequence prediction problem and train the
model on all sub-tasks simultaneously.
The proposed approach improves over the prior state-of-the-
art in joint goal accuracy for dialogue state tracking, and also
shows robustness to noisy annotations. It also improves main
metrics used to evaluate action decisions and response
generation in an end-to-end setting.
Not explicitly mentioned The authors conclude that their proposed approach can solve
all the sub-tasks of task-oriented dialogue using a simple,
causal language model trained on all sub-tasks recast as a
single sequence prediction problem. They also highlight its
robustness to noisy annotations and improvements in main
metrics used to evaluate action decisions and response
generation.
A Survey on Evaluation of Large Language Models YUPENGCHANG and XUWANG (SchoolofArtificialIntelligence,
JilinUniversity, Changchun, China), JINDONG WANG
(Microsoft Research Asia, Beijing,China), YUANWU (School of
Artificial Intelligence, Jilin University,Changchun, China), LINYI
YANG (Westlake University, Hangzhou, China), KAIJIE ZHU
(Instituteof Automation, Chinese Academy of Sciences,
Beijing,China), HAO CHEN (Carnegie MellonUniversity,
Pittsburgh, USA), XIAOYUAN YI (MicrosoftResearch
Asia,Beijing, China), CUNXIANG WANG (Westlake University,
Hangzhou, China), YIDONG WANG and WEIYE (Peking
University, Beijing,China), YUE ZHANG (Westlake University,
Hangzhou, China), YI CHANG (School of Artificial Intelligence,
JilinUniversity,Changchun, China)
Information not availableThe evaluation of large language models is becoming
increasingly critical at both task and societal levels for better
understanding their potential risks.
To provide a comprehensive review of evaluation methods
for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. To offer invaluable
insights to researchers in the realm of LLMs evaluation,
thereby aiding the development of more proficient LLMs.
The authors provided an overview from the perspective of
evaluation tasks, encompassing general natural language
processing tasks, reasoning, medical usage, ethics, education,
natural sciences, social sciences, agent applications, and other areas. They also answered the "where" and "how" questions
by diving into the evaluation methods and benchmarks.
The authors presented a comprehensive review of evaluation
methods for LLMs, which serves as a valuable resource for
researchers in this field.
One potential gap is that the paper does not explicitly discuss
the challenges faced when evaluating LLMs in specific
domains (e.g., healthcare, education) and how these
evaluations can be tailored to meet those domain-specific
needs. However, it's possible that this topic was discussed
elsewhere in the text.
The authors concluded by shedding light on several future
challenges that lie ahead in LLMs evaluation. They
emphasized the importance of continued research in this area to develop more proficient LLMs and mitigate potential risks.