literature_map_LLM Response Evaluation.pdf

AllanTaracatac 35 views 2 slides Aug 16, 2024
Slide 1
Slide 1 of 2
Slide 1
1
Slide 2
2

About This Presentation

literature map for large language Model response evaluation


Slide Content

Paper TitleAuthorsJournal Name Year Published Main Problem or Research Question Main Objectives or GoalsMethods Used Main Strengths or Contributions Main Gaps or LimitationsConclusion
Large Language Models can Accurately Predict Searcher
Preferences
Paul Thomas, Seth Spielman, Nick Craswell, Bhaskar Mitra Proceedings of the 47th International ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR
'24)
2024 The main problem is that much of the evaluation and tuning
of a search system relies on relevance labels—annotations
that say whether a document is useful for a given search and
searcher. Ideally these come from real searchers, but it is hard to collect this data at scale.
The main objectives are to discuss an alternative approach
using large language models (LLMs) and prompts to select
LLMs that agree with feedback from real searchers, which can
then produce labels at scale. The goal is to show that LLMs
are as accurate as human labellers and useful for finding the
best systems and hardest queries.
The method used involves taking careful feedback from real
searchers and using this to select an LLM and prompt that
agrees with this feedback, which can then produce labels at
scale. The authors also discuss different labelling options
(Figure 1) and their costs and accuracy.
One of the main strengths is that the proposed approach
uses high-quality "gold" labels from real searchers to select
LLMs and prompts, which are as accurate as human labellers
and useful for finding the best systems and hardest queries.
The authors also highlight the unpredictability of LLM
performance with prompt features.
One potential gap is that while the proposed approach uses
high-quality "gold" labels from real searchers to select LLMs
and prompts, it may not be feasible to collect this data at
scale for all types of searches. Additionally, there may be
limitations in terms of scalability and cost.
The authors conclude that their proposed approach using
large language models (LLMs) and prompts can accurately
predict searcher preferences and is a viable alternative to
traditional labelling approaches. They also emphasize the
importance of high-quality "gold" labels for evaluating
information retrieval systems. One potential gap could be the scalability issue, as collecting data from real searchers at scale
may not be feasible or cost-effective for all types of searches.
This could limit the applicability of this approach to certain
domains or use cases.
Large Language Models are Zero-Shot Rankers for
Recommender Systems
Yupeng Hou1,2†, Junjie Zhang1†, Zihan Lin3, Hongyu Lu4,
Ruobing Xie4, Julian McAuley2, and Wayne Xin Zhao1B
arXiv (cs.IR)2024 The paper aims to investigate the capacity of large language
models that act as ranking models for recommender systems.
To formalize the recommendation problem as a conditional
ranking task, design prompting templates, and conduct
extensive experiments on two widely-used datasets.
Carefully designing prompting templates and conducting
experiments on two datasets to solve the ranking task by
large language models (LLMs).
The paper shows that LLMs have promising zero-shot ranking
abilities but struggle with perceiving the order of historical
interactions, can be biased by popularity or item positions in
prompts. It also demonstrates how these issues can be
alleviated using specially designed prompting and
bootstrapping strategies.
The paper mentions two major limitations: (1) LLMs struggle
to perceive the order of historical interactions; (2) they can be biased by popularity or item positions in prompts. A potential
gap could be that the paper does not explore other possible
biases, such as user demographics.
The code and processed datasets are available at
https://github.com/ RUCAIBox/LLMRank.
Large Language Models can Accurately Predict Searcher
Preferences
Paul Thomas, Seth Spielman, Nick Craswell, Bhaskar Mitra Proceedings of the 47th International ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR
'24)
2024 The main problem is that much of the evaluation and tuning
of a search system relies on relevance labels—annotations
that say whether a document is useful for a given search and
searcher. Ideally these come from real searchers, but it's hard
to collect this data at scale.
To discuss an alternative approach using large language
models (LLMs) and prompts to select LLMs that agree with
feedback from real searchers, which can then produce labels
at scale.
The authors take careful feedback from real searchers and
use this to select a large language model (LLM), and prompt,
that agrees with this feedback; the LLM can then produce
labels at scale.
The main strengths are that LLMs are as accurate as human
labellers and as useful for finding the best systems and
hardest queries.
One potential gap is that label quality is managed with
ongoing auditing, training, and monitoring in traditional
approaches. A potential gap is also the need for high-quality
"gold" labels, which reinforces the importance of ongoing
auditing, training, and monitoring in traditional approaches to ensure label quality.
In conclusion, this work presents an alternative approach to
using LLMs and prompts to select LLMs that agree with
feedback from real searchers. The results show that LLMs are
as accurate as human labellers and can produce labels at
scale.
When Large Language Models Meet Personalization:
Perspectives of Challenges and Opportunities
Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang,
Kai Zheng, Defu Lian, Enhong Chen
World Wide Web2024 The paper discusses the challenges and opportunities in
personalization when large language models are used.
To review the challenges in personalization and explore ways
to address them with large language models, specifically
discussing the development and challenges of existing
personalization systems, the newly emerged capabilities of
large language models, and potential uses for making use of
large language models for personalization.
Not explicitly mentioned The paper provides a perspective on the opportunities and
challenges in using large language models for personalization,
highlighting their potential to revolutionize how interaction
between humans and personalized systems occurs.
Potential gap - While the paper discusses the potential
benefits of using large language models for personalization, it
does not explicitly address the limitations or challenges that
may arise from implementing such a system. For example,
issues related to data privacy, security, and user trust in AI-
driven personalized systems.
The conclusion is implicit, as the paper focuses on providing
perspectives rather than presenting concrete results.
However, it suggests that large language models have the
potential to revolutionize personalization by enabling active
user engagement and expanding its scope beyond
information filtering.
The Rise and Potential of Large Language Model Based
Agents: A Survey
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding,
Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu
Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao
Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang
Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen
Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu,
Xuanjing Huang and Tao Gui
arXiv (pre-print)2023 The paper does not explicitly state a main problem or
research question. However, it can be inferred that the
authors aim to explore the potential of large language models
as the foundation for building artificial general intelligence
agents.
- To perform a comprehensive survey on LLM-based agents The authors used a literature review approach to conduct
their survey. They analyzed existing research papers related
to large language models and artificial intelligence agents.
1. The paper provides an overview of the concept of AI
agents from its philosophical origins to its development in AI.
2.
1. The paper does not explicitly state any gaps or limitations.
However, it can be inferred that the authors aim to explore
the potential of large language models as the foundation for
building artificial general intelligence agents. 2. A potential
gap is the lack of explicit discussion on the challenges and
difficulties in developing LLM-based AI agents.
The paper provides a comprehensive survey on LLM-based
agents, exploring their potential applications in single-agent
scenarios, multi-agent scenarios, and human-agent
cooperation. It also delves into agent societies, discussing the
behavior and personality of LLM-based agents, social
phenomena that emerge from an agent society, and insights
they offer for human society.
A Systematic Evaluation of Large Language Models of Code Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua
Hellendoorn
Proceedings of the 6th ACM SIGPLAN International
Symposium on Machine Programming (MAPS)
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions.
To fill in some of these blanks through a systematic
evaluation of the largest existing models across various
programming languages.
- Evaluation of the largest existing models (Codex, GPT-J, GPT-
Neo, GPT-NeoX-20B, and CodeParrot) across various
programming languages - Training of a new model,
PolyCoder, with 2.7B parameters based on the GPT-2
architecture, trained on 249GB of code across 12
programming languages
PolyCoder outperforms all other models in C programming
language.
- The current state-of-the-art code LMs (e.g., Codex) are not
publicly available, leaving many questions about their model
and data design decisions. The paper does not explicitly
mention any gaps. However, a potential gap could be the lack
of evaluation on specific programming languages or domains
that may require more nuanced understanding.
The authors conclude by releasing open-source models that
enable future research and application in this area. They also
highlight PolyCoder's performance in C programming
language.
A systematic evaluation of large language models of code. Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua
Hellendoorn
Proceedings of the 6th ACM SIGPLAN International
Symposium on Machine Programming (MAPS)
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions.
To fill in some of these blanks through a systematic
evaluation of the largest existing models across various
programming languages.
Existing open-source models were evaluated for their
performance on different programming languages. A new
model called PolyCoder was trained exclusively on a multi-
lingual corpus of code and outperformed all other models.
The paper presents a systematic evaluation of large language
models of code, which helps to fill in the gaps about these
models' design decisions. A new open-source model
(PolyCoder) is introduced that achieves better performance
than existing models on some programming languages.
Although Codex itself is not open- source, it's mentioned as a
reference point for comparison. However, there might be
potential gaps in terms of the data and training methods used by Codex, which are not publicly available.
The paper concludes that existing open-source models can
achieve close results to Codex on some programming
languages, but more research is needed to understand their
design decisions. A potential gap could be the lack of
transparency in the training process and data selection for
these large language models. The authors do not provide
detailed information about how PolyCoder was trained or
what specific features were used during its development.
Further investigation into this area might help improve our
understanding of these models' capabilities and limitations.
A Systematic Evaluation of Large Language Models of Code Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hell Proceedings of the 6th ACM SIGPLAN International
Symposium on Machine Programming (MAPS)
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions.
To fill in some of these blanks through a systematic
evaluation of the largest existing models across various
programming languages.
Evaluation of large language models (LMs) of code
Pretraining on 249GB of code across 12 programming
languages
Existing open-source models do achieve close results in some
programming languages, although targeted mainly for natural language modeling. The authors release a new model,
PolyCoder, with 2.7B parameters based on the GPT-2
architecture.
The current state-of-the-art code LMs are not publicly
available.
PolyCoder outperforms all models including Codex in the C
programming language. The authors release a new model,
PolyCoder, with 2.7B parameters based on the GPT-2
architecture. The lack of transparency and reproducibility in
the development process of large-scale AI models like code
LMs may hinder their adoption and improvement.
An Open Large Language Model for Code with Multi-Turn
Program Synthesis.
Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua
Hellendoorn.
ACM SIGPLAN International Symposium on Machine
Programming (MAPS).
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions. The authors aim to fill in some of these
blanks through a systematic evaluation of the largest existing
models across various programming languages.
1) To evaluate the performance of large language models
(LMs) for code generation. 2) To identify an important missing piece, i.e., a large open-source model trained exclusively on a
multi-lingual corpus of code. 3) To release a new model,
PolyCoder, with 2.7B parameters based on the GPT-2
architecture.
The authors used existing models (Codex, GPT-J, GPT-Neo,
GPT-NeoX-20B, and CodeParrot) for evaluation across various
programming languages. They also trained a new model,
PolyCoder, with 2.7B parameters based on the GPT-2
architecture.
1) The authors provide a systematic evaluation of large
language models (LMs) for code generation. 2) They identify
an important missing piece in the form of a large open-source
model trained exclusively on a multi-lingual corpus of code. 3)
They release a new model, PolyCoder, with 2.7B parameters
based on the GPT-2 architecture.
Information not available Please note that the main gap or limitation is not explicitly
mentioned in the text, but a potential gap could be the lack of evaluation on more diverse programming languages or
scenarios beyond those considered in the paper.
Evaluating large language models trained on code. Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua
Hellendoorn
Proceedings of the 6th ACM SIGPLAN International
Symposium on Machine Programming (MAPS)
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions.
To fill in some of these blanks through a systematic
evaluation of the largest existing models across various
programming languages.
The authors release a new open-source model, PolyCoder,
which outperforms all models including Codex in the C
programming language. The evaluation of existing open-
source models provides insights into their strengths and
weaknesses across different programming languages.
Information not available
Training language models to follow instructions with human
feedback.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L.
Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal,
Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser
Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter
Welinder, Paul Christiano.
NeurIPS 20222022 How to align language models with user intent on a wide
range of tasks by fine-tuning with human feedback?
To show an avenue for aligning language models with user
intent, and to demonstrate the effectiveness of this approach
in improving truthfulness and reducing toxic output
generation.
The authors collected a dataset of labeler-written prompts
and prompts submitted through a language model API. They
then used supervised learning to fine-tune GPT-3 using these
demonstrations, followed by reinforcement learning from
human feedback.
The resulting models (InstructGPT) showed improvements in
truthfulness and reductions in toxic output generation while
having minimal performance regressions on public NLP
datasets.
Although the InstructGPT models still made simple mistakes,
fine-tuning with human feedback is a promising direction for
aligning language models with human intent. However, there
may be potential limitations to this approach, such as the
need for large amounts of high-quality training data and the
possibility that humans may not always provide accurate or
consistent feedback.
The authors conclude that their results show that fine-tuning
with human feedback is a promising direction for aligning
language models with human intent. They also suggest that
further research is needed to fully understand the potential
limitations and challenges of this approach.
Using DeepSpeed and Megatron to Train Megatron-Turing
NLG 530B - a large-scale generative language model.
[Information not available] Proceedings of the 6th ACM SIGPLAN International
Symposium on Machine Programming (MAPS '22)
2022 The current state-of-the-art code LMs are not publicly
available, leaving many questions about their model and data
design decisions.
To fill in some of these blanks through a systematic
evaluation of the largest existing models across various
programming languages.
Evaluation of large language models (LMs) of code;
pretraining on 249GB of code across 12 programming
languages using GPT-2 architecture and releasing an open-
source model, PolyCoder.
The release of a new open-source model, PolyCoder, which
outperforms all other models in the C programming language. Existing open-source models achieve close results to Codex in
some programming languages.
[Information not available] Our trained models are open-source and publicly available at
https://github.com/VHellendoorn/Code-LMs, which enables
future research and application in this area. Note that the
provided text is about a different paper than what you
requested. I apologize for any confusion caused by my
mistake. Please provide me with more information or clarify
your request so that I can assist you better.
DIALO GPT : Large-Scale Generative Pre-training for
Conversational Response Generation
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris
Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan
Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics
2020 Neural response generation in conversational settings is
challenging due to the one-to-many problem and noise
present in human conversations.
To develop a large, tunable neural conversational response
generation model that can generate responses close to
human performance both in terms of automatic and human
evaluation.
The authors trained their model on 147M conversation-like
exchanges extracted from Reddit comment chains over a
period spanning from 2005 through 2017. They used the
Hugging Face PyTorch transformer architecture extended for
conversational response generation.
The proposed DIALO GPT model achieves performance close
to human both in terms of automatic and human evaluation
in single-turn dialogue settings, generating more relevant,
contentful, and context-consistent responses than strong
baseline systems. The pre-trained model and training pipeline
are publicly released for research into neural response
generation.
Not explicitly mentioned; however, a potential gap could be
the lack of exploration on how to adapt this approach to
other conversational settings beyond single-turn dialogue.
The authors conclude that their proposed DIALO GPT model
can generate more relevant and context-consistent responses than strong baseline systems in single-turn dialogue settings.
They also emphasize the importance of releasing pre-trained
models for research into neural response generation.
A Simple Language Model for Task-Oriented Dialogue Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih
Yavuz, Richard Socher
Salesforce ResearchThe paper proposes a simple approach to task-oriented
dialogue that uses a single causal language model trained on
all sub-tasks recast as a single sequence prediction problem.
To show that the proposed approach can solve all the sub-
tasks of task-orientated dialogue, including understanding
user input, deciding actions, and generating responses.
The authors propose a simple language model for task-
oriented dialogue that uses a pre-trained causal language
model (GPT-2) as a starting point. They recast each sub-task
into a single sequence prediction problem and train the
model on all sub-tasks simultaneously.
The proposed approach improves over the prior state-of-the-
art in joint goal accuracy for dialogue state tracking, and also
shows robustness to noisy annotations. It also improves main
metrics used to evaluate action decisions and response
generation in an end-to-end setting.
Not explicitly mentioned The authors conclude that their proposed approach can solve
all the sub-tasks of task-oriented dialogue using a simple,
causal language model trained on all sub-tasks recast as a
single sequence prediction problem. They also highlight its
robustness to noisy annotations and improvements in main
metrics used to evaluate action decisions and response
generation.
A Survey on Evaluation of Large Language Models YUPENGCHANG and XUWANG (SchoolofArtificialIntelligence,
JilinUniversity, Changchun, China), JINDONG WANG
(Microsoft Research Asia, Beijing,China), YUANWU (School of
Artificial Intelligence, Jilin University,Changchun, China), LINYI
YANG (Westlake University, Hangzhou, China), KAIJIE ZHU
(Instituteof Automation, Chinese Academy of Sciences,
Beijing,China), HAO CHEN (Carnegie MellonUniversity,
Pittsburgh, USA), XIAOYUAN YI (MicrosoftResearch
Asia,Beijing, China), CUNXIANG WANG (Westlake University,
Hangzhou, China), YIDONG WANG and WEIYE (Peking
University, Beijing,China), YUE ZHANG (Westlake University,
Hangzhou, China), YI CHANG (School of Artificial Intelligence,
JilinUniversity,Changchun, China)
Information not availableThe evaluation of large language models is becoming
increasingly critical at both task and societal levels for better
understanding their potential risks.
To provide a comprehensive review of evaluation methods
for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. To offer invaluable
insights to researchers in the realm of LLMs evaluation,
thereby aiding the development of more proficient LLMs.
The authors provided an overview from the perspective of
evaluation tasks, encompassing general natural language
processing tasks, reasoning, medical usage, ethics, education,
natural sciences, social sciences, agent applications, and other areas. They also answered the "where" and "how" questions
by diving into the evaluation methods and benchmarks.
The authors presented a comprehensive review of evaluation
methods for LLMs, which serves as a valuable resource for
researchers in this field.
One potential gap is that the paper does not explicitly discuss
the challenges faced when evaluating LLMs in specific
domains (e.g., healthcare, education) and how these
evaluations can be tailored to meet those domain-specific
needs. However, it's possible that this topic was discussed
elsewhere in the text.
The authors concluded by shedding light on several future
challenges that lie ahead in LLMs evaluation. They
emphasized the importance of continued research in this area to develop more proficient LLMs and mitigate potential risks.

DriveGPT4: Interpretable End-to-end Autonomous Driving via
Large Language Model
Zhenhua Xu, Yujia Zhang, Enze Xie*, Zhen Zhao, Yong Guo,
Kwan-Yee. K. Wong, Zhenguo Li, Hengshuang Zhao*
Information not available (assuming it's a research journal) The paper aims to extend the application of multimodal large
language models (MLLMs) to autonomous driving by
introducing DriveGPT4, an interpretable end-to-end
autonomous driving system.
To develop a novel interpretable end-to-end autonomous
driving system based on LLMs that can process multi-frame
video inputs and textual queries, facilitate the interpretation
of vehicle actions, offer pertinent reasoning, and predict low-
level vehicle control signals in an end-to-end fashion.
The authors used a bespoke visual instruction tuning dataset
specifically tailored for autonomous driving applications,
along with a mix-fine-tuning training strategy to achieve
advanced capabilities.
DriveGPT4 represents the pioneering effort to leverage LLMs
for the development of an interpretable end-to-end
autonomous driving solution. The fine-tuning of domain-
specific data enables DriveGPT4 to yield close or even
improved results in terms of autonomous driving grounding
when compared with GPT4-V.
Information not available Evaluations conducted on the BDD-X dataset showcase the
superior qualitative and quantitative performance of
DriveGPT4. The code and dataset will be publicly available.
The paper does not explicitly mention any gaps or limitations,
but a potential gap could be the lack of real-world testing and
validation of the proposed system in diverse scenarios and
environments. A more comprehensive evaluation of the
system's robustness and adaptability to different situations
would strengthen its overall performance and reliability.
Efficient Large-Scale Language Model Training on GPU
Clusters Using Megatron-LM
Deepak Narayanan‡, Mohammad Shoeybi†, Jared Casper†,
Patrick LeGresley†, Mostofa Patwary†, Vijay Korthik anQ†,
Dmitri Vainbrand†, Prethvi KashinkunQ†, Julie Berna uer†,
Bryan Catanzaro†, Amar Phanishayee∗, Matei Zaharia‡
Information not available (research journal)Training large language models efficiently due to limited GPU
memory capacity and high number of compute operations
required.
To show how tensor, pipeline, and data parallelism can be
composed to scale to thousands of GPUs; propose a novel
interleaved pipelining schedule that improves throughput by
10+% with memory footprint comparable to existing
approaches.
Composing tensor, pipeline, and data parallelism; proposing
an interleaved pipelining schedule.
The proposed approach allows for training iterations on a
model with 1 trillion parameters at 502 petaFLOP/s on 3072
GPUs (per-GPU throughput of 52% of theoretical peak).
Information not available Training large language models efficiently is challenging due
to limited GPU memory capacity and high number of
compute operations required. The proposed approach allows
for efficient training by composing tensor, pipeline, and data
parallelism with an interleaved pipelining schedule. Note that
the year published was not specified in the text, so it cannot
be extracted. Similarly, there are no explicit gaps or
limitations mentioned in the paper, but potential gaps could
include scalability issues at very large scales (e.g., tens of
thousands of GPUs), limited applicability to specific types of
models or tasks, and potential trade-offs between
throughput and memory usage.
ERASER - A Benchmark to Evaluate Rationalized NLP Models Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric
Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace
Not explicitly mentioned (likely a research journal or
conference proceedings)
The limitation of state-of-the-art NLP models being opaque in
terms of how they make predictions has increased interest in
designing more interpretable deep models for NLP that reveal
the 'reasoning' behind model outputs.
To propose a standardized benchmark, ERASER (Evaluating
Rationales AndSimple English Reasoning), to advance
research on interpretable models in NLP and facilitate
progress on designing more interpretable NLP systems.
The authors propose several metrics that aim to capture how
well the rationales provided by models align with human
rationales and also how faithful these rationales are (i.e., the
degree to which provided rationales influenced the
corresponding predictions).
ReNot explicitly mentioned. However, one potential gap is that
the authors do not provide information about how they
collected human annotations of "rationales" (supporting
evidence) for each task in their benchmark.
The ERASER benchmark aims to advance research on
interpretable models in NLP and facilitate progress on
designing more interpretable NLP systems. By releasing this
benchmark, code, and documentation, the authors hope that
it will help researchers compare methods and track progress
in developing more interpretable deep models for NLP.
How Can We Know What Language Models Know? Zhengbao Jiang1

Frank F. Xu1

Jun Araki2Graham Neubig1 Not specified (research journal)The paper aims to more accurately estimate the knowledge
contained in language models by automatically discovering
better prompts.
To propose mining-based and paraphrasing-based methods
for generating high-quality and diverse prompts, as well as
ensemble methods to combine answers from different
prompts.
Mining-based and paraphrasing-based methods were used to
generate high-quality and diverse prompts. Ensemble
methods were also employed to combine answers from
different prompts.
The proposed methods can improve accuracy in estimating
the knowledge contained in language models, providing a
tighter lower bound on what LMs know.
Not explicitly mentioned The paper demonstrates that the proposed methods can
improve accuracy in estimating the knowledge contained in
language models. One potential gap is the lack of evaluation
on diverse tasks and domains. While the paper shows
promising results on a specific benchmark, it would be
interesting to see how well these methods generalize across
different tasks and domains.
How Much Knowledge Can You Pack Into the Parameters of a
Language Model?
Adam Roberts, Colin Raffel, Noam Shazeer Not specified (research journal)Measuring the practical utility of using pre-trained language
models to answer questions without access to external
context or knowledge.
Fine-tuning pre-trained models to answer questions,
measuring the performance and scalability of this approach.
Pre-training a model (T5) on unstructured text data, fine-
tuning it for question answering tasks, releasing code and
trained models.
The paper shows that this approach scales with model size
and performs competitively with open-domain systems that
explicitly retrieve answers from an external knowledge source
when answering questions.
Not explicitly mentioned in the text. However, a potential gap
could be the lack of evaluation on diverse question types or
domains beyond trivia-style questions.
The paper demonstrates the practical utility of using pre-
trained language models to answer questions without access
to external context or knowledge. Note that some
information (e.g., journal name and year published) is not
available in the provided text. If you need this information,
please provide more context or clarify what specific details
are required.
Large Language Models are Zero-Shot Reasoners Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka
Matsuo, Yusuke Iwasawa
Not specified (research journal) The paper explores whether large language models are
capable of zero-shot reasoning without any hand-crafted few-
shot examples.
To demonstrate that LLMs are decent zero-shot reasoners by
adding a simple prompt template ("Let's think step by step")
before each answer, and to show the versatility of this single
prompt across diverse reasoning tasks.
The authors used large-scale InstructGPT model (text-davinci-
002) and another off-the-shelf large model, 540B parameter
PaLM. They also employed chain-of-thought prompting
technique for eliciting complex multi-step reasoning through
step-by-step answer examples.
The paper shows that LLMs are decent zero-shot reasoners
by simply adding the prompt template "Let's think step by
step" before each answer, and achieves significant
improvements on diverse benchmark reasoning tasks without
any hand-crafted few-shot examples. The versatility of this
single prompt across very diverse reasoning tasks hints at
untapped and understudied fundamental zero-shot
capabilities of LLMs.
Information not available The paper highlights the importance of carefully exploring
and analyzing the enormous zero-shot knowledge hidden
inside LLMs before crafting fine-tuning datasets or few-shot
exemplars. It also serves as a minimal strongest zero-shot
baseline for challenging reasoning benchmarks.
Program Synthesis with Large Language Models Jacob Austin*, Augustus Odena*, Maxwell Nyey, Maarten
Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie
Cai, Michael Terry, Quoc Le, Charles Sutton
Google Research (not a traditional journal)The limits of the current generation of large language models
for program synthesis in general purpose programming
languages.
- Evaluate a collection of large language models on two new
benchmarks, MBPP and MathQA-Python, in both few-shot
and fine-tuning regimes. - Measure the ability of these
models to synthesize short Python programs from natural
language descriptions.
- Synthesis performance scales log-linearly with model size. -
The largest models can synthesize solutions to 59.6% of
problems from MBPP using few-shot learning with a well-
designed prompt. - Fine-tuning on a held-out portion of the
dataset improves performance by about 10 percentage
points across most model sizes.
- The largest models are generally unable to predict the
output of a program given a specific input.
The paper explores the limits of large language models for
program synthesis in general-purpose programming
languages. It evaluates these models on two new benchmarks
and finds that their performance scales log-linearly with
model size. However, it also highlights limitations, such as the
inability of even the best models to predict the output of a
program given a specific input.
Unified Language Model Pre-training for Natural Language
Understanding and Generation
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu
Wang, Jianfeng Gao, Ming Zhou, Hsiao- Wen Hon
Microsoft Research (Information not available) The paper presents a new pre-trained language model that
can be fine-tuned for both natural language understanding
and generation tasks.
To propose a unified pre-trained language model (U NILM)
that can be applied to both NLU and NLG tasks, achieving
state-of-the-art results on various benchmarks.
The U NILM achieves state-of-the-art results on five natural
language generation datasets, including improving ROUGE-L
scores for abstractive summarization tasks by 2.04 absolute
improvement. The model also compares favorably with BERT
on NLU benchmarks like GLUE and SQuAD.
Information not available One potential gap could be the lack of explicit evaluation of
U NILM's performance in low-resource settings or with limited training data, which is an important consideration for real-
world applications where large amounts of labeled data may
not always be available.
Tags