History of the histories: LLMs and the Art of Memory
forw4788
161 views
13 slides
Sep 10, 2024
Slide 1 of 13
1
2
3
4
5
6
7
8
9
10
11
12
13
About This Presentation
This ppt goes through the history of LLMs and how the history of part inferences within the LLMs affect the likelihood of next prediction.
Size: 2.65 MB
Language: en
Added: Sep 10, 2024
Slides: 13 pages
Slide Content
Presented by BILAL
History of the Histories – LLMs and
the ART of Memory
LLMs:
Types &
Purpose
BERT
GPT (Generative Pretrained Transformer)
T5
Pre-trained using Masked Language Modeling (MLM): some tokens are
randomly masked, and the model predicts them.
Great for understanding context in both directions (bi-directional).
Uses Causal Language Modeling: predicts the next token based on all
previous tokens.
Strong at generation tasks (e.g., ChatGPT).
A transformer designed for any text-to-text task. It can be applied to
summarization, translation, and Q&A by framing every task as text
generation.
Histories:
Importance &
Challanges
Why Histories Matter
Challenges
LLMs need to maintain conversational context across
multiple turns to offer relevant, coherent responses.
In complex applications like customer support,
remembering past exchanges is crucial for user
satisfaction.
Token limits: Most LLMs have a token window, beyond
which they lose context (e.g., GPT-4’s token limit is 8k or 32k
tokens).
Balancing memory and processing speed—tracking too
much history can slow down response times.
Ways:
How LLMs
Maintain
Histories
Sliding Window
Explicit Memory Networks
Attention Mechanism
Self-attention mechanism
LLMs keep the most recent n tokens (within the model’s limit) and drop older
ones. Useful for short interactions.
Store conversation history in an external database or cache that the model
queries for long-term context.
LLMs weigh the importance of each token using self-attention (e.g., Query,
Key, and Value matrices) to determine which tokens matter the most.
def attention( query, key, value):
scores = query @ key.T
attention_weights = softmax( scores )
return attention_weights @ value
Predicition
of Future
Context Windows: Use recent history to influence the prediction
of the next token.
Embeddings: Convert tokens and histories into vectors to guide
predictions.
Prediction Process:
Tokenization: Convert input text into tokens.
Embedding: Map tokens to vectors.
Contextualization: Process vectors through layers to consider
historical context.
Generation: Predict the next token based on the processed
context.
input_text = some_history + "The weather today is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=50)
Methodology
How CHATGPT stores
Histories
CHATGPT communication structure
ChatGPT follows a structured history:
{system}: Sets the role/instructions (e.g., "You are a helpful
assistant").
{user}: Stores inputs from the user.
{assistant}: The responses generated by the model.
This structure ensures continuity in the conversation.
[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the weather?"},
{"role": "assistant", "content": "The weather is sunny."}
]
Consequences
of Repeated & Misused
Tags
Repeating {system} tags
Multiple {user} tags
Example
Technical Insight
Conflicting or redundant instructions
confuse the model.
The model might treat previous inputs
as new queries, causing inconsistent
responses.
Two consecutive {system} tags could
give contradictory instructions,
resulting in incorrect behavior from the
model.
When parsing, the LLM assigns weights
to tags. Misplacing tags alters the
conversation flow.
Correct
Tagging in
Conversations
Tips
Avoid dumping entire conversation
histories in a single {user} tag, as this
overloads the LLM’s attention
mechanisms.
Ensure history is labeled correctly:
{system} for instructions.
{user} for user queries.
{assistant} for model responses.
When marking a conversation as “This
is history,” don’t overload the {user}
tag with irrelevant past data—store it
efficiently outside the token window if
needed.
Use RAG, for Large histories.
Predicting Text
Step - by - Step
Token Prediction Process
Input Processing: Tokenize input text
and embed tokens.
Contextual Analysis: Process tokens
through transformer layers to
understand context.
Prediction: Generate probabilities for
the next token.
Selection: Choose the most likely next
token based on probabilities.
outputs = model.generate(input_ids, max_length=50,
num_return_sequences=1)
Hyperparameters
In Text Generation
top_p (Nucleus Sampling) max_tokens
temperature
top_k
Limits the sampling to a subset of
tokens with cumulative probability p
(e.g., p=0.9 keeps tokens with a
cumulative probability of 90%).
Specifies the maximum number of
tokens to generate.
Controls the randomness of
predictions (higher values = more
randomness
outputs = model.generate( input_ids,
max_length=50,
top_p=0.9,
top_k=50,
temperature=0.7
)
Limits the sampling to the top k most
likely tokens (e.g., k=50 keeps the top
50 tokens).
Practical
Applications
Customer Support Code Support
Application Insight
Healthcare
Enables bots to handle long
conversations while recalling past
details (e.g., previous issues or
preferences).
Intelligent chatbots like GitHub
Copilot recall coding style or user
preferences over time.
Imagine handling a user query where
a shopping bot recalls a customer’s
previous preferences in product
recommendations.
Ensures LLMs can provide consistent
advice by recalling a patient’s history
across sessions.
Conclusion
History Management Future developments
Takeaway
Efficient history management enhances
the accuracy and performance of LLMs
in practical applications.
Future developments in memory
mechanisms (e.g., external memory)
will allow LLMs to handle longer
conversations seamlessly.
Understanding how to store and manage
conversation history can significantly
impact model performance and user
satisfaction.