Evolution of Language Technology : From Bag of Words to Geenrative AI.
Sponsored Talk under the aegis of IEEE Systems Council and Oslen College of Engineering and Science, FDU
Since 2022, the Generative AI revolution has taken the world by storm with it...
#NLP #GenerativeAI #LanguageTechnology #IEEE
Evolution of Language Technology : From Bag of Words to Geenrative AI.
Sponsored Talk under the aegis of IEEE Systems Council and Oslen College of Engineering and Science, FDU
Since 2022, the Generative AI revolution has taken the world by storm with its immense potential and promises.
At the heart of this transformation are large language models based on transformer architecture—particularly auto-regressive, pre-trained generative models like GPT. But this revolution didn’t happen overnight. It is the result of decades of iterative progress in the broader field of language technology, which consistently sought ways to make human-readable text understandable to machines through a series of increasingly sophisticated
techniques.
This presentation offers a retrospective techno-functional overview of that evolution, spotlighting the key
milestones that shaped the journey. Back in the 1970s, techniques like TF-IDF provided basic statistical
relevance scores based on word frequency. While not true embeddings, they offered foundational documentspecific
representations. A significant leap occurred around 2011 with the rise of neural language models, eventually leading to the development of global, pre-trained word embeddings like Word2Vec (2013) and GloVe
(2014), which captured semantic relationships and marked a new era.
The year 2018 stands out as an inflection point, with the arrival of BERT from Google and GPT from OpenAI— two transformative models leveraging the attention mechanism of transformers. BERT set new benchmarks in
natural language understanding, excelling at sentence-level tasks with bidirectional context. GPT, on the other hand, opened doors to generative capabilities, surprising the world with its ability to produce human-like text,
although early versions faced issues like repetitive output.
The journey of GPT since then has focused on overcoming these limitations through advancements in decoding strategies, such as top-k and nucleus sampling, significantly improving fluency and diversity in generated outputs. This talk draws on landmark research papers and modeling strategies that have defined this multi-decade
evolution, offering attendees an engaging tour of how language technology has progressed from basic keyword counting to the generative intelligence we see today.
Bag of Words | TF-IDF | Word2Vec | BERT | GPT | Natural Language Processing
Size: 1.95 MB
Language: en
Added: Sep 28, 2025
Slides: 20 pages
Slide Content
EVOLUTION OF LANGUAGE TECHNOLOGY: FROM BAG OF WORDS TO
GENERATIVE AI
GENERATING, EXPRESSING AND UNDERSTANDING “MEANING” FROM HUMAN SPOKEN WORDS
ABHIJIT CHOUDHARY
MS ANALYTICS GEORGIA TECH
M.S. MATHEMATICS & COMPUTING
INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI (IITG)
AGENDA:
A glimpse of the essence of GEN AI potential and promises of Gen AI.
Gen AI in Action: When it works great and when it doesn’t?
Unified View.
Role of learned Embeddings. : tf-idf from the 1970s and its still there.
The advent of Neural Language Modeling: quest of pre-trained global meaning
(word level).
The Game changing era of Transformers: quest of human like NLU and NLG.
The BERT Era (Bidirectional)
The GPTs Era (Auto Regressive Generative) Evolving Decoding Strategy
Each generation is itself a topic, this presentation touch them from an evolution perspective.
ESSENCE OF HOW GENAI WORKS
Chat GPT can learn
Thousand times
better than Human.
Digital Intelligence >>
Human Intelligence.
(*For Certain Tasks)
Geoffrey Hinton
Godfather of Modern AI
The Path of Convergence
I am ordering
Chick-fil-A at
Chicken Sandwich.
t =0
For all time, t>=0,
sample meaningful
sub-sequences from
sequence of
sequences,
Autoregressively (feed
back into the model).
Litmus Test: Whenever
meaning is contained in sequence of tokens, we can have specialized Gen AI systems (Agents). Example:
Text, Music, Protein Sequencing,
Code Generation, Image
Generation.
GENAI IN ACTION (THE GOOD)
input_sequence =
" I am at Chick Fill A, For lunch
please give me a large"
Generated sequence 1:Sure, you could try the Chick-fil-A Deluxe Sandwich meal.
It comes with a large Waffle Potato Fries and a large Freshly-Brewed Iced Tea. You can also add a side salad or
a fruit
Generated sequence 2:Chick-fil-A Deluxe Sandwich meal. This includes a large
waffle potato fries and a large beverage of your choice. I would suggest their signature lemonade or iced tea.
Generated sequence 3:Sure, here is your order suggestion:- Large Chick-fil-A Deluxe Sandwich
- Large Waffle Potato Fries
- Large Freshly-Brewed Iced Tea
A real-world use case showing the strength of Gen AI.
Depicts working of Gen AI with different responses coming from different paths of completion.
Underscores Gen AI's ability to excel in structured, goal-
oriented tasks such as generating options on top of Chick-fil-
A menu, that feels natural.
Could enrich customer experience.
GENAI IN ACTION (THE BAD)
Is liril a palindrome?
Understandable based on Gen AI’s inner working.
Acceptable
response should
be “more
information
desired”.
Can Excel in solving Contemporary Abstract Algebra
(PhD level) but can fail in High School Data Interpretation
Gen AI systems are “predominantly” Question Answering Systems
built on very high- dimensional, unidirectional embeddings. Their
vast vector space enables them to perform nuanced tasks out of
the box.
For enterprise grade real world solutions there’s need to
compliment them with other technologies.
Gen AI “delusions” with confidence. Sporadic and hard to find the
issues pose big risk.
Cost, Privacy, Control, Size, Complexity – many reasons to look for
alternatives.
Gen AI together with companions is best.
GEN AI IS GREAT BUT NOT FULL PROOF NEEDS A BALANCED LOOK
A sea plane and a snow car are marvel of technology, but a boat or a
ski is better for practical use.
A UNIFIED VIEW OF LANGUAGE TECHNOLOGY (COMMON FOUNDATION)
Custom Model
Pre-Trained / Fine-Tuned Models
VOCABULARY SPACE
(CORPUS)
EMBEDDING LAYER
Y = f (X)
MODELING LAYER
Domain Specific
Vocabulary, Corpus,
Language
Text To Vector Space
Frequentist to
Bidirectional Semantics to
Generative
TOKEN CLASSIFICATION
TEXT CLASSIFICATION
TEXT SUMMARIZATION
QUESTION ANSWERING
SENTIMENT ANALYSIS
The evolution of language technology is essentially the story of how embeddings have become increasingly
sophisticated—progressing from one state-of-the-art approach to the next—each generation capturing
more nuanced aspects of human language understanding.
TOKENIZER (TF-IDF EXAMPLE)
VOCABULARY SPACE
(CORPUS)
EMBEDDING LAYER
Y = f (X)
MODELING LAYER
TEXT CLASSIFICATION
Common foundation: learned knowledge from training, reused to embed new data..
Look at the sales receipt and tell me which item was sold?
TF-IDF = TF*IDF, assigns an importance weight to each word
within a document corpus.
TF = How important a word is in the document? Frequency
IDF = Inverse Document Frequency
= log (total number of documents/ number of documents
containing that word).
*rare words (small denominator) gets high weight and acts as
defining signature of documents.
TOKENIZER IS A PROGRAM WHICH PROVIDES MULTI- DIMENSIONAL REPRESENTATION OF WORDS
orders = [
"Give me a large French fry.",
"Give me a large spicy sandwich.",
…]
TOKENIZER (TF-IDF EXAMPLE CONTINUED)
new_order = "Grilled Chicken Sandwich with Waffle Fry"
LIMITATIONS:
It memorize but doesn’t generalize.
Can not handle synonyms of same word.
Doesn’t care about word sequences.
Context of words is not handled.
Embedding Matrix
columns contains TF-
IDF score of each
word for each
document in the
corpus.
A new receipt gets
vector representation
from vectorizer of
training phase.
Significance: Due to light weight it in certain static corpus driven scenarios. Multiple Sophistications
Ground for NeXT enhancement.
NEURAL LANGUAGE MODELING (INTRODUCTION)
MOTIVATION: Limitations of frequency-based algorithms initiated various research Paths primarily leveraging deep learning to learn
embedding weights of each word using neural networks.
IMPACT: Word embeddings are no more stand-alone vectors, but embeddings holds relations with related words and are co-located in
embedding space –thus enabling word level algebra via embeddings.
Each word will get a
proportionate
weight for each
feature during
training. See
Gender and Royal.
*www.deeplearning.ai
NEURAL LANGUAGE MODELING (UNDER THE HOOD)
I would like to order a sandwich and a large ?.
I
Would
Like
A
Sandwich
…
e_6239
…..
e_5410
….
e_2398
e_234 0.93
……
…..
e_567 0.03
Pre-Training: Building on the idea from the previous slide, a feature-based embedding matrix is learned during model training,
where randomly initialized weights are iteratively refined via backpropagation. A key example is the 300-dimensional embedding
matrix trained on the Google News corpus, which captures semantic and syntactic relationships across billions of words.
In Action: The model predicts the next word using a context window of the previous n=4 words. Each word in the input sequence is
encoded into its corresponding embedding from the embedding matrix. These embeddings are passed through a feedforward neural
network with a multi-class classification layer, which outputs the most probable next word.
Sophistications: Building on these foundations, models like Word2Vec and GloVe defined the state of the art from 2013 to 2018,
paving the way for the next wave of innovations, including transformer-based architectures from BERT to various GPTs.
Window of
Context Words
Embedding Vector
Of Each Word
Neural Network
Predict The Target Word
Given The Context
Predicted Word Vector
SELF-ATTENTION MECHANISM
Input Sequence: “Give Me a Chicken Sandwich with a layer of American cheese on top of it.”
Enhanced Embedding: Final output embedding of “Sandwich” (Query) will incorporate embeddings
from all other words in the sequence including itself proportionate to the attention it gets.
Under The Hood: Every Query (input word) is projected to a Query Space. And all the context words
are projected to Key and Value space. Self-Attention is the game changer in Language Tech space –
from BERT to GPTs.
BI-DIRECTIONAL TRANSFORMERS (BERT ERA)
Sophistication of
Language
“BERT is conceptually
simple and empirically
powerful. It obtains new state-of-the-art results
on eleven natural language processing tasks…”.
Under The Hood:
Pre-training Tasks: BERT is trained on
Masked Language Modeling (MLM) to
predict masked words and Next Sentence
Prediction (NSP) to understand sentence
relationships, generating 768-dimensional
embeddings. (Sophisticated Language
Understanding)
Layered Processing: Input words are
processed through 12 stacked neural
network layers, each layer refining
linguistic understanding by focusing on
different aspects of language.
Task-Specific Output: The final embedding
is fed into a classification layer, enabling
diverse downstream tasks with high
accuracy.
Enhanced Embeddings of Each Word
Input Word Sequence
BERT Architecture
768 dimensions of Language Understanding
1X768
Enhanced embedding of every word includes weighted contribution of every other contextual
Word. Weights are Self Attention Scores.
*www.chrismccormick.ai
BERT
GPT
Transformers
TRANSFORMERS ERA (OVERVIEW)
2017 2018 2019 2020 2022 2023
GPT2
RoBERTa
BART
& T5
GPT3
Big Bird
ChatGPT GPT 4
Transformers, introduced in 2017, revolutionized neural network architecture with their self-attention mechanism, enabling them to
model contextual relationships in sequential data. These models evolved into two broad categories: Generative Transformers (e.g.,
GPT) and Bi-Directional Transformers (e.g., BERT). Generative models predict text sequentially, while bi-directional models analyze
the context of words from both directions.
NOTE: Models above the blue line are generative transformers (decoder-only), while models below the blue line are bi-directional
transformers.
Each advancement in transformers built on its predecessor, leveraging improvements in architecture, large training data, and faster
computation. For example, GPT demonstrated the power of large datasets, which influenced BERT's success with its 12-layer
architecture. Subsequent GPT versions further refined this foundation, showcasing the transformative potential of training scale and
architectural enhancements.
*www.chrismccormick.ai
GENERATIVE TRANSFORMERS (GPT ERA)
Model
Name
#Decoder
Layers
Embedding
Length
#
Parameters
Max Sequence
Length
GPT (2018) 12 768 117M 1,024
GPT-2
(2019)
48 1600 1.5B 1,024
GPT-3
(2020)
98 12888 175B 2,048
Rise of Transformers
GPT-1 (2018): Introduced with 12 decoder-only
layers, GPT-1 had limited success but inspired
future models like BERT. BERT, with a comparable
architecture of 12 encoder-only layers, achieved
immense success in Natural Language
Understanding tasks.
GPT-2 (2019): Captivated the world with its
human-like text generation capabilities. However,
its limitations, such as getting stuck in repetitive
loops, curtailed its broader adoption.
GPT-3 (2020): Expanded model size significantly
and introduced critical changes to its decoding
strategy.
These advancements were further refined in
ChatGPT, enabling the generation of long,
coherent text and marking a milestone in
language modeling.
Larger models meant bigger embedding space and more parameters – together
encoding more nuanced language understanding. *GPT4 details are not officially disclosed but are estimates from industry experts. (*Above parameter table Priya Shree, Walmart-Tech).
GPT-4 (2023)~150 16384~1.7T 8,192
GPT DECODONG STRATEGY : 1.NEXT WORD & NEXT SEQUENCEPREDICTION
Next Word Prediction
Input Sequence
Decoder Layer 12
Decoder Layer 2
Decoder Layer 1
French
Diet
Milk
Chocolate
…........
0.69
0.71
0.685
0.43
I would like to
order a sandwich
and a large
0.42 0.9
French Fry. (0.378)
0.33 0.95
Diet Coke. (0.313)
0.3 0.98
Chocolate Brownie.(0.294)
0.31 0.89
Milk Shake. (.2759)
0.25 0.5 0.45 0.01
Bag of whole wheat.
(0.00056)
Greedy search which is about predicting the next word does poorly because it
can miss on the next highest probable sequence.
But generating a single sequence won’t solve the problem, because it might
get stuck into a loop of repeating itself.
0.31 0.89
Milk Shake. (.2759)
0.25 0.5 0.45 0.01
Bag of whole wheat.
(0.00056)
I would like to order a
sandwich and a large
GPT DECODONG STRATEGY : 2 PROBABILITY OF SEQUENCE
1.Greedy search which is about predicting the next word does poorly because it can miss on the next highest probable sequence.
2.But generating a single sequence won’t solve the problem, because it might get stuck into a loop of repeating itself. The answer is beam search. “I would like to order a sandwich and a large French Fry and a large French Fry and a large French Fry …”. Issue was cited in
earlier research papers.
I would like to order a
sandwich and a large
GPT DECODONG STRATEGY : 3. BEAM SEARCH
ToastFry
large
French
Milk
diet
Cut
Shake
puff
cake
coke
plan
pill
t=0
t=1
t=1
t=1
Model keep track of multiple beams at each point of generation.
Finds the most probable sequence based on total probability.
GPT DECODING STRATEGY : NUCLEUS SAMPLING
*Perplexity Paper
1. At t=0 current sequence is “I would like to order a sandwich and a large ”.
2. The probability distribution for all vocabulary tokens is updated.
3. Top-p Tokens: Tokens collectively meeting the threshold ????????????=0.9: ["fry", "milk",
"toast"] are highlighted in aqua in above graph.
4. A token (e.g., "toast") is randomly selected from the Top-p set, added to the
sequence, and steps 1–4 repeat until the sequence completion criteria is met.
Chosen Token
REFERENCES:
1.Word 2Vec Efficient Estimation of Word Representations in Vector Space (Jan 2013)
2.Transformer Paper: Attention Is All You Need (June 2017)
3.
. GPT1 Paper:
Improving Language Understanding by Generative Pre-Training (June 2018)
4. BERT Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (May 2019)
5. GPT2 Paper: Language Models are Unsupervised Multitask Learners
6. GPT 3 Paper: