Artificial Intelligence, Machine Learning, and (Large) Language Models: A Quick Introduction

HirokiSayama 551 views 67 slides Oct 08, 2024
Slide 1
Slide 1 of 67
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67

About This Presentation

This is an updated version in August 2024.


Slide Content

Artificial Intelligence, Machine
Learning, and (Large) Language
Models: A Quick Introduction
Hiroki Sayama
[email protected]

Outline
1.The Origin: Understanding
“Intelligence”
2.Key Ingredient I: Statistics
& Data Analytics
3.Key Ingredient II:
Optimization
4.Machine Learning
5.Artificial Neural Networks
6.(Large) Language Models
7.Challenges
2

The Origin:
Understanding
“Intelligence”
3

Alan Turing and the
Turing Machine (1936)
4
https://www.felienne.com/archives/2974

Turing Test (1950) –a.k.a.
“the Imitation Game”
5
https://en.wikipedia.org/wiki/Turing_test

McCulloch-Pitts Model
(1943)
6
The first formal model of
computational mechanisms of
(artificial) neurons

Basis of
Modern
Artificial
Neural
Networks
7
Multilayer perceptron
(Rosenblatt 1958)
Backpropagation
(Rumelhart, Hinton &
Williams 1986)
Deep learning
https://commons.wikimedia.org/wiki/File:
Example_of_a_deep_neural_network.png

Cybernetics (1940s-80s)
8

“Cybernetics” as a
Precursor to “AI”
9
Norbert Wiener
(This is where the word “cyber-” came from!)

Good Old-Fashioned AI:
Symbolic Computation and
Reasoning
▪Herbert Simon et al.’s “Logic Theorist” (1956)
▪Functional programming, list processing (e.g.,
LISP (1955-))
▪Logic-based chatbots (e.g., ELIZA (1966))
▪Expert systems
▪Fuzzy logic (Zadeh, 1965)
10

“AI Winters”
11

Key
Ingredient I:
Statistics &
Data Analytics
12

Pattern Discovery,
Classic Way
▪Descriptive statistics
▪Distribution, correlation,
regression
▪Inferential statistics
▪Hypothesis testing, estimation,
Bayesian inference
▪Parametric / non-parametric
approaches
13
https://en.wikipedia.org/wiki/Statistics

Regression
▪Legendre, Gauss (early 1800s)
▪Representing the behavior of a
dependent variable (DV) as a
function of independent
variable(s) (IV)
▪Linear regression, polynomial
regression, logistic regression,
etc.
▪Optimization (minimization) of
errors between model and data
14
https://en.wikipedia.org/wiki/Regression_analysis
https://en.wikipedia.org/wiki/Polynomial_regression

Hypothesis Testing
▪Original idea dates back to
1700s
▪Pearson, Gosset, Fisher (early
1900s)
▪Set up hypothesis(-ses) and
see how (un)likely the
observed data could be
explained by them
▪Type-I error (false positive),
Type-II error (false negative)
15
https://en.wikibooks.org/wiki/Statistics/Testing
_Statistical_Hypothesis

Bayesian Inference
▪Bayes & Price (1763), Laplace
(1774)
▪Probability as a degree of belief
that an event or a proposition is
true
▪Estimated likelihoods updated
as additional data are obtained
▪Empowered by Markov Chain
Monte Carlo (MCMC) numerical
integration methods (Metropolis
1953; Hastings 1970)
16
https://en.wikipedia.org/wiki/Bayes%27_theorem
https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo

Key
Ingredient II:
Optimization
17

Least Squares Method
▪Legendre, Gauss (early 1800s)
▪Find the formula that minimizes
the sum of squared errors
(residuals) analytically
18
https://en.wikipedia.org/wiki/Least_squares

Gradient Methods
▪Find local minimum of a
function computationally
▪Gradient descent (Cauchy
1847) and its variants
▪More than 150 years later,
this is still what modern
AI/ML/DL systems are
essentially doing!!
▪Error minimization
19
https://commons.wikimedia.org/wiki/File:
Gradient_descent.gif

Linear/Nonlinear/Integer/
Dynamic Programming
▪Extensively studied and used in
Operations Research
▪Practical optimization algorithms
under various constraints
20
https://en.wikipedia.org/wiki/Linear_programming
https://en.wikipedia.org/wiki/Integer_programming
https://en.wikipedia.org/wiki/Floyd%E2%80%93Wa
rshall_algorithm

Evolutionary Algorithms
▪Original idea by Turing (1950)
▪Genetic algorithm (Holland 1975)
▪Genetic programming (Cramer 1985, Koza 1988)
▪Differential evolution (Storn & Price 1997)
▪Neuroevolution (Stanley & Miikkulainen 2002)
21
https://becominghuman.ai/my-new-genetic-algorithm-for-time-series-f7f0df31343d https://en.wikipedia.org/wiki/Genetic_programming

Other Population-Based
Learning & Optimization
▪Ant colony optimization
(Dorigo 1992)
▪Particle swarm optimization
(Kennedy & Eberhart 1995)
▪And various other metaphor-based metaheuristic algorithms
https://en.wikipedia.org/wiki/List_of_metaphor-based_metaheuristics
22
https://en.wikipedia.org/wiki
/Ant_colony_optimization_al
gorithms
https://en.wikipedia.org/wiki
/Particle_swarm_optimizati
on

Machine
Learning
23

Pattern Discovery,
Modern Way
▪Unsupervised learning
▪Find patterns in the data
▪Supervised learning
▪Find patterns in the input-output mapping
▪Reinforcement learning
▪Learn the world by taking actions and receiving
rewards from the environment
24

Unsupervised Learning
▪Clustering
▪k-means, agglomerative
clustering, DBSCAN,
Gaussian mixture, community
detection, Jarvis Patrick, etc.
▪Anomaly detection
▪Feature
extraction/selection
▪Dimension reduction
▪PCA, t-SNE, etc.
25
https://reference.wolfram.com/language/ref/FindClusters.html
https://commons.wikimedia.org/wiki/File:T-SNE_and_PCA.png

Supervised Learning
▪Regression
▪Linear regression, Lasso, polynomial
regression, nearest neighbors,
decision tree, random forest,
Gaussian process, gradient boosted
trees, neural networks, support vector
machine, etc.
▪Classification
▪Logistic regression, decision tree,
gradient boosted trees, naive Bayes,
nearest neighbors, support vector
machine, neural networks, etc.
▪Risk of overfitting
▪Addressed by model selection, cross-
validation, etc.
26
https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
https://scikit-learn.org/stable/auto_examples/
model_selection/plot_underfitting_overfitting.html

Reinforcement Learning
▪Environment typically
formulated as a Markov
decision process (MDP)
▪State of the world + agent’s
action
→ next state of the world +
reward
▪Monte Carlo methods
▪TD learning, Q-learning
27
https://en.wikipedia.org/wiki/Markov_decision_process

Artificial
Neural
Networks
28

Hopfield Networks
▪Hopfield (1982)
▪A.k.a. “attractor networks”
▪Fully connected networks with
symmetric weights can recover
imprinted patterns from imperfect
initial conditions
▪“Associative memory”
Input Output
29
https://github.com/nosratullah/hopfieldNeuralNetwork

Boltzmann Machines
▪Hinton & Sejnowski (1983),
Hinton & Salakhutdinov (2006)
▪Stochastic, learnable variants
of Hopfield networks
▪Restricted (bipartite) Boltzmann
machine was at the core of the
HS 2006 Science paper that
ignited the current boom of “Deep
Learning”
30
https://en.wikipedia.org/wiki/Boltzmann_machine
https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine

Feed-Forward NNs and
Backpropagation
▪Multilayer perceptron
(Rosenblatt 1958)
▪Backpropagation (Werbos
1974; Rumelhart, Hinton &
Williams 1986)
▪Minimization of errors by
gradient descent method
▪Note that this is NOT how our
brain learns
▪“Vanishing gradient” problem
31
Computation
Error correction
Input
Output

Autoencoders
▪Rumelhart, Hinton & Williams
(1986) (again!)
▪Feed-forward ANNs that try
to reproduce the input
▪Smaller intermediate layers
→ dimension reduction,
feature learning
▪HS 2006 Science paper also
used restricted Boltzmann
machines as stacked
autoencoders
32
https://towardsdatascience.com/applied-deep-learning-part-3-
autoencoders-1c083af4d798
https://doi.org/10.1126/science.1127647

Recurrent Neural
Networks
▪Hopfield (1982);
Rumelhart, Hinton &
Williams (1986) (again!!)
▪ANNs that contain
feedback loops
▪Have internal states and
can learn temporal
behaviors of any long-
term dependencies
▪With practical problems
in vanishing or exploding
long-term gradients
33
https://commons.wikimedia.org/wiki/File:Neuronal-Networks-
Feedback.png
https://en.wikipedia.org/wiki/Recurrent_neural_network
h
o

V

nfold

t



1
h
t



1
o
t



1



t
h
t
o
t



t

+

1
h
t

+

1
o
t

+

1


VV V V
. . . . . .

Long Short-Term Memory
(LSTM)
▪Hochreiter & Schmidhuber
(1997)
▪An improved neural module
for RNNs that can learn long-
term dependencies
effectively
▪Vanishing gradient problem
resolved by hidden states
and error flow control
▪“The most cited NN paper of
the 20
th
century”
34

Reservoir Computing
▪Actively studied since 2000s
▪Use inherent behaviors of
complex dynamical systems
(usually a random RNN) as
a “reservoir” of various
solutions
▪Learning takes place only at
the readout layer (i.e., no
backpropagation needed)
▪Discrete-time, continuous-
time versions
35
https://doi.org/10.1515/nanoph-2016-0132
https://doi.org/10.1103/PhysRevLett.120.024102

Deep Neural Networks
▪Ideas originally around since
the beginning of ANNs
▪Became feasible and popular
in 2010s because of:
▪Huge increase in available
computational power thanks
to GPUs
▪Wide availability of training
data over the Internet
36https://commons.wikimedia.org/wiki/File:Example_of_a_deep_neural_network.png
https://www.techradar.com/news/computing-components/graphics-cards/best-graphics-cards-1291458

Convolutional Neural
Networks
▪Fukushima (1980), Homma
et al. (1988), LeCun et al.
(1989, 1998)
▪DNNs with convolution
operations between layers
▪Layers represent spatial
(and/or temporal) patterns
▪Many great applications to
image/video/time series
analyses
37
https://towardsdatascience.com/a-comprehensive-guide-to-
convolutional-neural-networks-the-eli5-way-3bd2b1164a53
https://cs231n.github.io/convolutional-networks/

Adversarial Attacks and
Generative Adversarial
Networks (GAN)
38
https://arxiv.org/abs/1412.6572
https://en.wikipedia.org/wiki/Generative_
adversarial_network
▪Goodfellow et al. (2014a,b)
▪DNNs are vulnerable
against adversarial attacks
▪Utilize it to create co-
evolutionary systems of
generator and discriminator
https://commons.wikimedia.org/wiki/File:A-Standard-GAN-and-b-conditional-GAN-architecturpn.png

Graph Neural
Networks
▪Scarselli et al. (2008),
Kipf & Welling (2016)
▪Non-regular graph
structure used as
network topology
within each layer of
DNN
▪Applications to graph-
based data modeling,
e.g, social networks,
molecular biology, etc.
39
https://tkipf.github.io/graph-convolutional-networks/
https://towardsdatascience.com/how-to-do-deep-learning-on-
graphs-with-graph-convolutional-networks-7d2250723780

Other ANNs
▪Self-organizing map (Kohonen 1982)
▪Neural gas (Martinetz & Schulten 1991)
▪Spiking neural networks (1990s-)
▪Hierarchical Temporal Memory (2004-)
etc…
40
https://en.wikipedia.org/wiki/
Self-organizing_map
https://doi.org/10.1016/j.neucom.
2019.10.104
https://numenta.com/neuroscience-research/sequence-learning/

(Large)
Language
Models
41

History of “Chatbots”
▪ELIZA (Weizenbaum 1966)
▪A.L.I.C.E. (Wallace 1995)
▪Jabberwacky (Carpenter 1997)
▪Cleverbot (Carpenter 2008)
(and many others)
42
https://en.wikipedia.org/wiki/ELIZA#/media/File:ELIZA_conversation.png
http://chatbots.org/
https://www.youtube.com/watch?v=WnzlbyTZsQY (by Cornell CCSL)

Language Models
“With great power comes great _____”
43
Probability of
the next word
… depends on the conte t
Function P( ) can be defined as an explicit dataset,
a heuristic algorithm, a simple statistical distribution,
a (deep) neural network, or anything else

“Large” Language Models
▪Language models meet
(1) massive amount of data
and (2) “transformers”!
▪Vaswani et al. (2017)
▪DNNs with self-attention
mechanism for natural language
processing
▪Enhanced parallelizability
leading to shorter training time
than LSTM
▪BERT (2018) for Google
search
▪Open AI’s GPT (2020-) and
many others
44https://arxiv.org/abs/1706.03762

GPT/LLM
Architecture
Details
45
https://www.youtube.com/watch?v=wjZofJX0v4M
https://www.youtube.com/watch?v=eMlx5fFNoYc
3Blue1Brown
offers some great
video explanations!

46
https://informationisbeautiful.net/visualizations/the-rise-of-generative-ai-large-language-models-llms-like-chatgpt/
Getting
Larger

The New “Chatbots”
47

“ChatGPT and the Evolution
of Artificial Intelligence”
48https://www.youtube.com/watch?v=SzbKJWKE_Ss

LLMsBecoming
Multimodal
49
Example: NExT-GPT architecture
https://medium.com/@cout.shubham/exploring-multimodal-large-language-
models-a-step-forward-in-ai-626918c6a3ec

Promising Applications
▪Coding aid
▪Personalized tutoring
▪Conversation partners
▪Modality conversion for people
with disability
▪Analysis of qualitative scientific
data
(… and many
others)
50

“Foundation” Models
▪General-purpose AI
models “that are
trained on broad
data at scale and
are adaptable to a
wide range of
downstream tasks”
−Stanford Institute for
Human-Centered Artificial
Intelligence (2021);
https://arxiv.org/abs/2108.07
258
51
https://philosophyterms.com/the-library-of-babel/

Conscious-
ness in
LLMs?
52

Challenges
(Especially from Systems
Science Perspectives)
53

Various Societal
Concerns About AI
▪“Artificial General Intelligence” (AGI)
and the “e istential crisis of the humanity”
▪Significant job loss caused by AI
▪Fake information generated by AI
▪Biases and social (in)justice
▪Lack of transparency and over-concentration of AI
power
▪Huge energy costs of deep learning and LLMs
▪Rights of AI and machines
54

AI as a Threat to Humanity?
55

But Some Simple Tasks
Are Still Difficult for AI
▪Words, numbers, facts
▪Simple logic and
reasoning
▪Maintaining stability and plasticity
▪Catastrophic forgetting
56
https://spectrum.ieee.org/openai-dall-e-2
https://www.invistaperforms.
org/getting-ahead-forgetting-
curve-training/

57

58
“Hallucination”
(B.S.-ing)

Wrong Use Cases of AI
59

Contamination of AI-
Generated Data
60

Another “AI Winter”
Coming?
61

System-LevelChallenge:
Idea Homogenization and
Social Fragmentation
▪Widespread use of
common AI tools may
homogenize human ideas
▪Over-consumption of
catered AI-generated
information may accelerate
echo chamber formation
and social fragmentation
▪How can we prevent these
negative outcomes?
62
(Centola et al. 2007)

System-LevelChallenge:
Critical Decision Making in
the Absence of Data
63
Fall 2020: “How to
safely reopen the
campus”
How can we make
informed decisions
in a critical situation
when no prior data
are available?

System-Level
Challenge:
Open-Endedness
64
https://en.wikipedia.org/wiki/Tree_of_life_(biology)
How can we make AI able to
keep producing new things?

Are We Getting Any
Closer to the
Understanding of
True “Intelligence"?
65

Final Remarks
▪Don’t get drowned in the vast
ocean of methods and tools
▪Hundreds of years of history
▪Buzzwords and fads keep changing
▪Keep the big picture in mind –
focus on what your real problem
is and how you will solve it
▪Being able to think and develop
unique, original, creative
solutions is key to differentiate
your intelligence from
AI/LLMs/machines
66

Thank You
67
@hirokisayama
Tags