Artificial Intelligence, Machine Learning, and (Large) Language Models: A Quick Introduction

Artificial Intelligence, Machine
Learning, and (Large) Language
Models: A Quick Introduction
Hiroki Sayama
sayama@binghamton.edu

Outline
1. The Origin: Understanding
“Intelligence”
2. Key Ingredient I: Statistics
& Data Analytics
3. Key Ingredient II:
Optimization
4. Machine Learning
5. Artificial Neural Networks
6. (Large) Language Models
7. Challenges
2

The Origin:
Understanding
“Intelligence”
3

Alan Turing and the
Turing Machine (1936)
4
https://www.felienne.com/archives/2974

Turing Test (1950) – a.k.a.
“the Imitation Game”
5
https://en.wikipedia.org/wiki/Turing_test

McCulloch-Pitts Model
(1943)
6
The first formal model of
computational mechanisms of
(artificial) neurons

Basis of
Modern
Artificial
Neural
Networks
7
Multilayer perceptron
(Rosenblatt 1958)
Backpropagation
(Rumelhart, Hinton &
Williams 1986)
Deep learning
https://commons.wikimedia.org/wiki/File:
Example_of_a_deep_neural_network.png

“Cybernetics” as a
Precursor to “AI”
9
Norbert Wiener
(This is where the word “cyber-” came from!)

Good Old-Fashioned AI:
Symbolic Computation and
Reasoning
▪ Herbert Simon et al.’s “Logic Theorist” (1956)
▪ Functional programming, list processing (e.g.,
LISP (1955-))
▪ Logic-based chatbots (e.g., ELIZA (1966))
▪ Expert systems
▪ Fuzzy logic (Zadeh, 1965)
10

Key
Ingredient I:
Statistics &
Data Analytics
12

Pattern Discovery,
Classic Way
▪ Descriptive statistics
▪ Distribution, correlation,
regression
▪ Inferential statistics
▪ Hypothesis testing, estimation,
Bayesian inference
▪ Parametric / non-parametric
approaches
13
https://en.wikipedia.org/wiki/Statistics

Regression
▪ Legendre, Gauss (early 1800s)
▪ Representing the behavior of a
dependent variable (DV) as a
function of independent
variable(s) (IV)
▪ Linear regression, polynomial
regression, logistic regression,
etc.
▪ Optimization (minimization) of
errors between model and data
14
https://en.wikipedia.org/wiki/Regression_analysis
https://en.wikipedia.org/wiki/Polynomial_regression

Hypothesis Testing
▪ Original idea dates back to
1700s
▪ Pearson, Gosset, Fisher (early
1900s)
▪ Set up hypothesis(-ses) and
see how (un)likely the
observed data could be
explained by them
▪ Type-I error (false positive),
Type-II error (false negative)
15
https://en.wikibooks.org/wiki/Statistics/Testing
_Statistical_Hypothesis

Bayesian Inference
▪ Bayes & Price (1763), Laplace
(1774)
▪ Probability as a degree of belief
that an event or a proposition is
true
▪ Estimated likelihoods updated
as additional data are obtained
▪ Empowered by Markov Chain
Monte Carlo (MCMC) numerical
integration methods (Metropolis
1953; Hastings 1970)
16
https://en.wikipedia.org/wiki/Bayes%27_theorem
https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo

Key
Ingredient II:
Optimization
17

Least Squares Method
▪ Legendre, Gauss (early 1800s)
▪ Find the formula that minimizes
the sum of squared errors
(residuals) analytically
18
https://en.wikipedia.org/wiki/Least_squares

Gradient Methods
▪ Find local minimum of a
function computationally
▪ Gradient descent (Cauchy
1847) and its variants
▪ More than 150 years later,
this is still what modern
AI/ML/DL systems are
essentially doing!!
▪ Error minimization
19
https://commons.wikimedia.org/wiki/File:
Gradient_descent.gif

Linear/Nonlinear/Integer/
Dynamic Programming
▪ Extensively studied and used in
Operations Research
▪ Practical optimization algorithms
under various constraints
20
https://en.wikipedia.org/wiki/Linear_programming
https://en.wikipedia.org/wiki/Integer_programming
https://en.wikipedia.org/wiki/Floyd%E2%80%93Wa
rshall_algorithm

Evolutionary Algorithms
▪ Original idea by Turing (1950)
▪ Genetic algorithm (Holland 1975)
▪ Genetic programming (Cramer 1985, Koza 1988)
▪ Differential evolution (Storn & Price 1997)
▪ Neuroevolution (Stanley & Miikkulainen 2002)
21
https://becominghuman.ai/my-new-genetic-algorithm-for-time-series-f7f0df31343d https://en.wikipedia.org/wiki/Genetic_programming

Other Population-Based
Learning & Optimization
▪ Ant colony optimization
(Dorigo 1992)
▪ Particle swarm optimization
(Kennedy & Eberhart 1995)
▪ And various other metaphor-based metaheuristic algorithms
https://en.wikipedia.org/wiki/List_of_metaphor-based_metaheuristics
22
https://en.wikipedia.org/wiki
/Ant_colony_optimization_al
gorithms
https://en.wikipedia.org/wiki
/Particle_swarm_optimizati
on

Pattern Discovery,
Modern Way
▪ Unsupervised learning
▪ Find patterns in the data
▪ Supervised learning
▪ Find patterns in the input-output mapping
▪ Reinforcement learning
▪ Learn the world by taking actions and receiving
rewards from the environment
24

Unsupervised Learning
▪ Clustering
▪ k-means, agglomerative
clustering, DBSCAN,
Gaussian mixture, community
detection, Jarvis Patrick, etc.
▪ Anomaly detection
▪ Feature
extraction/selection
▪ Dimension reduction
▪ PCA, t-SNE, etc.
25
https://reference.wolfram.com/language/ref/FindClusters.html
https://commons.wikimedia.org/wiki/File:T-SNE_and_PCA.png

Supervised Learning
▪ Regression
▪ Linear regression, Lasso, polynomial
regression, nearest neighbors,
decision tree, random forest,
Gaussian process, gradient boosted
trees, neural networks, support vector
machine, etc.
▪ Classification
▪ Logistic regression, decision tree,
gradient boosted trees, naive Bayes,
nearest neighbors, support vector
machine, neural networks, etc.
▪ Risk of overfitting
▪ Addressed by model selection, cross-
validation, etc.
26
https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
https://scikit-learn.org/stable/auto_examples/
model_selection/plot_underfitting_overfitting.html

Reinforcement Learning
▪ Environment typically
formulated as a Markov
decision process (MDP)
▪ State of the world + agent’s
action
→ next state of the world +
reward
▪ Monte Carlo methods
▪ TD learning, Q-learning
27
https://en.wikipedia.org/wiki/Markov_decision_process

Hopfield Networks
▪ Hopfield (1982)
▪ A.k.a. “attractor networks”
▪ Fully connected networks with
symmetric weights can recover
imprinted patterns from imperfect
initial conditions
▪ “Associative memory”
Input Output
29
https://github.com/nosratullah/hopfieldNeuralNetwork

Boltzmann Machines
▪ Hinton & Sejnowski (1983),
Hinton & Salakhutdinov (2006)
▪ Stochastic, learnable variants
of Hopfield networks
▪ Restricted (bipartite) Boltzmann
machine was at the core of the
HS 2006 Science paper that
ignited the current boom of “Deep
Learning”
30
https://en.wikipedia.org/wiki/Boltzmann_machine
https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine

Feed-Forward NNs and
Backpropagation
▪ Multilayer perceptron
(Rosenblatt 1958)
▪ Backpropagation (Werbos
1974; Rumelhart, Hinton &
Williams 1986)
▪ Minimization of errors by
gradient descent method
▪ Note that this is NOT how our
brain learns
▪ “Vanishing gradient” problem
31
Computation
Error correction
Input
Output

Autoencoders
▪ Rumelhart, Hinton & Williams
(1986) (again!)
▪ Feed-forward ANNs that try
to reproduce the input
▪ Smaller intermediate layers
→ dimension reduction,
feature learning
▪ HS 2006 Science paper also
used restricted Boltzmann
machines as stacked
autoencoders
32
https://towardsdatascience.com/applied-deep-learning-part-3-
autoencoders-1c083af4d798
https://doi.org/10.1126/science.1127647

Recurrent Neural
Networks
▪ Hopfield (1982);
Rumelhart, Hinton &
Williams (1986) (again!!)
▪ ANNs that contain
feedback loops
▪ Have internal states and
can learn temporal
behaviors of any long-
term dependencies
▪ With practical problems
in vanishing or exploding
long-term gradients
33
https://commons.wikimedia.org/wiki/File:Neuronal-Networks-
Feedback.png
https://en.wikipedia.org/wiki/Recurrent_neural_network
h
o
V
nfold
t 1
ht 1
ot 1
t
ht
ot
t+1
ht+1
ot+1
V
V V V
... ...

Long Short-Term Memory
(LSTM)
▪ Hochreiter & Schmidhuber
(1997)
▪ An improved neural module
for RNNs that can learn long-
term dependencies
effectively
▪ Vanishing gradient problem
resolved by hidden states
and error flow control
▪ “The most cited NN paper of
the 20th century”
34

Reservoir Computing
▪ Actively studied since 2000s
▪ Use inherent behaviors of
complex dynamical systems
(usually a random RNN) as
a “reservoir” of various
solutions
▪ Learning takes place only at
the readout layer (i.e., no
backpropagation needed)
▪ Discrete-time, continuous-
time versions
35
https://doi.org/10.1515/nanoph-2016-0132
https://doi.org/10.1103/PhysRevLett.120.024102

Deep Neural Networks
▪ Ideas originally around since
the beginning of ANNs
▪ Became feasible and popular
in 2010s because of:
▪ Huge increase in available
computational power thanks
to GPUs
▪ Wide availability of training
data over the Internet
36
https://commons.wikimedia.org/wiki/File:Example_of_a_deep_neural_network.png
https://www.techradar.com/news/computing-components/graphics-cards/best-graphics-cards-1291458

Convolutional Neural
Networks
▪ Fukushima (1980), Homma
et al. (1988), LeCun et al.
(1989, 1998)
▪ DNNs with convolution
operations between layers
▪ Layers represent spatial
(and/or temporal) patterns
▪ Many great applications to
image/video/time series
analyses
37
https://towardsdatascience.com/a-comprehensive-guide-to-
convolutional-neural-networks-the-eli5-way-3bd2b1164a53
https://cs231n.github.io/convolutional-networks/

Adversarial Attacks and
Generative Adversarial
Networks (GAN)
38
https://arxiv.org/abs/1412.6572
https://en.wikipedia.org/wiki/Generative_
adversarial_network
▪ Goodfellow et al. (2014a,b)
▪ DNNs are vulnerable
against adversarial attacks
▪ Utilize it to create co-
evolutionary systems of
generator and discriminator
https://commons.wikimedia.org/wiki/File:A-Standard-GAN-and-b-conditional-GAN-architecturpn.png

Graph Neural
Networks
▪ Scarselli et al. (2008),
Kipf & Welling (2016)
▪ Non-regular graph
structure used as
network topology
within each layer of
DNN
▪ Applications to graph-
based data modeling,
e.g, social networks,
molecular biology, etc.
39
https://tkipf.github.io/graph-convolutional-networks/
https://towardsdatascience.com/how-to-do-deep-learning-on-
graphs-with-graph-convolutional-networks-7d2250723780

Other ANNs
▪ Self-organizing map (Kohonen 1982)
▪ Neural gas (Martinetz & Schulten 1991)
▪ Spiking neural networks (1990s-)
▪ Hierarchical Temporal Memory (2004-)
etc…
40
https://en.wikipedia.org/wiki/
Self-organizing_map
https://doi.org/10.1016/j.neucom.
2019.10.104
https://numenta.com/neuroscience-research/sequence-learning/

History of “Chatbots”
▪ ELIZA (Weizenbaum 1966)
▪ A.L.I.C.E. (Wallace 1995)
▪ Jabberwacky (Carpenter 1997)
▪ Cleverbot (Carpenter 2008)
(and many others)
42
https://en.wikipedia.org/wiki/ELIZA#/media/File:ELIZA_conversation.png
http://chatbots.org/
https://www.youtube.com/watch?v=WnzlbyTZsQY (by Cornell CCSL)

Language Models
“With great power comes great _____”
43
Probability of
the next word
… depends on the conte t
Function P( ) can be defined as an explicit dataset,
a heuristic algorithm, a simple statistical distribution,
a (deep) neural network, or anything else

“Large” Language Models
▪ Language models meet
(1) massive amount of data
and (2) “transformers”!
▪ Vaswani et al. (2017)
▪ DNNs with self-attention
mechanism for natural language
processing
▪ Enhanced parallelizability
leading to shorter training time
than LSTM
▪ BERT (2018) for Google
search
▪ Open AI’s GPT (2020-) and
many others
44

GPT/LLM
Architecture
Details
45
https://www.youtube.com/watch?v=wjZofJX0v4M
https://www.youtube.com/watch?v=eMlx5fFNoYc
3Blue1Brown
offers some great
video explanations!

46
https://informationisbeautiful.net/visualizations/the-rise-of-generative-ai-large-language-models-llms-like-chatgpt/
Getting
Larger

“ChatGPT and the Evolution
of Artificial Intelligence”
48
https://www.youtube.com/watch?v=SzbKJWKE_Ss

LLMs Becoming
Multimodal
49
Example: NExT-GPT architecture
https://medium.com/@cout.shubham/exploring-multimodal-large-language-
models-a-step-forward-in-ai-626918c6a3ec

Promising Applications
▪ Coding aid
▪ Personalized tutoring
▪ Conversation partners
▪ Modality conversion for people
with disability
▪ Analysis of qualitative scientific
data
(… and many
others)
50

“Foundation” Models
▪ General-purpose AI
models “that are
trained on broad
data at scale and
are adaptable to a
wide range of
downstream tasks”
− Stanford Institute for
Human-Centered Artificial
Intelligence (2021);
258
51
https://philosophyterms.com/the-library-of-babel/

Challenges
(Especially from Systems
Science Perspectives)
53

Various Societal
Concerns About AI
▪ “Artificial General Intelligence” (AGI)
and the “e istential crisis of the humanity”
▪ Significant job loss caused by AI
▪ Fake information generated by AI
▪ Biases and social (in)justice
▪ Lack of transparency and over-concentration of AI
power
▪ Huge energy costs of deep learning and LLMs
▪ Rights of AI and machines 54

AI as a Threat to Humanity?
55

But Some Simple Tasks
Are Still Difficult for AI
▪ Words, numbers, facts
▪ Simple logic and
reasoning
▪ Maintaining stability and plasticity
▪ Catastrophic forgetting
56
https://spectrum.ieee.org/openai-dall-e-2
https://www.invistaperforms.
org/getting-ahead-forgetting-
curve-training/

58
“Hallucination”
(B.S.-ing)

Contamination of AI-
Generated Data
60

Another “AI Winter”
Coming?
61

System-Level Challenge:
Idea Homogenization and
Social Fragmentation
▪ Widespread use of
common AI tools may
homogenize human ideas
▪ Over-consumption of
catered AI-generated
information may accelerate
echo chamber formation
and social fragmentation
▪ How can we prevent these
negative outcomes?
62
(Centola et al. 2007)

System-Level Challenge:
Critical Decision Making in
the Absence of Data
63
Fall 2020: “How to
safely reopen the
campus”
How can we make
informed decisions
in a critical situation
when no prior data
are available?

System-Level
Challenge:
Open-Endedness
64
https://en.wikipedia.org/wiki/Tree_of_life_(biology)
How can we make AI able to
keep producing new things?

Are We Getting Any
Closer to the
Understanding of
True “Intelligence"?
65

Final Remarks
▪ Don’t get drowned in the vast
ocean of methods and tools
▪ Hundreds of years of history
▪ Buzzwords and fads keep changing
▪ Keep the big picture in mind –
focus on what your real problem
is and how you will solve it
▪ Being able to think and develop
unique, original, creative
solutions is key to differentiate
your intelligence from
AI/LLMs/machines 66

Artificial Intelligence, Machine Learning, and (Large) Language Models: A Quick Introduction

More Related Content

What's hot

Similar to Artificial Intelligence, Machine Learning, and (Large) Language Models: A Quick Introduction

More from Hiroki Sayama

Recently uploaded

Artificial Intelligence, Machine Learning, and (Large) Language Models: A Quick Introduction