SlideShare a Scribd company logo
1 of 85
Introduction to
Artificial Intelligence
Sebastian E. Kwiatkowski
(sebastian@aisummary.com)
__
HIGH-LEVEL OVERVIEW
Machine learning:
 Gives computers the ability to learn from data (as opposed to hard-coded rules)
 “AI that actually works”
Big data:
 Optimistic view: we are able to practice big data learning
 A more skeptical view: we are unable to practice small data learning
Natural language processing (NLP):
 Teaches computers an understanding of languages
 Focuses on English (and Mandarin), but many methods works on all languages
 including non-natural languages
 and different types of languages
16,000 words
spoken per person
per day
100 trillion words
words spoken by
humanity per day
28 million papers
(1980-2012)
130 million books
indexed by Google
1 billion websites
on the World Wide Web
500 million videos
hosted on YouTube
WHY NATURAL LANGUAGE PROCESSING?
NLP TECHNOLOGY IN EVERYDAY LIFE
Search engines
Virtual
assistants
NLP
Machine reading
Natural language
generation (NLG)
Challenges in
Natural Language Processing
__
COMPOSITIONALITY: IRONY & SARCASM
I feel so miserable without you,
it’s almost like having you here.
It’s not that there isn’t anything
positive to say about the film.
There is.
After 92 minutes, it ends.
NORMALIZATION: 16 WAYS OF SPELLING “TOGETHER”
2gtr18, 6.29,2qetha 46,49,
94,together178,
1266,
2gthr10,
togeda250,2gether togetha tgthr 2getha
togather tOgether toqethaa
togeter 2getter togethor tagether
6326, 919, 20,
207, 57,
10,
ZIPF’S LAW: STOP WORDS AND HAPAX LEGOMENA
Zipf’s Law:
- The number of occurrences of a
word is inversely proportional to its
word rank frequency.
- named after the American linguist
George Kingsley Zipf (1902-1950)
SYNTACTIC AMBIGUITY: TIME FLIES LIKE AN ARROW
Certain insects, called “time flies”,
happen to like an arrow.
(You should) time flies
like an arrow would.
Time flies like an arrow. (You should) time flies
that are like an arrow.
SEGMENTATION: BREAKING DOWN WORDS AND SENTENCES
Sentence splitting
 Sentences can have recursive structures: “He said ‘Hi there!’ to her.”
 Full stops need to be distinguished from abbreviations such as (“Mr.” and “U.S.A.”).
Word boundaries
 Some (variations of some) writing systems don’t have explicit word boundaries
 Chinese and Japanese characters
Compound words
 Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
(Cattle marking and beef labeling supervision duties delegation law)
 Grundstücksverkehrsgenehmigungszuständigkeitsübertragungsverordnung
(Regulation on the delegation of authority concerning land conveyance permissions)
What is intelligence?
__
Legg & Hutter (2007) collected some 70 odd definitions of intelligence
Some commonalities across definitions:
• Intelligence is a property of an individual
• It concerns the interaction between the individual and the environment
• It has to do with the given set of goals that an individual is trying to attain
WHAT IS INTELLIGENCE?
WHAT IS INTELLIGENCE?
Individuals differ from one another in their ability to
understand complex ideas, to adapt effectively to the
environment, to learn from experience, to engage in
various forms of reasoning, to overcome obstacles by
taking thought.
American Psychology Association
WHAT IS INTELLIGENCE?
I define [intelligence] as your skill in achieving whatever
it is you want to attain in your life within your
sociocultural context.
Robert Sternberg
Intelligence is the ability to adapt effectively to the
environment, either by making a change in oneself or by
changing the environment or finding a new one.
Encyclopedia Britannica
WHAT IS INTELLIGENCE?
agent
observation
reward
environment
action
AGENT-ENVIRONMENT FRAMEWORK
observation
action
reward
agent environment
TIME STEP ACTION OBSERVATION REWARD
1 a1 o1 r1
2 a2 o2 r2
… … … …
n an on rn
AGENT-ENVIRONMENT FRAMEWORK: HISTORY OF INTERACTION
Environments
Passive environment:
actions do not affect observations
Active environment:
actions do affect observations
AGENT-ENVIRONMENT FRAMEWORK: PASSIVE AND ACTIVE ENVIRONMENTS
PASSIVE ENVIRONMENTS: ACTIONS DO NOT AFFECT OBSERVATIONS
TYPE ACTIONS OBSERVATIONS REWARDS HISTORY EXAMPLE
Constant
environment
no
yes
Markov chain
Bernoulli scheme
Higher-order Markov
chain
Passive
environment
constant
probabilistic
depend on actions
irrelevant
irrelevant
last perception
all perceptions
all perceptions
unconditional fixed
income payout
can be reduced to a
Markov chain of the first
order
coin flip
many board
games
sequence prediction
problems
no
no
no
probabilistic
probabilistic
probabilistic
ACTIVE ENVIRONMENTS: ACTIONS AFFECT OBSERVATIONS
TYPE ACTIONS OBSERVATIONS REWARDS HISTORY EXAMPLE
Bandits
Markov Decision
Process (MDP)
Higher-order MDP
Partially Observable
Markov Decision
Process
yes depend on actions
depend on actions
depend on actions
not fully
observable
depend
on actions
irrelevant
last perception
last n perception
last n perception
multi-armed bandit
problems in online
marketing
contextual multi-armed
bandits (returning
visitors)
can be reduced to
MDPs of the first
order
negotiations, sales,
teaching
yes
yes
yes
Sequence prediction
__
Sequence prediction can be thought of as a passive environment.
OBSERVATION
The agent is presented
with a particular sequence.
ACTION
The task is to predict the
next item in the sequence.
REWARD
The environment evaluates
and rewards the prediction.
SEQUENCE PREDICTION AS A PASSIVE ENVIRONMENT
WHY SEQUENCE PREDICTION?
Why focus on sequence prediction tasks?
• Probably the most common type of machine learning
• Tremendous amount of progress over the last decade
• Agents in active environments depend on sequence prediction.
SEQUENCE-TO-SEQUENCE PREDICTION: NAMED ENTITY RECOGNITION
Identify all named entities in a given sentence:
Person
Title
Organization
Tim Cook is Chief Executive Officer of Apple .
B-Name I-Name O B-Title I-Title I-Title O B-Org O
SEQUENCE-TO-SEQUENCE PREDICTION: SPEECH RECOGNITION
vs.
Recognize speech
Wreck a nice beach
Example 2
CLASSIFICATION AS SEQUENCE PREDICTION
This is the most common task in all of machine learning.
Input: sequence of data
Example 3
Medicaltesting:
Giventhemedicalhistory,
imagingresults,etc.:
doesthepatientsuffer
fromillnessX?
Customersegmentation:
Givenhis/herpurchasehistory:
whatsegmentofcustomers
doesaparticularuser
belongto?
Biologicalclassification:
Edibleorpoisonousfood?
Preyorpredator?
Trustworthyordishonest?
Output: predicted class (a sequence of length 1)
Example 1 Example 3
Simple solution
SEQUENCE PREDICTION: NUMBER COMPLETION TASK
3, 9, 27, 81,
What is the next number in the sequence?
f(x)=x3
“Startwith 3,thenmultiply thepredecessor by3”
f(1) =3,f(2) =9,f(3)= 27,f(4) =81
f(5) =243
f(x)= -15 + 32x – 18x2 + 4x3
f(1) = 3, f(2) = 9, f(3) = 27, f(4) = 81
But: f(5) = 195
?
More complex solution
A very short history of
Solomonoff Induction
__
Principle of indifference: Keep all hypotheses that are consistent with the data.
• Attributed to Epicurus (341 – 270 BC)
• Source: On the Nature of Things by the Roman poet Lucretius (99 BC - 55 BC)
• Keynes popularized the name principle of indifference.
EPICURUS & LUCRETIUS (FEAT. KEYNES): PRINCIPLE OF INDIFFERENCE
Ockham’s Razor: Prefer the simplest theory
• The principle is attributed to William Ockham (1287 – 1347).
• The term “Ockham’s razor” was coined by the Belgian theologian Libert Froidmont
in the 17th century.
• “Entities must not be multiplied beyond necessity”.
• This formulation is from a commentary about the Scottish theologian Duns
Scotus (1266-1308).
OCKHAM & SCOTUS: OCKHAM’S RAZOR
SCOTUS’S RAZOR?
I propose this general [principle], well known to Aristotle:
the Fifteenth conclusion:
Plurality is never to be posited without necessity.
Since therefore there is no apparent necessity of positing more essential orders
than the two already spoken of, they are the only ones.
Duns Scotus
Bayes’ Theorem: Assign priors to theories and update your beliefs based on the evidence.
𝑃 𝑡ℎ𝑒𝑜𝑟𝑦 𝑑𝑎𝑡𝑎 =
𝑃 𝑑𝑎𝑡𝑎 𝑡ℎ𝑒𝑜𝑟𝑦 ∗ 𝑃(𝑡ℎ𝑒𝑜𝑟𝑦)
𝑃(𝑑𝑎𝑡𝑎)
• It occurs in an essay written by Rev. Thomas Bayes (1701? -1761).
• The essay was edited and posthumously published by Richard Price (1723-1791).
posterior prior
BAYES & PRICE: BAYES’ THEOREM
Kolmogorov complexity: the complexity of a theory is the length of the
shortest program that computes that theory
• Andrey Kolmogorov was a Russian mathematician (1903-1987).
• Ockham’s Razor does not tell us how to measure simplicity.
• Simplicity can now be defined as low Kolmogorov complexity.
• Theories that predict patterns have low Kolmogorov complexity.
• Theories that predict random data have high complexity.
KOLMOGOROV: KOLMOGOROV COMPLEXITY
Solomonoff induction (SI): Keep all theories consistent with the data (Epicurus), but
assign priors (Bayes) based on Kolmogorov complexity (Kolmogorov) in a way that
favors simpler theories (Ockham)
𝑃 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑆 =
𝑎𝑛𝑦 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑡ℎ𝑎𝑡
𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠 𝑆
2−𝑙𝑒𝑛𝑔𝑡ℎ 𝑝𝑟𝑜𝑔𝑟𝑎𝑚
• Solomonoff was an American mathematician (1926-2009).
• SI is incomputable, but we can use it as a guiding principle.
SOLMONOFF: SOLOMONOFF INDUCTION
Meaning acquisition
__
Machines cannot operate directly on natural language.
Neither do humans.
We need a method that maps words to vectors.
More formally, each word w will be represented as a d-dimensional vector vw.
WORD VECTORS
𝑣 𝑤
= [ 𝑣1
𝑤
, 𝑣2
𝑤
, ..., 𝑣 𝑑
𝑤
]
WORD VECTOR: EXAMPLE
Liberty
[ −0.33689 1.1738 −0.047928 0.46625 0.67902 −0.15174 −0.46996 −0.44411 0.33238
−0.078593 −0.017934 0.16921 −0.22687 −0.4116 −0.64984 −0.032479 −0.17657 0.19539
−0.51467 −0.1865 −0.068173 0.55336 −0.90075 −0.54647 −0.37622 −1.1421 −0.27839
−0.18665 0.66295 −0.60268 1.4837 0.37457 −0.6572 −0.62025 −0.78689 −0.96868
−0.077115 −1.1386 −0.01644 0.098453 0.24518 −1.2068 0.28002 0.51562 −0.85232
To obtain word vectors, we need three things:
REQUIREMENTS FOR WORD VECTORS
a corpus
(collection of
relevant documents)
a method
to generate word
vectors from the
data
criteria
by which we can
evaluate the method
and the vectors
HUMAN CIVILIZATION
What sort of corpus would
you send to help the aliens
understand human
civilization?
A typical pre-processing pipeline:
CORPUS PRE-PROCESSING
SENTENCE
SPLITTING
The first step is to
split the corpus into
Sentences.
TOKENIZATION
Each sentence is
then split into a
list of tokens.
FILTERING
Words that occur
only a few times are
discarded.
SENTENCE
SPLITTING
The first step is to
split the corpus
Into sentences.
CORPUS PRE-PROCESSING: EXAMPLE
Raw sentence Washington, D.C. is the capital of the United States.
Lowercase
washington
,
d.c.
is
the
capital
of
the
united
states
.
Washington
,
D.C.
is
the
capital
of
the
United
States
.
Tokenization
WHAT CONSTITUTES A GOOD SOLUTION?
MEANINGFULNESS
Word vectors should relate to
dictionary definitions and common
sense.
USEFULNESS
The vectors are a means to an end.
They should help solve
higher-order NLP problems.
DOMAIN GENERALITY
The vectors should work in
different domains: from news and
science to email and fiction.
LANGUAGE
INDEPENDENCE
The same method should work for
different languages.
COMPUTATIONAL
EFFICIENCY
The method should be able to
obtain vectors from millions of
sentences.
COMPOSITIONALITY
We should be able to combine
word vectors to represent the
meaning of phrases, sentences and
documents.
Easy meaning acquisition
__
THE DISTRIBUTIONAL HYPOTHESIS
The day-to-day practice of playing language games recognizes customs
and rules. It follows that a text in such established usage may contain
sentences such as ‘Don’t be such an ass!’, ‘You silly ass!’, ‘What an ass
he is!’ In these examples, the word ass is in familiar and habitual
company, commonly collocated with you silly-, he is a silly-, don’t be
such an-. You shall know a word by the company it keeps! One of the
meanings of ass is its habitual collocation with such other words as
those above quoted.
In 1957, an essay by the English linguish John Rupert Firth was
published that contained the following passage.
Using the Distributional Hypothesis, we can create vectors based on co-occurrence counts.
From things that have happened and from things as they exist and from
all things that you know and all those you cannot know, you make
something through your invention that is not a representation but a
whole new thing truer than anything true and alive, and you make it
alive, and if you make it well enough, you give it immortality.
“,”: 3, all: 1, and: 3, cannot: 1, enough: 1, give: 1, if: 1, it: 3, know: 2,
make: 3, something: 1, that: 1, things: 1, those: 1
[ 3, 1, 3, 1, 1, 1, 1, 3, 2, 3, 1, 1, 1, 1 ]
CO-OCCURRENCE COUNTS
EASY MEANING ACQUISITION: EASY, BUT NOT SIMPLE
Advantages
• Co-occurrence counts are domain-
general and language-independent.
• It’s an easy and fast method of obtaining
word vectors.
• Co-occurrence counts are easy to
interpret.
Disadvantages
• If |V| is the size of the vocabulary, we
end up with a VxV matrix.
• The estimated number of English words is
around 1 million.
• Co-occurrence counts are easy, but they
are not simple. They do not compress the
data.
There’s got to be a better way!
Simple meaning
acquisition
__
BUILDING BLOCKS
Gordon Allport (1897-1967) extracted almost 18,000 terms that describe
personality from a then-current dictionary of more than 400,000 entries.
Over time, a consensus has emerged: 5-6 factors explain a considerable
amount of variance in personality.
Extraversion
Positive loadings: outgoing, talkative, vocal
Negative loadings: withdrawn, quiet, shy
Openness to experience
Positive loadings: intellectual, creative, innovative
Negative loadings: shallow, unimaginative, conventional
The “Big Five” or “Big Six” break down personality into (building) blocks.
WORD VECTOR MODELS
BENGIO ET AL.
(2003)
pioneering work
MIKOLOV ET AL.
(2013)
“word2vec”,
most popular
approach
PENNINGTON ET
AL. (2014)
“Glove”,
competes with
word2vec
The focus here is on a new model by Li et al. (2016), named “Context
guided N-gram Representation” which is inspired by word2vec.
N-GRAMS
• An n-gram is a sequence of n items.
• Sequence of 1, 2 or 3 items are called unigrams, bigrams and trigrams, respectively.
• Many bigrams and trigrams are extremely rare.
Hydrogen is the most abundant element in the
universe.
Unigrams
hydrogen
most
abundant
element
universe
.
Bigrams
hydrogen-most
most-abundant
abundant-element
element-universe
universe-.
Trigrams
hydrogen-most-abundant
most-abundant-element
abundant-element-universe
element-universe-.
CONTEXT OF AN N-GRAM
• Assume that “most abundant” is the current bigram:
Hydrogen is the most abundant element in the
universe.
Unigrams
hydrogen
element
Bigrams
hydrogen-most
abundant-element
Trigrams
hydrogen-most-abundant
most-abundant-element
• These are all the n-grams of the context:
KEY INSIGHT
• All word vectors are initialized randomly.
• Pairs of n-grams differ in how frequently they co-occur. Some n-grams do not co-occur at all.
• Probability of n-grams w1 and w2 co-occurring:
𝑃 𝑤1, 𝑤2 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠
• This specifies a probability distribution that we can sample form.
We can change the vectors such that they predict the probability that a
particular n-gram occurs at least once in a particular context.
FIRST STRATEGY
Using this insight, we can start to formulate an algorithm:
Sample a pair from the distribution specified by P(w1, w2)
Ask the probabilistic model how likely it is that this particular
appears in the corpus at least once
Update probabilistic model if the answer is too low.
Repeat
The starting point for the probabilistic model is the sigmoid function.
1
2
3
4
THE SIGMOID FUNCTION
The sigmoid function s(x) is one of the most
important functions in all of AI:
𝑠 𝑥 =
1
1 + 𝑒−𝑥
Desirable properties:
• “squashes” any input to the range between 0 and 1
• The derivative is easy to compute:
𝑑𝑠
𝑑𝑥
= 𝑠 𝑥 (1 − 𝑠 𝑥 )
PROBABILISTIC MODEL
Let x denote the input to the sigmoid function.
For our purpose, x is the result of the dot product of the two vectors w and c that
represent an n-gram and a context, respectively.
The probability is high (low) when the result of the dot product of the two vectors
is large (small).
x = c ∙ w =
𝐢=𝟏
𝐝
𝐜𝐢 𝐰𝐢 = c 𝟏w 𝟏 + c 𝟐w 𝟐 + … + c 𝐝w 𝐝
𝑠 𝑥 =
1
1 + 𝑒−𝑥
NEGATIVE SAMPLING
Problem: There is a trivial solution.
If we set all vectors equal to each other such that the dot product equals ~ 40, the
probability will be 1 for each pair …
… and the machine hasn’t learned anything.
Solution: In addition to n-gram/context pairs that do occur in the corpus, we create
a set of negative examples, i.e., pairs that do not occur.
FINAL STRATEGY
Sample one positive pair and k = 5 negative pairs
For each pair, ask the model for the probability of the pair
being a positive (negative) example
Update the vectors such that probability increases for the positive
pair and decreases for the negative pair
Repeat
1
2
3
4
New and final strategy:
Logistic loss
__
DEFINITION
Reinforcement learning: maximize the reward
Predictions: minimize loss (“cost”, “error”, “empirical risk”)
A loss function is a measure of the distance between the predicted values and the actual values (“targets”).
The logistic loss function is one of the most important loss functions.
The target can be either 1 (“did occur”) or 0 (“did not occur”).
If the target equals 1: -log(prediction)
If the target equals 0: -log(1-prediction)
loss (prediction, target) = target  log (prediction)-[ (1 - target)  log (1 - prediction)+ ]
EXAMPLES
Good prediction
Target: 1
Prediction: 90%
Loss: -log(0.9) ≈ 0.0458
This is a good prediction.
Consequently, the loss is small.
Mediocre prediction
Target: 1
Prediction: 40%
Loss: -log (1-0.4) ≈ 0.22184875
This loss is a function of the
counter-probability of 60%.
Bad prediction
Target: 1
Prediction: 10%
Loss: -log(0.2) ≈ 0.699
This prediction is inaccurate
and the loss, therefore, is high.
Bad prediction
loss prediction, target = −[target  log(prediction) + 1 − target  log(1 − prediction)]
LIKELIHOOD
The likelihood function returns the probability of the data for a given
parameter.
L(parameter | data) = P(data | parameter) =
i=1
n
P(data pointi | parameter)
Example: L(pH = 0.5 | HH) = P(HH | pH =
0.5) = 0.25
LOG LIKELIHOOD
In practice, it is convenient to use the log likelihood:
log L parameters data = log
i=1
n
P data pointi parameter) =
i=1
n
logP(data pointi|parameter))
• This reduces the computational cost through a transformation from multiplication to addition.
• Using the log likelihood helps avoid underflow problems.
θ
∗
= 𝐚𝐫𝐠 𝐦𝐚𝐱θ
i=1
n
log P data
MAXIMUM LIKELIHOOD APPROACH
The model parameters, i.e., the word vectors, can be estimated based on the maximum likelihood:
a maximization problem w.r.t. to f(x) is equivalent to a minimization problem w.r.t to f(-x):
For a random variable with two outcomes, the logistic loss is the negative log likelihood.
Thus, minimizing the logistic loss is equivalent to the maximum likelihood approach.
θ∗= 𝐚𝐫𝐠 𝐦𝐢𝐧θ[−
i=1
n
log P data
Gradient Descent
__
GRADIENT
For a differentiable multivariable function f(x1, …, xn),
the gradient is a vector whose components are the partial derivatives of f:
∇f =
[
૒f
૒x 𝟏
,
૒f
૒x 𝟐
…
૒f
૒x 𝐧
]
DIRECTIONAL DERIVATIVE
The rate of change of the function f in the direction of a unit vector û is called the directional derivative:
Dû f(x) = 𝛁 f(x)  û
By the Law of Cosines:
Dû f(x) = |𝛁 f|  |û| cos(angle)
Since |û| = 1:
Dû f(x) = |𝛁 f| cos(angle)
The right-hand side is maximal when cos(0°) = 1
Conclusion: The RHS is maximal when the unit vector points in the same direction as the gradient.
INTERPRETATION OF THE GRADIENT
The gradient 𝛁 f points in the direction of the steepest ascent.
The negative of the gradient, -𝛁f, points in the direction steepest descent:
𝐃û 𝐟 𝐱 = 𝛁 f û 𝐜𝐨𝐬 (𝐚𝐧𝐠𝐥𝐞) is minimal when the unit vector
is antiparallel to the gradient.
The latter fact is the basis for the Gradient Descent algorithm.
PURPOSE
• This is one of the simplest and most powerful algorithms in artificial intelligence.
• A description of Gradient Descent was first published by Peter Debye in 1901.
• Debye pointed out that the idea occurred in a note by Riemann in 1863.
• The purpose of Gradient Descent is to find parameters that minimize a function value.
• The core idea is to update the parameters using the negative of the gradient weighted by a learning rate γ :
θupdated = θcurrent – γ 𝛁f(x)
GRADIENT DESCENT
Initialize the parameters randomly
Calculate the gradient at the point x
Update the current parameters using the following rule:
Repeat until you are satisfied with the current parameters or
GD gets stuck in a bad local minimum
1
2
3
4
θupdated = θcurrent – γ 𝛁 f (x)
A SIMPLE EXAMPLE
Source: hackernoon.com
INCREASED LEARNING RATE
Source: hackernoon.com
A MORE REALISTIC EXAMPLE
Source: Analytics Vidhya
From words to documents
__
DOCUMENT VECTORS
Vector representations can be calculated at different levels:
Vectors that capture the meaning of documents are known as document vectors or document embeddings.
Document embeddings often have the same dimensions as word vectors.
The easiest method is to take an average of word vectors. This works surprisingly well in some cases, but can be
improved upon.
What’s the next simplest approach?
words phrases sentences documentscharacters
Li et al. (2016b) apply the strategy just described for word vectors to documents.
For each document d:
1. Extract all unigrams, bigrams and trigrams
2. Initialize a random document vector
As before, we generate two types of examples:
Positive example
a pair consisting of a document
d and an n-gram that occurs in d
Negative example
a fictitious document/n-gram
pair that did not occur.
PREDICTING WORDS IN A DOCUMENT
We use the same tools that we’ve applied to words:
• Logistic loss to measure the prediction error
• Negative sampling to train the model on positive and negative example
• Gradient descent to update the vectors
TOOLBOX
DOCUMENT CLASSIFICATION
Classification:
“Given a certain input, which class does this input belong to?”
Binary classification:
“Does this input belong to the class A or class B?”
Sentiment analysis:
“Does this document express a positive or negative sentiment?”
Document classification:
A document classifier is a function:
• Input: a document vector
• Output: a probability distribution over the set of possible classes
Weights:
To classify something we need to “weigh” the evidence: one weight is assigned to each dimension.
LOGISTIC REGRESSION
There are many different classification models:
logistic regression neural networks random forests support vector machines …
Logistic regression (LR) is a good place to start.
Another new concept? No! LR is essentially the sigmoid function plus weights and a bias.
LR(input) = sigmoid(weights  input + bias)
LR predicts:
• the positive class when the result is at least 0.5
• the negative class otherwise.
FITTING THE MODEL
Example: (“Best movie I’ve ever seen”,
positive)
(“What a disappointing movie!”,
negative)
Suppose we have a data set in which every document is annotated with the class it belongs to.
We start with randomly initialized model weights and update them iteratively.
In each iteration, we go through every document in the dataset.
For each document, we run gradient descent, update weights each time.
Recent applications
__
INFORMATION EXTRACTION: PHARMACEUTICAL APPLICATIONS
Drug repositioning:
• Almost 12,500 disease categories in the ICD-10
• Close to 1,500 drugs have been by approved by the FDA
Pharmacovigilance:
• 6.5% of admissions are related to adverse drug reactions
• Cost to the NHS is estimated to be £466m
Big data:
• PubMed/Medline:
• 26 million biomedical abstracts
• Twitter:
• 200 billion tweets per year
• Health-specific social media
MACHINE TRANSLATION
• Use of parallel corpora
• Google’s Neural Machine Translation System:
Side-by-side comparison on 500 examples:
Machine: 4.46
Human: 4.82
• Translation from standard English to Basic English:
• For beginners, children, the mentally handicapped, …
• Translation from one legal system to another:
• Cost of extending legal assets will drop
• Further extension of the division of labor:
(a) effective outsourcing of communication-intensive jobs
(b) more cultural exchange
TEXT-TO-SPEECH
Voices.com estimate:
• $15 billion industry size
• $2,000 average cost for a national TV ad
Amazon Polly:
• 1 million characters, ~ 23h speech duration: $4,00
• 47 voices, 24 languages
DeepMind WaveNet:
• Naturalness ratings on a scale of 1-5:
• US English: 4.21 (vs. 4.55 for human speech)
• Mandarin: 4.08 (vs. 4.21)
Speech Synthesis Markup Language (SSML):
• comparable to other markup languages such as HTML
• settings for emphasis, pitch, speaking rate, volume, etc.
CONVERSATIONAL AGENTS
• Open domain vs. closed-domain
• Current agents are vague and non-committal
• “I don’t know.”, “yeah”, “sure”.
• Some objective functions promote diversity.
• Persona-based agents aim for speaker consistency:
Q: Where are you from?
A: I’m from England.
Q: In which city do you live now?
A: I live in London.
• Recently, agents were trained on transcripts from “Friends” and “The Big Bang Theory”.
• Speakers are represented by embeddings:
(a) Helps infer answers to question that are not present in the current discussion.
(b) Embeddings for Rachel, Ross, Emily, etc.
Thank you!
__

More Related Content

Similar to Introduction to Artificial Intelligence

NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptOlusolaTop
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Lecture 2: Language
Lecture 2: LanguageLecture 2: Language
Lecture 2: LanguageDavid Evans
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Errors of Artificial Intelligence, their Correction and Simplicity Revolution...
Errors of Artificial Intelligence, their Correction and Simplicity Revolution...Errors of Artificial Intelligence, their Correction and Simplicity Revolution...
Errors of Artificial Intelligence, their Correction and Simplicity Revolution...Alexander Gorban
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics Ibutest
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Fuzzy mathematics:An application oriented introduction
Fuzzy mathematics:An application oriented introductionFuzzy mathematics:An application oriented introduction
Fuzzy mathematics:An application oriented introductionNagasuri Bala Venkateswarlu
 
ODSC London 2018
ODSC London 2018ODSC London 2018
ODSC London 2018Kfir Bar
 
SoftComputing.pdf
SoftComputing.pdfSoftComputing.pdf
SoftComputing.pdfktosri
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2Karthik Murugesan
 
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIBayes Nets meetup London
 

Similar to Introduction to Artificial Intelligence (20)

NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Lecture 2: Language
Lecture 2: LanguageLecture 2: Language
Lecture 2: Language
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Errors of Artificial Intelligence, their Correction and Simplicity Revolution...
Errors of Artificial Intelligence, their Correction and Simplicity Revolution...Errors of Artificial Intelligence, their Correction and Simplicity Revolution...
Errors of Artificial Intelligence, their Correction and Simplicity Revolution...
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
 
Academic Course: 04 Introduction to complex systems and agent based modeling
Academic Course: 04 Introduction to complex systems and agent based modelingAcademic Course: 04 Introduction to complex systems and agent based modeling
Academic Course: 04 Introduction to complex systems and agent based modeling
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Fuzzy mathematics:An application oriented introduction
Fuzzy mathematics:An application oriented introductionFuzzy mathematics:An application oriented introduction
Fuzzy mathematics:An application oriented introduction
 
ODSC London 2018
ODSC London 2018ODSC London 2018
ODSC London 2018
 
SoftComputing.pdf
SoftComputing.pdfSoftComputing.pdf
SoftComputing.pdf
 
artficial intelligence
artficial intelligenceartficial intelligence
artficial intelligence
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Optimization
OptimizationOptimization
Optimization
 
L2 Thinking
L2 ThinkingL2 Thinking
L2 Thinking
 
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AI
 
Swarm intel
Swarm intelSwarm intel
Swarm intel
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Introduction to Artificial Intelligence

  • 1. Introduction to Artificial Intelligence Sebastian E. Kwiatkowski (sebastian@aisummary.com) __
  • 2. HIGH-LEVEL OVERVIEW Machine learning:  Gives computers the ability to learn from data (as opposed to hard-coded rules)  “AI that actually works” Big data:  Optimistic view: we are able to practice big data learning  A more skeptical view: we are unable to practice small data learning Natural language processing (NLP):  Teaches computers an understanding of languages  Focuses on English (and Mandarin), but many methods works on all languages  including non-natural languages  and different types of languages
  • 3. 16,000 words spoken per person per day 100 trillion words words spoken by humanity per day 28 million papers (1980-2012) 130 million books indexed by Google 1 billion websites on the World Wide Web 500 million videos hosted on YouTube WHY NATURAL LANGUAGE PROCESSING?
  • 4. NLP TECHNOLOGY IN EVERYDAY LIFE Search engines Virtual assistants NLP Machine reading Natural language generation (NLG)
  • 6. COMPOSITIONALITY: IRONY & SARCASM I feel so miserable without you, it’s almost like having you here. It’s not that there isn’t anything positive to say about the film. There is. After 92 minutes, it ends.
  • 7. NORMALIZATION: 16 WAYS OF SPELLING “TOGETHER” 2gtr18, 6.29,2qetha 46,49, 94,together178, 1266, 2gthr10, togeda250,2gether togetha tgthr 2getha togather tOgether toqethaa togeter 2getter togethor tagether 6326, 919, 20, 207, 57, 10,
  • 8. ZIPF’S LAW: STOP WORDS AND HAPAX LEGOMENA Zipf’s Law: - The number of occurrences of a word is inversely proportional to its word rank frequency. - named after the American linguist George Kingsley Zipf (1902-1950)
  • 9. SYNTACTIC AMBIGUITY: TIME FLIES LIKE AN ARROW Certain insects, called “time flies”, happen to like an arrow. (You should) time flies like an arrow would. Time flies like an arrow. (You should) time flies that are like an arrow.
  • 10. SEGMENTATION: BREAKING DOWN WORDS AND SENTENCES Sentence splitting  Sentences can have recursive structures: “He said ‘Hi there!’ to her.”  Full stops need to be distinguished from abbreviations such as (“Mr.” and “U.S.A.”). Word boundaries  Some (variations of some) writing systems don’t have explicit word boundaries  Chinese and Japanese characters Compound words  Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz (Cattle marking and beef labeling supervision duties delegation law)  Grundstücksverkehrsgenehmigungszuständigkeitsübertragungsverordnung (Regulation on the delegation of authority concerning land conveyance permissions)
  • 12. Legg & Hutter (2007) collected some 70 odd definitions of intelligence Some commonalities across definitions: • Intelligence is a property of an individual • It concerns the interaction between the individual and the environment • It has to do with the given set of goals that an individual is trying to attain WHAT IS INTELLIGENCE?
  • 13. WHAT IS INTELLIGENCE? Individuals differ from one another in their ability to understand complex ideas, to adapt effectively to the environment, to learn from experience, to engage in various forms of reasoning, to overcome obstacles by taking thought. American Psychology Association
  • 14. WHAT IS INTELLIGENCE? I define [intelligence] as your skill in achieving whatever it is you want to attain in your life within your sociocultural context. Robert Sternberg
  • 15. Intelligence is the ability to adapt effectively to the environment, either by making a change in oneself or by changing the environment or finding a new one. Encyclopedia Britannica WHAT IS INTELLIGENCE?
  • 17. observation action reward agent environment TIME STEP ACTION OBSERVATION REWARD 1 a1 o1 r1 2 a2 o2 r2 … … … … n an on rn AGENT-ENVIRONMENT FRAMEWORK: HISTORY OF INTERACTION
  • 18. Environments Passive environment: actions do not affect observations Active environment: actions do affect observations AGENT-ENVIRONMENT FRAMEWORK: PASSIVE AND ACTIVE ENVIRONMENTS
  • 19. PASSIVE ENVIRONMENTS: ACTIONS DO NOT AFFECT OBSERVATIONS TYPE ACTIONS OBSERVATIONS REWARDS HISTORY EXAMPLE Constant environment no yes Markov chain Bernoulli scheme Higher-order Markov chain Passive environment constant probabilistic depend on actions irrelevant irrelevant last perception all perceptions all perceptions unconditional fixed income payout can be reduced to a Markov chain of the first order coin flip many board games sequence prediction problems no no no probabilistic probabilistic probabilistic
  • 20. ACTIVE ENVIRONMENTS: ACTIONS AFFECT OBSERVATIONS TYPE ACTIONS OBSERVATIONS REWARDS HISTORY EXAMPLE Bandits Markov Decision Process (MDP) Higher-order MDP Partially Observable Markov Decision Process yes depend on actions depend on actions depend on actions not fully observable depend on actions irrelevant last perception last n perception last n perception multi-armed bandit problems in online marketing contextual multi-armed bandits (returning visitors) can be reduced to MDPs of the first order negotiations, sales, teaching yes yes yes
  • 22. Sequence prediction can be thought of as a passive environment. OBSERVATION The agent is presented with a particular sequence. ACTION The task is to predict the next item in the sequence. REWARD The environment evaluates and rewards the prediction. SEQUENCE PREDICTION AS A PASSIVE ENVIRONMENT
  • 23. WHY SEQUENCE PREDICTION? Why focus on sequence prediction tasks? • Probably the most common type of machine learning • Tremendous amount of progress over the last decade • Agents in active environments depend on sequence prediction.
  • 24. SEQUENCE-TO-SEQUENCE PREDICTION: NAMED ENTITY RECOGNITION Identify all named entities in a given sentence: Person Title Organization Tim Cook is Chief Executive Officer of Apple . B-Name I-Name O B-Title I-Title I-Title O B-Org O
  • 25. SEQUENCE-TO-SEQUENCE PREDICTION: SPEECH RECOGNITION vs. Recognize speech Wreck a nice beach
  • 26. Example 2 CLASSIFICATION AS SEQUENCE PREDICTION This is the most common task in all of machine learning. Input: sequence of data Example 3 Medicaltesting: Giventhemedicalhistory, imagingresults,etc.: doesthepatientsuffer fromillnessX? Customersegmentation: Givenhis/herpurchasehistory: whatsegmentofcustomers doesaparticularuser belongto? Biologicalclassification: Edibleorpoisonousfood? Preyorpredator? Trustworthyordishonest? Output: predicted class (a sequence of length 1) Example 1 Example 3
  • 27. Simple solution SEQUENCE PREDICTION: NUMBER COMPLETION TASK 3, 9, 27, 81, What is the next number in the sequence? f(x)=x3 “Startwith 3,thenmultiply thepredecessor by3” f(1) =3,f(2) =9,f(3)= 27,f(4) =81 f(5) =243 f(x)= -15 + 32x – 18x2 + 4x3 f(1) = 3, f(2) = 9, f(3) = 27, f(4) = 81 But: f(5) = 195 ? More complex solution
  • 28. A very short history of Solomonoff Induction __
  • 29. Principle of indifference: Keep all hypotheses that are consistent with the data. • Attributed to Epicurus (341 – 270 BC) • Source: On the Nature of Things by the Roman poet Lucretius (99 BC - 55 BC) • Keynes popularized the name principle of indifference. EPICURUS & LUCRETIUS (FEAT. KEYNES): PRINCIPLE OF INDIFFERENCE
  • 30. Ockham’s Razor: Prefer the simplest theory • The principle is attributed to William Ockham (1287 – 1347). • The term “Ockham’s razor” was coined by the Belgian theologian Libert Froidmont in the 17th century. • “Entities must not be multiplied beyond necessity”. • This formulation is from a commentary about the Scottish theologian Duns Scotus (1266-1308). OCKHAM & SCOTUS: OCKHAM’S RAZOR
  • 31. SCOTUS’S RAZOR? I propose this general [principle], well known to Aristotle: the Fifteenth conclusion: Plurality is never to be posited without necessity. Since therefore there is no apparent necessity of positing more essential orders than the two already spoken of, they are the only ones. Duns Scotus
  • 32. Bayes’ Theorem: Assign priors to theories and update your beliefs based on the evidence. 𝑃 𝑡ℎ𝑒𝑜𝑟𝑦 𝑑𝑎𝑡𝑎 = 𝑃 𝑑𝑎𝑡𝑎 𝑡ℎ𝑒𝑜𝑟𝑦 ∗ 𝑃(𝑡ℎ𝑒𝑜𝑟𝑦) 𝑃(𝑑𝑎𝑡𝑎) • It occurs in an essay written by Rev. Thomas Bayes (1701? -1761). • The essay was edited and posthumously published by Richard Price (1723-1791). posterior prior BAYES & PRICE: BAYES’ THEOREM
  • 33. Kolmogorov complexity: the complexity of a theory is the length of the shortest program that computes that theory • Andrey Kolmogorov was a Russian mathematician (1903-1987). • Ockham’s Razor does not tell us how to measure simplicity. • Simplicity can now be defined as low Kolmogorov complexity. • Theories that predict patterns have low Kolmogorov complexity. • Theories that predict random data have high complexity. KOLMOGOROV: KOLMOGOROV COMPLEXITY
  • 34. Solomonoff induction (SI): Keep all theories consistent with the data (Epicurus), but assign priors (Bayes) based on Kolmogorov complexity (Kolmogorov) in a way that favors simpler theories (Ockham) 𝑃 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑆 = 𝑎𝑛𝑦 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑡ℎ𝑎𝑡 𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠 𝑆 2−𝑙𝑒𝑛𝑔𝑡ℎ 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 • Solomonoff was an American mathematician (1926-2009). • SI is incomputable, but we can use it as a guiding principle. SOLMONOFF: SOLOMONOFF INDUCTION
  • 36. Machines cannot operate directly on natural language. Neither do humans. We need a method that maps words to vectors. More formally, each word w will be represented as a d-dimensional vector vw. WORD VECTORS 𝑣 𝑤 = [ 𝑣1 𝑤 , 𝑣2 𝑤 , ..., 𝑣 𝑑 𝑤 ]
  • 37. WORD VECTOR: EXAMPLE Liberty [ −0.33689 1.1738 −0.047928 0.46625 0.67902 −0.15174 −0.46996 −0.44411 0.33238 −0.078593 −0.017934 0.16921 −0.22687 −0.4116 −0.64984 −0.032479 −0.17657 0.19539 −0.51467 −0.1865 −0.068173 0.55336 −0.90075 −0.54647 −0.37622 −1.1421 −0.27839 −0.18665 0.66295 −0.60268 1.4837 0.37457 −0.6572 −0.62025 −0.78689 −0.96868 −0.077115 −1.1386 −0.01644 0.098453 0.24518 −1.2068 0.28002 0.51562 −0.85232
  • 38. To obtain word vectors, we need three things: REQUIREMENTS FOR WORD VECTORS a corpus (collection of relevant documents) a method to generate word vectors from the data criteria by which we can evaluate the method and the vectors
  • 39. HUMAN CIVILIZATION What sort of corpus would you send to help the aliens understand human civilization?
  • 40. A typical pre-processing pipeline: CORPUS PRE-PROCESSING SENTENCE SPLITTING The first step is to split the corpus into Sentences. TOKENIZATION Each sentence is then split into a list of tokens. FILTERING Words that occur only a few times are discarded. SENTENCE SPLITTING The first step is to split the corpus Into sentences.
  • 41. CORPUS PRE-PROCESSING: EXAMPLE Raw sentence Washington, D.C. is the capital of the United States. Lowercase washington , d.c. is the capital of the united states . Washington , D.C. is the capital of the United States . Tokenization
  • 42. WHAT CONSTITUTES A GOOD SOLUTION? MEANINGFULNESS Word vectors should relate to dictionary definitions and common sense. USEFULNESS The vectors are a means to an end. They should help solve higher-order NLP problems. DOMAIN GENERALITY The vectors should work in different domains: from news and science to email and fiction. LANGUAGE INDEPENDENCE The same method should work for different languages. COMPUTATIONAL EFFICIENCY The method should be able to obtain vectors from millions of sentences. COMPOSITIONALITY We should be able to combine word vectors to represent the meaning of phrases, sentences and documents.
  • 44. THE DISTRIBUTIONAL HYPOTHESIS The day-to-day practice of playing language games recognizes customs and rules. It follows that a text in such established usage may contain sentences such as ‘Don’t be such an ass!’, ‘You silly ass!’, ‘What an ass he is!’ In these examples, the word ass is in familiar and habitual company, commonly collocated with you silly-, he is a silly-, don’t be such an-. You shall know a word by the company it keeps! One of the meanings of ass is its habitual collocation with such other words as those above quoted. In 1957, an essay by the English linguish John Rupert Firth was published that contained the following passage.
  • 45. Using the Distributional Hypothesis, we can create vectors based on co-occurrence counts. From things that have happened and from things as they exist and from all things that you know and all those you cannot know, you make something through your invention that is not a representation but a whole new thing truer than anything true and alive, and you make it alive, and if you make it well enough, you give it immortality. “,”: 3, all: 1, and: 3, cannot: 1, enough: 1, give: 1, if: 1, it: 3, know: 2, make: 3, something: 1, that: 1, things: 1, those: 1 [ 3, 1, 3, 1, 1, 1, 1, 3, 2, 3, 1, 1, 1, 1 ] CO-OCCURRENCE COUNTS
  • 46. EASY MEANING ACQUISITION: EASY, BUT NOT SIMPLE Advantages • Co-occurrence counts are domain- general and language-independent. • It’s an easy and fast method of obtaining word vectors. • Co-occurrence counts are easy to interpret. Disadvantages • If |V| is the size of the vocabulary, we end up with a VxV matrix. • The estimated number of English words is around 1 million. • Co-occurrence counts are easy, but they are not simple. They do not compress the data. There’s got to be a better way!
  • 48. BUILDING BLOCKS Gordon Allport (1897-1967) extracted almost 18,000 terms that describe personality from a then-current dictionary of more than 400,000 entries. Over time, a consensus has emerged: 5-6 factors explain a considerable amount of variance in personality. Extraversion Positive loadings: outgoing, talkative, vocal Negative loadings: withdrawn, quiet, shy Openness to experience Positive loadings: intellectual, creative, innovative Negative loadings: shallow, unimaginative, conventional The “Big Five” or “Big Six” break down personality into (building) blocks.
  • 49. WORD VECTOR MODELS BENGIO ET AL. (2003) pioneering work MIKOLOV ET AL. (2013) “word2vec”, most popular approach PENNINGTON ET AL. (2014) “Glove”, competes with word2vec The focus here is on a new model by Li et al. (2016), named “Context guided N-gram Representation” which is inspired by word2vec.
  • 50. N-GRAMS • An n-gram is a sequence of n items. • Sequence of 1, 2 or 3 items are called unigrams, bigrams and trigrams, respectively. • Many bigrams and trigrams are extremely rare. Hydrogen is the most abundant element in the universe. Unigrams hydrogen most abundant element universe . Bigrams hydrogen-most most-abundant abundant-element element-universe universe-. Trigrams hydrogen-most-abundant most-abundant-element abundant-element-universe element-universe-.
  • 51. CONTEXT OF AN N-GRAM • Assume that “most abundant” is the current bigram: Hydrogen is the most abundant element in the universe. Unigrams hydrogen element Bigrams hydrogen-most abundant-element Trigrams hydrogen-most-abundant most-abundant-element • These are all the n-grams of the context:
  • 52. KEY INSIGHT • All word vectors are initialized randomly. • Pairs of n-grams differ in how frequently they co-occur. Some n-grams do not co-occur at all. • Probability of n-grams w1 and w2 co-occurring: 𝑃 𝑤1, 𝑤2 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 • This specifies a probability distribution that we can sample form. We can change the vectors such that they predict the probability that a particular n-gram occurs at least once in a particular context.
  • 53. FIRST STRATEGY Using this insight, we can start to formulate an algorithm: Sample a pair from the distribution specified by P(w1, w2) Ask the probabilistic model how likely it is that this particular appears in the corpus at least once Update probabilistic model if the answer is too low. Repeat The starting point for the probabilistic model is the sigmoid function. 1 2 3 4
  • 54. THE SIGMOID FUNCTION The sigmoid function s(x) is one of the most important functions in all of AI: 𝑠 𝑥 = 1 1 + 𝑒−𝑥 Desirable properties: • “squashes” any input to the range between 0 and 1 • The derivative is easy to compute: 𝑑𝑠 𝑑𝑥 = 𝑠 𝑥 (1 − 𝑠 𝑥 )
  • 55. PROBABILISTIC MODEL Let x denote the input to the sigmoid function. For our purpose, x is the result of the dot product of the two vectors w and c that represent an n-gram and a context, respectively. The probability is high (low) when the result of the dot product of the two vectors is large (small). x = c ∙ w = 𝐢=𝟏 𝐝 𝐜𝐢 𝐰𝐢 = c 𝟏w 𝟏 + c 𝟐w 𝟐 + … + c 𝐝w 𝐝 𝑠 𝑥 = 1 1 + 𝑒−𝑥
  • 56. NEGATIVE SAMPLING Problem: There is a trivial solution. If we set all vectors equal to each other such that the dot product equals ~ 40, the probability will be 1 for each pair … … and the machine hasn’t learned anything. Solution: In addition to n-gram/context pairs that do occur in the corpus, we create a set of negative examples, i.e., pairs that do not occur.
  • 57. FINAL STRATEGY Sample one positive pair and k = 5 negative pairs For each pair, ask the model for the probability of the pair being a positive (negative) example Update the vectors such that probability increases for the positive pair and decreases for the negative pair Repeat 1 2 3 4 New and final strategy:
  • 59. DEFINITION Reinforcement learning: maximize the reward Predictions: minimize loss (“cost”, “error”, “empirical risk”) A loss function is a measure of the distance between the predicted values and the actual values (“targets”). The logistic loss function is one of the most important loss functions. The target can be either 1 (“did occur”) or 0 (“did not occur”). If the target equals 1: -log(prediction) If the target equals 0: -log(1-prediction) loss (prediction, target) = target  log (prediction)-[ (1 - target)  log (1 - prediction)+ ]
  • 60. EXAMPLES Good prediction Target: 1 Prediction: 90% Loss: -log(0.9) ≈ 0.0458 This is a good prediction. Consequently, the loss is small. Mediocre prediction Target: 1 Prediction: 40% Loss: -log (1-0.4) ≈ 0.22184875 This loss is a function of the counter-probability of 60%. Bad prediction Target: 1 Prediction: 10% Loss: -log(0.2) ≈ 0.699 This prediction is inaccurate and the loss, therefore, is high. Bad prediction loss prediction, target = −[target  log(prediction) + 1 − target  log(1 − prediction)]
  • 61. LIKELIHOOD The likelihood function returns the probability of the data for a given parameter. L(parameter | data) = P(data | parameter) = i=1 n P(data pointi | parameter) Example: L(pH = 0.5 | HH) = P(HH | pH = 0.5) = 0.25
  • 62. LOG LIKELIHOOD In practice, it is convenient to use the log likelihood: log L parameters data = log i=1 n P data pointi parameter) = i=1 n logP(data pointi|parameter)) • This reduces the computational cost through a transformation from multiplication to addition. • Using the log likelihood helps avoid underflow problems.
  • 63. θ ∗ = 𝐚𝐫𝐠 𝐦𝐚𝐱θ i=1 n log P data MAXIMUM LIKELIHOOD APPROACH The model parameters, i.e., the word vectors, can be estimated based on the maximum likelihood: a maximization problem w.r.t. to f(x) is equivalent to a minimization problem w.r.t to f(-x): For a random variable with two outcomes, the logistic loss is the negative log likelihood. Thus, minimizing the logistic loss is equivalent to the maximum likelihood approach. θ∗= 𝐚𝐫𝐠 𝐦𝐢𝐧θ[− i=1 n log P data
  • 65. GRADIENT For a differentiable multivariable function f(x1, …, xn), the gradient is a vector whose components are the partial derivatives of f: ∇f = [ ૒f ૒x 𝟏 , ૒f ૒x 𝟐 … ૒f ૒x 𝐧 ]
  • 66. DIRECTIONAL DERIVATIVE The rate of change of the function f in the direction of a unit vector û is called the directional derivative: Dû f(x) = 𝛁 f(x)  û By the Law of Cosines: Dû f(x) = |𝛁 f|  |û| cos(angle) Since |û| = 1: Dû f(x) = |𝛁 f| cos(angle) The right-hand side is maximal when cos(0°) = 1 Conclusion: The RHS is maximal when the unit vector points in the same direction as the gradient.
  • 67. INTERPRETATION OF THE GRADIENT The gradient 𝛁 f points in the direction of the steepest ascent. The negative of the gradient, -𝛁f, points in the direction steepest descent: 𝐃û 𝐟 𝐱 = 𝛁 f û 𝐜𝐨𝐬 (𝐚𝐧𝐠𝐥𝐞) is minimal when the unit vector is antiparallel to the gradient. The latter fact is the basis for the Gradient Descent algorithm.
  • 68. PURPOSE • This is one of the simplest and most powerful algorithms in artificial intelligence. • A description of Gradient Descent was first published by Peter Debye in 1901. • Debye pointed out that the idea occurred in a note by Riemann in 1863. • The purpose of Gradient Descent is to find parameters that minimize a function value. • The core idea is to update the parameters using the negative of the gradient weighted by a learning rate γ : θupdated = θcurrent – γ 𝛁f(x)
  • 69. GRADIENT DESCENT Initialize the parameters randomly Calculate the gradient at the point x Update the current parameters using the following rule: Repeat until you are satisfied with the current parameters or GD gets stuck in a bad local minimum 1 2 3 4 θupdated = θcurrent – γ 𝛁 f (x)
  • 70. A SIMPLE EXAMPLE Source: hackernoon.com
  • 72. A MORE REALISTIC EXAMPLE Source: Analytics Vidhya
  • 73. From words to documents __
  • 74. DOCUMENT VECTORS Vector representations can be calculated at different levels: Vectors that capture the meaning of documents are known as document vectors or document embeddings. Document embeddings often have the same dimensions as word vectors. The easiest method is to take an average of word vectors. This works surprisingly well in some cases, but can be improved upon. What’s the next simplest approach? words phrases sentences documentscharacters
  • 75. Li et al. (2016b) apply the strategy just described for word vectors to documents. For each document d: 1. Extract all unigrams, bigrams and trigrams 2. Initialize a random document vector As before, we generate two types of examples: Positive example a pair consisting of a document d and an n-gram that occurs in d Negative example a fictitious document/n-gram pair that did not occur. PREDICTING WORDS IN A DOCUMENT
  • 76. We use the same tools that we’ve applied to words: • Logistic loss to measure the prediction error • Negative sampling to train the model on positive and negative example • Gradient descent to update the vectors TOOLBOX
  • 77. DOCUMENT CLASSIFICATION Classification: “Given a certain input, which class does this input belong to?” Binary classification: “Does this input belong to the class A or class B?” Sentiment analysis: “Does this document express a positive or negative sentiment?” Document classification: A document classifier is a function: • Input: a document vector • Output: a probability distribution over the set of possible classes Weights: To classify something we need to “weigh” the evidence: one weight is assigned to each dimension.
  • 78. LOGISTIC REGRESSION There are many different classification models: logistic regression neural networks random forests support vector machines … Logistic regression (LR) is a good place to start. Another new concept? No! LR is essentially the sigmoid function plus weights and a bias. LR(input) = sigmoid(weights  input + bias) LR predicts: • the positive class when the result is at least 0.5 • the negative class otherwise.
  • 79. FITTING THE MODEL Example: (“Best movie I’ve ever seen”, positive) (“What a disappointing movie!”, negative) Suppose we have a data set in which every document is annotated with the class it belongs to. We start with randomly initialized model weights and update them iteratively. In each iteration, we go through every document in the dataset. For each document, we run gradient descent, update weights each time.
  • 81. INFORMATION EXTRACTION: PHARMACEUTICAL APPLICATIONS Drug repositioning: • Almost 12,500 disease categories in the ICD-10 • Close to 1,500 drugs have been by approved by the FDA Pharmacovigilance: • 6.5% of admissions are related to adverse drug reactions • Cost to the NHS is estimated to be £466m Big data: • PubMed/Medline: • 26 million biomedical abstracts • Twitter: • 200 billion tweets per year • Health-specific social media
  • 82. MACHINE TRANSLATION • Use of parallel corpora • Google’s Neural Machine Translation System: Side-by-side comparison on 500 examples: Machine: 4.46 Human: 4.82 • Translation from standard English to Basic English: • For beginners, children, the mentally handicapped, … • Translation from one legal system to another: • Cost of extending legal assets will drop • Further extension of the division of labor: (a) effective outsourcing of communication-intensive jobs (b) more cultural exchange
  • 83. TEXT-TO-SPEECH Voices.com estimate: • $15 billion industry size • $2,000 average cost for a national TV ad Amazon Polly: • 1 million characters, ~ 23h speech duration: $4,00 • 47 voices, 24 languages DeepMind WaveNet: • Naturalness ratings on a scale of 1-5: • US English: 4.21 (vs. 4.55 for human speech) • Mandarin: 4.08 (vs. 4.21) Speech Synthesis Markup Language (SSML): • comparable to other markup languages such as HTML • settings for emphasis, pitch, speaking rate, volume, etc.
  • 84. CONVERSATIONAL AGENTS • Open domain vs. closed-domain • Current agents are vague and non-committal • “I don’t know.”, “yeah”, “sure”. • Some objective functions promote diversity. • Persona-based agents aim for speaker consistency: Q: Where are you from? A: I’m from England. Q: In which city do you live now? A: I live in London. • Recently, agents were trained on transcripts from “Friends” and “The Big Bang Theory”. • Speakers are represented by embeddings: (a) Helps infer answers to question that are not present in the current discussion. (b) Embeddings for Rachel, Ross, Emily, etc.