- High-level overview
- Challenges in natural language processing
- What is intelligence?
- Sequence prediction
- A very short history of Solomonoff induction
- Meaning acquisition
- Logistic loss
- Gradient descent
- Applications
2. HIGH-LEVEL OVERVIEW
Machine learning:
Gives computers the ability to learn from data (as opposed to hard-coded rules)
“AI that actually works”
Big data:
Optimistic view: we are able to practice big data learning
A more skeptical view: we are unable to practice small data learning
Natural language processing (NLP):
Teaches computers an understanding of languages
Focuses on English (and Mandarin), but many methods works on all languages
including non-natural languages
and different types of languages
3. 16,000 words
spoken per person
per day
100 trillion words
words spoken by
humanity per day
28 million papers
(1980-2012)
130 million books
indexed by Google
1 billion websites
on the World Wide Web
500 million videos
hosted on YouTube
WHY NATURAL LANGUAGE PROCESSING?
4. NLP TECHNOLOGY IN EVERYDAY LIFE
Search engines
Virtual
assistants
NLP
Machine reading
Natural language
generation (NLG)
6. COMPOSITIONALITY: IRONY & SARCASM
I feel so miserable without you,
it’s almost like having you here.
It’s not that there isn’t anything
positive to say about the film.
There is.
After 92 minutes, it ends.
8. ZIPF’S LAW: STOP WORDS AND HAPAX LEGOMENA
Zipf’s Law:
- The number of occurrences of a
word is inversely proportional to its
word rank frequency.
- named after the American linguist
George Kingsley Zipf (1902-1950)
9. SYNTACTIC AMBIGUITY: TIME FLIES LIKE AN ARROW
Certain insects, called “time flies”,
happen to like an arrow.
(You should) time flies
like an arrow would.
Time flies like an arrow. (You should) time flies
that are like an arrow.
10. SEGMENTATION: BREAKING DOWN WORDS AND SENTENCES
Sentence splitting
Sentences can have recursive structures: “He said ‘Hi there!’ to her.”
Full stops need to be distinguished from abbreviations such as (“Mr.” and “U.S.A.”).
Word boundaries
Some (variations of some) writing systems don’t have explicit word boundaries
Chinese and Japanese characters
Compound words
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
(Cattle marking and beef labeling supervision duties delegation law)
Grundstücksverkehrsgenehmigungszuständigkeitsübertragungsverordnung
(Regulation on the delegation of authority concerning land conveyance permissions)
12. Legg & Hutter (2007) collected some 70 odd definitions of intelligence
Some commonalities across definitions:
• Intelligence is a property of an individual
• It concerns the interaction between the individual and the environment
• It has to do with the given set of goals that an individual is trying to attain
WHAT IS INTELLIGENCE?
13. WHAT IS INTELLIGENCE?
Individuals differ from one another in their ability to
understand complex ideas, to adapt effectively to the
environment, to learn from experience, to engage in
various forms of reasoning, to overcome obstacles by
taking thought.
American Psychology Association
14. WHAT IS INTELLIGENCE?
I define [intelligence] as your skill in achieving whatever
it is you want to attain in your life within your
sociocultural context.
Robert Sternberg
15. Intelligence is the ability to adapt effectively to the
environment, either by making a change in oneself or by
changing the environment or finding a new one.
Encyclopedia Britannica
WHAT IS INTELLIGENCE?
18. Environments
Passive environment:
actions do not affect observations
Active environment:
actions do affect observations
AGENT-ENVIRONMENT FRAMEWORK: PASSIVE AND ACTIVE ENVIRONMENTS
19. PASSIVE ENVIRONMENTS: ACTIONS DO NOT AFFECT OBSERVATIONS
TYPE ACTIONS OBSERVATIONS REWARDS HISTORY EXAMPLE
Constant
environment
no
yes
Markov chain
Bernoulli scheme
Higher-order Markov
chain
Passive
environment
constant
probabilistic
depend on actions
irrelevant
irrelevant
last perception
all perceptions
all perceptions
unconditional fixed
income payout
can be reduced to a
Markov chain of the first
order
coin flip
many board
games
sequence prediction
problems
no
no
no
probabilistic
probabilistic
probabilistic
20. ACTIVE ENVIRONMENTS: ACTIONS AFFECT OBSERVATIONS
TYPE ACTIONS OBSERVATIONS REWARDS HISTORY EXAMPLE
Bandits
Markov Decision
Process (MDP)
Higher-order MDP
Partially Observable
Markov Decision
Process
yes depend on actions
depend on actions
depend on actions
not fully
observable
depend
on actions
irrelevant
last perception
last n perception
last n perception
multi-armed bandit
problems in online
marketing
contextual multi-armed
bandits (returning
visitors)
can be reduced to
MDPs of the first
order
negotiations, sales,
teaching
yes
yes
yes
22. Sequence prediction can be thought of as a passive environment.
OBSERVATION
The agent is presented
with a particular sequence.
ACTION
The task is to predict the
next item in the sequence.
REWARD
The environment evaluates
and rewards the prediction.
SEQUENCE PREDICTION AS A PASSIVE ENVIRONMENT
23. WHY SEQUENCE PREDICTION?
Why focus on sequence prediction tasks?
• Probably the most common type of machine learning
• Tremendous amount of progress over the last decade
• Agents in active environments depend on sequence prediction.
24. SEQUENCE-TO-SEQUENCE PREDICTION: NAMED ENTITY RECOGNITION
Identify all named entities in a given sentence:
Person
Title
Organization
Tim Cook is Chief Executive Officer of Apple .
B-Name I-Name O B-Title I-Title I-Title O B-Org O
26. Example 2
CLASSIFICATION AS SEQUENCE PREDICTION
This is the most common task in all of machine learning.
Input: sequence of data
Example 3
Medicaltesting:
Giventhemedicalhistory,
imagingresults,etc.:
doesthepatientsuffer
fromillnessX?
Customersegmentation:
Givenhis/herpurchasehistory:
whatsegmentofcustomers
doesaparticularuser
belongto?
Biologicalclassification:
Edibleorpoisonousfood?
Preyorpredator?
Trustworthyordishonest?
Output: predicted class (a sequence of length 1)
Example 1 Example 3
27. Simple solution
SEQUENCE PREDICTION: NUMBER COMPLETION TASK
3, 9, 27, 81,
What is the next number in the sequence?
f(x)=x3
“Startwith 3,thenmultiply thepredecessor by3”
f(1) =3,f(2) =9,f(3)= 27,f(4) =81
f(5) =243
f(x)= -15 + 32x – 18x2 + 4x3
f(1) = 3, f(2) = 9, f(3) = 27, f(4) = 81
But: f(5) = 195
?
More complex solution
29. Principle of indifference: Keep all hypotheses that are consistent with the data.
• Attributed to Epicurus (341 – 270 BC)
• Source: On the Nature of Things by the Roman poet Lucretius (99 BC - 55 BC)
• Keynes popularized the name principle of indifference.
EPICURUS & LUCRETIUS (FEAT. KEYNES): PRINCIPLE OF INDIFFERENCE
30. Ockham’s Razor: Prefer the simplest theory
• The principle is attributed to William Ockham (1287 – 1347).
• The term “Ockham’s razor” was coined by the Belgian theologian Libert Froidmont
in the 17th century.
• “Entities must not be multiplied beyond necessity”.
• This formulation is from a commentary about the Scottish theologian Duns
Scotus (1266-1308).
OCKHAM & SCOTUS: OCKHAM’S RAZOR
31. SCOTUS’S RAZOR?
I propose this general [principle], well known to Aristotle:
the Fifteenth conclusion:
Plurality is never to be posited without necessity.
Since therefore there is no apparent necessity of positing more essential orders
than the two already spoken of, they are the only ones.
Duns Scotus
32. Bayes’ Theorem: Assign priors to theories and update your beliefs based on the evidence.
𝑃 𝑡ℎ𝑒𝑜𝑟𝑦 𝑑𝑎𝑡𝑎 =
𝑃 𝑑𝑎𝑡𝑎 𝑡ℎ𝑒𝑜𝑟𝑦 ∗ 𝑃(𝑡ℎ𝑒𝑜𝑟𝑦)
𝑃(𝑑𝑎𝑡𝑎)
• It occurs in an essay written by Rev. Thomas Bayes (1701? -1761).
• The essay was edited and posthumously published by Richard Price (1723-1791).
posterior prior
BAYES & PRICE: BAYES’ THEOREM
33. Kolmogorov complexity: the complexity of a theory is the length of the
shortest program that computes that theory
• Andrey Kolmogorov was a Russian mathematician (1903-1987).
• Ockham’s Razor does not tell us how to measure simplicity.
• Simplicity can now be defined as low Kolmogorov complexity.
• Theories that predict patterns have low Kolmogorov complexity.
• Theories that predict random data have high complexity.
KOLMOGOROV: KOLMOGOROV COMPLEXITY
34. Solomonoff induction (SI): Keep all theories consistent with the data (Epicurus), but
assign priors (Bayes) based on Kolmogorov complexity (Kolmogorov) in a way that
favors simpler theories (Ockham)
𝑃 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑆 =
𝑎𝑛𝑦 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑡ℎ𝑎𝑡
𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠 𝑆
2−𝑙𝑒𝑛𝑔𝑡ℎ 𝑝𝑟𝑜𝑔𝑟𝑎𝑚
• Solomonoff was an American mathematician (1926-2009).
• SI is incomputable, but we can use it as a guiding principle.
SOLMONOFF: SOLOMONOFF INDUCTION
36. Machines cannot operate directly on natural language.
Neither do humans.
We need a method that maps words to vectors.
More formally, each word w will be represented as a d-dimensional vector vw.
WORD VECTORS
𝑣 𝑤
= [ 𝑣1
𝑤
, 𝑣2
𝑤
, ..., 𝑣 𝑑
𝑤
]
38. To obtain word vectors, we need three things:
REQUIREMENTS FOR WORD VECTORS
a corpus
(collection of
relevant documents)
a method
to generate word
vectors from the
data
criteria
by which we can
evaluate the method
and the vectors
40. A typical pre-processing pipeline:
CORPUS PRE-PROCESSING
SENTENCE
SPLITTING
The first step is to
split the corpus into
Sentences.
TOKENIZATION
Each sentence is
then split into a
list of tokens.
FILTERING
Words that occur
only a few times are
discarded.
SENTENCE
SPLITTING
The first step is to
split the corpus
Into sentences.
41. CORPUS PRE-PROCESSING: EXAMPLE
Raw sentence Washington, D.C. is the capital of the United States.
Lowercase
washington
,
d.c.
is
the
capital
of
the
united
states
.
Washington
,
D.C.
is
the
capital
of
the
United
States
.
Tokenization
42. WHAT CONSTITUTES A GOOD SOLUTION?
MEANINGFULNESS
Word vectors should relate to
dictionary definitions and common
sense.
USEFULNESS
The vectors are a means to an end.
They should help solve
higher-order NLP problems.
DOMAIN GENERALITY
The vectors should work in
different domains: from news and
science to email and fiction.
LANGUAGE
INDEPENDENCE
The same method should work for
different languages.
COMPUTATIONAL
EFFICIENCY
The method should be able to
obtain vectors from millions of
sentences.
COMPOSITIONALITY
We should be able to combine
word vectors to represent the
meaning of phrases, sentences and
documents.
44. THE DISTRIBUTIONAL HYPOTHESIS
The day-to-day practice of playing language games recognizes customs
and rules. It follows that a text in such established usage may contain
sentences such as ‘Don’t be such an ass!’, ‘You silly ass!’, ‘What an ass
he is!’ In these examples, the word ass is in familiar and habitual
company, commonly collocated with you silly-, he is a silly-, don’t be
such an-. You shall know a word by the company it keeps! One of the
meanings of ass is its habitual collocation with such other words as
those above quoted.
In 1957, an essay by the English linguish John Rupert Firth was
published that contained the following passage.
45. Using the Distributional Hypothesis, we can create vectors based on co-occurrence counts.
From things that have happened and from things as they exist and from
all things that you know and all those you cannot know, you make
something through your invention that is not a representation but a
whole new thing truer than anything true and alive, and you make it
alive, and if you make it well enough, you give it immortality.
“,”: 3, all: 1, and: 3, cannot: 1, enough: 1, give: 1, if: 1, it: 3, know: 2,
make: 3, something: 1, that: 1, things: 1, those: 1
[ 3, 1, 3, 1, 1, 1, 1, 3, 2, 3, 1, 1, 1, 1 ]
CO-OCCURRENCE COUNTS
46. EASY MEANING ACQUISITION: EASY, BUT NOT SIMPLE
Advantages
• Co-occurrence counts are domain-
general and language-independent.
• It’s an easy and fast method of obtaining
word vectors.
• Co-occurrence counts are easy to
interpret.
Disadvantages
• If |V| is the size of the vocabulary, we
end up with a VxV matrix.
• The estimated number of English words is
around 1 million.
• Co-occurrence counts are easy, but they
are not simple. They do not compress the
data.
There’s got to be a better way!
48. BUILDING BLOCKS
Gordon Allport (1897-1967) extracted almost 18,000 terms that describe
personality from a then-current dictionary of more than 400,000 entries.
Over time, a consensus has emerged: 5-6 factors explain a considerable
amount of variance in personality.
Extraversion
Positive loadings: outgoing, talkative, vocal
Negative loadings: withdrawn, quiet, shy
Openness to experience
Positive loadings: intellectual, creative, innovative
Negative loadings: shallow, unimaginative, conventional
The “Big Five” or “Big Six” break down personality into (building) blocks.
49. WORD VECTOR MODELS
BENGIO ET AL.
(2003)
pioneering work
MIKOLOV ET AL.
(2013)
“word2vec”,
most popular
approach
PENNINGTON ET
AL. (2014)
“Glove”,
competes with
word2vec
The focus here is on a new model by Li et al. (2016), named “Context
guided N-gram Representation” which is inspired by word2vec.
50. N-GRAMS
• An n-gram is a sequence of n items.
• Sequence of 1, 2 or 3 items are called unigrams, bigrams and trigrams, respectively.
• Many bigrams and trigrams are extremely rare.
Hydrogen is the most abundant element in the
universe.
Unigrams
hydrogen
most
abundant
element
universe
.
Bigrams
hydrogen-most
most-abundant
abundant-element
element-universe
universe-.
Trigrams
hydrogen-most-abundant
most-abundant-element
abundant-element-universe
element-universe-.
51. CONTEXT OF AN N-GRAM
• Assume that “most abundant” is the current bigram:
Hydrogen is the most abundant element in the
universe.
Unigrams
hydrogen
element
Bigrams
hydrogen-most
abundant-element
Trigrams
hydrogen-most-abundant
most-abundant-element
• These are all the n-grams of the context:
52. KEY INSIGHT
• All word vectors are initialized randomly.
• Pairs of n-grams differ in how frequently they co-occur. Some n-grams do not co-occur at all.
• Probability of n-grams w1 and w2 co-occurring:
𝑃 𝑤1, 𝑤2 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜−𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠
• This specifies a probability distribution that we can sample form.
We can change the vectors such that they predict the probability that a
particular n-gram occurs at least once in a particular context.
53. FIRST STRATEGY
Using this insight, we can start to formulate an algorithm:
Sample a pair from the distribution specified by P(w1, w2)
Ask the probabilistic model how likely it is that this particular
appears in the corpus at least once
Update probabilistic model if the answer is too low.
Repeat
The starting point for the probabilistic model is the sigmoid function.
1
2
3
4
54. THE SIGMOID FUNCTION
The sigmoid function s(x) is one of the most
important functions in all of AI:
𝑠 𝑥 =
1
1 + 𝑒−𝑥
Desirable properties:
• “squashes” any input to the range between 0 and 1
• The derivative is easy to compute:
𝑑𝑠
𝑑𝑥
= 𝑠 𝑥 (1 − 𝑠 𝑥 )
55. PROBABILISTIC MODEL
Let x denote the input to the sigmoid function.
For our purpose, x is the result of the dot product of the two vectors w and c that
represent an n-gram and a context, respectively.
The probability is high (low) when the result of the dot product of the two vectors
is large (small).
x = c ∙ w =
𝐢=𝟏
𝐝
𝐜𝐢 𝐰𝐢 = c 𝟏w 𝟏 + c 𝟐w 𝟐 + … + c 𝐝w 𝐝
𝑠 𝑥 =
1
1 + 𝑒−𝑥
56. NEGATIVE SAMPLING
Problem: There is a trivial solution.
If we set all vectors equal to each other such that the dot product equals ~ 40, the
probability will be 1 for each pair …
… and the machine hasn’t learned anything.
Solution: In addition to n-gram/context pairs that do occur in the corpus, we create
a set of negative examples, i.e., pairs that do not occur.
57. FINAL STRATEGY
Sample one positive pair and k = 5 negative pairs
For each pair, ask the model for the probability of the pair
being a positive (negative) example
Update the vectors such that probability increases for the positive
pair and decreases for the negative pair
Repeat
1
2
3
4
New and final strategy:
59. DEFINITION
Reinforcement learning: maximize the reward
Predictions: minimize loss (“cost”, “error”, “empirical risk”)
A loss function is a measure of the distance between the predicted values and the actual values (“targets”).
The logistic loss function is one of the most important loss functions.
The target can be either 1 (“did occur”) or 0 (“did not occur”).
If the target equals 1: -log(prediction)
If the target equals 0: -log(1-prediction)
loss (prediction, target) = target log (prediction)-[ (1 - target) log (1 - prediction)+ ]
60. EXAMPLES
Good prediction
Target: 1
Prediction: 90%
Loss: -log(0.9) ≈ 0.0458
This is a good prediction.
Consequently, the loss is small.
Mediocre prediction
Target: 1
Prediction: 40%
Loss: -log (1-0.4) ≈ 0.22184875
This loss is a function of the
counter-probability of 60%.
Bad prediction
Target: 1
Prediction: 10%
Loss: -log(0.2) ≈ 0.699
This prediction is inaccurate
and the loss, therefore, is high.
Bad prediction
loss prediction, target = −[target log(prediction) + 1 − target log(1 − prediction)]
61. LIKELIHOOD
The likelihood function returns the probability of the data for a given
parameter.
L(parameter | data) = P(data | parameter) =
i=1
n
P(data pointi | parameter)
Example: L(pH = 0.5 | HH) = P(HH | pH =
0.5) = 0.25
62. LOG LIKELIHOOD
In practice, it is convenient to use the log likelihood:
log L parameters data = log
i=1
n
P data pointi parameter) =
i=1
n
logP(data pointi|parameter))
• This reduces the computational cost through a transformation from multiplication to addition.
• Using the log likelihood helps avoid underflow problems.
63. θ
∗
= 𝐚𝐫𝐠 𝐦𝐚𝐱θ
i=1
n
log P data
MAXIMUM LIKELIHOOD APPROACH
The model parameters, i.e., the word vectors, can be estimated based on the maximum likelihood:
a maximization problem w.r.t. to f(x) is equivalent to a minimization problem w.r.t to f(-x):
For a random variable with two outcomes, the logistic loss is the negative log likelihood.
Thus, minimizing the logistic loss is equivalent to the maximum likelihood approach.
θ∗= 𝐚𝐫𝐠 𝐦𝐢𝐧θ[−
i=1
n
log P data
65. GRADIENT
For a differentiable multivariable function f(x1, …, xn),
the gradient is a vector whose components are the partial derivatives of f:
∇f =
[
f
x 𝟏
,
f
x 𝟐
…
f
x 𝐧
]
66. DIRECTIONAL DERIVATIVE
The rate of change of the function f in the direction of a unit vector û is called the directional derivative:
Dû f(x) = 𝛁 f(x) û
By the Law of Cosines:
Dû f(x) = |𝛁 f| |û| cos(angle)
Since |û| = 1:
Dû f(x) = |𝛁 f| cos(angle)
The right-hand side is maximal when cos(0°) = 1
Conclusion: The RHS is maximal when the unit vector points in the same direction as the gradient.
67. INTERPRETATION OF THE GRADIENT
The gradient 𝛁 f points in the direction of the steepest ascent.
The negative of the gradient, -𝛁f, points in the direction steepest descent:
𝐃û 𝐟 𝐱 = 𝛁 f û 𝐜𝐨𝐬 (𝐚𝐧𝐠𝐥𝐞) is minimal when the unit vector
is antiparallel to the gradient.
The latter fact is the basis for the Gradient Descent algorithm.
68. PURPOSE
• This is one of the simplest and most powerful algorithms in artificial intelligence.
• A description of Gradient Descent was first published by Peter Debye in 1901.
• Debye pointed out that the idea occurred in a note by Riemann in 1863.
• The purpose of Gradient Descent is to find parameters that minimize a function value.
• The core idea is to update the parameters using the negative of the gradient weighted by a learning rate γ :
θupdated = θcurrent – γ 𝛁f(x)
69. GRADIENT DESCENT
Initialize the parameters randomly
Calculate the gradient at the point x
Update the current parameters using the following rule:
Repeat until you are satisfied with the current parameters or
GD gets stuck in a bad local minimum
1
2
3
4
θupdated = θcurrent – γ 𝛁 f (x)
74. DOCUMENT VECTORS
Vector representations can be calculated at different levels:
Vectors that capture the meaning of documents are known as document vectors or document embeddings.
Document embeddings often have the same dimensions as word vectors.
The easiest method is to take an average of word vectors. This works surprisingly well in some cases, but can be
improved upon.
What’s the next simplest approach?
words phrases sentences documentscharacters
75. Li et al. (2016b) apply the strategy just described for word vectors to documents.
For each document d:
1. Extract all unigrams, bigrams and trigrams
2. Initialize a random document vector
As before, we generate two types of examples:
Positive example
a pair consisting of a document
d and an n-gram that occurs in d
Negative example
a fictitious document/n-gram
pair that did not occur.
PREDICTING WORDS IN A DOCUMENT
76. We use the same tools that we’ve applied to words:
• Logistic loss to measure the prediction error
• Negative sampling to train the model on positive and negative example
• Gradient descent to update the vectors
TOOLBOX
77. DOCUMENT CLASSIFICATION
Classification:
“Given a certain input, which class does this input belong to?”
Binary classification:
“Does this input belong to the class A or class B?”
Sentiment analysis:
“Does this document express a positive or negative sentiment?”
Document classification:
A document classifier is a function:
• Input: a document vector
• Output: a probability distribution over the set of possible classes
Weights:
To classify something we need to “weigh” the evidence: one weight is assigned to each dimension.
78. LOGISTIC REGRESSION
There are many different classification models:
logistic regression neural networks random forests support vector machines …
Logistic regression (LR) is a good place to start.
Another new concept? No! LR is essentially the sigmoid function plus weights and a bias.
LR(input) = sigmoid(weights input + bias)
LR predicts:
• the positive class when the result is at least 0.5
• the negative class otherwise.
79. FITTING THE MODEL
Example: (“Best movie I’ve ever seen”,
positive)
(“What a disappointing movie!”,
negative)
Suppose we have a data set in which every document is annotated with the class it belongs to.
We start with randomly initialized model weights and update them iteratively.
In each iteration, we go through every document in the dataset.
For each document, we run gradient descent, update weights each time.
81. INFORMATION EXTRACTION: PHARMACEUTICAL APPLICATIONS
Drug repositioning:
• Almost 12,500 disease categories in the ICD-10
• Close to 1,500 drugs have been by approved by the FDA
Pharmacovigilance:
• 6.5% of admissions are related to adverse drug reactions
• Cost to the NHS is estimated to be £466m
Big data:
• PubMed/Medline:
• 26 million biomedical abstracts
• Twitter:
• 200 billion tweets per year
• Health-specific social media
82. MACHINE TRANSLATION
• Use of parallel corpora
• Google’s Neural Machine Translation System:
Side-by-side comparison on 500 examples:
Machine: 4.46
Human: 4.82
• Translation from standard English to Basic English:
• For beginners, children, the mentally handicapped, …
• Translation from one legal system to another:
• Cost of extending legal assets will drop
• Further extension of the division of labor:
(a) effective outsourcing of communication-intensive jobs
(b) more cultural exchange
83. TEXT-TO-SPEECH
Voices.com estimate:
• $15 billion industry size
• $2,000 average cost for a national TV ad
Amazon Polly:
• 1 million characters, ~ 23h speech duration: $4,00
• 47 voices, 24 languages
DeepMind WaveNet:
• Naturalness ratings on a scale of 1-5:
• US English: 4.21 (vs. 4.55 for human speech)
• Mandarin: 4.08 (vs. 4.21)
Speech Synthesis Markup Language (SSML):
• comparable to other markup languages such as HTML
• settings for emphasis, pitch, speaking rate, volume, etc.
84. CONVERSATIONAL AGENTS
• Open domain vs. closed-domain
• Current agents are vague and non-committal
• “I don’t know.”, “yeah”, “sure”.
• Some objective functions promote diversity.
• Persona-based agents aim for speaker consistency:
Q: Where are you from?
A: I’m from England.
Q: In which city do you live now?
A: I live in London.
• Recently, agents were trained on transcripts from “Friends” and “The Big Bang Theory”.
• Speakers are represented by embeddings:
(a) Helps infer answers to question that are not present in the current discussion.
(b) Embeddings for Rachel, Ross, Emily, etc.