3. Dr Haithem Afli - Background
§ Computer Science Lecturer at CIT (J102)
- NLP, Data Analytics and ML
§ Science Foundation Ireland Funded Investigator
- Leader of ADAPT@CIT research group
Research Interest:
- Natural Language Processing
- Social Media and UGC Analysis
- Machine Translation
- Data Analytics
§ Lecturing Experience (10+ years)
- 5 years in France
- 3 years in DCU
- 2 years in CIT
05/05/2020 3
4. Overview
§ Introduction to NLP
§ Language Modeling and its applications
§ The Golden Age of AI
§ Language Technologies
§ Ethical Issues (the case of Dialog systems)
haithem.afli@cit.ie 4
5. Language
§ âLanguage, a system of conventional spoken, manual
(signed), or written symbols by means of
which human beings, as members of a social group and
participants in its culture, express themselves.
§ The functions of language include communication, the
expression of identity, play, imaginative expression,
and emotional release. â Britanica
haithem.afli@cit.ie 5
6. If you think the language
industry is new
haithem.afli@cit.ie 6
7. If you think the language
industry is new, think again!
haithem.afli@cit.ie 7
Rosetta Stone (British Museum)
8. Natural Language :
An age-old industry ?
§ For as far back as we can see, human has needed to
communicate â so the origin of language industry is closely
intertwined with the need of communication itself
haithem.afli@cit.ie 8
The Tower of Babel and The House of Wisdom in Bagdad (Bait-al-Hikma)
9. The importance of Language
Processing in modern history
haithem.afli@cit.ie 9
Media agencies and translators interpreted the word âtreat with silent contemptâ or âtake
into accountâ (to ignore), as the categorical rejection by the Prime Minister.
The Americans understood that there would never be a diplomatic end to the war and
were naturally annoyed by what they considered the arrogant tone used in the Japanese
translation of the Prime Ministerâs response. International news agencies reported to the
world that in the eyes of the Japanese government the ultimatum was ânot worthy of
comment.â
11. http://sma.adaptcentre.ie/ge16/#!/
Social Media Analysis
Haithem Afli, Sorcha McGuire, and Andy Way. 2017. Sentiment translation for low resourced languages: Experiments on irish general election tweets. In 18th
International Conference on Computational Linguistics and Intelligent Text Processing.
haithem.afli@cit.ie 11
12. Information Extraction & Sentiment Analysis
§ nice and compact to carry!
§ since the camera is small and light, I won't need to carry
around those heavy, bulky professional cameras either!
§ the camera feels flimsy, is plastic and very light in weight
you have to be very delicate in the handling of this
camera
Size and weight
Attributes:
zoom
affordability
size and weight
flash
ease of use
â
â
â
haithem.afli@cit.ie 12
14. Requested
translation
from
Twitter
(words)
Grand Total from
all World Cup
matches
6,459,830 5,141,360 4,847,590 85,047,110
⢠Sourceà Target traffic:
⢠ENà ES 13,614,450 (EN to all languages: 50,545,460)
⢠ESà EN 5,569,200 (ES to all languages: 10,609,420)
⢠PTà EN 1,831,750 (PT to all languages: 4,230,880)
The 2014 FIFA World Cup was the biggest event yet for Twitter with 672 million tweets
English Portuguese SpanishTop 3 languages
UGC Machine Translation - Braziliator
haithem.afli@cit.ie 14
15. Requested
translation
from
Twitter
(words)
Grand Total from
all World Cup
matches
6,459,830 5,141,360 4,847,590 85,047,110
⢠Sourceà Target traffic:
⢠ENà ES 13,614,450 (EN to all languages: 50,545,460)
⢠ESà EN 5,569,200 (ES to all languages: 10,609,420)
⢠PTà EN 1,831,750 (PT to all languages: 4,230,880)
The 2014 FIFA World Cup was the biggest event yet for Twitter with 672 million tweets
English Portuguese SpanishTop 3 languages
UGC Machine Translation - Braziliator
haithem.afli@cit.ie 15
16. UI: Sentiment pitch
Final: Germany 1-0 Argentina
3rd
Place: Netherlands 3-0 Brazil
Semi-final: Argentina 1-0 NetherlandsSemi-final: Germany 7-1 Brazil
UGC Machine Translation - Braziliator
16
18. Now if we return to HAL 9000
haithem.afli@cit.ie 18
https://www.youtube.com/watch?v=ARJ8cAGm6JE
19. HAL: Whatâs needed?
§ Speech recognition and synthesis
§ Knowledge of the English words involved
§ What they mean?
§ How groups of words form a sentence
§ How can we define a language?
haithem.afli@cit.ie 19
22. What is a language?
Can we define a language mathematically?
Deterministic Definition:
A language is the set of all the sentences we can
say.
Probabilistic Definition:
A language is the probabilistic distribution of all
possible sentences
Ă Statistical Language Model
haithem.afli@cit.ie 22
24. Statistical Language Model
⢠How can we estimate the probability of a sentence in a
specific language?
⢠Unlike estimating the probability distribution of a dice, we
cannot exhaust all the possible sentences in
limited samples
haithem.afli@cit.ie 24
25. Statistical Language Model
⢠How can we estimate the probability of a sentence in a
specific language?
⢠Unlike estimating the probability distribution of a dice, we
cannot exhaust all the possible sentences in
limited sample
⢠Idea
- break all sentences down to limited substrings (n-grams)
- Estimate the probability of a sentence by these substrings
l If a sentence has many plausible substrings then it
might be a reasonable sentence
haithem.afli@cit.ie 25
26. Simplest Language Model
⢠Simplest way to break down a sentence
- split it to words
⢠Thus, the simplest language model
⢠Here the probability of a sentence is just the
multiplication of the probability of the words in the
sentence
⢠This model is called unigram language model
haithem.afli@cit.ie 26
27. Word Frequency
⢠p(w) is word frequency
Type Occurrences Rank
the 3789654 1st
he 2098762 2nd
[...]
king 57897 1,356th
boy 56975 1,357th
[...]
stringyfy 5 34,589th
[...]
transducionalify 1 123,567th
p(w)=
occurrences of w
number of tokens
haithem.afli@cit.ie 27
30. 1990s-2010s: Statistical Machine Translation
⢠Question: How to learn translation model ?
⢠First, need large amount of parallel data
(e.g. pairs of human-translated French/English sentences)
haithem.afli@cit.ie
31. 1990s-2010s: Statistical Machine Translation
⢠Question: How to learn translation model ?
⢠First, need large amount of parallel data
(e.g. pairs of human-translated French/English sentences)
⢠Break it down further: we actually want to consider
where a is the alignment, i.e. word-level correspondence
between French sentence x and English sentence y
haithem.afli@cit.ie
32. How does SMT Work?
haithem.afli@cit.ie 32
Fei-Fei Li
34. HAL: Whatâs needed?
§ Dialog
§ It is polite to respond, even if you re planning to kill someone.
§ It is polite to pretend to want to be cooperative (I m afraid, I
can tâŚ)
haithem.afli@cit.ie 34
35. Conversational Agents
AKA Dialog Agents
§ Phone-based Personal Assistants
SIRI, Alexa, Cortana, Google Assistant
§ Talking to your car
§ Communicating with robots
§ Clinical uses for mental health
§ Chatting for fun
haithem.afli@cit.ie 35
36. Two classes of systems
1. Chatbots
2. (Goal-based) Dialog agents
- SIRI, interfaces to cars, robots,
- booking flights or restaurants
The word "chatbots" is sometimes used in the popular
press for both. We'll use it only for #1.
haithem.afli@cit.ie 36
39. Eliza: Weizenbaum (1966)
Men are all alike.
IN WHAT WAY
They're always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE
Well, my boyfriend made me come here.
YOUR BOYFRIEND MADE YOU COME HERE
He says I'm depressed much of the time.
I AM SORRY TO HEAR YOU ARE DEPRESSED
...
WHO ELSE IN YOUR FAMILY TAKES CARE 0F YOU?
My father
YOUR FATHER
haithem.afli@cit.ie 39
40. You are like my father in some ways.
WHAT RESEMBLANCE DO YOU SEE
You are not very aggressive but I think you donât want me to notice that.
WHAT MAKES YOU THINK I AM NOT AGGRESSIVE
You donât argue with me.
WHY DO YOU THINK I DONâT ARGUE WITH YOU
You are afraid of me.
DOES IT PLEASE YOU TO BELIEVE IâM AFRAID OF YOU
My father is afraid of everybody.
WHAT ELSE COMES TO MIND WHEN YOU THINK OF YOUR FATHER
Bullies.
DOES THAT HAVE ANYTHING TO DO WITH THE FACT THAT YOUR BOYFRIEND MADE YOU
COME HERE
How does it work?
Eliza uses this one weird trick
haithem.afli@cit.ie 40
41. That trick: be a Rogerian psychologist
haithem.afli@cit.ie 41
42. That trick: be a Rogerian psychologist
§ Draw the patient out by reflecting patientâs
statements back at them
§ Rare type of conversation in which one can
âassume the pose of knowing almost
nothing of the real worldâ
haithem.afli@cit.ie 42
43. Eliza pattern/transform rules
(0 YOU 0 ME) [pattern]
Ă
(WHAT MAKES YOU THINK I 3 YOU) [transform]
0 means Kleene star *
The 3 is the constituent # in pattern
You hate me
WHAT MAKES YOU THINK I HATE YOU
haithem.afli@cit.ie 43
Dan Jurafsky
44. Some implications
§ People became deeply emotionally involved with the
program
§ Weizenbaum tells the story of his secretary who would ask
Weizenbaum to leave the room when she talked with ELIZA
§ When he suggested that he might want to store all the ELIZA
conversations for later analysis, people immediately pointed
out the privacy implications
§ Suggesting that they were having quite private conversations with
ELIZA
§ Anthropomorphicism and the Heider-Simmel Illusion
§ https://www.youtube.com/watch?v=8FIEZXMUM2I
haithem.afli@cit.ie 44
45. Components of current SIRI-style architectures
Interaction Model
Speech
Synthesis
Output
to User
Speech
Synthesis
Elicitation
Interaction Context
World Knowledge
Word
Sequence
Input
from User
Speech
Recognition
Semantic
Interpretation
LPM
Training
NL Under-
standing
Clarifying
Question
Dialog
Management
Missing
Elements
incomplete
Best
Outcome
Inferred
User Intent
Action
Selection
complete
?
Figure from Jerome Bellegarda
haithem.afli@cit.ie 45
46. NLP in the Golden Age of AI
NLP has an AI aspect to it.
§ Weâre often dealing with ill-defined problems
§ We donât often come up with exact solutions/algorithms
§ We canât let either of those facts get in the way of making progress
haithem.afli@cit.ie 46
49. The Rise of Natural Language Processing
(NLP), and How it is Changing the Way we
Retrieve Information
haithem.afli@cit.ie 49
The 'creator' of Bitcoin, Satoshi Nakamoto, is
the world's most elusive billionaire. Very few
people outside of the Department of
Homeland Security know Satoshi's real
name. Satoshi has taken great care to keep
his identity secret employing the latest
encryption and obfuscation methods in his
communications.
Despite these efforts Satoshi Nakamoto gave
investigators the only tool they needed to find him -
- his own words. Using NLP, NSA (and everyone!)
was able to compare texts to determine authorship
of a particular work.
More info: https://tech.slashdot.org/story/17/08/28/1725232/how-the-nsa-identified-satoshi-
nakamoto
50. Timeline of (modern) AI
haithem.afli@cit.ie
Graph from The University Of Queensland Brain Institute
The 1st AI
Winter
The second AI
Winter
Including CIT MSc in AI
https://www.cit.ie/course/CRKARIN9
50
51. The first AI winter
haithem.afli@cit.ie
By 1964, the National Research Council (NRC)
had become concerned about the lack of progress
and formed the Automatic Language Processing
Advisory Committee (ALPAC) to look into the
problem.
They concluded, in a famous 1966 report, that
machine translation was more expensive, less
accurate and slower than human translation.
After spending some 20 million dollars, the NRC
ended all support.
Image from Wikipedia
51
52. haithem.afli@cit.ie
In 1984, John McCarthy criticized expert systems because they lacked common sense
and knowledge about their own limitations.
Schwarz, Director of DARPA ISTO from 1987 to 1989 concluded that AI research has
always had
â⌠very limited success in particular areas, followed immediately by failure to reach the
broader goal at which these initial successes seem at first to hintâŚâ.
Ă Decrease in funding in AI research.
Ă Many AI companies closed their doors.
Ă The AAAI conference that attracted over 6000
visitors in 1986 quickly decreased to just 2000
by 1991.
The second AI winter
52
53. The survivors
The Deep Learning God Fathers
haithem.afli@cit.ie
Turing Award given for:
⢠âThe conceptual and engineering breakthroughs that have made deep neural
networks a critical component of computing.â
53
55. 2014: Generative Adversarial
Networks
§ The neural network at
the top is the
discriminator, and its task
is to distinguish the
training setâs real
information from the
generatorâs creations.
§ In the simplest GAN
structure, the generator
starts with random data
and learns to transform
this noise into
information that matches
the distribution of the
real data.
haithem.afli@cit.ie 55
56. Do you know this person?
haithem.afli@cit.ie
https://thispersondoesnotexist.com/
56
60. DeepFake
§ The development of
deepfakes has taken place
to a large extent in two
settings: research at
academic institutions, and
development by amateurs
in online communities.
haithem.afli@cit.ie 60
61. GAN
Applications of GANs
ĂGANs for Image Editing
ĂUsing GANs for Security
(SSGAN: Secure Steganography Based on GAN)
ĂDe-aging Robert De Niro!
(Martin Scorsese spent millions of Netflix's money
to digitally de-age De Niro, Pacino, and Pesci so they could portray these men throughout
different parts of their lives.)
haithem.afli@cit.ie 61
63. non-standard English
Great job @justinbieber! Were
SOO PROUD of what youve
accomplished! U taught us 2
#neversaynever & you yourself
should never give up eitherâĽ
Why else is natural language
understanding difficult?
haithem.afli@cit.ie 63
65. non-standard English
Great job @justinbieber! Were
SOO PROUD of what youve
accomplished! U taught us 2
#neversaynever & you yourself
should never give up eitherâĽ
segmentation issues
the New York-New Haven Railroad
the New York-New Haven Railroad
Why else is natural language
understanding difficult?
haithem.afli@cit.ie 65
66. non-standard English
Great job @justinbieber! Were
SOO PROUD of what youve
accomplished! U taught us 2
#neversaynever & you yourself
should never give up eitherâĽ
segmentation issues idioms
dark horse
get cold feet
lose face
throw in the towel
the New York-New Haven Railroad
the New York-New Haven Railroad
Why else is natural language
understanding difficult?
haithem.afli@cit.ie 66
67. non-standard English
Great job @justinbieber! Were
SOO PROUD of what youve
accomplished! U taught us 2
#neversaynever & you yourself
should never give up eitherâĽ
segmentation issues idioms
dark horse
get cold feet
lose face
throw in the towel
neologisms
unfriend
Retweet
bromance
the New York-New Haven Railroad
the New York-New Haven Railroad
Why else is natural language
understanding difficult?
haithem.afli@cit.ie 67
68. non-standard English
Great job @justinbieber! Were
SOO PROUD of what youve
accomplished! U taught us 2
#neversaynever & you yourself
should never give up eitherâĽ
segmentation issues idioms
dark horse
get cold feet
lose face
throw in the towel
neologisms
unfriend
Retweet
bromance
But thatâs what makes it fun!
the New York-New Haven Railroad
the New York-New Haven Railroad
Why else is natural language
understanding difficult?
haithem.afli@cit.ie 68
69. Making progress on this problemâŚ
§ The task is difficult! What tools do we need?
§ Knowledge about language
§ Knowledge about the world
§ A way to combine knowledge sources
§ How we generally do this:
§ Probabilistic models built from language data
§ P(âmaisonâ ÂŽ âhouseâ) high
§ P(âLâavocat gĂŠnĂŠralâ ÂŽ âthe general avocadoâ) low
§ Luckily, rough text features can often do half the job.
haithem.afli@cit.ie 69
Dan Jurafsky and James H. Martin
Ă Pre-trained models
76. Addressing commensense problem
haithem.afli@cit.ie 76
Cunxiang Wang, Shuailong Liang , Yue Zhang , Xiaonan Li and Tian Gao. Does It Make Sense?
And Why? A Pilot Study for Sense Making and Explanation.
77. Language Technology
Coreference resolution
Question answering (QA)
Part-of-speech (POS) tagging
Word sense disambiguation
(WSD)
Paraphrase
Named entity recognition (NER)
Parsing
Summarization
Information extraction (IE)
Machine translation (MT)
Dialog
Sentiment analysis
mostly solved
making good progress
still really hard
Spam detection
Letâs go to Agra!
Buy V1AGRA âŚ
â
â
Colorless green ideas sleep furiously.
ADJ ADJ NOUN VERB ADV
Einstein met with UN officials in Princeton
PERSON ORG LOC
Youâre invited to our dinner
party, Friday May 27 at 8:30
Party
May 27
add
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes.
Carter told Mubarak he shouldnât run again.
I need new batteries for my mouse.
The 13th Shanghai International Film FestivalâŚ
13 âŚ
The Dow Jones is up
Housing prices rose
Economy is
good
Q. How effective is ibuprofen in reducing
fever in patients with acute febrile illness?
I can see Alcatraz from the window!
XYZ acquired ABC yesterday
ABC has been taken over by XYZ
Where is Citizen Kane playing in SF?
Castro Theatre at 7:30. Do
you want a ticket?
The S&P500 jumped
haithem.afli@cit.ie 77
78. Real Success: IBMâs Watson
§ Won Jeopardy on February 16, 2011!
WILLIAM WILKINSONâS
âAN ACCOUNT OF THE PRINCIPALITIES OF
WALLACHIA AND MOLDOVIAâ
INSPIRED THIS AUTHORâS
MOST FAMOUS NOVEL
Bram Stoker
haithem.afli@cit.ie 78
79. Real Success: Watson on Jeopardy
§ https://www.youtube.com/watch?v=WFR3lOm_xhE
haithem.afli@cit.ie 79
80. Ethical Issues in Dialog System Design
§ Machine learning systems replicate biases that occurred in
the training data.
§ Microsoft's Tay chatbot
§ Went live on Twitter in 2016
§ Taken offline 16 hours later
§ In that time it had started posting racial slurs, conspiracy
theories, and personal attacks
§ Learned from user interactions (Neff and Nagy 2016)
The Twitter profile picture of Tay
haithem.afli@cit.ie 80
82. Ethical Issues in Dialog System Design
§ Machine learning systems replicate biases that occurred in
the training data.
§ Dialog datasets
§ Henderson et al. (2017) examined standard datasets (Twitter, Reddit,
movie dialogs)
§ Found examples of hate speech, offensive language, and bias
§ Both in the original training data, and in the output of chatbots trained
on the data.
haithem.afli@cit.ie 82
83. Ethical Issues in Dialog System Design: Privacy
§ Remember this was noticed in the days of Weizenbaum
§ Agents may record sensitive data
§ (e.g. âComputer, turn on the lights [an-swers the phone âHi, yes, my
password is...â],
§ Which may then be used to train a seq2seq conversational
model.
§ Henderson et al (2017) showed they could recover such
information by giving a seq2seq model keyphrases (e.g.,
"password is")
haithem.afli@cit.ie 83
84. Ethical Issues in Dialog System Design: Gender
equality
§ Dialog agents overwhelmingly given female names,
perpetuating female servant stereotype(Paolino, 2017).
§ Responses from commercial dialog agents when users use
sexually harassing language (Fessler 2017):
haithem.afli@cit.ie 84
Speech and Language Processing (3rd ed. draft)
Dan Jurafsky and James H. Martin
85. Addressing real-world challenges
§ AI Technologies
- Natural Language Processing (NLP)
- Social Media and UGC Analysis
- Computer Vision (CV)
- Machine/Deep Learning (ML-DL)
§ Applications
- Digital Humanities
- Fintech
- Digital Health and Life-science
- Social Science and Psychology
- Security and Cybersecurity
85haithem.afli@cit.ie