NOVA Data Science Meetup 1/19/2017 - Presentation 2

Statistical NLP
Mona Diab
George Washington University

Who am I?
•  Prof in CS department working on issues
of big data, data science, natural
language processing
•  mtdiab@gwu.edu
•  Check out my research @
– www.seas.gwu.edu/~mtdiab
•  NLP lab @gw
– Care4lang1.seas.gwu.edu

“Every 2 days we produce as much
information as we did from the beginning of
time till 2003”
“Big Data refers to our ability to make use
of the ever-increasing volumes of data.”
“…everything we do is increasingly leaving a
digital trace (or data), which we (and others)
can use and analyze.”
Bernard Marr

The Dream
•  It’d be great if machines could
•  Process our email (usefully)
•  Translate languages accurately
•  Help us manage, summarize, and
aggregate information
•  Use speech as a UI (when
needed)
•  Talk to us / listen to us
•  But they can’t:
•  Language is complex, ambiguous,
flexible, and subtle
•  Good solutions need linguistics
and machine learning
knowledge
Slide courtesy of Heng Ji

Heterogeneous BigData
MartinLockheed
3,000 workers
to furlough
amid
#USGovernmentShutdown
The Patient Protection and Affordable
Care Act (PPACA),[1] commonly
called the Affordable Care Act (ACA)
or Obamacare, is a United States
federal statute signed into law by
President Barack Obama on March
23, 2010.
The U.S. Congress, still in partisan
deadlock over Republican efforts to
halt President Barack Obama's
healthcare reforms, was on the
verge of shutting down most of the
U.S. government starting on
Tuesday morning.
NSF and NIST are
temporarily closed
because the Government
entered a period of
partial shutdown.
President Obama's 70-minute White
House meeting late Wednesday
afternoon with congressional leaders
including House Speaker John
Boehner, did nothing to help end the
impasse.

Mystery
•  What’s now impossible for computers (and any other
species) to do is effortless for humans
✕ ✕ ✓

What is NLP?
•  Fundamental goal: deep understanding of broad language use
•  not just string processing or keyword matching!

What is NLP/CL?
•  NLP: Natural Language Processing
–  Is the field of making computers process natural language
•  Does process entail understand?
•  CL: Computational Linguistics
–  Is the field of using computers to understand (natural)
language
•  Natural Language?
–  Refers to the language spoken by people, e.g. English,
Japanese, Swahili, as opposed to artificial languages, like C++,
Java, etc.

What is NLP?
•  Computers using and processing natural language input (data)
and producing useful information, could be natural language
output/or structured data
•  Software that can recognize, analyze and generate text and
speech
•  Typically NLP refers to processing unstructured data – text in
free form (unstructured text)
•  Contrast to Structured data refers to information in “tables”
–  Typically allows numerical range and exact match (for text)
queries, e.g.,Salary < 60000 AND Manager = Smith, should
return Turner, Ian

Employee Manager Salary
Smith, John David, Richard $80,000
Turner, Ian Smith, John $59,000
Huang, Chang Smith, John $69,000

11
Unstructured (text) vs. structured
(database) data in 1996
0
20
40
60
80
100
120
140
160
Data volume Market Cap
Unstructured
Structured

12
Unstructured (text) vs. structured
(database) data
0
20
40
60
80
100
120
140
160
Data volume Market Cap
Unstructured
Structured

Goals of NLP/CL
•  Model Human Language Processing
•  Analyze Human Language
•  Facilitate Human Language Communication
via Automated Tools

Why NLP?
•  kJfmmfj mmmvvv nnnffn333
•  Uj iheale eleee mnster vensi credur
•  Baboi oi cestnitze
•  Coovoel2^ ekk; ldsllk lkdf vnnjfj?
•  Fgmflmllk mlfm kfre xnnn!
•  Can you READ this? You, yes you!

Computers Lack Knowledge!
•  Computers “see” text in English/Arabic/French
the same way you saw the previous slide!
•  People have no trouble understanding language
–  Common sense knowledge
–  Reasoning capacity
–  Experience
•  However, Computers have
–  No common sense knowledge
–  No reasoning capacity
Unless we teach them!

Why Should You Care?
•  An enormous amount of knowledge is now
available in machine readable form as
natural language text
•  Conversational agents are becoming an
important form of human-computer
communication
•  Much of human-human communication is
now mediated by computers
•  Very cool stuff! And with lots of commercial
interest.
Adapted from Speech and Language Processing - Jurafsky and MarJn

Why NLP?
•  Applications for
processing large
amounts of texts
(BIG DATA)
require NLP
expertise
•  Classify text into categories
•  Index and search large texts
•  Automatic machine translation
•  Speech understanding
–  Understand phone conversations
•  Information extraction
–  Extract useful information from
resumes
•  Automatic summarization
–  Condense 1 book into 1 page
•  Question answering
•  Knowledge acquisition
•  Text generation / dialogs

Why is NLP intriguing?
•  NLP has an AI aspect to it
–  We’re often dealing with ill-deﬁned problems
– We don’t often come up with exact solutions/
algorithms
– We can’t let either of those facts get in the
way of making progress

NLP in CS taxonomy
Computers
Artificial Intelligence AlgorithmsDatabases Networking
Robotics SearchNatural Language Processing
Information
Retrieval
Machine
Translation
Language
Analysis
Semantics Parsing

The Challenge
•  Language is complex with infinite
possible constructions
•  Good news is that there are patterns as
the symbol set is finite, but the
patterns are latent
•  Abundance of raw data

Why is NLP hard? Some Headlines…
•  Police Begin Campaign To Run Down Jaywalkers
•  Iraqi Head Seeks Arms
•  Enraged Cow Injures Farmer With Ax
•  Teacher Strikes Idle Kids
•  Squad Helps Dog Bite Victim
•  Red Tape Holds Up New Bridges
•  Hospitals Are Sued by 7 Foot Doctors
•  Court to Try Shooting Defendant
•  Local High School Dropouts Cut in Half

How can a machine understand
these differences?
•  Get the cat with the gloves.

Ambiguous Spoken Example
I made her duck
•  I cooked waterfowl for her
•  I cooked the waterfowl that belongs to
her
•  I created the ceramic duck she owns
•  I caused her to quickly lower her head
•  And more….

Example … continued!
I made her duck
maid
Eye
Speech
recognition
cook
create
Word Sense
Disambiguation
Syntactic parsing
Verb
noun
Part of
Speech
Tagging

Linguistics
•  It is the study of the science of human
language
•  How the mind comes up with language

Levels of Language Description
•  6 basic levels (more or less explicitly present in most theories):
–  and beyond (pragmatics/logic/...)
–  meaning (semantics)
–  (surface) syntax
–  morphology
–  phonology
–  phonetics/orthography
•  Each level has an input and output representation
–  output from one level is the input to the next (upper)
level
–  sometimes levels might be skipped (merged) or split

The Steps in NLP
Discourse
Pragmatics
Semantics
Syntax
Morphology
**we can go up, down and up and
down and combine steps too!!
**every step is equally complex

The View: Ambiguity
•  All 6 levels of linguistic knowledge require
resolving ambiguity
•  Ambiguity results from the existence of
multiple possibilities for each level

Ambiguity
•  Computational linguists are obsessed with ambiguity
•  Ambiguity is a fundamental problem of computational
linguistics
•  Resolving ambiguity is a crucial goal

non---standard English
Great job @jusJnbieber! Were
SOO PROUD of what youve
accomplished! U taught us 2
#neversaynever & you yourself
should never give up either♥
Adapted from Speech and Language Processing --- Jurafsky and Mar<n
Why else is natural language
understanding difficult?

segmenta5on issues
the New York---New Haven Railroad
Adapted from Speech and Language Processing --- Jurafsky and MarJn

segmenta5on issues idioms
dark horse
get cold feet
lose face
throw in the towel

dark horse
get cold feet
lose face
throw in the towel
neologisms
unfriend
Retweet
bromance

dark horse
get cold feet
lose face
throw in the towel
neologisms
unfriend
Retweet
bromance
world knowledge
Mary and Sue are sisters.
Mary and Sue are mothers.

dark horse
get cold feet
lose face
throw in the towel
neologisms
unfriend
Retweet
bromance
tricky en5ty names
Where is A Bug’s Life playing …
Let It Be sold millions…
… a mutaJon on the for gene …
world knowledge

dark horse
get cold feet
lose face
throw in the towel
neologisms
unfriend
Retweet
bromance
tricky en5ty names
Where is A Bug’s Life playing …
Let It Be sold millions…
… a muta<on on the for gene …
world knowledge
But that’s what makes it fun!

Making progress on this problem…
•  The task is difficult! What tools do we
need?
–  Knowledge about language
–  Knowledge about the world
–  A way to combine knowledge sources
•  How we generally do this:
–  probabilistic models built from language data
•  P(“maison” → “house”) high
•  P(“L’avocat général” → “the general avocado”) low
–  Luckily, rough text features can often do half
the job.

CL Toolkit
•  Knowledge of Linguistics, i.e. NLPers call them features!!
•  State Machines
–  Finite state automata, transducers
•  Formal Rule Systems
–  Regular Grammars, Context Free Grammars
•  Logic
–  First order logic, predicate calculus
•  Probability Theory
–  Associating probabilities with the previous machinery
•  Machine Learning Tools
–  Learning automatically from representations, play a very important role in cases where
we don’t have good explanations of why things happen the way they do
•  Performance Metrics
–  Well defined evaluation metrics for different tasks

Major Topics
1.  Words
2.  Syntax
3.  Meaning
4.  Discourse
5. Applications exploiting each

Models and Algorithms
•  By models we mean the formalisms that
are used to capture the various kinds of
linguistic knowledge we need.
•  Algorithms are then used to manipulate
the knowledge representations needed
to tackle the task at hand.

Models
•  Finite state machines
•  Linguistic Rules
•  Markov models
•  Alignment
•  Vector space model of word and
document meaning
•  Logical formalisms
•  Network models

Algorithms
•  Rule-based
– Symbolic Parsers and morphological
analyzers
– Finite state automata
•  Probabilistic/statistical
– Learned from observation of (labeled) data
– Predicting new data based on old
– Machine learning

Algorithms
•  Many of the algorithms that we’ll study will turn out to
be transducers; algorithms that take one kind of
structure as input and output another
•  Unfortunately, ambiguity makes this process difficult
•  This leads us to employ algorithms that are designed to
handle ambiguity of various kinds
•  State-space search paradigm: To manage the problem
of making choices during processing when we lack
the information needed to make the right choice

Machine Learning
Machine learning based classifiers that are
trained to make decisions based on (implicitly
or explicitly modeled) features from context
Simple Classifiers:
Naïve Bayes
Logistic Regression (MaxEnt)
Decision Trees
Neural Networks
Sequence Models:
Hidden Markov Models
Maximum Entropy Markov Models
Conditional Random Fields
Recursive Neural Networks (RNNs, LSTMs)

Approaching the challenge
•  Divide & Conquer
– Break the problem into smaller problems
•  Throw state of the art techniques at
the smaller problems
•  Keep your fingers crossed!!

NLP Categories
•  Applications
•  Word counters (wc in UNIX)
•  Spell Checkers, grammar checkers
•  Predictive Text on mobile handsets
•  Machine Translation (MT)
•  Information Retrieval (IR)
•  Automatic Speech Recognition (ASR)
•  Optical Character Recognition (OCR)
•  Automatic Summarization, Speech Synthesis, etc.
•  Enabling Technologies
–  Tokenization
–  Part-of-Speech Tagging
–  Syntactic Parsing
–  Lemmatization
–  Word Sense Disambiguation, etc.

•  Alan Turing was British pioneering
computer scientist, mathematician,
logician, and cryptanalyst. He is widely
considered the Father of Computer
Science.
•  The movie Imitation Game is about him.
•  The Turing test is a test of a machine's ability to exhibit
intelligent behavior equivalent to, or indistinguishable from, that
of a human. Turing proposed that a human evaluator would judge
natural language conversations between a human and a machine
that is designed to generate human-‐ like responses.
Turing Test
Courtesy of Nizar Habash

Current Real-World Applications
•  Search: very large corpora, e.g. Google
•  Information Extraction: relevant information to a task
•  Sentiment analysis: restaurant or movie reviews
•  Summarizing very large amounts of text or speech: e.g.
your email, the news, voicemail
•  Translating between one language and another: e.g.
Google Translate, Babelfish
•  Dialogue systems: e.g. chatbots, Amtrak’s ‘Julie’
•  Question answering: e.g. IBM’s Watson Jeopardy!,
DARPA who/what/where…, Ask Jeeves
•  Even more: speech processing, common sense
knowledge, text categorization, web monitoring, etc.

Machine Translation
•  Basic types of Machine Translation
–  Text to Text Machine Translations
–  Speech to Speech Machine Translations
•  To date, majority of approaches have
targeted rich language pairs (with lots of
automated resources) – No Swahili-German
systems
•  Current approaches are statistical,
learning from existing translations (parallel
data collections)
•  Reasonable performance due significant
funding

Google Translate
Adapted from Speech and Language Processing - Jurafsky and MarJn

Blog Analytics
•  Data-mining of blogs, discussion forums,
message boards, user groups, and other
forms of user generated media
– Product marketing information
– Political opinion tracking
– Social network analysis
– Buzz analysis (what’s hot, what topics are
people talking about right now).

Livejournal.com:
I, me, my on or after Sep 11, 2001
o30-n5
o16-o22
o2-o8
s24
s22
s20
s18
s16
s14
s12
B
7.2
7.0
6.8
6.6
6.4
6.2
6.0
5.8
Graph from Pennebaker slides
Cohn, Mehl, Pennebaker. 2004. LinguisJc markers of psychological change surrounding September
11, 2001. Psychological Science 15, 10: 687-693.

September 11 LiveJournal.com study:
We, us, our
o30-n5
o16-o22
o2-o8
s24
s22
s20
s18
s16
s14
s12
B
1.1
1.0
.9
.8
.7
.6
.5
Cohn, Mehl, Pennebaker. 2004. LinguisJc markers of psychological change surrounding
September 11, 2001. Psychological Science 15, 10: 687-693.
Graph from Pennebaker slides

Sentiment Analysis
•  Movie Review Mining
–  User1: The Matrix rocked, I simply loved it….
–  User2: Really, that Keanu Reaves gets on my nerves,
he is too robotic
–  User1: it was way deep, it obviously went over your
head!
–  User2: I think it GOT INTO ur head J
•  What do you think User1 and User2’s
sentiments are toward the movie?
–  User1
–  User2
•  What do you think the sentiment of User2
toward User1 is?

What about positive restaurant reviews?
Sex, Drugs, and Dessert
•  sexy food
•  seductively seared fois
gras
—  addicted to pepper shooters
—  garlic noodles… my drug of
choice
—  the fries are like crack

Let’s Zoom in
•  Beyond Sentiment
– Perspective Identification
•  Why would you need that?
–  Polling
–  Detection of bias
•  How would you go about doing it?
–  Sentiment analysis
–  Opinion mining
–  Cognitive state (belief)

Dialogue Systems
•  User: Can you book me a flight from New York
City to Washington?
•  System: Is your destination Washington
state?
•  User: No I mean Washington DC
•  System: When would you like to fly?
•  Example: Amtrak’s Julie system

Tell me what's bothering you.
hbp://www.masswerk.at/elizabot/eliza.html

Eliza
•  Eliza is a mock Rogerian psychotherapist
–  Rogerian psychotherapy, or person-‐centered therapy, is a
form of talk-‐psychotherapy developed by psychologist Carl
Rogers in the 1940s & 1950s.
•  The original program was described by
Joseph Weizenbaum in 1966.
•  Primitive natural language processing using simple
scripts and regular expressions
•  Chatterbots or chatbots
•  Eliza effect
–  the tendency to unconsciously assume computer behaviors
are analogous to human behaviors.
•  Turing test

•  Men are all alike.
•  IN WHAT WAY?
•  They're always bugging us about something or other.
•  CAN YOU THINK OF A SPECIFIC EXAMPLE?
•  Well, my boyfriend made me come here.
•  YOUR BOYFRIEND MADE YOU COME HERE
•  He says I'm depressed much of the time.
•  I AM SORRY TO HEAR YOU ARE DEPRESSED
•  It's true. I am unhappy
•  DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY
•  I need some help, that much seems certain.
•  WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP
•  Perhaps I could learn to get along with my mother.
•  TELL ME MORE ABOUT YOUR FAMILY
•  My mother takes care of me.
•  WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU
•  My father.
•  YOUR FATHER
•  You are like my father in some ways.
Original Chatting Program
Eliza

s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE 1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE 1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Eliza-style regular expressions
Step 1: replace first person with second person references
s/bI(’m| am)b /YOU ARE/g
s/bmyb /YOUR/g
S/bmineb /YOURS/g
Step 2: use additional regular expressions to generate replies
Step 3: use scores to rank possible transformations

•  Let’s chat with Mitsuku!
•  http://www.mitsuku.com
•  Loebner prize winner 2013,
runner up 2015
–  Modern form of the Turing test
for Artiﬁcial Intelligence
Mitsuku
Slide courtesy of Nizar Habash

Question Answering: IBM’s
Watson

Question Answering: IBM’s
Watson
•  Won Jeopardy on February 16, 2011!
70
WILLIAM WILKINSON’S
“AN ACCOUNT OF THE PRINCIPALITIES OF
WALLACHIA AND MOLDOVIA”
INSPIRED THIS AUTHOR’S
MOST FAMOUS NOVEL
Bram Stoker

§§ Capture the imagination
–  The Next Deep Blue
§§ Engage the scientific community
–  Envision new ways for computers to impact society & science
–  Drive important and measurable scientific advances
§§ Be Relevant to IBM Customers
–  Enable better, faster decision making over unstructured and structured content
–  Business Intelligence, Knowledge Discovery and Management, Government,
Compliance, Publishing, Legal, Healthcare, Business Integrity, Customer
Relationship Management, Web Self-Service, Product Support, etc.
A Grand Challenge
Opportunity
© 2011 IBM CorporaJon

Real Language is Real Hard
– A finite, mathematically well-defined search space
– Limited number of moves and states
– Grounded in explicit, unambiguous mathematical rules
– Ambiguous, contextual and implicit
– Grounded only in human cognition
– Seemingly infinite number of ways to express the same meaning
Chess
Human Language

Easy Questions?
Serial Number Type Invoice #
45322190-AK LapTop INV10895
David Jones
David Jones =
ln((12,546,798 * π)) ^ 2 / 34,567.46 = 0.00885
Select Payment where Owner=“David Jones” and Type(Product)=“Laptop”,
Owner Serial Number
David Jones 45322190-AK
Invoice # Vendor Payment
INV10895 MyBuy $104.56
Dave Jones
David Jones
≠

Hard Questions?
Computer programs are natively explicit, fast and exacting in their calculation over
numbers and symbols….But Natural Language is implicit, highly contextual,
ambiguous and often imprecise.
§§ Wherewas X born?
One day, from among his city views of Ulm, Otto chose a water color to
send to Albert Einstein as a remembrance of Einstein´s birthplace.
§§ X ran this?
If leadership is an art then surely Jack Welch has proved himself a
master painter during his tenure at GE.
Person Birth Place
A.  Einstein ULM
Person Organization
J. Welch GE
Structured
Unstructured

Automatic Open-Domain Question Answering
A Long-Standing Challenge in Artificial Intelligence to emulate human expertise
© 2011 IBM CorporaJon 7
§§ Given
–  Rich Natural LanguageQuestions
–  Over a Broad Domain ofKnowledge
§§ Deliver
–  Precise Answers:Determine what is being asked & give precise response
–  Accurate Confidences: Determine likelihood answer iscorrect
–  Consumable Justifications: Explain why the answer isright
–  Fast Response Time: Precision & Confidence in <3seconds

Information Retrieval
•  Very successful enterprise: Google, Bing,
Yahoo, Altavista
•  General model: given a huge collection of texts
(document collection), given a query
–  Task: find specific documents that are relevant to
the given query
–  How: Create an index, like the index in a book to
look up the information, predominant approaches
include vector space models

Information Extraction
Subject: curriculum meeting
Date: January 15, 2012
To: Dan Jurafsky
Hi Dan,
we’ve now scheduled the curriculum meeting.
It will be in Gates 159 tomorrow from 10:00-11:30.
-Chris Create new Calendar entry
Event: Curriculum mtg
Date: Jan-16-2012
Start: 10:00am
End: 11:30am
Where: Gates 159

Information Extraction
•  nice and compact to carry!
•  since the camera is small and light, I won't
need to carry around those heavy, bulky
professional cameras either!
•  the camera feels flimsy, is plastic and very
light in weight you have to be very delicate
in the handling of this camera78
Size and weight
Abributes:
zoom
aﬀordability
size and weight
ﬂash
ease of use
✓
✗
✓

Language Technology
Coreference resoluJon
QuesJon answering (QA)
Part-of-speech (POS) tagging
Word sense disambiguaJon (WSD)
Paraphrase
Named enJty recogniJon (NER)
Parsing
SummarizaJon
InformaJon extracJon (IE)
Machine translaJon (MT)
Dialog
SenJment analysis

mostly solved
making good progress
sJll really hard
Spam detecJon
Let’s go to Agra!
Buy V1AGRA …
✓
✗
Colorless green ideas sleep furiously.
ADJ ADJ NOUN VERB ADV
Einstein met with UN oﬃcials in Princeton
PERSON ORG LOC
You’re invited to our dinner
party, Friday May 27 at 8:30
Party
May 27
add
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes.
Carter told Mubarak he shouldn’t run again.
I need new baberies for my mouse.
The 13th Shanghai InternaJonal Film FesJval…
第13届上海国际电影节开幕…
The Dow Jones is up
Housing prices rose
Economy is
good
Q. How eﬀecJve is ibuprofen in reducing
fever in paJents with acute febrile illness?
I can see Alcatraz from the window!
XYZ acquired ABC yesterday
ABC has been taken over by XYZ
Where is CiJzen Kane playing in SF?
Castro Theatre at 7:30. Do
you want a Jcket?
The S&P500 jumped

•  Thanks for listening!!
•  Questions?

Reminder of who I amJ
•  Prof in CS department working on issues
of big data, data science, natural
language processing
•  mtdiab@gwu.edu
•  Check out my research @
– www.seas.gwu.edu/~mtdiab
•  NLP lab @gw
– Care4lang1.seas.gwu.edu

NOVA Data Science Meetup 1/19/2017 - Presentation 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to NOVA Data Science Meetup 1/19/2017 - Presentation 2

Similar to NOVA Data Science Meetup 1/19/2017 - Presentation 2 (20)

More from NOVA DATASCIENCE

More from NOVA DATASCIENCE (6)

Recently uploaded

Recently uploaded (20)

NOVA Data Science Meetup 1/19/2017 - Presentation 2