SlideShare a Scribd company logo
1 of 54
An Introduction to
Machine Learning
and
Natural Language Processing
Tools
Vivek Srikumar, Mark Sammons
(Some slides from Nick Rizzolo)
The Famous People Classifier
f( ) = Politician
f( ) = Athlete
f( ) = Corporate Mogul
Outline
 An Overview of NLP Resources
 Our NLP Application: The Fame classifier
 The Curator
 Edison
 Learning Based Java
 Putting everything together
An overview of NLP resources
What can NLP do for me?
NLP resources (An incomplete list)
 Cognitive Computation Group resources
 Tokenization/Sentence Splitting
 Part Of Speech
 Chunking
 Named Entity Recognition
 Coreference
 Semantic Role Labeling
 Others
 Stanford parser and dependencies
 Charniak Parser
Page 5
Tokenization and Sentence Segmentation
 Given a document, find the sentence and token
boundaries
The police chased Mr. Smith of Pink Forest,
Fla. all the way to Bethesda, where he lived.
Smith had escaped after a shoot-out at his
workplace, Machinery Inc.
 Why?
 Word counts may be important features
 Words may themselves be the object you want to classify
 “lived.” and “lived” should give the same information
 different analyses need to align if you want to leverage
multiple annotators from different sources/tasks
Page 6
Part of Speech (POS)
 Allows simple abstraction for pattern detection
 Disambiguate a target, e.g.
“make (a cake)” vs. “make (of car)”
 Specify more abstract patterns,
e.g. Noun Phrase: ( DT JJ* NN )
 Specify context in abstract way
 e.g. “DT boy VBX” for “actions boys do”
 This expression will catch “a boy cried”, “some boy ran”, …
Page 7
POS DT NN VBD PP DT JJ NN
Word The boy stood on the burning deck
POS DT NN VBD PP DT JJ NN
Word A boy rode on a red bicycle
Chunking
 Identifies phrase-level constituents in sentences
[NP Boris] [ADVP regretfully] [VP told] [NP his wife]
[SBAR that] [NP their child] [VP could not attend] [NP
night school] [PP without] [NP permission] .
 Useful for filtering: identify e.g. only noun phrases, or
only verb phrases
 Groups modifiers with heads
 Useful for e.g. Mention Detection
 Used as source of features, e.g. distance (abstracts
away determiners, adjectives, for example), sequence,…
 More efficient to compute than full syntactic parse
 Applications in e.g. Information Extraction – getting (simple)
information about concepts of interest from text documents
Page 8
Named Entity Recognition
 Identifies and classifies strings of characters
representing proper nouns
[PER Neil A. Armstrong] , the 38-year-old civilian
commander, radioed to earth and the mission control
room here: “[LOC Houston] , [ORG Tranquility]
Base here; the Eagle has landed."
 Useful for filtering documents
 “I need to find news articles about organizations in which
Bill Gates might be involved…”
 Disambiguate tokens: “Chicago” (team) vs. “Chicago” (city)
 Source of abstract features
 E.g. “Verbs that appear with entities that are Organizations”
 E.g. “Documents that have a high proportion of
Organizations”
Page 9
Coreference
 Identify all phrases that refer to each entity of interest –
i.e., group mentions of concepts
[Neil A. Armstrong] , [the 38-year-old civilian
commander], radioed to [earth]. [He] said the
famous words, “[the Eagle] has landed”."
 The Named Entity recognizer only gets us part-way…
 …if we ask, “what actions did Neil Armstrong perform?”,
we will miss many instances (e.g. “He said…”)
 Coreference resolver abstracts over different ways of
referring to the same person
 Useful in feature extraction, information extraction
Page 10
Parsers
 Identify the grammatical structure of a sentence
Full parse
John hit the ball
object
subject
modifier
Dependency parse
Parsers reveal the grammatical
relationships between words and
phrases
Semantic Role Labeler
 SRL reveals relations
and arguments in the
sentence (where
relations are expressed
as verbs)
 Cannot abstract over
variability of expressing
the relations – e.g. kill
vs. murder vs. slay…
Page 12
The fame classifier
Enough NLP. Let’s make our $$$ with the
The Famous People Classifier
f( ) = Politician
f( ) = Athlete
f( ) = Corporate Mogul
The NLP version of the fame classifier
All sentences in the news, which the string Barack
Obamaoccurs
All sentences in the news, which the string Roger
Federer occurs
All sentences in the news, which the string BillGates
occurs
Represented
by
Our goal
 Find famous athletes, corporate moguls and politicians
Athlete
• Michael
Schumacher
• Michael Jordan
• …
Politician
• Bill Clinton
• George W. Bush
• …
Corporate Mogul
• Warren Buffet
• Larry Ellison
• …
Let’s brainstorm
 What NLP resources could we use for this task?
Remember, we start off with just raw text from a news website
One solution
 Let us label entities using features defined on mentions
 Identify mentions using the named entity recognizer
 Define features based on the words, parts of speech and
dependency trees
 Train a classifier
All sentences in the news, which the
string Barack Obama occurs
Where to get it: Machine Learning
Feature
Functions
Learning Algorithm
Data
→ “politics”
→ “sports”
→ “business”
A second look at the solution
 Identify mentions using the named entity recognizer
 Define features based on the words, parts of speech and
dependency trees
 Train a classifier
University of Illinois
Sentence and Word Splitter
Part-of-speech Tagger
Named Entity Recognizer
Stanford University
Dependency Parser
(and the NLP pipeline)
These tools can be downloaded from the websites.
Are we done? If not, what’s missing?
We need to put the pieces together
The infrastructure
The Curator
• A common interface for different NLP annotators
• Caches their results
Edison
• Library for NLP representation in Java
• Helps with extracting complex features
Learning Based Java
• A Java library for machine learning
• Provides a simple language to define classifiers and
perform inference with them
The infrastructure
 Each infrastructure module has specific interfaces that
the user is expected to use
 The Curator specifies the interface for accessing
annotations from the NLP tools
 Edison fixes the representation for the NLP annotation
 Learning Based Java requires training data to be
presented to it using an interface called Parser
Curator
A place where NLP annotations live
Big NLP
 NLP tools are quite sophisticated
 The more complex, the bigger the memory requirement
 NER: 1G; Coref: 1G; SRL: 4G ….
 If you use tools from different sources, they may be…
 In different languages
 Using different data structures
 If you run a lot of experiments on a single corpus, it
would be nice to cache the results
 …and for your colleagues, nice if they can access that
cache.
 Curator is our solution to these problems.
Page 25
Curator
Page 26
NER
SRL
POS,
Chunker
Cache
Curator
What does the Curator give you?
 Supports distributed NLP resources
 Central point of contact
 Single set of interfaces
 Code generation in many programming languages (using
Thrift)
 Programmatic interface
 Defines set of common data structures used for interaction
 Caches processed data
 Enables highly configurable NLP pipeline
Overhead:
 Annotation is all at the level of character offsets:
Normalization/mapping to token level required
 Need to wrap tools to provide requisite data structures
Page 27
Getting Started With the Curator
http://cogcomp.cs.illinois.edu/curator
 Installation:
 Download the curator package and uncompress the archive
 Run bootstrap.sh
 The default installation comes with the following
annotators (Illinois, unless mentioned):
 Sentence splitter and tokenizer
 POS tagger
 Shallow Parser
 Named Entity Recognizer
 Coreference resolution system
 Stanford parser
Basic Concept
 Different NLP annotations can be defined in terms of a
few simple data structures:
1. Record: A big container to store all annotations of a text
2. Span: A span of text (defined in terms of characters) along
with a label (A single token, or a single POS tag)
3. Labeling: A collection of Spans (POS tags for the text)
4. Trees and Forests (Parse trees)
5. Clustering: A collection of Labelings (Co-reference)
Go here for more information:
http://cogcomp.cs.illinois.edu/trac/wiki/CuratorDataStructures
Example of a Labeling
The tree fell.
Edison
Representing NLP objects and extracting features
Edison
 An NLP data representation and feature extraction
library
 Helps manage and use different annotations of text
 Doesn’t the Curator do everything we need?
 Curator is a service that abstracts away different annotators
 Edison is a Curator client
 And more…
Representation of NLP annotations
 All NLP annotations are called Views
 A View is just a labeled directed graph
 Nodes are labeled collections of tokens, called
Constituents
 Labeled edges between nodes are called Relations
 All Views related to some text are contained in a
TextAnnotation
Example of Views: Part of speech
A tree fell
 Part of speech view is a degenerate graph
 No edges because there are no relations
 This kind of View is represented by a subclass called
TokenLabelView
 Note that constituents are token based, not character
based
A
DT
tree
NN
fell
VBD
constituents
Example of Views: Shallow Parse
A tree fell
 Shallow parse view is also a degenerate graph
 No edges because there are no relations
 This kind of View is represented by a subclass called
SpanLabelView
A tree
Noun Phrase
fell
Verb Phrase
constituents
Example of Views: DependencyTree
A tree fell
 A subclass of View called TreeView
A tree fell
mod subj
Constituents
Relations
More about Views
 View represents a generic graph of Constituents and
Relations
 Its subclasses denote specializations suited to specific
structures
 TokenLabelView
 SpanLabelView
 TreeView
 PredicateArgumentView
 CoreferenceView
 Each view allows us to query its constituents
 Useful for defining features!
Features
 Complex features using this library
 Examples
 POS tag for a token
 All POS tags within a span
 All tokens within a span that have a specific POS tag
 All chunks contained within a parse constituent
 All chunks contained in the largest NP that covers a token
 All co-referring mentions to chunks contained in the largest NP
that covers this token
 All incoming dependency edges to a constituent
 Enables quick feature engineering
Getting started with Edison
http://cogcomp.cs.uiuc.edu/software/edison
 How to use Edison:
1. Download the latest version of Edison and its dependencies
from the website
2. Add all the jars to your project
3. ????
4. Profit
 A Maven repository is also available. See the edison
page for more details
Demo 1
 Basic Edison example, where we will
1. Create a TextAnnotation object from raw text
2. Add a few views from the curator
3. Print them on the terminal
http://cogcomp.cs.uiuc.edu/software/edison/FirstCuratorExam
ple.html
Demo 2
 Second Edison example, where we will
1. Create a TextAnnotation object from raw text
2. Add a few views from the curator
3. Print all the constituents in the named entity view
Let’s recall our goal
 Let us label entities using features defined on mentions
 Identify mentions using the named entity recognizer
 Define features based on the words, parts of speech and
dependency trees
 Train a classifier
All sentences in the news, which the
string Barack Obama occurs
Demo 3
 Reading the Fame classifier data and adding views
 Feature functions
 What would be good features for the fame classification
task?
The US President Barack Obama said that he ….
President Barack Obama recently visited France.
Features for Barack Obama
• US: 1
• President: 2
• said: 1
• visited: 1
• France: 1
Learning Based Java
Writing classifiers
What is Learning Based Java?
 A modeling language for learning and inference
 Supports
 Programming using learned models
 High level specification of features and constraints between
classifiers
 Inference with constraints
 The learning operator
 Classifiers are functions defined in terms of data
 Learning happens at compile time
What does LBJ do for you?
 Abstracts away the feature representation, learning and
inference
 Allows you to write learning based programs
 Application developers can reason about the application
at hand
Our application
Feature
Functions
Learning Algorithm
Data
→ “politics”
→ “sports”
→ “business”
Curator and Edison
Learning Based Java
Demo 4
 The fame classifier itself
1. The features
2. The classifier
3. Compiling to train the classifier
The Fame classifier
Putting the pieces together
Recall our solution
 Let us label entities using features defined on mentions
 Identify mentions using the named entity recognizer
 Define features based on the words, parts of speech and
dependency trees
 Train a classifier
All sentences in the news, which the
string Barack Obama occurs
The infrastructure
 Curator
 Provides access to the POS tagger, NER and the Stanford
Dependency parser
 Caches all annotations
 Edison
 NLP representation in our program
 Feature extraction
 Learning Based Java
 The machine learning
Final demo
 Let’s see this in action
Links
 Cogcomp Software:
http://cogcomp.cs.illinois.edu/page/software
 Support:
illinois-ml-nlp-users@cs.uiuc.edu
 Download the slides and the code from
http://cogcomp.cs.illinois.edu/page/tutorial.201008
Running the test code on a Unix Machine
Step 1: Train the classifier
$ ./compileLBJ entityFame.lbj
Step 2: Compile the other java files with
$ ant
Step 3: Test the classifier:
$ ./test.sh data/test

More Related Content

Similar to Day2-Slides.ppt pppppppppppppppppppppppppp

Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA IntroductionTommaso Teofili
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxAlyaaMachi
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 
State of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendState of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendEgor Pushkin
 
Sk t academy lecture note
Sk t academy lecture noteSk t academy lecture note
Sk t academy lecture noteSusang Kim
 
Multi-language Content Discovery Through Entity Driven Search
Multi-language Content Discovery Through Entity Driven SearchMulti-language Content Discovery Through Entity Driven Search
Multi-language Content Discovery Through Entity Driven SearchAlessandro Benedetti
 
Introduction to Application Profiles
Introduction to Application ProfilesIntroduction to Application Profiles
Introduction to Application ProfilesDiane Hillmann
 
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...CITE
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing WorkshopLakshya Sivaramakrishnan
 
leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...
leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...
leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...KristiLBurns
 
Web 3 Expert System
Web 3 Expert SystemWeb 3 Expert System
Web 3 Expert Systemguest4513a7
 
Web 3 Expert System
Web 3 Expert SystemWeb 3 Expert System
Web 3 Expert SystemMediabistro
 
The RDA Vocabularies: What They Are, How They Work
The RDA Vocabularies: What They Are, How They WorkThe RDA Vocabularies: What They Are, How They Work
The RDA Vocabularies: What They Are, How They WorkDiane Hillmann
 
Diane Hillmann: RDA Vocabularies in the Semantic Web
Diane Hillmann: RDA Vocabularies in the Semantic WebDiane Hillmann: RDA Vocabularies in the Semantic Web
Diane Hillmann: RDA Vocabularies in the Semantic WebALATechSource
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...Iman Mirrezaei
 

Similar to Day2-Slides.ppt pppppppppppppppppppppppppp (20)

Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
Nltk
NltkNltk
Nltk
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
State of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendState of NLP and Amazon Comprehend
State of NLP and Amazon Comprehend
 
Sk t academy lecture note
Sk t academy lecture noteSk t academy lecture note
Sk t academy lecture note
 
Multi-language Content Discovery Through Entity Driven Search
Multi-language Content Discovery Through Entity Driven SearchMulti-language Content Discovery Through Entity Driven Search
Multi-language Content Discovery Through Entity Driven Search
 
Introduction to Application Profiles
Introduction to Application ProfilesIntroduction to Application Profiles
Introduction to Application Profiles
 
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
 
Text summarization
Text summarizationText summarization
Text summarization
 
NLP todo
NLP todoNLP todo
NLP todo
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing Workshop
 
leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...
leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...
leewayhertz.com-Named Entity Recognition NER Unveiling the value in unstructu...
 
Web 3 Expert System
Web 3 Expert SystemWeb 3 Expert System
Web 3 Expert System
 
Web 3 Expert System
Web 3 Expert SystemWeb 3 Expert System
Web 3 Expert System
 
01.ppt
01.ppt01.ppt
01.ppt
 
Pcd question bank
Pcd question bank Pcd question bank
Pcd question bank
 
The RDA Vocabularies: What They Are, How They Work
The RDA Vocabularies: What They Are, How They WorkThe RDA Vocabularies: What They Are, How They Work
The RDA Vocabularies: What They Are, How They Work
 
Diane Hillmann: RDA Vocabularies in the Semantic Web
Diane Hillmann: RDA Vocabularies in the Semantic WebDiane Hillmann: RDA Vocabularies in the Semantic Web
Diane Hillmann: RDA Vocabularies in the Semantic Web
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
 

Recently uploaded

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 

Recently uploaded (20)

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 

Day2-Slides.ppt pppppppppppppppppppppppppp

  • 1. An Introduction to Machine Learning and Natural Language Processing Tools Vivek Srikumar, Mark Sammons (Some slides from Nick Rizzolo)
  • 2. The Famous People Classifier f( ) = Politician f( ) = Athlete f( ) = Corporate Mogul
  • 3. Outline  An Overview of NLP Resources  Our NLP Application: The Fame classifier  The Curator  Edison  Learning Based Java  Putting everything together
  • 4. An overview of NLP resources What can NLP do for me?
  • 5. NLP resources (An incomplete list)  Cognitive Computation Group resources  Tokenization/Sentence Splitting  Part Of Speech  Chunking  Named Entity Recognition  Coreference  Semantic Role Labeling  Others  Stanford parser and dependencies  Charniak Parser Page 5
  • 6. Tokenization and Sentence Segmentation  Given a document, find the sentence and token boundaries The police chased Mr. Smith of Pink Forest, Fla. all the way to Bethesda, where he lived. Smith had escaped after a shoot-out at his workplace, Machinery Inc.  Why?  Word counts may be important features  Words may themselves be the object you want to classify  “lived.” and “lived” should give the same information  different analyses need to align if you want to leverage multiple annotators from different sources/tasks Page 6
  • 7. Part of Speech (POS)  Allows simple abstraction for pattern detection  Disambiguate a target, e.g. “make (a cake)” vs. “make (of car)”  Specify more abstract patterns, e.g. Noun Phrase: ( DT JJ* NN )  Specify context in abstract way  e.g. “DT boy VBX” for “actions boys do”  This expression will catch “a boy cried”, “some boy ran”, … Page 7 POS DT NN VBD PP DT JJ NN Word The boy stood on the burning deck POS DT NN VBD PP DT JJ NN Word A boy rode on a red bicycle
  • 8. Chunking  Identifies phrase-level constituents in sentences [NP Boris] [ADVP regretfully] [VP told] [NP his wife] [SBAR that] [NP their child] [VP could not attend] [NP night school] [PP without] [NP permission] .  Useful for filtering: identify e.g. only noun phrases, or only verb phrases  Groups modifiers with heads  Useful for e.g. Mention Detection  Used as source of features, e.g. distance (abstracts away determiners, adjectives, for example), sequence,…  More efficient to compute than full syntactic parse  Applications in e.g. Information Extraction – getting (simple) information about concepts of interest from text documents Page 8
  • 9. Named Entity Recognition  Identifies and classifies strings of characters representing proper nouns [PER Neil A. Armstrong] , the 38-year-old civilian commander, radioed to earth and the mission control room here: “[LOC Houston] , [ORG Tranquility] Base here; the Eagle has landed."  Useful for filtering documents  “I need to find news articles about organizations in which Bill Gates might be involved…”  Disambiguate tokens: “Chicago” (team) vs. “Chicago” (city)  Source of abstract features  E.g. “Verbs that appear with entities that are Organizations”  E.g. “Documents that have a high proportion of Organizations” Page 9
  • 10. Coreference  Identify all phrases that refer to each entity of interest – i.e., group mentions of concepts [Neil A. Armstrong] , [the 38-year-old civilian commander], radioed to [earth]. [He] said the famous words, “[the Eagle] has landed”."  The Named Entity recognizer only gets us part-way…  …if we ask, “what actions did Neil Armstrong perform?”, we will miss many instances (e.g. “He said…”)  Coreference resolver abstracts over different ways of referring to the same person  Useful in feature extraction, information extraction Page 10
  • 11. Parsers  Identify the grammatical structure of a sentence Full parse John hit the ball object subject modifier Dependency parse Parsers reveal the grammatical relationships between words and phrases
  • 12. Semantic Role Labeler  SRL reveals relations and arguments in the sentence (where relations are expressed as verbs)  Cannot abstract over variability of expressing the relations – e.g. kill vs. murder vs. slay… Page 12
  • 13. The fame classifier Enough NLP. Let’s make our $$$ with the
  • 14. The Famous People Classifier f( ) = Politician f( ) = Athlete f( ) = Corporate Mogul
  • 15. The NLP version of the fame classifier All sentences in the news, which the string Barack Obamaoccurs All sentences in the news, which the string Roger Federer occurs All sentences in the news, which the string BillGates occurs Represented by
  • 16. Our goal  Find famous athletes, corporate moguls and politicians Athlete • Michael Schumacher • Michael Jordan • … Politician • Bill Clinton • George W. Bush • … Corporate Mogul • Warren Buffet • Larry Ellison • …
  • 17. Let’s brainstorm  What NLP resources could we use for this task? Remember, we start off with just raw text from a news website
  • 18. One solution  Let us label entities using features defined on mentions  Identify mentions using the named entity recognizer  Define features based on the words, parts of speech and dependency trees  Train a classifier All sentences in the news, which the string Barack Obama occurs
  • 19. Where to get it: Machine Learning Feature Functions Learning Algorithm Data → “politics” → “sports” → “business”
  • 20. A second look at the solution  Identify mentions using the named entity recognizer  Define features based on the words, parts of speech and dependency trees  Train a classifier University of Illinois Sentence and Word Splitter Part-of-speech Tagger Named Entity Recognizer Stanford University Dependency Parser (and the NLP pipeline) These tools can be downloaded from the websites. Are we done? If not, what’s missing?
  • 21. We need to put the pieces together
  • 22. The infrastructure The Curator • A common interface for different NLP annotators • Caches their results Edison • Library for NLP representation in Java • Helps with extracting complex features Learning Based Java • A Java library for machine learning • Provides a simple language to define classifiers and perform inference with them
  • 23. The infrastructure  Each infrastructure module has specific interfaces that the user is expected to use  The Curator specifies the interface for accessing annotations from the NLP tools  Edison fixes the representation for the NLP annotation  Learning Based Java requires training data to be presented to it using an interface called Parser
  • 24. Curator A place where NLP annotations live
  • 25. Big NLP  NLP tools are quite sophisticated  The more complex, the bigger the memory requirement  NER: 1G; Coref: 1G; SRL: 4G ….  If you use tools from different sources, they may be…  In different languages  Using different data structures  If you run a lot of experiments on a single corpus, it would be nice to cache the results  …and for your colleagues, nice if they can access that cache.  Curator is our solution to these problems. Page 25
  • 27. What does the Curator give you?  Supports distributed NLP resources  Central point of contact  Single set of interfaces  Code generation in many programming languages (using Thrift)  Programmatic interface  Defines set of common data structures used for interaction  Caches processed data  Enables highly configurable NLP pipeline Overhead:  Annotation is all at the level of character offsets: Normalization/mapping to token level required  Need to wrap tools to provide requisite data structures Page 27
  • 28. Getting Started With the Curator http://cogcomp.cs.illinois.edu/curator  Installation:  Download the curator package and uncompress the archive  Run bootstrap.sh  The default installation comes with the following annotators (Illinois, unless mentioned):  Sentence splitter and tokenizer  POS tagger  Shallow Parser  Named Entity Recognizer  Coreference resolution system  Stanford parser
  • 29. Basic Concept  Different NLP annotations can be defined in terms of a few simple data structures: 1. Record: A big container to store all annotations of a text 2. Span: A span of text (defined in terms of characters) along with a label (A single token, or a single POS tag) 3. Labeling: A collection of Spans (POS tags for the text) 4. Trees and Forests (Parse trees) 5. Clustering: A collection of Labelings (Co-reference) Go here for more information: http://cogcomp.cs.illinois.edu/trac/wiki/CuratorDataStructures
  • 30. Example of a Labeling The tree fell.
  • 31. Edison Representing NLP objects and extracting features
  • 32. Edison  An NLP data representation and feature extraction library  Helps manage and use different annotations of text  Doesn’t the Curator do everything we need?  Curator is a service that abstracts away different annotators  Edison is a Curator client  And more…
  • 33. Representation of NLP annotations  All NLP annotations are called Views  A View is just a labeled directed graph  Nodes are labeled collections of tokens, called Constituents  Labeled edges between nodes are called Relations  All Views related to some text are contained in a TextAnnotation
  • 34. Example of Views: Part of speech A tree fell  Part of speech view is a degenerate graph  No edges because there are no relations  This kind of View is represented by a subclass called TokenLabelView  Note that constituents are token based, not character based A DT tree NN fell VBD constituents
  • 35. Example of Views: Shallow Parse A tree fell  Shallow parse view is also a degenerate graph  No edges because there are no relations  This kind of View is represented by a subclass called SpanLabelView A tree Noun Phrase fell Verb Phrase constituents
  • 36. Example of Views: DependencyTree A tree fell  A subclass of View called TreeView A tree fell mod subj Constituents Relations
  • 37. More about Views  View represents a generic graph of Constituents and Relations  Its subclasses denote specializations suited to specific structures  TokenLabelView  SpanLabelView  TreeView  PredicateArgumentView  CoreferenceView  Each view allows us to query its constituents  Useful for defining features!
  • 38. Features  Complex features using this library  Examples  POS tag for a token  All POS tags within a span  All tokens within a span that have a specific POS tag  All chunks contained within a parse constituent  All chunks contained in the largest NP that covers a token  All co-referring mentions to chunks contained in the largest NP that covers this token  All incoming dependency edges to a constituent  Enables quick feature engineering
  • 39. Getting started with Edison http://cogcomp.cs.uiuc.edu/software/edison  How to use Edison: 1. Download the latest version of Edison and its dependencies from the website 2. Add all the jars to your project 3. ???? 4. Profit  A Maven repository is also available. See the edison page for more details
  • 40. Demo 1  Basic Edison example, where we will 1. Create a TextAnnotation object from raw text 2. Add a few views from the curator 3. Print them on the terminal http://cogcomp.cs.uiuc.edu/software/edison/FirstCuratorExam ple.html
  • 41. Demo 2  Second Edison example, where we will 1. Create a TextAnnotation object from raw text 2. Add a few views from the curator 3. Print all the constituents in the named entity view
  • 42. Let’s recall our goal  Let us label entities using features defined on mentions  Identify mentions using the named entity recognizer  Define features based on the words, parts of speech and dependency trees  Train a classifier All sentences in the news, which the string Barack Obama occurs
  • 43. Demo 3  Reading the Fame classifier data and adding views  Feature functions  What would be good features for the fame classification task? The US President Barack Obama said that he …. President Barack Obama recently visited France. Features for Barack Obama • US: 1 • President: 2 • said: 1 • visited: 1 • France: 1
  • 45. What is Learning Based Java?  A modeling language for learning and inference  Supports  Programming using learned models  High level specification of features and constraints between classifiers  Inference with constraints  The learning operator  Classifiers are functions defined in terms of data  Learning happens at compile time
  • 46. What does LBJ do for you?  Abstracts away the feature representation, learning and inference  Allows you to write learning based programs  Application developers can reason about the application at hand
  • 47. Our application Feature Functions Learning Algorithm Data → “politics” → “sports” → “business” Curator and Edison Learning Based Java
  • 48. Demo 4  The fame classifier itself 1. The features 2. The classifier 3. Compiling to train the classifier
  • 49. The Fame classifier Putting the pieces together
  • 50. Recall our solution  Let us label entities using features defined on mentions  Identify mentions using the named entity recognizer  Define features based on the words, parts of speech and dependency trees  Train a classifier All sentences in the news, which the string Barack Obama occurs
  • 51. The infrastructure  Curator  Provides access to the POS tagger, NER and the Stanford Dependency parser  Caches all annotations  Edison  NLP representation in our program  Feature extraction  Learning Based Java  The machine learning
  • 52. Final demo  Let’s see this in action
  • 53. Links  Cogcomp Software: http://cogcomp.cs.illinois.edu/page/software  Support: illinois-ml-nlp-users@cs.uiuc.edu  Download the slides and the code from http://cogcomp.cs.illinois.edu/page/tutorial.201008
  • 54. Running the test code on a Unix Machine Step 1: Train the classifier $ ./compileLBJ entityFame.lbj Step 2: Compile the other java files with $ ant Step 3: Test the classifier: $ ./test.sh data/test

Editor's Notes

  1. How hard can it be? Periods, spaces, and capitalization all give strong cues, but are still ambiguous