This document provides an introduction to natural language processing (NLP) through a presentation given by Rutu Mulkar-Mehta. The presentation covers understanding natural language, common NLP tasks like text categorization and sentiment analysis, and challenges like ambiguity. It also discusses part-of-speech tagging and linguistic resources. The overall goal is to introduce attendees to the field of NLP and some of its applications.
Natural language processing provides a way in which human interacts with computer / machines by means of voice.
"Google Search by voice is the best example " which makes use of natural language processing.
Natural Language Processing(NLP) is a subset Of AI.It is the ability of a computer program to understand human language as it is spoken.
Contents
What Is NLP?
Why NLP?
Levels In NLP
Components Of NLP
Approaches To NLP
Stages In NLP
NLTK
Setting Up NLP Environment
Some Applications Of NLP
Natural language processing provides a way in which human interacts with computer / machines by means of voice.
"Google Search by voice is the best example " which makes use of natural language processing.
Natural Language Processing(NLP) is a subset Of AI.It is the ability of a computer program to understand human language as it is spoken.
Contents
What Is NLP?
Why NLP?
Levels In NLP
Components Of NLP
Approaches To NLP
Stages In NLP
NLTK
Setting Up NLP Environment
Some Applications Of NLP
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
These slides are an introduction to the understanding of the domain NLP and the basic NLP pipeline that are commonly used in the field of Computational Linguistics.
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the world of natural language - unstructured data that by its very nature has important latent information for humans. NLP practitioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, the Natural Language Toolkit (NLTK), and to a lesser extent, the Gensim Library.
NLTK is an excellent library for machine learning-based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. Gensim provides vector-based topic modeling, which is currently absent in both NLTK and Scikit-Learn. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
Natural language processing provides a way in which human interacts with computer / machines by means of voice.
"Google Search by voice is the best example " which makes use of natural language processing..
Natural Language Processing is a subfield of Artificial Intelligence and linguistics, devoted to make computers understand the statements or words written by humans.
In this seminar we discuss its issues, and its working etc...
Introduction to Natural Language Processingrohitnayak
Natural Language Processing has matured a lot recently. With the availability of great open source tools complementing the needs of the Semantic Web we believe this field should be on the radar of all software engineering professionals.
Text analytics and R - Open Question: is it a good match?Marina Santini
http://www.forum.santini.se
* The Quest: finding the optimal way to handle Big Textual Data for Information Discovery
* The Question: is R convenient for text analytics of Big TEXTUAL Data?
* Mission: identification of pros, cons, limits, benefits …
Current Status: investigation in progress…
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
These slides are an introduction to the understanding of the domain NLP and the basic NLP pipeline that are commonly used in the field of Computational Linguistics.
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the world of natural language - unstructured data that by its very nature has important latent information for humans. NLP practitioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, the Natural Language Toolkit (NLTK), and to a lesser extent, the Gensim Library.
NLTK is an excellent library for machine learning-based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. Gensim provides vector-based topic modeling, which is currently absent in both NLTK and Scikit-Learn. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
Natural language processing provides a way in which human interacts with computer / machines by means of voice.
"Google Search by voice is the best example " which makes use of natural language processing..
Natural Language Processing is a subfield of Artificial Intelligence and linguistics, devoted to make computers understand the statements or words written by humans.
In this seminar we discuss its issues, and its working etc...
Introduction to Natural Language Processingrohitnayak
Natural Language Processing has matured a lot recently. With the availability of great open source tools complementing the needs of the Semantic Web we believe this field should be on the radar of all software engineering professionals.
Text analytics and R - Open Question: is it a good match?Marina Santini
http://www.forum.santini.se
* The Quest: finding the optimal way to handle Big Textual Data for Information Discovery
* The Question: is R convenient for text analytics of Big TEXTUAL Data?
* Mission: identification of pros, cons, limits, benefits …
Current Status: investigation in progress…
Measuring Opinion Credibility in TwitterMya Thandar
Today thousands of people in Hong Kong are protesting an election reform that would essentially mandate Beijing approval of candidates for Hong Kong’s chief executive. Many people such as Hong Kong citizen, government staff, journalists and news channels express information and their opinions about Hong Kong Revolution over social media. In that event, we don’t know whose opinions are strong or credential. Therefore when we take the infor- mation from opinion content, the identifier of an author may help to determine credibility. The opinions of specialists and recognized experts are more likely to be credential and to reflect a significant viewpoint. For that reason, we pro- pose a new method to define the credibility of sentiment polarity based on their expertise or background knowledge and apply on Twitter: social media. Hence we identify the credibility of tweets polarity for a particular topic, we add weight of authors according to their expert knowledge. We classify tweets sen- timent polarity using machine learning technique: Support Vector machine (SVM) and we combine it with weight of authors’ background knowledge based on author’s profile, twitter List feature and their tweets behavior for a given topic and then show the result as the percentage of credibility on their positive or negative views.
Humor as a survival mechanism
Harper Reed will share stories and talk about his time as the CTO of the Obama Campaign and how one of the ways he was able to survive the highly stressful, highly competitive and rather complicated work environment was through humor. Expect naughty words. Or not. He's not sure yet. ;)
This is an Introduction to NLP presented in Hout Bay on 19 January 2014.
We offer NLP Certification Trainings as well as other accelerated technologies should you be interested - www.latitudetraining.co.za
Online Tweet Sentiment Analysis with Apache SparkDavide Nardone
Sentiment Analysis (SA) relates to the use of: Natural Language Processing (NLP), analysis and computational linguistics text to extract and identify subjective information in the source material. A fundamental task of SA is to "classify" the polarity of a given document text, phrases or levels of functionality/appearance - whether the opinion expressed in a document or in a sentence is positive, negative or neutral. Usually, this analysis is performed "offline" using Machine Learning (ML) techniques. In this project two online tweet classification methods have been proposed, which exploits the well known framework "Apache Spark" for processing the data and the tool "Apache Zeppelin" for data visualization.
Text analytics in Python and R with examples from Tobacco ControlBen Healey
Ben has been doing data sciencey work since 1999 for organisations in the banking, retailing, health and education industries. He is currently on contracts with Pharmac and Aspire2025 (a Tobacco Control research collaboration) where, happily, he gets to use his data-wrangling powers for good.
This presentation focuses on analysing text, with Tobacco Control as the context. Examples include monitoring mentions of NZ's smokefree goal by politicians and examining media uptake of BATNZ's Agree/Disagree PR campaign. It covers common obstacles during data extraction, cleaning and analysis, along with the key Python and R packages you can use to help clear them.
Natural Language Processing in R (rNLP)fridolin.wild
The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
A quick tutorial for the Boston Predictive Analytics MeetUp to demonstrate the use of R in the context of text mining Twitter. We implement a very crude algorithm for sentiment analysis but still get a plausible result.
A sprint thru Python's Natural Language ToolKit, presented at SFPython on 9/14/2011. Covers tokenization, part of speech tagging, chunking & NER, text classification, and training text classifiers with nltk-trainer.
Beyond document retrieval using semantic annotations Roi Blanco
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
NLP
Machine learning
is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Slides for the iDB summer school (Sapporo, Japan) http://db-event.jpn.org/idb2013/
Typically, Web mining approaches have focused on enhancing or learning about user seeking behavior, from query log analysis and click through usage, employing the web graph structure for ranking to detecting spam or web page duplicates. Lately, there's a trend on mining web content semantics and dynamics in order to enhance search capabilities by either providing direct answers to users or allowing for advanced interfaces or capabilities. In this tutorial we will look into different ways of mining textual information from Web archives, with a particular focus on how to extract and disambiguate entities, and how to put them in use in various search scenarios. Further, we will discuss how web dynamics affects information access and how to exploit them in a search context.
The Coming Explosion of Records at FamilySearch Syllabusbakers84
Syllabus for the 2018 BYU Conference on Family History and Genealogy. While record hinting has greatly increased the number of record sources attached to persons in FamilySearch Family Tree, many records are still only available as images and are not yet indexed to be searchable. This is especially true for non-English records. This presentation shows how FamilySearch is working to provide more findable, relevant, curated records for gathering multi-generational families from around the world by using Artificial Intelligence (AI) and other cutting edge technologies to greatly accelerate the number of historical records available to patrons.
Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
Presented at PLAIN 2013 in Vancouver, BC
Plain language is an efficient, effective, and human approach to practical communication. Consider your reader's knowledge, reading ability, interest, motivation, and the circumstances under which they will encounter your document. Rethink, reorganize, reword, and redesign your document to meet your reader's needs. Search for that perfect match between your audience, your purpose, and your message to create clear communication.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Intro to nlp
1. Introduction
to
Natural
Language
Processing
Rutu
Mulkar-‐Mehta,
PhD
Founder
and
Data
Scientist
@Ticary
@RutuMulkar
2. Co-‐hosted
Meetup
Data
Science
Dojo
http://www.meetup.com/data-‐science-‐dojo
Natural
Language
Processing
http://www.meetup.com/Natural-‐Language-‐Processing-‐Meetup/
3.
4. About
Me
• Founder
and
Data
Scientist
at
Ticary
• Background:
– PhD
in
Natural
Language
Processing
– Computer
Science
• Worked
on
applying
NLP
to:
– Healthcare
– SEO
(Search
Engine
Optimization)
– Other
Stuff:
Sentiment
Analysis,
Question
Answering,
Natural
Language
Understanding
++
4
5. Agenda
• Understanding
Natural
Language
• Introduction
to
different
NLP
Problems
• Part
of
Speech
tagging
• Linguistic
Resources
7. Some
Example
Sentences
• Children
make
delicious
snacks
• I
saw
the
Grand
Canyon
flying
to
New
York
• Stolen
painting
found
by
the
tree
• Two
sentences:
– Monkeys
like
bananas
when
they
wake
up.
– Monkeys
like
bananas
when
they
are
ripe.
8. Why
is
NLP
Hard?
Brazil
crowds
attend
funeral
of
late
candidate
Campos
More
than
100,000
people
in
Brazil
have
paid
their
last
respects
to
the
late
presidential
candidate,
Eduardo
Campos,
who
died
in
a
plane
crash
on
Wednesday.
They
attended
a
funeral
Mass
and
filled
the
streets
of
the
city
of
Recife
to
follow
the
passage
of
his
coffin.
Later
this
week,
Mr.
Campos's
Socialist
Party
is
expected
to
appoint
former
Environment
Minister
Marina
Silva
as
a
replacement
candidate.
Mr.
Campos's
jet
crashed
in
bad
weather
in
Santos,
near
Sao
Paulo.
Investigators
are
still
trying
to
establish
the
exact
causes
of
the
crash,
which
killed
six
other
people.
9. Why
is
NLP
Hard?
Brazil
crowds
attend
funeral
of
late
candidate
Campos
More
than
100,000
people
in
Brazil
have
paid
their
last
respects
to
the
late
presidential
candidate,
Eduardo
Campos,
who
died
in
a
plane
crash
on
Wednesday.
They
attended
a
funeral
Mass
and
filled
the
streets
of
the
city
of
Recife
to
follow
the
passage
of
his
coffin.
Later
this
week,
Mr
Campos's
Socialist
Party
is
expected
to
appoint
former
Environment
Minister
Marina
Silva
as
a
replacement
candidate.
Mr
Campos's
jet
crashed
in
bad
weather
in
Santos,
near
Sao
Paulo.
Investigators
are
still
trying
to
establish
the
exact
causes
of
the
crash,
which
killed
six
other
people.
10. Why
is
NLP
Hard?
Brazil
crowds
attend
funeral
of
late
candidate
Campos
More
than
100,000
people
in
Brazil
have
paid
their
last
respects
to
the
late
presidential
candidate,
Eduardo
Campos,
who
died
in
a
plane
crash
on
Wednesday.
They
attended
a
funeral
Mass
and
filled
the
streets
of
the
city
of
Recife
to
follow
the
passage
of
his
coffin.
Later
this
week,
Mr
Campos's
Socialist
Party
is
expected
to
appoint
former
Environment
Minister
Marina
Silva
as
a
replacement
candidate.
Mr
Campos's
jet
crashed
in
bad
weather
in
Santos,
near
Sao
Paulo.
Investigators
are
still
trying
to
establish
the
exact
causes
of
the
crash,
which
killed
six
other
people.
11. Why
is
NLP
Hard?
• To
understand
the
current
event,
you
need
to
understand
several
other
concepts:
– Current
Event
– Background
Event
– Property
– references
to
other
events
– pronouns
12. NLP
TASKS
What
can
we
solve
with
Natural
Language
Processing
13. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
14. Text
Categorization
Input
Document
What
is
the
document
about:
sports:
0.2%
politics:
2%
entertainment:
96%
religion:
…
finance:
…
15. Text
Classification
finance.yahoo.com
sports.yahoo.com
make
your
own
wordle
using
wordle.net
Vocabulary
used
in
one
genre
of
text,
is
different
from
vocabulary
used
in
another
genre
16. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
18. Sentiment
Analysis
• What
are
people
saying?
– Twitter
– Reviews
– Blogs
– Emails
• Can
be
for:
– Products
– Companies
– Movies
– Books
19. Sentiment
Analysis
Possible
Features
• Important
keywords,
and
key
phrases:
– POS:
dazzling,
brilliant,
phenomenal
– NEG:
hideous,
awful,
unwatchable
• Emoticons
– POS
:-‐)
– NEG
:-‐(
• Ontologies
– Wordnet:
https://wordnet.princeton.edu/
– SentiWordnet:
http://sentiwordnet.isti.cnr.it/
20. Challenges
• People
express
opinions
in
complex
ways
– “The
acting
was
great
and
the
plots
were
intense
and
mesmerizing,
but
I
hated
the
movie”
• Sarcasm,
humor
and
other
expressions
– “It
was
a
great
movie
for
a
Sunday
nap.
I
only
fell
asleep
twice,
but
it
was
very
restful”
21.
22. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
23. Information
Extraction
Input
Document
What
are
the
key
pieces
of
information
?
Location:
Time:
People:
…
Extracting
Named
Entities
from
Documents
24.
25. Other
ways
for
IE
:
Hypernyms
(type
of)
colors
such
as
red,
blue
and
…
25
26. Other
ways
for
IE:
Synonyms
Find
different
relations
between
2
concepts:
Microsoft
bought
Farecast
26
27. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
29. Information
Retrieval
Input
Document
What
are
the
documents
relevant
to
the
query?
Input
Document
Input
Document
Input
Document
Input
Document
query
30.
31. Information
Retrieval
Q)
Which
documents
are
most
relevant
to
a
given
query?
A)
Similar
vocabulary
between
query
and
document?
Quantify
similarity
based
on
maximum
overlap
– Cosine
Similarity
– Jaccard
Similarity
32. Information
Retrieval
Q)
If
you
rewrite
the
query
–
will
that
give
you
more
precise
results?
A)
Yes!
It
is
called
“Query
Expansion”
33. Commercial
Search
Tools
• Lucene
– http://lucene.apache.org/
• ElasticSearch
– https://www.elastic.co/
Underlying
technology
in
most
of
these
is
the
same,
with
some
variations
Meetup
about
this
topic
scheduled
for
early
2016
34. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
35. Question
Answering
-‐
Closed
Input
Data
Source
Questions:
What
event
happened?
When
did
the
event
happen?
Why
did
the
event
happen?
How
long
was
the
event?
How
did
the
event
happen?
41. Types
of
Text
Summarization
• Keyword
Summaries
– Extract
significant
Keywords
from
text
– Easy
to
implement
– Hard
to
understand
by
end
user
a
42. Types
of
Text
Summarization
• Sentence/Phrase
Extraction
– Extract
relevant
sentences
– Medium-‐Hard
to
implement
– Easy
for
end
user
to
understand
43. Types
of
Text
Summarization
• Natural
Language
Understanding
and
Generation
– Understand
meaning
of
text
– Generate
sentences
from
meaning
of
original
text
– Hard
to
implement
– Easy
for
end
user
President
of
University
of
Missouri
resigned
after
graduate
student
hunger
strike
and
class
cancellations
by
faculty
44. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
46. Why
is
MT
Hard?
• It
is
not
a
1
to
1
translation
– In
the
previous
example
4
words
in
English
translate
into
2
in
Spanish
• Grammar
is
different
in
different
languages
– SOV
(Subject
–
Object
–
Verb)
• “She
him
loves”
(Hindi,
Japanese)
– SVO
(Subject
–
Verb
–
Object)
• “She
loves
him”
(English,
Mandarin)
47. Machine
Translation
• Waygoapp
• Instantly
translated
Chinese,
Japanese
and
Korean
• Simply
point
and
translate
• Offline
http://waygoapp.com/
49. Example
All
the
gobulins
were
gramzies.
It
was
grimbleton.
What
are
the
underlined
words?
gobulins
• Noun
gramzies
• Noun
or
Adjective
grimbleton
• Noun
or
Adjective
50. Why
is
the
example
important?
We
can
get
a
sense
of
what
the
word
means,
based
on
how
it
is
used
in
language.
51. Nouns
• E.g.
cat,
car,
computer,
tree
• Variations:
– Number:
singular,
plural
• one
car,
two
cars
– Gender:
masculine,
feminine,
neuter
– Case:
nominative,
genitive,
accusative,
dative
52. Pronouns
• Vary
in
– E.g.
she,
ourselves,
mine
– Person
– Gender
• his,
her
– Number
– Case:
nominative,
accusative,
possessive,
2nd
possessive
– Reflexive
and
Anaphoric
Forms:
• herself,
each
other
54. Adjectives
• Describe
Properties
– sunny,
beautiful,
calm
• Attributive
and
predicative
properties
• Agreement
– in
gender,
number
• Comparative
and
superlative
forms
– derivative
and
periphrastic
• positive
form
56. Other
POS
tags
• Adverbs
– happily
• Prepositions
– of,
on,
in
• Particles
– ran
a
bill
vs
ran
up
a
bill
57. Morphological
Analysis
• Sleeps
=
sleep
+
v
+
3rd
Person
+
Singular
• If
we
have
a
good
enough
grammar
with
all
of
these
rules,
we
have
a
good
shot
at
understanding
syntax
of
language
58. Automatic
Taggers
• Almost
all
the
POS
taggers
use
the
Penn-‐Treebank
list
of
tags
• https://www.ling.upenn.edu/courses/Fall_2003/
ling001/penn_treebank_pos.html
58
59. Automatic
Taggers
• Almost
all
the
POS
taggers
use
the
Penn-‐Treebank
list
of
tags
• https://www.ling.upenn.edu/courses/Fall_2003/ling001/
penn_treebank_pos.html
– Nouns
:
• NN
(house),
NNS(houses),
NNP(White
House),
NNPS
– Verbs:
• VB(say),
VBD(said),
VBG(saying),
VBN,
VBP,
VBZ
– Adjectives:
• JJ
(good),
JJR(better),
JJS(best)
– Adverbs:
RB,
RBR,
RBS
– Prepositions:
IN
59
61. POS
Tagging
and
Parsing
• Stanford
Core
NLP
– http://nlp.stanford.edu:8080/corenlp/
• NLTK
– Natural
Language
Toolkit
– You
need
to
provide
your
own
training
data,
and
train
models
for
NLTK
to
be
effective
61
62. Other
Linguistic
Features
of
Interest
– We
want
to
get
nouns
and
verbs
into
a
root
form
E.g.
• am,
are,
is
à
be
• car,
cars,
car’s
à
car
– Two
approaches:
• Stemming
• Lemmatization
62
63. Stemming
and
Lemmatization
• Lemmatization
– use
of
a
vocabulary
– morphological
analysis
of
words
– returns
the
base
or
dictionary
form
of
a
word
– base
form
is
known
as
the
lemma
– e.g.
am,
are,
is
à
be
• Stemming
– crude
heuristic
process
– chops
off
the
ends
of
words
– hope
of
achieving
this
goal
– e.g.
Marked
à
Mark,
Marker
à
Mark
63
64. Parsing
Resources
• NLTK
– python,
low
accuracy,
fast
– http://www.nltk.org/
• Stanford
Core
NLP
– java,
high
accuracy,
slow
– http://nlp.stanford.edu/software/corenlp.shtml
• SpaCy
– python,
medium
accuracy,
fast
– https://spacy.io/
65. Other
Resources:
Ontologies
• Wordnet
– groups
words
when
they
have
the
same
meaning
– represents
hierarchical
links
between
groups
– E.g.
car
is
the
same
thing
as
an
automobile
• SentiWordnet
• Wordnet
+
Sentiment
• ConceptNet
– broader
relationships
than
WordNet
– E.g.
bread
is
typically
found
near
a
toaster.
• FrameNet
– Frames
represent
concepts
and
their
associated
roles
67. Semantics
and
Word
Co-‐locations
• It
is
important
to
know
which
words
occur
together
– Strong
Beer
vs
Powerful
Beer
– Big
Sister
vs
Large
Sister
• Two
approaches
have
been
used
– Semantics
–
ontologies
and
word
meanings
– Statistics
–
word
colocations
and
probabilities