Introduction to natural language processing (NLP)

ML Applications: 1st Session
Introduction to Natural Language Processing (NLP)
Alia Hamwi

What is NLP?
• Natural Language Processing (NLP) is a field in Artificial Intelligence
(AI) devoted to creating computers that use natural language as input
and/or output.

What is NLP?
• The field of NLP involves making computers to perform useful tasks
with the natural languages humans use. The input and output of an
NLP system can be:
• Speech
• Written Text

NLP Applications
• Data-mining and analytics of weblogs, microblogs, discussion forums,
user reviews, and other forms of user-generated media.

NLP Applications
• Conversational agents Combine
• Speech recognition/synthesis
• Question answering
• From the web and from structured information sources (freebase, dbpedia, etc.)
• Commands identification for agent-like abilities
• Create/edit calendar entries
• Reminders
• Directions
• Invoking/interacting with other apps

NLP Applications
• Translation
- Google.
-DIRA (From English 2 Egyptian Dialect).

DIRA (From English 2 Egyptian Dialect)
https://aclanthology.org/I13-2004.pdf

NLP Applications
• Classifiers: classify a set of document into categories, (as email spam
filters)
• Information Retrieval: find relevant documents to a given query.
(search engines)
• Summarization: Produce a readable summary, e.g., news about oil
today.
• Spelling checkers, grammar checkers, auto-filling, ..... and more

Linguistics Levels of Analysis/Ambiguity
• Phonology ‫الصوتي‬
• Speech audio signal to phonemes sounds / letters / pronunciation
• Ambiguity (two, too,‫سائد‬,‫صائد‬, “I scream” / “Ice cream”)
• Morphology ‫الصرفي‬
• the structure of words.
• Inflection (e.g. “I”, “my”, “me”; “eat”, “eats”, “ate”, “eaten”)
• Derivation (e.g. “teach”, “teacher”, “‫”كتب‬, “‫”كاتب‬,” friendly “)
• Ambiguity (‫كوارث‬, Unionized)

Linguistics Levels of Analysis/Ambiguity
• Syntax ‫القواعدي‬
• grammar, how these sequences are structured
• Part-of-speech (noun, verb, adjective, preposition, etc.)
• Phrase structure (e.g. noun phrase, verb phrase)
• Ambiguity

Linguistics Levels of Analysis
• Semantics ‫الداللي‬
• Meaning of a word
• Ambiguity ( “board”, “book”,” ‫”عين‬ (
• Dialogue
• Meaning and inter-relation between sentences

Common NLP Tasks
• Word tokenization
• Sentence boundary detection
• Part-of-speech (POS) tagging
• to identify the part-of-speech (e.g. noun, verb) of each word
• Named Entity (NE) recognition
• to identify proper nouns (e.g. names of person, location, organization; domain
terminologies)
• Parsing
• to identify the syntactic structure of a sentence
• Semantic analysis
• to derive the meaning of a sentence

NLP Task : Part-Of-Speech (POS) Tagging
• POS tagging is a process of assigning a POS or lexical class marker to
each word in a sentence (and all sentences in a corpus).
Input: the lead paint is unsafe
Output: the/Det lead/N paint/N is/V unsafe/Adj

NLP Task : Named Entity Recognition (NER)
• NER is to process a text and identify named entities in a sentence
• e.g. “U.N. official Ekeus heads for Baghdad.”

NLP Task : Named Entity Recognition (NER)

NLP Task : Parsing and dependency parsing
• Shallow (or Partial) parsing identifies the (base) syntactic phases in a
sentence.
• After NEs are identified, dependency parsing is often applied to
extract the syntactic/dependency relations between the NEs.
[NP He] [v saw] [NP the big dog]
[PER Bill Gates] founded [ORG Microsoft].
found
Bill Gates Microsoft
nsubj dobj
Dependency Relations
nsubj(Bill Gates, found)
dobj(found, Microsoft)

NLP Task : Information Extraction
• Identify specific pieces of information (data) in an unstructured or
semi-structured text
• Transform unstructured information in a corpus of texts or web
pages into a structured database (or templates)
• Applied to various types of text, e.g.
• Newspaper
articles
• Scientific
articles
• Web pages
• etc.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new
Taiwan dollars, will start production in January 1990 with production of 20,000
iron and “metal wood” clubs a month.
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
template filling

NLP Pipeline: Data Collection
• Ideal Setting: We have everything needed.
• Labels and Annotations
• Very often we are dealing with less-than-ideal scenarios
• Initial datasets with limited annotations/labels
• Initial datasets labeled based on regular expressions or heuristics
• Public datasets (cf. Google Dataset Search or kaggle)
• Scrape data

NLP Pipeline: Text Cleaning
• Extracting raw texts from the input data
• HTML
• PDF
• Relevant vs. irrelevant information
• non-textual information
• markup
• metadata
• Encoding format

NLP Pipeline: Preprocessing
• Sentence segmentation
• Word tokenization
• Frequent preprocessing
• Stopword removal
• Stemming and/or lemmatization
• Digits/Punctuaions removal
• Case normalization
• Arabic: Remove Diacritic
• Remove redundant spaces

NLP Pipeline:
Feature Engineering/text representation
• Feature Engineering for Classical ML
• Bag-of-words representations
• Domain-specific word frequency lists
• Handcrafted features based on domain-specific knowledge
• Feature Engineering for DL
• DL directly takes the texts as inputs to the model.
• The DL model is capable of learning features from the texts (e.g.,
embeddings)
• The price is that the model is often less interpretable.

NLP Pipeline: Bag of Words Model (Binary)
• Bag-of-words model is the simplest way (i.e., easy to be automated)
to vectorize texts into binary representations.

NLP Pipeline: Bag of Words Model (Count)
• Bag-of-words model is the simplest way (i.e., easy to be automated)
to vectorize texts into numeric representations.

NLP Pipeline: Bag of Words Model
• Issues with Bag-of-Words Text Representation
• Word order is ignored.
• Raw absolute frequency counts of words do not necessarily represent the
meaning of the text properly.

NLP Pipeline: TF-IDF Model
• TF-IDF model is an extension of the bag-of-words model, whose main
objective is to adjust the raw frequency counts by considering the
dispersion of the words in the corpus.
• Dispersion refers to how evenly each word/term is distributed across
different documents of the corpus.
• Interaction between Word Raw Frequency Counts and Dispersion:
• Given a high-frequency word:
• If the word is widely dispersed across different documents of the corpus (i.e., high dispersion)
• it is more likely to be semantically general.
• If the word is mostly centralized in a limited set of documents in the corpus (i.e., low
dispersion)
• it is more likely to be topic-specific.
• Dispersion rates of words can be used as weights for the importance of
word frequency counts.

NLP Pipeline: TF-IDF Model
TF-IDF = TF * IDF

NLP Pipeline:Modeling & Evalution
• More details in (Week2 - Week3 - Week6) in the program.

Further Resources..
• Deep Learning for NLP in Python – DataCamp
https://learn.datacamp.com/skill-tracks/deep-learning-for-nlp-in-python
• Natural Language Processing Specialization – Coursera
https://www.coursera.org/specializations/natural-language-processing
• Speech and Language Processing – Book
https://web.stanford.edu/~jurafsky/slp3/

Introduction to natural language processing (NLP)

More Related Content

What's hot

Similar to Introduction to natural language processing (NLP)

More from Alia Hamwi

Recently uploaded

Introduction to natural language processing (NLP)