ML Applications: 1st Session
Introduction to Natural Language Processing (NLP)
Alia Hamwi
What is NLP?
• Natural Language Processing (NLP) is a field in Artificial Intelligence
(AI) devoted to creating computers that use natural language as input
and/or output.
What is NLP?
• The field of NLP involves making computers to perform useful tasks
with the natural languages humans use. The input and output of an
NLP system can be:
• Speech
• Written Text
NLP Applications
• Data-mining and analytics of weblogs, microblogs, discussion forums,
user reviews, and other forms of user-generated media.
NLP Applications
• Conversational agents Combine
• Speech recognition/synthesis
• Question answering
• From the web and from structured information sources (freebase, dbpedia, etc.)
• Commands identification for agent-like abilities
• Create/edit calendar entries
• Reminders
• Directions
• Invoking/interacting with other apps
NLP Applications
• Translation
- Google.
-DIRA (From English 2 Egyptian Dialect).
DIRA (From English 2 Egyptian Dialect)
https://aclanthology.org/I13-2004.pdf
NLP Applications
• Classifiers: classify a set of document into categories, (as email spam
filters)
• Information Retrieval: find relevant documents to a given query.
(search engines)
• Summarization: Produce a readable summary, e.g., news about oil
today.
• Spelling checkers, grammar checkers, auto-filling, ..... and more
Linguistics Levels of Analysis/Ambiguity
• Phonology ‫الصوتي‬
• Speech audio signal to phonemes sounds / letters / pronunciation
• Ambiguity (two, too,‫سائد‬,‫صائد‬, “I scream” / “Ice cream”)
• Morphology ‫الصرفي‬
• the structure of words.
• Inflection (e.g. “I”, “my”, “me”; “eat”, “eats”, “ate”, “eaten”)
• Derivation (e.g. “teach”, “teacher”, “‫”كتب‬, “‫”كاتب‬,” friendly “)
• Ambiguity (‫كوارث‬, Unionized)
Linguistics Levels of Analysis/Ambiguity
• Syntax ‫القواعدي‬
• grammar, how these sequences are structured
• Part-of-speech (noun, verb, adjective, preposition, etc.)
• Phrase structure (e.g. noun phrase, verb phrase)
• Ambiguity
Linguistics Levels of Analysis
• Semantics ‫الداللي‬
• Meaning of a word
• Ambiguity ( “board”, “book”,” ‫”عين‬ (
• Dialogue
• Meaning and inter-relation between sentences
Common NLP Tasks
• Word tokenization
• Sentence boundary detection
• Part-of-speech (POS) tagging
• to identify the part-of-speech (e.g. noun, verb) of each word
• Named Entity (NE) recognition
• to identify proper nouns (e.g. names of person, location, organization; domain
terminologies)
• Parsing
• to identify the syntactic structure of a sentence
• Semantic analysis
• to derive the meaning of a sentence
NLP Task : Part-Of-Speech (POS) Tagging
• POS tagging is a process of assigning a POS or lexical class marker to
each word in a sentence (and all sentences in a corpus).
Input: the lead paint is unsafe
Output: the/Det lead/N paint/N is/V unsafe/Adj
NLP Task : Named Entity Recognition (NER)
• NER is to process a text and identify named entities in a sentence
• e.g. “U.N. official Ekeus heads for Baghdad.”
NLP Task : Named Entity Recognition (NER)
NLP Task : Parsing and dependency parsing
• Shallow (or Partial) parsing identifies the (base) syntactic phases in a
sentence.
• After NEs are identified, dependency parsing is often applied to
extract the syntactic/dependency relations between the NEs.
[NP He] [v saw] [NP the big dog]
[PER Bill Gates] founded [ORG Microsoft].
found
Bill Gates Microsoft
nsubj dobj
Dependency Relations
nsubj(Bill Gates, found)
dobj(found, Microsoft)
NLP Task : Information Extraction
• Identify specific pieces of information (data) in an unstructured or
semi-structured text
• Transform unstructured information in a corpus of texts or web
pages into a structured database (or templates)
• Applied to various types of text, e.g.
• Newspaper
articles
• Scientific
articles
• Web pages
• etc.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new
Taiwan dollars, will start production in January 1990 with production of 20,000
iron and “metal wood” clubs a month.
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
template filling
NLP Pipeline
NLP Pipeline: Data Collection
• Ideal Setting: We have everything needed.
• Labels and Annotations
• Very often we are dealing with less-than-ideal scenarios
• Initial datasets with limited annotations/labels
• Initial datasets labeled based on regular expressions or heuristics
• Public datasets (cf. Google Dataset Search or kaggle)
• Scrape data
NLP Pipeline: Text Cleaning
• Extracting raw texts from the input data
• HTML
• PDF
• Relevant vs. irrelevant information
• non-textual information
• markup
• metadata
• Encoding format
NLP Pipeline: Preprocessing
• Sentence segmentation
• Word tokenization
• Frequent preprocessing
• Stopword removal
• Stemming and/or lemmatization
• Digits/Punctuaions removal
• Case normalization
• Arabic: Remove Diacritic
• Remove redundant spaces
NLP Pipeline:
Feature Engineering/text representation
• Feature Engineering for Classical ML
• Bag-of-words representations
• Domain-specific word frequency lists
• Handcrafted features based on domain-specific knowledge
• Feature Engineering for DL
• DL directly takes the texts as inputs to the model.
• The DL model is capable of learning features from the texts (e.g.,
embeddings)
• The price is that the model is often less interpretable.
NLP Pipeline: Bag of Words Model (Binary)
• Bag-of-words model is the simplest way (i.e., easy to be automated)
to vectorize texts into binary representations.
NLP Pipeline: Bag of Words Model (Count)
• Bag-of-words model is the simplest way (i.e., easy to be automated)
to vectorize texts into numeric representations.
NLP Pipeline: Bag of Words Model
• Issues with Bag-of-Words Text Representation
• Word order is ignored.
• Raw absolute frequency counts of words do not necessarily represent the
meaning of the text properly.
NLP Pipeline: TF-IDF Model
• TF-IDF model is an extension of the bag-of-words model, whose main
objective is to adjust the raw frequency counts by considering the
dispersion of the words in the corpus.
• Dispersion refers to how evenly each word/term is distributed across
different documents of the corpus.
• Interaction between Word Raw Frequency Counts and Dispersion:
• Given a high-frequency word:
• If the word is widely dispersed across different documents of the corpus (i.e., high dispersion)
• it is more likely to be semantically general.
• If the word is mostly centralized in a limited set of documents in the corpus (i.e., low
dispersion)
• it is more likely to be topic-specific.
• Dispersion rates of words can be used as weights for the importance of
word frequency counts.
NLP Pipeline: TF-IDF Model
TF-IDF = TF * IDF
NLP Pipeline:Modeling & Evalution
• More details in (Week2 - Week3 - Week6) in the program.
Further Resources..
• Deep Learning for NLP in Python – DataCamp
https://learn.datacamp.com/skill-tracks/deep-learning-for-nlp-in-python
• Natural Language Processing Specialization – Coursera
https://www.coursera.org/specializations/natural-language-processing
• Speech and Language Processing – Book
https://web.stanford.edu/~jurafsky/slp3/
Any Question ?
Thank You

Introduction to natural language processing (NLP)

  • 1.
    ML Applications: 1stSession Introduction to Natural Language Processing (NLP) Alia Hamwi
  • 2.
    What is NLP? •Natural Language Processing (NLP) is a field in Artificial Intelligence (AI) devoted to creating computers that use natural language as input and/or output.
  • 3.
    What is NLP? •The field of NLP involves making computers to perform useful tasks with the natural languages humans use. The input and output of an NLP system can be: • Speech • Written Text
  • 4.
    NLP Applications • Data-miningand analytics of weblogs, microblogs, discussion forums, user reviews, and other forms of user-generated media.
  • 5.
    NLP Applications • Conversationalagents Combine • Speech recognition/synthesis • Question answering • From the web and from structured information sources (freebase, dbpedia, etc.) • Commands identification for agent-like abilities • Create/edit calendar entries • Reminders • Directions • Invoking/interacting with other apps
  • 6.
    NLP Applications • Translation -Google. -DIRA (From English 2 Egyptian Dialect).
  • 7.
    DIRA (From English2 Egyptian Dialect) https://aclanthology.org/I13-2004.pdf
  • 8.
    NLP Applications • Classifiers:classify a set of document into categories, (as email spam filters) • Information Retrieval: find relevant documents to a given query. (search engines) • Summarization: Produce a readable summary, e.g., news about oil today. • Spelling checkers, grammar checkers, auto-filling, ..... and more
  • 9.
    Linguistics Levels ofAnalysis/Ambiguity • Phonology ‫الصوتي‬ • Speech audio signal to phonemes sounds / letters / pronunciation • Ambiguity (two, too,‫سائد‬,‫صائد‬, “I scream” / “Ice cream”) • Morphology ‫الصرفي‬ • the structure of words. • Inflection (e.g. “I”, “my”, “me”; “eat”, “eats”, “ate”, “eaten”) • Derivation (e.g. “teach”, “teacher”, “‫”كتب‬, “‫”كاتب‬,” friendly “) • Ambiguity (‫كوارث‬, Unionized)
  • 10.
    Linguistics Levels ofAnalysis/Ambiguity • Syntax ‫القواعدي‬ • grammar, how these sequences are structured • Part-of-speech (noun, verb, adjective, preposition, etc.) • Phrase structure (e.g. noun phrase, verb phrase) • Ambiguity
  • 11.
    Linguistics Levels ofAnalysis • Semantics ‫الداللي‬ • Meaning of a word • Ambiguity ( “board”, “book”,” ‫”عين‬ ( • Dialogue • Meaning and inter-relation between sentences
  • 12.
    Common NLP Tasks •Word tokenization • Sentence boundary detection • Part-of-speech (POS) tagging • to identify the part-of-speech (e.g. noun, verb) of each word • Named Entity (NE) recognition • to identify proper nouns (e.g. names of person, location, organization; domain terminologies) • Parsing • to identify the syntactic structure of a sentence • Semantic analysis • to derive the meaning of a sentence
  • 13.
    NLP Task :Part-Of-Speech (POS) Tagging • POS tagging is a process of assigning a POS or lexical class marker to each word in a sentence (and all sentences in a corpus). Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj
  • 14.
    NLP Task :Named Entity Recognition (NER) • NER is to process a text and identify named entities in a sentence • e.g. “U.N. official Ekeus heads for Baghdad.”
  • 15.
    NLP Task :Named Entity Recognition (NER)
  • 16.
    NLP Task :Parsing and dependency parsing • Shallow (or Partial) parsing identifies the (base) syntactic phases in a sentence. • After NEs are identified, dependency parsing is often applied to extract the syntactic/dependency relations between the NEs. [NP He] [v saw] [NP the big dog] [PER Bill Gates] founded [ORG Microsoft]. found Bill Gates Microsoft nsubj dobj Dependency Relations nsubj(Bill Gates, found) dobj(found, Microsoft)
  • 17.
    NLP Task :Information Extraction • Identify specific pieces of information (data) in an unstructured or semi-structured text • Transform unstructured information in a corpus of texts or web pages into a structured database (or templates) • Applied to various types of text, e.g. • Newspaper articles • Scientific articles • Web pages • etc.
  • 18.
    The joint venture,Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 template filling
  • 19.
  • 20.
    NLP Pipeline: DataCollection • Ideal Setting: We have everything needed. • Labels and Annotations • Very often we are dealing with less-than-ideal scenarios • Initial datasets with limited annotations/labels • Initial datasets labeled based on regular expressions or heuristics • Public datasets (cf. Google Dataset Search or kaggle) • Scrape data
  • 21.
    NLP Pipeline: TextCleaning • Extracting raw texts from the input data • HTML • PDF • Relevant vs. irrelevant information • non-textual information • markup • metadata • Encoding format
  • 22.
    NLP Pipeline: Preprocessing •Sentence segmentation • Word tokenization • Frequent preprocessing • Stopword removal • Stemming and/or lemmatization • Digits/Punctuaions removal • Case normalization • Arabic: Remove Diacritic • Remove redundant spaces
  • 23.
    NLP Pipeline: Feature Engineering/textrepresentation • Feature Engineering for Classical ML • Bag-of-words representations • Domain-specific word frequency lists • Handcrafted features based on domain-specific knowledge • Feature Engineering for DL • DL directly takes the texts as inputs to the model. • The DL model is capable of learning features from the texts (e.g., embeddings) • The price is that the model is often less interpretable.
  • 24.
    NLP Pipeline: Bagof Words Model (Binary) • Bag-of-words model is the simplest way (i.e., easy to be automated) to vectorize texts into binary representations.
  • 25.
    NLP Pipeline: Bagof Words Model (Count) • Bag-of-words model is the simplest way (i.e., easy to be automated) to vectorize texts into numeric representations.
  • 26.
    NLP Pipeline: Bagof Words Model • Issues with Bag-of-Words Text Representation • Word order is ignored. • Raw absolute frequency counts of words do not necessarily represent the meaning of the text properly.
  • 27.
    NLP Pipeline: TF-IDFModel • TF-IDF model is an extension of the bag-of-words model, whose main objective is to adjust the raw frequency counts by considering the dispersion of the words in the corpus. • Dispersion refers to how evenly each word/term is distributed across different documents of the corpus. • Interaction between Word Raw Frequency Counts and Dispersion: • Given a high-frequency word: • If the word is widely dispersed across different documents of the corpus (i.e., high dispersion) • it is more likely to be semantically general. • If the word is mostly centralized in a limited set of documents in the corpus (i.e., low dispersion) • it is more likely to be topic-specific. • Dispersion rates of words can be used as weights for the importance of word frequency counts.
  • 28.
    NLP Pipeline: TF-IDFModel TF-IDF = TF * IDF
  • 29.
    NLP Pipeline:Modeling &Evalution • More details in (Week2 - Week3 - Week6) in the program.
  • 30.
    Further Resources.. • DeepLearning for NLP in Python – DataCamp https://learn.datacamp.com/skill-tracks/deep-learning-for-nlp-in-python • Natural Language Processing Specialization – Coursera https://www.coursera.org/specializations/natural-language-processing • Speech and Language Processing – Book https://web.stanford.edu/~jurafsky/slp3/
  • 31.
  • 32.