SlideShare a Scribd company logo
1 of 57
Download to read offline
MACHINE-DRIVEN TEXT ANALYSIS
An Introduction to NLP in
Python
Massimo Schenone
Senior Consultant
What is NLP?
Natural Language Processing
NLP is all about creating systems that process or “understand”
human language in order to perform certain tasks
NLP USE IN BUSINESS
Creditworthiness assessment
Neural machine translation
Hiring and recruitment
Chatbots
Sentiment analysis
Advertising
Market intelligence
Healthcare
LITTLE BIT OF HISTORY
1950
1954
Georgetown IBM
experiment
Automatic translation of Russian
sentences into English
A. Turing
"Computing
Machinery and
Intelligence"
1960s
SHRDLU
restricted
"blocks worlds"
1964-66
1970s
Late 80s-90s
ELIZA
Simulated
conversation
Statistical
revolution
Mid 2000s
“conceptual
ontologies”
as computational
models
Machine
Learning
2010s
Deep
Learning
Today
The ImageNet
Moment of NLP
A changing field
Rules-based vs statistical approach
✔
Until the end of the 80s NLP systems were designed by hand-coding
a set of rules: this is rarely robust to natural language
variation
✔
Machine-learning paradigm: using statistical inference to
automatically learn such rules through the analysis of large
corpora of typical real-world examples
PROs:
•
focus on the most common cases
•
robust to unfamiliar or erroneous input
•
more accurate simply by supplying more input data
CONs:
•
data availability
•
precision and accuracy
NLP Tools
●
Context-free grammars
●
Regular expressions
●
Tokenization
●
Parse trees
●
N-grams
●
Linear algebra
●
Statistical inference
●
Neural nets
●
Word embeddings
●
Machine and deep learning
Common use Python libraries
nltk: very broad NLP library
spaCy: parse trees, tokenizer, opinionated
gensim: topic modeling and similarity
fasttext: text classification and representation learning
sklearn: general purpose Python ML library
fastai: built on top of PyTorch
TF.Text: a collection of text related classes and ops ready
to use with TensorFlow 2.0
GETTING STARTED
The Corpora
A corpus usually contains raw text (in ASCII or
UTF-8) and any metadata associated with the text.
> import nltk
> from nltk.corpus import words.words()
> from nltk.corpus import reuters
> from nltk.corpus import brown
> brown.categories()
['adventure', 'belles_lettres', 'editorial',
'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news',
'religion', 'reviews', 'romance',
'science_fiction']
Tokenization
The process of breaking a text down into tokens
Types are unique tokens present in
a corpus. The set of all types in a
corpus is its vocabulary.
Words can be distinguished as
content words and stopwords.
The first step of the pipeline,
just after cleaning
Tokenizing text in Python
Pure Python, spaCy or NLTK can be used
In spaCy tokenization is
done by applying rules
specific to each language
[≠ text.split()]
NTLK features a tweet
tokenizer which
preserves #hashtags,
@handles and smiles
Lemmatization
Lemmas are root forms of words.
spaCy uses a predefined dictionary,
called WordNet, for extracting lemmas
Stemming: the poor man’s lemmatization,
truncates the word to its stem (arguing →
argu)
WordNet
A large lexical database of English.
Nouns, verbs, adjectives and adverbs are grouped into sets of
cognitive synonyms (synsets), each expressing a distinct concept.
Grammatical analysis
spaCy provides a variety of linguistic annotations
to give you insights into a text’s grammatical
structure
The loaded statistical models enable spaCy
to predict linguistic annotations – for
example, whether a word is a verb or a
noun (part-of-the-speech or POS tagging)
or whether a noun is the subject of a
sentence, or the object
Dependency parsing
displacy is one of the features that makes spaCy a
nice tool
Named Entity Recognition (NER)
Labelling named “real-world” objects, like persons,
companies or locations.
The spaCy pretrained model performs pretty well (at least in English).
Again you can use displaCy to get a beautiful visualization of the NE annotated
sentence.
PUTTING ALL TOGETHER
TEXT REPRESENTATION
WHY REPRESENTATION IS IMPORTANT
Text representation scheme must facilitate the
extraction of the features
The semantics (meaning) of a sentence comes from the 4 steps
above:
●
Break the sentence into lexical units
●
Derive the meaning of each unit
●
Understand syntactic (grammatical) structure of the sentence
●
Understand the context in which the sentence appears
TEXT REPRESENTATION IS NOT EASY
Images and sounds have a natural digital representation scheme
For text there is no obvious way.
HOW TO FEED A STATISTICAL MODEL?
Machines do not understand text, they are good at crunching
numbers
Statistics and linear algebra work with numbers
Machine learning algorithms assume that all features used to
represent an observation are numeric
Text representation is the conversion from raw text to a
suitable numerical form
Legacy techniques
●
One-hot encoding
●
Bag of words
●
N-gram
●
TF-IDF
One-hot encoding
Every element is zero except the one corresponding
to a specific word
def one_hot(word, word_dict):
vector = np.zeros(len(word_dict))
vector[word_dict[word]] = 1
return vector
print(one_hot("paris", word_dict))
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
No information about words relations
Must pre-determine vocabulary size
Size of input vector scales with size of vocabulary
“Out-of-Vocabulary” (OOV) problem
Bag of words
A vector representation of a text produced by
simply adding up all the one-hot encoded vectors:
bow = np.zeros(vocabulary_size)
for word in text_words:
hot_word = one_hot(word, word_dict)
bow += hot_word
print(bow)
[6. 2. 2. 4. 5. 1. 1. 2. 1. 1. 1. 1. 2. 2. 4. 1.
1. 1. 1.]
bow[word_dict[“paris”]]
1.0
Vectors simply contains the number of
times each word appears in our document.
Orderless
No notion of similarity
N-gram model
A contiguous sequence of n items from a given
sample of text
Vocabulary = set of all n-grams in corpus
No notion of similarity
Collocations
A sequence of words that occur together unusually often
nltk.collocations can help identifying
phrases that act like single words
In the example bi-grams are paired with a
"more likely to occur" score
Term Frequency
Intuitively, we expect the frequency with which a given word is mentioned
should correspond to the relevance of that word for the piece of text we are
considering.
Very frequent words are
really meaningful!?
→ stopwords
Stopwords
Words that are very frequent but not meaningful
Remove the most common 100 words
Use spaCy predefined stopwords
Use nltk predefined stopwords
TF–IDF
Reflects how important a word is to a document in a corpus
Idea: importance increases proportionally to the frequency of a word in the document; but is
inversely proportional to the frequency of the word in the corpus
The tf–idf is the product of two statistics, term frequency and inverse document frequency.
TF–IDF
A toy example may help to clearly understand
D1: “The car is driven on the road.”
D2: “The truck is driven on the highway.”
len(D1) = len(D2) = 7
TF-IDF (t) = TF(t) * log(N/DF)
N=2
TF-IDF of common words is zero, they are not significant.
TF–IDF
Let’s now code TF-IDF in Python from scratch
Compute the TF score for each word
in the corpus, by document
Computes the IDF score of every
word in the corpus
TF–IDF
TF-IDF implementation using sklearn
Under the hood two functions are executed:
fit: learn vocabulary and idf from training set
transform: transform documents to
document-term matrix
terms with zero idf don't get suppressed entirely score is 0.6 !?
TF–IDF
TF-IDF implementation using sklearn and stopwords
VECTOR MODELS
VECTOR SPACE MODELS
Represent text units as vectors of numbers
Each dimension corresponds to a separate term (words, keywords, phrases).
D = (t1, t2, …, tn)
If a term occurs in the document, its value in the
vector is non-zero (TF-IDF weighting)
We can choose other features to fill the vector
Vector operations can be used to compare documents
with queries
COSINE SIMILARITY
Measure of similarity between two non-zero vectors
Relevance rankings of documents in a keyword search can be calculated by comparing
the cosine of angles between each document vector and the original query vector
The resulting similarity ranges from −1 meaning
exactly opposite, to 1 meaning exactly the same
For text matching, the attribute vectors A and B are
usually the term frequency vectors of the documents.
WORD EMBEDDINGS
The state-of-the-art text representation
An approach to provide a dense vector representation of words that capture something about
their meaning
The central idea of word embedding training is that similar words are typically surrounded by the
same “context” words in normal use
“You shall know a word by the company it keeps” – Firth, J.R. (1957)
WORD EMBEDDINGS
Popular Word Embedding Algorithms
Skip-Gram
Continuous Bag of Words (CBOW)
Word2Vec (2013) by Google - during training uses CBOW and Skip-Gram techniques together
“Global Vectors for Word Representations” (GloVe) from a team at Stanford University
fasttext created by Facebook AI group
WORD EMBEDDINGS IN ACTION
Word vectors let you import knowledge from raw text
into your model
●
We can represent words as
vectors of numbers
●
We can easily calculate how
similar vectors are to each
other
●
We can add and subtract
word embeddings and arrive
at interesting results
●
The most famous example is
the formula:
“king” - “man” + “woman”
Coloring the cells based on their values we can easily compare two or more vectors
WORD EMBEDDINGS WITH SPACY
spaCy comes shipped with pre-trained models!
spaCy’s small models (all packages
that end in _sm) don’t ship with word
vectors, and only include context-
sensitive tensors (POS, NER, etc)
You can still use the similarity()
methods to compare documents
and tokens, but the result won’t be
as good
en_vectors_web_lg model provides
300-dimensional GloVe vectors for
over 1 million terms of English
WORD EMBEDDINGS WITH SPACY
(continued)
We now need to find the closest vector in the vocabulary to the result
of “king - man + woman”
SOME HINTS ABOUT THE LAST
TENDENCIES IN NLP
DEEP LEARNING IN ONE SLIDE
FASTTEXT
An open source NLP library developed by facebook AI
(2016)
Its goal is to provide word embedding and text classification efficiently through its shallow neural
network implementation
The accuracy of this library is on par of deep neural networks and requires very less amount of
time to train
Models can be saved and later reduced in size to even fit on mobile devices
FASTTEXT
Word representation
Training:
Word vector:
The dimension (dim) controls the size of the vectors (default 100)
The subwords are all the substrings contained in a word between
the minimum size (minn) and the maximal size (maxn). [3-6]
Parameters:
FASTTEXT
Testing your model
A simple way to check the quality of a word vector is to look at its nearest neighbors
This give an intuition of the type of semantic information the vectors are able to capture
A USE CASE: SENTIMENT ANALYSIS
Identify and extract opinions within a given text
valence = {}
for word in pos:
valence[word.lower()] = 1
for word in neg:
valence[word.lower()] = -1
def sentiment(text, valence):
words = extract_words(text.lower())
word_count = 0
score = 0
for word in words:
if word in valence:
score += valence[word]
word_count += 1
return score/word_count
texts = ["I'm very happy",
"The product is pretty annoying, and I hate it",
"I'm sad"]
for text in texts:
print(sentiment(text, valence))
1.0
-0.3333333333333333
-1.0
Simplest approach is counting positive and negative
words using opinion lists, but ...
Modifier words ("very", "much", "not", "pretty",
"somewhat") change the meaning
→ We need real valued weights
Sentiment analysis from scratch
NLTK’s VADER
A lexicon and rule-based sentiment analysis tool that is
specifically attuned to sentiments expressed in social
media
Positive + Neutral + Negative = 1
-1 < Compound < +1
analyzer = SentimentIntensityAnalyzer()
score = analyzer.polarity_scores(sentence)
VADER analyses sentiments primarily
based on certain key points:
●
Punctuation (cool!)
●
Capitalization (GREAT)
●
Degree modifiers (very good)
●
Conjunctions (good but I hate it)
VADER: QUICK AND DIRTY
USING FASTTEXT
We use the sentiment140 dataset that contains 1600000 tweets that have been annotated with 2 keys:
0 = negative, 4 = positive
fastText use the format __label__ to recognize labels from words
Data preprocessing and cleaning
USING FASTTEXT
Training the classifier
Using bigrams
Increasing also learning rate
Increasing epochs
Default parameters:
epoch=5
lr=0.1
wordNgrams=1
Precision Recall
USING FASTTEXT
Evaluate the model predictions
USING FASTTEXT
Classify tweets
CONCLUSIONS
Large volumes of data are crucial to the success of a machine learning
project, but having clean, high-quality data is just as important (ImageNet
moment of NLP)
Sometimes rules-based approaches (still) work better, e.g. VADER
Don’t fall in love with tools or algorithms, feel free to build the best possible
environment to process your data and get the job done!
A BIG THANKS!
Massimo Schenone
Some online stuff
●
Introduction to Information Retrieval
https://nlp.stanford.edu/IR-book/html/htmledition/irbook.html
●
Speech and Language Processing
https://web.stanford.edu/~jurafsky/slp3/
●
Natural Language Processing From Scratch
https://2017.pygotham.org/talks/natural-language-processing-from-scratch
/
●
Advanced NLP with spaCy
https://course.spacy.io
●
Sentiment analysis
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
●
The Illustrated Word2vec
https://jalammar.github.io/illustrated-word2vec/

More Related Content

What's hot

Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memoriesRIILP
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text SummarizationTho Phan
 
Intent Classifier with Facebook fastText
Intent Classifier with Facebook fastTextIntent Classifier with Facebook fastText
Intent Classifier with Facebook fastTextBayu Aldi Yansyah
 
Recent Advances in NLP
  Recent Advances in NLP  Recent Advances in NLP
Recent Advances in NLPAnuj Gupta
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentationSurya Sg
 
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?tauyou
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationSurabhi Verma
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Marcin Junczys-Dowmunt
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categoriesWarNik Chow
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introductionananth
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational SemanticsMarina Santini
 

What's hot (20)

Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
5. manuel arcedillo & juanjo arevalillo (hermes) translation memories
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text Summarization
 
Intent Classifier with Facebook fastText
Intent Classifier with Facebook fastTextIntent Classifier with Facebook fastText
Intent Classifier with Facebook fastText
 
Recent Advances in NLP
  Recent Advances in NLP  Recent Advances in NLP
Recent Advances in NLP
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentation
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
 
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
 
1909 paclic
1909 paclic1909 paclic
1909 paclic
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense Disambiguation
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Word embedding
Word embedding Word embedding
Word embedding
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Text summarization
Text summarizationText summarization
Text summarization
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
 

Similar to MACHINE-DRIVEN TEXT ANALYSIS

Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Word embeddings
Word embeddingsWord embeddings
Word embeddingsShruti kar
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...QuantInsti
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to HindiRajat Jain
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectorsOsebe Sammi
 

Similar to MACHINE-DRIVEN TEXT ANALYSIS (20)

Nltk
NltkNltk
Nltk
 
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Document similarity
Document similarityDocument similarity
Document similarity
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to Hindi
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
ppt
pptppt
ppt
 
ppt
pptppt
ppt
 

Recently uploaded

Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Vipesco
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesPooja Nehwal
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxNikitaBankoti2
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMoumonDas2
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardsticksaastr
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...Sheetaleventcompany
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxraffaeleoman
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxmohammadalnahdi22
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Chameera Dedduwage
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfSenaatti-kiinteistöt
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Hasting Chen
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyPooja Nehwal
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubssamaasim06
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsaqsarehman5055
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Delhi Call girls
 

Recently uploaded (20)

Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptx
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 

MACHINE-DRIVEN TEXT ANALYSIS

  • 1. MACHINE-DRIVEN TEXT ANALYSIS An Introduction to NLP in Python Massimo Schenone Senior Consultant
  • 2. What is NLP? Natural Language Processing NLP is all about creating systems that process or “understand” human language in order to perform certain tasks
  • 3. NLP USE IN BUSINESS Creditworthiness assessment Neural machine translation Hiring and recruitment Chatbots Sentiment analysis Advertising Market intelligence Healthcare
  • 4. LITTLE BIT OF HISTORY 1950 1954 Georgetown IBM experiment Automatic translation of Russian sentences into English A. Turing "Computing Machinery and Intelligence" 1960s SHRDLU restricted "blocks worlds" 1964-66 1970s Late 80s-90s ELIZA Simulated conversation Statistical revolution Mid 2000s “conceptual ontologies” as computational models Machine Learning 2010s Deep Learning Today The ImageNet Moment of NLP
  • 5. A changing field Rules-based vs statistical approach ✔ Until the end of the 80s NLP systems were designed by hand-coding a set of rules: this is rarely robust to natural language variation ✔ Machine-learning paradigm: using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples PROs: • focus on the most common cases • robust to unfamiliar or erroneous input • more accurate simply by supplying more input data CONs: • data availability • precision and accuracy
  • 6. NLP Tools ● Context-free grammars ● Regular expressions ● Tokenization ● Parse trees ● N-grams ● Linear algebra ● Statistical inference ● Neural nets ● Word embeddings ● Machine and deep learning
  • 7. Common use Python libraries nltk: very broad NLP library spaCy: parse trees, tokenizer, opinionated gensim: topic modeling and similarity fasttext: text classification and representation learning sklearn: general purpose Python ML library fastai: built on top of PyTorch TF.Text: a collection of text related classes and ops ready to use with TensorFlow 2.0
  • 9. The Corpora A corpus usually contains raw text (in ASCII or UTF-8) and any metadata associated with the text. > import nltk > from nltk.corpus import words.words() > from nltk.corpus import reuters > from nltk.corpus import brown > brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
  • 10. Tokenization The process of breaking a text down into tokens Types are unique tokens present in a corpus. The set of all types in a corpus is its vocabulary. Words can be distinguished as content words and stopwords. The first step of the pipeline, just after cleaning
  • 11. Tokenizing text in Python Pure Python, spaCy or NLTK can be used In spaCy tokenization is done by applying rules specific to each language [≠ text.split()] NTLK features a tweet tokenizer which preserves #hashtags, @handles and smiles
  • 12. Lemmatization Lemmas are root forms of words. spaCy uses a predefined dictionary, called WordNet, for extracting lemmas Stemming: the poor man’s lemmatization, truncates the word to its stem (arguing → argu)
  • 13. WordNet A large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.
  • 14. Grammatical analysis spaCy provides a variety of linguistic annotations to give you insights into a text’s grammatical structure The loaded statistical models enable spaCy to predict linguistic annotations – for example, whether a word is a verb or a noun (part-of-the-speech or POS tagging) or whether a noun is the subject of a sentence, or the object
  • 15. Dependency parsing displacy is one of the features that makes spaCy a nice tool
  • 16. Named Entity Recognition (NER) Labelling named “real-world” objects, like persons, companies or locations. The spaCy pretrained model performs pretty well (at least in English). Again you can use displaCy to get a beautiful visualization of the NE annotated sentence.
  • 19. WHY REPRESENTATION IS IMPORTANT Text representation scheme must facilitate the extraction of the features The semantics (meaning) of a sentence comes from the 4 steps above: ● Break the sentence into lexical units ● Derive the meaning of each unit ● Understand syntactic (grammatical) structure of the sentence ● Understand the context in which the sentence appears
  • 20. TEXT REPRESENTATION IS NOT EASY Images and sounds have a natural digital representation scheme For text there is no obvious way.
  • 21. HOW TO FEED A STATISTICAL MODEL? Machines do not understand text, they are good at crunching numbers Statistics and linear algebra work with numbers Machine learning algorithms assume that all features used to represent an observation are numeric Text representation is the conversion from raw text to a suitable numerical form
  • 22. Legacy techniques ● One-hot encoding ● Bag of words ● N-gram ● TF-IDF
  • 23. One-hot encoding Every element is zero except the one corresponding to a specific word def one_hot(word, word_dict): vector = np.zeros(len(word_dict)) vector[word_dict[word]] = 1 return vector print(one_hot("paris", word_dict)) [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] No information about words relations Must pre-determine vocabulary size Size of input vector scales with size of vocabulary “Out-of-Vocabulary” (OOV) problem
  • 24. Bag of words A vector representation of a text produced by simply adding up all the one-hot encoded vectors: bow = np.zeros(vocabulary_size) for word in text_words: hot_word = one_hot(word, word_dict) bow += hot_word print(bow) [6. 2. 2. 4. 5. 1. 1. 2. 1. 1. 1. 1. 2. 2. 4. 1. 1. 1. 1.] bow[word_dict[“paris”]] 1.0 Vectors simply contains the number of times each word appears in our document. Orderless No notion of similarity
  • 25. N-gram model A contiguous sequence of n items from a given sample of text Vocabulary = set of all n-grams in corpus No notion of similarity
  • 26. Collocations A sequence of words that occur together unusually often nltk.collocations can help identifying phrases that act like single words In the example bi-grams are paired with a "more likely to occur" score
  • 27. Term Frequency Intuitively, we expect the frequency with which a given word is mentioned should correspond to the relevance of that word for the piece of text we are considering. Very frequent words are really meaningful!? → stopwords
  • 28. Stopwords Words that are very frequent but not meaningful Remove the most common 100 words Use spaCy predefined stopwords Use nltk predefined stopwords
  • 29. TF–IDF Reflects how important a word is to a document in a corpus Idea: importance increases proportionally to the frequency of a word in the document; but is inversely proportional to the frequency of the word in the corpus The tf–idf is the product of two statistics, term frequency and inverse document frequency.
  • 30. TF–IDF A toy example may help to clearly understand D1: “The car is driven on the road.” D2: “The truck is driven on the highway.” len(D1) = len(D2) = 7 TF-IDF (t) = TF(t) * log(N/DF) N=2 TF-IDF of common words is zero, they are not significant.
  • 31. TF–IDF Let’s now code TF-IDF in Python from scratch Compute the TF score for each word in the corpus, by document Computes the IDF score of every word in the corpus
  • 32. TF–IDF TF-IDF implementation using sklearn Under the hood two functions are executed: fit: learn vocabulary and idf from training set transform: transform documents to document-term matrix terms with zero idf don't get suppressed entirely score is 0.6 !?
  • 33. TF–IDF TF-IDF implementation using sklearn and stopwords
  • 35. VECTOR SPACE MODELS Represent text units as vectors of numbers Each dimension corresponds to a separate term (words, keywords, phrases). D = (t1, t2, …, tn) If a term occurs in the document, its value in the vector is non-zero (TF-IDF weighting) We can choose other features to fill the vector Vector operations can be used to compare documents with queries
  • 36. COSINE SIMILARITY Measure of similarity between two non-zero vectors Relevance rankings of documents in a keyword search can be calculated by comparing the cosine of angles between each document vector and the original query vector The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents.
  • 37. WORD EMBEDDINGS The state-of-the-art text representation An approach to provide a dense vector representation of words that capture something about their meaning The central idea of word embedding training is that similar words are typically surrounded by the same “context” words in normal use “You shall know a word by the company it keeps” – Firth, J.R. (1957)
  • 38. WORD EMBEDDINGS Popular Word Embedding Algorithms Skip-Gram Continuous Bag of Words (CBOW) Word2Vec (2013) by Google - during training uses CBOW and Skip-Gram techniques together “Global Vectors for Word Representations” (GloVe) from a team at Stanford University fasttext created by Facebook AI group
  • 39. WORD EMBEDDINGS IN ACTION Word vectors let you import knowledge from raw text into your model ● We can represent words as vectors of numbers ● We can easily calculate how similar vectors are to each other ● We can add and subtract word embeddings and arrive at interesting results ● The most famous example is the formula: “king” - “man” + “woman” Coloring the cells based on their values we can easily compare two or more vectors
  • 40. WORD EMBEDDINGS WITH SPACY spaCy comes shipped with pre-trained models! spaCy’s small models (all packages that end in _sm) don’t ship with word vectors, and only include context- sensitive tensors (POS, NER, etc) You can still use the similarity() methods to compare documents and tokens, but the result won’t be as good en_vectors_web_lg model provides 300-dimensional GloVe vectors for over 1 million terms of English
  • 41. WORD EMBEDDINGS WITH SPACY (continued) We now need to find the closest vector in the vocabulary to the result of “king - man + woman”
  • 42. SOME HINTS ABOUT THE LAST TENDENCIES IN NLP
  • 43. DEEP LEARNING IN ONE SLIDE
  • 44. FASTTEXT An open source NLP library developed by facebook AI (2016) Its goal is to provide word embedding and text classification efficiently through its shallow neural network implementation The accuracy of this library is on par of deep neural networks and requires very less amount of time to train Models can be saved and later reduced in size to even fit on mobile devices
  • 45. FASTTEXT Word representation Training: Word vector: The dimension (dim) controls the size of the vectors (default 100) The subwords are all the substrings contained in a word between the minimum size (minn) and the maximal size (maxn). [3-6] Parameters:
  • 46. FASTTEXT Testing your model A simple way to check the quality of a word vector is to look at its nearest neighbors This give an intuition of the type of semantic information the vectors are able to capture
  • 47. A USE CASE: SENTIMENT ANALYSIS
  • 48. Identify and extract opinions within a given text valence = {} for word in pos: valence[word.lower()] = 1 for word in neg: valence[word.lower()] = -1 def sentiment(text, valence): words = extract_words(text.lower()) word_count = 0 score = 0 for word in words: if word in valence: score += valence[word] word_count += 1 return score/word_count texts = ["I'm very happy", "The product is pretty annoying, and I hate it", "I'm sad"] for text in texts: print(sentiment(text, valence)) 1.0 -0.3333333333333333 -1.0 Simplest approach is counting positive and negative words using opinion lists, but ... Modifier words ("very", "much", "not", "pretty", "somewhat") change the meaning → We need real valued weights Sentiment analysis from scratch
  • 49. NLTK’s VADER A lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media Positive + Neutral + Negative = 1 -1 < Compound < +1 analyzer = SentimentIntensityAnalyzer() score = analyzer.polarity_scores(sentence) VADER analyses sentiments primarily based on certain key points: ● Punctuation (cool!) ● Capitalization (GREAT) ● Degree modifiers (very good) ● Conjunctions (good but I hate it)
  • 51. USING FASTTEXT We use the sentiment140 dataset that contains 1600000 tweets that have been annotated with 2 keys: 0 = negative, 4 = positive fastText use the format __label__ to recognize labels from words Data preprocessing and cleaning
  • 52. USING FASTTEXT Training the classifier Using bigrams Increasing also learning rate Increasing epochs Default parameters: epoch=5 lr=0.1 wordNgrams=1 Precision Recall
  • 53. USING FASTTEXT Evaluate the model predictions
  • 55. CONCLUSIONS Large volumes of data are crucial to the success of a machine learning project, but having clean, high-quality data is just as important (ImageNet moment of NLP) Sometimes rules-based approaches (still) work better, e.g. VADER Don’t fall in love with tools or algorithms, feel free to build the best possible environment to process your data and get the job done!
  • 57. Some online stuff ● Introduction to Information Retrieval https://nlp.stanford.edu/IR-book/html/htmledition/irbook.html ● Speech and Language Processing https://web.stanford.edu/~jurafsky/slp3/ ● Natural Language Processing From Scratch https://2017.pygotham.org/talks/natural-language-processing-from-scratch / ● Advanced NLP with spaCy https://course.spacy.io ● Sentiment analysis https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html ● The Illustrated Word2vec https://jalammar.github.io/illustrated-word2vec/