Natural language processing (NLP) involves developing systems that can process and understand human language. This document discusses NLP tools and techniques for representing text numerically so it can be analyzed by machine learning algorithms. It covers topics like tokenization, part-of-speech tagging, named entity recognition, vector space models, term frequency-inverse document frequency (TF-IDF) weighting, and word embeddings which represent words as dense vectors of numbers. Popular Python libraries for NLP and text analysis are also introduced.
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
presentation from my thesis defense on text summarization, discusses already existing state of art models along with efficiency of AMR or Abstract Meaning Representation for text summarization, we see how we can use AMRs with seq2seq models. We also discuss other techniques such as BPE or Byte Pair Encoding and its effectiveness for the task. Also we see how data augmentation with POS tags and AMRs effect the summarization with s2s learning.
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
presentation from my thesis defense on text summarization, discusses already existing state of art models along with efficiency of AMR or Abstract Meaning Representation for text summarization, we see how we can use AMRs with seq2seq models. We also discuss other techniques such as BPE or Byte Pair Encoding and its effectiveness for the task. Also we see how data augmentation with POS tags and AMRs effect the summarization with s2s learning.
Abstractive text summarization is nowadays one of the most important research topics in NLP. However, getting a deep understanding of what it is and also how it works requires a series of base pieces of knowledge that build on top of each other. This is the reason why this presentation will give audiences an overview of sequence-to-sequence with the acceleration of various versions of attention over the past few years. In addition, natural language generation (NLG) with the focusing on decoder techniques and its relevant problems will be reviewed, as a supportive factor to the light of the success of automatic summarization. Finally, the abstractive text summarization will be represented with potential approaches to tackle some hot issues in some latest research papers.
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Marcin Junczys-Dowmunt
Video recording of full talk: http://lectures.ms.mff.cuni.cz/view.php?rec=255
There has been an increasing interest in using statistical machine translation (SMT) for the task of Grammatical Error Correction (GEC) for English-as-a-Second-Language (ESL) learners. Two of the three highest-scoring systems of the CoNLL-2014 Shared Task were SMT-based. The currently highest-scoring result published for the CoNLL-2014 test set has been achieved by a system combination of the five best CoNLL-2014 submissions built with MEMT (a tool for MT system combination). In this talk, we demonstrate how a single SMT-based system can match and outperform the result of the mentioned combined system. Furthermore, this system outperforms any other published results (including our own CoNLL-2014 submission) for a single system by a margin of several percent F-Score when the same training data is being used. These results are achieved by adapting current state-of-the art methods for phrase-based SMT specifically to the problem of GEC.
We report on the effects of:
- Parameter tuning for GEC
- Introducing GEC-specific dense and sparse features
- Using large-scale data
Natural Language Processing: L01 introductionananth
This presentation introduces the course Natural Language Processing (NLP) by enumerating a number of applications, course positioning, challenges presented by Natural Language text and emerging approaches to topics like word representation.
It's a brief overview of Natural Language Processing using Python module NLTK.The codes for demonstration can be found from the github link given in the references slide.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
Abstractive text summarization is nowadays one of the most important research topics in NLP. However, getting a deep understanding of what it is and also how it works requires a series of base pieces of knowledge that build on top of each other. This is the reason why this presentation will give audiences an overview of sequence-to-sequence with the acceleration of various versions of attention over the past few years. In addition, natural language generation (NLG) with the focusing on decoder techniques and its relevant problems will be reviewed, as a supportive factor to the light of the success of automatic summarization. Finally, the abstractive text summarization will be represented with potential approaches to tackle some hot issues in some latest research papers.
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Marcin Junczys-Dowmunt
Video recording of full talk: http://lectures.ms.mff.cuni.cz/view.php?rec=255
There has been an increasing interest in using statistical machine translation (SMT) for the task of Grammatical Error Correction (GEC) for English-as-a-Second-Language (ESL) learners. Two of the three highest-scoring systems of the CoNLL-2014 Shared Task were SMT-based. The currently highest-scoring result published for the CoNLL-2014 test set has been achieved by a system combination of the five best CoNLL-2014 submissions built with MEMT (a tool for MT system combination). In this talk, we demonstrate how a single SMT-based system can match and outperform the result of the mentioned combined system. Furthermore, this system outperforms any other published results (including our own CoNLL-2014 submission) for a single system by a margin of several percent F-Score when the same training data is being used. These results are achieved by adapting current state-of-the art methods for phrase-based SMT specifically to the problem of GEC.
We report on the effects of:
- Parameter tuning for GEC
- Introducing GEC-specific dense and sparse features
- Using large-scale data
Natural Language Processing: L01 introductionananth
This presentation introduces the course Natural Language Processing (NLP) by enumerating a number of applications, course positioning, challenges presented by Natural Language text and emerging approaches to topics like word representation.
It's a brief overview of Natural Language Processing using Python module NLTK.The codes for demonstration can be found from the github link given in the references slide.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
Getting started on your natural language processing project? First you'll need to extract some features from your corpus. Frequency, Syntax parsing, word vectors are good ones to start with.
Presentation of "Challenges in transfer learning in NLP" from Madrid Natural Language Processing Meetup Event, May, 2019.
https://www.meetup.com/es-ES/Madrid-Natural-Language-Processing-meetup/
Practical related work in repository: https://github.com/laraolmos/madrid-nlp-meetup
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...QuantInsti
Terry Benzschawel (Founder and Principal at Benzschawel Scientific, LLC) and Ishan Shah (AVP, Content & Research at QuantInsti) helm this Masterclass about Natural Language Processing in Trading.
In this session, Terry and Ishan discuss:
- How is Natural Language Processing applied in financial markets?
- Calculating Daily Sentiment Score on Quantra learning portal
- Compare different word embedding methods with their pros and cons
- How does Quantra learning portal provide a unique learning experience?
Summary:
This introductory webinar describes the use of Natural Language Processing (NLP) techniques in the context of building trading strategies for 1-day horizons for the corporate bond market and equities markets using news headlines.
We described various methods for converting text into digital representations and for extracting sentiment scores from those embeddings.
Looking ahead to the course, we find that approaches using the latest advances in NLP are better suited to predict future returns in credit indices, by using news headlines directly as inputs, instead of news headline sentiments
About us:
Quantra is an online education portal that specializes in Algorithmic and Quantitative trading. Quantra offers various bite-sized, self-paced and interactive courses that are perfect for busy professionals, seeking implementable knowledge in this domain.
Useful links and some bonus content:
[Blog] Step By Step Guide - http://bit.ly/StepByStepGuideNLP
[Course] Natural Language Processing in Trading - http://bit.ly/QuantraNLPT
[Courses] All Quantra courses - http://bit.ly/AllQuantraCourses
Find more info on - https://quantra.quantinsti.com/
Like us on Facebook: https://www.facebook.com/goquantra/
Follow us on Twitter: https://twitter.com/GoQuantra
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
Machine translation from English to HindiRajat Jain
Machine translation a part of natural language processing.The algorithm suggested is word based algorithm.We have done Translation from English to Hindi
submitted by
Garvita Sharma,10103467,B3
Rajat Jain,10103571,B6
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
UIIN Conference, Madrid, 27-29 May 2024
James Wilson, Orkestra and Deusto Business School
Emily Wise, Lund University
Madeline Smith, The Glasgow School of Art
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
2. What is NLP?
Natural Language Processing
NLP is all about creating systems that process or “understand”
human language in order to perform certain tasks
3. NLP USE IN BUSINESS
Creditworthiness assessment
Neural machine translation
Hiring and recruitment
Chatbots
Sentiment analysis
Advertising
Market intelligence
Healthcare
4. LITTLE BIT OF HISTORY
1950
1954
Georgetown IBM
experiment
Automatic translation of Russian
sentences into English
A. Turing
"Computing
Machinery and
Intelligence"
1960s
SHRDLU
restricted
"blocks worlds"
1964-66
1970s
Late 80s-90s
ELIZA
Simulated
conversation
Statistical
revolution
Mid 2000s
“conceptual
ontologies”
as computational
models
Machine
Learning
2010s
Deep
Learning
Today
The ImageNet
Moment of NLP
5. A changing field
Rules-based vs statistical approach
✔
Until the end of the 80s NLP systems were designed by hand-coding
a set of rules: this is rarely robust to natural language
variation
✔
Machine-learning paradigm: using statistical inference to
automatically learn such rules through the analysis of large
corpora of typical real-world examples
PROs:
•
focus on the most common cases
•
robust to unfamiliar or erroneous input
•
more accurate simply by supplying more input data
CONs:
•
data availability
•
precision and accuracy
6. NLP Tools
●
Context-free grammars
●
Regular expressions
●
Tokenization
●
Parse trees
●
N-grams
●
Linear algebra
●
Statistical inference
●
Neural nets
●
Word embeddings
●
Machine and deep learning
7. Common use Python libraries
nltk: very broad NLP library
spaCy: parse trees, tokenizer, opinionated
gensim: topic modeling and similarity
fasttext: text classification and representation learning
sklearn: general purpose Python ML library
fastai: built on top of PyTorch
TF.Text: a collection of text related classes and ops ready
to use with TensorFlow 2.0
9. The Corpora
A corpus usually contains raw text (in ASCII or
UTF-8) and any metadata associated with the text.
> import nltk
> from nltk.corpus import words.words()
> from nltk.corpus import reuters
> from nltk.corpus import brown
> brown.categories()
['adventure', 'belles_lettres', 'editorial',
'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news',
'religion', 'reviews', 'romance',
'science_fiction']
10. Tokenization
The process of breaking a text down into tokens
Types are unique tokens present in
a corpus. The set of all types in a
corpus is its vocabulary.
Words can be distinguished as
content words and stopwords.
The first step of the pipeline,
just after cleaning
11. Tokenizing text in Python
Pure Python, spaCy or NLTK can be used
In spaCy tokenization is
done by applying rules
specific to each language
[≠ text.split()]
NTLK features a tweet
tokenizer which
preserves #hashtags,
@handles and smiles
12. Lemmatization
Lemmas are root forms of words.
spaCy uses a predefined dictionary,
called WordNet, for extracting lemmas
Stemming: the poor man’s lemmatization,
truncates the word to its stem (arguing →
argu)
13. WordNet
A large lexical database of English.
Nouns, verbs, adjectives and adverbs are grouped into sets of
cognitive synonyms (synsets), each expressing a distinct concept.
14. Grammatical analysis
spaCy provides a variety of linguistic annotations
to give you insights into a text’s grammatical
structure
The loaded statistical models enable spaCy
to predict linguistic annotations – for
example, whether a word is a verb or a
noun (part-of-the-speech or POS tagging)
or whether a noun is the subject of a
sentence, or the object
16. Named Entity Recognition (NER)
Labelling named “real-world” objects, like persons,
companies or locations.
The spaCy pretrained model performs pretty well (at least in English).
Again you can use displaCy to get a beautiful visualization of the NE annotated
sentence.
19. WHY REPRESENTATION IS IMPORTANT
Text representation scheme must facilitate the
extraction of the features
The semantics (meaning) of a sentence comes from the 4 steps
above:
●
Break the sentence into lexical units
●
Derive the meaning of each unit
●
Understand syntactic (grammatical) structure of the sentence
●
Understand the context in which the sentence appears
20. TEXT REPRESENTATION IS NOT EASY
Images and sounds have a natural digital representation scheme
For text there is no obvious way.
21. HOW TO FEED A STATISTICAL MODEL?
Machines do not understand text, they are good at crunching
numbers
Statistics and linear algebra work with numbers
Machine learning algorithms assume that all features used to
represent an observation are numeric
Text representation is the conversion from raw text to a
suitable numerical form
23. One-hot encoding
Every element is zero except the one corresponding
to a specific word
def one_hot(word, word_dict):
vector = np.zeros(len(word_dict))
vector[word_dict[word]] = 1
return vector
print(one_hot("paris", word_dict))
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
No information about words relations
Must pre-determine vocabulary size
Size of input vector scales with size of vocabulary
“Out-of-Vocabulary” (OOV) problem
24. Bag of words
A vector representation of a text produced by
simply adding up all the one-hot encoded vectors:
bow = np.zeros(vocabulary_size)
for word in text_words:
hot_word = one_hot(word, word_dict)
bow += hot_word
print(bow)
[6. 2. 2. 4. 5. 1. 1. 2. 1. 1. 1. 1. 2. 2. 4. 1.
1. 1. 1.]
bow[word_dict[“paris”]]
1.0
Vectors simply contains the number of
times each word appears in our document.
Orderless
No notion of similarity
25. N-gram model
A contiguous sequence of n items from a given
sample of text
Vocabulary = set of all n-grams in corpus
No notion of similarity
26. Collocations
A sequence of words that occur together unusually often
nltk.collocations can help identifying
phrases that act like single words
In the example bi-grams are paired with a
"more likely to occur" score
27. Term Frequency
Intuitively, we expect the frequency with which a given word is mentioned
should correspond to the relevance of that word for the piece of text we are
considering.
Very frequent words are
really meaningful!?
→ stopwords
28. Stopwords
Words that are very frequent but not meaningful
Remove the most common 100 words
Use spaCy predefined stopwords
Use nltk predefined stopwords
29. TF–IDF
Reflects how important a word is to a document in a corpus
Idea: importance increases proportionally to the frequency of a word in the document; but is
inversely proportional to the frequency of the word in the corpus
The tf–idf is the product of two statistics, term frequency and inverse document frequency.
30. TF–IDF
A toy example may help to clearly understand
D1: “The car is driven on the road.”
D2: “The truck is driven on the highway.”
len(D1) = len(D2) = 7
TF-IDF (t) = TF(t) * log(N/DF)
N=2
TF-IDF of common words is zero, they are not significant.
31. TF–IDF
Let’s now code TF-IDF in Python from scratch
Compute the TF score for each word
in the corpus, by document
Computes the IDF score of every
word in the corpus
32. TF–IDF
TF-IDF implementation using sklearn
Under the hood two functions are executed:
fit: learn vocabulary and idf from training set
transform: transform documents to
document-term matrix
terms with zero idf don't get suppressed entirely score is 0.6 !?
35. VECTOR SPACE MODELS
Represent text units as vectors of numbers
Each dimension corresponds to a separate term (words, keywords, phrases).
D = (t1, t2, …, tn)
If a term occurs in the document, its value in the
vector is non-zero (TF-IDF weighting)
We can choose other features to fill the vector
Vector operations can be used to compare documents
with queries
36. COSINE SIMILARITY
Measure of similarity between two non-zero vectors
Relevance rankings of documents in a keyword search can be calculated by comparing
the cosine of angles between each document vector and the original query vector
The resulting similarity ranges from −1 meaning
exactly opposite, to 1 meaning exactly the same
For text matching, the attribute vectors A and B are
usually the term frequency vectors of the documents.
37. WORD EMBEDDINGS
The state-of-the-art text representation
An approach to provide a dense vector representation of words that capture something about
their meaning
The central idea of word embedding training is that similar words are typically surrounded by the
same “context” words in normal use
“You shall know a word by the company it keeps” – Firth, J.R. (1957)
38. WORD EMBEDDINGS
Popular Word Embedding Algorithms
Skip-Gram
Continuous Bag of Words (CBOW)
Word2Vec (2013) by Google - during training uses CBOW and Skip-Gram techniques together
“Global Vectors for Word Representations” (GloVe) from a team at Stanford University
fasttext created by Facebook AI group
39. WORD EMBEDDINGS IN ACTION
Word vectors let you import knowledge from raw text
into your model
●
We can represent words as
vectors of numbers
●
We can easily calculate how
similar vectors are to each
other
●
We can add and subtract
word embeddings and arrive
at interesting results
●
The most famous example is
the formula:
“king” - “man” + “woman”
Coloring the cells based on their values we can easily compare two or more vectors
40. WORD EMBEDDINGS WITH SPACY
spaCy comes shipped with pre-trained models!
spaCy’s small models (all packages
that end in _sm) don’t ship with word
vectors, and only include context-
sensitive tensors (POS, NER, etc)
You can still use the similarity()
methods to compare documents
and tokens, but the result won’t be
as good
en_vectors_web_lg model provides
300-dimensional GloVe vectors for
over 1 million terms of English
41. WORD EMBEDDINGS WITH SPACY
(continued)
We now need to find the closest vector in the vocabulary to the result
of “king - man + woman”
44. FASTTEXT
An open source NLP library developed by facebook AI
(2016)
Its goal is to provide word embedding and text classification efficiently through its shallow neural
network implementation
The accuracy of this library is on par of deep neural networks and requires very less amount of
time to train
Models can be saved and later reduced in size to even fit on mobile devices
45. FASTTEXT
Word representation
Training:
Word vector:
The dimension (dim) controls the size of the vectors (default 100)
The subwords are all the substrings contained in a word between
the minimum size (minn) and the maximal size (maxn). [3-6]
Parameters:
46. FASTTEXT
Testing your model
A simple way to check the quality of a word vector is to look at its nearest neighbors
This give an intuition of the type of semantic information the vectors are able to capture
48. Identify and extract opinions within a given text
valence = {}
for word in pos:
valence[word.lower()] = 1
for word in neg:
valence[word.lower()] = -1
def sentiment(text, valence):
words = extract_words(text.lower())
word_count = 0
score = 0
for word in words:
if word in valence:
score += valence[word]
word_count += 1
return score/word_count
texts = ["I'm very happy",
"The product is pretty annoying, and I hate it",
"I'm sad"]
for text in texts:
print(sentiment(text, valence))
1.0
-0.3333333333333333
-1.0
Simplest approach is counting positive and negative
words using opinion lists, but ...
Modifier words ("very", "much", "not", "pretty",
"somewhat") change the meaning
→ We need real valued weights
Sentiment analysis from scratch
49. NLTK’s VADER
A lexicon and rule-based sentiment analysis tool that is
specifically attuned to sentiments expressed in social
media
Positive + Neutral + Negative = 1
-1 < Compound < +1
analyzer = SentimentIntensityAnalyzer()
score = analyzer.polarity_scores(sentence)
VADER analyses sentiments primarily
based on certain key points:
●
Punctuation (cool!)
●
Capitalization (GREAT)
●
Degree modifiers (very good)
●
Conjunctions (good but I hate it)
51. USING FASTTEXT
We use the sentiment140 dataset that contains 1600000 tweets that have been annotated with 2 keys:
0 = negative, 4 = positive
fastText use the format __label__ to recognize labels from words
Data preprocessing and cleaning
52. USING FASTTEXT
Training the classifier
Using bigrams
Increasing also learning rate
Increasing epochs
Default parameters:
epoch=5
lr=0.1
wordNgrams=1
Precision Recall
55. CONCLUSIONS
Large volumes of data are crucial to the success of a machine learning
project, but having clean, high-quality data is just as important (ImageNet
moment of NLP)
Sometimes rules-based approaches (still) work better, e.g. VADER
Don’t fall in love with tools or algorithms, feel free to build the best possible
environment to process your data and get the job done!
57. Some online stuff
●
Introduction to Information Retrieval
https://nlp.stanford.edu/IR-book/html/htmledition/irbook.html
●
Speech and Language Processing
https://web.stanford.edu/~jurafsky/slp3/
●
Natural Language Processing From Scratch
https://2017.pygotham.org/talks/natural-language-processing-from-scratch
/
●
Advanced NLP with spaCy
https://course.spacy.io
●
Sentiment analysis
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
●
The Illustrated Word2vec
https://jalammar.github.io/illustrated-word2vec/