This document provides an overview and introduction to representation learning of text, specifically word vectors. It discusses older techniques like bag-of-words and n-grams, and then introduces modern distributed representations like word2vec's CBOW and Skip-Gram models as well as the GloVe model. The document covers how these models work, are evaluated, and techniques to speed them up like hierarchical softmax and negative sampling.
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
Talk about representation learning using word vectors such as Word2Vec, Paragraph Vector. Also introduced to neural network language models. Expose some applications using NNLM such as sentiment analysis and information retrieval.
[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
End to-end goal-oriented question answering systems
version 2.0: An updated version with references of the old version (https://www.slideshare.net/QiHe2/kdd-2018-tutorial-end-toend-goaloriented-question-answering-systems).
08/22/2018: The old version was just deleted for reducing the confusion.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
In this presentation we discuss the hypothesis of MaxEnt models, describe the role of feature functions and their applications to Natural Language Processing (NLP). The training of the classifier is discussed in a later presentation.
Michael Manukyan and Hrayr Harutyunyan gave a talk on sentence representations in the context of deep learning during Armenian NLP Meetup. They also reviewed a recent paper on machine comprehension (Wang, Jiang, 2016)
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
Talk about representation learning using word vectors such as Word2Vec, Paragraph Vector. Also introduced to neural network language models. Expose some applications using NNLM such as sentiment analysis and information retrieval.
[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
End to-end goal-oriented question answering systems
version 2.0: An updated version with references of the old version (https://www.slideshare.net/QiHe2/kdd-2018-tutorial-end-toend-goaloriented-question-answering-systems).
08/22/2018: The old version was just deleted for reducing the confusion.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
In this presentation we discuss the hypothesis of MaxEnt models, describe the role of feature functions and their applications to Natural Language Processing (NLP). The training of the classifier is discussed in a later presentation.
Michael Manukyan and Hrayr Harutyunyan gave a talk on sentence representations in the context of deep learning during Armenian NLP Meetup. They also reviewed a recent paper on machine comprehension (Wang, Jiang, 2016)
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
The slide covers a few state of the art models of word embedding and deep explanation on algorithms for approximation of softmax function in language models.
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
20180526@Taiwan AI Academy, Professional Managers Class.
Covering important concepts of classical machine learning, in preparation for deep learning topics to follow. Topics include regression (linear, polynomial, gaussian and sigmoid basis functions), dimension reduction (PCA, LDA, ISOMAP), clustering (K-means, GMM, Mean-Shift, DBSCAN, Spectral Clustering), classification (Naive Bayes, Logistic Regression, SVM, kNN, Decision Tree, Classifier Ensembles, Bagging, Boosting, Adaboost) and Semi-Supervised learning techniques. Emphasis on sampling, probability, curse of dimensionality, decision theory and classifier generalizability.
Neural networks for word embeddings have received a lot of attention since some Googlers published word2vec in 2013. They showed that the internal state (embeddings) that the neural network learned by "reading" a large corpus of text preserved semantic relations between words.
As a result, this type of embedding started being studied in more detail and applied to more serious Natural Language Processing + NLP and IR tasks such as summarization, query expansion, etc...
In this talk we will cover the intuitions and algorithms underlying word2vec family of algorithms. On the second half of the presentation we will quickly review than basics of tensorflow and analyze in detail the tensorflow reference implementation of word2vec
Word embeddings have received a lot of attention since some Tomas Mikolov published word2vec in 2013 and showed that the embeddings that the neural network learned by “reading” a large corpus of text preserved semantic relations between words. As a result, this type of embedding started being studied in more detail and applied to more serious NLP and IR tasks such as summarization, query expansion, etc… More recently, researchers and practitioners alike have come to appreciate the power of this type of approach and have started a cottage industry of modifying Mikolov’s original approach to many different areas.
In this talk we will cover the implementation and mathematical details underlying tools like word2vec and some of the applications word embeddings have found in various areas. Starting from an intuitive overview of the main concepts and algorithms underlying the neural network architecture used in word2vec we will proceed to discussing the implementation details of the word2vec reference implementation in tensorflow. Finally, we will provide a birds eye view of the emerging field of “2vec" (dna2vec, node2vec, etc...) methods that use variations of the word2vec neural network architecture.
This (long) version of the Tutorial was presented at #O'Reilly AI 2017 in San Francisco. See https://bmtgoncalves.github.io/word2vec-and-friends/ for further details.
Breaking the Softmax Bottleneck: a high-rank RNN Language ModelSsu-Rui Lee
My paper presentation slides of a nice paper in ICLR 2018. (2018/05/02 in IDEA Lab)
Paper Information:
Breaking the Softmax Bottleneck: a high-rank RNN Language Model
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen
https://arxiv.org/abs/1711.03953
Supervised machine learning addresses the problem of approximating a function, given the examples of inputs and outputs. The classical tasks of regression and classification deal with functions whose outputs are real numbers. Structured output prediction goes beyond one-dimensional outputs, and allows predicting complex objects, such as sequences, trees, and graphs. In this talk I will show how to apply structured output prediction to building informative summaries of the topic graphs—a problem I encountered in my Ph.D. research. The focus of the talk will be on understanding the intuitions behind the machine learning algorithms. We will start from the basics and walk our way through the inner workings of DAgger—state-of-the-art method of structured output prediction.
This talk was be given at a seminar in Google Krakow.
Continuous Learning Systems: Building ML systems that learn from their mistakesAnuj Gupta
Won't it be great to have ML models that can update their “learning” as and when they make mistake and correction is provided in real time? In this talk we look at a concrete business use case which warrants such a system. We will take a deep dive to understand the use case and how we went about building a continuously learning system for text classification. The approaches we took, the results we got.
In this talk we explore how to build Machine Learning Systems that can that can learn "continuously" from their mistakes (feedback loop) and adapt to an evolving data distribution.
The youtube link to video of the talk is here:
https://www.youtube.com/watch?v=VtBvmrmMJaI
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
3. Introduction
Example of NLP tasks :
Easy
• Spell Checking
• Keyword Search
• Finding Synonyms
Medium
• Parsing information from websites, documents, etc.
3
4. 4
Hard
• Machine Translation (e.g. Translate Chinese text to English)
• Semantic Analysis (What is the meaning of query statement?)
• Co-reference (e.g. What does "he" or "it" refer to given a document?)
• Question Answering (e.g. Answering Jeopardy questions).
The first and arguably most important common denominator across
all NLP tasks is : how we represent text as input to our models.
5. • Machine does not understand text.
• We need numeric representation
• An integral part of any NLP pipeline.
• Unlike images (RGB matrix), for text there is no obvious way.
Legacy Techniques*
• Bag of words
• N-gram
• TF-IDF
5* Details in appendix
6. Bottom Line
• More often than not, how rich your input representation is has huge bearing
on the quality of your downstream ML models.
• For NLP, archaic techniques treat words as atomic symbols. Thus every 2
words are equally apart.
• They don’t have any notion of either syntactic or semantic similarity
between parts of language.
• This is one of the chief reasons for poor/mediocre performance of NLP
based models.
But this has changed dramatically in past few years
6
8. Distributional representations
• Linguistic aspect.
• Based on co-occurrence/ context
• Distributional hypothesis: linguistic units with similar distributions
have similar meanings.
• The distributional property is usually induced from document or
context or textual vicinity (like sliding window).
8
9. Distributed representations
• Compact, dense and low dimensional representation.
• Differs from distributional representations as the constraint is to seek
efficient dense representation, not just to capture the co-occurrence
similarity.
• Each single component of vector representation does not have any
meaning of its own.
• The interpretable features (for example, word contexts in case of
word2vec) are hidden and distributed among uninterpretable vector
components.
9
10. • Embedding: Mapping between space with one dimension per linguistic
unit (word, character, phrase, sentence, document ) to a continuous vector
space with much lower dimension.
“You shall know a word by the company it keeps” - J R Firth
• One of the most successful ideas of modern statistical NLP
10
12. Co-occurrence with SVD
• Define a word using the words in its context.
• Words that co-occur
• Building a co-occurrence matrix M.
Context = previous word and
next word
Corpus ={“I like deep learning.”
“I like NLP.”
“I enjoy flying.”}
12
13. • Imagine we do this for a large
corpus of text
• row vector xdog describes usage
of word dog in the corpus
• can be seen as coordinates of
point in n-dimensional
Euclidean space Rn
• Reduce dimensions using SVD =
M
13
14. • Given a matrix of m × n dimensionality, construct a m × k matrix, where k << n
• M = U Σ VT
• U is an m × m orthogonal matrix (UUT = I)
• Σ is a m × n diagonal matrix, with diagonal values ordered from largest to smallest (σ1 ≥
σ2 ≥ · · · ≥ σr ≥ 0, where r = min(m, n)) [σi’s are known as singular values]
• V is an n × n orthogonal matrix (VVT = I)
• We construct M’ s.t. rank(M’) = k
• We compute M’ = U Σ’ V, where Σ’ = Σ with k largest singular values
• k captures desired percentage variance
• Then, submatrix U v,k is our desired word embedding matrix.
14
16. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Rohde et al. 2005
16
17. Pros & Cons
+ Simple method
+ Captures some sense (though weak) of similarity between words.
- Matrix is extremely sparse.
- Quadratic cost to train (perform SVD)
- Drastic imbalance in frequencies can adversely impact quality of
embeddings.
- Adding new words is expensive.
Take home : we worked with statistics of the corpus rather than working with
the corpus directly. This will recur in GloVe
17
19. Language Models
• Filter out good sentences from bad ones.
• Good = semantically and syntactically correct.
• Modeled this via probability of given sequence of n words
Pr (w1, w2, ….., wn)
• S1 = “the cat jumped over the dog”, Pr(S1) ~ 1
• S2 = “jumped over the the cat dog”, Pr(S2) ~ 0
19
22. BiGram Model
• Objective : given wi , predict wi+1
• Training data: given sequence of n words < w1, w2, ….., wn >, extract bi-gram
pairs (wi-1 , wi)
• Knowns:
• input – output training examples : (wi-1 , wi)
• Vocab of training corpus (V) = U (wi)
• Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
• Model : shallow net
22
24. • Feed index of wi-1 as input to network.
• Use index to lookup embedding matrix.
• Perform affine transform on word embedding to get a score vector.
• Compute probability for each word.
• Set 1-hot vector of wi as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
24
27. 27
●Per word, we have 2 vectors :
1. As row in Embedding layer (E)
2. As column in weights layer (used for afine transformation)
●It’s common to take average of the 2 vectors.
●It’s common to normalise the vectors. Divide by norm.
●An alternative way to compute ŷi : # (wi, wi-1) / # (wj, wi-1) ∀ j∈V
●Use co-occurrence matrix to compute these counts.
Remarks
28. I learn best with toy code,
that I can play with.
- Andrew Trask
jupyter notebook 1
28
30. CBOW
• Continuous Bag of words.
• Proposed by Mikolov et al. in 2013
• Conceptually, very similar to Bi-gram model
• In the bigram model, there were 2 key drawbacks:
1. The context was very small – we took only wi-1 , while predicting wi
2. Context is not just preceding words; but following words too.
30
31. • “the brown cat jumped over the dog”
Context = the brown cat over the dog
Target = jumped
• Context window = k words on either side of the word to be
predicted.
• Pr (w1, w2, ….., wn) = ∏ Pr(wc | wc−k, . . . , wc−1, wc+1, . . . , wc+k)
• W = total number of unique windows
• Each window is sliding block 2c+1 words
31
32. CBOW Model
• Objective : given wc−k, . . . , wc−1, wc+1, . . . , wc+k , predict wc
• Training data: given sequence of n words < w1, w2, ….., wn >, for each window
extract context and target (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc )
• Knowns:
• input – output training examples : (wc−k, . . . , wc−1, wc+1, . . . , wc+k ; wc )
• Vocab of training corpus (V) = ∪(wi)
• Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
32
34. • Feed indexes of (x(c−k) , ... , x(c−1) , x(c+1) , ... , x(c+k)) for the input context of size
k.
• Use indexes to lookup embedding matrix.
• Average these vectors to get vˆ = (vc−k+vc−1+...+vc+1+vc+k ) / 2m
• Perform affine transform on vˆ to get a score vector.
• Turn scores in probabilities for each word.
• Set 1-hot vector of wc as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
34
36. Skip-Gram model
• 2nd model proposed by Mikolov et al. in 2013
• Turns CBOW over its head.
• CBOW = given context, predict the target word
• Skip Gram = given target, predict context
• “the brown cat jumped over the dog”
Target = jumped
Context = the, brown, cat, over, the, dog
36
37. • Objective : given wc , predict wc−k, . . . , wc−1, wc+1, . . . , wc+k
• Training data: given sequence of n words < w1, w2, ….., wn >, for each window
extract target and context pairs (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k)
• Knowns:
• input – output training examples : (wc, wc−k) , (wc, wc−1) , (wc, wc+1), (wc, wc+k)
• Vocab of training corpus (V) = ∪ (wi)
• Unknowns: word embeddings. Model as a matrix E |v| x d . d = embedding
dimensions. Usually a hyper parameter.
37
39. • Feed index of xc
• Use index to lookup embedding matrix.
• Perform affine transform on vˆ to get a score vector.
• Turn scores in probabilities for each word.
• Set 1-hot vector of wc as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
39
40. Maths behind the scene
• Optimization objective J = - log Pr(wc−k, . . . , wc−1, wc+1, . . . , wc+k | , wc)
• gradient descent to update all relevant word vectors uc and wj.
40
42. • How to quantitatively evaluate the quality of word vectors?
• Intrinsic Evaluation :
• Word Vector Analogies
• Extrinsic Evaluation :
• Downstream NLP task
42
43. Intrinsic Evaluation
• Specific Intermediate subtasks
• Easy to compute.
• Analogy completion:
• a:b :: c:? d =
man:woman :: king:?
• Evaluate word vectors by how well their cosine distance after addition
captures intuitive semantic and syntactic analogy questions
• Discarding the input words from the search!
• Problem: What if the information is there but not linear?
43
45. Extrinsic Evaluation
• Real task at hand
• Ex: Sentiment analysis.
• Not very robust.
• End result is a function of whole process and not just embeddings.
• Process:
• Data pipelines
• Algorithm(s)
• Fine tuning
• Quality of dataset
45
47. Bottleneck
• Recall, to calculate probability, we use softmax. The denominator is
sum across entire vocab.
• Further, this is calculated for every window.
• Too expensive.
• Single update of parameters requires to iterate over |V|. Our vocab
usually is in millions.
47
48. To approximate probability, dont use the entire vocab.
There are 2 popular line of attacks to achieve this:
•Modify the structure the softmax
•Hierarchical Softmax
• Sampling techniques : don’t use entire vocabulary to compute the sum
• Negative sampling
48
49. ● Arrange words in vocab as leaf units of a
balanced binary tree.
● |V| leaves |V| - 1 internal nodes
● Each leaf node has a unique path from root to
the leaf
● Probability of a word (leaf node Lw) =
Probability of the path from root node to leaf Lw
● No output vector representation for words,
unlike softmax.
● Instead every internal node has a d-dimension
vector associated with it - v’n(w, j)
Hierarchical Softmax
n(w, j) means the j-th unit on the path from root to the
word w
50. ● Product of probabilities over nodes in the path
● Each probability is computed using sigmoid
●
● Inside it we check : if (j+1)th node on path left child of jth node or not
● v’n(w, j)
T h : vector product between vector on hidden layer and vector for the
inner node in consideration.
51. ● p(w = w2)
● We start at root, and navigate to leaf w2
●
●
● p(w = w2)
●
Example
52. ● Cost: O(|V|) to O(log |V| )
●In practice, use Huffman tree
53. Negative Sampling
●Given (w, c) : word and context
●Let P(D=1|w,c) be probability that (w, c) came from the corpus data.
●P(D=0|w,c) = probability that (w, c) didn’t come from the corpus data.
● Lets model P(D=1|w,c) with sigmoid:
●Objective function (J):
○ maximize P(D=1|w,c) if (w, c) is in the corpus data.
○ maximize P(D=0|w,c) if (w, c) is not in the corpus data.
●We take a simple maximum likelihood approach of these two probabilities.
54. θ is parameters of the model. In our case U and V - input, output word vectors.
Took log on
both side
55. ●Now, maximizing log likelihood = minimizing negative log likelihood.
●
● D ̃ s “false” or negative “Corpus” with wrong sentences - "jumped cat dog the the over"
● Generate D ̃ n he ly y an only nllys hes nhse lanl he onao yn .
● For skip-gram, our new objective function for observing the context word wc − m + j given
the center word wc would be :
regular softmax loss for skip-gram
56. ● Likewise for CBOW, our new objective function for observing the center
word uc given the context vector
● I he nyne lnaluynhsn , {u˜k |k = 1 . . . K} are sampled from Pn(w).
● best Pn(w) = Unigram distribution raised to the power of 3/4
● Usually K = 20-30 works well.
regular softmax loss for CBOW
58. Global matrix factorization methods
● Use co-occurrence counts
● Ex: LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert)
+ Fast training
+ Efficient usage of statistics
+ Captures word similarity
- Do badly on analogy tasks
- Disproportionate importance given to large counts
58
59. Local context window method
● Use window to determine context of a word
● Ex: Skip-gram/CBOW ( Mikolov et al), NNLM(Bengio et al), HLBL, (Collobert & Weston)
+ Capture word similarity.
+ Also performance better on analogy tasks
- Slow down with increase in corpus size
- Inefficient usage of statistics
59
60. Combining the best of both worlds
● Glove model tries to combine the two major model families :-
○ Global matrix factorization (co-occurrence counts)
○ Local context window (context comes from window)
= Co-occurrence counts with context distance
60
61. Co-occurrence counts with context distance
● Uses context distance : weight each word in context window using its
distance from the center word
● This ensures nearby words have more influence than far off ones.
● Sentence -> “I ys NLP”
○ Co-occurrence for I -> like : 1.0 & I -> NLP : 0.5
○ Co-occurrence for like -> I : 1.0 & like -> NLP : 1.0
○ Co-occurrence for NLP -> I : 0.5 & NLP -> like : 1.0
● Corpus C: I like NLP. I like cricket.
Co-occurrence matrix for C
61
62. Issues with Co-occurrence Matrix
● Long tail distribution
● Frequent words contribute disproportionately
(use weight function to fix this)
● Use Log for normalization
● Avoid log 0 : Add 1 to each Xij X21
62
63. Intuition for Glove
●Think of matrix factorization algorithms used in recommendation systems.
●Latent Factor models
○ Find features that describe the characteristics of rated objects.
○ Item characteristics and user preferences are described using vectors which are called factor
vectors
○ Assumption: Ratings can be inferred from a model put together from a smaller number of
parameters
63
64. Latent Factor models
● Dot product estimates user’s interest in the item
○ where, qi : factor vector for item i.
pu : factor vector for user u
i : estimated user interest
● How to compute vectors for items and users ?
64
65. Matrix Factorization
●rui : known rating of user u for item i
● predicted rating :
● Similarly glove model tries to model the co-occurrence counts with the
following equation :
65
66. Weighting function
.
●Properties of f(X)
○vanish at 0 i.e. f(0) = 0
○monotonically increasing
○f(x) should be relatively small for large values of x
● Empirically 𝞪 = 0.75, xmax=100 works best
66
67. Loss Function
● Scalable.
● Fast training
○ Tans s hsl on ’h o l o n he cnalu sz
○ Always fitting to a |V| x |V| matrix.
● Good performance with small corpus, and small vectors.
67
68. ●Input :
○Xij (|V| x |V| matrix) : co-occurrence matrix
●Parameters
○ W (|V| x |D| lnhasx) & W˜ (|V| x |D| lnhasx) :
■ wi and wj˜ a la hnhsn nl he sth & jth onao lanl W n o W˜ lnhasc a l chse y .
○bi (|V| x 1) column vector : variable for incorporating biases in terms
○bj (1 x |V|) row vector : variable for incorporating biases in terms
68
Training
69. ● Train on Wikipedia data
●|V| = 2000
● Window size = 3
● Iterations = 10000
●D = 50
●Learn two representations for each word in |V|.
●reg = 0.01
●Use momentum optimizer with momentum=0.9.
69
Quick Experiment
76. Objective
● Given a collection of N high-dimensional objects x1, x2, …. xN.
● How can we get a feel for how these objects are (relatively) arranged ?
76
77. Introduction
●Busyo lnl(yno osl sn ) .h. os hn c y ho lns h a ly ch “ slsynashs ” s
the data :
●Minimize some objective function that measures the discrepancy between
similarities in the data and similarities in the map
77
83. t-SNE
●We have measure of similarity of data points in High Dimension
●We have measure of similarity of data points in Low Dimension
●We need a distance measure between the two.
●Once we have distance measure, all we want is : to minimize it
83
84. One possible choice - KL divergence
● It’s a measure of how one probability distribution diverges from a second
expected probability distribution
84
85. KL divergence applied to t-SNE
Objective function (C)
● We want nearby points in high-D to remain nearby in low-D
○ In the case it's not, then
■ pij will large (because points are nearby)
■ but qij will be small (because points are far away)
■ This will result in larger penalty
■ In contrast, If both pij and qij are large : lower penalty 85
86. KL divergence applied to t-SNE
●Likewise, we want far away points in high-D to remain (relatively) far away in
low-D
○ In the case it's not, then
■ pij will small (because points are far away)
■ but qij will be large (because points are nearby)
■ This will result in lower penalty
● t-SNE mainly preserves local similarity structure of the data
86
88. Why a Student t-Distribution ?
●t-SNE tries to retain local structure of this data in the map
●Result : dissimilar points have to be modelled as far apart in the map
●Hinton, has showed that student t-distribution is very similar to gaussian
distribution
88
Local structures
global structure
● Local structures preserved
● global structure is lost
89. Deciding the effective number of neighbours
● We need to decide the radii in different parts of the space, so that we can keep
the effective number of neighbours about constant.
● A big radius leads to a high entropy for the distribution over neighbors of i.
● A small radius leads to a low entropy.
● So decide what entropy you want and then find the radius that produces that
entropy.
● It's easier to specify 2entropy
○ This is called the perplexity
○ It is the effective number of neighbors.
89
91. Hyper parameters really matter: Playing with perplexity
● projected 100 data points clearly separated in two different clusters with tSNE
● Applied tSNE with different values of perplexity
● With perplexity=2, local variations in the data dominate
● With perplexity in range(5-50) as suggested in paper, plots still capture some structure in the data
91
92. Hyper parameters really matter: Playing with #iterations
● Perplexity set to 30.0
● Applied tSNE with different number of iterations
● Takeaway : different datasets may require different number of iterations
92
93. Cluster sizes can be misleading
● Uses tSNE to plot two clusters with different standard deviation
● bottomline, we cannot see cluster sizes in t-SNE plots
93
94. Distances in t-SNE plots
● At lower perplexity clusters look equidistant
● At perplexity=50, tSNE captures some notion of global geometry in the data
● 50 data points in each sub cluster
94
95. Distances in t-SNE plots
● tSNE is not able to capture global geometry even at perplexity=50.
● key take away : well separated clusters may not mean anything in tSNE.
● 200 data points in each sub cluster
95
96. Random noise doesn’t always look random
● For this experiment, we generated random points from gaussian distribution
● Plots with lower perplexity, showing misleading structures in the data
96
97. You can see some shapes sometimes
● Axis aligned gaussian distribution
● For certain values of perplexity, long cluster look almost correct.
● tSNE tends to expands regions which are denser
97
99. 99
At heart they are all same !!
●Its has been shown that in essence GloVe and word2vec are no different
from traditional methods like PCA, LSA etc (Levy et al. 2015 call them
DSM )
●GloVe ⋍ PCA/LSA is straightforward (both factorize global counts
matrix)
●word2vec ⋍ PCA/LSA is non-trivial (Levy et al. 2015)
●They show that in essence word2vec also factorizes word context matrix
(PMI)
100. 100
●Despite this “equality” of algorithm, word2vec is still known to do better
on several tasks.
●Why ?
○Levy et al. 2015 show : magic lies in Hyperparameters
102. Pre-processing
●Dynamic Context window
○ In DSM, context window: unweighted & constant size.
○ Glove & SGNS - give more weightage to closer terms
○ SGNS - even the window size can be dynamic and take a value between 1 & max of windowsize.
●Subsampling frequent words
○ SGNS dilutes frequent words by randomly removing words whose frequency f is higher than
some threshold t, with probability
●Deleting rare words
○ In SGNS, rare words are also deleted before creating context windows. 102
103. Post-processing
●Adding context vectors
○ Glove adds word vectors and the context vectors for the final representation.
●Vector normalization
○ All vectors can be normalized to unit length
103
104. Key Take Home
●Hyperparameters vs Algorithms
○ Hyper parameter settings is more important than the algorithm choice
○ No single algorithm consistently outperforms the other ones
●Hyperparameters vs more data
○ Training on larger corpus helps on some tasks
○ In many cases, tuning hyperparameters in more beneficial
104
105. References
Idea of word vectors is not new.
• Learning representations by back-propagating errors (Rumelhart et al. 1986)
• A neural probabilistic language model (Bengio et al., 2003)
• NLP from Scratch (Collobert & Weston, 2008)
• Word2Vec (Mikolov et al. 2013)
•Sebastian Ruder’s 3 part Blog series
•Lecture 2-4, CS 224d “Deep Learning for NLP” by Richard Socher
•word2vec Parameter Learning Explained by X Rong
105
110. Bag of Words
• Vocab = set of all the words in corpus
• Document = Words in document w.r.t vocab with multiplicity
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat”
Vocab = { the, cat, sat, on, hat, dog, ate, and }
Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
110
111. Pros & Cons
+ Quick and Simple
- Too simple
- Orderless
- No notion of syntactic/semantic similarity
111
112. N-gram model
• Vocab = set of all n-grams in corpus
• Document = n-grams in document w.r.t vocab with multiplicity
For bigram:
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat”
Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog ate, ate the, cat and,
and the}
Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1}
112
113. Pros & Cons
+ Tries to incorporate order of words
- Very large vocab set
- No notion of syntactic/semantic similarity
113
114. Term Frequency–Inverse Document Frequency (TF-IDF)
• Captures importance of a word to a document in a corpus.
• Importance increases proportionally to the number of times a word appears in the
document; but is offset by the frequency of the word in the corpus.
• TF(t) = (Number of times term t appears in a document) / (Total number of terms
in the document).
• IDF(t) = log (Total number of documents / Number of documents with term t in
it).
• TF-IDF (t) = TF(t) * IDF(t)
114
115. Example
• Document D1 contains 100 words.
• cat appears 3 times in D1
• TF(cat) = 3 / 100
= 0.3
• Corpus contains 10 million documents
• cat appears in 1000 documents
• IDF(cat) = log (10,000,000 / 1,000)
= 4
• TF-IDF (cat) = 0.3 * 4
115
116. Pros & Cons
• Pros:
• Easy to compute
• Has some basic metric to extract the most descriptive terms in a document
• Thus, can easily compute the similarity between 2 documents using it
• Disadvantages:
• Based on the bag-of-words (BoW) model, therefore it does not capture position
in text, semantics, co-occurrences in different documents, etc.
• Thus, TF-IDF is only useful as a lexical level feature. (presence/absense)
• Cannot capture semantics (unlike topic models, word embeddings)
116
117. ● Positive Pointwise Mutual Information (PPMI): PMI is a common measure for the strength of
association between two words. It is defined as the log ratio between the joint probability of two
words ww and cc and the product of their marginal probabilities:
a. PMI(w,c)=logP(w,c)/P(w)P(c)
b. PPMI(w, c) = max(PMI(w,c), 0)
117