- The document discusses neural word embeddings, which represent words as dense real-valued vectors in a continuous vector space. This allows words with similar meanings to have similar vector representations.
- It describes how neural network language models like skip-gram and CBOW can be used to efficiently learn these word embeddings from unlabeled text data in an unsupervised manner. Techniques like hierarchical softmax and negative sampling help reduce computational complexity.
- The learned word embeddings show meaningful syntactic and semantic relationships between words and allow performing analogy and similarity tasks without any supervision during training.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
A presentation on Bidirectional Encoder Representations from Transformers (BERT) meant to introduce the model's use cases and training mechanism. Best viewed with powerpoint since it contain many slide animations.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
A presentation on Bidirectional Encoder Representations from Transformers (BERT) meant to introduce the model's use cases and training mechanism. Best viewed with powerpoint since it contain many slide animations.
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
NLP with Deep Learning Guest Lecture slides by Fatih Mehmet Güler, PragmaCraft. Includes my background on the subject, our projects, the NLP stages and the latest developments.
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
What one needs to know to work in Natural Language Processing field and the aspects of developing an NLP project using the example of a system to identify text language
Brief introduction on attention mechanism and its application in neural machine translation, especially in transformer, where attention was used to remove RNNs completely from NMT.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
Research on character level language modelling using LSTM for semi-supervised learning. The objective is learning inner layer representations of the language model for transfer learning into a classification one.
Generalizing NLP processes using Bi-directional LSTMs to learn character(byte) level embeddings of financial news headlines up too 8 bits ( 2**8 -1) in order to study the relationship between character vectors in financial news headlines in order to transfer learning in to classification models using UTF-8 encoding. Many traditional NLP steps (lemmatize, POS, NER, stemming...) are skipped when diving to byte level making the process more universal in terms of scope then task specific.
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
This presentation on Recurrent Neural Network will help you understand what is a neural network, what are the popular neural networks, why we need recurrent neural network, what is a recurrent neural network, how does a RNN work, what is vanishing and exploding gradient problem, what is LSTM and you will also see a use case implementation of LSTM (Long short term memory). Neural networks used in Deep Learning consists of different layers connected to each other and work on the structure and functions of the human brain. It learns from huge volumes of data and used complex algorithms to train a neural net. The recurrent neural network works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer. Now lets deep dive into this presentation and understand what is RNN and how does it actually work.
Below topics are explained in this recurrent neural networks tutorial:
1. What is a neural network?
2. Popular neural networks?
3. Why recurrent neural network?
4. What is a recurrent neural network?
5. How does an RNN work?
6. Vanishing and exploding gradient problem
7. Long short term memory (LSTM)
8. Use case implementation of LSTM
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you'll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
Word Embeddings, Application of Sequence modelling, Recurrent neural network , drawback of recurrent neural networks, gated recurrent unit, long short term memory unit, Attention Mechanism
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
And then there were ... Large Language ModelsLeon Dohmen
It is not often even in the ICT world that one witnesses a revolution. The rise of the Personal Computer, the rise of mobile telephony and, of course, the rise of the Internet are some of those revolutions. So what is ChatGPT really? Is ChatGPT also such a revolution? And like any revolution, does ChatGPT have its winners and losers? And who are they? How do we ensure that ChatGPT contributes to a positive impulse for "Smart Humanity?".
During a key note om April 3 and 13 2023 Piek Vossen explained the impact of Large Language Models like ChatGPT.
Prof. PhD. Piek Th.J.M. Vossen, is Full professor of Computational Lexicology at the Faculty of Humanities, Department of Language, Literature and Communication (LCC) at VU Amsterdam:
What is ChatGPT? What technology and thought processes underlie it? What are its consequences? What choices are being made? In the presentation, Piek will elaborate on the basic principles behind Large Language Models and how they are used as a basis for Deep Learning in which they are fine-tuned for specific tasks. He will also discuss a specific variant GPT that underlies ChatGPT. It covers what ChatGPT can and cannot do, what it is good for and what the risks are.
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
NLP with Deep Learning Guest Lecture slides by Fatih Mehmet Güler, PragmaCraft. Includes my background on the subject, our projects, the NLP stages and the latest developments.
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
What one needs to know to work in Natural Language Processing field and the aspects of developing an NLP project using the example of a system to identify text language
Brief introduction on attention mechanism and its application in neural machine translation, especially in transformer, where attention was used to remove RNNs completely from NMT.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
Research on character level language modelling using LSTM for semi-supervised learning. The objective is learning inner layer representations of the language model for transfer learning into a classification one.
Generalizing NLP processes using Bi-directional LSTMs to learn character(byte) level embeddings of financial news headlines up too 8 bits ( 2**8 -1) in order to study the relationship between character vectors in financial news headlines in order to transfer learning in to classification models using UTF-8 encoding. Many traditional NLP steps (lemmatize, POS, NER, stemming...) are skipped when diving to byte level making the process more universal in terms of scope then task specific.
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
This presentation on Recurrent Neural Network will help you understand what is a neural network, what are the popular neural networks, why we need recurrent neural network, what is a recurrent neural network, how does a RNN work, what is vanishing and exploding gradient problem, what is LSTM and you will also see a use case implementation of LSTM (Long short term memory). Neural networks used in Deep Learning consists of different layers connected to each other and work on the structure and functions of the human brain. It learns from huge volumes of data and used complex algorithms to train a neural net. The recurrent neural network works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer. Now lets deep dive into this presentation and understand what is RNN and how does it actually work.
Below topics are explained in this recurrent neural networks tutorial:
1. What is a neural network?
2. Popular neural networks?
3. Why recurrent neural network?
4. What is a recurrent neural network?
5. How does an RNN work?
6. Vanishing and exploding gradient problem
7. Long short term memory (LSTM)
8. Use case implementation of LSTM
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you'll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
Word Embeddings, Application of Sequence modelling, Recurrent neural network , drawback of recurrent neural networks, gated recurrent unit, long short term memory unit, Attention Mechanism
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
And then there were ... Large Language ModelsLeon Dohmen
It is not often even in the ICT world that one witnesses a revolution. The rise of the Personal Computer, the rise of mobile telephony and, of course, the rise of the Internet are some of those revolutions. So what is ChatGPT really? Is ChatGPT also such a revolution? And like any revolution, does ChatGPT have its winners and losers? And who are they? How do we ensure that ChatGPT contributes to a positive impulse for "Smart Humanity?".
During a key note om April 3 and 13 2023 Piek Vossen explained the impact of Large Language Models like ChatGPT.
Prof. PhD. Piek Th.J.M. Vossen, is Full professor of Computational Lexicology at the Faculty of Humanities, Department of Language, Literature and Communication (LCC) at VU Amsterdam:
What is ChatGPT? What technology and thought processes underlie it? What are its consequences? What choices are being made? In the presentation, Piek will elaborate on the basic principles behind Large Language Models and how they are used as a basis for Deep Learning in which they are fine-tuned for specific tasks. He will also discuss a specific variant GPT that underlies ChatGPT. It covers what ChatGPT can and cannot do, what it is good for and what the risks are.
Deep Learning for Machine Translation - A dramatic turn of paradigmMeetupDataScienceRoma
Presentazione al Meetup di Marzo del Machine Learning / Data Science Meetup di Roma: https://www.meetup.com/it-IT/Machine-Learning-Data-Science-Meetup/events/248063386/
How to Ground A Language for Legal Discourse In a Prototypical Perceptual Sem...L. Thorne McCarty
Slides for my talk at the 15th International Conference on Artificial Intelligence and Law (ICAIL 2015), June 11, 2015.
The full ICAIL 2015 paper is available on ResearchGate at bit.ly/1qCnLJq.
A Neural Probabilistic Language Model.pptx
Bengio, Yoshua, et al. "A neural probabilistic language model." Journal of machine learning research 3.Feb (2003): 1137-1155.
A goal of statistical language modeling is to learn the joint probability function of sequences of
words in a language. This is intrinsically difficult because of the curse of dimensionality: a word
sequence on which the model will be tested is likely to be different from all the word sequences seen
during training. Traditional but very successful approaches based on n-grams obtain generalization
by concatenating very short overlapping sequences seen in the training set. We propose to fight the
curse of dimensionality by learning a distributed representation for words which allows each
training sentence to inform the model about an exponential number of semantically neighboring
sentences. The model learns simultaneously (1) a distributed representation for each word along
with (2) the probability function for word sequences, expressed in terms of these representations.
Generalization is obtained because a sequence of words that has never been seen before gets high
probability if it is made of words that are similar (in the sense of having a nearby representation) to
words forming an already seen sentence. Training such large models (with millions of parameters)
within a reasonable time is itself a significant challenge. We report on experiments using neural
networks for the probability function, showing on two text corpora that the proposed approach
significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to
take advantage of longer contexts.
Material of the 4th Intensive Summer school and collaborative workshop on Natural Language Processing (NAIST Franco-Thai Workshop 2010).
Bangkok, Thaıland.
From Word Embeddings To Document Distances
We present the Word Mover’s Distance (WMD), a novel distance function between text documents. Our work is based on recent results in word embeddings that learn semantically meaningful representations for words from local cooccurrences in sentences. The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document. We show that this distance metric can be cast as an instance of the Earth Mover’s Distance, a well studied transportation problem for which several highly efficient solvers have been developed. Our metric has no hyperparameters and is straight-forward to implement. Further, we demonstrate on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates.
Paleo environmental bio-diversity macro-evolutionary data mining and deep lea...Abdullah Khan Zehady
How are the environmental variables and marine evolution connected? Does astronomical forcing influence climate variation? Can we apply deep learning to classify index fossils?
Data mining and_visualization_of_earth_history_datasets_to_find_cause_effect_...Abdullah Khan Zehady
To know the future of our earth, we need to look back to the past and collect evidence and examine the geologic and biologic events. My projects with http://timescalecreator.org is our approach to analyze the largest publicly available earth historical data to test different hypothesis, to understand better about the past of our loving pale blue. Interestingly lots of the events under the surface of the earth or under the ocean show same periodic cycles that we see in the planetary motions in the solar system and even in the galaxy. Cyclostratigraphy is a field where we try to explore data from the rock or marine records and find possible orbital forcing. Everything is connected after all and we are star dusts !! ;)
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
4. Neural Word Embedding
● Continuous vector space representation
o Words represented as dense real-valued vectors in Rd
● Distributed word representation ↔ Word Embedding
o Embed an entire vocabulary into a relatively low-dimensional linear
space where dimensions are latent continuous features.
● Classical n-gram model works in terms of discrete units
o No inherent relationship in n-gram.
● In contrast, word embeddings capture regularities and relationships
between words.
5. Syntactic & Semantic Relationship
Regularities are observed as the constant offset vector between
pair of words sharing some relationship.
Gender Relation
KING-QUEEN ~ MAN - WOMAN
Singular/Plural Relation
KING-KINGS ~ QUEEN - QUEENS
Other Relations:
● Language
France - French
~
Spain - Spanish
● Past Tense
Go – Went
~
Capture - Captured
8. Language Model(LM)
● Different models for estimating continuous representations of
words.
○ Latent Semantic Analysis (LSA)
○ Latent Dirichlet Allocation (LDA)
○ Neural network Language Model(NNLM)
9. Feed Forward NNLM
● Consists of input, projection, hidden and output layers.
● N previous words are encoded using 1-of-V coding, where V is size of the
vocabulary. Ex: A = (1,0,...,0), B = (0,1,...,0), … , Z = (0,0,...,1) in R26
● NNLM becomes computationally complex between projection(P) and
hidden(H) layer
○ For N=10, size of P = 500-2000, size of H = 500-1000
○ Hidden layer is used to compute prob. dist. over all the words in
vocabulary V
● Hierarchical softmax as the rescue.
10. Recurrent NNLM
● No projection Layer, consists of input, hidden and output layers only.
● No need to specify the context length like feed forward NNLM
● What is special in RNN model?
○ Recurrent matrix that connects layer to itself.
○ Allows to form short-term memory
■ Information from the past is represen-
ted by the hidden layer
● RNN-embedded vector achieved state of the
art results in relational similarity identification task.
RNN Model
11. Recurrent NNLM
w(t): Input word at time t
y(t): Output layer produces a prob. Dist.
over words.
s(t): Hidden layer
U: Each column represents a word
● Four-gram neural net language model architecture(Bengio 2001)
● RNN is trained with SGD and backpropagation to maximize the
● log likelihood.
12. Bringing efficiency..
● Computational complexity of the NNLMs are high.
● We can remove the hidden layer and speed up 1000x
○ Continuous bag-of-words model
○ Continuous skip-gram model
● The full softmax can be replaced by:
○ Hierarchical softmax (Morin and Bengio)
○ Hinge loss (Collobert and Weston)
○ Noise contrastive estimation (Mnih et al.)
13. Continuous Bag of Word Model(CBOW)
● Non-linear hidden layer is removed
● Projection layer is shared for all words(not
just the projection matrix).
● All words get projected into the same
position(vectors are averaged).
● Naming Reson: Order of words in the
history does not influence the projection.
● Best performance obtained by a log-
linear classifier with four future and
four history words at the input
Predicts the current word based on
the context.
14. Continuous Skip-gram Model
● Objective: Tries to maximize
classification of a word based on another
word in the same sentence. Maximize the
average log probability
● Define p(wt+j |wt ) using the softmax
function:
Predicts surrounding word given
the current word.
15. Bringing efficiency..
● Computational complexity of the NNLMs are high.
● We can remove the hidden layer and speed up 1000x
○ Continuous bag-of-words model
○ Continuous skip-gram model
● The full softmax can be replaced by:
○ Hierarchical softmax (Morin and Bengio)
○ Hinge loss (Collobert and Weston)
○ Noise contrastive estimation (Mnih et al.)
16. Hierarchical Softmax for efficient computation
● This formulation is impractical because the cost of computing ∇logp(wO|wI)
is proportional to W, which is often large (105–107 terms).
● With hierarchical softmax, the cost is reduced
17. Hierarchical Softmax
● Uses a binary tree (Huffman code) representation of the output layer with the W
words as its leaves.
o A random walk that assigns probabilities to words.
● Instead of evaluating W output nodes, evaluate log(W) nodes to calculate prob. dist.
● Each word w can be reached by an appropriate path from the root of the tree● n(w, j): j-th node on the path from the root to w
● L(w): The length of this path
● n(w, 1) = root and n(w, L(w)) = w
● ch(n): An arbitrary fixed child of an inner node n
● [x] = 1 if x is true and [x] = -1 otherwise
18. Negative Sampling
● Noise Contrastive Estimation (NCE)
o A good model should be able to differentiate data from noise by means of
logistic regression.
o Alternative to the hierarchical softmax.
o Introduced by Gutmann and Hyvarinen and applied to language modeling by
Mnih and Teh.
● NCE approximates the log probability of the softmax
● Define Negative Sampling by the objective which replaces log P(w0|wI) in the skip-
gram.
● Task: Distinguish the target word wO from draws from the noise distribution
19. Subsampling of Frequent words
● Most frequent words provide less information than rare words.
o Co-occurrences of “France” and “Paris” is informative
o Co-occurrences of “France” and “the” is less informative
● A simple subsampling approach to counter imbalance
o Each word wi in the training set is discarded with probability
where f(wi) is the frequency of word wi and t is a chosen threshold,
typically around 10−5
● Aggressive subsampling of words whose frequency is greater than
t while preserving the ranking of the frequencies.
21. Automatic learning by skip-gram model
● No supervised information
about what a capital city
means.
● But the model is still
capable of
o Automatic
organization of
concepts
o Learning implicit
relationship
PCA projection of 100- dimensional skip-gram vectors
23. Learning Phrases
● To learn phrase vectors
o First find words that appear frequently together, and infrequently in
other contexts.
o Replace with unique tokens. Ex: “New York Times” ->
New_York_Times
● Phrases are formed based on the unigram and bigram counts, using
δ(discounting coefficient) prevents too many phrases consisting of very
25. Phrase Skip-gram Results
● Accuracies of the Skip-gram models on the phrase analogy dataset
o Using different hyperparameters
o Models trained on approximately one billion words from the news
dataset
● Size of the training data matters.
o HS-Huffman( dimensionality=1000) trained on 33 billion words
reaches an accuracy of 72%
26. Additive compositionality
● Possible to meaningfully combine words by an element-wise addition of their
vector representations.
○ Word vectors represents the distribution of the context in which it appears.
● Vector values related logarithmically to the probabilities computed by output layer.
○ The sum of two word vectors is related to the product of the two context
distributions
29. Comments
● Reduction of computational complexity is impressive.
● Works with unsupervised/unlabelled data
● Vector representation can be extended to large pieces of text
Paragraph Vector (Mikolov et al. 2013)
● Applicable to a lot of NLP tasks
o Tagging
o Named Entity Recognition
o Translation
o Paraphrasing
words are represented as dense real-valued vectors in Rd
words are represented as dense real-valued vectors in Rd
he basic Skip-gram formulation defines p(wt+j |wt ) using the softmax function
his formulation is impractical because the cost of computing ∇logp(wO|wI)isproportionaltoW,whichisoftenlarge(105–107 terms).
each word w can be reached by an appropriate path from the root of the tree
Neg-k : Negative sampling with k negative samples
the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability. Thus, if “Volga River” appears frequently in the same sentence together with the words “Russian” and “river”, the sum of these two word vectors will result in such a feature vector that is close to the vector of “Volga River”.