This document provides an overview of representation learning techniques for natural language processing (NLP). It begins with introducing the speakers and objectives of the workshop, which is to provide a deep dive into state-of-the-art text representation techniques and how to apply them to solve NLP problems. The workshop covers four modules: 1) archaic techniques, 2) word vectors, 3) sentence/paragraph/document vectors, and 4) character vectors. It emphasizes that representation learning is key to NLP as it transforms raw text into a numeric form that machine learning models can understand.
Monthly AI Tech Talks in Toronto 2019-08-28
https://www.meetup.com/aittg-toronto
The talk will cover the end-to-end details including contextual and linguistic feature extraction, vectorization, n-grams, topic modeling, named entity resolution which are based on concepts from mathematics, information retrieval and natural language processing. We will also be diving into more advanced feature engineering strategies such as word2vec, GloVe and fastText that leverage deep learning models.
In addition, attendees will learn how to combine NLP features with numeric and categorical features and analyze the feature importance from the resulting models.
The following libraries will be used to demonstrate the aforementioned feature engineering techniques: spaCy, Gensim, fasText and Keras in Python.
https://www.meetup.com/aittg-toronto/events/261940480/
Part 1 of the Deep Learning Fundamentals Series, this session discusses the use cases and scenarios surrounding Deep Learning and AI; reviews the fundamentals of artificial neural networks (ANNs) and perceptrons; discuss the basics around optimization beginning with the cost function, gradient descent, and backpropagation; and activation functions (including Sigmoid, TanH, and ReLU). The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Simplilearn
This Deep Learning presentation will help you in understanding what is Deep Learning, why do we need Deep learning, what is neural network, applications of Deep Learning, what is perceptron, implementing logic gates using perceptron, types of neural networks. At the end of the video, you will get introduced to TensorFlow along with a usecase implementation on recognizing hand-written digits. Deep Learning is inspired by the integral function of the human brain specific to artificial neural networks. These networks, which represent the decision-making process of the brain, use complex algorithms that process data in a non-linear way, learning in an unsupervised manner to make choices based on the input. Deep Learning, on the other hand, uses advanced computing power and special type of neural networks and applies them to large amounts of data to learn, understand, and identify complicated patterns. W will also understand neural networks and how they work in this Deep Learning tutorial video. This Deep Learning tutorial is ideal for professionals with beginner to intermediate level of experience. Now, let us dive deep into this topic and understand what Deep Learning actually is.
Below topics are explained in this Deep Learning presentation:
1. What is Deep Learning?
2. Why do we need Deep Learning?
3. What is Neural network?
4. What is Perceptron?
5. Implementing logic gates using Perceptron
6. Types of Neural networks
7. Applications of Deep Learning
8. Working of Neural network
9. Introduction to TensorFlow
10. Use case implementation using TensorFlow
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you’ll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change.
There is booming demand for skilled deep learning engineers across a wide range of industries, making this deep learning course with TensorFlow training well-suited for professionals at the intermediate to advanced level of experience. We recommend this deep learning online course particularly for the following professionals:
1. Software engineers
2. Data scientists
3. Data analysts
4. Statisticians with an interest in deep learning
( Python Training: https://www.edureka.co/python )
This Edureka Python Numpy tutorial (Python Tutorial Blog: https://goo.gl/wd28Zr) explains what exactly is Numpy and how it is better than Lists. It also explains various Numpy operations with examples.
Check out our Python Training Playlist: https://goo.gl/Na1p9G
This tutorial helps you to learn the following topics:
1. What is Numpy?
2. Numpy v/s Lists
3. Numpy Operations
4. Numpy Special Functions
Monthly AI Tech Talks in Toronto 2019-08-28
https://www.meetup.com/aittg-toronto
The talk will cover the end-to-end details including contextual and linguistic feature extraction, vectorization, n-grams, topic modeling, named entity resolution which are based on concepts from mathematics, information retrieval and natural language processing. We will also be diving into more advanced feature engineering strategies such as word2vec, GloVe and fastText that leverage deep learning models.
In addition, attendees will learn how to combine NLP features with numeric and categorical features and analyze the feature importance from the resulting models.
The following libraries will be used to demonstrate the aforementioned feature engineering techniques: spaCy, Gensim, fasText and Keras in Python.
https://www.meetup.com/aittg-toronto/events/261940480/
Part 1 of the Deep Learning Fundamentals Series, this session discusses the use cases and scenarios surrounding Deep Learning and AI; reviews the fundamentals of artificial neural networks (ANNs) and perceptrons; discuss the basics around optimization beginning with the cost function, gradient descent, and backpropagation; and activation functions (including Sigmoid, TanH, and ReLU). The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Simplilearn
This Deep Learning presentation will help you in understanding what is Deep Learning, why do we need Deep learning, what is neural network, applications of Deep Learning, what is perceptron, implementing logic gates using perceptron, types of neural networks. At the end of the video, you will get introduced to TensorFlow along with a usecase implementation on recognizing hand-written digits. Deep Learning is inspired by the integral function of the human brain specific to artificial neural networks. These networks, which represent the decision-making process of the brain, use complex algorithms that process data in a non-linear way, learning in an unsupervised manner to make choices based on the input. Deep Learning, on the other hand, uses advanced computing power and special type of neural networks and applies them to large amounts of data to learn, understand, and identify complicated patterns. W will also understand neural networks and how they work in this Deep Learning tutorial video. This Deep Learning tutorial is ideal for professionals with beginner to intermediate level of experience. Now, let us dive deep into this topic and understand what Deep Learning actually is.
Below topics are explained in this Deep Learning presentation:
1. What is Deep Learning?
2. Why do we need Deep Learning?
3. What is Neural network?
4. What is Perceptron?
5. Implementing logic gates using Perceptron
6. Types of Neural networks
7. Applications of Deep Learning
8. Working of Neural network
9. Introduction to TensorFlow
10. Use case implementation using TensorFlow
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you’ll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change.
There is booming demand for skilled deep learning engineers across a wide range of industries, making this deep learning course with TensorFlow training well-suited for professionals at the intermediate to advanced level of experience. We recommend this deep learning online course particularly for the following professionals:
1. Software engineers
2. Data scientists
3. Data analysts
4. Statisticians with an interest in deep learning
( Python Training: https://www.edureka.co/python )
This Edureka Python Numpy tutorial (Python Tutorial Blog: https://goo.gl/wd28Zr) explains what exactly is Numpy and how it is better than Lists. It also explains various Numpy operations with examples.
Check out our Python Training Playlist: https://goo.gl/Na1p9G
This tutorial helps you to learn the following topics:
1. What is Numpy?
2. Numpy v/s Lists
3. Numpy Operations
4. Numpy Special Functions
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn
This presentation on Machine Learning will help you understand what is clustering, K-Means clustering, flowchart to understand K-Means clustering along with demo showing clustering of cars into brands, what is logistic regression, logistic regression curve, sigmoid function and a demo on how to classify a tumor as malignant or benign based on its features. Machine Learning algorithms can help computers play chess, perform surgeries, and get smarter and more personal. K-Means & logistic regression are two widely used Machine learning algorithms which we are going to discuss in this video. Logistic Regression is used to estimate discrete values (usually binary values like 0/1) from a set of independent variables. It helps to predict the probability of an event by fitting data to a logit function. It is also called logit regression. K-means clustering is an unsupervised learning algorithm. In this case, you don't have labeled data unlike in supervised learning. You have a set of data that you want to group into and you want to put them into clusters, which means objects that are similar in nature and similar in characteristics need to be put together. This is what k-means clustering is all about. Now, let us get started and understand K-Means clustering & logistic regression in detail.
Below topics are explained in this Machine Learning tutorial part -2 :
1. Clustering
- What is clustering?
- K-Means clustering
- Flowchart to understand K-Means clustering
- Demo - Clustering of cars based on brands
2. Logistic regression
- What is logistic regression?
- Logistic regression curve & Sigmoid function
- Demo - Classify a tumor as malignant or benign based on features
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
We recommend this Machine Learning training course for the following professionals in particular:
1. Developers aspiring to be a data scientist or Machine Learning engineer
2. Information architects who want to gain expertise in Machine Learning algorithms
3. Analytics professionals who want to work in Machine Learning or artificial intelligence
4. Graduates looking to build a career in data science and Machine Learning
Learn more at: https://www.simplilearn.com/
This presentation on Recurrent Neural Network will help you understand what is a neural network, what are the popular neural networks, why we need recurrent neural network, what is a recurrent neural network, how does a RNN work, what is vanishing and exploding gradient problem, what is LSTM and you will also see a use case implementation of LSTM (Long short term memory). Neural networks used in Deep Learning consists of different layers connected to each other and work on the structure and functions of the human brain. It learns from huge volumes of data and used complex algorithms to train a neural net. The recurrent neural network works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer. Now lets deep dive into this presentation and understand what is RNN and how does it actually work.
Below topics are explained in this recurrent neural networks tutorial:
1. What is a neural network?
2. Popular neural networks?
3. Why recurrent neural network?
4. What is a recurrent neural network?
5. How does an RNN work?
6. Vanishing and exploding gradient problem
7. Long short term memory (LSTM)
8. Use case implementation of LSTM
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you'll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
This presentation on "Supervised and Unsupervised Learning" will help you understand what is machine learning, what are the types of Machine learning, what is supervised machine learning, types of supervised machine learning, what is unsupervised learning, types of unsupervised learning and what are the differences between supervised and unsupervised machine learning. In supervised learning, the model learns from a labeled data whereas in unsupervised learning, model trains itself on unlabeled data. Now, let us get started and understand supervised and unsupervised learning and how they are different from each other.
Below are the topics explained in this supervised and unsupervised learning in Machine Learning presentation-
1. What is Machine Learning
- Types of Machine Learning
- Supervised Learning
- Unsupervised Learning
2. Supervised Learning
- Types of Supervised Learning
3. Unsupervised Learning
- Types of Unsupervised Learning
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars. This Machine Learning course prepares engineers, data scientists and other professionals with the knowledge and hands-on skills required for certification and job competency in Machine Learning.
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire a thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
Learn more at: https://www.simplilearn.com/
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
This Deep Learning Presentation will help you in understanding what is Deep learning, why do we need Deep learning, applications of Deep Learning along with a detailed explanation on Neural Networks and how these Neural Networks work. Deep learning is inspired by the integral function of the human brain specific to artificial neural networks. These networks, which represent the decision-making process of the brain, use complex algorithms that process data in a non-linear way, learning in an unsupervised manner to make choices based on the input. This Deep Learning tutorial is ideal for professionals with beginners to intermediate levels of experience. Now, let us dive deep into this topic and understand what Deep learning actually is.
Below topics are explained in this Deep Learning Presentation:
1. What is Deep Learning?
2. Why do we need Deep Learning?
3. Applications of Deep Learning
4. What is Neural Network?
5. Activation Functions
6. Working of Neural Network
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you’ll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms.
There is booming demand for skilled deep learning engineers across a wide range of industries, making this deep learning course with TensorFlow training well-suited for professionals at the intermediate to advanced level of experience. We recommend this deep learning online course particularly for the following professionals:
1. Software engineers
2. Data scientists
3. Data analysts
4. Statisticians with an interest in deep learning
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
Talk about representation learning using word vectors such as Word2Vec, Paragraph Vector. Also introduced to neural network language models. Expose some applications using NNLM such as sentiment analysis and information retrieval.
Presentation on Neural Networks in Tensorflow. Code available at https://github.com/nfmcclure/tensorflow_cookbook . Presentation for Open Source Bridge, Portland, 2016.
I summarized the GPT models in this slide and compared the GPT1, GPT2, and GPT3.
GPT means Generative Pre-Training of a language model and was implemented based on the decoder structure of the transformer model.
(24th May, 2021)
Dataset Preparation
Abstract: This PDSG workshop introduces basic concepts on preparing a dataset for training a model. Concepts covered are data wrangling, replacing missing values, categorical variable conversion, and feature scaling.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
This presentation about Scikit-learn will help you understand what is Scikit-learn, what can we achieve using Scikit-learn and a demo on how to use Scikit-learn in Python. Scikit is a powerful and modern machine learning python library. It's a great tool for fully and semi-automated advanced data analysis and information extraction. There are a lot of reasons why Scikit-Learn is a preferred machine learning tool. It has efficient tools to identify and organize problems, such as whether it fits a supervised or unsupervised learning model. It contains many free and open data sets. It has a rich set of built-in libraries for learning and predicting. It provides model support for every problem type. It also has built-in functions such as pickle for model persistence. It is supported by a huge open source community and vendor base. Now, let us get started and understand Sciki-Learn in detail.
Below topics are explained in this Scikit-Learn presentation:
1. What is Scikit-learn?
2. What we can achieve using Scikit-learn
3. Demo
Simplilearn’s Python Training Course is an all-inclusive program that will introduce you to the Python development language and expose you to the essentials of object-oriented programming, web development with Django and game development. Python has surpassed Java as the top language used to introduce U.S. students to programming and computer science. This course will give you hands-on development experience and prepare you for a career as a professional Python programmer.
What is this course about?
The All-in-One Python course enables you to become a professional Python programmer. Any aspiring programmer can learn Python from the basics and go on to master web development & game development in Python. Gain hands-on experience creating a flappy bird game clone & website functionalities in Python.
What are the course objectives?
By the end of this online Python training course, you will be able to:
1. Internalize the concepts & constructs of Python
2. Learn to create your own Python programs
3. Master Python Django & advanced web development in Python
4. Master PyGame & game development in Python
5. Create a flappy bird game clone
The Python training course is recommended for:
1. Any aspiring programmer can take up this bundle to master Python
2. Any aspiring web developer or game developer can take up this bundle to meet their training needs
Learn more at https://www.simplilearn.com/mobile-and-software-development/python-development-training
Deep Learning - Overview of my work IIMohamed Loey
Deep Learning Machine Learning MNIST CIFAR 10 Residual Network AlexNet VGGNet GoogleNet Nvidia Deep learning (DL) is a hierarchical structure network which through simulates the human brain’s structure to extract the internal and external input data’s features
This is the basic introduction of the pandas library, you can use it for teaching this library for machine learning introduction. This slide will be able to help to understand the basics of pandas to the students with no coding background.
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
This presentation delves into the world of Natural Language Processing (NLP), exploring its goal to make human language understandable to machines. The complexities of language, such as ambiguity and complex structures, are highlighted as major challenges. The talk underscores the evolution of NLP through deep learning methodologies, leading to a new era defined by large-scale language models. However, obstacles like low-resource languages and ethical issues including bias and hallucination are acknowledged as enduring challenges in the field. Overall, the presentation provides a condensed, yet comprehensive view of NLP's accomplishments and ongoing hurdles.
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn
This presentation on Machine Learning will help you understand what is clustering, K-Means clustering, flowchart to understand K-Means clustering along with demo showing clustering of cars into brands, what is logistic regression, logistic regression curve, sigmoid function and a demo on how to classify a tumor as malignant or benign based on its features. Machine Learning algorithms can help computers play chess, perform surgeries, and get smarter and more personal. K-Means & logistic regression are two widely used Machine learning algorithms which we are going to discuss in this video. Logistic Regression is used to estimate discrete values (usually binary values like 0/1) from a set of independent variables. It helps to predict the probability of an event by fitting data to a logit function. It is also called logit regression. K-means clustering is an unsupervised learning algorithm. In this case, you don't have labeled data unlike in supervised learning. You have a set of data that you want to group into and you want to put them into clusters, which means objects that are similar in nature and similar in characteristics need to be put together. This is what k-means clustering is all about. Now, let us get started and understand K-Means clustering & logistic regression in detail.
Below topics are explained in this Machine Learning tutorial part -2 :
1. Clustering
- What is clustering?
- K-Means clustering
- Flowchart to understand K-Means clustering
- Demo - Clustering of cars based on brands
2. Logistic regression
- What is logistic regression?
- Logistic regression curve & Sigmoid function
- Demo - Classify a tumor as malignant or benign based on features
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
We recommend this Machine Learning training course for the following professionals in particular:
1. Developers aspiring to be a data scientist or Machine Learning engineer
2. Information architects who want to gain expertise in Machine Learning algorithms
3. Analytics professionals who want to work in Machine Learning or artificial intelligence
4. Graduates looking to build a career in data science and Machine Learning
Learn more at: https://www.simplilearn.com/
This presentation on Recurrent Neural Network will help you understand what is a neural network, what are the popular neural networks, why we need recurrent neural network, what is a recurrent neural network, how does a RNN work, what is vanishing and exploding gradient problem, what is LSTM and you will also see a use case implementation of LSTM (Long short term memory). Neural networks used in Deep Learning consists of different layers connected to each other and work on the structure and functions of the human brain. It learns from huge volumes of data and used complex algorithms to train a neural net. The recurrent neural network works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer. Now lets deep dive into this presentation and understand what is RNN and how does it actually work.
Below topics are explained in this recurrent neural networks tutorial:
1. What is a neural network?
2. Popular neural networks?
3. Why recurrent neural network?
4. What is a recurrent neural network?
5. How does an RNN work?
6. Vanishing and exploding gradient problem
7. Long short term memory (LSTM)
8. Use case implementation of LSTM
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you'll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
This presentation on "Supervised and Unsupervised Learning" will help you understand what is machine learning, what are the types of Machine learning, what is supervised machine learning, types of supervised machine learning, what is unsupervised learning, types of unsupervised learning and what are the differences between supervised and unsupervised machine learning. In supervised learning, the model learns from a labeled data whereas in unsupervised learning, model trains itself on unlabeled data. Now, let us get started and understand supervised and unsupervised learning and how they are different from each other.
Below are the topics explained in this supervised and unsupervised learning in Machine Learning presentation-
1. What is Machine Learning
- Types of Machine Learning
- Supervised Learning
- Unsupervised Learning
2. Supervised Learning
- Types of Supervised Learning
3. Unsupervised Learning
- Types of Unsupervised Learning
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars. This Machine Learning course prepares engineers, data scientists and other professionals with the knowledge and hands-on skills required for certification and job competency in Machine Learning.
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire a thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
Learn more at: https://www.simplilearn.com/
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
This Deep Learning Presentation will help you in understanding what is Deep learning, why do we need Deep learning, applications of Deep Learning along with a detailed explanation on Neural Networks and how these Neural Networks work. Deep learning is inspired by the integral function of the human brain specific to artificial neural networks. These networks, which represent the decision-making process of the brain, use complex algorithms that process data in a non-linear way, learning in an unsupervised manner to make choices based on the input. This Deep Learning tutorial is ideal for professionals with beginners to intermediate levels of experience. Now, let us dive deep into this topic and understand what Deep learning actually is.
Below topics are explained in this Deep Learning Presentation:
1. What is Deep Learning?
2. Why do we need Deep Learning?
3. Applications of Deep Learning
4. What is Neural Network?
5. Activation Functions
6. Working of Neural Network
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you’ll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms.
There is booming demand for skilled deep learning engineers across a wide range of industries, making this deep learning course with TensorFlow training well-suited for professionals at the intermediate to advanced level of experience. We recommend this deep learning online course particularly for the following professionals:
1. Software engineers
2. Data scientists
3. Data analysts
4. Statisticians with an interest in deep learning
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
Talk about representation learning using word vectors such as Word2Vec, Paragraph Vector. Also introduced to neural network language models. Expose some applications using NNLM such as sentiment analysis and information retrieval.
Presentation on Neural Networks in Tensorflow. Code available at https://github.com/nfmcclure/tensorflow_cookbook . Presentation for Open Source Bridge, Portland, 2016.
I summarized the GPT models in this slide and compared the GPT1, GPT2, and GPT3.
GPT means Generative Pre-Training of a language model and was implemented based on the decoder structure of the transformer model.
(24th May, 2021)
Dataset Preparation
Abstract: This PDSG workshop introduces basic concepts on preparing a dataset for training a model. Concepts covered are data wrangling, replacing missing values, categorical variable conversion, and feature scaling.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
This presentation about Scikit-learn will help you understand what is Scikit-learn, what can we achieve using Scikit-learn and a demo on how to use Scikit-learn in Python. Scikit is a powerful and modern machine learning python library. It's a great tool for fully and semi-automated advanced data analysis and information extraction. There are a lot of reasons why Scikit-Learn is a preferred machine learning tool. It has efficient tools to identify and organize problems, such as whether it fits a supervised or unsupervised learning model. It contains many free and open data sets. It has a rich set of built-in libraries for learning and predicting. It provides model support for every problem type. It also has built-in functions such as pickle for model persistence. It is supported by a huge open source community and vendor base. Now, let us get started and understand Sciki-Learn in detail.
Below topics are explained in this Scikit-Learn presentation:
1. What is Scikit-learn?
2. What we can achieve using Scikit-learn
3. Demo
Simplilearn’s Python Training Course is an all-inclusive program that will introduce you to the Python development language and expose you to the essentials of object-oriented programming, web development with Django and game development. Python has surpassed Java as the top language used to introduce U.S. students to programming and computer science. This course will give you hands-on development experience and prepare you for a career as a professional Python programmer.
What is this course about?
The All-in-One Python course enables you to become a professional Python programmer. Any aspiring programmer can learn Python from the basics and go on to master web development & game development in Python. Gain hands-on experience creating a flappy bird game clone & website functionalities in Python.
What are the course objectives?
By the end of this online Python training course, you will be able to:
1. Internalize the concepts & constructs of Python
2. Learn to create your own Python programs
3. Master Python Django & advanced web development in Python
4. Master PyGame & game development in Python
5. Create a flappy bird game clone
The Python training course is recommended for:
1. Any aspiring programmer can take up this bundle to master Python
2. Any aspiring web developer or game developer can take up this bundle to meet their training needs
Learn more at https://www.simplilearn.com/mobile-and-software-development/python-development-training
Deep Learning - Overview of my work IIMohamed Loey
Deep Learning Machine Learning MNIST CIFAR 10 Residual Network AlexNet VGGNet GoogleNet Nvidia Deep learning (DL) is a hierarchical structure network which through simulates the human brain’s structure to extract the internal and external input data’s features
This is the basic introduction of the pandas library, you can use it for teaching this library for machine learning introduction. This slide will be able to help to understand the basics of pandas to the students with no coding background.
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
This presentation delves into the world of Natural Language Processing (NLP), exploring its goal to make human language understandable to machines. The complexities of language, such as ambiguity and complex structures, are highlighted as major challenges. The talk underscores the evolution of NLP through deep learning methodologies, leading to a new era defined by large-scale language models. However, obstacles like low-resource languages and ethical issues including bias and hallucination are acknowledged as enduring challenges in the field. Overall, the presentation provides a condensed, yet comprehensive view of NLP's accomplishments and ongoing hurdles.
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0Plain Concepts
In this session we will explore Recurrent Neural Networks (RNN) - a type of neural networks specially designed to process sequences - and their applications to time series and text processing (NLP). To make the session even more interesting, all the code will be developed using the latest version of TensorFlow 2.0, using the implementation of the models to discuss the major changes with respect to versions 1.x of the Deep Learning framework, and it will leverage MLFLlow within Azure Databricks as a development platform and model serving.
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
Review of paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ArXiv link: https://arxiv.org/abs/1810.04805
YouTube Presentation: https://youtu.be/GK4IO3qOnLc
(Slides are written in English, but the presentation is done in Korean)
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
Introduction to Text Mining and Topic ModellingDavid Paule
A brief introduction to Text Mining and Topic Modelling given at the Urban Big Data Centre (University of Glasgow).
Want to know more? Visit my website davidpaule.es
Building a Neural Machine Translation System From ScratchNatasha Latysheva
Human languages are complex, diverse and riddled with exceptions – translating between different languages is therefore a highly challenging technical problem. Deep learning approaches have proved powerful in modelling the intricacies of language, and have surpassed all statistics-based methods for automated translation. This session begins with an introduction to the problem of machine translation and discusses the two dominant neural architectures for solving it – recurrent neural networks and transformers. A practical overview of the workflow involved in training, optimising and adapting a competitive neural machine translation system is provided. Attendees will gain an understanding of the internal workings and capabilities of state-of-the-art systems for automatic translation, as well as an appreciation of the key challenges and open problems in the field.
The NLP muppets revolution! @ Data Science London 2019
video: https://skillsmatter.com/skillscasts/13940-a-deep-dive-into-contextual-word-embeddings-and-understanding-what-nlp-models-learn
event: https://www.meetup.com/Data-Science-London/events/261483332/
Similar to Representation Learning of Text for NLP (20)
Continuous Learning Systems: Building ML systems that learn from their mistakesAnuj Gupta
Won't it be great to have ML models that can update their “learning” as and when they make mistake and correction is provided in real time? In this talk we look at a concrete business use case which warrants such a system. We will take a deep dive to understand the use case and how we went about building a continuously learning system for text classification. The approaches we took, the results we got.
Deep dive into the world of word vectors. We will cover - Bigram model, Skip-gram, CBOW, GLO. Starting from simplest models, we will journey through key results and ideas in this area.
In this talk we explore how to build Machine Learning Systems that can that can learn "continuously" from their mistakes (feedback loop) and adapt to an evolving data distribution.
The youtube link to video of the talk is here:
https://www.youtube.com/watch?v=VtBvmrmMJaI
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
Representation Learning of Text for NLP
1. Representation Learning
of Text for NLP
Anuj Gupta
Satyam Saxena
@anujgupta82, @Satyam8989
anujgupta82@gmail.com, satyamiitj89@gmail.com
2. About Us
Anuj is a senior ML researcher at Freshworks; working in the
areas of NLP, Machine Learning, Deep learning. Earlier he
was heading ML efforts at Airwoot (now acquired by
Freshdesk) and Droom.
Speaker at prestigious forums like PyData, Fifth Elephant,
ICDCN, PODC, IIT Delhi, IIIT Hyderabad and special interest
groups like DLBLR.
@anujgupta82
anujgupta82@gmail.com
Satyam is an ML researcher at Freshworks. An IIT alumnus,
his interest lie in NLP, Machine Learning, Deep Learning. Prior
to this, he was a part of ML group at Cisco.
Speaker at forums like ICAT, and special interest groups like
DLBLR, IIT Jodhpur.
@Satyam8989
satyamiitj89@gmail.com
2
3. Objective of this Workshop
• Deep dive into state-of-the-art techniques for representing text data.
• By the end of this workshop, you would have gained a deeper understanding of key
ideas, maths and code powering these techniques.
• You will be able to apply these techniques in solving NLP problems of your interest.
• Help you achieve higher accuracies.
• Help you achieve deeper understanding.
• Target audience: Data science teams, industry practitioners, researchers, enthusiast in
the area of NLP
• This will be a very hands-on workshop
3
4. I learn best with toy
code
that I can play with.
But unless you know
key concepts, you can’t
code.
In this workshop, we will do both
4
5. Outline
Workshop is divided into 4 modules. We will cover module 1 and 2 before lunch
break. Module 3 and 4 post lunch. The github repo has folders for each of the 4
modules containing respective notebooks.
•Module 1
•Archaic techniques
•Module 2
•Word vectors
•Module 3
•Sentence/Paragraph/Document vectors
•Module 4
•Character vectors
5
7. Resurrect your dead friend as an AI
7
Luka - Eugenia lost her friend Roman in an accident. Determined not to lose his memory, she gathered all
the texts Roman sent over his short life and made a chatbot – a program that responds automatically to
text messages. Now whenever she is missing Roman, Eugenia sends the chatbot a message and
Roman’s words respond.
10. Topics
• Introduction to NLP
• Examples of various NLP tasks
• Archaic Techniques
• Using pretrained embeddings
Key Learning outcomes:
• Basics of NLP
• One hot encoding
• Bag of words
• N-gram
• TF-IDF
• Why these techniques are bad
• How can you use pretrained embeddings
• Issues with using pre-trained word embeddings
10
11. What is NLP
• Concerned with programming computers to fruitfully process large natural
language.
• It is at the intersection of computer science, artificial intelligence and
computational linguistics
11
17. 17
NLP pipeline
Raw Text Preprocessing
Tokenization to get
language units.
Mathematical
representation of
language unit
Build train/test
data
Train model
using training
data
Test the model
on test data
The first and arguably most important common denominator across
all NLP tasks is : how we represent text as input to our models.
18. •Machine does not understand text.
•We need a numeric representation.
•Unlike images (RGB matrix), for text there
is no obvious way
•An integral part of any NLP pipeline
Why text representation is important?
18
19. •Representation learning is a set of techniques that learn a feature: a
transformation of the raw data input to a representation that can be
effectively exploited in machine learning tasks.
•Part of feature engineering/learning.
•Get rid of “hand-designed” features and representation
•Unsupervised feature learning - obviates manual feature engineering
What & Why Representation learning
19
21. • One hot encoding
• Bag of words
• N-gram
• TF-IDF
21
Legacy Techniques
22. One hot encoding
• Map each word to a unique ID
• Typical vocabulary sizes will vary between 10k and 250k.
22
23. • Use word ID, to get a basic representation of
word through.
• This is done via one-hot encoding of the ID
• one-hot vector of an ID is a vector filled with 0s,
except for a 1 at the position associated with the
ID.
• ex.: for vocabulary size D=10, the one-hot
vector of word (w) ID=4 is e(w) = [ 0 0 0 1 0
0 0 0 0 0 ]
• a one-hot encoding makes no assumption about
word similarity
• all words are equally similar/different from
each other
• this is a natural representation to start with,
though a poor one
23
25. Drawbacks
•Size of input vector scales with size of vocabulary
•Must pre-determine vocabulary size.
•Cannot scale to large or infinite vocabularies (Zipf’s law!)
•Computationally expensive - large input vector results in far too many
parameters to learn.
•“Out-of-Vocabulary” (OOV) problem
•How would you handle unseen words in the test set?
•One solution is to have an “UNKNOWN” symbol that represents
low-frequency or unseen words
25
26. •No relationship between words
•Each word is an independent unit vector
•D(“cat”, “refrigerator”) = D(“cat”, “cats”)
•D(“spoon”, “knife”) = D(“spoon”, “dog”)
•In the ideal world…
•Relationships between word vectors reflects relationships between words
•Features of word embeddings reflect features of words
•These vectors are sparse:
•Vulnerable to overfitting: sparse vectors most computations go to zero resultant loss
function has very few parameters to update.
26
27. Bag of Words
•Vocab = set of all the words in corpus
•Document = Words in document w.r.t vocab with multiplicity
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat”
Vocab = { the, cat, sat, on, hat, dog, ate, and }
Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
27
28. Pros & Cons
+ Quick and Simple
-Too simple
-Orderless
-No notion of syntactic/semantic similarity
28
29. N-gram model
•Vocab = set of all n-grams in corpus
•Document = n-grams in document w.r.t vocab with multiplicity
For bigram:
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat”
Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog ate, ate the, cat and,
and the}
Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1}
29
30. Pros & Cons
+ Tries to incorporate order of words
-Very large vocab set
-No notion of syntactic/semantic similarity
30
31. Term Frequency–Inverse Document Frequency (TF-IDF)
•Captures importance of a word to a document in a corpus.
•Importance increases proportionally to the number of times a word
appears in the document; but is inversely proportional to the frequency of
the word in the corpus.
•TF(t) = (Number of times term t appears in a document) / (Total number of
terms in the document).
•IDF(t) = log (Total number of documents / Number of documents with term
t in it).
•TF-IDF (t) = TF(t) * IDF(t)
31
33. Pros & Cons
•Pros:
•Easy to compute
•Has some basic metric to extract the most descriptive terms in a document
•Thus, can easily compute the similarity between 2 documents using it
•Disadvantages:
•Based on the bag-of-words (BoW) model, therefore it does not capture position
in text, semantics, co-occurrences in different documents, etc.
•Thus, TF-IDF is only useful as a lexical level feature.
•Cannot capture semantics (unlike topic models, word embeddings)
33
35. Bottom Line
• More often than not, how rich your input representation is has huge bearing on
the quality of your downstream ML models.
• For NLP, archaic techniques treat words as atomic symbols. Thus every 2 words
are equally apart.
• They don’t have any notion of either syntactic or semantic similarity between
parts of language.
• This is one of the chief reasons for poor/mediocre performance of NLP based
models.
But this has changed dramatically in past few years
35
38. Topics
•Word level language models
•tSNE : Visualizing word-embeddings
•Demo of word vectors.
Key Learning outcomes:
• Key ideas behind word vectors
• Maths powering their formulation
• Bigram, SkipGram, CBOW
• Train your own word vectors
• Visualize word embeddings
• GloVe
• How GloVe different Word2Vec
• Evaluating word vectors
• tSNE
• how is tSNE it different from PCA
38
40. Distributional representations
•Linguistic aspect.
•Based on co-occurrence/ context
•Distributional hypothesis: linguistic units with similar distributions
have similar meanings.
•Meaning is defined by the context in which a word appears. This is
‘connotation’.
•This is contrast with ‘denotation’ - literal meaning of a word.
Rock-literally means a stone but can also be used to refer to a person as solid and stable. “Anthill rocks”
•The distributional property is usually induced from document or context
or textual vicinity (like sliding window).
40
41. Distributed representations
•Compact, dense and low dimensional representation.
•Differs from distributional representations as the constraint is to seek
efficient dense representation, not just to capture the co-occurrence
similarity.
•Each single component of vector representation does not have any
meaning of its own. Meaning is smeared across all dimensions.
•The interpretable features (for example, word contexts in case of
word2vec) are hidden and distributed among uninterpretable vector
components.
41
42. •Embedding: Mapping between space with one dimension per linguistic
unit (character, morpheme, word, phrase, paragraph, sentence, document)
to a continuous vector space with much lower dimension.
•For the rest of this presentation, “meaning” of linguistic unit is represented
by a vector of real numbers.
42
good
43. Using pre-trained word embeddings
• Most popular - Google’s word2vec, Stanford’s GloVe
• Use it as a dictionary - query with the word, and use the vector
returned.
• Sentence (S) - “The cat sat on the table”
• Challenges:
• Representing sentence/document/paragraph.
• sum
• Average of the word vectors.
• Weighted mean
43
44. • Handling Out Of Vocabulary (OOV) words.
• Transfer learning (i.e. fine tuning on data).
44
45. For the rest of this presentation we will see various technique to
build/train our own embeddings
45
48. John Rupert Firth
“You shall know a word by the company it keeps”
-1957
•English linguist
•Most famous quote in NLP (probably)
•Modern interpretation: Co-occurrence is a good
indicator of meaning
•One of the most successful ideas of modern
statistical NLP
48
49. Co-occurrence with SVD
•Define a word using the words in its context.
•Words that co-occur
•Building a co-occurrence matrix M.
Context = previous word and
next word
Corpus ={“I like deep learning.”
“I like NLP.”
“I enjoy flying.”}
49
50. • Imagine we do this for a large
corpus of text
• row vector xdog
describes usage
of word dog in the corpus
• can be seen as coordinates of
point in n-dimensional
Euclidean space Rn
• Reduce dimensions using SVD =
M
50
51. • Given a matrix of m × n dimensionality, construct a m × k matrix, where k << n
• M = U Σ VT
• U is an m × m orthogonal matrix (UUT
= I)
• Σ is a m × n diagonal matrix, with diagonal values ordered from largest to smallest (σ1
≥
σ2
≥ · · · ≥ σr
≥ 0, where r = min(m, n)) [σi
’s are known as singular values]
• V is an n × n orthogonal matrix (VVT
= I)
• We construct M’
s.t. rank(M’
) = k
•We compute M’
= U Σ
’
V, where Σ’
= Σ with k largest singular values
• k captures desired percentage variance
• Then, submatrix U v,k
is our desired word embedding matrix.
51
53. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Rohde et al. 2005
53
54. Pros & Cons
+ Simple method
+ Captures some sense (though weak) of similarity between words.
- Matrix is extremely sparse.
- Quadratic cost to train (perform SVD)
- Drastic imbalance in frequencies can adversely impact quality of
embeddings.
- Adding new words is expensive.
Take home : we worked with statistics of the corpus rather than working with
the corpus directly. This will recur in GloVe
54
56. Language Models
•Filter out good sentences from bad ones.
•Good = semantically and syntactically correct.
•Modeled this via probability of given sequence of n words
Pr (w1
, w2
, ….., wn
)
•S1
= “the cat jumped over the dog”, Pr(S1
) ~ 1
•S2
= “jumped over the the cat dog”, Pr(S2
) ~ 0
56
59. BiGram Model
•Objective : given wi
, predict wi+1
• Training data: given sequence of n words < w1
, w2
, ….., wn
>, extract bi-gram
pairs (wi-1
, wi
)
• Knowns:
• input – output training examples : (wi-1
, wi
)
• Vocab of training corpus (V) = U (wi
)
• Unknowns: word embeddings. Model as a matrix E |v| x d
. d = embedding
dimensions. Usually a hyper parameter.
• Model : shallow net
59
61. • Feed index of wi-1
as input to network.
• Use index to lookup embedding matrix.
• Perform affine transform on word embedding to get a score vector.
• Compute probability for each word.
• Set 1-hot vector of wi
as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
61
64. ● Per word, we have 2 vectors :
1. As row in Embedding layer (E)
2. As column in weights layer (used for afine transformation)
● It’s common to take average of the 2 vectors.
● It’s common to normalise the vectors. Divide by norm.
● An alternative way to compute ŷ i
: # (wi
, wi-1
) / # (wj
, wi-1
) ∀ j∈V
● Use co-occurrence matrix to compute these counts.
Remarks
64
65. I learn best with toy code,
that I can play with.
- Andrew Trask
jupyter notebook 1
65
67. CBOW
•Continuous Bag of words.
•Proposed by Mikolov et al. in 2013
•Conceptually, very similar to Bi-gram model
•In the bigram model, there were 2 key drawbacks:
1. The context was very small – we took only wi-1
, while predicting wi
2. Context is not just preceding words; but following words too.
67
68. ● “the brown cat jumped over the dog”
Context = the brown cat over the dog
Target = jumped
● Context window = k words on either side of the word to be
predicted.
● Pr (w1
, w2
, ….., wn
) = ∏ Pr(wc
| wc−k
, . . . , wc−1
, wc+1
, . . . , wc+k
)
● W = total number of unique windows
● Each window is sliding block 2c+1 words
68
69. CBOW Model
•Objective : given wc−k
, . . . , wc−1
, wc+1
, . . . , wc+k
, predict wc
• Training data: given sequence of n words < w1
, w2
, ….., wn
>, for each window
extract context and target (wc−k
, . . . , wc−1
, wc+1
, . . . , wc+k
; wc
)
• Knowns:
• input – output training examples : (wc−k
, . . . , wc−1
, wc+1
, . . . , wc+k
; wc
)
•Vocab of training corpus (V) = ∪(wi
)
• Unknowns: word embeddings. Model as a matrix E |v| x d
. d = embedding
dimensions. Usually a hyper parameter.
69
71. • Feed indexes of (x(c−k)
, ... , x(c−1)
, x(c+1)
, ... , x(c+k)
) for the input context of size
k.
• Use indexes to lookup embedding matrix.
• Average these vectors to get vˆ = (vc−k
+vc−1
+...+vc+1
+vc+k
) / 2m
• Perform affine transform on vˆ to get a score vector.
• Turn scores in probabilities for each word.
• Set 1-hot vector of wc
as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
71
73. Skip-Gram model
• 2nd
model proposed by Mikolov et al. in 2013
• Turns CBOW over its head.
• CBOW = given context, predict the target word
• Skip Gram = given target, predict context
• “the brown cat jumped over the dog”
Target = jumped
Context = the, brown, cat, over, the, dog
73
74. •Objective : given wc
, predict wc−k
, . . . , wc−1
, wc+1
, . . . , wc+k
• Training data: given sequence of n words < w1
, w2
, ….., wn
>, for each window
extract target and context pairs (wc
, wc−k
) , (wc
, wc−1
) , (wc
, wc+1
), (wc
, wc+k
)
• Knowns:
• input – output training examples : (wc
, wc−k
) , (wc
, wc−1
) , (wc
, wc+1
), (wc
, wc+k
)
•Vocab of training corpus (V) = ∪ (wi
)
• Unknowns: word embeddings. Model as a matrix E |v| x d
. d = embedding
dimensions. Usually a hyper parameter.
74
76. • Feed index of xc
• Use index to lookup embedding matrix.
• Perform affine transform on vˆ to get a score vector.
• Turn scores in probabilities for each word.
• Set 1-hot vector of wc
as target.
• Set loss = cross-entropy between probability vector and target vector.
Steps
76
77. Maths behind the scene
•Optimization objective J = - log Pr(wc−k
, . . . , wc−1
, wc+1
, . . . , wc+k
| , wc
)
•gradient descent to update all relevant word vectors uc
and wj
.
77
79. •How to quantitatively evaluate the quality of word vectors?
•Intrinsic Evaluation :
• Word Vector Analogies
•Extrinsic Evaluation :
• Downstream NLP task
79
80. Intrinsic Evaluation
•Specific Intermediate subtasks
•Easy to compute.
•Analogy completion:
• a:b :: c:? d =
man:woman :: king:?
• Evaluate word vectors by how well their cosine distance after addition
captures intuitive semantic and syntactic analogy questions
• Discarding the input words from the search!
• Problem: What if the information is there but not linear?
80
82. Extrinsic Evaluation
•Real task at hand
•Ex: Sentiment analysis.
•Not very robust.
•End result is a function of whole process and not just embeddings.
•Process:
• Data pipelines
• Algorithm(s)
• Fine tuning
• Quality of dataset
82
84. Bottleneck
•Recall, to calculate probability, we use softmax. The denominator is
sum across entire vocab.
•Further, this is calculated for every window.
•Too expensive.
•Single update of parameters requires to iterate over |V|. Our vocab
usually is in millions.
84
85. To approximate probability, dont use the entire vocab.
There are 2 popular line of attacks to achieve this:
•Modify the structure the softmax
• Hierarchical Softmax
•Sampling techniques : don’t use entire vocabulary to compute the sum
• Negative sampling
85
86. ● Arrange words in vocab as leaf units of a
balanced binary tree.
● |V| leaves |V| - 1 internal nodes
● Each leaf node has a unique path from root to
the leaf
● Probability of a word (leaf node Lw
) =
Probability of the path from root node to leaf Lw
● No output vector representation for words,
unlike softmax.
● Instead every internal node has a d-dimension
vector associated with it - v’n(w, j)
Hierarchical Softmax
n(w, j) means the j-th unit on the path from root to the
word w
87. ● Product of probabilities over nodes in the path
● Each probability is computed using sigmoid
●
● Inside it we check : if (j+1)th
node on path left child of jth
node or not
● v’n(w, j)
T
h : vector product between vector on hidden layer and vector for the
inner node in consideration.
88. ● p(w = w2
)
● We start at root, and navigate to leaf w2
●
●
● p(w = w2
)
●
Example
89. ● Cost: O(|V|) to O(log |V| )
● In practice, use Huffman tree
90. Negative Sampling
● Given (w, c) : word and context
● Let P(D=1|w,c) be probability that (w, c) came from the corpus data.
● P(D=0|w,c) = probability that (w, c) didn’t come from the corpus data.
● Lets model P(D=1|w,c) with sigmoid:
● Objective function (J):
○ maximize P(D=1|w,c) if (w, c) is in the corpus data.
○ maximize P(D=0|w,c) if (w, c) is not in the corpus data.
● We take a simple maximum likelihood approach of these two probabilities.
91. θ is parameters of the model. In our case U and V - input, output word vectors.
Took log on
both side
92. ● Now, maximizing log likelihood = minimizing negative log likelihood.
●
● D ̃ is “false” or negative “Corpus” with wrong sentences - "jumped cat dog the the over"
● Generate D ̃ on the fly by randomly sampling this negative from the word bank.
● For skip-gram, our new objective function for observing the context word wc − m + j
given
the center word wc
would be :
regular softmax loss for skip-gram
93. ● Likewise for CBOW, our new objective function for observing the center
word uc
given the context vector
● In the above formulation, {u˜k
|k = 1 . . . K} are sampled from Pn
(w).
● best Pn
(w) = Unigram distribution raised to the power of 3/4
● Usually K = 20-30 works well.
regular softmax loss for CBOW
96. Global matrix factorization methods
● Use co-occurrence counts
● Ex: LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret &
Collobert)
+ Fast training
+ Efficient usage of statistics
+ Captures word similarity
- Do badly on analogy tasks
- Disproportionate importance given to large counts
96
97. Local context window method
● Use window to determine context of a word
● Ex: Skip-gram/CBOW ( Mikolov et al), NNLM(Bengio et al), HLBL, (Collobert &
Weston)
+ Capture word similarity.
+ Also performance better on analogy tasks
- Slow down with increase in corpus size
- Inefficient usage of statistics
97
98. Combining the best of both worlds
● Glove model tries to combine the two major model families :-
○ Global matrix factorization (co-occurrence counts)
○ Local context window (context comes from window)
= Co-occurrence counts with context distance
98
99. Co-occurrence counts with context distance
● Uses context distance : weight each word in context window using its distance
from the center word
● This ensures nearby words have more influence than far off ones.
● Sentence -> “I like NLP”
○ Co-occurrence for I -> like : 1.0 & I -> NLP : 0.5
○ Co-occurrence for like -> I : 1.0 & like -> NLP : 1.0
○ Co-occurrence for NLP -> I : 0.5 & NLP -> like : 1.0
● Corpus C: I like NLP. I like cricket.
Co-occurrence matrix for C
99
100. Issues with Co-occurrence Matrix
● Long tail distribution
● Frequent words contribute disproportionately
(use weight function to fix this)
● Use Log for normalization
● Avoid log 0 : Add 1 to each Xij
X21
100
101. Intuition for Glove
● Think of matrix factorization algorithms used in recommendation systems.
● Latent Factor models
○ Find features that describe the characteristics of rated objects.
○ Item characteristics and user preferences are described using vectors which are called factor
vectors z
○ Assumption: Ratings can be inferred from a model put together from a smaller number of
parameters
101
102. Latent Factor models
● Dot product estimates user’s interest in the item
○ where, qi
: factor vector for item i.
pu
: factor vector for user u
i
: estimated user interest
● How to compute vectors for items and users ?
102
103. Matrix Factorization
● rui
: known rating of user u for item i
● predicted rating :
● Similarly glove model tries to model the co-occurrence counts with the
following equation :
103
104. Weighting function
●
.
● Properties of f(X)
○ vanish at 0 i.e. f(0) = 0
○ monotonically increasing
○ f(x) should be relatively small for large values of x
● Empirically = 0.75, xmax
=100 works best
104
105. Loss Function
● Scalable.
● Fast training
○ Training time doesn’t depend on the corpus size
○ Always fitting to a |V| x |V| matrix.
● Good performance with small corpus, and small vectors.
105
106. ● Input :
○ Xij
(|V| x |V| matrix) : co-occurrence matrix
● Parameters
○ W (|V| x |D| matrix) & W˜ (|V| x |D| matrix) :
■ wi
and wj
˜ representation of the ith
& jth
words from W and W˜ matrices respectively.
○ bi
(|V| x 1) column vector : variable for incorporating biases in terms
○ bj
(1 x |V|) row vector : variable for incorporating biases in terms
Training
106
107. ● Train on Wikipedia data
● |V| = 2000
● Window size = 3
● Iterations = 10000
● D = 50
● Learn two representations for each word in |V|.
● reg = 0.01
● Use momentum optimizer with momentum=0.9.
● Takes less than 15 minutes
Quick Experiment
107
115. Artworks mapped using Machine Learning.
Art work Mapped using t-SNE
https://artsexperiments.withgoogle.com/tsnemap/#47.68,1025.98,361.43,51.29,0.00,271.67
116. Objective
● Given a collection of N high-dimensional objects x1, x2, …. xN.
● How can we get a feel for how these objects are (relatively) arranged ?
116
117. Introduction
● Build map(low dimension) s.t. distances between points reflect “similarities” in
the data :
● Minimize some objective function that measures the discrepancy between
similarities in the data and similarities in the map
117
124. t-SNE
● We have measure of similarity of data points in High Dimension
● We have measure of similarity of data points in Low Dimension
● We need a distance measure between the two.
● Once we have distance measure, all we want is : to minimize it
124
125. One possible choice - KL divergence
● It’s a measure of how one probability distribution diverges from a second
expected probability distribution
125
126. KL divergence applied to t-SNE
Objective function (C)
● We want nearby points in high-D to remain nearby in low-D
○ In the case it's not, then
■ pij
will large (because points are nearby)
■ but qij
will be small (because points are far away)
■ This will result in larger penalty
■ In contrast, If both pij
and qij
are large : lower penalty
Ex : Let pij
= 0.8 & qij
= 0.1, Loss = log(0.8/0.1) = 2.07
Let pij
= 0.8 & qij
= 0.7, Loss = log(0.8/0.7) = 0.13 126
127. KL divergence applied to t-SNE
● Likewise, we want far away points in high-D to remain (relatively) far away in
low-D
○ In the case it's not, then
■ pij
will small (because points are far away)
■ but qij
will be large (because points are nearby)
■ This will result in lower penalty
● t-SNE mainly preserves local similarity structure of the data
127
129. Why a Student t-Distribution ?
● t-SNE tries to retain local structure of this data in the map
● Result : dissimilar points have to be modelled as far apart in the map
● Hinton, has showed that student t-distribution is very similar to gaussian
distribution
Local structures
global structure
● Local structures preserved
● global structure is lost
129
130. Deciding the effective number of neighbours
● We need to decide the radii in different parts of the space, so that we can keep
the effective number of neighbours about constant.
● A big radius leads to a high entropy for the distribution over neighbors of i.
● A small radius leads to a low entropy.
● So decide what entropy you want and then find the radius that produces that
entropy.
● It's easier to specify 2entropy
○ This is called the perplexity
○ It is the effective number of neighbors.
130
132. Hyper parameters really matter: Playing with perplexity
● projected 100 data points clearly separated in two different clusters with tSNE
● Applied tSNE with different values of perplexity
● With perplexity=2, local variations in the data dominate
● With perplexity in range(5-50) as suggested in paper, plots still capture some structure in the data
132
133. Hyper parameters really matter: Playing with #iterations
● Perplexity set to 30.0
● Applied tSNE with different number of iterations
● Takeaway : different datasets may require different number of iterations
133
134. Cluster sizes can be misleading
● Uses tSNE to plot two clusters with different standard deviation
● bottomline, we cannot see cluster sizes in t-SNE plots
134
135. Distances in t-SNE plots
● At lower perplexity clusters look equidistant
● At perplexity=50, tSNE captures some notion of global geometry in the data
● 50 data points in each sub cluster
135
136. Distances in t-SNE plots
● tSNE is not able to capture global geometry even at perplexity=50.
● key take away : well separated clusters may not mean anything in tSNE.
● 200 data points in each sub cluster
136
137. Random noise doesn’t always look random
● For this experiment, we generated random points from gaussian distribution
● Plots with lower perplexity, showing misleading structures in the data
137
138. You can see some shapes sometimes
● Axis aligned gaussian distribution
● For certain values of perplexity, long cluster look almost correct.
● tSNE tends to expands regions which are denser
138
142. At heart they are all same !!
● Its has been shown that in essence GloVe and word2vec are no different
from traditional methods like PCA, LSA etc (Levy et al. 2015 call them
DSM )
● GloVe ⋍ PCA/LSA is straightforward (both factorize global counts matrix)
● word2vec ⋍ PCA/LSA is non-trivial (Levy et al. 2015)
● They show that in essence word2vec also factorizes word context matrix
(PMI)
142
143. ● Despite this “equality” of algorithm, word2vec is still known to do better
on several tasks.
● Why ?
○ Levy et al. 2015 show : magic lies in Hyperparameters
143
144. Hyperparameters
● Pre-processing
○ Dynamic context window
○ Subsampling frequent words
○ Deleting rare words
● Post-processing
○ Adding context words
○ Vector normalization
144
145. Pre-processing
● Dynamic Context window
○ In DSM, context window: unweighted & constant size.
○ Glove & SGNS - give more weightage to closer terms
○ SGNS - even the window size can be dynamic and take a value between 1 & max of windowsize.
● Subsampling frequent words
○ SGNS dilutes frequent words by randomly removing words whose frequency f is higher than
some threshold t, with probability
● Deleting rare words
○ In SGNS, rare words are also deleted before creating context windows.
○ Levy et at(2015) show this didn’t have significant impact.
145
146. Post-processing
● Adding context vectors
○ Glove adds word vectors and the context vectors for the final representation.
■
● Vector normalization
○ All vectors can be normalized to unit length
146
147. Key Take Home
● Hyperparameters vs Algorithms
○ Hyper parameter settings is more important than the algorithm choice
○ No single algorithm consistently outperforms the other ones
● Hyperparameters vs more data
○ Training on larger corpus helps on some tasks
○ In many cases, tuning hyperparameters in more beneficial
147
148. References
Idea of word vectors is not new.
• Learning representations by back-propagating errors (Rumelhart et al. 1986)
• A neural probabilistic language model (Bengio et al., 2003)
• NLP from Scratch (Collobert & Weston, 2008)
• Word2Vec (Mikolov et al. 2013)
• Sebastian Ruder’s 3 part Blog series
• Lecture 2-4, CS 224d “Deep Learning for NLP” by Richard Socher
• word2vec Parameter Learning Explained by X Rong
148
153. •Document level language models
Key Learning outcomes:
• Combining word vectors
• Key ideas behind document vectors
• DM, DBOW
• How are they similar/different from
word vectors
• Drawbacks of these approaches
• Skip-Thought vectors
• RNNs: LSTM, GRU
• Architecture of skip-thought vectors
153
Module 3
156. Sentence Representation
Task : Train a ML model for sentiment classification.
Problem :
Given a sentence, predict its sentiment.
Solution:
1) Represent the sentence in mathematical format
2) Train a model on data - sentence, label
How do you represent the sentence ? we want a representation that captures the
semantics of the sentence.
156
157. We already have word vectors.
Can we use these to come up with a way to represent the sentence ?
Eg :- “the cat sat on the table”
We have vectors for “the”, “cat”, “sat”, “on”, “the” & “table”.
How can we use the vectors for words to get vector for sentence ?
157
158. Possible Solutions
Sentence (S) - “The cat sat on the table”
Concatenation : Our sentence is one word followed by another.
So, its representation can be - word vectors for every word in sentence in same
order.
Sv
= [wvThe
wvcat
wvsat
wvon
wvthe
wvtable
]
Each word is represented by a d-dimensional vector, so a sentence with k words
has k X d dimensions.
Problem : Different sentences in corpus will have different lengths. Most ML
models work with fixed length input.
158
159. Mean of word vectors:
Weighted average of the word vectors
Sv
=
159
160. Fallacies
● Different sentences with same words but different ordering will give same
vector.
○ “are you good” vs “you are good”
● Negation - opposite meaning but very similar words
○ “I do want a car” vs “I don’t want a car”
If word vectors for “do” and “don’t” are close by, then in this case their
sentence vectors will also be close by. If these 2 sentences are in opposite
Classes, we are in trouble.
● Sentence vector generated via simple operations on word vectors - often do
not capture syntactic and semantics properties.
160
161. Motivation
● Build vector representation at sentence/paragraph/document level such that it
has the following properties :
○ Syntactic properties:
■ Ordering of words
○ Semantic properties:
■ Sentences that have the same meaning should come together.
■ Capturing negation.
○ Provide fixed length representation for variable length text.
161
162. Solution
● Doc2Vec*
○ Distributed Memory (DM)
○ Distributed Bag Of Words (DBOW)
● We will study these 2 methods to learn a representation for text at paragraph
level. However, this is applicable directly at sentence and document level too.
* Le, Quoc; et al. "Distributed Representations of Sentences and Documents"
162
163. Distributed Memory (DM)
● We saw that word2vec uses context words to predict the target word.
● In distributed memory model, we simply extend the above idea - we use
paragraph vector along with context word vectors to predict the next word.
● S = “The cat sat on the table”
● (Sv ,
wvThe
, wvcat
, wvsat
) wvon
163
164. Architecture
* Le, Quoc; et al. "Distributed Representations of Sentences and Documents"
D
Para2vec Matrix
W
word2vec matrix
ddv
|N|
dW
|V|
164
165. Details
● Each document is represented by a ddv
dimensional vector.
● Each word is represented by dw
dimensional vector.
● Index the vectors for document d and word w1
, w2
& w3
(i.e. The, cat & sat)
● These vectors are then combined (concatenate/average) for predicting next
word (w4
) in document.
165
166. Details
● Objective of word vector model.
● Prediction is obtained through multi class classification.
● Each of yi is un-normalized log-probability for each output word i.
● where U, b are the softmax parameters. h is constructed by a concatenation
or average of word vectors extracted from W.
● Cross entropy loss function is used to learn the representation of the word
and each document vector. 166
167. Generating representation at test time
Sentence : “I got back home.”
* Le, Quoc; et al. "Distributed Representations of Sentences and Documents"
167
168. Distributed Bag of words(DBOW)
● We saw that word2vec uses target word to predict the context words.
● In distributed memory model, we simply extend the above idea - we use
paragraph vector along with context word vectors to predict the next word.
● S = “The cat sat on the table”
(Sv
) (wvThe
, wvcat
, wvsat
wvon
)
168
169. * Le, Quoc; et al. "Distributed Representations of Sentences and Documents"
Architecture
● Words and the ordering of the words
uniquely define a paragraph.
● Reversing this : a paragraph uniquely
defines the words and their ordering
present in the paragraph.
● Thus, given a paragraph representation,
we should be able to predict the words in
the paragraph
● This is precisely what DBOW does.
169
170. DBOW
● Each document is represented by a ddv
dimensional vector.
● Softmax layer outputs a |V| dimensional vector (this is nothing but probability
distribution over words).
● Essentially, we are trying to learn a document representation ddv
which can
predict the words in any window on the document.
170
171. Details
● Random windows are samples from each document.
● Document vector is used to make a prediction for words in this window.
● Cross entropy loss function is used to learn the representation of the word
and each document vector.
171
172. Generating representation at test time
Sentence : “I got back home.”
* Le, Quoc; et al. "Distributed Representations of Sentences and Documents"
172
173. Evaluation
• Paragraph vec + 9 words to predict
10th
word
• Input: Concatenates 400 dim. DBOW
and DM vectors.
• Predicts test-set paragraph vec’s from
frozen train-set word vec’s
Stanford IMDB movie review data set
* Le, Quoc; et al. "Distributed Representations of Sentences and Documents"
173
177. Drawbacks
● Inference needs to be performed at test time, for generating vector
representation of a sentence in test corpus.
● This scales poorly for application which incorporate large amount of text.
177
178. Drawbacks
● Inference needs to be performed at test time, for generating vector
representation of a sentence in test corpus.
● This scales poorly for application which incorporate large amount of text.
178
179. Hacker’s way for quick implementation
Gensim notebook
gensim notebook
Tensor Flow Implementation
179
Tensorflow implementation
181. Motivation
● Although various techniques exist for generating sentence and paragraph
vector, there is lack of generalized framework for sentence encoding.
● Encode a sentence based on its neighbour( encode a sentence and try to
generate to two neighbouring sentences in the decoding layer).
● Doc2vec require to perform explicit inference in order to generate the vector
representation of sentence at test time.
181
182. Introduction to skip-thoughts
● word2vec skip gram model applied at sentence level.
● Instead of using a word to predict its surrounding words, use a sentence to
predict their surrounding sentences.
● Corpus : I got back home. I could see the cat on the steps. This was
strange.
si-1
: I got back home.
si
: I could see the cat on the steps.
si+1
: This was strange.
182
183. Introduction to skip-thoughts
● need ml model that can (sequentially) consume variable length sentences
● And after consumption used the knowledge gained from whole sentence to
predict the neighbouring sentences
● FFN, CNN cannot neither consume sequential text nor have any persistence
183
184. RNN
● Motivation: How do humans understand language
○ “How are you ? Lets go for a coffee ? ...”
● As we read from left to right, we don’t understand each word in isolation,
completely throwing away previous words. We understand each word in
conjunction with our understanding from previous words.
● Traditional neural networks (FFNs, CNNs) can not reason based on
understanding from previous words - no information persistence.
184
185. RNN
● RNN are designed to do exactly this - they have loops in them, allowing
information to persist.
● In the above diagram, A, looks at input xt
and produces hidden state ht
. A loop
allows information to be passed from one step of the network to the next.
Thus, using x0
to xt-1
while consuming xt
.
Image borrowed from Christopher Olah’s blog
185
186. ● To better understand the loop in RNN, let us unroll it.
Time
● The chain depicts information(state) being passed from one step to another.
● Popular RNNs = LSTM, GRU Image borrowed from Christopher Olah’s blog
186
189. In CNN we have parameters shared across space. In RNN parameters are shared across time
189
190. Architecture of RNN
● All RNNs have a chain of repeating modules of neural network.
● In basic RNNs, this repeating module will have a very simple structure, such
as a single tanh layer.
Image borrowed from Christopher Olah’s
190
191. Image borrowed from suriyadeepan’s
The state consists of a single “hidden” vector h
h h h h
191
192. The Dark side
● RNN's have difficulty dealing with long-range dependencies.
● “Nitin says Ram is an awesome person to talk to, you should definitely meet
him”.
● In theory they can “summarize all the information until time t with hidden state
ht
”
● In practice, this is far from true.
192
193. ● This is primarily due to deficiencies in the training algorithm - BPTT (Back
Propagation Through Time)
● Gradients are computed via chain rule. So either the gradients become:
○ Too small (Vanishing gradients)
■ Multiplying n of these small gradients (<1) results in even smaller gradient.
○ Too big (Exploding gradients)
■ Multiplying n of these large gradients (>1) results in even larger gradient.
193
194. LSTM
● LSTMs are specifically designed to handle long term dependencies.
● The way they do it is using cell memory: The LSTM does have the ability to
remove or add information to the cell state, carefully regulated by structures
called “gates”.
● Gates control what information is to be added or deleted.
194
195. ● “forget gate” decides what information to throw from cell state.
● It looks at ht−1
and xt
, and outputs a number between 0 and 1 for each number
in the cell state Ct−1
. A 1 represents “completely keep this” while a 0
represents “completely get rid of this.”
Image borrowed from Christopher Olah’s
195
196. ● “input gate” decides which values in cell state to update.
● tanh layer creates candidate values which may be added to the state
Image borrowed from Christopher Olah’s
196
197. ● “forget gate” & “input gate” come together to update cell state.
Image borrowed from Christopher Olah’s
197
198. ● “output gate” decides the output.
Image borrowed from Christopher Olah’s
198
199. ● There are many variants.
● Each variant has some gates that control what is stored/deleted.
● At heart of any LSTM implementation are these equations.
● By making memory cell additive, they circumvent the problem of diminishing
gradients.
● For exploding gradients - use gradient clipping.
199
200. GRU
● GRU units are simplification of LSTM units.
● Gated recurrent units have 2 gates.
● GRU does not have internal memory
● GRU does not use a second nonlinearity for computing the output
200
201. Details
● Reset Gate
○ Combine new input with previous memory.
● Update Gate
○ How long the previous memory should stay.
201
202. LSTM & GRU Benefits
● Remember for longer temporal durations
● RNN has issues for remembering longer durations
● Able to have feedback flow at different strengths depending on inputs
202
204. Encoding
● Let x1
, x2
, … xN
be the words in sentence si
, where N is the number of words.
● Encoder produces an output representation at time step t, which is the
representation of the sequence x1
, x2
, ...xt
.
● Hidden state hi
N
is the output representation of the entire sentence.
204
206. Decoding
● Decoder conditions on the encoder output hi
.
● One decoder is used for next sentence, while another decoder is used for the
previous sentence.
● Decoders share the vocabulary V, but learn the other parameters separately.
206
208. Details
● Given ht
i+1
, the probability of word wt
i+1
given the previous t − 1 words and the
encoder vector is
● where, denotes the row of V corresponding to the word of wt
i+1
● Similar computation is performed for the previous sentence st-1
208
209. Objective Function
● Given a tuple (si−1
, si
, si+1
), the objective is the sum of the log-probabilities for
the forward(si+1
) and backward(si-1
) sentences conditioned on the encoder
representation:
● The total objective is the above summed over all such training tuples.
209
219. Drawbacks
● Until now we built language models at word/sentence/paragraph/document
level.
● There are couple of major problems with them:
○ Out Of Vocabulary (OOV) - how to handle missing words ?
○ Low frequency count - Zipf’s Law tells us that in any natural language corpus a majority of
the vocabulary word types will either be absent or occur in low frequency.
○ Blind to subword information - “event”, “eventfully”, “uneventful”, “uneventfully” should have
structurally related embeddings.
219
220. ○ Each word vector is independent - so you may have vectors for “run”, “ran”, “running” but there is
no (clean) way to use them to obtain vector for “runs”. Poor estimate of unseen words.
○ Storage space - have to store large number word vectors. English wikipedia contains 60 million
sentences with 6 billion tokens of which ~ 20 million are unique words. This is typically countered
by capping the vocabulary size.
○ Generative models: Imagine you feed k words/sentences to the model, and ask it to predict (k+1)st
word/sentence.
■ How well is such a model likely to do ?
■ Badly
■ Why ?
■ Large output space.
■ Need massive data to learn a meaningful distribution.
■ We need a small output space.
220
221. Way forward
● Construct vector representation from smaller pieces:
○ Morphemes:
■ Meaningful morphological unit of a language that cannot be further divided (e.g. for
‘incoming’ morphemes are : in, come, ing)
■ Ideal primitive. By definition they are minimal meaning bearing units of a language.
■ Given a word, breaking it into morphemes is non-trivial.
■ Requires morphological tagger as preprocessing step (Botha and Blunsom 2014; Luong,
Socher, and Manning 2013)
○ Characters:
■ Fundamental unit
■ Easy to identify
■ How character compose to give meaning is not very clear. “Less”, “Lesser”, “Lessen”,
“lesson”
■ Most languages have a relatively small character set -
● this gives a simple yet effective tool to handle any language.
● And handle OOV
221
222. ● For the rest of this presentation, we will treat text as a sequence of characters
- feeding 1 character at a time to our model.
● For this we need models that are capable of taking and processing
sequences. FFN, CNN
● RNN - Recurrent Neural Networks
○ LSTM
○ GRU
222
223. ● Imagine we are working with english language.
● Roughly ~70 unique characters.
● Easiest character embedding - 1 hot vectors in 70 dimension space.
● Every 2 characters are equally distant(near by). Is there any use of such
embedding ? YES
Simplest char2vec
223
224. Unreasonable effectiveness of RNN*
● Blog by Andrej Karpathy in 2015
● Demonstrated the power of character level language models.
● Central problem: Given k (continuous) characters (from a text corpora),
predict (k+1)st
character.
● Very very interesting results
* karpathy.github.io/2015/05/21/rnn-effectiveness/
224
227. char2vec : Toy Example
Example training
sequence: “hello”
Vocabulary: [h,e,l,o]
227
228. Let’s implement it !
● Take input text (say Shakespeare’s novels), and using a sliding window of
length (k+1) slice the raw text in contiguous chunks of (k+1) characters
● Split each chunk into (X,y) pairs where first k characters become X and
(k+1)th
character is the y. This becomes our training data.
228
229. ● Map each character to a unique id
● Say we have d unique characters in our corpus
● Each character is a vector of d dimensions in 1-hot format
● A sequence of k characters is : 2d tensor of k x d
● Dataset X is : 3d tensor of m sequences, each of k x d
● Y is 2d tensor : m x d. Why ?
● Feed X, y to RNN
● It takes one row at each time step.
k
0
0
0
0
1
d
0
0
0
0
1
k
d
m
229
230. ● We will use keras
● A super simple library on top of TF/Theano
● Meant for both beginners and advanced.
● Exceptionally useful for quick prototyping.
● Super popular on kaggle
Almost there ….
230
232. Some more awesome applications of char2vec
Writing with machine
DeepDrumpf
232
233. Similar idea applied via CNN
● Similarly Zhang et al. have applied CNN instead of RNN directly to 1-hot
vectors character vectors.
“Text Understanding from Scratch” Xiang Zhang, Yann LeCun
“Character-level Convolutional Networks for Text Classification”, Xiang Zhang, Junbo Zhao, Yann
LeCun
233
234. Dense char2vec
● 1-hot encoding of characters is fairly straight forward and very useful.
● But people have shown learning a dense character level representation can
work even better (improved results or similar results with lesser params).
● Also results in lesser parameters in input layer and and its subsequent layer.
(though not much) (# of edges between embedding layer and next laye)
● Simplest way to learn dense character vectors ?
234
235. CBOW & SkipGram
● Original CBOW and Skip-Gram were based on words.
● Use the same architecture, but character level i.e.
○ CBOW = given characters in context, predict the target character
○ Skip Gram = given target character, predict characters in context
235
236. We have given the notebook for character level skip-gram.
Notebook for character level CBOW : take home assignment !
236
237. How good is the embedding ?
● Word vectors or document vectors are evaluated using both intrinsic and
extrinsic evaluation.
● Character vectors have only extrinsic evaluation.
● Makes no sentence to say something like r : s :: a : b
● Even from human perspective, a character has no meaning on its own.
● Building character embedding is relatively cheap, hence most tasks specific
architectures have this component built into them.
Man : King :: Woman : Queen Sentiment analysis
237
238. Tweet2Vec*
● Twitter - Informal language, slang, spelling errors,
abbreviations, new and ever evolving vocabulary,
and special characters.
● For most twitter corpuses : size of vocabulary is
~30-50% of number of documents.
● Can not use word level approaches - very large
vocabulary size.
● Not only this makes it practically infeasible but also
affects the quality of word vectors. Why ?
* Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et
al.
238
239. Task
● Given a tweet, predict its hashtag.
● “Shoutout to @xfoml Project in rob wittig talk #ELO17”
● Super easy to collect a dataset.
239
240. Designing N/W
● raw characters character embedding bi-directional GRU
● Why bi-directional GRU (BGRU) ?
○ Language is not just a forward sequence.
○ “He went to ___?___”
○ “He went to ___?___ to buy grocerry”
○ Its both past words and future words that determine the missing word.
○ (BGRU) exploits this - it has 2 independent GRU networks. One consumes text in
forward direction while other in backward direction.
240
242. Loss function
● Final tweet embedding is used to produce score for every hashtag.
● Scores are converted to probability using softmax
● This gives a distribution over hashtags.
● This is compared against true distribution.
● Cross entropy is used to measure the gap between 2 distributions.
● This is loss function(J)
242
246. Char CNN
● Convolutional Neural Nets (CNN)* have been super successful in the area of
vision.
● CNN treats image as a signal in spatial domain.
● Can it be applied to text ? Yes
○ Text = stream of characters
○ Since characters come one after another - this is signal in time domain
○ Embedding matrix as input matrix
* LeNet-5 by Yann LeCun
Pixels spread in space. Position of each pixel
is fixed. Changing that will change the image
Characters spread in time. Position (1d) of
each character is fixed. Changing that will
change the sentence.
246
247. Basics of CNN
● Input : Image
● Image is nothing but a signal in space.
● Represented by matrix with values (RGB)
● Each value ~ wavelength of Red, Green and Blue signals respectively.
Tweet2Vec: Character-Based Distributed Representations for Social Media - Dhingra et al.
247
249. ● In simplest terms : given 2 signals x() and h(), convolution combines the
2 signals:
● In the discrete space:
● For our case image is x()
● h() is called filter/kernel/feature detector. Well known concept in the world
of image processing.
Convolution
249
250. ● Ex: Filters for edge
detection, blurring,
sharpen, etc
● It is usually a small
matrix - 3x3, 5x5, 5x7
etc
● There are well known
predefined filters
https://en.wikipedia.org/wiki/Kernel_(image_processing)
250
251. 1 0 1
0 1 0
1 0 1
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0
4
1*1 + 1*0 + 1*1
0*0 + 1*1 + 1*0
0*1 + 0*0 + 1*1
● Convolved feature is nothing but taking a part of the image and applying
filter over it - taking pairwise products and adding them.
251
252. ● Convolved feature map is nothing but sliding the filter over entire image
and applying convolution at each step, as shown in diagram below:
1 0 1
0 1 0
1 0 1
https://stats.stackexchange.com/questions/154798/difference-between-kernel-and-filter-in-cn
Filter
252
253. ● Image processing over past many decades has
built many filters for specific tasks.
● In DL (CNN) rather than using predefined filters,
we learn the filters.
● We start with small random values and update
them using gradients
● Stride: by how much we shift the filter.
? ? ?
? ? ?
? ? ?
253
254. ● It’s a simple technique for down sampling.
● In CNNs, downsampling, or "pooling" layers are often placed after
convolutional layers.
● They are used mainly to reduce the feature map dimensionality for
computational efficiency. This in turn improves actual performance.
● Takes disjoint chunks of the image (typically 2×22×2) and aggregates
them into a single value.
● Average, max, min, etc. Most popular is max-pooling.
Pooling
https://cambridgespark.com/content/tutorials/convolutional-neural-networks-with-keras/index.html
254
255. Putting it all together
https://adeshpande3.github.io
255
256. Character-Aware Neural Language Models*
● Uses subword information through a character-level CNN.
● The output from CNN is then fed to RNN (LSTM).
● We learn embedding for each character.
● A word w is then nothing but embeddings of it constituent characters.
● For each word, we apply convolution on its character embeddings to obtain
features.
● These are then fed to RNN via highway layers.
* “Character-Aware Neural Language Models” Y kim 2015 256
257. Details
● C - vocabulary of characters. D - dimensionality
of character embeddings. R - matrix character
embeddings.
● Let wk
= [c1
,....,cl
]. Ck
be character-level
representation of wk
● Apply filter H to Ck
to obtain feature map fk
given
by:
|C|
D R
l
D Ck
c1
c2
cl
l - w +1
fk
257
258. Details
● To capture most important feature - we take max over time
● yk
is the feature corresponding to filter H.
● Likewise, we apply h filters : H1
, …., Hh.
Then, yk
=
● These are then fed to RNN.
258
259. ● Rather than feeding yk
to LSTM, we pass it via Highway network*
● Highway network:
○ Basic idea: carry some input directly to output. We learn “what parts to carry”
○ FFN: Linear transformation of input followed by nonlinearity (usually g = tanh).
y is input to FFN
○ One layer of Highway network does the following:
○ t is called transform gate. (1-t) is called carry gate
* “Training Very Deep Networks” Srivastava et al. 2015 259
262. References
● Tweet2vec:
○ “Character-based Neural Embeddings for Tweet Clustering” - Vakulenko et. al
○ “Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network” - Sakaguchi et. al
● Basics of CNN
○ https://adeshpande3.github.io
262
263. ● CNN on text:
○ https://medium.com/@TalPerry/convolutional-methods-for-text-d5260fd5675f
○ https://medium.com/@thoszymkowiak/how-to-implement-sentiment-analysis-using-word-em
bedding-and-convolutional-neural-networks-on-keras-163197aef623
○ Seminal paper - “Convolutional Neural Networks for Sentence Classification” Y kim
○ “Text Understanding from Scratch” Xiang Zhang, Yann LeCun
○ “Character-level Convolutional Networks for Text Classification”, Xiang Zhang, Yann LeCun
○ “Character-Aware Neural Language Models” Y kim
● Character Embeddings:
○ “Character-level Convolutional Networks for Text Classification”, Xiang Zhang, Yann LeCun
○ “Character-Aware Neural Language Models” Y kim
○ “Exploring the Limits of Language Modeling” Google brain team.
○ “Finding Function in Form: Compositional Character Models for Open Vocabulary Word
Representation” Wang Ling et al.
○ “Learning Character-level Representations for Part-of-Speech Tagging” Santos et al.
263
264. Summary
● We learnt various ways to build representation at :
○ Word level
○ sentence/paragraph/document level
○ character level
● We discussed the key architectures used in representation learning and
fundamental ideas behind them.
● Core idea being : context units and target units.
● We also saw strengths and weaknesses of each of these ideas.
● There is no single representation that is “best”.
264
265. ● Start with pretrained embeddings. This serves as baseline.
● Use rigorous evaluation - both intrinsic and extrinsic.
● If you have lot of data, fine tuning pretrained embeddings can improve
performance on extrinsic task.
● If your dataset is small - worth trying GloVe. Don’t try fine tuning.
● Embeddings and task are closely tied. An embedding that works beautifully for
NER might fail miserably for sentiment analysis.
○ “It was a great movie”
○ “Such a boring movie”
If you are training word vectors and in your corpus “great” and “boring” come in similar context, then their
vectors will be closer in embedding space. Thus, they may be difficult to separate.
265
266. ● Hyperparameter matter : many a times key distinguisher.
● Character embeddings are usually task specific. Thus, they often tend to do
better.
● However, character embeddings can be expensive to train.
● Building blocks are same: new architectures can be build using the same
principles.
● State of the art (for practitioners) - FastText from facebook.
○ trains embeddings for character n-gram
○ Character n-gram(“beautiful”) : {“bea”, “eau”, “aut”, ………}
266
267. • Please upvote the repo
• Run the notebooks. Play, experiment with them. Break them.
• If you come across any bug, please open a issue on our github repo.
• Want to contribute to this repo, great ! Pls contact us
• https://github.com/anujgupta82/Representation-Learning-for-NLP
Thank You 267