Revised presentation slide for NLP-DL, 2016/6/22.
Recent Progress (from 2014) in Recurrent Neural Networks and Natural Language Processing.
Profile http://www.cl.ecei.tohoku.ac.jp/~sosuke.k/
Japanese ver. https://www.slideshare.net/hytae/rnn-63761483
발표자: 최윤제(고려대 석사과정)
최윤제 (Yunjey Choi)는 고려대학교에서 컴퓨터공학을 전공하였으며, 현재는 석사과정으로 Machine Learning을 공부하고 있는 학생이다. 코딩을 좋아하며 이해한 것을 다른 사람들에게 공유하는 것을 좋아한다. 1년 간 TensorFlow를 사용하여 Deep Learning을 공부하였고 현재는 PyTorch를 사용하여 Generative Adversarial Network를 공부하고 있다. TensorFlow로 여러 논문들을 구현, PyTorch Tutorial을 만들어 Github에 공개한 이력을 갖고 있다.
개요:
Generative Adversarial Network(GAN)은 2014년 Ian Goodfellow에 의해 처음으로 제안되었으며, 적대적 학습을 통해 실제 데이터의 분포를 추정하는 생성 모델입니다. 최근 들어 GAN은 가장 인기있는 연구 분야로 떠오르고 있고 하루에도 수 많은 관련 논문들이 쏟아져 나오고 있습니다.
수 없이 쏟아져 나오고 있는 GAN 논문들을 다 읽기가 힘드신가요? 괜찮습니다. 기본적인 GAN만 완벽하게 이해한다면 새로 나오는 논문들도 쉽게 이해할 수 있습니다.
이번 발표를 통해 제가 GAN에 대해 알고 있는 모든 것들을 전달해드리고자 합니다. GAN을 아예 모르시는 분들, GAN에 대한 이론적인 내용이 궁금하셨던 분들, GAN을 어떻게 활용할 수 있을지 궁금하셨던 분들이 발표를 들으면 좋을 것 같습니다.
발표영상: https://youtu.be/odpjk7_tGY0
안녕하세요 딥러닝 논문 읽기 모임입니다. 오늘 업로드된 논문 리뷰 영상은 'Transformer Interpretability Beyond Attention Visualization'라는 제목의 논문입니다.
트랜스포머는 지금 까지 논문 리뷰 영상을 업로드 하면서 가장 많이 언급한 모델중 하나입니다. NLP를 넘어, 이미지 처리 매우 많은 영역에서 소타 네트워크로 쓰였습니다. 해당 논문은 이미지 처리 영역에서의 Transformer가 의사결정을 내리는 과정에 대해 특히 self Attention 모듈에 관해 다양한 방법으로 심층적으로 연구한 논문 입니다!
오늘 논문 리뷰를 위해 펀디멘탈팀 김채현님이 자세한리뷰 도와주셨습니다!
많은 관심 미리 감사드립니다!
https://youtu.be/XCED5bd2WT0
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
발표자: 최윤제(고려대 석사과정)
최윤제 (Yunjey Choi)는 고려대학교에서 컴퓨터공학을 전공하였으며, 현재는 석사과정으로 Machine Learning을 공부하고 있는 학생이다. 코딩을 좋아하며 이해한 것을 다른 사람들에게 공유하는 것을 좋아한다. 1년 간 TensorFlow를 사용하여 Deep Learning을 공부하였고 현재는 PyTorch를 사용하여 Generative Adversarial Network를 공부하고 있다. TensorFlow로 여러 논문들을 구현, PyTorch Tutorial을 만들어 Github에 공개한 이력을 갖고 있다.
개요:
Generative Adversarial Network(GAN)은 2014년 Ian Goodfellow에 의해 처음으로 제안되었으며, 적대적 학습을 통해 실제 데이터의 분포를 추정하는 생성 모델입니다. 최근 들어 GAN은 가장 인기있는 연구 분야로 떠오르고 있고 하루에도 수 많은 관련 논문들이 쏟아져 나오고 있습니다.
수 없이 쏟아져 나오고 있는 GAN 논문들을 다 읽기가 힘드신가요? 괜찮습니다. 기본적인 GAN만 완벽하게 이해한다면 새로 나오는 논문들도 쉽게 이해할 수 있습니다.
이번 발표를 통해 제가 GAN에 대해 알고 있는 모든 것들을 전달해드리고자 합니다. GAN을 아예 모르시는 분들, GAN에 대한 이론적인 내용이 궁금하셨던 분들, GAN을 어떻게 활용할 수 있을지 궁금하셨던 분들이 발표를 들으면 좋을 것 같습니다.
발표영상: https://youtu.be/odpjk7_tGY0
안녕하세요 딥러닝 논문 읽기 모임입니다. 오늘 업로드된 논문 리뷰 영상은 'Transformer Interpretability Beyond Attention Visualization'라는 제목의 논문입니다.
트랜스포머는 지금 까지 논문 리뷰 영상을 업로드 하면서 가장 많이 언급한 모델중 하나입니다. NLP를 넘어, 이미지 처리 매우 많은 영역에서 소타 네트워크로 쓰였습니다. 해당 논문은 이미지 처리 영역에서의 Transformer가 의사결정을 내리는 과정에 대해 특히 self Attention 모듈에 관해 다양한 방법으로 심층적으로 연구한 논문 입니다!
오늘 논문 리뷰를 위해 펀디멘탈팀 김채현님이 자세한리뷰 도와주셨습니다!
많은 관심 미리 감사드립니다!
https://youtu.be/XCED5bd2WT0
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
A comprehensive tutorial on Convolutional Neural Networks (CNN) which talks about the motivation behind CNNs and Deep Learning in general, followed by a description of the various components involved in a typical CNN layer. It explains the theory involved with the different variants used in practice and also, gives a big picture of the whole network by putting everything together.
Next, there's a discussion of the various state-of-the-art frameworks being used to implement CNNs to tackle real-world classification and regression problems.
Finally, the implementation of the CNNs is demonstrated by implementing the paper 'Age ang Gender Classification Using Convolutional Neural Networks' by Hassner (2015).
I have implemented various optimizers (gradient descent, momentum, adam, etc.) based on gradient descent using only numpy not deep learning framework like TensorFlow.
One-stage Network(YOLO, SSD 등)의 문제점 예를 들어 근본적인 문제인 # of Hard positives(object) << # of Easy negatives(back ground) 또는 large object 와 small object 를 동시에 detect하는 경우 등과 같이 극단적인 Class 간 unbalance나 난이도에서 차이가 나는 문제가 동시에 존재함으로써 발생하는 문제를 해결하기 위하여 제시된 Focal loss를 class간 아주 극단적인 unbalance data에 대한 classification 문제(예를 들어 1:10이나 1:100)에 적용한 실험결과가 있어서 정리해봤습니다. 결과적으로 hyper parameter의 설정에 매우 민감하다는 실험결과와 잘만 활용할 경우, class간 unbalance를 해결하기 위한 data level의 sampling 방법이나 classifier level에서의 특별한 고려 없이 좋은 결과를 얻을 수 있다는 내용입니다.
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
This Material is an in_depth study report of Recurrent Neural Network (RNN)
Material mainly from Deep Learning Book Bible, http://www.deeplearningbook.org/
Topics: Briefing, Theory Proof, Variation, Gated RNNN Intuition. Real World Application
Application (CNN+RNN on SVHN)
Also a video (In Chinese)
https://www.youtube.com/watch?v=p6xzPqRd46w
In this presentation we discuss the convolution operation, the architecture of a convolution neural network, different layers such as pooling etc. This presentation draws heavily from A Karpathy's Stanford Course CS 231n
Introduction to Deep Learning, Keras, and TensorFlowSri Ambati
This meetup was recorded in San Francisco on Jan 9, 2019.
Video recording of the session can be viewed here: https://youtu.be/yG1UJEzpJ64
Description:
This fast-paced session starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next, we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful. If time permits, we'll look at the UAT, CLT, and the Fixed Point Theorem. (Bonus points if you know Zorn's Lemma, the Well-Ordering Theorem, and the Axiom of Choice.)
Oswald's Bio:
Oswald Campesato is an education junkie: a former Ph.D. Candidate in Mathematics (ABD), with multiple Master's and 2 Bachelor's degrees. In a previous career, he worked in South America, Italy, and the French Riviera, which enabled him to travel to 70 countries throughout the world.
He has worked in American and Japanese corporations and start-ups, as C/C++ and Java developer to CTO. He works in the web and mobile space, conducts training sessions in Android, Java, Angular 2, and ReactJS, and he writes graphics code for fun. He's comfortable in four languages and aspires to become proficient in Japanese, ideally sometime in the next two decades. He enjoys collaborating with people who share his passion for learning the latest cool stuff, and he's currently working on his 15th book, which is about Angular 2.
An overview of gradient descent optimization algorithms Hakky St
勾配降下法についての論文をスライドにしたものです。
This is the slide for study meeting of gradient descent.
I use this paper and this is very good information about gradient descent.
https://arxiv.org/abs/1609.04747
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
This K-Means clustering algorithm presentation will take you through the machine learning introduction, types of clustering algorithms, k-means clustering, how does K-Means clustering work and at least explains K-Means clustering by taking a real life use case. This Machine Learning algorithm tutorial video is ideal for beginners to learn how K-Means clustering work.
Below topics are covered in this K-Means Clustering Algorithm presentation:
1. Types of Machine Learning?
2. What is K-Means Clustering?
3. Applications of K-Means Clustering
4. Common distance measure
5. How does K-Means Clustering work?
6. K-Means Clustering Algorithm
7. Demo: k-Means Clustering
8. Use case: Color compression
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Simplilearn
This presentation on Convolutional neural network tutorial (CNN) will help you understand what is a convolutional neural network, hoe CNN recognizes images, what are layers in the convolutional neural network and at the end, you will see a use case implementation using CNN. CNN is a feed forward neural network that is generally used to analyze visual images by processing data with grid like topology. A CNN is also known as a "ConvNet". Convolutional networks can also perform optical character recognition to digitize text and make natural-language processing possible on analog and hand-written documents. CNNs can also be applied to sound when it is represented visually as a spectrogram. Now, lets deep dive into this presentation to understand what is CNN and how do they actually work.
Below topics are explained in this CNN presentation(Convolutional Neural Network presentation)
1. Introduction to CNN
2. What is a convolutional neural network?
3. How CNN recognizes images?
4. Layers in convolutional neural network
5. Use case implementation using CNN
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you’ll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
GANs are the new hottest topic in the ML arena; however, they present a challenge for the researchers and the engineers alike. Their design, and most importantly, the code implementation has been causing headaches to the ML practitioners, especially when moving to production.
Starting from the very basic of what a GAN is, passing trough Tensorflow implementation, using the most cutting-edge APIs available in the framework, and finally, production-ready serving at scale using Google Cloud ML Engine.
Slides for the talk: https://www.pycon.it/conference/talks/deep-diving-into-gans-form-theory-to-production
Github repo: https://github.com/zurutech/gans-from-theory-to-production
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...홍배 김
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
오사카 대학 박사과정인 Takato Horii군이 작성한 자료
데이터 생성 모델로 우수한 GAN을 이용하여 비지도학습을 통해
"알기쉬게" 이미지의 정보를 표현하는 특징량을 "간단하게"획득하기
* 특징이 서로 얽혀있는 Physical space에서 서로 독립적인 Eigen space로 변환하는 것과 같은 원리
In this paper, a novel architecture of RNS based 1D Lifting Integer Wavelet Transform (IWT) has been introduced. Advantage of Residue Number System (RNS) based Lifting Scheme over RNS based Filter Bank and non-binary IWT has been discussed. The performance of traditional predicts and updates stage of binary Lifting Scheme (LS) for Discrete Wavelet Transform (DWT) generates huge carry propagation
delay, power and complexity. As a result non binary number system is becoming popular in the field of Digital Signal Processing (DSP) due to its efficient performance. In this paper also a new fixed number ROM based RNS division circuit has been proposed. The proposed architecture has been validated on Xilinx Vertex5 FPGA platform and the corresponding result and reports are shown in here.
A comprehensive tutorial on Convolutional Neural Networks (CNN) which talks about the motivation behind CNNs and Deep Learning in general, followed by a description of the various components involved in a typical CNN layer. It explains the theory involved with the different variants used in practice and also, gives a big picture of the whole network by putting everything together.
Next, there's a discussion of the various state-of-the-art frameworks being used to implement CNNs to tackle real-world classification and regression problems.
Finally, the implementation of the CNNs is demonstrated by implementing the paper 'Age ang Gender Classification Using Convolutional Neural Networks' by Hassner (2015).
I have implemented various optimizers (gradient descent, momentum, adam, etc.) based on gradient descent using only numpy not deep learning framework like TensorFlow.
One-stage Network(YOLO, SSD 등)의 문제점 예를 들어 근본적인 문제인 # of Hard positives(object) << # of Easy negatives(back ground) 또는 large object 와 small object 를 동시에 detect하는 경우 등과 같이 극단적인 Class 간 unbalance나 난이도에서 차이가 나는 문제가 동시에 존재함으로써 발생하는 문제를 해결하기 위하여 제시된 Focal loss를 class간 아주 극단적인 unbalance data에 대한 classification 문제(예를 들어 1:10이나 1:100)에 적용한 실험결과가 있어서 정리해봤습니다. 결과적으로 hyper parameter의 설정에 매우 민감하다는 실험결과와 잘만 활용할 경우, class간 unbalance를 해결하기 위한 data level의 sampling 방법이나 classifier level에서의 특별한 고려 없이 좋은 결과를 얻을 수 있다는 내용입니다.
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
This Material is an in_depth study report of Recurrent Neural Network (RNN)
Material mainly from Deep Learning Book Bible, http://www.deeplearningbook.org/
Topics: Briefing, Theory Proof, Variation, Gated RNNN Intuition. Real World Application
Application (CNN+RNN on SVHN)
Also a video (In Chinese)
https://www.youtube.com/watch?v=p6xzPqRd46w
In this presentation we discuss the convolution operation, the architecture of a convolution neural network, different layers such as pooling etc. This presentation draws heavily from A Karpathy's Stanford Course CS 231n
Introduction to Deep Learning, Keras, and TensorFlowSri Ambati
This meetup was recorded in San Francisco on Jan 9, 2019.
Video recording of the session can be viewed here: https://youtu.be/yG1UJEzpJ64
Description:
This fast-paced session starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next, we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful. If time permits, we'll look at the UAT, CLT, and the Fixed Point Theorem. (Bonus points if you know Zorn's Lemma, the Well-Ordering Theorem, and the Axiom of Choice.)
Oswald's Bio:
Oswald Campesato is an education junkie: a former Ph.D. Candidate in Mathematics (ABD), with multiple Master's and 2 Bachelor's degrees. In a previous career, he worked in South America, Italy, and the French Riviera, which enabled him to travel to 70 countries throughout the world.
He has worked in American and Japanese corporations and start-ups, as C/C++ and Java developer to CTO. He works in the web and mobile space, conducts training sessions in Android, Java, Angular 2, and ReactJS, and he writes graphics code for fun. He's comfortable in four languages and aspires to become proficient in Japanese, ideally sometime in the next two decades. He enjoys collaborating with people who share his passion for learning the latest cool stuff, and he's currently working on his 15th book, which is about Angular 2.
An overview of gradient descent optimization algorithms Hakky St
勾配降下法についての論文をスライドにしたものです。
This is the slide for study meeting of gradient descent.
I use this paper and this is very good information about gradient descent.
https://arxiv.org/abs/1609.04747
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
This K-Means clustering algorithm presentation will take you through the machine learning introduction, types of clustering algorithms, k-means clustering, how does K-Means clustering work and at least explains K-Means clustering by taking a real life use case. This Machine Learning algorithm tutorial video is ideal for beginners to learn how K-Means clustering work.
Below topics are covered in this K-Means Clustering Algorithm presentation:
1. Types of Machine Learning?
2. What is K-Means Clustering?
3. Applications of K-Means Clustering
4. Common distance measure
5. How does K-Means Clustering work?
6. K-Means Clustering Algorithm
7. Demo: k-Means Clustering
8. Use case: Color compression
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Simplilearn
This presentation on Convolutional neural network tutorial (CNN) will help you understand what is a convolutional neural network, hoe CNN recognizes images, what are layers in the convolutional neural network and at the end, you will see a use case implementation using CNN. CNN is a feed forward neural network that is generally used to analyze visual images by processing data with grid like topology. A CNN is also known as a "ConvNet". Convolutional networks can also perform optical character recognition to digitize text and make natural-language processing possible on analog and hand-written documents. CNNs can also be applied to sound when it is represented visually as a spectrogram. Now, lets deep dive into this presentation to understand what is CNN and how do they actually work.
Below topics are explained in this CNN presentation(Convolutional Neural Network presentation)
1. Introduction to CNN
2. What is a convolutional neural network?
3. How CNN recognizes images?
4. Layers in convolutional neural network
5. Use case implementation using CNN
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you’ll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
GANs are the new hottest topic in the ML arena; however, they present a challenge for the researchers and the engineers alike. Their design, and most importantly, the code implementation has been causing headaches to the ML practitioners, especially when moving to production.
Starting from the very basic of what a GAN is, passing trough Tensorflow implementation, using the most cutting-edge APIs available in the framework, and finally, production-ready serving at scale using Google Cloud ML Engine.
Slides for the talk: https://www.pycon.it/conference/talks/deep-diving-into-gans-form-theory-to-production
Github repo: https://github.com/zurutech/gans-from-theory-to-production
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...홍배 김
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
오사카 대학 박사과정인 Takato Horii군이 작성한 자료
데이터 생성 모델로 우수한 GAN을 이용하여 비지도학습을 통해
"알기쉬게" 이미지의 정보를 표현하는 특징량을 "간단하게"획득하기
* 특징이 서로 얽혀있는 Physical space에서 서로 독립적인 Eigen space로 변환하는 것과 같은 원리
In this paper, a novel architecture of RNS based 1D Lifting Integer Wavelet Transform (IWT) has been introduced. Advantage of Residue Number System (RNS) based Lifting Scheme over RNS based Filter Bank and non-binary IWT has been discussed. The performance of traditional predicts and updates stage of binary Lifting Scheme (LS) for Discrete Wavelet Transform (DWT) generates huge carry propagation
delay, power and complexity. As a result non binary number system is becoming popular in the field of Digital Signal Processing (DSP) due to its efficient performance. In this paper also a new fixed number ROM based RNS division circuit has been proposed. The proposed architecture has been validated on Xilinx Vertex5 FPGA platform and the corresponding result and reports are shown in here.
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTIONijwmn
Multiple-input multiple-output (MIMO) systems are playing an important role in the recent wireless
communication. The complexity of the different systems models challenge different researches to get a good
complexity to performance balance. Lattices Reduction Techniques and Lenstra-Lenstra-Lovàsz (LLL)
algorithm bring more resources to investigate and can contribute to the complexity reduction purposes.
In this paper, we are looking to modify the LLL algorithm to reduce the computation operations by
exploiting the structure of the upper triangular matrix without “big” performance degradation. Basically,
the first columns of the upper triangular matrix contain many zeroes, so the algorithm will perform several
operations with very limited income. We are presenting a performance and complexity study and our
proposal show that we can gain in term of complexity while the performance results remains almost the
same.
Slides for the paper titled "Structured pruning of LSTMs via Eigenanalysis and Geometric Median for Mobile Multimedia and Deep Learning Applications", by N. Gkalelis and V. Mezaris, presented at the 22nd IEEE Int. Symposium on Multimedia (ISM), Dec. 2020.
Transformer Mods for Document Length InputsSujit Pal
The Transformer architecture is responsible for many state of the art results in Natural Language Processing. A central feature behind its superior performance over Recurrent Neural Networks is its multi-headed self-attention mechanism. However, the superior performance comes at a cost, an O(n2) time and memory complexity, where n is the size of the input sequence. Because of this, it is computationally infeasible to feed large documents to the standard transformer. To overcome this limitation, a number of approaches have been proposed, which involve modifying the self-attention mechanism in interesting ways.
In this presentation, I will describe the transformer architecture, and specifically the self-attention mechanism, and then describe some of the approaches proposed to address the O(n2) complexity. Some of these approaches have also been implemented in the HuggingFace transformers library, and I will demonstrate some code for doing document level operations using one of these approaches.
A STUDY OF METHODS FOR TRAINING WITH DIFFERENT DATASETS IN IMAGE CLASSIFICATIONADEIJ Journal
This research developed a training method of Convolutional Neural Network model with multiple datasets to achieve good performance on both datasets. Two different methods of training with two characteristically different datasets with identical categories, one with very clean images and one with real-world data, were proposed and studied. The model used for the study was a neural network derived from ResNet. Mixed training was shown to produce the best accuracies for each dataset when the dataset is mixed into the training set at the highest proportion, and the best combined performance when the realworld dataset was mixed in at a ratio of around 70%. This ratio produced a top-1 combined performance of 63.8% (no mixing produced 30.8%) and a top-3 combined performance of 83.0% (no mixing produced 55.3%). This research also showed that iterative training has a worse combined performance than mixed training due to the issue of fast forgetting.
Recognition of handwritten digits using rbf neural networkeSAT Journals
Abstract Pattern recognition is required in many fields for different purposes. Methods based on Radial basis function (RBF) neural networks are found to be very successful in pattern classification problems. Training neural network is in general a challenging nonlinear optimization problem. Several algorithms have been proposed for choosing the RBF neural network prototypes and training the network. In this paper RBF neural network using decoupling Kalman filter method is proposed for handwritten digit recognition applications. The efficacy of the proposed method is tested on the handwritten digits of different fonts and found that it is successful in recognizing the digits. Keywords: - Neural network, RBF neural network, Decoupled kalman filter Training, Zoning method
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Area, Delay and Power Comparison of Adder TopologiesVLSICS Design
Adders form an almost obligatory component of every contemporary integrated circuit. The prerequisite of the adder is that it is primarily fast and secondarily efficient in terms of power consumption and chip area. This paper presents the pertinent choice for selecting the adder topology with the tradeoff between delay, power consumption and area. The adder topology used in this work are ripple carry adder, carry lookahead adder, carry skip adder, carry select adder, carry increment adder, carry save adder and carry bypass adder. The module functionality and performance issues like area, power dissipation and propagation delay are analyzed at 0.12µm 6metal layer CMOS technology using microwind tool.
PERFORMANCE AND COMPLEXITY ANALYSIS OF A REDUCED ITERATIONS LLL ALGORITHMIJCNCJournal
Multiple-input multiple-output (MIMO) systems are playing an increasing and interesting role in the recent
wireless communication. The complexity and the performance of the systems are driving the different
studies and researches. Lattices Reduction techniques bring more resources to investigate the complexity
and performances of such systems.
In this paper, we look to modify a fixed complexity verity of the LLL algorithm to reduce the computation
operations by reducing the number of iterations without important performance degradation. Our proposal
shows that we can achieve a good performance results while avoiding extra iteration that doesn’t bring
much performance.
Implementation of an arithmetic logic using area efficient carry lookahead adderVLSICS Design
An arithmetic logic unit acts as the basic building blocks or cell of a central processing unit of a computer.
And it is a digital circuit comprised of the basic electronics components, which is used to perform various
function of arithmetic and logic and integral operations further the purpose of this work is to propose the
design of an 8-bit ALU which supports 4-bit multiplication. Thus, the functionalities of the ALU in this
study consist of following main functions like addition also subtraction, increment, decrement, AND, OR,
NOT, XOR, NOR also two complement generation Multiplication. And the functions with the adder in the
airthemetic logic unit are implemented using a Carry Look Ahead adder joined by a ripple carry approach.
The design of the following multiplier is achieved using the Booths Algorithm therefore the proposed ALU
can be designed by using verilog or VHDL and can also be designed on Cadence Virtuoso platform.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
How world-class product teams are winning in the AI era by CEO and Founder, P...
Recent Progress in RNN and NLP
1. Recent Progress
in RNN and NLP
Tohoku University
Inui and Okazaki Lab.
Sosuke Kobayashi
⼩林 颯介
2. • Revised presentation slides from
2016/6/22 NLP-DL MTG@Preferred Networks and
2016/6/30 Inui and Okazaki Lab. Talk
• Overview of basic progress in RNN from late 2014
• Attention is not included.
c.f. http://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention
• Not published arXiv papers are marked with ” ”
• Reference:
https://docs.google.com/document/d/1nmkidNi_MsRPbB65kHsmyMfGqmaQ0r5dW518J8k_aeI/edit?usp=sharing
( https://goo.gl/kE6GCM )
Note
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
3. • Basic RNN
• RNN’s Unit
• Benchmarking various RNNs
• Connections in RNNs
• RNN and Tree
• Regularizations and Tricks for RNN’s Learning
• Decoding
Agenda
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
4. • Benchmarking RNN’s various units
• Variants of LSTM or GRU
• Examinations of gates in LSTM
• Initialization trick of LSTM
• High performance by simple units
• Visualization and analysis
1. Unit and Benchmark
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
5. LSTM and GRU
• LSTM [Hochreiter&Schmidhuber97] • GRU [Cho+14]
(Biases are omitted.)Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
6. • Search better unit structures from LSTM and GRU by
mutating computation graphs
• Arith.: Calculation with noise tokens
• XML: Character-based prediction of XML tags
• PTB: Language modeling
Discovered Units [Jozefowicz+15]
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
7. • Better units are similar to GRU
• Due to bias from
search algorighm?
• MUT1: Update gate is
controlled by only x (not h).
It looks reasonable for Arith.
h_tをh_(t_1)にずらしてください
7
[Jozefowicz+15]
GRU:
Discovered Units
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
8. • Remove LSTM’s input, forget, output gates?
• LSTM’s forget gate’s bias is initialized to +1?
≒ Keeping 73% cell’s values initially?
• Initializing forget gate with positive bias is good
([Gers+2000] also said so.)
• Dropout improves LSTM, not GRU, in language
modeling
• Gates’ importance are f >> i > o.
Examination of LSTM [Jozefowicz+15]
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
9. • “LSTM: A Search Space Odyssey.” Cool title.
• Good examinations on LSTM
• Gates, peephole, tanh before output,
forget gate = 1 – input gate (like GRU)
• Full gate recurrence; gates are also controlled by
gates’ values at previous step
• Peephole is not important,
forget gate is important,
f=1-i is good and can save the # of parameters
• You are recommended to use a common LSTM
[Greff+15]
Examination of LSTM
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
10. • Structurally Constrained Recurrent Network (SCRN)
[Mikolov+15]
• RNN with a simple cell
by weighted sum
• IRNN [Le+15]
• Simple RNN with
recurrent matrix initialized with identity matrix
and ReLU instead of tanh
• Effects of diagonal and orthogonal matrix in RNN
[Henaff+16]
Other Devised Units
Q is diagonal
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
11. • Minimal Gated Unit; MGU
[Zhou+, 16]
Other GRU-like Units
• Simple Gated Unit; SGU
[Gao+, 16]
• Deep SGU; DSGU
[Gao+, 16]
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
12. • Multiplicative Integration [Wu+16]
• Improvements by using multiplication with addition in
RNN, changing
into
.
• Similarly in LSTM and GRU
• Improve performances of many tasks
• In the near future, this will get common ...?
Multiplicative Integration
resurgence of new structural designs for recurrent neural networks (RNNs)
esigns are derived from popular structures including vanilla RNNs, Long
works (LSTMs) [4] and Gated Recurrent Units (GRUs) [5]. Despite of their
ost of them share a common computational building block, described by the
(Wx + Uz + b), (1)
Rm
are state vectors coming from different information sources, W 2 Rd⇥n
e-to-state transition matrices, and b is a bias vector. This computational
a combinator for integrating information flow from the x and z by a sum
by a nonlinearity . We refer to it as the additive building block. Additive
ly implemented in various state computations in RNNs (e.g. hidden state
RNNs, gate/cell computations of LSTMs and GRUs.
an alternative design for constructing the computational building block by
of information integration. Specifically, instead of utilizing sum operation
e Hadamard product “ ” to fuse Wx and Uz:
(Wx Uz + b) (2)
ucture Description and Analysis
neral Formulation of Multiplicative Integration
idea behind Multiplicative Integration is to integrate different information flows Wx
adamard product “ ”. A more general formulation of Multiplicative Integration
e bias vectors 1 and 2 added to Wx and Uz:
((Wx + 1) (Uz + 2) + b)
1, 2 2 Rd
are bias vectors. Notice that such formulation contains the first order
itive building block, i.e., 1 Uht 1 + 2 Wxt. In order to make the Mult
on more flexible, we introduce another bias vector ↵ 2 Rd
to gate2
the term W
g the following formulation:
(↵ Wx Uz + 1 Uz + 2 Wx + b),
t the number of parameters of the Multiplicative Integration is about the same as t
building block, since the number of new parameters (↵, 1 and 2) are negligible c
number of parameters. Also, Multiplicative Integration can be easily extended to
Us3
, that adopt vanilla building blocks for computing gates and output states, wher
replace them with the Multiplicative Integration. More generally, in any kind of
information flows (k 2) are involved (e.g. RNN with multiple skip connect
dforward models like residual networks [12]), one can implement pairwise Mult
on for integrating all k information sources.Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
13. • Visualization of character-based language model
• A cell got a function of apostrophes’ opening and
closing
• But other cells can not be interpretable
Visualization
Figure 2: Several examples of cells with interpretable activa
A tanh(cell)’s value.
Red -1 <---> +1Blue
[Karpathy+15]
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
14. Word Ablation [Kádár+16]
• Analyzing a GRU’s output by omission score
when encoding a image caption
• Model predicting
image’s vector
(CNN output)
focuses on nouns
• Language model
focuses more evenly
omission(i, S) = 1 cosine(hend(S), hend(Si))
(12)
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
15. • Removing word ‘pizza’ removes a just ’pizza’ from the
image (searched from dataset)
Word Ablation [Kádár+16]
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
16. • Analyzing mean omission scores in dataset on pos-tag,
model for image focuses on
NN > JJ > VB, CD > ...
Word Ablation [Kádár+16]
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
17. • Connections of RNNs
• Tree structure and RNN (LSTM)
• Tree-based Composition by Shift-reduce
2. Connections and Trees
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
18. • Clockwork RNN. Combination of different RNNs with
different step processing [Koutník+14, Liu+15,
(Chung+16)]
• Gated Feedback RNN. Feedback outputs into lower layers
with gate [Chung+15]
• Depth-Gated LSTM, Highway LSTM: Cell are connected
to the upper layer‘s cell with gate [Yao+15, Chen+15]
• k-th layer’s input is from k-1th’s input and output
[Zhou+16]
• Hierarchical RNN
[Serban+2015]
Connections in Multi-RNNs
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
19. Pixel Recurrent Neu
x1
xi
xn
xn2
Figure 2. Left: To generate pixel xi one conditions on all the pre-
viously generated pixels left and above of xi. Center: Illustration
of a Row LSTM with a kernel of size 3. The dependency field of
the Row LSTM does not reach pixels further away on the sides
of the image. Right: Illustration of the two directions of the Di-
agonal BiLSTM. The dependency field of the Diagonal BiLSTM
covers the entire available context in the image.
Figure 3. In the Diagonal BiLSTM, to allow for parallelization
along the diagonals, the input map is skewed by offseting each
row by one position with respect to the previous row. When the
spatial layer is computed left to right and column by column, the
output map is shifted back into the original size. The convolution
uses a kernel of size 2 ⇥ 1.
(2015); Uria et al. (2014)). By contrast we model p(x) as
a discrete distribution, with every conditional distribution
3
T
th
tu
fo
x
p
d
la
T
in
T
a
c
L
th
tw
u
T
(s
re
in
la
T
th
s
h
Pixel Recurrent Neural Networks
x1
xi
xn
xn2
Figure 2. Left: To generate pixel xi one conditions on all the pre-
viously generated pixels left and above of xi. Center: Illustration
of a Row LSTM with a kernel of size 3. The dependency field of
the Row LSTM does not reach pixels further away on the sides
of the image. Right: Illustration of the two directions of the Di-
agonal BiLSTM. The dependency field of the Diagonal BiLSTM
covers the entire available context in the image.
3.1. Row LSTM
The Row LSTM is a unidirectiona
the image row by row from top to b
tures for a whole row at once; the
formed with a one-dimensional con
xi the layer captures a roughly triang
pixel as shown in Figure 2 (center).
dimensional convolution has size k
larger the value of k the broader the c
The weight sharing in the convoluti
invariance of the computed features
The computation proceeds as follow
an input-to-state component and a r
component that together determine th
LSTM core. To enhance parallelizat
• Grid LSTM. Each axis has each LSTM for multi-
dimensional applications [Kalchbrenner+15]
• RNN for DAG, (image) pixel
[Shuai+15, Zhu+16, Oord+16]
• Structure complexity of RNN model [Zhang+16]
as a conference paper at ICLR 2016
2d Grid LSTM blockblock
m0
h0
h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
cks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
ons. The dashed lines indicate identity transformations. The standard LSTM block
a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
ector m1 applied along the vertical dimension.
essfully train feed-forward networks with up to 900 layers of depth. Grid LSTM with
er review as a conference paper at ICLR 2016
2d Grid LSTM blockandard LSTM block
m0
h0
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
re 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
3 dimensions. The dashed lines indicate identity transformations. The standard LSTM block
not have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
memory vector m1 applied along the vertical dimension.
ed to successfully train feed-forward networks with up to 900 layers of depth. Grid LSTM with
review as a conference paper at ICLR 2016
2d Grid LSTM blockard LSTM block
m0
h0
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
dimensions. The dashed lines indicate identity transformations. The standard LSTM block
ot have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
mory vector m1 applied along the vertical dimension.
to successfully train feed-forward networks with up to 900 layers of depth. Grid LSTM with
conference paper at ICLR 2016
2d Grid LSTM block
m0
h0
h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
orm the standard LSTM and those that form Grid LSTM networks of N = 1, 2
The dashed lines indicate identity transformations. The standard LSTM block
mory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
m1 applied along the vertical dimension.
ully train feed-forward networks with up to 900 layers of depth. Grid LSTM with
onference paper at ICLR 2016
2d Grid LSTM block
m0
h0
h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
rm the standard LSTM and those that form Grid LSTM networks of N = 1, 2
The dashed lines indicate identity transformations. The standard LSTM block
mory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
m1 applied along the vertical dimension.
ly train feed-forward networks with up to 900 layers of depth. Grid LSTM with
review as a conference paper at ICLR 2016
2d Grid LSTM blockdard LSTM block
m0
h0
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
e 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
dimensions. The dashed lines indicate identity transformations. The standard LSTM block
ot have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
emory vector m1 applied along the vertical dimension.
d to successfully train feed-forward networks with up to 900 layers of depth. Grid LSTM with
Under review as a conference paper at ICLR 2016
2d Grid LSTM blockStandard LSTM block
m m0
h0
h
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM
Figure 1: Blocks form the standard LSTM and those that form Grid LSTM networks o
and 3 dimensions. The dashed lines indicate identity transformations. The standard L
does not have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM
the memory vector m1 applied along the vertical dimension.
is used to successfully train feed-forward networks with up to 900 layers of depth. Grid L
two dimensions is analogous to the Stacked LSTM, but it adds cells along the depth dim
Connections in Multi-RNNs
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
20. • Tree-LSTM [Tai+15] Apply LSTM to follow directed edges
(child to parent) of tree structure. Most cited “Tree-LSTM”
• S-LSTM [Zhu+15] Add peephole and remove input x
• LSTM-RecursiveNN [Le+15] Control forget and input
gates with untied matrices of each cell and ouput (h).
Input gate is applied before tanh.
• Top-down TreeLSTM [Zhang+16]
Sentence generation from
root of dependency tree
Tree-LSTM
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Language Processing, pages 1556–1566,
Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics
works, a type of recurrent neural net-
work with a more complex computational
unit, have obtained strong results on a va-
riety of sequence modeling tasks. The
only underlying LSTM structure that has
been explored so far is a linear chain.
However, natural language exhibits syn-
tactic properties that would naturally com-
bine words to phrases. We introduce the
Tree-LSTM, a generalization of LSTMs to
tree-structured network topologies. Tree-
LSTMs outperform all existing systems
and strong LSTM baselines on two tasks:
predicting the semantic relatedness of two
sentences (SemEval 2014, Task 1) and
sentiment classification (Stanford Senti-
ment Treebank).
1 Introduction
Most models for distributed representations of
phrases and sentences—that is, models where real-
valued vectors are used to represent meaning—fall
into one of three classes: bag-of-words models,
sequence models, and tree-structured models. In
bag-of-words models, phrase and sentence repre-
sentations are independent of word order; for ex-
ample, they can be generated by averaging con-
stituent word representations (Landauer and Du-
mais, 1997; Foltz et al., 1998). In contrast, se-
quence models construct sentence representations
as an order-sensitive function of the sequence of
tokens (Elman, 1990; Mikolov, 2012). Lastly,
tree-structured models compose each phrase and
sentence representation from its constituent sub-
phrases according to a given syntactic structure
over the sentence (Goller and Kuchler, 1996;
Socher et al., 2011).
x1
x2
x4 x5 x6
y1
y2 y3
y4 y6
Figure 1: Top: A chain-structured LSTM net-
work. Bottom: A tree-structured LSTM network
with arbitrary branching factor.
Order-insensitive models are insufficient to
fully capture the semantics of natural language
due to their inability to account for differences in
meaning as a result of differences in word order
or syntactic structure (e.g., “cats climb trees” vs.
“trees climb cats”). We therefore turn to order-
sensitive sequential or tree-structured models. In
particular, tree-structured models are a linguisti-
cally attractive option due to their relation to syn-
tactic interpretations of sentence structure. A nat-
ural question, then, is the following: to what ex-
tent (if at all) can we do better with tree-structured
models as opposed to sequential models for sen-
tence representation? In this paper, we work to-
wards addressing this question by directly com-
paring a type of sequential model that has recently
been used to achieve state-of-the-art results in sev-
eral NLP tasks against its tree-structured general-
ization.
Due to their capability for processing arbitrary-
length sequences, recurrent neural networks
1556
w0
w0w1w2
w4 w5 w6
w0 w4 w5
G
EN
-L
GEN-NX-LGEN-NX-L
G
EN
-R
GEN-NX-R GEN-NX-R
w1w2w3
LD LD
Figure 4: Generation of left and right dependents of node w0
according to LDTREELSTM.
by input gate it and how much of the earlier mem-
ory cell ˆcl
t0 will be forgotten is controlled by forget
gate ft. This process is computed as follows:
z,l ˆl 1 z,l ˆl
i
g
l
d
r
t
r
p
d
t
p
g
o
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
21. Figure 2: Attentional Encoder-Decoder model.
dj is calculated as the summation vector weighted
by ↵j(i):
dj =
nX
i=1
↵j(i)hi. (6)
To incorporate the attention mechanism into the
decoding process, the context vector is used for the
the j-th word prediction by putting an additional
hidden layer ˜sj:
˜s = tanh(W [s ; d ] + b ), (7)
Figure 3: Proposed model: Tree-to-sequence
tentional NMT model.
a sentence inherent in language. We propose
novel tree-based encoder in order to explicitly ta
the syntactic structure into consideration in t
NMT model. We focus on the phrase structure
a sentence and construct a sentence vector fro
phrase vectors in a bottom-up fashion. The se
tence vector in the tree-based encoder is the
• Tree-based and Sequential Encoder for Attention.
[Eriguchi+16]
• Tree-LSTM composition
with leaf nodes output
from seq-LSTM
• “The cutest approach!”,
Kyunghyun Cho said at SedMT, NAACL16.
• Undercoated seq-LSTM makes nodes
more context-aware and less ambiguous.
+[Bowman+16]
Combination of Tree and Seq
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
22. The hungry cat
NP (VP(S
REDUCE
GENNT(NP)NT(VP)
…
cat hungry The
a<t
p(at)
ut
TtSt
Figure 5: Neural architecture for defining a distribution over at given representations of the stack (St), output buffer (Tt) and
history of actions (a<t). Details of the composition architecture of the NP, the action history LSTM, and the other elements of the
stack are not shown. This architecture corresponds to the generator state at line 7 of Figure 4.
of the forward and reverse LSTMs are concatenated,
passed through an affine transformation and a tanh
nonlinearity to become the subtree embedding.4 Be-
cause each of the child node embeddings (u, v, w in
Fig. 6) is computed similarly (if it corresponds to an
internal node), this composition function is a kind of
recursive neural network.
4.4 Discriminative Parsing Model
A discriminative parsing model can be obtained by
replacing the embedding of Tt at each time step with
an embedding of the input buffer Bt. To train this
model, the conditional likelihood of each sequence
of actions given the input string is maximized.5
5 Inference via Importance Sampling
• Generation by sequential actions from
{GEN(word), REDUCE, NT(non-terminal symbol)}.
Features for action decisions are LSTM’s outputs of (1)
terminals (2) stack (3) action history.
Recurrent Neural Network Grammars
ber of output nodes in a parse tree as a function of
the number of input words, stating the runtime com-
plexity of the parsing algorithm as a function of the
input size requires further assumptions. Assuming
our fixed constraint on maximum depth, it is linear.
3.5 Comparison to Other Models
Our generation algorithm algorithm differs from
previous stack-based parsing/generation algorithms
in two ways. First, it constructs rooted tree struc-
tures top down (rather than bottom up), and sec-
ond, the transition operators are capable of directly
generating arbitrary tree structures rather than, e.g.,
assuming binarized trees, as is the case in much
prior work that has used transition-based algorithms
to produce phrase-structure trees (Sagae and Lavie,
2005; Zhang and Clark, 2011; Zhu et al., 2013).
4 Generative Model
RNNGs use the generator transition set just pre-
sented to define a joint distribution on syntax trees
(y) and words (x). This distribution is defined as a
sequence model over generator transitions that is pa-
rameterized using a continuous space embedding of
the algorithm state at each time step (ut); i.e.,
p(x, y) =
|a(x,y)|
Y
t=1
p(at | a<t)
=
|a(x,y)|
Y
t=1
exp r>
at
ut + bat
P
a02AG(Tt,St,nt) exp r>
a0 ut + ba0
,
resentations of them we use recurrent neural net-
works to “encode” their contents (Cho et al., 2014).
Since the output buffer and history of actions are
only appended to and only contain symbols from a
finite alphabet, it is straightforward to apply a stan-
dard RNN encoding architecture. The stack (S) is
more complicated for two reasons. First, the ele-
ments of the stack are more complicated objects than
symbols from a discrete alphabet: open nontermi-
nals, terminals, and full trees, are all present on the
stack. Second, it is manipulated using both push and
pop operations. To efficiently obtain representations
of S under push and pop operations, we use stack
LSTMs (Dyer et al., 2015).
4.1 Syntactic Composition Function
When a REDUCE operation is executed, the parser
pops a sequence of completed subtrees and/or to-
kens (together with their vector embeddings) from
the stack and makes them children of the most recent
open nonterminal on the stack, “completing” the
constituent. To compute an embedding of this new
subtree, we use a composition function based on
bidirectional LSTMs, which is illustrated in Fig. 6.
NP
u v w
NP u v w NP
x
x
Figure 6: Syntactic composition function based on bidirec-
[Dyer+16]
Input: The hungry cat meows .
Stack Buffer Action
0 The | hungry | cat | meows | . NT(S)
1 (S The | hungry | cat | meows | . NT(NP)
2 (S | (NP The | hungry | cat | meows | . SHIFT
3 (S | (NP | The hungry | cat | meows | . SHIFT
4 (S | (NP | The | hungry cat | meows | . SHIFT
5 (S | (NP | The | hungry | cat meows | . REDUCE
6 (S | (NP The hungry cat) meows | . NT(VP)
7 (S | (NP The hungry cat) | (VP meows | . SHIFT
8 (S | (NP The hungry cat) | (VP meows . REDUCE
9 (S | (NP The hungry cat) | (VP meows) . SHIFT
10 (S | (NP The hungry cat) | (VP meows) | . REDUCE
11 (S (NP The hungry cat) (VP meows) .)
Figure 2: Top-down parsing example.
tackt Termst Open NTst Action Stackt+1 Termst+1 Open NTst+1
T n NT(X) S | (X T n + 1
T n GEN(x) S | x T | x n
| (X | ⌧1 | . . . | ⌧` T n REDUCE S | (X ⌧1 . . . ⌧`) T n 1
ure 3: Generator transitions. Symbols defined as in Fig. 1 with the addition of T representing the history of generated terminals.
Stack Terminals Action
0 NT(S)
1 (S NT(NP)
2 (S | (NP GEN(The)
3 (S | (NP | The The GEN(hungry)
4 (S | (NP | The | hungry The | hungry GEN(cat)
5 (S | (NP | The | hungry | cat The | hungry | cat REDUCE
6 (S | (NP The hungry cat) The | hungry | cat NT(VP)
7 (S | (NP The hungry cat) | (VP The | hungry | cat GEN(meows)
8 (S | (NP The hungry cat) | (VP meows The | hungry | cat | meows REDUCE
9 (S | (NP The hungry cat) | (VP meows) The | hungry | cat | meows GEN(.)
10 (S | (NP The hungry cat) | (VP meows) | . The | hungry | cat | meows | . REDUCE
11 (S (NP The hungry cat) (VP meows) .) The | hungry | cat | meows | .
• REDUCE action
weaves a new chunk
vector by bi-LSTM
and re-stacks it.
“NP→the→hungry→cat”
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
23. • Joint learning of shift-reduce parsing and sentence-level
classification with shift-reduce-based tree-LSTM
composition. When REDUCE the top 2 chunks are
composed by tree-LSTM.
• Speedy tree composition (like Recurrent NN).
SPINN [Bowman+16]
bu er
down
sat
stack
cat
the
composition
tracking
transition
down
sat
the cat composition
tracking
transition
down
sat
the cat
tracking
(a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’,
and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function.
bu er
stack
t = 0
down
sat
cat
the
t = 1
down
sat
cat
the
t = 2
down
sat
cat
the
t = 3
down
sat
the cat
t = 4
down
sat
the cat
t = 5
down
sat
the cat
t = 6
sat down
the cat
t = 7 = T
(the cat) (sat down)
output to model
for semantic task
bu er
down
sat
stack
cat
the
composition
tracking
transition
down
sat
the cat composition
tracking
transition
down
sat
the cat
tracking
(a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’,
and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function.
bu er
stack
t = 0
down
sat
cat
the
t = 1
down
sat
cat
the
t = 2
down
sat
cat
the
t = 3
down
sat
the cat
t = 4
down
sat
the cat
t = 5
down
sat
the cat
t = 6
sat down
the cat
t = 7 = T
(the cat) (sat down)
output to model
for semantic task
(b) The fully unrolled SPINN for the cat sat down, with neural network layers omitted for clarity.
Stack-augmented Parser-Interpreter
Neural Network
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
24. • Repeated attention with
LSTM captures
input vectors as a set
[Vinyals+15]
• (End-to-end) Memory Networks [Sukhbaatar+15]
Sentence encoding by weighted sum.
Earlier word(’s vector) has larger weights at smaller dims.
Later word(‘s vector) has larger weights at larger dims.
• e.g., When sentence length d is 10 and vector dimension J is 20,
value of 1st vec at 1st dim: (1-1/20)-(1/10)(1-2*1/20) = 0.86
value of 1st vec at 20th
dim: (1-20/20)-(1/10)(1-2*20/20) = 0.1
value of 10th
vec at 1st dim: (1-1/20)-(10/10)(1-2*1/20) = 0.05
value of 10th
vec at 20th
dim: (1-20/20)-(10/10)(1-2*20/20) = 1.0
(Encoders without C/RNN)
tor with the structure lkj = (1 j/J) (k/d)(1 2j/J) (assuming 1-based indexing),
ng the number of words in the sentence, and d is the dimension of the embedding. This
presentation, which we call position encoding (PE), means that the order of the words
mi. The same representation is used for questions, memory inputs and memory outputs.
Encoding: Many of the QA tasks require some notion of temporal context, i.e. in
ample of Section 2, the model needs to understand that Sam is in the bedroom after
simplistic nature of the QA language). The same representation is used for the
d answer a. Two versions of the data are used, one that has 1000 training problems
second larger one with 10,000 per task.
Details
wise stated, all experiments used a K = 3 hops model with the adjacent weight sharing
all tasks that output lists (i.e. the answers are multiple words), we take each possible
of possible outputs and record them as a separate answer vocabulary word.
presentation: In our experiments we explore two different representations for
s. The first is the bag-of-words (BoW) representation that takes the sentence
2, ..., xin}, embeds each word and sums the resulting vectors: e.g mi =
P
j Axij and
j. The input vector u representing the question is also embedded as a bag of words:
. This has the drawback that it cannot capture the order of the words in the sentence,
ortant for some tasks.
propose a second representation that encodes the position of words within the
s takes the form: mi =
P
j lj · Axij, where · is an element-wise multiplication. lj is a
4
4.2 ATTENTION MECHANISMS
Neural models with memories coupled to differentiable addressing mechanism have been success-
fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-
danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,
2015). Since we are interested in associative memories we employed a “content” based attention.
This has the property that the vector retrieved from our memory would not change if we randomly
shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular,
our process block based on an attention mechanism uses the following:
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
Read
Process Write
Figure 1: The Read-Process-and-Write model.
where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is
a query vector which allows us to read rt from the memories, f is a function that computes a
single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a
recurrent state but which takes no inputs. q⇤
t is the state which this LSTM evolves, and is formed
by concatenating the query qt with the resulting attention readout rt. t is the index which indicates
how many “processing steps” are being carried to compute the state to be fed to the decoder. Note
that permuting mi and mi0 has no effect on the read vector rt.
4.3 READ, PROCESS, WRITE
Our model, which naturally handles input sets, has three components (the exact equations and im-
plementation will be released in an appendix prior to publication):
• A reading block, which simply embeds each element xi 2 X using a small neural network
onto a memory vector mi (the same neural network is used for all i).
• A process block, which is an LSTM without inputs or outputs performing T steps of com-
putation over the memories mi. This LSTM keeps updating its state by reading mi repeat-
edly using the attention mechanism described in the previous section. At the end of this
block, its hidden state q⇤
T is an embedding which is permutation invariant to the inputs. See
eqs. (3)-(7) for more details.
4
fully applied to handwriting generation a
danau et al., 2015a), and more general c
2015). Since we are interested in associa
This has the property that the vector retri
shuffled the memory. This is crucial for
our process block based on an attention m
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
where i indexes through each memory v
a query vector which allows us to read
single scalar from mi and qt (e.g., a do
recurrent state but which takes no inputs
by concatenating the query qt with the re
how many “processing steps” are being c
that permuting mi and mi0 has no effect o
4.3 READ, PROCESS, WRITE
Our model, which naturally handles inpu
plementation will be released in an appen
• A reading block, which simply e
onto a memory vector mi (the sa
• A process block, which is an LS
putation over the memories mi.
edly using the attention mechan
block, its hidden state q⇤
T is an em
eqs. (3)-(7) for more details.
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
25. • Regularizations
• Dropout in RNN
• Batch Normalization in RNN
• Other Regularizations
• Multi-task learning and pre-training of encoder(-decoder)
3. Learning Tricks
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
26. • Dropout [Hinton+12, Srivastava+14]
Drop nodes by probability p and multiply 1/(1-p).
• RNN (mainly LSTM and GRU) need some tricks
• At upward (inter-layer) connections [Zaremba+14]
• At update terms
[Semeniuta+16]
• Use one consistent dropout mask in one seq. (Effect
looks obscure now) [Semeniuta+16, Gal15]
• Zoneout. Stochastically preserve
previous c and h [Krueger+16]
• Word dropout. Stochastically use zero/mean/<unk> vec as
a word vec. [Iyyer+15, Dai&Le15, Dyer+15, Bowman+15]
Dropout
, d(ht−1)] + bh), (4)
function from Equation 2.
d-forward fully connected
is a significant difference:
ks every fully-connected
only once, while it is not
nt layer: each training ex-
mposed of a number of in-
ropout this results in hid-
on every step. This obser-
tion of how to sample the
re two options: sample it
sequence (per-sequence)
mask on every step (per-
wo strategies for sampling
etail in Section 3.4.
ht = ot ∗ f(ct),
where it, ft, ot are input, output and forget gate
step t; gt is the vector of cell updates and ct is
updated cell vector used to update the hidden s
ht; σ is the sigmoid function and ∗ is the elem
wise multiplication.
Our approach is to apply dropout to the cell
date vector ct as follows:
ct = ft ∗ ct−1 + it ∗ d(gt)
In contrast, Moon et al. (2015) propose to
ply dropout directly to the cell values and use
sequence sampling:
ct = d(ft ∗ ct−1 + it ∗ gt)
We will discuss the limitations of the appro
of Moon et al. (2015) in Section 3.4 and sup
Figure 1: Illustration of the three types
circles represent connections, hidden state
we apply dropout.
gt = f(Wg xt, rt ∗ ht−1 + bg)
ht = (1 − zt) ∗ ht−1 + zt ∗ gt
Similarly to the LSTMs, we propoose
dropout to the hidden state updates vector
ht = (1 − zt) ∗ ht−1 + zt ∗ d(gt)
To the best of our knowledge, this work is
to study the effect of recurrent dropout
networks.
3.4 Dropout and memory
Before going further with the explanatio
LSTM
GRU
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
27. • Batch Normalization [Ioffe+15] Normalization, scale and shift
function at intermediate layer. Popular for CNN (DNN).
• RNN needs tricks [Cooijmans+16, Laurent+15]
• Mean and Variance value for normalization are
prepared at each time step
• Don’t insert at cell’s recurrence
• Initialize scale value lower (0.1).
Prevent sigm/tanh saturation initially.
Batch Normalization
hidden-to-hidden transformations. We introduce the batch-normalizing transform BN( · ; , )
the LSTM as follows:
0
B
B
@
˜ft
˜it
˜ot
˜gt
1
C
C
A = BN(Whht 1; h, h) + BN(Wxxt; x, x) + b (6)
ct = (˜ft) ct 1 + (˜it) tanh( ˜gt) (7)
ht = (˜ot) tanh(BN(ct; c, c)) (8)
network by discarding the absolute scale of activations.
We want to a preserve the information in the network, by
normalizing the activations in a training example relative
to the statistics of the entire training data.
3 Normalization via Mini-Batch
Statistics
Since the full whitening of each layer’s inputs is costly
and not everywhere differentiable, we make two neces-
sary simplifications. The first is that instead of whitening
the features in layer inputs and outputs jointly, we will
normalize each scalar feature independently, by making it
have the mean of zero and the variance of 1. For a layer
with d-dimensional input x = (x(1)
. . . x(d)
), we will nor-
malize each dimension
x(k)
=
x(k)
− E[x(k)
]
Var[x(k)]
where the expectation and variance are computed over the
training data set. As shown in (LeCun et al., 1998b), such
normalization speeds up convergence, even when the fea-
tures are not decorrelated.
Note that simply normalizing each input of a layer may
change what the layer can represent. For instance, nor-
B = {x1...
Let the normalized values be x1.
formations be y1...m. We refer to
BNγ,β : x1...m →
as the Batch Normalizing Trans
Transform in Algorithm 1. In the
added to the mini-batch variance
Input: Values of x over a mini-
Parameters to be learned
Output: {yi = BNγ,β(xi)}
µB ←
1
m
m
i=1
xi
σ2
B ←
1
m
m
i=1
(xi − µB)2
xi ←
xi − µB
σ2
B + ϵ
yi ← γxi + β ≡ BNγ,β(xi)
Algorithm 1: Batch Normalizi
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
28. • Norm-stabilizer [Krueger+15]
• Penalize the difference between the norms of hidden
vectors at successive time steps
• Temporal Coherence Loss [Jonschkowski&Brock15]
• Penalize the difference between the hidden vectors
at successive time steps.
Regularization for Successiveness
Overfitting in machine learning is addressed by restricting the space o
considered. This can be accomplished by reducing the number of par
with an inductive bias for simpler models, such as early stopping.
can be achieved by incorporating more sophisticated prior knowledg
activations on a reasonable path can be difficult, especially across lo
in mind, we devise a regularizer for the state representation learned
RNNs, that aims to encourage stability of the path taken through repr
we propose the following additional cost term for Recurrent Neural N
1
T
TX
t=1
(khtk2 kht 1k2)2
Where ht is the vector of hidden activations at time-step t, and is a h
amounts of regularization. We call this penalty the norm-stabilizer, as
norms of the hiddens to be stable (i.e. approximately constant acros
coherence” penalty of Jonschkowski & Brock (2015), our penalty
representation to remain constant, only its norm.
In the absence of inputs and nonlinearities, a constant norm would imp
to-hidden transition matrix for simple RNNs (SRNNs). However, in t
sition matrix, inputs and nonlinearities can still change the norm of
instability. This makes targeting the hidden activations directly a mo
ing norm stability. Stability becomes especially important when we
sequences at test time than those seen during training (the “training h
arXiv:1511.08400v
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
29. • Auto-(sentence-)encoder and language modeling as pre-
training for sentence classifications. (But, joint learning is
not good.) [Dai&Le15]
• Multi-task learning for encoder-decoder.
The coefficients of tasks’ losses are so important.
(Multi-language translation, parsing, image captioning,
auto-encoder, skip-thought vectors) [Luong+15]
Multi-task Learning
nce paper at ICLR 2016
English (unsupervised)
German (translation)
Tags (parsing)English
y Setting – one encoder, multiple decoders. This scheme is useful for either
as in Dong et al. (2015) or between different tasks. Here, English and Ger-
of words in the respective languages. The α values give the proportions of
are allocated for the different tasks.
Published as a conference paper at ICLR 2016
English (unsupervised)
German (translation)
Tags (parsing)English
Figure 2: One-to-many Setting – one encoder, multiple decoders. This scheme is useful for either
multi-target translation as in Dong et al. (2015) or between different tasks. Here, English and Ger-
man imply sequences of words in the respective languages. The α values give the proportions of
parameter updates that are allocated for the different tasks.
for constituency parsing as used in (Vinyals et al., 2015a), (b) a sequence of German words for ma-
chine translation (Luong et al., 2015a), and (c) the same sequence of English words for autoencoders
or a related sequence of English words for the skip-thought objective (Kiros et al., 2015).
3.2 MANY-TO-ONE SETTING
This scheme is the opposite of the one-to-many setting. As illustrated in Figure 3, it consists of mul-
tiple encoders and one decoder. This is useful for tasks in which only the decoder can be shared, for
example, when our tasks include machine translation and image caption generation (Vinyals et al.,
2015b). In addition, from a machine translation perspective, this setting can benefit from a large
amount of monolingual data on the target side, which is a standard practice in machine translation
system and has also been explored for neural MT by Gulcehre et al. (2015).
English (unsupervised)
Image (captioning) English
German (translation)
Figure 3: Many-to-one setting – multiple encoders, one decoder. This scheme is handy for tasks in
which only the decoders can be shared.
3.3 MANY-TO-MANY SETTING
Lastly, as the name describes, this category is the most general one, consisting of multiple encoders
Published as a conference paper at ICLR 2016
German (translation)
English (unsupervised) German (unsupervised)
English
Figure 4: Many-to-many setting – multiple encoders, multiple decoders. We consider t
in a limited context of machine translation to utilize the large monolingual corpora i
source and the target languages. Here, we consider a single translation task and two un
autoencoder tasks.
consist of ordered sentences, e.g., paragraphs. Unfortunately, in many applications th
machine translation, we only have sentence-level data where the sentences are unordered.
that, we split each sentence into two halves; we then use one half to predict the other hal
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
30. • Lighten Calculation of Softmax output
• Copy mechanism
• Character-based
• Global Optimization of Decoding
4. Decoding
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
31. • Softmax over large vocabulary (class) has large time
and space computational complexity.
Lighten (replace) it by
• Sampled Softmax
• Class-factored Softmax
• Hierarchical Softmax
• BlackOut
• Noise Constrastive Estimation; NCE
• Self-normalization
• Negative Sampling
Lighten Softmax
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
32. • Copy function from source sentence
• [Gulcehre+16] calculates attention distribution over source
sentence(‘s LSTM outputs).
(Pointer Networks [Vinyals+15])
Sigmoid gate-based weighted sum of
common vocabulary output probability and
copy vocabulary probability distribution.
• [Gu+16]
Similar, but more
complicated
structure
Copy Mechanism
hello , my name is Tony Jebara .
Attentive Read
hi , Tony Jebara
<eos> hi , Tony
h1 h2 h3 h4 h5
s1 s2 s3 s4
h6 h7 h8
“Tony”
DNN
Embedding
for “Tony”
Selective Read
for “Tony”
(a) Attention-based Encoder-Decoder (RNNSearch)
(c) State Update
s4
SourceVocabulary
Softmax
Prob(“Jebara”)=Prob(“Jebara”, g) +Prob(“Jebara”, c)
… ...
(b) Generate-Mode & Copy-Mode
!
M
M
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
33. forms and their meanings is non-trivial (de Saus-
sure, 1916). While some compositional relation-
ships exist, e.g., morphological processes such as
adding -ing or -ly to a stem have relatively reg-
ular effects, many words with lexical similarities
convey different meanings, such as, the word pairs
lesson () lessen and coarse () course.
3 C2W Model
Our compositional character to word (C2W)
model is based on bidirectional LSTMs (Graves
and Schmidhuber, 2005), which are able to
learn complex non-local dependencies in sequence
models. An illustration is shown in Figure 1. The
input of the C2W model (illustrated on bottom) is
a single word type w, and we wish to obtain is
a d-dimensional vector used to represent w. This
model shares the same input and output of a word
lookup table (illustrated on top), allowing it to eas-
ily replace then in any network.
As input, we define an alphabet of characters
C. For English, this vocabulary would contain an
entry for each uppercase and lowercase letter as
well as numbers and punctuation. The input word
w is decomposed into a sequence of characters
c1, . . . , cm, where m is the length of w. Each ci
cats
cat
cats
job
....
....
........
cats
c a t s
a
c
t
....
....
s
Character
Lookup
Table
Word
Lookup
Table
Bi-LSTM
embeddings
for word "cats"
embeddings
for word "cats"
• In/output chunk is a character, not a defined word
• Language model, various task’s input features,
machine translation’s decoding LM: [Sutskever+11, Graves13,
Ling+15a, Kim+15], MT: [Chung+16, Ling+15b,Costa-
jussa&Fonollosa16,Luong+16]
• Combination of words and characters
[Kang+11,Józefowicz+16,Miyamoto&Cho16]
• (Not only RNN composition,
but also CNN)
• Good in terms of morphology
and rare word problem
Character-based
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
34. Figure 1: Illustration of the Scheduled Sampling approach,
• Reinventing the wheel (of non-NN research)?
(Even so, these are useful and good next steps.)
• Use model’s prediction as next input in training (while
usually only true input is used) [Bengio+15]
• Similar to DAgger [Daumé III16; Blog]
• Use dynamic oracle [Daumé III16; Blog]
[Ballesteros+16,
Goldberg&Nivre13]
Global Decoding
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
35. • REINFORCE to optimize BLEU/ROUGE [Ranzato+15]
• Minimum Risk Training [Shen+15, Ayana+16]
• Optimization for beam search [Wiseman&Rush16]
Global Decoding
In order to apply the REINFORCE algorithm (Williams, 1992; Zaremba & Sutskever, 2015) to the
problem of sequence generation we cast our problem in the reinforcement learning (RL) frame-
work (Sutton & Barto, 1988). Our generative model (the RNN) can be viewed as an agent, which
interacts with the external environment (the words and the context vector it sees as input at every
time step). The parameters of this agent defines a policy, whose execution results in the agent pick-
ing an action. In the sequence generation setting, an action refers to predicting the next word in
the sequence at each time step. After taking an action the agent updates its internal state (the hid-
den units of RNN). Once the agent has reached the end of a sequence, it observes a reward. We
can choose any reward function. Here, we use BLEU (Papineni et al., 2002) and ROUGE-2 (Lin
& Hovy, 2003) since these are the metrics we use at test time. BLEU is essentially a geometric
mean over n-gram precision scores as well as a brevity penalty (Liang et al., 2006); in this work, we
consider up to 4-grams. ROUGE-2 is instead recall over bi-grams. Like in imitation learning, we
have a training set of optimal sequences of actions. During training we choose actions according to
the current policy and only observe a reward at the end of the sequence (or after maximum sequence
length), by comparing the sequence of actions from the current policy against the optimal action
sequence. The goal of training is to find the parameters of the agent that maximize the expected
reward. We define our loss as the negative expected reward:
L✓ =
X
wg
1 ,...,wg
T
p✓(wg
1, . . . , wg
T )r(wg
1, . . . , wg
T ) = E[wg
1 ,...wg
T ]⇠p✓
r(wg
1, . . . , wg
T ), (9)
where wg
n is the word chosen by our model at the n-th time step, and r is the reward associated
with the generated sequence. In practice, we approximate this expectation with a single sample
from the distribution of actions implemented by the RNN (right hand side of the equation above
and Figure 9 of Supplementary Material). We refer the reader to prior work (Zaremba & Sutskever,
2015; Williams, 1992) for the full derivation of the gradients. Here, we directly report the partial
derivatives and their interpretation. The derivatives w.r.t. parameters are:
@L✓
@✓
=
X
t
@L✓
@ot
@ot
@✓
(10)
6
Published as a conference paper at ICLR 2016
h2 = ✓( , h1)
p✓(w| , h1)
XENT
h1
w2 w3XENT
top-k
w0
1,...,k p✓(w|w0
1,...,k, h2) w00
1,...,k
h3 = ✓(w0
1,...,k, h2)
top-k
Figure 3: Illustration of the End-to-End BackProp method. The first steps of the unrolled sequence
(here just the first step) are exactly the same as in a regular RNN trained with cross-entropy. How-
ever, in the remaining steps the input to each module is a sparse vector whose non-zero entries are
the k largest probabilities of the distribution predicted at the previous time step. Errors are back-
propagated through these inputs as well.
While this algorithm is a simple way to expose the model to its own predictions, the loss function
optimized is still XENT at each time step. There is no explicit supervision at the sequence level
while training the model.
3.2 SEQUENCE LEVEL TRAINING
We now introduce a novel algorithm for sequence level training, which we call Mixed Incremental
Cross-Entropy Reinforce (MIXER). The proposed method avoids the exposure bias problem, and
oss L using a two-step pro-
ass, we compute candidate
n violations (sequences with
backward pass, we back-
ugh the seq2seq RNNs. Un-
ining, the first-step requires
case beam search) to find
Time Step
a red dog smells home today
the dog dog barks quickly Friday
red blue cat barks straight now
runs today
a red dog runs quickly today
blue dog barks home today
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi
36. • Better RNN’s units and connections are produced,
however, their impacts are small now
(compared to ones between “vanilla and LSTM” or “1-
layer or multi-layer”.)
• Analysis is more needed in general and each task
• Designing models with (reasonable) idea may good
result, e.g., tree composition
• Regularization and learning tricks increased
• Other decoding training or inference algorithms are
required
Summary
Tohoku University, Inui and Okazaki Lab.
Sosuke Kobayashi