Deep Learning seminar presentation, Max Planck Institute for Informatics.
Based on the papers by Weston et al. (ICLR2015), Graves et al. (2014), and Sukhbaatar et al. (2015)
"Memory Networks", "End-to-end memory networks", "Neural Turing Machine"
Deep Learning Models for Question AnsweringSujit Pal
Talk about a hobby project to apply Deep Learning models to predict answers to 8th grade science multiple choice questions for the Allen AI challenge on Kaggle.
Michael Manukyan and Hrayr Harutyunyan gave a talk on sentence representations in the context of deep learning during Armenian NLP Meetup. They also reviewed a recent paper on machine comprehension (Wang, Jiang, 2016)
Question Answering System using machine learning approachGarima Nanda
In a compact form, this is a presentation reflecting how the machine learning approach can be used for the effective and efficient interaction using classification techniques.
[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
End to-end goal-oriented question answering systems
version 2.0: An updated version with references of the old version (https://www.slideshare.net/QiHe2/kdd-2018-tutorial-end-toend-goaloriented-question-answering-systems).
08/22/2018: The old version was just deleted for reducing the confusion.
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskSaurabh Saxena
Studied feasibility of applying state-of-the-art deep learning models like end-to-end memory networks and neural attention- based models to the problem of machine comprehension and subsequent question answering in corporate settings with huge
amount of unstructured textual data. Used pre-trained embeddings like word2vec and GLove to avoid huge training costs.
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
Slides for talk Abhishek Sharma and I gave at the Gennovation tech talks (https://gennovationtalks.com/) at Genesis. The talk was part of outreach for the Deep Learning Enthusiasts meetup group at San Francisco. My part of the talk is covered from slides 19-34.
With the explosive growth of online information, recommender system has been an effective tool to overcome information overload and promote sales. In recent years, deep learning's revolutionary advances in speech recognition, image analysis and natural language processing have gained significant attention. Meanwhile, recent studies also demonstrate its efficacy in coping with information retrieval and recommendation tasks. Applying deep learning techniques into recommender system has been gaining momentum due to its state-of-the-art performance. In this talk, I will present recent development of deep learning based recommender models and highlight some future challenges and open issues of this research field.
Deep Learning Models for Question AnsweringSujit Pal
Talk about a hobby project to apply Deep Learning models to predict answers to 8th grade science multiple choice questions for the Allen AI challenge on Kaggle.
Michael Manukyan and Hrayr Harutyunyan gave a talk on sentence representations in the context of deep learning during Armenian NLP Meetup. They also reviewed a recent paper on machine comprehension (Wang, Jiang, 2016)
Question Answering System using machine learning approachGarima Nanda
In a compact form, this is a presentation reflecting how the machine learning approach can be used for the effective and efficient interaction using classification techniques.
[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
End to-end goal-oriented question answering systems
version 2.0: An updated version with references of the old version (https://www.slideshare.net/QiHe2/kdd-2018-tutorial-end-toend-goaloriented-question-answering-systems).
08/22/2018: The old version was just deleted for reducing the confusion.
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskSaurabh Saxena
Studied feasibility of applying state-of-the-art deep learning models like end-to-end memory networks and neural attention- based models to the problem of machine comprehension and subsequent question answering in corporate settings with huge
amount of unstructured textual data. Used pre-trained embeddings like word2vec and GLove to avoid huge training costs.
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
Slides for talk Abhishek Sharma and I gave at the Gennovation tech talks (https://gennovationtalks.com/) at Genesis. The talk was part of outreach for the Deep Learning Enthusiasts meetup group at San Francisco. My part of the talk is covered from slides 19-34.
With the explosive growth of online information, recommender system has been an effective tool to overcome information overload and promote sales. In recent years, deep learning's revolutionary advances in speech recognition, image analysis and natural language processing have gained significant attention. Meanwhile, recent studies also demonstrate its efficacy in coping with information retrieval and recommendation tasks. Applying deep learning techniques into recommender system has been gaining momentum due to its state-of-the-art performance. In this talk, I will present recent development of deep learning based recommender models and highlight some future challenges and open issues of this research field.
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
Talk given at the 8th Forum for Information Retrieval Evaluation (FIRE, http://fire.irsi.res.in/fire/2016/), December 10, 2016, and at the Qatar Computing Research Institute (QCRI), December 15, 2016.
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
A tutorial given at NAACL HLT 2013.
Richard Socher and Christopher Manning
http://nlp.stanford.edu/courses/NAACL2013/
Machine learning is everywhere in today's NLP, but by and large machine learning amounts to numerical optimization of weights for human designed representations and features. The goal of deep learning is to explore how computers can take advantage of data to develop features and representations appropriate for complex interpretation tasks. This tutorial aims to cover the basic motivation, ideas, models and learning algorithms in deep learning for natural language processing. Recently, these methods have been shown to perform very well on various NLP tasks such as language modeling, POS tagging, named entity recognition, sentiment analysis and paraphrase detection, among others. The most attractive quality of these techniques is that they can perform well without any external hand-designed resources or time-intensive feature engineering. Despite these advantages, many researchers in NLP are not familiar with these methods. Our focus is on insight and understanding, using graphical illustrations and simple, intuitive derivations. The goal of the tutorial is to make the inner workings of these techniques transparent, intuitive and their results interpretable, rather than black boxes labeled "magic here". The first part of the tutorial presents the basics of neural networks, neural word vectors, several simple models based on local windows and the math and algorithms of training via backpropagation. In this section applications include language modeling and POS tagging. In the second section we present recursive neural networks which can learn structured tree outputs as well as vector representations for phrases and sentences. We cover both equations as well as applications. We show how training can be achieved by a modified version of the backpropagation algorithm introduced before. These modifications allow the algorithm to work on tree structures. Applications include sentiment analysis and paraphrase detection. We also draw connections to recent work in semantic compositionality in vector spaces. The principle goal, again, is to make these methods appear intuitive and interpretable rather than mathematically confusing. By this point in the tutorial, the audience members should have a clear understanding of how to build a deep learning system for word-, sentence- and document-level tasks. The last part of the tutorial gives a general overview of the different applications of deep learning in NLP, including bag of words models. We will provide a discussion of NLP-oriented issues in modeling, interpretation, representational power, and optimization.
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz
A brief survey of current deep learning/neural network methods currently used in NLP: recurrent networks (LSTM, GRU), recursive networks, convolutional networks, hybrid architectures, attention models. We will look at specific papers in the literature, targeting sentiment analysis, text classification and other tasks.
Cost-effective Interactive Attention Learning with Neural Attention ProcessMLAI2
We propose a novel interactive learning framework which we refer to as Interactive Attention Learning (IAL), in which the human supervisors interactively manipulate the allocated attentions, to correct the model's behavior by updating the attention-generating network. However, such a model is prone to overfitting due to scarcity of human annotations, and requires costly retraining. Moreover, it is almost infeasible for the human annotators to examine attentions on tons of instances and features. We tackle these challenges by proposing a sample-efficient attention mechanism and a cost-effective reranking algorithm for instances and features. First, we propose Neural Attention Process (NAP), which is an attention generator that can update its behavior by incorporating new attention-level supervisions without any retraining. Secondly, we propose an algorithm which prioritizes the instances and the features by their negative impacts, such that the model can yield large improvements with minimal human feedback. We validate IAL on various time-series datasets from multiple domains (healthcare, real-estate, and computer vision) on which it significantly outperforms baselines with conventional attention mechanisms, or without cost-effective reranking, with substantially less retraining and human-model interaction cost.
In this talk we explore how to build Machine Learning Systems that can that can learn "continuously" from their mistakes (feedback loop) and adapt to an evolving data distribution.
The youtube link to video of the talk is here:
https://www.youtube.com/watch?v=VtBvmrmMJaI
Seminar overview of the third article produced by Google DeepMind. This one again contains conceptual novelties: adding external memory to machine learning pipeline (using an Artificial Neural Network as a Controller, which decides how to use this memory). System is differentiable, meaning that you can give it inputs, show the outputs it should produce, define an error-function (cross-entropy in this case) and then train the whole thing using gradient descent. The amazing outcome is that the system learns not the statistical relations between the input and the output as your usual ML, but attempts to learn an algorithm, which allows it to generalize well and perform correctly on problem instances which are bigger or different from what is has been trained on.
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
Talk given at the 8th Forum for Information Retrieval Evaluation (FIRE, http://fire.irsi.res.in/fire/2016/), December 10, 2016, and at the Qatar Computing Research Institute (QCRI), December 15, 2016.
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
A tutorial given at NAACL HLT 2013.
Richard Socher and Christopher Manning
http://nlp.stanford.edu/courses/NAACL2013/
Machine learning is everywhere in today's NLP, but by and large machine learning amounts to numerical optimization of weights for human designed representations and features. The goal of deep learning is to explore how computers can take advantage of data to develop features and representations appropriate for complex interpretation tasks. This tutorial aims to cover the basic motivation, ideas, models and learning algorithms in deep learning for natural language processing. Recently, these methods have been shown to perform very well on various NLP tasks such as language modeling, POS tagging, named entity recognition, sentiment analysis and paraphrase detection, among others. The most attractive quality of these techniques is that they can perform well without any external hand-designed resources or time-intensive feature engineering. Despite these advantages, many researchers in NLP are not familiar with these methods. Our focus is on insight and understanding, using graphical illustrations and simple, intuitive derivations. The goal of the tutorial is to make the inner workings of these techniques transparent, intuitive and their results interpretable, rather than black boxes labeled "magic here". The first part of the tutorial presents the basics of neural networks, neural word vectors, several simple models based on local windows and the math and algorithms of training via backpropagation. In this section applications include language modeling and POS tagging. In the second section we present recursive neural networks which can learn structured tree outputs as well as vector representations for phrases and sentences. We cover both equations as well as applications. We show how training can be achieved by a modified version of the backpropagation algorithm introduced before. These modifications allow the algorithm to work on tree structures. Applications include sentiment analysis and paraphrase detection. We also draw connections to recent work in semantic compositionality in vector spaces. The principle goal, again, is to make these methods appear intuitive and interpretable rather than mathematically confusing. By this point in the tutorial, the audience members should have a clear understanding of how to build a deep learning system for word-, sentence- and document-level tasks. The last part of the tutorial gives a general overview of the different applications of deep learning in NLP, including bag of words models. We will provide a discussion of NLP-oriented issues in modeling, interpretation, representational power, and optimization.
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz
A brief survey of current deep learning/neural network methods currently used in NLP: recurrent networks (LSTM, GRU), recursive networks, convolutional networks, hybrid architectures, attention models. We will look at specific papers in the literature, targeting sentiment analysis, text classification and other tasks.
Cost-effective Interactive Attention Learning with Neural Attention ProcessMLAI2
We propose a novel interactive learning framework which we refer to as Interactive Attention Learning (IAL), in which the human supervisors interactively manipulate the allocated attentions, to correct the model's behavior by updating the attention-generating network. However, such a model is prone to overfitting due to scarcity of human annotations, and requires costly retraining. Moreover, it is almost infeasible for the human annotators to examine attentions on tons of instances and features. We tackle these challenges by proposing a sample-efficient attention mechanism and a cost-effective reranking algorithm for instances and features. First, we propose Neural Attention Process (NAP), which is an attention generator that can update its behavior by incorporating new attention-level supervisions without any retraining. Secondly, we propose an algorithm which prioritizes the instances and the features by their negative impacts, such that the model can yield large improvements with minimal human feedback. We validate IAL on various time-series datasets from multiple domains (healthcare, real-estate, and computer vision) on which it significantly outperforms baselines with conventional attention mechanisms, or without cost-effective reranking, with substantially less retraining and human-model interaction cost.
In this talk we explore how to build Machine Learning Systems that can that can learn "continuously" from their mistakes (feedback loop) and adapt to an evolving data distribution.
The youtube link to video of the talk is here:
https://www.youtube.com/watch?v=VtBvmrmMJaI
Seminar overview of the third article produced by Google DeepMind. This one again contains conceptual novelties: adding external memory to machine learning pipeline (using an Artificial Neural Network as a Controller, which decides how to use this memory). System is differentiable, meaning that you can give it inputs, show the outputs it should produce, define an error-function (cross-entropy in this case) and then train the whole thing using gradient descent. The amazing outcome is that the system learns not the statistical relations between the input and the output as your usual ML, but attempts to learn an algorithm, which allows it to generalize well and perform correctly on problem instances which are bigger or different from what is has been trained on.
Meta Dropout: Learning to Perturb Latent Features for Generalization MLAI2
A machine learning model that generalizes well should obtain low errors on unseen test examples. Thus, if we know how to optimally perturb training examples to account for test examples, we may achieve better generalization performance. However, obtaining such perturbation is not possible in standard machine learning frameworks as the distribution of the test data is unknown. To tackle this challenge, we propose a novel regularization method, meta-dropout, which learns to perturb the latent features of training examples for generalization in a meta-learning framework. Specifically, we meta-learn a noise generator which outputs a multiplicative noise distribution for latent features, to obtain low errors on the test instances in an input-dependent manner. Then, the learned noise generator can perturb the training examples of unseen tasks at the meta-test time for improved generalization. We validate our method on few-shot classification datasets, whose results show that it significantly improves the generalization performance of the base model, and largely outperforms existing regularization methods such as information bottleneck, manifold mixup, and information dropout.
Attention Is All You Need (NIPS 2017)
(Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin)
paper link: https://arxiv.org/pdf/1706.03762.pdf
Reference:
https://youtu.be/mxGCEWOxfe8 (by Minsuk Heo)
https://youtu.be/5vcj8kSwBCY (Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention)
Paper presented at the 6th International Work-Conference on Ambient Assisted Living.
Abstract: Due to the increasing demand of multi-camera setup and long-term monitoring in vision applications, real-time multi-view action recognition has gain a great interest in recent years. In this paper, we propose a multiple kernel learning based fusion framework that employs a motion-based person detector for finding regions of interest and local descriptors with bag-of-words quantisation for feature representation. The experimental results on a multi-view action dataset suggest that the proposed framework significantly outperforms simple fusion techniques and state-of-the-art methods.
Rooter: A Methodology for the Typical Unification
of Access Points and Redundancy
Many physicists would agree that, had it not been for
congestion control, the evaluation of web browsers might never
have occurred. In fact, few hackers worldwide would disagree
with the essential unification of voice-over-IP and public-
private key pair. In order to solve this riddle, we confirm that
SMPs can be made stochastic, cacheable, and interposable.
Simulation may be defined as a technique that imitates the
operation of a real-world system as it evolves over time. This is normally
done by developing a simulation model.
A simulation model take the form of a set of assumptions about
the operation of the system, expressed as mathematical or logical
relations between the objects of interest in the system.
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...Databricks
Given the resurgence of neural network-based techniques in recent years, it is important for data science practitioner to understand how to apply these techniques and the tradeoffs between neural network-based and traditional statistical methods.
This lecture discusses two specific techniques: Vector Autoregressive (VAR) Models and Recurrent Neural Network (RNN). The former is one of the most important class of multivariate time series statistical models applied in finance while the latter is a neural network architecture that is suitable for time series forecasting. I’ll demonstrate how they are implemented in practice and compares their advantages and disadvantages. Real-world applications, demonstrated using python and Spark, are used to illustrate these techniques. While not the focus in this lecture, exploratory time series data analysis using time-series plot, plots of autocorrelation (i.e. correlogram), plots of partial autocorrelation, plots of cross-correlations, histogram, and kernel density plot, will also be included in the demo.
The attendees will learn – the formulation of a time series forecasting problem statement in context of VAR and RNN – the application of Recurrent Neural Network-based techniques in time series forecasting – the application of Vector Autoregressive Models in multivariate time series forecasting – the pros and cons of using VAR and RNN-based techniques in the context of financial time series forecasting – When to use VAR and when to use RNN-based techniques
Similar to Memory Networks, Neural Turing Machines, and Question Answering (20)
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Memory Networks, Neural Turing Machines, and Question Answering
1. 1/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Memory Networks, Neural Turing Machines,
and Question Answering
Akram El-Korashy1
1Max Planck Institute for Informatics
November 30, 2015
Deep Learning Seminar.
Papers by Weston et al. (ICLR2015), Graves et al. (2014), and
Sukhbaatar et al. (2015)
2. 2/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Outline
1 Introduction
Intuition and resemblance to human cognition
How does it look like?
2 QA Experiments, End-to-End
Architecture - MemN2N
Training
Baselines and Results
3 QA Experiments, Strongly Supervised
Architecture - MemNN
Training
Results
4 NTM code induction experiments
3. 3/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Intuition and resemblance to human cognition
Why memory?
Human’s working memory is a capacity for short-term storage
of information and its rule-based manipulation. . .
Therefore, an NTM1resembles a working memory system, as it
is designed to solve tasks that require the application of
approximate rules to “rapidly-created variables”.
1
Neural Turing Machine. I will use it interchangeably with Memory
Networks, depending on which paper I am citing.
4. 4/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Intuition and resemblance to human cognition
Why memory? Why not RNNs and LSTM?
The memory in these models is the state of the network, which
is latent (i.e., hidden; no exlpicit access) and inherently
unstable over long timescales. [Sukhbaatar2015]
Unlike a standard network, NTM interacts with a memory matrix
using selective read and write operations that can focus on
(almost) a single memory location. [Graves2014]
5. 5/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Intuition and resemblance to human cognition
Why memory networks? How about attention models with RNN
encoders/decoders?
The memory model is indeed analogous to the attention
mechanisms introduced for machine translation.
6. 5/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Intuition and resemblance to human cognition
Why memory networks? How about attention models with RNN
encoders/decoders?
The memory model is indeed analogous to the attention
mechanisms introduced for machine translation.
Main differences
In a memory network model, the query can be made over
multiple sentences, unlike machine translation.
The memory model makes several hops on the memory
before making an output.
The network architecture of the memory scoring is a
simple linear layer, as opposed to a sophisticated gated
architecture in previous work.
7. 6/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Intuition and resemblance to human cognition
Why memory? What’s the main usage?
Memory as non-compact storage
Explicitly update memory slots mi on test time by making use of
a “generalization” component that determines “what” is to be
stored from input x, and “where” to store it (choosing among
the memory slots).
Storing stories for Question Answering
Given a story (i.e., a sequence of sentences), training of the
output component of the memory network can learn scoring
functions (i.e., similarity) between query sentences and existing
memory slots from previous sentences.
8. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
A memory model that is trained only end-to-end.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
9. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
Trained model takes a set of inputs x1, ..., xn to be stored in
the memory, a query q, and outputs an answer a.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
10. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
Each of xi, q, a contains symbols coming from a dictionary
with V words.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
11. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
All x is written to memory up to a fixed buffer size, then find
a continuous representation for the x and q.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
12. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
The continuous representation is then processed via
multiple hops to output a.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
13. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
This allows back-propagation of the error signal through
multiple memory accesses back to input during training.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
14. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
A, B, C are embedding matrices (of size d × V) used to
convert the input to the d-dimensional vectors mi.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
15. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
A match is computed between u and each memory mi by
taking the inner product followed by a softmax.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
16. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
The response vector o from the memory is the weighted
sum: o = i pici.
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
17. 7/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
How does it look like?
Overview of a memory model
The final prediction (answer to the query) is computed with
the help of a weight matrix as: ˆa = Softmax(W(o + u)).
Figure: A single layer, and a three-layer memory model
[Sukhbaatar2015]
18. 8/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Plan
1 Introduction
Intuition and resemblance to human cognition
How does it look like?
2 QA Experiments, End-to-End
Architecture - MemN2N
Training
Baselines and Results
3 QA Experiments, Strongly Supervised
Architecture - MemNN
Training
Results
4 NTM code induction experiments
19. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
There are a total of 20 different types of tasks that test
different forms of reasoning and deduction.
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
20. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
Note that for each question, only some subset of the
statements contain information needed for the answer, and
the others are essentially irrelevant distractors (e.g., the
first sentence in the first example).
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
21. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
In the Memory Networks of Weston et al., this supporting
subset was explicitly indicated to the model during training.
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
22. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
In what is called end-to-end training of memory networks,
this information is no longer provided.
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
23. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
20 QA tasks. A task is a set of example problems. A
problem is a set of I sentences xi where I ≤ 320, a
question q and an answer a.
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
24. 9/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Synthetic QA tasks, supporting subset
The vocabulary is of size V = 177! Two versions of the
data are used, one that has 1000 training problems per
task, and one with 10,000 per task.
Figure: A given QA task consists of a set of statements, followed by a
question whose answer is typically a single word. [Sukhbaatar2015]
25. 10/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture - MemN2N
Model Architecture
K = 3 hops were used.
Adjacent weight sharing was used to ease training and reduce
the number of parameters.
Adjacent weight tying
1 The output embedding of a layer is input to the layer
above. (Ak+1 = Ck )
2 Answer prediction is the same as the final output.
(WT = CK )
3 Question embedding is the same as the input to the first
layer. (B = A1)
26. 11/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture - MemN2N
Sentence Representation, Temporal Encoding
Two different sentence representations: bag-of-words
(BoW), and Position Encoding (PE)
BoW embeds each words, and sums the resulting vectors,
e.g., mi = j Axij .
PE encodes the position of the word using a column vector
lj where lkj = (1 − j/J) − (k/d)(1 − 2j/J), where J is the
number of words in the sentence.
27. 11/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture - MemN2N
Sentence Representation, Temporal Encoding
Two different sentence representations: bag-of-words
(BoW), and Position Encoding (PE)
BoW embeds each words, and sums the resulting vectors,
e.g., mi = j Axij .
PE encodes the position of the word using a column vector
lj where lkj = (1 − j/J) − (k/d)(1 − 2j/J), where J is the
number of words in the sentence.
Temporal Encoding: Modify the memory vector with a
special matrix that encodes temporal information. 2
Now, mi = j Axij + TA(i), where TA(i) is the ith row of a
special temporal matrix TA.
All the T matrices are learned during training. They are
subject to the sharing constraints as between A and C.
2
There isn’t enough detail on what constraints this matrix should be
subject to, if any.
28. 12/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Training
Loss function and learning parameters
Embedding Matrices A, B and C, as well as W are jointly
learnt.
Loss function is a standard cross entropy between ˆa and
the true label a.
Stochastic gradient descent is used with learning rate of
η = 0.01, with annealing.
29. 13/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Training
Parameters and Techniques
RN: Learning time invariance by injecting random noise to
regularize TA
LS: Linear start: Remove all softmax except for the answer
prediction layer. Apply it back when validation loss stops
decreasing. (LS learning rate of η = 0.005 instead of 0.01
for normal training.)
LW: Layer-wise, RNN-like weight tying. Otherwise,
adjacent weight tying.
BoW or PE: sentence representation.
joint: training on all 20 tasks jointly vs independently.
[Sukhbaatar2015]
30. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
RN: Learning time invariance by injecting random noise to
regularize TA
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
31. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
LS: Linear start: Remove all softmax except for the answer prediction layer. Apply it back when validation loss stops
decreasing. (LS learning rate of η = 0.005 instead of 0.01 for normal training.)
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
32. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
LW: Layer-wise, RNN-like weight tying. Otherwise, adjacent
weight tying.
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
33. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
BoW or PE: sentence representation.
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
34. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
take-home msg: More memory hops give improved
performance.
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
35. 14/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
take-home msg: Joint training on various tasks sometimes
helps.
Figure: All variants of the end-to-end trained memory model
comfortably beat the weakly supervised baseline methods.
[Sukhbaatar2015]
36. 15/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Baselines and Results
Set of Supporting Facts
Figure: Instances of successful prediction of the supporting
sentences.
37. 16/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Plan
1 Introduction
Intuition and resemblance to human cognition
How does it look like?
2 QA Experiments, End-to-End
Architecture - MemN2N
Training
Baselines and Results
3 QA Experiments, Strongly Supervised
Architecture - MemNN
Training
Results
4 NTM code induction experiments
38. 17/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture - MemNN
IGOR
The memory network consists of a memory m and 4 learned
components
1 I: (input feature map) - converts the incoming input to the
internal feature representation.
2 G: (generalization) - updates old memories given the new
input.
3 O: (output feature map) - produces a new output, given the
new input and the current memory state.
4 R: (response) - converts the output into the response
format desired.
39. 18/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture - MemNN
Model Flow
The core of inference lies in the O and R modules. The O
module produces output features by finding k supporting
memories given x.
For k = 1, the highest scoring supporting memory is
retrieved: o1 = O1(x, m) = argmax
i=1,...,N
sO(x, mi).
For k = 2, a second supporting memory is additionally
computed: o2 = O2(x, m) = argmax
i=1,...,N
sO([x, mo1
], mi).
In the single-word response setting, where W is the set of
all words in the dict., then r = argmax
w∈W
sR([x, mo1
, mo2
], w).
40. 19/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Training
Max-margin, SGD
Supporting sentences annotations are available as part of the training
data. Thus, scoring functions are trained by minimizing a margin
ranking loss over the model parameters UO and UR using SGD.
Figure: For a given question x with true response r and supporting
sentences mO1
, mO2
(i.e., k = 2), this expression is minimized over
parameters UO and UR:
where ¯f, ¯f and ¯r are all other choices than the correct labels, and γ is the margin.
41. 20/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Results
large-scale QA
Figure: Results on a QA dataset with 14M statements.
Hashing techniques for efficient memory scoring
Idea: hash the inputs I(x) into buckets, and score memories mi lying
in the same buckets only.
42. 20/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Results
large-scale QA
Figure: Results on a QA dataset with 14M statements.
Hashing techniques for efficient memory scoring
word hash: a bucket per dict. word, containing all sentences that
contain this word.
43. 20/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Results
large-scale QA
Figure: Results on a QA dataset with 14M statements.
Hashing techniques for efficient memory scoring
cluster hash: Run K-means to cluster word vectors (UO)i , giving K
buckets. Hash sentence to all buckets in which its words belong.
44. 21/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Results
simulation QA
Figure: The task is a simple simulation of 4 characters, 3 objects and
5 rooms - with characters moving around, picking up and dropping
objects. (Similar to the 10k dataset of MemN2N)
45. 22/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Results
simulation QA - sample test rseults
Figure: Sample test set predictions (in red) for the simulation in the
setting of word-based input and where answers are sentences and an
LSTM is used as the R component of the MemNN.
46. 23/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Plan
1 Introduction
Intuition and resemblance to human cognition
How does it look like?
2 QA Experiments, End-to-End
Architecture - MemN2N
Training
Baselines and Results
3 QA Experiments, Strongly Supervised
Architecture - MemNN
Training
Results
4 NTM code induction experiments
47. 24/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Architecture
More sophisticated memory “controller”.
Figure: Content-addressing is implemented by learning similarity
measures, analogous to MemNN. Additionally, the controller offers
simulation of location-based addressing by implementing a rotational
shift of a weighting.
48. 25/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
NTM learns a Copy task
Figure: The networks were trained to copy sequences of eight bit
random vectors, where the sequence lengths were randomized
between 1 and 20. NTM with LSTM controller was used.
49. 25/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
... on which LSTM fails
Figure: The networks were trained to copy sequences of eight bit
random vectors, where the sequence lengths were randomized
between 1 and 20. NTM with LSTM controller was used.
50. 26/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Summary
Intuition of memory networks vs standard neural network
models.
MemNN is successful through strongly-supervised learning
in QA tasks
MemN2N is used with more realistic end-to-end training,
and is competent enough on the same tasks.
NTMs can learn simple memory copy and recall tasks from
input-memory, output-memory training data.
51. 26/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
Summary
Intuition of memory networks vs standard neural network
models.
MemNN is successful through strongly-supervised learning
in QA tasks
MemN2N is used with more realistic end-to-end training,
and is competent enough on the same tasks.
NTMs can learn simple memory copy and recall tasks from
input-memory, output-memory training data.
Thank you!
52. 27/27
Introduction QA Experiments, End-to-End QA Experiments, Strongly Supervised NTM code induction experiments Summary
References
End-To-End Memory Networks, Sainbayar Sukhbaatar,
Arthur Szlam, Jason Weston, Rob Fergus, 2015.
Memory Networks, Jason Weston, Sumit Chopra, Antoine
Bordes, 2015
Neural Turing Machines, Alex Graves, Greg Wayne, Ivo
Danihelka, 2014
Deep learning at Oxford 2015, Nando de Freitas