A lexisearch algorithm for the Bottleneck Traveling Salesman ProblemCSCJournals
The Bottleneck Traveling Salesman Problem (BTSP) is a variation of the well-known Traveling Salesman Problem in which the objective is to minimize the maximum lap (arc length) in the tour of the salesman. In this paper, a lexisearch algorithm using adjacency representation for a tour has been developed for obtaining exact optimal solution to the problem. Then a comparative study has been carried out to show the efficiency of the algorithm as against existing exact algorithm for some randomly generated and TSPLIB instances of different sizes.
EVEN GRACEFUL LABELLING OF A CLASS OF TREESFransiskeran
A labelling or numbering of a graph G with q edges is an assignment of labels to the vertices of G that
induces for each edge uv a labelling depending on the vertex labels f(u) and f(v). A labelling is called a
graceful labelling if there exists an injective function f: V (G) → {0, 1,2,......q} such that for each edge xy,
the labelling │f(x)-f(y)│is distinct. In this paper, we prove that a class of Tn trees are even graceful.
Data Structure is a way of collecting and organising data in such a way that we can perform operations on these data in an effective way. Data Structures is about rendering data elements in terms of some relationship, for better organization and storage. For example, we have data player's name "Virat" and age 26. Here "Virat" is of String data type and 26 is of integer data type.
We can organize this data as a record like Player record. Now we can collect and store player's records in a file or database as a data structure. For example: "Dhoni" 30, "Gambhir" 31, "Sehwag" 33
In simple language, Data Structures are structures programmed to store ordered data, so that various operations can be performed on it easily.
In this article, first we generalize a few notions like (α ) - soft compatible maps, (β)- soft ompatible maps,soft compatible of type ( I ) and soft compatible of type ( II ) maps in oft metric spaces and then we give an accounts for comparison of these soft compatible aps. Finally, we demonstrate the utility of these new concepts by proving common fixed point theorem for fore soft continuous self maps on a complete soft metric space.
We propose a new stochastic first-order algorithmic framework to solve stochastic composite nonconvex optimization problems that covers both finite-sum and expectation settings. Our algorithms rely on the SARAH estimator and consist of two steps: a proximal gradient and an averaging step making them different from existing nonconvex proximal-type algorithms. The algorithms only require an average smoothness assumption of the nonconvex objective term and additional bounded variance assumption if applied to expectation problems. They work with both constant and adaptive step-sizes, while allowing single sample and mini-batches. In all these cases, we prove that our algorithms can achieve the best-known complexity bounds. One key step of our methods is new constant and adaptive step-sizes that help to achieve desired complexity bounds while improving practical performance. Our constant step-size is much larger than existing methods including proximal SVRG schemes in the single sample case. We also specify the algorithm to the non-composite case that covers existing state-of-the-arts in terms of complexity bounds.Our update also allows one to trade-off between step-sizes and mini-batch sizes to improve performance. We test the proposed algorithms on two composite nonconvex problems and neural networks using several well-known datasets.
EM 알고리즘을 jensen's inequality부터 천천히 잘 설명되어있다
이것을 보면, LDA의 Variational method로 학습하는 방식이 어느정도 이해가 갈 것이다.
옛날 Andrew Ng 선생님의 강의노트에서 발췌한 건데 5년전에 본 것을
아직도 찾아가면서 참고하면서 해야 된다는 게 그 강의가 얼마나 명강의였는지 새삼 느끼게 된다.
In this paper, Assignment problem with crisp, fuzzy and intuitionistic fuzzy numbers as cost coefficients is investigated. In conventional assignment problem, cost is always certain. This paper develops an approach to solve a mixed intuitionistic fuzzy assignment problem where cost is considered real, fuzzy and an intuitionistic fuzzy numbers. Ranking procedure of Annie Varghese and Sunny Kuriakose [4] is used to transform the mixed intuitionistic fuzzy assignment problem into a crisp one so that the conventional method may be applied to solve the assignment problem. The method is illustrated by a numerical example. The proposed method is very simple and easy to understand. Numerical examples show that an intuitionistic fuzzy ranking method offers an effective tool for handling an intuitionistic fuzzy assignment problem.
Black holes and white rabbits metaphor identification with visual featuresSumit Maharjan
E. Shutova, D. Kiela, and J. Maillard. Black holes and
white rabbits: Metaphor identification with visual
features. In Proc. of the 2016 Conference of the North
American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 160–170, San Diego, California,
June 2016. Association for Computational Linguistics
Abstract:
Sarcasm is a peculiar form of sentiment expression, where the surface sentiment differs from the implied sentiment. The detection of sarcasm in social media platforms has been applied in the past mainly to textual utterances where lex- ical indicators (such as interjections and intensifiers), lin- guistic markers, and contextual information (such as user profiles, or past conversations) were used to detect the sar- castic tone. However, modern social media platforms allow to create multimodal messages where audiovisual content is integrated with the text, making the analysis of a mode in isolation partial. In our work, we first study the relation- ship between the textual and visual aspects in multimodal posts from three major social media platforms, i.e., Insta- gram, Tumblr and Twitter, and we run a crowdsourcing task to quantify the extent to which images are perceived as necessary by human annotators. Moreover, we propose two different computational frameworks to detect sarcasm that integrate the textual and visual modalities. The first approach exploits visual semantics trained on an external dataset, and concatenates the semantics features with state- of-the-art textual features. The second method adapts a vi- sual neural network initialized with parameters trained on ImageNet to multimodal sarcastic posts. Results show the positive effect of combining modalities for the detection of sarcasm across platforms and methods.
A lexisearch algorithm for the Bottleneck Traveling Salesman ProblemCSCJournals
The Bottleneck Traveling Salesman Problem (BTSP) is a variation of the well-known Traveling Salesman Problem in which the objective is to minimize the maximum lap (arc length) in the tour of the salesman. In this paper, a lexisearch algorithm using adjacency representation for a tour has been developed for obtaining exact optimal solution to the problem. Then a comparative study has been carried out to show the efficiency of the algorithm as against existing exact algorithm for some randomly generated and TSPLIB instances of different sizes.
EVEN GRACEFUL LABELLING OF A CLASS OF TREESFransiskeran
A labelling or numbering of a graph G with q edges is an assignment of labels to the vertices of G that
induces for each edge uv a labelling depending on the vertex labels f(u) and f(v). A labelling is called a
graceful labelling if there exists an injective function f: V (G) → {0, 1,2,......q} such that for each edge xy,
the labelling │f(x)-f(y)│is distinct. In this paper, we prove that a class of Tn trees are even graceful.
Data Structure is a way of collecting and organising data in such a way that we can perform operations on these data in an effective way. Data Structures is about rendering data elements in terms of some relationship, for better organization and storage. For example, we have data player's name "Virat" and age 26. Here "Virat" is of String data type and 26 is of integer data type.
We can organize this data as a record like Player record. Now we can collect and store player's records in a file or database as a data structure. For example: "Dhoni" 30, "Gambhir" 31, "Sehwag" 33
In simple language, Data Structures are structures programmed to store ordered data, so that various operations can be performed on it easily.
In this article, first we generalize a few notions like (α ) - soft compatible maps, (β)- soft ompatible maps,soft compatible of type ( I ) and soft compatible of type ( II ) maps in oft metric spaces and then we give an accounts for comparison of these soft compatible aps. Finally, we demonstrate the utility of these new concepts by proving common fixed point theorem for fore soft continuous self maps on a complete soft metric space.
We propose a new stochastic first-order algorithmic framework to solve stochastic composite nonconvex optimization problems that covers both finite-sum and expectation settings. Our algorithms rely on the SARAH estimator and consist of two steps: a proximal gradient and an averaging step making them different from existing nonconvex proximal-type algorithms. The algorithms only require an average smoothness assumption of the nonconvex objective term and additional bounded variance assumption if applied to expectation problems. They work with both constant and adaptive step-sizes, while allowing single sample and mini-batches. In all these cases, we prove that our algorithms can achieve the best-known complexity bounds. One key step of our methods is new constant and adaptive step-sizes that help to achieve desired complexity bounds while improving practical performance. Our constant step-size is much larger than existing methods including proximal SVRG schemes in the single sample case. We also specify the algorithm to the non-composite case that covers existing state-of-the-arts in terms of complexity bounds.Our update also allows one to trade-off between step-sizes and mini-batch sizes to improve performance. We test the proposed algorithms on two composite nonconvex problems and neural networks using several well-known datasets.
EM 알고리즘을 jensen's inequality부터 천천히 잘 설명되어있다
이것을 보면, LDA의 Variational method로 학습하는 방식이 어느정도 이해가 갈 것이다.
옛날 Andrew Ng 선생님의 강의노트에서 발췌한 건데 5년전에 본 것을
아직도 찾아가면서 참고하면서 해야 된다는 게 그 강의가 얼마나 명강의였는지 새삼 느끼게 된다.
In this paper, Assignment problem with crisp, fuzzy and intuitionistic fuzzy numbers as cost coefficients is investigated. In conventional assignment problem, cost is always certain. This paper develops an approach to solve a mixed intuitionistic fuzzy assignment problem where cost is considered real, fuzzy and an intuitionistic fuzzy numbers. Ranking procedure of Annie Varghese and Sunny Kuriakose [4] is used to transform the mixed intuitionistic fuzzy assignment problem into a crisp one so that the conventional method may be applied to solve the assignment problem. The method is illustrated by a numerical example. The proposed method is very simple and easy to understand. Numerical examples show that an intuitionistic fuzzy ranking method offers an effective tool for handling an intuitionistic fuzzy assignment problem.
Black holes and white rabbits metaphor identification with visual featuresSumit Maharjan
E. Shutova, D. Kiela, and J. Maillard. Black holes and
white rabbits: Metaphor identification with visual
features. In Proc. of the 2016 Conference of the North
American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 160–170, San Diego, California,
June 2016. Association for Computational Linguistics
Abstract:
Sarcasm is a peculiar form of sentiment expression, where the surface sentiment differs from the implied sentiment. The detection of sarcasm in social media platforms has been applied in the past mainly to textual utterances where lex- ical indicators (such as interjections and intensifiers), lin- guistic markers, and contextual information (such as user profiles, or past conversations) were used to detect the sar- castic tone. However, modern social media platforms allow to create multimodal messages where audiovisual content is integrated with the text, making the analysis of a mode in isolation partial. In our work, we first study the relation- ship between the textual and visual aspects in multimodal posts from three major social media platforms, i.e., Insta- gram, Tumblr and Twitter, and we run a crowdsourcing task to quantify the extent to which images are perceived as necessary by human annotators. Moreover, we propose two different computational frameworks to detect sarcasm that integrate the textual and visual modalities. The first approach exploits visual semantics trained on an external dataset, and concatenates the semantics features with state- of-the-art textual features. The second method adapts a vi- sual neural network initialized with parameters trained on ImageNet to multimodal sarcastic posts. Results show the positive effect of combining modalities for the detection of sarcasm across platforms and methods.
Revised presentation slide for NLP-DL, 2016/6/22.
Recent Progress (from 2014) in Recurrent Neural Networks and Natural Language Processing.
Profile http://www.cl.ecei.tohoku.ac.jp/~sosuke.k/
Japanese ver. https://www.slideshare.net/hytae/rnn-63761483
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz
A brief survey of current deep learning/neural network methods currently used in NLP: recurrent networks (LSTM, GRU), recursive networks, convolutional networks, hybrid architectures, attention models. We will look at specific papers in the literature, targeting sentiment analysis, text classification and other tasks.
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
This Material is an in_depth study report of Recurrent Neural Network (RNN)
Material mainly from Deep Learning Book Bible, http://www.deeplearningbook.org/
Topics: Briefing, Theory Proof, Variation, Gated RNNN Intuition. Real World Application
Application (CNN+RNN on SVHN)
Also a video (In Chinese)
https://www.youtube.com/watch?v=p6xzPqRd46w
The slides for the equation deviation of recurrent neural network (RNN), back-propagation through time and Sequence-to-sequence (Seq2Seq) models in image/video captioning tasks. Used in group paper reading in University of Sydney.
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...inventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Common Fixed Theorems Using Random Implicit Iterative Schemesinventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Digital Signal Processing[ECEG-3171]-Ch1_L07Rediet Moges
This Digital Signal Processing Lecture material is the property of the author (Rediet M.) . It is not for publication,nor is it to be sold or reproduced.
In this presentation we describe the formulation of the HMM model as consisting of states that are hidden that generate the observables. We introduce the 3 basic problems: Finding the probability of a sequence of observation given the model, the decoding problem of finding the hidden states given the observations and the model and the training problem of determining the model parameters that generate the given observations. We discuss the Forward, Backward, Viterbi and Forward-Backward algorithms.
Statement of stochastic programming problemsSSA KPI
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 1.
More info at http://summerschool.ssa.org.ua
We present a causal speech enhancement model working on the
raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with
skip-connections. It is optimized on both time and frequency
domains, using multiple loss functions. Empirical evidence
shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises,
as well as room reverb. Additionally, we suggest a set of
data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard
benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working
directly on the raw waveform.
Index Terms: Speech enhancement, speech denoising, neural
networks, raw waveform
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Accelerate your Kubernetes clusters with Varnish Caching
Pointing the Unknown Words
1. Pointing
the Unknown Words
1
ACL 2016
Pointing the Unknown Words
Caglar Gulcehre
Universit´e de Montr´eal
Sungjin Ahn
Universit´e de Montr´eal
Ramesh Nallapati
IBM T.J. Watson Research
Bowen Zhou
IBM T.J. Watson Research
Yoshua Bengio
Universit´e de Montr´eal
CIFAR Senior Fellow
Abstract
The problem of rare and unknown words
is an important issue that can potentially
effect the performance of many NLP sys-
tems, including both the traditional count-
based and the deep learning models. We
propose a novel way to deal with the rare
and unseen words for the neural network
models using attention. Our model uses
two softmax layers in order to predict the
softmax output layer where each of the output di-
mension corresponds to a word in a predefined
word-shortlist. Because computing high dimen-
sional softmax is computationally expensive, in
practice the shortlist is limited to have only top-
K most frequent words in the training corpus. All
other words are then replaced by a special word,
called the unknown word (UNK).
The shortlist approach has two fundamental
problems. The first problem, which is known as
21Aug2016
Pointing the Unknown Words
Caglar Gulcehre
Universit´e de Montr´eal
Sungjin Ahn
Universit´e de Montr´eal
Ramesh Nallapati
IBM T.J. Watson Research
Bowen Zhou
IBM T.J. Watson Research
Yoshua Bengio
Universit´e de Montr´eal
CIFAR Senior Fellow
Abstract
The problem of rare and unknown words
is an important issue that can potentially
effect the performance of many NLP sys-
tems, including both the traditional count-
based and the deep learning models. We
propose a novel way to deal with the rare
and unseen words for the neural network
models using attention. Our model uses
two softmax layers in order to predict the
softmax output layer where each of the output di-
mension corresponds to a word in a predefined
word-shortlist. Because computing high dimen-
sional softmax is computationally expensive, in
practice the shortlist is limited to have only top-
K most frequent words in the training corpus. All
other words are then replaced by a special word,
called the unknown word (UNK).
The shortlist approach has two fundamental
problems. The first problem, which is known as
the rare word problem, is that some of the words
21Aug2016
2. 2
-
e
-
-
e
-
f
e
l
o
-
Guillaume et Cesar ont une voiture bleue a Lausanne.
Guillaume and Cesar have a blue car in Lausanne.
Copy Copy Copy
French:
English:
Figure 1: An example of how copying can happen
for machine translation. Common words that ap-
pear both in source and the target can directly be
copied from input to source. The rest of the un-
known in the target can be copied from the input
after being translated with a dictionary.
3. •
•
•
• V
a man yesterday . [eos]!
killed a man yesterday .!t
+ bhp) (16)
(17)
Whp
bhp) (16)
(17)
(18)
(19)
!
man
a
a
man !
6
RNN t
pt
pt = softmax(W
ht =
−−−→
RNNt′≺t(
softmax(s)i =
exp(si
sj∈s exp
→
t′≺t(xwt′ ) (17)
p(si)
s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
bhp(w)
Whp bhp
t = softmax(Whpht + bhp) (16)
t =
−−−→
RNNt′≺t(xwt′ ) (17)
i =
exp(si)
sj∈s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
Whp(w) bhp(w)
(18)
(19)
i
Whp ∈ RV ×N
V
)
t wt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
6
RNN t wt
pt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
softmax(s)i N s
exp 3
4. •
•
•
• V
a man yesterday . [eos]!
killed a man yesterday .!t
+ bhp) (16)
(17)
Whp
bhp) (16)
(17)
(18)
(19)
!
man
a
a
man !
6
RNN t
pt
pt = softmax(W
ht =
−−−→
RNNt′≺t(
softmax(s)i =
exp(si
sj∈s exp
→
t′≺t(xwt′ ) (17)
p(si)
s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
bhp(w)
Whp bhp
t = softmax(Whpht + bhp) (16)
t =
−−−→
RNNt′≺t(xwt′ ) (17)
i =
exp(si)
sj∈s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
Whp(w) bhp(w)
(18)
(19)
i
Whp ∈ RV ×N
V
)
t wt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
• V T
•
•
•
Pointer Softmax
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
i N s
exp
w Whp ∈ RV ×N
Whp(w) bhp(w)
T
6
RNN t wt
pt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
softmax(s)i N s
exp
6
RNN t wt
pt
pt = softmax(Whpht +
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
softmax(s)i N s
exp
・
4
5. •
5
rce sentence during decoding a translation (Sec. 3.1).
ER: GENERAL DESCRIPTION
x1 x2 x3 xT
+
αt,1
αt,2 αt,3
αt,T
h1 h2 h3 hT
h1 h2 h3 hT
st-1 st
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x , x , . . . , x ).
el architecture, we define each conditional probability
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
RNN hidden state for time i, computed by
si = f(si 1, yi 1, ci).
noted that unlike the existing encoder–decoder ap-
q. (2)), here the probability is conditioned on a distinct
ci for each target word yi.
vector ci depends on a sequence of annotations
) to which an encoder maps the input sentence. Each
contains information about the whole input sequence
focus on the parts surrounding the i-th word of the
e. We explain in detail how the annotations are com-
ext section.
ector ci is, then, computed as a weighted sum of these
i:
TxX
F
t
t
g
s
t vector ci depends on a sequence of annotations
x
) to which an encoder maps the input sentence. Each
hi contains information about the whole input sequence
g focus on the parts surrounding the i-th word of the
nce. We explain in detail how the annotations are com-
next section.
vector ci is, then, computed as a weighted sum of these
hi:
ci =
TxX
j=1
↵ijhj. (5)
↵ij of each annotation hj is computed by
↵ij =
exp (eij)
P ,
GENERAL DESCRIPTION
hitecture, we define each conditional probability
y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
N hidden state for time i, computed by
si = f(si 1, yi 1, ci).
d that unlike the existing encoder–decoder ap-
), here the probability is conditioned on a distinct
or each target word yi.
or ci depends on a sequence of annotations
which an encoder maps the input sentence. Each
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
ere the probability is conditioned on a distinct
ach target word yi.
ci depends on a sequence of annotations
ch an encoder maps the input sentence. Each
s information about the whole input sequence
n the parts surrounding the i-th word of the
xplain in detail how the annotations are com-
on.
is, then, computed as a weighted sum of these
ci =
TxX
j=1
↵ijhj. (5)
h annotation hj is computed by
↵ij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
e = a(s , h )
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
t the whole input sequence
nding the i-th word of the
w the annotations are com-
as a weighted sum of these
. (5)
computed by
j =
exp (eij)
PTx
k=1 exp (eik)
, (6)
eij = a(si 1, hj)
well the inputs around position j and the output at position
hidden state si 1 (just before emitting yi, Eq. (4)) and the
3 Neural Machine Translation Model
with Attention
As the baseline neural machine translation sys-
tem, we use the model proposed by (Bahdanau et
al., 2014) that learns to (soft-)align and translate
jointly. We refer this model as NMT.
The encoder of the NMT is a bidirectional
RNN (Schuster and Paliwal, 1997). The forward
RNN reads input sequence x = (x1, . . . , xT )
in left-to-right direction, resulting in a sequence
of hidden states (
!
h 1, . . . ,
!
h T ). The backward
RNN reads x in the reversed direction and outputs
( h 1, . . . , h T ). We then concatenate the hidden
states of forward and backward RNNs at each time
step and obtain a sequence of annotation vectors
(h1, . . . , hT ) where hj =
h!
h j|| h j
i
. Here, ||
denotes the concatenation operator. Thus, each an-
notation vector hj encodes information about the
j-th word with respect to all the other surrounding
where fr is G
We use a
2013) to com
words:
p(yt
ex
where W is
bias of the o
forward neu
that perform
And the sup
umn vector o
The whol
and the deco
(conditional
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci),
is an RNN hidden state for time i, computed by
si = f(si 1, yi 1, ci).
be noted that unlike the existing encoder–decoder
ee Eq. (2)), here the probability is conditioned on a dis
ector ci for each target word yi.
ext vector ci depends on a sequence of annotat
hTx
) to which an encoder maps the input sentence. E
n hi contains information about the whole input sequ
ong focus on the parts surrounding the i-th word of
uence. We explain in detail how the annotations are c
(t=i)
[Bahdanau+15]
ncepaperatICLR2015
trainedtopredictthenextwordyt0giventhecontextvectorcandallthe
ords{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
composingthejointprobabilityintotheorderedconditionals:
p(y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
Ty
.WithanRNN,eachconditionalprobabilityismodeledas
p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3)
potentiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis
RNN.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN
lneuralnetworkcanbeused(KalchbrennerandBlunsom,2013).
ALIGNANDTRANSLATE
poseanovelarchitectureforneuralmachinetranslation.Thenewarchitecture
onalRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching
nceduringdecodingatranslation(Sec.3.1).
ERALDESCRIPTION
st
cture,wedefineeachconditionalprobability
...,yi1,x)=g(yi1,si,ci),(4)
iddenstatefortimei,computedby
si=f(si1,yi1,ci).
atunliketheexistingencoder–decoderap-
eretheprobabilityisconditionedonadistinct
achtargetwordyi.
idependsonasequenceofannotations
nedtopredictthenextwordyt0giventhecontextvectorcandallthe
s{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
posingthejointprobabilityintotheorderedconditionals:
p(y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
.WithanRNN,eachconditionalprobabilityismodeledas
p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3)
entiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis
N.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN
uralnetworkcanbeused(KalchbrennerandBlunsom,2013).
LIGNANDTRANSLATE
anovelarchitectureforneuralmachinetranslation.Thenewarchitecture
lRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching
duringdecodingatranslation(Sec.3.1).
ALDESCRIPTION
st
e,wedefineeachconditionalprobability
,yi1,x)=g(yi1,si,ci),(4)
nstatefortimei,computedby
f(si1,yi1,ci).
nliketheexistingencoder–decoderap-
heprobabilityisconditionedonadistinct
targetwordyi.
ependsonasequenceofannotations
encodermapstheinputsentence.Each
ormationaboutthewholeinputsequence
epartssurroundingthei-thwordofthe
inindetailhowtheannotationsarecom-
atICLR2015
predictthenextwordyt0giventhecontextvectorcandallthe
···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
thejointprobabilityintotheorderedconditionals:
y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
anRNN,eachconditionalprobabilityismodeledas
|{y1,···,yt1},c)=g(yt1,st,c),(3)
multi-layered,functionthatoutputstheprobabilityofyt,andstis
ouldbenotedthatotherarchitecturessuchasahybridofanRNN
tworkcanbeused(KalchbrennerandBlunsom,2013).
ANDTRANSLATE
larchitectureforneuralmachinetranslation.Thenewarchitecture
asanencoder(Sec.3.2)andadecoderthatemulatessearching
decodingatranslation(Sec.3.1).
SCRIPTION
st
efineeachconditionalprobability
x)=g(yi1,si,ci),(4)
fortimei,computedby
1,yi1,ci).
theexistingencoder–decoderap-
abilityisconditionedonadistinct
ordyi.
onasequenceofannotations
dictthenextwordyt0giventhecontextvectorcandallthe
,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
ejointprobabilityintotheorderedconditionals:
=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
NN,eachconditionalprobabilityismodeledas
y1,···,yt1},c)=g(yt1,st,c),(3)
ulti-layered,functionthatoutputstheprobabilityofyt,andstis
ldbenotedthatotherarchitecturessuchasahybridofanRNN
rkcanbeused(KalchbrennerandBlunsom,2013).
DTRANSLATE
rchitectureforneuralmachinetranslation.Thenewarchitecture
anencoder(Sec.3.2)andadecoderthatemulatessearching
codingatranslation(Sec.3.1).
IPTION
st
neeachconditionalprobability
g(yi1,si,ci),(4)
timei,computedby
i1,ci).
existingencoder–decoderap-
ilityisconditionedonadistinct
dyi.
nasequenceofannotations
mapstheinputsentence.Each
boutthewholeinputsequence
rroundingthei-thwordofthe
lhowtheannotationsarecom-
p(yt | {y1, · · · , yt 1} , c) = g(yt 1, st, c), (3)
where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is
the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN
and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).
3 LEARNING TO ALIGN AND TRANSLATE
In this section, we propose a novel architecture for neural machine translation. The new architecture
consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searching
through a source sentence during decoding a translation (Sec. 3.1).
3.1 DECODER: GENERAL DESCRIPTION
yt-1
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
In a new model architecture, we define each conditional probability
in Eq. (2) as:
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
where si is an RNN hidden state for time i, computed by
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoder–decoder ap-
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, · · · , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
↵ijhj. (5)
The weight ↵ij of each annotation hj is computed by
↵ij =
exp (eij)
PTx
exp (eik)
, (6)
ptpt-1
c.f. http://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention
6. •
•
α
6
rce sentence during decoding a translation (Sec. 3.1).
ER: GENERAL DESCRIPTION
x1 x2 x3 xT
+
αt,1
αt,2 αt,3
αt,T
h1 h2 h3 hT
h1 h2 h3 hT
st-1 st
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x , x , . . . , x ).
el architecture, we define each conditional probability
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
RNN hidden state for time i, computed by
si = f(si 1, yi 1, ci).
noted that unlike the existing encoder–decoder ap-
q. (2)), here the probability is conditioned on a distinct
ci for each target word yi.
vector ci depends on a sequence of annotations
) to which an encoder maps the input sentence. Each
contains information about the whole input sequence
focus on the parts surrounding the i-th word of the
e. We explain in detail how the annotations are com-
ext section.
ector ci is, then, computed as a weighted sum of these
i:
TxX
F
t
t
g
s
t vector ci depends on a sequence of annotations
x
) to which an encoder maps the input sentence. Each
hi contains information about the whole input sequence
g focus on the parts surrounding the i-th word of the
nce. We explain in detail how the annotations are com-
next section.
vector ci is, then, computed as a weighted sum of these
hi:
ci =
TxX
j=1
↵ijhj. (5)
↵ij of each annotation hj is computed by
↵ij =
exp (eij)
P ,
GENERAL DESCRIPTION
hitecture, we define each conditional probability
y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
N hidden state for time i, computed by
si = f(si 1, yi 1, ci).
d that unlike the existing encoder–decoder ap-
), here the probability is conditioned on a distinct
or each target word yi.
or ci depends on a sequence of annotations
which an encoder maps the input sentence. Each
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
ere the probability is conditioned on a distinct
ach target word yi.
ci depends on a sequence of annotations
ch an encoder maps the input sentence. Each
s information about the whole input sequence
n the parts surrounding the i-th word of the
xplain in detail how the annotations are com-
on.
is, then, computed as a weighted sum of these
ci =
TxX
j=1
↵ijhj. (5)
h annotation hj is computed by
↵ij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
e = a(s , h )
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
t the whole input sequence
nding the i-th word of the
w the annotations are com-
as a weighted sum of these
. (5)
computed by
j =
exp (eij)
PTx
k=1 exp (eik)
, (6)
eij = a(si 1, hj)
well the inputs around position j and the output at position
hidden state si 1 (just before emitting yi, Eq. (4)) and the
3 Neural Machine Translation Model
with Attention
As the baseline neural machine translation sys-
tem, we use the model proposed by (Bahdanau et
al., 2014) that learns to (soft-)align and translate
jointly. We refer this model as NMT.
The encoder of the NMT is a bidirectional
RNN (Schuster and Paliwal, 1997). The forward
RNN reads input sequence x = (x1, . . . , xT )
in left-to-right direction, resulting in a sequence
of hidden states (
!
h 1, . . . ,
!
h T ). The backward
RNN reads x in the reversed direction and outputs
( h 1, . . . , h T ). We then concatenate the hidden
states of forward and backward RNNs at each time
step and obtain a sequence of annotation vectors
(h1, . . . , hT ) where hj =
h!
h j|| h j
i
. Here, ||
denotes the concatenation operator. Thus, each an-
notation vector hj encodes information about the
j-th word with respect to all the other surrounding
where fr is G
We use a
2013) to com
words:
p(yt
ex
where W is
bias of the o
forward neu
that perform
And the sup
umn vector o
The whol
and the deco
(conditional
(t=i)
ncepaperatICLR2015
trainedtopredictthenextwordyt0giventhecontextvectorcandallthe
ords{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
composingthejointprobabilityintotheorderedconditionals:
p(y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
Ty
.WithanRNN,eachconditionalprobabilityismodeledas
p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3)
potentiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis
RNN.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN
lneuralnetworkcanbeused(KalchbrennerandBlunsom,2013).
ALIGNANDTRANSLATE
poseanovelarchitectureforneuralmachinetranslation.Thenewarchitecture
onalRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching
nceduringdecodingatranslation(Sec.3.1).
ERALDESCRIPTION
st
cture,wedefineeachconditionalprobability
...,yi1,x)=g(yi1,si,ci),(4)
iddenstatefortimei,computedby
si=f(si1,yi1,ci).
atunliketheexistingencoder–decoderap-
eretheprobabilityisconditionedonadistinct
achtargetwordyi.
idependsonasequenceofannotations
nedtopredictthenextwordyt0giventhecontextvectorcandallthe
s{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
posingthejointprobabilityintotheorderedconditionals:
p(y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
.WithanRNN,eachconditionalprobabilityismodeledas
p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3)
entiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis
N.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN
uralnetworkcanbeused(KalchbrennerandBlunsom,2013).
LIGNANDTRANSLATE
anovelarchitectureforneuralmachinetranslation.Thenewarchitecture
lRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching
duringdecodingatranslation(Sec.3.1).
ALDESCRIPTION
st
e,wedefineeachconditionalprobability
,yi1,x)=g(yi1,si,ci),(4)
nstatefortimei,computedby
f(si1,yi1,ci).
nliketheexistingencoder–decoderap-
heprobabilityisconditionedonadistinct
targetwordyi.
ependsonasequenceofannotations
encodermapstheinputsentence.Each
ormationaboutthewholeinputsequence
epartssurroundingthei-thwordofthe
inindetailhowtheannotationsarecom-
atICLR2015
predictthenextwordyt0giventhecontextvectorcandallthe
···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
thejointprobabilityintotheorderedconditionals:
y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
anRNN,eachconditionalprobabilityismodeledas
|{y1,···,yt1},c)=g(yt1,st,c),(3)
multi-layered,functionthatoutputstheprobabilityofyt,andstis
ouldbenotedthatotherarchitecturessuchasahybridofanRNN
tworkcanbeused(KalchbrennerandBlunsom,2013).
ANDTRANSLATE
larchitectureforneuralmachinetranslation.Thenewarchitecture
asanencoder(Sec.3.2)andadecoderthatemulatessearching
decodingatranslation(Sec.3.1).
SCRIPTION
st
efineeachconditionalprobability
x)=g(yi1,si,ci),(4)
fortimei,computedby
1,yi1,ci).
theexistingencoder–decoderap-
abilityisconditionedonadistinct
ordyi.
onasequenceofannotations
dictthenextwordyt0giventhecontextvectorcandallthe
,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
ejointprobabilityintotheorderedconditionals:
=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
NN,eachconditionalprobabilityismodeledas
y1,···,yt1},c)=g(yt1,st,c),(3)
ulti-layered,functionthatoutputstheprobabilityofyt,andstis
ldbenotedthatotherarchitecturessuchasahybridofanRNN
rkcanbeused(KalchbrennerandBlunsom,2013).
DTRANSLATE
rchitectureforneuralmachinetranslation.Thenewarchitecture
anencoder(Sec.3.2)andadecoderthatemulatessearching
codingatranslation(Sec.3.1).
IPTION
st
neeachconditionalprobability
g(yi1,si,ci),(4)
timei,computedby
i1,ci).
existingencoder–decoderap-
ilityisconditionedonadistinct
dyi.
nasequenceofannotations
mapstheinputsentence.Each
boutthewholeinputsequence
rroundingthei-thwordofthe
lhowtheannotationsarecom-
p(yt | {y1, · · · , yt 1} , c) = g(yt 1, st, c), (3)
where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is
the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN
and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).
3 LEARNING TO ALIGN AND TRANSLATE
In this section, we propose a novel architecture for neural machine translation. The new architecture
consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searching
through a source sentence during decoding a translation (Sec. 3.1).
3.1 DECODER: GENERAL DESCRIPTION
yt-1
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
In a new model architecture, we define each conditional probability
in Eq. (2) as:
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
where si is an RNN hidden state for time i, computed by
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoder–decoder ap-
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, · · · , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
↵ijhj. (5)
The weight ↵ij of each annotation hj is computed by
↵ij =
exp (eij)
PTx
exp (eik)
, (6)
ptpt-1
7. •
•
•
• V
a man yesterday . [eos]!
killed a man yesterday .!t
+ bhp) (16)
(17)
Whp
bhp) (16)
(17)
(18)
(19)
!
man
a
a
man !
6
RNN t
pt
pt = softmax(W
ht =
−−−→
RNNt′≺t(
softmax(s)i =
exp(si
sj∈s exp
→
t′≺t(xwt′ ) (17)
p(si)
s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
bhp(w)
Whp bhp
t = softmax(Whpht + bhp) (16)
t =
−−−→
RNNt′≺t(xwt′ ) (17)
i =
exp(si)
sj∈s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
Whp(w) bhp(w)
(18)
(19)
i
Whp ∈ RV ×N
V
)
t wt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
• V T
•
•
•
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
i N s
exp
w Whp ∈ RV ×N
Whp(w) bhp(w)
T
6
RNN t wt
pt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
softmax(s)i N s
exp
6
RNN t wt
pt
pt = softmax(Whpht +
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
softmax(s)i N s
exp
・
7
8. •
8
atten-
nd the
e pro-
or un-
at we
Guillaume et Cesar ont une voiture bleue a Lausanne.
Guillaume and Cesar have a blue car in Lausanne.
Copy Copy Copy
French:
English:
Figure 1: An example of how copying can happen
10. •
10
Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source
Source #1 china ’s tang gonghong set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-## kilogram
weightlifting title at the asian games on tuesday .
Target #1 china ’s tang <unk>,sets world weightlifting record
NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-
ing title at asian games
Source #2 owing to criticism , nbc said on wednesday that it was ending
a three-month-old experiment that would have brought the first
liquor advertisements onto national broadcast network television
Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the so
Source #1 china ’s tang gonghong set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-## kilogram
weightlifting title at the asian games on tuesday .
Target #1 china ’s tang <unk>,sets world weightlifting record
NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-
<v1> ’s <v2> <v3> set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-##
kilogram weightlifting title at the asian games on tuesday .
<v1> ’s <v2> <v3>,sets world weightlifting record
12. • gonghonh <unk>
•
•
12
The experimental results comparing the Pointer
Softmax with NMT model are displayed in Ta-
ble 1 for the UNK pointers data and in Table 2
for the entity pointers data. As our experiments
show, pointer softmax improves over the baseline
NMT on both UNK data and entities data. Our
hope was that the improvement would be larger
for the entities data since the incidence of point-
ers was much greater. However, it turns out this
is not the case, and we suspect the main reason
is anonymization of entities which removed data-
sparsity by converting all entities to integer-ids
that are shared across all documents. We believe
that on de-anonymized data, our model could help
more, since the issue of data-sparsity is more acute
in this case.
Table 1: Results on Gigaword Corpus when point-
ers are used for UNKs in the training data, using
Rouge-F1 as the evaluation metric.
Rouge-1 Rouge-2 Rouge-L
NMT + lvt 34.87 16.54 32.27
NMT + lvt + PS 35.19 16.66 32.51
I
ate
tra
am
acc
it n
5.3
In
me
the
the
Fre
ear
log
eva
els
sco
W
the
tok
Th
gli
We
Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source
Source #1 china ’s tang gonghong set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-## kilogram
weightlifting title at the asian games on tuesday .
Target #1 china ’s tang <unk>,sets world weightlifting record
NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-
ing title at asian games
Source #2 owing to criticism , nbc said on wednesday that it was ending
a three-month-old experiment that would have brought the first
liquor advertisements onto national broadcast network television
.
Target #2 advertising : nbc retreats from liquor commercials
NMT+PS #2 nbc says it is ending a three-month-old experiment
Source #3 a senior trade union official here wednesday called on ghana ’s
government to be “ mindful of the plight ” of the ordinary people
in the country in its decisions on tax increases .
Target #3 tuc official,on behalf of ordinary ghanaians
NMT+PS #3 ghana ’s government urged to be mindful of the plight
13. •
•
13
is not the case, and we suspect the main reason
is anonymization of entities which removed data-
sparsity by converting all entities to integer-ids
that are shared across all documents. We believe
that on de-anonymized data, our model could help
more, since the issue of data-sparsity is more acute
in this case.
Table 1: Results on Gigaword Corpus when point-
ers are used for UNKs in the training data, using
Rouge-F1 as the evaluation metric.
Rouge-1 Rouge-2 Rouge-L
NMT + lvt 34.87 16.54 32.27
NMT + lvt + PS 35.19 16.66 32.51
Table 2: Results on anonymized Gigaword Corpus
when pointers are used for entities, using Rouge-
F1 as the evaluation metric.
Rouge-1 Rouge-2 Rouge-L
NMT + lvt 34.89 16.78 32.37
NMT + lvt + PS 35.11 16.76 32.55
Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source
Source #1 china ’s tang gonghong set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-## kilogram
weightlifting title at the asian games on tuesday .
Target #1 china ’s tang <unk>,sets world weightlifting record
NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-
ing title at asian games
Source #2 owing to criticism , nbc said on wednesday that it was ending
a three-month-old experiment that would have brought the first
liquor advertisements onto national broadcast network television
.
Target #2 advertising : nbc retreats from liquor commercials
NMT+PS #2 nbc says it is ending a three-month-old experiment
Source #3 a senior trade union official here wednesday called on ghana ’s
government to be “ mindful of the plight ” of the ordinary people
<v1> ’s <v2> <v3> set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-##
kilogram weightlifting title at the asian games on tuesday .
<v1> ’s <v2> <v3>,sets world weightlifting record
14. •
•
•
•
•
•
• 14
en-
the
pro-
un-
we
ict-
e of
We
odel
Guillaume et Cesar ont une voiture bleue a Lausanne.
Guillaume and Cesar have a blue car in Lausanne.
Copy Copy Copy
French:
English:
Figure 1: An example of how copying can happen
for machine translation. Common words that ap-
pear both in source and the target can directly be
copied from input to source. The rest of the un-
known in the target can be copied from the input
after being translated with a dictionary.
15. •
•
•
•
15
of the gradients exceed 1 (Pascanu et al., 2012).
Table 5: Europarl Dataset (EN-FR)
BLEU-4
NMT 20.19
NMT + PS 23.76
in the country in its decisions on tax increases .
et #3 tuc official,on behalf of ordinary ghanaians
T+PS #3 ghana ’s government urged to be mindful of the plight
y, we first check if the same word yt ap-
e source sentence. If it is not, we then
translated version of the word exists in
sentence by using a look-up table be-
source and the target language. If the
the source sentence, we then use the lo-
he word in the source as the target. Oth-
check if one of the English senses from
anguage dictionary of the French word
urce. If it is in the source sentence, then
location of that word as our translation.
we just use the argmax of lt as the tar-
ching network dt, we observed that us-
layered MLP with noisy-tanh activation
et al., 2016) function with residual con-
om the lower layer (He et al., 2015) ac-
In Table 5, we provided the result of NMT w
pointer softmax and we observe about 3.6 BLE
score improvement over our baseline.
Figure 4: A comparison of the validation learnin
16. •
•
•
16
h2 hTh1 …
yt-1
Source Sequence
x2 xTx1 …
BiRNN
Target Sequence
Figure 2: A depiction of neural machine transla-
tion architecture with attention. At each timestep,
the model generates the attention distribution lt.
We use lt and the encoder’s hidden states to obtain
the context ct. The decoder uses ct to predict a
vector of probabilities for the words wt by using
vocabulary softmax.
4 The Pointer Softmax
In this section, we introduce our method, called as
the pointer softmax (PS), to deal with the rare and
unknown words. The pointer softmax can be an
applicable approach to many NLP tasks, because
it resolves the limitations about unknown words
for neural networks. It can be used in parallel with
other existing techniques such as the large vocabu-
lary trick (Jean et al., 2014). Our model learns two
key abilities jointly to make the pointing mech-
anism applicable in more general settings: (i) to
predict whether it is required to use the pointing
complish this, we introduce a switching network
to the model. The switching network, which is
a multilayer perceptron in our experiments, takes
the representation of the context sequence (similar
to the input annotation in NMT) and the previous
hidden state of the output RNN as its input. It out-
puts a binary variable zt which indicates whether
to use the shortlist softmax (when zt = 1) or the
location softmax (when zt = 0). Note that if the
word that is expected to be generated at each time-
step is neither in the shortlist nor in the context se-
quence, the switching network selects the shortlist
softmax, and then the shortlist softmax predicts
UNK. The details of the pointer softmax model can
be seen in Figure 3 as well.
h2 hTh1 …
st ct
zt yl
tyw
t
yt-1
Vocabulary softmax
Pointer distribution (lt)
Source Sequence
Point & copy
x2 xTx1 …
BiRNN
Target Sequence
p 1-p
st-1
Figure 3: A depiction of the Pointer Softmax (PS)
17. Ø
Ø
Ø
Ø
Ø
Ø
17
improved the convergence speed of the model as
well. For French to English machine translation
on Europarl corpora, we observe that using the
pointer softmax can also improve the training con-
vergence of the model.
References
[Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun
Cho, and Yoshua Bengio. 2014. Neural machine
translation by jointly learning to align and translate.
CoRR, abs/1409.0473.
[Bengio and Sen´ecal2008] Yoshua Bengio and Jean-
S´ebastien Sen´ecal. 2008. Adaptive importance
sampling to accelerate training of a neural proba-
bilistic language model. Neural Networks, IEEE
Transactions on, 19(4):713–722.
[Bordes et al.2015] Antoine Bordes, Nicolas Usunier,
Sumit Chopra, and Jason Weston. 2015. Large-
scale simple question answering with memory net-
works. arXiv preprint arXiv:1506.02075.
[Cheng and Lapata2016] Jianpeng Cheng and Mirella
Lapata. 2016. Neural summarization by ex-
plications to natural image statistics. The Journal of
Machine Learning Research, 13(1):307–361.
[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-
qing Ren, and Jian Sun. 2015. Deep resid-
ual learning for mage recognition. arXiv preprint
arXiv:1512.03385.
[Hermann et al.2015] Karl Moritz Hermann, Tomas
Kocisky, Edward Grefenstette, Lasse Espeholt, Will
Kay, Mustafa Suleyman, and Phil Blunsom. 2015.
Teaching machines to read and comprehend. In Ad-
vances in Neural Information Processing Systems,
pages 1684–1692.
[Jean et al.2014] S´ebastien Jean, Kyunghyun Cho,
Roland Memisevic, and Yoshua Bengio. 2014. On
using very large target vocabulary for neural ma-
chine translation. arXiv preprint arXiv:1412.2007.
[Kingma and Adam2015] Diederik P Kingma and
Jimmy Ba Adam. 2015. A method for stochastic
optimization. In International Conference on
Learning Representation.
[Luong et al.2015] Minh-Thang Luong, Ilya Sutskever,
Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.
2015. Addressing the rare word problem in neural
machine translation. In Proceedings of ACL.
148
[Schuster and Paliwal1997] Mike Schuster and
Kuldip K Paliwal. 1997. Bidirectional recur-
rent neural networks. Signal Processing, IEEE
Transactions on, 45(11):2673–2681.
[Sennrich et al.2015] Rico Sennrich, Barry Haddow,
and Alexandra Birch. 2015. Neural machine trans-
lation of rare words with subword units. arXiv
preprint arXiv:1508.07909.
[Theano Development Team2016] Theano Develop-
ment Team. 2016. Theano: A Python framework
for fast computation of mathematical expressions.
arXiv e-prints, abs/1605.02688, May.
[Tomasello et al.2007] Michael Tomasello, Malinda
Carpenter, and Ulf Liszkowski. 2007. A new look at
infant pointing. Child development, 78(3):705–722.
[Vinyals et al.2015] Oriol Vinyals, Meire Fortunato,
and Navdeep Jaitly. 2015. Pointer networks. In Ad-
vances in Neural Information Processing Systems,
pages 2674–2682.
[Zeiler2012] Matthew D Zeiler. 2012. Adadelta:
an adaptive learning rate method. arXiv preprint
arXiv:1212.5701.
7 Acknowledgments
We would also like to thank the developers of
Theano 5, for developing such a powerful tool
5
http://deeplearning.net/software/
theano/
149
able to improve the results even when it is used
together with the large-vocabulary trick. In the
case of neural machine translation, we observed
that the training with the pointer softmax is also
improved the convergence speed of the model as
well. For French to English machine translation
on Europarl corpora, we observe that using the
pointer softmax can also improve the training con-
vergence of the model.
References
[Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun
Cho, and Yoshua Bengio. 2014. Neural machine
translation by jointly learning to align and translate.
CoRR, abs/1409.0473.
[Bengio and Sen´ecal2008] Yoshua Bengio and Jean-
S´ebastien Sen´ecal. 2008. Adaptive importance
sampling to accelerate training of a neural proba-
bilistic language model. Neural Networks, IEEE
Transactions on, 19(4):713–722.
[Bordes et al.2015] Antoine Bordes, Nicolas Usunier,
Sumit Chopra, and Jason Weston. 2015. Large-
scale simple question answering with memory net-
works. arXiv preprint arXiv:1506.02075.
[Cheng and Lapata2016] Jianpeng Cheng and Mirella
Lapata. 2016. Neural summarization by ex-
2016. Noisy activation functions. arXiv preprint
arXiv:1603.00391.
[Gutmann and Hyv¨arinen2012] Michael U Gutmann
and Aapo Hyv¨arinen. 2012. Noise-contrastive esti-
mation of unnormalized statistical models, with ap-
plications to natural image statistics. The Journal of
Machine Learning Research, 13(1):307–361.
[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-
qing Ren, and Jian Sun. 2015. Deep resid-
ual learning for mage recognition. arXiv preprint
arXiv:1512.03385.
[Hermann et al.2015] Karl Moritz Hermann, Tomas
Kocisky, Edward Grefenstette, Lasse Espeholt, Will
Kay, Mustafa Suleyman, and Phil Blunsom. 2015.
Teaching machines to read and comprehend. In Ad-
vances in Neural Information Processing Systems,
pages 1684–1692.
[Jean et al.2014] S´ebastien Jean, Kyunghyun Cho,
Roland Memisevic, and Yoshua Bengio. 2014. On
using very large target vocabulary for neural ma-
chine translation. arXiv preprint arXiv:1412.2007.
[Kingma and Adam2015] Diederik P Kingma and
Jimmy Ba Adam. 2015. A method for stochastic
optimization. In International Conference on
Learning Representation.
[Luong et al.2015] Minh-Thang Luong, Ilya Sutskever,
Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.
2015. Addressing the rare word problem in neural
machine translation. In Proceedings of ACL.
148
lidation learning-
tracting sentences and words. arXiv preprint
arXiv:1603.07252.
[Cho et al.2014] Kyunghyun Cho, Bart
Van Merri¨enboer, Caglar Gulcehre, Dzmitry
Bahdanau, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. 2014. Learning phrase
representations using rnn encoder-decoder for
statistical machine translation. arXiv preprint
arXiv:1406.1078.
[Chung et al.2014] Junyoung Chung, C¸ aglar G¨ulc¸ehre,
KyungHyun Cho, and Yoshua Bengio. 2014. Em-
pirical evaluation of gated recurrent neural networks
on sequence modeling. CoRR, abs/1412.3555.
n learning-
ined with
layer. As
el trained
an the reg-
or pointer
tion func-
ble to gen-
with rare-
marization
ftmax was
it is used
k. In the
observed
ax is also
model as
ranslation
using the
ining con-
tracting sentences and words. arXiv preprint
arXiv:1603.07252.
[Cho et al.2014] Kyunghyun Cho, Bart
Van Merri¨enboer, Caglar Gulcehre, Dzmitry
Bahdanau, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. 2014. Learning phrase
representations using rnn encoder-decoder for
statistical machine translation. arXiv preprint
arXiv:1406.1078.
[Chung et al.2014] Junyoung Chung, C¸ aglar G¨ulc¸ehre,
KyungHyun Cho, and Yoshua Bengio. 2014. Em-
pirical evaluation of gated recurrent neural networks
on sequence modeling. CoRR, abs/1412.3555.
[Gillick et al.2015] Dan Gillick, Cliff Brunk, Oriol
Vinyals, and Amarnag Subramanya. 2015. Mul-
tilingual language processing from bytes. arXiv
preprint arXiv:1512.00103.
[Graves2013] Alex Graves. 2013. Generating se-
quences with recurrent neural networks. arXiv
preprint arXiv:1308.0850.
[Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li,
and Victor OK Li. 2016. Incorporating copying
mechanism in sequence-to-sequence learning. arXiv
preprint arXiv:1603.06393.
[Gulcehre et al.2016] Caglar Gulcehre, Marcin
Moczulski, Misha Denil, and Yoshua Bengio.
2016. Noisy activation functions. arXiv preprint
arXiv:1603.00391.
[Gutmann and Hyv¨arinen2012] Michael U Gutmann
and Aapo Hyv¨arinen. 2012. Noise-contrastive esti-
mation of unnormalized statistical models, with ap-
plications to natural image statistics. The Journal of
Machine Learning Research, 13(1):307–361.
[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-
qing Ren, and Jian Sun. 2015. Deep resid-
ual learning for mage recognition. arXiv preprint
arXiv:1512.03385.
[Hermann et al.2015] Karl Moritz Hermann, Tomas
Kocisky, Edward Grefenstette, Lasse Espeholt, Will
[Morin and Bengio2005] Frederic Morin and Yoshua
Bengio. 2005. Hierarchical probabilistic neural net-
work language model. In Aistats, volume 5, pages
246–252. Citeseer.
[Pascanu et al.2012] Razvan Pascanu, Tomas Mikolov,
and Yoshua Bengio. 2012. On the difficulty of
training recurrent neural networks. arXiv preprint
arXiv:1211.5063.
[Pascanu et al.2013] Razvan Pascanu, Caglar Gulcehre,
Kyunghyun Cho, and Yoshua Bengio. 2013. How
to construct deep recurrent neural networks. arXiv
preprint arXiv:1312.6026.
[Rush et al.2015] Alexander M. Rush, Sumit Chopra,
and Jason Weston. 2015. A neural attention model
for abstractive sentence summarization. CoRR,
abs/1509.00685.
[Schuster and Paliwal1997] Mike Schuster and
Kuldip K Paliwal. 1997. Bidirectional recur-
rent neural networks. Signal Processing, IEEE
Transactions on, 45(11):2673–2681.
[Sennrich et al.2015] Rico Sennrich, Barry Haddow,
and Alexandra Birch. 2015. Neural machine trans-
lation of rare words with subword units. arXiv
preprint arXiv:1508.07909.
[Theano Development Team2016] Theano Develop-
ment Team. 2016. Theano: A Python framework
for fast computation of mathematical expressions.
arXiv e-prints, abs/1605.02688, May.
[Tomasello et al.2007] Michael Tomasello, Malinda
Carpenter, and Ulf Liszkowski. 2007. A new look at
infant pointing. Child development, 78(3):705–722.
[Vinyals et al.2015] Oriol Vinyals, Meire Fortunato,
and Navdeep Jaitly. 2015. Pointer networks. In Ad-
vances in Neural Information Processing Systems,
pages 2674–2682.
[Zeiler2012] Matthew D Zeiler. 2012. Adadelta:
an adaptive learning rate method. arXiv preprint
arXiv:1212.5701.
7 Acknowledgments
We would also like to thank the developers of
5
ranslation
using the
ining con-
Kyunghyun
al machine
d translate.
and Jean-
importance
ural proba-
orks, IEEE
as Usunier,
5. Large-
emory net-
nd Mirella
on by ex-
Machine Learning Research, 13(1):307–361.
[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-
qing Ren, and Jian Sun. 2015. Deep resid-
ual learning for mage recognition. arXiv preprint
arXiv:1512.03385.
[Hermann et al.2015] Karl Moritz Hermann, Tomas
Kocisky, Edward Grefenstette, Lasse Espeholt, Will
Kay, Mustafa Suleyman, and Phil Blunsom. 2015.
Teaching machines to read and comprehend. In Ad-
vances in Neural Information Processing Systems,
pages 1684–1692.
[Jean et al.2014] S´ebastien Jean, Kyunghyun Cho,
Roland Memisevic, and Yoshua Bengio. 2014. On
using very large target vocabulary for neural ma-
chine translation. arXiv preprint arXiv:1412.2007.
[Kingma and Adam2015] Diederik P Kingma and
Jimmy Ba Adam. 2015. A method for stochastic
optimization. In International Conference on
Learning Representation.
[Luong et al.2015] Minh-Thang Luong, Ilya Sutskever,
Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.
2015. Addressing the rare word problem in neural
machine translation. In Proceedings of ACL.
148
Management of Data (SIGMOD). pages 1247–
1250.
S. Bowman, G. Angeli, C. Potts, and C. D. Man-
ning. 2015. A large annotated corpus for learn-
ing natural language inference. In Empiri-
cal Methods in Natural Language Processing
(EMNLP).
D. L. Chen and R. J. Mooney. 2008. Learning to
sportscast: A test of grounded language acqui-
sition. In International Conference on Machine
Learning (ICML). pages 128–135.
F. Chevalier, R. Vuillemot, and G. Gali. 2013. Us-
ing concrete scales: A practical framework for
effective visual depiction of complex measures.
IEEE Transactions on Visualization and Com-
puter Graphics 19:2426–2435.
G. Chiacchieri. 2013. Dictionary of numbers.
http://www.dictionaryofnumbers.
com/.
A. Fader, S. Soderland, and O. Etzioni. 2011.
Identifying relations for open information ex-
traction. In Empirical Methods in Natural Lan-
guage Processing (EMNLP).
R. Jia and P. Liang. 2016. Data recombination
for neural semantic parsing. In Association for
Computational Linguistics (ACL).
M. G. Jones and A. R. Taylor. 2009. Developing
a sense of scale: Looking backward. Journal of
Research in Science Teaching 46:460–475.
Y. Kim, J. Hullman, and M. Agarwala. 2016. Gen-
erating personalized spatial analogies for dis-
tances and areas. In Conference on Human Fac-
tors in Computing Systems (CHI).
C. Seife. 2010. Proofine
fooled by the numbers. P
I. Sutskever, O. Vinyals, a
quence to sequence lea
works. In Advances in N
cessing Systems (NIPS).
K. H. Teigen. 2015. Fram
ties. The Wiley Blackw
ment and Decision Maki
T. R. Tretter, M. G. Jones,
Accuracy of scale conce
tal maneuverings acros
tial magnitude. Journal
Teaching 43:1061–1085
Y. Wang, J. Berant, and P.
a semantic parser overni
Computational Linguisti
Y. W. Wong and R. J. Moo
by inverting a semantic
cal machine translation.
Technology and North
for Computational Ling
pages 172–179.