Monotonic Multihead Attention, Ma, Xutai, et al. "Monotonic Multihead Attention." International Conference on Learning Representations. 2020. review by June-Woo Kim
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
발표자: 이활석(NAVER)
발표일: 2017.11.
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨 지고 있습니다. 본 과정에서는 비지도학습의 가장 대표적인 방법인 오토인코더의 모든 것에 대해서 살펴보고자 합니다. 차원 축소관점에서 가장 많이 사용되는Autoencoder와 (AE) 그 변형 들인 Denoising AE, Contractive AE에 대해서 공부할 것이며, 데이터 생성 관점에서 최근 각광 받는 Variational AE와 (VAE) 그 변형 들인 Conditional VAE, Adversarial AE에 대해서 공부할 것입니다. 또한, 오토인코더의 다양한 활용 예시를 살펴봄으로써 현업과의 접점을 찾아보도록 노력할 것입니다.
1. Revisit Deep Neural Networks
2. Manifold Learning
3. Autoencoders
4. Variational Autoencoders
5. Applications
발표자: 곽동현(서울대 박사과정, 현 NAVER Clova)
강화학습(Reinforcement learning)의 개요 및 최근 Deep learning 기반의 RL 트렌드를 소개합니다.
발표영상:
http://tv.naver.com/v/2024376
https://youtu.be/dw0sHzE1oAc
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
발표자: 이활석(NAVER)
발표일: 2017.11.
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨 지고 있습니다. 본 과정에서는 비지도학습의 가장 대표적인 방법인 오토인코더의 모든 것에 대해서 살펴보고자 합니다. 차원 축소관점에서 가장 많이 사용되는Autoencoder와 (AE) 그 변형 들인 Denoising AE, Contractive AE에 대해서 공부할 것이며, 데이터 생성 관점에서 최근 각광 받는 Variational AE와 (VAE) 그 변형 들인 Conditional VAE, Adversarial AE에 대해서 공부할 것입니다. 또한, 오토인코더의 다양한 활용 예시를 살펴봄으로써 현업과의 접점을 찾아보도록 노력할 것입니다.
1. Revisit Deep Neural Networks
2. Manifold Learning
3. Autoencoders
4. Variational Autoencoders
5. Applications
발표자: 곽동현(서울대 박사과정, 현 NAVER Clova)
강화학습(Reinforcement learning)의 개요 및 최근 Deep learning 기반의 RL 트렌드를 소개합니다.
발표영상:
http://tv.naver.com/v/2024376
https://youtu.be/dw0sHzE1oAc
RNN & LSTM: Neural Network for Sequential DataYao-Chieh Hu
Recurrent neural networks (RNNs) and long short-term memory (LSTM) networks can process sequential data like text and time series data. RNNs have memory and can perform the same task for every element in a sequence, but struggle with long-term dependencies. LSTMs address this issue using memory cells and gates that allow them to learn long-term dependencies. LSTMs have four interacting layers - a forget gate, input gate, cell state, and output gate that allow them to store and access information over long periods of time. RNNs and LSTMs are applied to tasks like language modeling, machine translation, speech recognition, and image caption generation.
This is an easy introduction to the concept of Genetic Algorithms. It gives Simple explanation of Genetic Algorithms. Covers the major steps that are required to implement the GA for your tasks.
For other resources visit: http://pimpalepatil.googlepages.com/
For more information mail me on pbpimpale@gmail.com
This document summarizes and compares two popular Python libraries for graph neural networks - Spektral and PyTorch Geometric. It begins by providing an overview of the basic functionality and architecture of each library. It then discusses how each library handles data loading and mini-batching of graph data. The document reviews several common message passing layer types implemented in both libraries. It provides an example comparison of using each library for a node classification task on the Cora dataset. Finally, it discusses a graph classification comparison in PyTorch Geometric using different message passing and pooling layers on the IMDB-binary dataset.
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1Masashi Shibata
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...Deep Learning JP
1. The document discusses a research paper on speech enhancement using a convolutional gated recurrent network (CGRN) and ordered neuron long short-term memory (ON-LSTM).
2. The proposed method aims to improve speech quality by incorporating both time and frequency dependencies using CGRN, and handling noise with varying change rates using ON-LSTM.
3. CGRN replaces fully-connected layers with convolutions, allowing it to capture local spatial structures in the frequency domain. ON-LSTM groups neurons based on the change rate of internal information to model hierarchical representations.
Recurrent Neural Networks have shown to be very powerful models as they can propagate context over several time steps. Due to this they can be applied effectively for addressing several problems in Natural Language Processing, such as Language Modelling, Tagging problems, Speech Recognition etc. In this presentation we introduce the basic RNN model and discuss the vanishing gradient problem. We describe LSTM (Long Short Term Memory) and Gated Recurrent Units (GRU). We also discuss Bidirectional RNN with an example. RNN architectures can be considered as deep learning systems where the number of time steps can be considered as the depth of the network. It is also possible to build the RNN with multiple hidden layers, each having recurrent connections from the previous time steps that represent the abstraction both in time and space.
(1) Dynamic programming is an algorithm design technique that solves problems by breaking them down into smaller subproblems and storing the results of already solved subproblems. (2) It is applicable to problems where subproblems overlap and solving them recursively would result in redundant computations. (3) The key steps of a dynamic programming algorithm are to characterize the optimal structure, define the problem recursively in terms of optimal substructures, and compute the optimal solution bottom-up by solving subproblems only once.
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
This document provides an overview of representation learning techniques for natural language processing (NLP). It begins with introducing the speakers and objectives of the workshop, which is to provide a deep dive into state-of-the-art text representation techniques and how to apply them to solve NLP problems. The workshop covers four modules: 1) archaic techniques, 2) word vectors, 3) sentence/paragraph/document vectors, and 4) character vectors. It emphasizes that representation learning is key to NLP as it transforms raw text into a numeric form that machine learning models can understand.
The document discusses Bayesian neural networks and related topics. It covers Bayesian neural networks, stochastic neural networks, variational autoencoders, and modeling prediction uncertainty in neural networks. Key points include using Bayesian techniques like MCMC and variational inference to place distributions over the weights of neural networks, modeling both model parameters and predictions as distributions, and how this allows capturing uncertainty in the network's predictions.
1. The document discusses recent developments in transformer architectures in 2021. It covers large transformers with models of over 100 billion parameters, efficient transformers that aim to address the quadratic attention problem, and new modalities like image, audio and graph transformers.
2. Issues with large models include high costs of training, carbon emissions, potential biases, and static training data not reflecting changing social views. Efficient transformers use techniques like mixture of experts, linear attention approximations, and selective memory to improve scalability.
3. New modalities of transformers in 2021 include vision transformers applied to images and audio transformers for processing sound. Multimodal transformers aim to combine multiple modalities.
PEGASUS is a large Transformer-based model for abstractive text summarization. It uses a novel pre-training objective called gap-sentence generation (GSG) which masks sentences from input documents and trains the model to generate the missing sentences. GSG more closely resembles the downstream summarization task compared to other objectives. In experiments, PEGASUS achieved state-of-the-art results on 12 summarization datasets using GSG pre-training and outperformed other models when fine-tuned on limited data.
Parallel WaveGAN, Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. review by June-Woo Kim
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
RNN & LSTM: Neural Network for Sequential DataYao-Chieh Hu
Recurrent neural networks (RNNs) and long short-term memory (LSTM) networks can process sequential data like text and time series data. RNNs have memory and can perform the same task for every element in a sequence, but struggle with long-term dependencies. LSTMs address this issue using memory cells and gates that allow them to learn long-term dependencies. LSTMs have four interacting layers - a forget gate, input gate, cell state, and output gate that allow them to store and access information over long periods of time. RNNs and LSTMs are applied to tasks like language modeling, machine translation, speech recognition, and image caption generation.
This is an easy introduction to the concept of Genetic Algorithms. It gives Simple explanation of Genetic Algorithms. Covers the major steps that are required to implement the GA for your tasks.
For other resources visit: http://pimpalepatil.googlepages.com/
For more information mail me on pbpimpale@gmail.com
This document summarizes and compares two popular Python libraries for graph neural networks - Spektral and PyTorch Geometric. It begins by providing an overview of the basic functionality and architecture of each library. It then discusses how each library handles data loading and mini-batching of graph data. The document reviews several common message passing layer types implemented in both libraries. It provides an example comparison of using each library for a node classification task on the Cora dataset. Finally, it discusses a graph classification comparison in PyTorch Geometric using different message passing and pooling layers on the IMDB-binary dataset.
CMA-ESサンプラーによるハイパーパラメータ最適化 at Optuna Meetup #1Masashi Shibata
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...Deep Learning JP
1. The document discusses a research paper on speech enhancement using a convolutional gated recurrent network (CGRN) and ordered neuron long short-term memory (ON-LSTM).
2. The proposed method aims to improve speech quality by incorporating both time and frequency dependencies using CGRN, and handling noise with varying change rates using ON-LSTM.
3. CGRN replaces fully-connected layers with convolutions, allowing it to capture local spatial structures in the frequency domain. ON-LSTM groups neurons based on the change rate of internal information to model hierarchical representations.
Recurrent Neural Networks have shown to be very powerful models as they can propagate context over several time steps. Due to this they can be applied effectively for addressing several problems in Natural Language Processing, such as Language Modelling, Tagging problems, Speech Recognition etc. In this presentation we introduce the basic RNN model and discuss the vanishing gradient problem. We describe LSTM (Long Short Term Memory) and Gated Recurrent Units (GRU). We also discuss Bidirectional RNN with an example. RNN architectures can be considered as deep learning systems where the number of time steps can be considered as the depth of the network. It is also possible to build the RNN with multiple hidden layers, each having recurrent connections from the previous time steps that represent the abstraction both in time and space.
(1) Dynamic programming is an algorithm design technique that solves problems by breaking them down into smaller subproblems and storing the results of already solved subproblems. (2) It is applicable to problems where subproblems overlap and solving them recursively would result in redundant computations. (3) The key steps of a dynamic programming algorithm are to characterize the optimal structure, define the problem recursively in terms of optimal substructures, and compute the optimal solution bottom-up by solving subproblems only once.
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
This document provides an overview of representation learning techniques for natural language processing (NLP). It begins with introducing the speakers and objectives of the workshop, which is to provide a deep dive into state-of-the-art text representation techniques and how to apply them to solve NLP problems. The workshop covers four modules: 1) archaic techniques, 2) word vectors, 3) sentence/paragraph/document vectors, and 4) character vectors. It emphasizes that representation learning is key to NLP as it transforms raw text into a numeric form that machine learning models can understand.
The document discusses Bayesian neural networks and related topics. It covers Bayesian neural networks, stochastic neural networks, variational autoencoders, and modeling prediction uncertainty in neural networks. Key points include using Bayesian techniques like MCMC and variational inference to place distributions over the weights of neural networks, modeling both model parameters and predictions as distributions, and how this allows capturing uncertainty in the network's predictions.
1. The document discusses recent developments in transformer architectures in 2021. It covers large transformers with models of over 100 billion parameters, efficient transformers that aim to address the quadratic attention problem, and new modalities like image, audio and graph transformers.
2. Issues with large models include high costs of training, carbon emissions, potential biases, and static training data not reflecting changing social views. Efficient transformers use techniques like mixture of experts, linear attention approximations, and selective memory to improve scalability.
3. New modalities of transformers in 2021 include vision transformers applied to images and audio transformers for processing sound. Multimodal transformers aim to combine multiple modalities.
PEGASUS is a large Transformer-based model for abstractive text summarization. It uses a novel pre-training objective called gap-sentence generation (GSG) which masks sentences from input documents and trains the model to generate the missing sentences. GSG more closely resembles the downstream summarization task compared to other objectives. In experiments, PEGASUS achieved state-of-the-art results on 12 summarization datasets using GSG pre-training and outperformed other models when fine-tuned on limited data.
Parallel WaveGAN, Yamamoto, Ryuichi, Eunwoo Song, and Jae-Min Kim. "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. review by June-Woo Kim
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
Conformer, Gulati, Anmol, et al. "Conformer: Convolution-augmented Transformer for Speech Recognition." arXiv preprint arXiv:2005.08100 (2020). review by June-Woo Kim
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
Sudeep Das presented on recommender systems and advances in deep learning approaches. Matrix factorization is still the foundational method for collaborative filtering, but deep learning models are now augmenting these approaches. Deep neural networks can learn hierarchical representations of users and items from raw data like images, text, and sequences of user actions. Models like wide and deep networks combine the strengths of memorization and generalization. Sequence models like recurrent neural networks have also been applied to sessions for next item recommendation.
The document summarizes the Transformer neural network model proposed in the paper "Attention is All You Need". The Transformer uses self-attention mechanisms rather than recurrent or convolutional layers. It achieves state-of-the-art results in machine translation by allowing the model to jointly attend to information from different representation subspaces. The key components of the Transformer include multi-head self-attention layers in the encoder and masked multi-head self-attention layers in the decoder. Self-attention allows the model to learn long-range dependencies in sequence data more effectively than RNNs.
This document provides an outline for a tutorial on deep learning for natural language processing. It begins with an introduction to deep learning and its history, then discusses how neural methods have become prominent in natural language processing. The rest of the tutorial is outlined covering deep semantic models for text, recurrent neural networks for text generation, neural question answering models, and deep reinforcement learning for dialog systems.
The slides for the techniques to use Convolutional Neural Networks (CNN) for the sequence modeling tasks, including image captioning and natural machine translation (NMT). The slides contain the main building blocks from different papers. Used in group paper reading in University of Sydney.
Sequence to Sequence Learning with Neural NetworksNguyen Quang
This document discusses sequence to sequence learning with neural networks. It summarizes a seminal paper that introduced a simple approach using LSTM neural networks to map sequences to sequences. The approach uses two LSTMs - an encoder LSTM to map the input sequence to a fixed-dimensional vector, and a decoder LSTM to map the vector back to the target sequence. The paper achieved state-of-the-art results on English to French machine translation, showing the potential of simple neural models for sequence learning tasks.
Non autoregressive neural text-to-speech reviewJune-Woo Kim
Non autoregressive neural text-to-speech, Peng, Kainan, et al. "Non-autoregressive neural text-to-speech." International Conference on Machine Learning. PMLR, 2020. review by June-Woo Kim
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...Rashid Mijumbi
1. The document proposes a connectionist approach using graph neural networks for dynamic resource management in virtualized network functions to improve efficiency while ensuring reliability.
2. It aims to predict VNF resource requirements to avoid unnecessary resources being kept active/standby through topology-aware resource management using a GNN model.
3. The GNN model takes VNFC features and states as input and computes VNFC states over iterations before outputting resource forecasts to dynamically manage resources.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
The document discusses optimizing the performance of Word2Vec on multicore systems through a technique called Context Combining. Some key points:
- Context Combining improves Word2Vec training efficiency by combining related windows that share samples, improving floating point throughput and reducing overhead.
- Experiments on Intel and Intel Knights Landing processors show Context Combining (pSGNScc) achieves up to 1.28x speedup over prior work (pWord2Vec) and maintains comparable accuracy to state-of-the-art implementations.
- Parallel scaling tests show pSGNScc achieves near linear speedup up to 68 cores, utilizing more of the available computational resources than previous techniques.
- Future
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...AI Frontiers
This document discusses sequence to sequence learning with Tensor2Tensor (T2T) and sequence models. It provides an overview of T2T, which is a library for deep learning models and datasets. It discusses basics of sequence models including recurrent neural networks (RNNs), convolutional models, and the Transformer model based on attention. It encourages experimenting with different sequence models and datasets in T2T.
- UniFi is a fully automatic dimension inference system for Java programs that can detect dimension errors by comparing the inferred dimensions between different versions of a program or between two related programs.
- It works by inferring dimension variables for objects in a program through static analysis and constraint solving and then comparing the inferred dimensions.
- A case study on a 19KLoC program found 5 real errors through comparing successive nightly builds.
발표자: 조경현 (NYU 교수)
Kyunghyun Cho is an assistant professor of computer science and data science at New York University.
He was a postdoctoral fellow at University of Montreal until summer 2015, and received PhD and MSc degrees from Aalto University early 2014.
He tries best to find a balance among machine learning, natural language processing and life, but often fails to do so.
개요:
There are three axes along which advances in machine learning and deep learning happen. They are (1) network architectures, (2) learning algorithms and (3) spatio-temporal abstraction.
In this talk, I will describe a set of research topics I’ve pursued in each of these axes.
- For network architectures, I will describe how recurrent neural networks, which were largely forgotten during 90s and early 2000s, have evolved over time and have finally become a de facto standard in machine translation.
- I continue on to discussing various learning paradigms, how they related to each other, and how they are combined in order to build a strong learning system. Along this line, I briefly discuss my latest research on designing a query-efficient imitation learning algorithm for autonomous driving.
- Lastly, I present my view on what it means to be a higher-level learning system. Under this view each and every end-to-end trainable neural network serves as a module, regardless of how they were trained, and interacts with each other in order to solve a higher-level task.
I will describe my latest research on trainable decoding algorithm as a first step toward building such a framework.
발표영상: https://youtu.be/soZXAH3leeQ (본 발표는 영어로 진행됩니다.)
Natural Language Generation / Stanford cs224n 2019w lecture 15 Reviewchangedaeoh
This document discusses natural language generation (NLG) tasks and neural approaches. It begins with a recap of language models and decoding algorithms like beam search and sampling. It then covers NLG tasks like summarization, dialogue generation, and storytelling. For summarization, it discusses extractive vs. abstractive approaches and neural methods like pointer-generator networks. For dialogue, it discusses challenges like genericness, irrelevance and repetition that neural models face. It concludes with trends in NLG evaluation difficulties and the future of the field.
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Isabelle Augenstein
The document discusses machine reading using neural machines. It presents goals of fact checking claims and understanding scientific publications. It outlines challenges in tasks like stance detection on tweets and summarizing scientific papers. These include interpreting statements based on the target or headline, handling unseen targets, and the small size of benchmark datasets which makes neural machine reading computationally costly.
The document discusses attention mechanisms for encoder-decoder neural networks. It describes traditional encoder-decoder models that compress all input information into a fixed vector, which cannot encode long sentences. Attention mechanisms allow the decoder to access the entire encoded input sequence and assign weights to input elements based on their relevance to predicting the output. The core attention model uses an alignment function to calculate energy scores between the input and output, a distribution function to calculate attention weights from the energy scores, and a weighted sum to compute the context vector used by the decoder. Various alignment functions are discussed, including dot product, additive, and deep attention.
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMHODECEDSIET
Time Division Multiplexing (TDM) is a method of transmitting multiple signals over a single communication channel by dividing the signal into many segments, each having a very short duration of time. These time slots are then allocated to different data streams, allowing multiple signals to share the same transmission medium efficiently. TDM is widely used in telecommunications and data communication systems.
### How TDM Works
1. **Time Slots Allocation**: The core principle of TDM is to assign distinct time slots to each signal. During each time slot, the respective signal is transmitted, and then the process repeats cyclically. For example, if there are four signals to be transmitted, the TDM cycle will divide time into four slots, each assigned to one signal.
2. **Synchronization**: Synchronization is crucial in TDM systems to ensure that the signals are correctly aligned with their respective time slots. Both the transmitter and receiver must be synchronized to avoid any overlap or loss of data. This synchronization is typically maintained by a clock signal that ensures time slots are accurately aligned.
3. **Frame Structure**: TDM data is organized into frames, where each frame consists of a set of time slots. Each frame is repeated at regular intervals, ensuring continuous transmission of data streams. The frame structure helps in managing the data streams and maintaining the synchronization between the transmitter and receiver.
4. **Multiplexer and Demultiplexer**: At the transmitting end, a multiplexer combines multiple input signals into a single composite signal by assigning each signal to a specific time slot. At the receiving end, a demultiplexer separates the composite signal back into individual signals based on their respective time slots.
### Types of TDM
1. **Synchronous TDM**: In synchronous TDM, time slots are pre-assigned to each signal, regardless of whether the signal has data to transmit or not. This can lead to inefficiencies if some time slots remain empty due to the absence of data.
2. **Asynchronous TDM (or Statistical TDM)**: Asynchronous TDM addresses the inefficiencies of synchronous TDM by allocating time slots dynamically based on the presence of data. Time slots are assigned only when there is data to transmit, which optimizes the use of the communication channel.
### Applications of TDM
- **Telecommunications**: TDM is extensively used in telecommunication systems, such as in T1 and E1 lines, where multiple telephone calls are transmitted over a single line by assigning each call to a specific time slot.
- **Digital Audio and Video Broadcasting**: TDM is used in broadcasting systems to transmit multiple audio or video streams over a single channel, ensuring efficient use of bandwidth.
- **Computer Networks**: TDM is used in network protocols and systems to manage the transmission of data from multiple sources over a single network medium.
### Advantages of TDM
- **Efficient Use of Bandwidth**: TDM all
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Monotonic Multihead Attention review
1. Monotonic Multihead Attention
Presented by: June-Woo Kim
ABR lab, Department of Artificial Intelligence,
Kyungpook National University
2, Dec. 2020.
Xutai Ma et al.
Facebook, Johns Hopkins Univ.
ICLR 2020
3. Abstract
• This paper extends previous models for monotonic attention to the multi-head attention used in Transformers [1],
yielding “Monotonic Multi-head Attention”.
• The proposed method is a relatively straightforward extension of the previous Hard [2] and Infinite Lookback [3]
monotonic attention models.
• This paper achieves better latency-quality tradeoffs in simultaneous Machine Translation tasks in two language pairs.
• Also, this paper is a meaningful contribution to the task of simultaneously Machine Translation.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
[2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International Conference on Machine Learning. 2017.
[3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
4. Background: Simultaneous Translation
• Start translating before reading the whole source sentence.
• Applications
• Video Subtitle Translation.
• International Conferences.
• Personal Translation Assistant.
• Goal: We do not want to wait for the completion of the full sentence before we start translation!
5. Background: Simultaneous Translation
• Same grammar structure case Simple!
• English:
• A republican strategy to counter the re-election of Obama. (SVO)
• Spanish:
• Una estrategia republicana para contrarrestar la reelección de Obama. (SVO)
• S: subject, O: object, V: verb
Action English Spanish
Read A Republican strategy
Write Una estrategia republicana
Read to counter
Write para contrarrestar
Read the re-election of Obama
Write la reelección de Obama
6. Background: Simultaneous Translation
• Different grammatical structure case More complex
• English:
• A republican strategy to counter the re-election of Obama. (SVO)
• Korean:
• Obama 재선을 대응하기 위한 공화당 전략. (SOV)
• S: subject, O: object, V: verb
Action English German
Read A Republican strategy
Write 공화당 전략
Read to counter
Read the re-election of Obama
Write 대응하다
Write Obama 재선
7. Background on the problem: Current Approaches
1. Fixed Policy – Weaker performance
• Wait If-* [4] policy. (Rule-based)
• Wait-k [5] policy. (Use 𝑘 tokens)
• Incremental decoding [6]. (Incremental learning)
2. Reinforcement Learning – Less stable learning process.
• Markov chain [7] (Markov chain with RL)
• Make decisions on when to translate from the interaction [8]. (Used pre-trained offline model to teach agent)
• Continuous rewards policy [9]. (Rewards policy gradient for online alignments)
[4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint arXiv:1606.02012 (2016).
[5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. 2019.
[6] Dalvi, Fahim, et al. "Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation." Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.
[7] Grissom II, Alvin, et al. "Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation." Proceedings of the 2014 Conference on empirical methods in natural language processing
(EMNLP). 2014.
[8] Gu, Jiatao, et al. "Learning to translate in real-time with neural machine translation." arXiv preprint arXiv:1610.00388 (2016).
[9] Luo, Yuping, et al. "Learning online alignments with continuous rewards policy gradient." 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
8. Background on the problem: Current Approaches
3. Monotonic Attention (MA) – State of the art.
• Hard Monotonic Attention [2]. (First introduction of concept of MA)
• Monotonic Chunkwise Attention (MoChA) [10]. (Let the model attend to a chunk of encoder states)
• Monotonic Infinite Lookback Attention (MILk) [3]. (Introduced infinite lookback to improve the quality)
• Let’s focus and glance this mechanism!
[2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International Conference on Machine Learning. 2017.
[10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on Learning Representations. 2018.
[3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
9. Background: Sequence-to-sequence
• Recently, the “Sequence-to-sequence” (Seq2Seq) [11] framework has facilitated the use of RNNs on sequence
transduction problems such as machine translation and speech recognition.
• Encoder: input sequence is processed with some networks. (e.g., RNNs, CNNs, FC-layers, hybrid, etc.)
• Decoder: produce the target sequence with output of the encoder. (almost RNNs)
• This often results in the model having difficulty to longer sequences than those seen during training.
[11] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.
Figure reference: https://brunch.co.kr/@kakao-it/155
10. Background: Attention mechanisms in Seq2seq
• Attention mechanism in Seq2seq
• Encoder: produces a sequence of hidden states which correspond to entries in the input sequence.
• Decoder: allow to refer back to any of the encoder states as it produces its output.
• In Seq2seq with attention, the encoder produces a sequence of hidden states which correspond to entries in the input
sequence!
• There are some effective attention mechanism with Seq2seq
• Bahdanau attention [12]. (additive attention)
• Luong attention [13]. (dot-product attention)
• Multi-head attention [1]. (scaled dot-product attention with multi-head. Each heads do scaled dot-production, and those
separated from number of heads)
[12] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
[13] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015).
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
11. Background: Soft attention
• Encoder RNN do the given input sequence 𝑥 = {𝑥1, … , 𝑥 𝑇} to produce a sequence of hidden states ℎ = {ℎ1, … , ℎ 𝑇}.
• Referring to ℎ as the “memory” to emphasize its connection to memory-augmented neural networks
• Decoder RNN produces an output sequence 𝑦 = 𝑦1, … , 𝑦 𝑈 , conditioned on the memory.
12. Background: Soft attention (additive attention)
𝑒𝑖,𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗)
𝛼𝑖,𝑗 =
exp 𝑒𝑖,𝑗
𝑘=1
𝑇
exp 𝑒𝑖,𝑘
𝑐𝑖 =
𝑗=1
𝑇
𝛼𝑖,𝑗ℎ𝑗
𝑠𝑖 = 𝑓(𝑠𝑖−1, 𝑦𝑖−1, 𝑐𝑖)
𝑦𝑖 = 𝑔(𝑠𝑖, 𝑐𝑖)
• When computing 𝑦𝑖, a soft attention-based decoder uses a learnable nonlinear function 𝑎(∙) to produce a scalar value 𝑒𝑖,𝑗
for each entry ℎ𝑗 in the memory based on ℎ𝑗 and the decoder’s state at the previous timestep 𝑠𝑖−1.
• 𝑎(∙) is a single-layer neural network using a 𝑡𝑎𝑛ℎ nonlinearity, but other functions such as a simple dot product between 𝑠𝑖−1 and ℎ𝑗
have been used.
• 𝑐𝑖 is the weighted sum of ℎ.
• Decoder updates its state to 𝑠𝑖 based on 𝑠𝑖−1 and 𝑐𝑖 and produces 𝑦𝑖.
• 𝑓(∙) is a RNN (one or more LSTM or GRU) and 𝑔(∙) is a learnable nonlinear function which maps the decoder state to the
output space.
13. Problem definitions in Attention mechanism in Seq2seq
• Problem
• A common criticism of soft attention is that the model must perform a pass over the entire input sequence when producing
each element of the output sequence.
• This results in the decoding process having complexity 𝑂(𝑇𝑈), where 𝑇 and 𝑈 are the input and output sequence lengths
respectively.
• Furthermore, because the entire sequence must be processed prior to outputting any symbols, soft attention cannot be used
in “online” settings where output sequence elements are produced when the input has only been partially observed.
14. Background: Monotonic Attention (MA)
• Soft attention mechanisms perform a pass over the entire input sequence when producing each element in the
output sequence.
• Authors [2] proposed an end-to-end differentiable method for learning monotonic alignments which, at test time,
enables computing attention online and in linear time.
Conventional soft attention
(e.g., additive, dot product) Monotonic attention
[2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International Conference on Machine Learning. 2017.
15. Background: MA
• If for any output timestep 𝑖 it have 𝑧𝑖,𝑗 = 0 for 𝑗 ∈ {𝑡𝑖 − 1, . . . , 𝑇}, it can simply set 𝑐𝑖 to a vector of zeros.
• 𝑒𝑖,𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗) 𝑒𝑖,𝑗 = 𝑀𝑜𝑛𝑜𝑡𝑜𝑛𝑖𝑐𝐸𝑛𝑒𝑟𝑔𝑦(𝑠𝑖−1, ℎ𝑗)
• 𝑝𝑖,𝑗 = 𝜎(𝑒𝑖,𝑗)
• 𝑧𝑖,𝑗~Bernoulli(𝑝𝑖,𝑗)
• where 𝑎(·) is a learnable deterministic “energy function” and 𝜎(·) is the logistic sigmoid function.
• Note that the above model only needs ℎ 𝑘, 𝑘 ∈ {1, … , 𝑗} compute ℎ𝑗.
• Time complexity will be 𝑂(max 𝑇, 𝑈 )
• Through this mechanism, we know that MA explicitly processes the input sequence in a left-to-right order and makes a hard assignment of
𝑐_𝑖 to one particular encoder state denoted ℎ_(𝑡_𝑖 ).
16. Background: MA
• This model cannot train using back-propagation because of sampling.
• So that, this model use the expectation of ℎ𝑗 during training (inspired from soft alignments) and try to induce
discreteness into 𝑝𝑖,𝑗.
• The 𝛼𝑖,𝑗 defines the probability that input time step 𝑗 is attended at output time step 𝑖.
• 𝛼𝑖,𝑗 = 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 = 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑)𝑃𝑖(ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑)
• 𝑃𝑖 ℎ𝑗 𝑢𝑠𝑒𝑑 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 = 𝑝𝑖,𝑗
• 𝑃𝑖 ℎ𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 = 𝑃𝑖 ℎ𝑗−1 𝑛𝑜𝑡 𝑢𝑠𝑒𝑑, ℎ𝑗−1 𝑐ℎ𝑒𝑐𝑘𝑒𝑑 + 𝑃𝑖−1 ℎ𝑗 𝑢𝑠𝑒𝑑 𝑗 𝑐ℎ𝑒𝑐𝑘𝑒𝑑)
17. Background: MA
• Stepwise Probability at target step 𝑗
𝑝𝑖,𝑗 =
> 0.5
< 0.5
• Expected Alignment (Monotonic Attention):
𝛼𝑖,: = 𝑝𝑖,: 𝑐𝑢𝑚𝑝𝑟𝑜𝑑 1 − 𝑝𝑖,: 𝑐𝑢𝑚𝑠𝑢𝑚(
𝛼 𝑖−1,:
𝑐𝑢𝑚𝑝𝑟𝑜𝑑(1−𝑝 𝑖,:)
) (1)
Where 𝑐𝑢𝑚𝑝𝑟𝑜𝑑 𝑥 = 1, 𝑥1, 𝑥1 𝑥2, … , 𝑖
𝑥 −1
𝑥𝑖 and 𝑐𝑢𝑚𝑠𝑢𝑚 𝑥 = 1, 𝑥1, 𝑥1 + 𝑥2, … , 𝑖
|𝑥|
𝑥𝑖
Resulting process only computes at most max(𝑇, 𝑈) terms 𝑝𝑖,𝑗 which is a linear runtime.
Read 𝑖-th source token,
Write (𝑗 + 1)-th source token.
18. Problem definition: MA
• Problem definitions
• Although this MA achieves online linear time decoding, the decoder can only attend to one encoder state.
• This limitation can diminish translation quality as there may be insufficient information for reordering.
• Moreover, the backpropagation is not available in the hard monotonic attention.
19. Background: Monotonic Chunkwise Attention (MoChA)
• Hard monotonic alignments are just too hard in their conditions!
• Using only one vector ℎ 𝑡,𝑖 as context vector 𝑐𝑖 is a little too much constraint.
• A novel solution to this problem is Monotonic Chunkwise Mechanism. (MoChA) [10]
• This allows the model to use soft attention over fixed-size chunks(e.g., size 𝑤) of memory ending at input time step
𝑡𝑖 for each output time step 𝑖.
𝑢𝑖,𝑘 = ChunkEnergy(𝑠𝑖−1, ℎ 𝑘) = 𝑣 𝑇
tanh(𝑊𝑠 𝑠𝑖−1 + 𝑊ℎℎ𝑗 + 𝑏)
𝑐𝑖 = 𝑘=𝑣
𝑡𝑖 exp(𝑢𝑖,𝑘)
𝑙=𝑣
𝑡 𝑖 exp(𝑢𝑖,𝑙)
ℎ 𝑘
[10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on Learning Representations. 2018.
20. Problem definition: Two major limitations of the hard monotonic mechanism.
• Problem definitions in conventional methods.
• Not enough context in the context vector.
• Assumption of strict monotonicity in input and output alignments.
• RNN-based networks.
21. Proposed architecture: Monotonic Multihead Attention (MMA)
• The Transformer [1] architecture has recently become the state-of-the-art for machine translation [14].
• An important feature of the Transformer is the use of a separate multihead attention module at each layer.
• Thus, this paper propose a new approach, Monotonic Multihead Attention (MMA), which combines the expressive
power of multihead attention and the low latency of monotonic attention.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
[14] Barrault, Loïc, et al. "Findings of the 2019 conference on machine translation (wmt19)." Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.
22. Related works: Transformer
• Given queries 𝑄, keys 𝐾 and values 𝑉, multihead attention Multihead(𝑄, 𝐾, 𝑉) is defined as:
MultiHead(𝑄, 𝐾, 𝑉) = Concat(ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑 𝐻)𝑊 𝑂
where ℎ𝑒𝑎𝑑ℎ = Attention(𝑄𝑊ℎ
𝑄
, 𝐾𝑊ℎ
𝐾
, 𝑉𝑊ℎ
𝑉
)
• The attention function is the scaled dot-product attention, defined as:
Attention(𝑄, 𝐾, 𝑉) = Softmax(
𝑄𝐾 𝑇
𝑑 𝑘
)𝑉
• Now, we can know that multihead attention allows each decoder layer to have multiple heads, where each head can
compute a different attention distribution.
• No RNNs and CNNs in this network.
• Parallel method. (So fast!)
• Better performance in huge dataset.
23. Proposed architecture: MMA
• For a transformer with 𝐿 decoder layers and 𝐻 attention heads per layer, paper defines the selection process of the ℎ −th
head encoder-decoder attention in the 𝑙 −th decoder layer as:
𝑒𝑖,𝑗
𝑙,ℎ
= (
𝑚 𝑗 𝑊𝑙,ℎ
𝐾
𝑠 𝑖−1 𝑊𝑙,ℎ
𝑄 𝑇
𝑑 𝑘 𝑖
)𝑖,𝑗
𝑝𝑖,𝑗
𝑙,ℎ
= Sigmoid(𝑒𝑖,𝑗)
𝑧𝑖,𝑗
𝑙,ℎ
~ Bernoulli(𝑝𝑖,𝑗)
where 𝑊𝑙,ℎ is the input projection matrix, 𝑑 𝑘 is the dimension of the attention head.
24. Proposed architecture: MMA
• Independent stepwise selection probability: For layer 𝑙 and head ℎ
• 𝑝𝑖,𝑗
𝑙,ℎ
=
> 0.5
< 0.5
• Inference algorithm
• A source token is read if the fastest head decides to read.
• A target token is written if all the heads finish reading.
Layer 𝑙 head 𝑗 move one step forward
Layer 𝑙 head 𝑗 stop reading
25. Proposed architecture: MMA
• MMA-H
• Hard alignment.
• Potential for streaming.
• MMA-IL
• Infinite lookback.
• Good translation quality.
26. MMA-H
• For MMA-H(ard), this paper use Equation 1(in slide 24) in order to calculate the expected alignment for each layer
each head, given 𝑝𝑖,𝑗
𝑙,ℎ
.
• Each attention head in MMA-H attends to one encoder state.
• However there are more than two heads in each layer.
• Therefore, compared with the previously MA-based model, this MMA model is able to set attention to different positions.
• Even with the hard alignment variant (MMA-H), this model is still able to preserve the history information by setting
heads to past states.
27. MMA-IL
• For MMA-IL, authors calculate the Softmax energy for each head as follows:
• 𝑢𝑖,𝑗
𝑙,ℎ
= SoftEnergy =
𝑚 𝑗 𝑊𝑙,ℎ
𝐾
𝑠 𝑖−1 𝑊𝑙,ℎ
𝑄 𝑇
𝑑 𝑘
𝑖,𝑗
• And then allows the decoder to access encoder states from the beginning of the source sequence.
• Each attention head in MMA-IL can attend to all previous encoder states.
• So it takes more time than MMA-H.
• But quality of translation is better than MMA-H.
• MMA models use unidirectional encoders: the encoder self-attention can only attend to previous states, which is
also required for simultaneous translation.
29. Compare MMA-H to MMA-IL
• MMA-H
• This model is faster than MMA-IL streaming.
• MMA-IL
• Better quality good quality of translation.
• Thus, MMA-IL allows the model to leverage more information for translation, but MMA-H may be better suited for
streaming systems with stricter efficiency requirements!
30. Experiments
• Datasets
• IWSLT 2015 English-Vietnamese.
• WMT 2016 English-German.
• Latency Metrics
• Average Proportion (AP) [3].
• Average Lagging (AL) [4].
• Differentiable Average Lagging (DAL) [5].
• Quality Metrics
• BLEU score.
[3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
[4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint arXiv:1606.02012 (2016).
[5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. 2019.
32. Results: Latency-quality for MILk and MMA on IWSLT15 En-Vi and WMT15 De-EN.
• The BLEU and latency scores on the test set are generated by setting a latency range and selecting the checkpoint with best
BLEU score on the validation set.
• Black dashed line indicates the unidirectional offline transformer model with greedy search.
• While MMA-IL tends to have a decrease in quality as the latency decreases, MMA-H has a small gain in quality as latency
decreases: a larger latency does not necessarily mean an increase in source information available to the model.
• In fact, the large latency is from the outlier attention heads, which skip the entire source sentence and point to the end of the
sentence.
33. Results
• Note that latency increases with the number of attention heads.
• With 6 layers, the best performance is reached with 16 heads.
34. Conclusion
Summary
• This paper proposed two variants of the monotonic multihead attention model for simultaneous machine translation.
• Introduced two new targeted loss terms for latency control.
• Achieved better latency-quality trade-offs than the previous state-of-the-art model.
35. Reference
• [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
• [2] Raffel, Colin, et al. "Online and Linear-Time Attention by Enforcing Monotonic Alignments." International
Conference on Machine Learning. 2017.
• [3] Arivazhagan, Naveen, et al. "Monotonic Infinite Lookback Attention for Simultaneous Machine Translation."
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
• [4] Cho, Kyunghyun, and Masha Esipova. "Can neural machine translation do simultaneous translation?." arXiv preprint
arXiv:1606.02012 (2016).
• [5] Ma, Mingbo, et al. "STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using
Prefix-to-Prefix Framework." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
2019.
• [6] Dalvi, Fahim, et al. "Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine
Translation." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.
• [7] Grissom II, Alvin, et al. "Don’t until the final verb wait: Reinforcement learning for simultaneous machine
translation." Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP). 2014.
• [8] Gu, Jiatao, et al. "Learning to translate in real-time with neural machine translation." arXiv preprint
arXiv:1610.00388 (2016).
36. Reference
• [9] Luo, Yuping, et al. "Learning online alignments with continuous rewards policy gradient." 2017 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
• [10] Chiu, Chung-Cheng, and Colin Raffel. "Monotonic Chunkwise Attention." International Conference on
Learning Representations. 2018.
• [11] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks."
Advances in neural information processing systems. 2014.
• [12] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to
align and translate." arXiv preprint arXiv:1409.0473 (2014).
• [13] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based
neural machine translation." arXiv preprint arXiv:1508.04025 (2015).
• [14] Barrault, Loïc, et al. "Findings of the 2019 conference on machine translation (wmt19)." Proceedings of the
Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019.
37. Appendix: Conventional attention mechanism
• Bahdanau attention (additive attention)
𝑐𝑡 = 𝑗=1
𝑇𝑥
𝑎 𝑡𝑗ℎ𝑗
𝐻𝑎 𝑡
𝑎 𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑐𝑜𝑟𝑒 𝑠𝑡−1, ℎ𝑗
𝑗=1
𝑇𝑥
∈ ℝ 𝑇𝑥
𝑆𝑐𝑜𝑟𝑒 𝑠𝑡−1, ℎ𝑗 = 𝑣 𝑇tanh(𝑊𝑎 𝑠𝑡−1 + 𝑈 𝑎ℎ𝑗)
Where 𝑠𝑡 is hidden state vector, 𝑐𝑡 is context vector,
𝑦𝑡−1 is input of current time step.
38. Appendix: Conventional attention mechanism
• Luong attention (dot product attention) 𝑐𝑡 = 𝑗=1
𝑇𝑥
𝑎 𝑡𝑗ℎ𝑗
𝐻𝑎 𝑡
𝑎 𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑆𝑐𝑜𝑟𝑒 𝑠𝑡, ℎ𝑗
𝑗=1
𝑇𝑥
∈ ℝ 𝑇𝑥
𝑦𝑡 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑊𝑦 𝑠𝑡 + 𝑏 𝑦
𝑠 = tanh(𝑊𝑠𝑠 𝑠𝑡 + 𝑊𝑐𝑠 𝑐𝑡 + 𝑏𝑠)
The different things from Bahdanau attention is:
- Using 𝑠𝑡 instead of 𝑠𝑡−1
- It use 𝑐𝑡
Computation path in the Luong attention is further
simplified because the part for obtaining the
output and the part for performing the recursive
operation of the RNN can be separated.
Simultaneous Translation is very useful in many applications.
Offline: Best performance with 3 layers and 2 heads (6)
MMA-H, improves in 1 layer with more heads
MMA-IL: similarly to offline model. Best = 6 layers and heads (24)
Latency, best performance = MMA-IL, 6 layers, 16 heads (96)