最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
This presentation is for SotA models in NLP called Transformer & BERT review materials. I reviewed many model in here Word2Vec, ELMo, GPT, ... etc
reference 1 : Kim Dong Ha (https://www.youtube.com/watch?v=xhY7m8QVKjo)
reference 2 : Raimi Karim (https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3)
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
This presentation is for SotA models in NLP called Transformer & BERT review materials. I reviewed many model in here Word2Vec, ELMo, GPT, ... etc
reference 1 : Kim Dong Ha (https://www.youtube.com/watch?v=xhY7m8QVKjo)
reference 2 : Raimi Karim (https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3)
"Attention Is All You Need" Grazie a queste semplici parole, nel 2017 il Deep Learning ha subito un profondo cambiamento. I Transformers, inizialmente introdotti nel campo del Natural Language Processing, si sono recentemente dimostrati estremamente efficaci anche al di fuori di questo settore, ottenendo un enorme - e forse inaspettato - successo nel campo della Computer Vision. I Vision Transformers e moltissime delle sue varianti stanno ridefinendo oggi lo stato dell'arte su molti task di visione artificiale, dalla classificazione di immagini fino ai sistemi di visione per la guida autonoma. Ma cosa sono i Transformers? In che cosa consiste il meccanismo della self-attention che è alla base del loro funzionamento? Quali sono i suoi limiti? Saranno in grado di rimpiazzare le famose reti convoluzionali che hanno, a loro tempo, rivoluzionato la Computer Vision? In questo talk cercheremo di rispondere a tutte queste domande, offrendo un'ampia panoramica sulle idee fondanti, sulle architetture Transformer più utilizzate, e sulle applicazioni più promettenti.
Brief introduction on attention mechanism and its application in neural machine translation, especially in transformer, where attention was used to remove RNNs completely from NMT.
This Edureka Recurrent Neural Networks tutorial will help you in understanding why we need Recurrent Neural Networks (RNN) and what exactly it is. It also explains few issues with training a Recurrent Neural Network and how to overcome those challenges using LSTMs. The last section includes a use-case of LSTM to predict the next word using a sample short story
Below are the topics covered in this tutorial:
1. Why Not Feedforward Networks?
2. What Are Recurrent Neural Networks?
3. Training A Recurrent Neural Network
4. Issues With Recurrent Neural Networks - Vanishing And Exploding Gradient
5. Long Short-Term Memory Networks (LSTMs)
6. LSTM Use-Case
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Attention Is All You Need.
With these simple words, the Deep Learning industry was forever changed. Transformers were initially introduced in the field of Natural Language Processing to enhance language translation, but they demonstrated astonishing results even outside language processing. In particular, they recently spread in the Computer Vision community, advancing the state-of-the-art on many vision tasks. But what are Transformers? What is the mechanism of self-attention, and do we really need it? How did they revolutionize Computer Vision? Will they ever replace convolutional neural networks?
These and many other questions will be answered during the talk.
In this tech talk, we will discuss:
- A piece of history: Why did we need a new architecture?
- What is self-attention, and where does this concept come from?
- The Transformer architecture and its mechanisms
- Vision Transformers: An Image is worth 16x16 words
- Video Understanding using Transformers: the space + time approach
- The scale and data problem: Is Attention what we really need?
- The future of Computer Vision through Transformers
Speaker: Davide Coccomini, Nicola Messina
Website: https://www.aicamp.ai/event/eventdetails/W2021101110
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
Transformer Architectures in Vision
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Slides reviewing the paper:
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, pp. 6000-6010. 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
"Attention Is All You Need" Grazie a queste semplici parole, nel 2017 il Deep Learning ha subito un profondo cambiamento. I Transformers, inizialmente introdotti nel campo del Natural Language Processing, si sono recentemente dimostrati estremamente efficaci anche al di fuori di questo settore, ottenendo un enorme - e forse inaspettato - successo nel campo della Computer Vision. I Vision Transformers e moltissime delle sue varianti stanno ridefinendo oggi lo stato dell'arte su molti task di visione artificiale, dalla classificazione di immagini fino ai sistemi di visione per la guida autonoma. Ma cosa sono i Transformers? In che cosa consiste il meccanismo della self-attention che è alla base del loro funzionamento? Quali sono i suoi limiti? Saranno in grado di rimpiazzare le famose reti convoluzionali che hanno, a loro tempo, rivoluzionato la Computer Vision? In questo talk cercheremo di rispondere a tutte queste domande, offrendo un'ampia panoramica sulle idee fondanti, sulle architetture Transformer più utilizzate, e sulle applicazioni più promettenti.
Brief introduction on attention mechanism and its application in neural machine translation, especially in transformer, where attention was used to remove RNNs completely from NMT.
This Edureka Recurrent Neural Networks tutorial will help you in understanding why we need Recurrent Neural Networks (RNN) and what exactly it is. It also explains few issues with training a Recurrent Neural Network and how to overcome those challenges using LSTMs. The last section includes a use-case of LSTM to predict the next word using a sample short story
Below are the topics covered in this tutorial:
1. Why Not Feedforward Networks?
2. What Are Recurrent Neural Networks?
3. Training A Recurrent Neural Network
4. Issues With Recurrent Neural Networks - Vanishing And Exploding Gradient
5. Long Short-Term Memory Networks (LSTMs)
6. LSTM Use-Case
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Attention Is All You Need.
With these simple words, the Deep Learning industry was forever changed. Transformers were initially introduced in the field of Natural Language Processing to enhance language translation, but they demonstrated astonishing results even outside language processing. In particular, they recently spread in the Computer Vision community, advancing the state-of-the-art on many vision tasks. But what are Transformers? What is the mechanism of self-attention, and do we really need it? How did they revolutionize Computer Vision? Will they ever replace convolutional neural networks?
These and many other questions will be answered during the talk.
In this tech talk, we will discuss:
- A piece of history: Why did we need a new architecture?
- What is self-attention, and where does this concept come from?
- The Transformer architecture and its mechanisms
- Vision Transformers: An Image is worth 16x16 words
- Video Understanding using Transformers: the space + time approach
- The scale and data problem: Is Attention what we really need?
- The future of Computer Vision through Transformers
Speaker: Davide Coccomini, Nicola Messina
Website: https://www.aicamp.ai/event/eventdetails/W2021101110
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
Transformer Architectures in Vision
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Slides reviewing the paper:
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, pp. 6000-6010. 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee
Computer Vision 분야에서 CNN은 과연 살아남을 수 있을까요?
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 317번째 논문 리뷰입니다.
이번에는 Google Research, Brain Team의 MLP-Mixer: An all-MLP Architecture for Vision을 리뷰해보았습니다.
Attention의 공격도 버거운데 이번에는 MLP(Multi-Layer Perceptron)의 공격입니다.
MLP만을 사용해서 Image Classification을 하는데 성능도 좋고 속도도 빠르고....
구조를 간단히 소개해드리면 ViT(Vision Transformer)의 self-attention 부분을 MLP로 변경하였습니다.
MLP block 2개를 사용하여 하나는 patch(token)들 간의 연산을 하는데 사용하고, 하나는 patch 내부 연산을 하는데 사용합니다.
사실 MLP를 사용하긴 했지만 논문에도 언급되어 있듯이, 이 부분을 일종의 convolution이라고 볼 수 있는데요...
그래도 transformer 기반의 network이 가질 수밖에 없는 quadratic complexity를 linear로 낮춰주고
convolution의 inductive bias 거의 없이 아주아주 simple한 구조를 활용하여 이렇게 좋은 성능을 보여준 점이 멋집니다.
반면에 역시나 data를 많이 써야 한다거나, MLP의 한계인 fixed length의 input만 받을 수 있다는 점은 단점이라고 생각하는데요,
이 연구를 시작으로 MLP도 다시한번 조명받는 계기가 되면 좋을 것 같네요
비슷한 시점에 나온 비슷한 연구들도 마지막에 간략하게 소개하였습니다.
재미있게 봐주세요. 감사합니다!
논문링크: https://arxiv.org/abs/2105.01601
영상링크: https://youtu.be/KQmZlxdnnuY
Paper Study: Melding the data decision pipelineChenYiHuang5
Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
Recurrent Neuron Network-from point of dynamic system & state machineGAYO3
General explanation of various recurrent framework and intuitions behind it.
Part 1: Focus on series sampling from continuous time.
Part 2: Explain the connection between state machine and language. And some ideas of NLP.
Deep learning (also known as deep structured learning or hierarchical learning) is the application of artificial neural networks (ANNs) to learning tasks that contain more than one hidden layer. Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, partially supervised or unsupervised.
Brief History of Visual Representation LearningSangwoo Mo
- [2012-2015] Evolution of deep learning architectures
- [2016-2019] Learning paradigms for diverse tasks
- [2020-current] Scaling laws and foundation models
Learning Visual Representations from Uncurated DataSangwoo Mo
Slide about the defense of my Ph.D. dissertation: "Learning Visual Representations from Uncurated Data"
It includes four papers about
- Learning from multi-object images for contrastive learning [1] and Vision Transformer (ViT) [2]
- Learning with limited labels (semi-sup) for image classification [3] and vision-language [4] models
[1] Mo*, Kang* et al. Object-aware Contrastive Learning for Debiased Scene Representation. NeurIPS’21.
[2] Kang*, Mo* et al. OAMixer: Object-aware Mixing Layer for Vision Transformers. CVPRW’22.
[3] Mo et al. RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data. ICLR’23.
[4] Mo et al. S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions. Under Review.
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...Sangwoo Mo
Lab seminar introduces Ting Chen's recent 3 works:
- Pix2seq: A Language Modeling Framework for Object Detection (ICLR’22)
- A Unified Sequence Interface for Vision Tasks (NeurIPS’22)
- A Generalist Framework for Panoptic Segmentation of Images and Videos (submitted to ICLR’23)
Lab seminar on
- Sharpness-Aware Minimization for Efficiently Improving Generalization (ICLR 2021)
- When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations (under review)
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
5. Self-attention with 𝑂(𝐿$
) complexity
5
𝑋: 𝐿×𝑑
𝑄: 𝐿×𝑑. 𝐾: 𝐿×𝑑. 𝑉: 𝐿×𝑑1
𝐴: 𝐿×𝐿
Linear layers
𝑌: 𝐿×𝑑
• For sequence of length 𝐿, self-attention
module converts a feature 𝑋 ∈ ℝ6×7
to
another feature 𝑌 ∈ ℝ6×7
Image from Synthesizer paper
𝑌8: 𝐿×𝑑1
Linear layer
Concat 𝑌8s
6. Self-attention with 𝑂(𝐿$
) complexity
6
𝑋: 𝐿×𝑑
𝑄: 𝐿×𝑑. 𝐾: 𝐿×𝑑. 𝑉: 𝐿×𝑑1
𝐴: 𝐿×𝐿
Linear layers
𝑌: 𝐿×𝑑
• For sequence of length 𝐿, self-attention
module converts a feature 𝑋 ∈ ℝ6×7
to
another feature 𝑌 ∈ ℝ6×7
• Compute query, key, value (𝑄, 𝐾, 𝐴)
Image from Synthesizer paper
𝑌8: 𝐿×𝑑1
Linear layer
Concat 𝑌8s
Can be non-identical, e.g.,
for encoder-decoder,
query is decoder feature and
key/value are encoder features
7. Self-attention with 𝑂(𝐿$
) complexity
7
𝑋: 𝐿×𝑑
𝑄: 𝐿×𝑑. 𝐾: 𝐿×𝑑. 𝑉: 𝐿×𝑑1
𝐴: 𝐿×𝐿
𝑌8: 𝐿×𝑑1
Linear layers
𝑌: 𝐿×𝑑
• For sequence of length 𝐿, self-attention
module converts a feature 𝑋 ∈ ℝ6×7
to
another feature 𝑌 ∈ ℝ6×7
• Compute query, key, value (𝑄, 𝐾, 𝐴)
• Dot-product attention is defined as
𝑌8 ≔ softmax
𝑄𝐾A
𝑑.
𝑉
Image from Synthesizer paper
Linear layer
Concat 𝑌8s
8. Self-attention with 𝑂(𝐿$
) complexity
8
𝑋: 𝐿×𝑑
𝑄: 𝐿×𝑑. 𝐾: 𝐿×𝑑. 𝑉: 𝐿×𝑑1
𝐴: 𝐿×𝐿
Linear layers
Linear layer
𝑌: 𝐿×𝑑
• For sequence of length 𝐿, self-attention
module converts a feature 𝑋 ∈ ℝ6×7
to
another feature 𝑌 ∈ ℝ6×7
• Compute query, key, value (𝑄, 𝐾, 𝐴)
• Dot-product attention is defined as
𝑌8 ≔ softmax
𝑄𝐾A
𝑑.
𝑉
• Do this for multiple times (in parallel), i.e.,
multi-head attention, and get final 𝑌
Image from Synthesizer paper
Concat 𝑌8s
×ℎ times
𝑌8: 𝐿×𝑑1
12. Towards Sparse Transformers
• There are 3 major approaches to reduce the attention complexity
1. Forget old memories and focus on new information
2. Restrict sparsity pattern to look at limited window
3. Learn sparsity pattern using extra components
• Adaptive Span Transformer (ACL 2019) - binary mask
• Reformer (ICLR 2020) - locally sensitive hashing
• Routing Transformer (arXiv 2020) - 𝑘-means clustering
• BP-Transformer (arXiv 2019) - bipartite partitioning
12
15. LSH attention with 𝑂(𝐿 log 𝐿) complexity
15
• Since query and key are identical for self-attention, the authors set 𝑄 = 𝐾
• This additional constraint does not degrade the performance
• One can define the similarity of indices thanks to the symmetry
=
16. LSH attention with 𝑂(𝐿 log 𝐿) complexity
16
• Idea: For each query 𝑞G, consider only the closest subset of keys
• Since softmax is dominated by the largest elements, it may be sufficient
• To find the nearest neighbors, the authors use locally sensitive hashing (LSH)
• The hash function ℎ maps similar vector 𝑥 to similar bucket ℎ 𝑥 ∈ {0, … , 𝑏 − 1}
• The vectors should be evenly distributed, i.e., the size of buckets should be similar
• Define ℎ 𝑥 = arg max([𝑥𝑅; −𝑥𝑅]) for a (fixed) random matrix 𝑅 ∈ ℝ7V×W/$
Andoni et al. Practical and optimal LSH for angular distance. NeurIPS 2015.
17. LSH attention with 𝑂(𝐿 log 𝐿) complexity
17
• Sort buckets (𝑂(𝐿 log 𝐿)) and compute attention with keys within the buckets
• Since the buckets may not be evenly distributed, chunk buckets into the fixed size
• Then, the order is not of max _bucket_size, but chuck_size
21. Low-rank approx. with 𝑂(𝐿) complexity
• For 𝑄, 𝐾 ∈ ℝ6×7
for 𝑑 ≪ 𝐿, the attention 𝐴 = softmax 𝑄𝐾A
∈ ℝ6×6
≈ low-rank
• Note that 𝐴d ≔ 𝑄𝐾A
is rank 𝑑, but 𝐴 is not due to the non-linearity of softmax
• Instead, one may apply random projection (Johnson-Lindenstrauss, or JL lemma)
that 𝑃𝑅A
𝑅𝑤A
≈ 𝑃𝑤A
for gaussian vector 𝑅 ∈ ℝ.×6
for 𝑘 = Ω(log 𝐿)
• Experiments show that 𝐴 is approximately low-rank
• 𝐿 = 512 and 𝑑 = 128, but rank is not exactly 128
21
22. Low-rank approx. with 𝑂(𝐿) complexity
• For 𝑄, 𝐾 ∈ ℝ6×7
for 𝑑 ≪ 𝐿, the attention 𝐴 = softmax 𝑄𝐾A
∈ ℝ6×6
≈ low-rank
• Note that 𝐴d ≔ 𝑄𝐾A
is rank 𝑑, but 𝐴 is not due to the non-linearity of softmax
• Instead, one may apply random projection (Johnson-Lindenstrauss, or JL lemma)
that 𝑃𝑅A
𝑅𝑤A
≈ 𝑃𝑤A
for gaussian vector 𝑅 ∈ ℝ.×6
for 𝑘 = Ω(log 𝐿)
• There are two challenges in naively applying low-rank approx. for 𝐴
1. How to reduce 𝑘 = Ω(1)?
2. How to get low-rank 𝐴hij ≈ 𝐴 ∈ ℝ6×6
, e.g., without costly SVD?
• Contribution:
1. Using the property rank 𝐴d = 𝑑, the authors reduce 𝑘 = Θ log 𝑑
2. Instead of SVD, the authors reduce 𝐴 ∈ ℝ6×.
, 𝑉 ∈ ℝ.×6
to compute 𝑌8
22
24. Low-rank approx. with 𝑂(𝐿) complexity
24
• Apply projection 𝐸, 𝐹 ∈ ℝ6×.
to 𝐾, 𝑉,
respectively; now the attention is given by
𝑌8 ≔ softmax
𝑄 ⋅ 𝐾A
𝐸
𝑑.
𝐹A
𝑉
• Applying JL lemma to a submatrix of size Θ(𝑑)
instead of the original matrix size 𝑂(𝐿), one
can approx. the output with 𝑘 = Θ(log 𝑑)
• In practice, the authors learn 𝐸, 𝐹 instead of
random projection (but share parameters)
29. Transformer without self-attention
• Surprisingly, this synthesized attention show comparable results in many NLP tasks
• It works well for machine translation, language modeling, and text generation
• However, it does not work well for natural language understanding (NLI)
• Remark: This is because the attention of former ones are aligned (i.e., diagonal-like),
but NLI needs more complex attention structure
29
33. Universal approx. for Transformers
• Definition. Let 𝒯r,s,t
be a family of Transformers without positional encoding (PE) that
has ℎ heads of size 𝑚 each, and feed-forward layer with 𝑟 hidden nodes
• Definition. Let 𝒯w
r,s,t
be a family of Transformers with PE such that
𝒯w
r,s,t
≔ {𝑔w 𝑿 = 𝑔 𝑿 + 𝑬 ∣ 𝑔 ∈ 𝒯r,s,t
, 𝑬 ∈ ℝ7×6
}
• Theorem 1. Transformer without PE, specifically 𝑔 ∈ 𝒯$,},~
, can approximate any
permutation equivariant function 𝑓 ∈ ℱw•
• Theorem 2. Transformer with PE, specifically 𝑔w ∈ 𝒯w
$,},~
, can approximate any
continuous seq2seq function (in compact domain) 𝑓 ∈ ℱ‚ƒ
• Remark: It is nontrivial since self-attention is pair-wise and shared among layers
33
34. Universal approx. for Transformers
• Theorem 1. Transformer without positional encoding (PE), specifically 𝑔 ∈ 𝒯$,},~
,
can approximate any permutation equivariant function 𝑓 ∈ ℱw•
• Proof sketch:
1. Approx. 𝑓 ∈ ℱw• with piece-wise constant function 𝑓̅ ∈ ℱ…w•
• Classical result in analysis
2. Approx. 𝑓̅ ∈ ℱ…w• with modified Transformer 𝑔̅ ∈ 𝒯…$,},}
such that
• Softmax → Max / ReLU → piece-wise linear activation 𝝓 with ≤ 3 pieces
1. Approx. modified Transformer 𝑔̅ ∈ 𝒯…$,},}
with original Transformer 𝑔 ∈ 𝒯$,},~
• Approx. 𝜙 with 4 ReLUs (hence 𝒯…$,},}
→ 𝒯$,},~
)
34
Main contribution
35. Universal approx. for Transformers
• Lemma 1.1. Approx. 𝑓̅ ∈ ℱ…w• with modified Transformer 𝑔̅ ∈ 𝒯…$,},}
• Softmax → Max / ReLU → piece-wise linear activation 𝝓 with ≤ 3 pieces
• Proof sketch:
1. Convert input 𝑿 to a quantized set 𝑳 with a series of feed-forward layers
• piece-wise linear activation 𝝓 with ≤ 3 pieces condition is used here
2. Convert 𝑳 to a distinct embedding 𝑞(𝑳) with a series of self-attention layers
• Max operation condition is used here
3. Convert 𝑞(𝑳) to the desired output of 𝑓̅ with a series of feed-forward layers
35
Main contribution
36. Universal approx. for Transformers
• Lemma 1.1. Approx. 𝑓̅ ∈ ℱ…w• with modified Transformer 𝑔̅ ∈ 𝒯…$,},}
• Lemma 1.2. Convert 𝑳 to a distinct embedding 𝑞(𝑳) with a series of self-attention layers
• Definition. A mapping 𝑞: 𝕃 ⊂ ℝ7×6
→ ℝ}×6
is contextual embedding if it satisfies
1. For any 𝑳 ∈ 𝕃, all 𝐿 entries of q(𝑳) are distinct
2. For any 𝑳 ≠ 𝑳•
∈ 𝕃, all 𝐿 entries of q(𝑳) and q(𝑳•
) are distinct
• Namely, the contextual embedding maps all sets/entries in distinct space
36
37. Universal approx. for Transformers
• Lemma 1.1. Approx. 𝑓̅ ∈ ℱ…w• with modified Transformer 𝑔̅ ∈ 𝒯…$,},}
• Lemma 1.2. Convert 𝑳 to a distinct embedding 𝑞(𝑳) with a series of self-attention layers
• Proof sketch:
• Using two attention heads of size 1, one can implement selective shift operation,
which shifts the entries in a specific interval, while leaving all others intact
• Recall: 𝑔̅ is a modified Transformer using Max operation and 𝝓 activation
• Concretely, the attention is given by 𝒁 → 𝒁 + Ψ 𝒁; 𝑏, 𝑏•
where
• Stacking this operation, one can construct the contextual embedding 𝑞
37
38. Universal approx. for Transformers
• Theorem 2. Transformer with PE, specifically 𝑔w ∈ 𝒯w
$,},~
, can approximate any
continuous seq2seq function (in compact domain) 𝑓 ∈ ℱ‚ƒ
• Proof sketch:
• For 𝑿 ∈ 0,1 7×6
, define positional encoding 𝐸 as follows:
• Then, columns are monotonically increasing for all rows
• Following similar steps, one can express any continuous seq2seq functions
38
39. Universal approx. for sparse Transformers
• Definition. Let {𝒜.
“
} be a sparsity pattern of 𝑘-th token for 𝑙 ∈ 𝑝 ≔ {1,2, … , 𝑝}
• Dense Transformer: 𝑝 = 1, 𝒜.
}
= [𝑛] for all 𝑘 ∈ [𝑛]
• Theorem 3. If sparsity pattern satisfies the following:
• it can approximate any continuous seq2seq function (in compact domain)
• Proof sketch:
• Due to the assumption, every index
can be connected as the layer goes
39
40. Universal approx. for sparse Transformers
• Definition. Let {𝒜.
“
} be a sparsity pattern of 𝑘-th token for 𝑙 ∈ 𝑝 ≔ {1,2, … , 𝑝}
• Theorem 3. If sparsity pattern satisfies the following:
• it can approximate any continuous seq2seq function (in compact domain)
• In particular, the following architectures satisfy the condition:
• Sparse Transformer - 𝑂(𝐿˜/$
) connections
• Star-Transformer - 𝑂(𝐿) connections
• Longformer - 𝑂(𝐿) connections
40
41. Discussion
• Linformer reduce the complexity of self-attention from 𝑂(𝐿$
) to 𝑂(𝐿)
• However, there are several remaining questions:
1. Empirical performance
• While Linformer has the best provable complexity, other architectures (e.g.,
Reformer or non-provable methods) may show the better performance
(especially, for the problems with moderately long sequences)
• We may need extensive comparison of numerous Transformer architectures
2. Expressive power
• It is unclear if Reformer and Linformer are expressive as the dense Transformer
• It is hard to apply Yun et al. since they do not assume a fixed sparsity pattern
41