SlideShare a Scribd company logo
1 of 35
Transformers for
Time Series
#ossummit @eze_lanza
Is the new State of the Art (SOTA)
approaching?
AI Open source Evangelist-Intel
#ossummit
Agenda
• Agenda
– Transformers 101
– Time Series + Transformers
– Informer & Spacetimeformer
– Use case
– Conclusions
Transformers 101
#ossummit
Transformers Family Tree
https://arxiv.org/abs/2302.07730v2
#ossummit
Stable Diffusion
Transformer
#ossummit
What is a Transformer? 1/4
• Vanilla transformer (Language Translation)
– 1. Embedding + Pos encoding
– 2. Encoder Muli-Head SELF-ATTENTION
– 3. Decoded Multi-Head SELF-ATTENTION
https://arxiv.org/abs/1706.03762 “Attention is all you need”
#ossummit
1- Embedding + Pos Encoding (1/4)
Convert text in vectors
(word2vec)
Phrase
Positional Vector
”I”
“love”
Dogs”
1,22 3,4 2,45 2,4 0,87
3,12 0,23 1,2 3,3 0,82
2,3 1 1,3 2,3 3
5,4 3,4 2,45 2,4 0,87
2,3 2,4 0,9 0,1 0,4
1,4 0,35 2,45 2,4 0,87
Message : Words are converted to vectors, and they contain information about position and
relationship between them (word2vec).
+
+
+
#ossummit
What is a transformer? 2/4 (Self-Attention) ENCODER
https://jalammar.github.io/illustrated-transformer/
Message : Embebbed vectors are converted to new vectors with self-attention information
incorporated.
Self-Attention : What part of the input should I focus on?
Children playing in the park
#ossummit
What is a transformer? 3/4 (Self Attention) ENCODER
https://jalammar.github.io/illustrated-transformer/
Message : Multi-head attention is used to provide multiple attentions that are combined to provide the
relationship between each word with others ( a word can be probably be refering to 2 words)
#ossummit
What is a transformer? 4/4 DECODER
https://jalammar.github.io/illustrated-transformer/
Message : Encoder runs one time while decoder runs multiple times until it generates the EOS.
How it can be used in
Time Series then?
#ossummit
Time series –LSTF (Long time-series forecasting)
A temporal or chronological series is a succession of data measured at specific
moments and ordered chronologically.
https://www.researchgate.net/figure/The-time-series-plot-of-Bitcoin-prices_fig1_333470720
#ossummit
Is it like language?
https://spacy.io/usage/rule-based-matching
#ossummit
Classic methods
• ARIMA (Analytics approach)
Linear dependencies !
#ossummit
Neural networks approach
• Feed-forward (6 lag) • RNN
• LSTM
#ossummit
Seq2seq
#ossummit
TRANSFORMERS – How do they help?
It captures both long and short-term dependencies thanks
To Multi-head attention.
Short-term
Long-term
https://proceedings.neurips.cc/paper/2019/file/6775a0635c302542da2c32aa19d86be0-Paper.pdf
#ossummit
TRANSFORMERS-ISSUES
• Quadratic complexity : O(N2): N is the
input time series length. It’s a
computational bottleneck
https://open4tech.com/time-complexity-of-algorithms/
Full self-attention
https://proceedings.neurips.cc/paper/2019/file/6775a0635c302542da2c32aa19
d86be0-Paper.pdf
#ossummit
Transformers in Time Series: A Survey (Feb 2023)
• “Transformers have shown great modeling ability for long- range
dependencies and interactions in sequential data and thus are
appealing to time series modeling. “
https://arxiv.org/pdf/2202.07125.pdf
#ossummit
NETWORK MODIFICATIONS : Positional encoding
• Vanilla (2019) (https://arxiv.org/abs/1907.00235)
– Unable to fully exploit the important features of TS
• Learnable (2021)(https://arxiv.org/abs/2010.02803)
– Showed to see more flexible
• Time stamp encoding
– Encode time stamps as additional positional encoding
by using learnable embedding layers.
• Informer, Autoformer and FEDformer
#ossummit
NETWORK MODIFICATIONS : Attention Module
• Goal to reduce quadratic complexity
– LogTrans and Pyraformer
– Informer and FEDformer
Image from https://arxiv.org/pdf/2202.07125.pdf
TS Transformers
#ossummit
Informer – Input Representation
https://wuhaixu2016.github.io/pdf/NeurIPS2021_Autoformer.pdf
#ossummit
Informer- Attention module modified
• ProbSparse Attention
https://wuhaixu2016.github.io/pdf/NeurIPS2021_Autoformer.pdf
#ossummit
INFORMER : Results
#ossummit
Spacetimeformer (Mar 2023)
Representing Time and value – Positional embedding
https://arxiv.org/pdf/2109.12218.pdf
#ossummit
Spacetimeformer
https://arxiv.org/pdf/2109.12218.pdf
Use case
#ossummit
Time series – Microservices latency – Online Boutique
https://github.com/GoogleCloudPlatform/microservices-demo
#ossummit
What will be predicted? (Multivariate)
#ossummit
Results
Model MSE Time (Batch)
LSTM 0,078 424s
Informer 0,056 670s
Model MSE Time (Batch)
LSTM 0,145 703s
Informer 0,089 850s
Seq_len : 360 ; pred_len = 36
Seq_len : 360 ; pred_len = 120
#ossummit
Processing time
#ossummit
Conclusions
• Transformers seems to be promising but there are
still work to be done (Optimizations)
• Even the most advanced architecture may not be
the best for each use case.
• Be involved. Community drives SOTA for TS.
• Try cool TS projects like
https://github.com/timeseriesAI/tsai
Thank you!
Transformers for time series

More Related Content

What's hot

Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Hima Patel
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine LearningTuri, Inc.
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with KerasQuantUniversity
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learningStanley Wang
 
Time series classification
Time series classificationTime series classification
Time series classificationSung Kim
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Brocade
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM健程 杨
 

What's hot (20)

Rnn and lstm
Rnn and lstmRnn and lstm
Rnn and lstm
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Lstm
LstmLstm
Lstm
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
 
Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with Keras
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
Time series classification
Time series classificationTime series classification
Time series classification
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
 

Similar to Transformers for time series

Multi-component Modeling with Swift at Extreme Scale
Multi-component Modeling with Swift at Extreme ScaleMulti-component Modeling with Swift at Extreme Scale
Multi-component Modeling with Swift at Extreme ScaleDaniel S. Katz
 
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...linshanleearchive
 
Seminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingSeminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingStefan Marr
 
R2D2 Project (EP/L006251/1) - Research Objectives & Outcomes
R2D2 Project (EP/L006251/1) - Research Objectives & OutcomesR2D2 Project (EP/L006251/1) - Research Objectives & Outcomes
R2D2 Project (EP/L006251/1) - Research Objectives & OutcomesAndrea Tassi
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011Patrick Walton
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
vlsi qb.docx imprtant questions for all units
vlsi qb.docx imprtant questions for all unitsvlsi qb.docx imprtant questions for all units
vlsi qb.docx imprtant questions for all unitsnitcse
 
Presto training course_1999
Presto training course_1999Presto training course_1999
Presto training course_1999Piero Belforte
 
Matrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdfMatrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdfAndrea496281
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
 
cis97003
cis97003cis97003
cis97003perfj
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistrybaoilleach
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistryguest5929fa7
 
Message-passing concurrency in Python
Message-passing concurrency in PythonMessage-passing concurrency in Python
Message-passing concurrency in PythonSarah Mount
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
IEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manualIEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manualFreyrSCADA Embedded Solution
 

Similar to Transformers for time series (20)

Multi-component Modeling with Swift at Extreme Scale
Multi-component Modeling with Swift at Extreme ScaleMulti-component Modeling with Swift at Extreme Scale
Multi-component Modeling with Swift at Extreme Scale
 
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
 
Seminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingSeminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent Programming
 
R2D2 Project (EP/L006251/1) - Research Objectives & Outcomes
R2D2 Project (EP/L006251/1) - Research Objectives & OutcomesR2D2 Project (EP/L006251/1) - Research Objectives & Outcomes
R2D2 Project (EP/L006251/1) - Research Objectives & Outcomes
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
vlsi qb.docx imprtant questions for all units
vlsi qb.docx imprtant questions for all unitsvlsi qb.docx imprtant questions for all units
vlsi qb.docx imprtant questions for all units
 
Presto training course_1999
Presto training course_1999Presto training course_1999
Presto training course_1999
 
Matrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdfMatrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdf
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
 
cis97003
cis97003cis97003
cis97003
 
report
reportreport
report
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
Message-passing concurrency in Python
Message-passing concurrency in PythonMessage-passing concurrency in Python
Message-passing concurrency in Python
 
Rnn presentation 2
Rnn presentation 2Rnn presentation 2
Rnn presentation 2
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
IEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manualIEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manual
 

Recently uploaded

Modernizing The Transport System:Dhaka Metro Rail
Modernizing The Transport System:Dhaka Metro RailModernizing The Transport System:Dhaka Metro Rail
Modernizing The Transport System:Dhaka Metro RailKhanMdReahnAftab
 
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESBIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESfuthumetsaneliswa
 
Databricks Machine Learning Associate Exam Dumps 2024.pdf
Databricks Machine Learning Associate Exam Dumps 2024.pdfDatabricks Machine Learning Associate Exam Dumps 2024.pdf
Databricks Machine Learning Associate Exam Dumps 2024.pdfSkillCertProExams
 
Using AI to boost productivity for developers
Using AI to boost productivity for developersUsing AI to boost productivity for developers
Using AI to boost productivity for developersTeri Eyenike
 
2024-05-15-Surat Meetup-Hyperautomation.pptx
2024-05-15-Surat Meetup-Hyperautomation.pptx2024-05-15-Surat Meetup-Hyperautomation.pptx
2024-05-15-Surat Meetup-Hyperautomation.pptxnitishjain2015
 
STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...
STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...
STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...Sosiaali- ja terveysministeriö / yleiset
 
SaaStr Workshop Wednesday with CEO of Guru
SaaStr Workshop Wednesday with CEO of GuruSaaStr Workshop Wednesday with CEO of Guru
SaaStr Workshop Wednesday with CEO of Gurusaastr
 
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfMicrosoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfSkillCertProExams
 
TSM unit 5 Toxicokinetics seminar by Ansari Aashif Raza.pptx
TSM unit 5 Toxicokinetics seminar by  Ansari Aashif Raza.pptxTSM unit 5 Toxicokinetics seminar by  Ansari Aashif Raza.pptx
TSM unit 5 Toxicokinetics seminar by Ansari Aashif Raza.pptxAnsari Aashif Raza Mohd Imtiyaz
 
"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXR"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXRMegan Campos
 
The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...Kayode Fayemi
 
2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdf2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdfNancy Goebel
 
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESBIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESfuthumetsaneliswa
 

Recently uploaded (14)

Modernizing The Transport System:Dhaka Metro Rail
Modernizing The Transport System:Dhaka Metro RailModernizing The Transport System:Dhaka Metro Rail
Modernizing The Transport System:Dhaka Metro Rail
 
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESBIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
 
Databricks Machine Learning Associate Exam Dumps 2024.pdf
Databricks Machine Learning Associate Exam Dumps 2024.pdfDatabricks Machine Learning Associate Exam Dumps 2024.pdf
Databricks Machine Learning Associate Exam Dumps 2024.pdf
 
Using AI to boost productivity for developers
Using AI to boost productivity for developersUsing AI to boost productivity for developers
Using AI to boost productivity for developers
 
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
 
2024-05-15-Surat Meetup-Hyperautomation.pptx
2024-05-15-Surat Meetup-Hyperautomation.pptx2024-05-15-Surat Meetup-Hyperautomation.pptx
2024-05-15-Surat Meetup-Hyperautomation.pptx
 
STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...
STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...
STM valmiusseminaari 26-04-2024 PUUMALAINEN Ajankohtaista kansainvälisestä yh...
 
SaaStr Workshop Wednesday with CEO of Guru
SaaStr Workshop Wednesday with CEO of GuruSaaStr Workshop Wednesday with CEO of Guru
SaaStr Workshop Wednesday with CEO of Guru
 
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdfMicrosoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
Microsoft Fabric Analytics Engineer (DP-600) Exam Dumps 2024.pdf
 
TSM unit 5 Toxicokinetics seminar by Ansari Aashif Raza.pptx
TSM unit 5 Toxicokinetics seminar by  Ansari Aashif Raza.pptxTSM unit 5 Toxicokinetics seminar by  Ansari Aashif Raza.pptx
TSM unit 5 Toxicokinetics seminar by Ansari Aashif Raza.pptx
 
"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXR"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXR
 
The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...
 
2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdf2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdf
 
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESBIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
 

Transformers for time series

Editor's Notes

  1. This would be my learning process to solve a problem which is latency prediction in Kubernetes. I believe that the learning process can be useful to others, since the amount of papers and information to learn is huge and I would like to share my experience with you, since this is a hot topis and it can benefitial to be used in cases where a XXXX time serioes problem has to be solved.. I promess it won’t be technical and you need to understand a full knowledge. My idea is to introduce for further readings I will cite and mention papers that were most useful in my journey
  2. ANIMACION APARECIENDO TODOS. DESDE EL VANILLA Transformers have changed the world from his appearance on 2017, why is it important? but have since come to monopolize the state-of-the-art performance across virtually all NLP tasks
  3. Circulo o cuadrado remarcando transf It has evolved in different ways, researchers found new ways to transform it. Like this is the case of Diffusion models, they use Transformers +UNET, and this is the SOA today. But this is not clear today in the case of Time Series, there are multiple approaches and multiple conversations IT’S NOT JUST THE TRANSFORMER, BUT
  4. Why a transformer is so important? and why we are seeing those arechitectures on almost most of the use cases today? To explain a transformer we need to start talking about the concepts. There are a lot of math behind that and I'll try to explain it in an easy way Let’s then try to understand the basics. How everything started.THE PARTS ARE ALMOST SIMILAR This is the initial idea. This is just the basics, some models have this architecture others modified, but it’s important to understand how it works to understand later how to modify the architecuree, or why we need to have a differente architecrue to our use case. Fo instance, GPT uses DECODER, while BERT uses ENCODER. ES IMPORTANTE ENTENDER EL CONCEPTO PARA DESPUES ENTENDER PROBLEMAS Y COMO SE PUEDE ADPATAR A LAS SEIRES. INICIALMENTE EL TRANSFOEMR FUE UTILIZADO PARA TRADUCCION DE TEXTO, Y FUE DISRUPTIVO POR SUS RESULATDOS. PODEMOS DIVIDIR EN TRES GRANDES GRUPOS. COMO ADAPTA LA INFORMACION PARA QUE SEA INTERPREATADA, COMO EXTRAE LAS RELACIONES ENTRE LAS PALABRAS EN UN ENTORNO DE SEQ2SEQ, ES DECIR,
  5. SOLAMENTE ENFOQUEMONOS EN TEXTO, COMPUTERS DON’T UNDERSTAND WORDS, THEY GET NUMBERS, VECTORS ON MATRICES. MAP WORDS WITH SIMILAR MEANINGS. BUT THOSE WORDS CAN HAVE DIFFERENT MEANINGS BASED ON THEIR POSITION, SO WE NEED TO GIVE A CONTEXT OF THE POSITION. The idea of the input embedcings is to that captures the dependencies across differ- ent variables without considering the position information It keeps the same similarity and the positons in there, Transformers are permutation invariant, meaning they cannot interpret the order of input tokens by default. We need to provide information by positional encoding This part is important, and there was multiple changes on it, on how the model better interprets the positions, some of the add it as a token, or there are hybris positional encoding.
  6. https://jalammar.github.io/illustrated-transformer/ Model the relationship between each word and others. As we can see Children has a strong relationship La idea ahora es convertir esa palabra en un vector que contenga la informacion con los otras palabras. Por lo que el vector convertido sera el resultado de cada relacion con otras palabras. Para eso se utilizan 3 matrices que es lo que se entrena y basicamente Q1 will give us just one relationship, but we’d need Multple relationships! This is why we use MULTIHEAD
  7. https://jalammar.github.io/illustrated-transformer/ We are just showing one, but the vectors contain information of each word related with other words. IT ALLOWS TO CAPTURE M,ULTIPLE RELATIONSHIPS, multuhead is concatenated and multiplead by a new weight matrix to get again 1 vector
  8. https://jalammar.github.io/illustrated-transformer/ We are just showing one, but the vectors contain information of each word related with other words The inputs are encoded, so you now have the relationships between them, but you now need to predict the output, this is when the model only uses the DECODER part, DE LA MISMA MANERA QUE EL ENCODER FUNCIONA ES COMO EL DECODER ENTONCES FUNCIONA, LOS Z SON TRANSFOMADOS A MATRICES K Y V POR LO QUE
  9. Now let’s dive on Time series , can it be used as it is for time series? There is a trend to use what works well in one topic for others,mainly because it was a huge advance, we can probably do the same for Time series. Well similar as we’ve seen for NLP, we can face some problems when we try to use it.
  10. Time series can be used in multiple use cases in the world, weather forecasting, or like in this case bitcoinfs We can define a time series whith a Time series forecasting plays an important role in daily life to help people manage resources and make decisions. For example, in retail industry, probabilistic forecasting of product demand and supply based on historical data can help people do inventory planning to maximize the profit. Everything related with a dependency of the precious is considered a time series, this is WHY is El punto x dependera de los puntos anteriores, esto esmuy importante mencionarlo ya que es un diferenciador de las series temporales. EXPLICAR, DIFERENCIA ENTRE SERIE Y REMARCAR LA IMPORTANCIA DE LA DEPENDENCIA DE UN PUNTO SOBRE LOS ANTEIORES. CONTEXT WINDOW TO PREDICT A TARGET WINDOW, IT CAN BE 1 OR 10 ,
  11. En el razonamiento, Podemos pensar entonces que si un tiempo depende de los anteiores, entonces, esto es simil al lenguahe no? It can be associated with language, but in language is not so adaptable, we can slightly change the order and we can still get the message. This is time series is fixed. It’s important to have a description or a modeling. We can imagine that an algorithm that can perform well in language could perform well for time series. What if I have a long phrase?
  12. Los metods classicos refieren a realizar un studio exhaustive de las caracteristicas de la serie temporal para poder realizar el modelado, se estudian parametros como tendendia, seasonabilitu y le utilizan metodos conocidos como ARIMA o Autorregresivos Fixed length Autoregressive Linear dependencies
  13. El principal problema aqui es que se debe disenar una red especial para esto, si quiero predecir el proximo valor, la architecture tiene que se asi, si quiero usar 8 anteiorres no podria. Esot hacve que sea un desafio RNN Uno de los atractivos de los RNN es la idea de que podrían conectar la información anterior a la tarea actual. Donde la brecha entre la información relevante y el lugar donde se necesita es pequeña, los RNN pueden aprender a usar la información pasada. También hay casos en los que se necesita más contexto y es muy posible que la brecha entre la información relevante y el punto donde se necesita se vuelva muy grande. Debido al desvanecimiento del gradiente las anteiroes valen cada vez menos LSTM Como tal, se puede usar para crear grandes redes recurrentes que, a su vez, se pueden usar para abordar problemas de secuencia difíciles en el aprendizaje automático y lograr resultados de vanguardia. En lugar de neuronas, las redes LSTM tienen bloques de memoria que están conectados a través de capas.
  14. since self-attention enables Transformer to capture both long- and short-term dependencies, and different attention heads learn to focus on different aspects of temporal patterns. These advantages make Transformer a good candidate for time series forecasting Transformer models are based on a multi-headed attention mechanism that offers several key advan- tages and renders them particularly suitable for time series data They can concurrently take into account long contexts of input sequence elements and learn to represent each sequence element by selectively attending to those input sequence elements which the model considers most relevant. They do so without position-dependent prior bias; this is to be contrasted with RNN-based models: a) even bi-directional RNNs treat elements in the middle of the input sequence differently from elements close to the two endpoints, and b) despite careful design, even LSTM (Long Short Term Memory) and GRU (Gated Recurrent Unit) networks practically only retain information from a limited number of time steps stored inside their hidden state (vanishing gradient problem (Hochreiter, 1998; Pascanu et al., 2013)), and thus the context used for representing each sequence element is inevitably local. Multiple attention heads can consider different representation subspaces, i.e., multiple as- pects of relevance between input elements. For example, in the context of a signal with two frequency components, 1/T1 and 1/T2 , one attention head can attend to neighboring time points, while another one may attend to points spaced a period T1 before the currently examined time point, a third to a period T2 before, etc. This is to be contrasted with at- tention mechanisms in RNN models, which learn a single global aspect/mode of relevance between sequence elements. After each stage of contextual representation (i.e., transformer encoder layer), attention is redistributed over the sequence elements, taking into account progressively more abstract representations of the input elements as information flows from the input towards the out- put. By contrast, RNN models with attention use a single distribution of attention weights to extract a representation of the input, and most typically attend over a single layer of representation (hidden states).
  15. BUT THERE ARE SOME PROBLEMS THIS IS MORE REALTED WITH THE ALGORITH ITSELF. If you probably have worked with algorithms when you are writing a function, we have several tyes. This part is important because we’d like an algorithm that don’t grows as the input size grows. In other words, we look that the Constant is when you do a math = 1+1 is the same as 200+200, it always takes 2 units. Linear, can be a factorial, when if you have to multiple by the previous, so sum_elements + prev in a for. When you have a FOR loop Quadratic : FOR {…FOR{..} . THIS IS THE CASE OF TRANSFOEMRES, YOU ARE COMPARING EACH WORD WITH ALL THE OTHER WORDS, SO THIS A CHALLENGE, IT MAKES THE ALGORITH LOG :
  16. Por lo tanto hay que adaptar el transformer, para ello existen diferentes maneras de realizarlo To summarize the existing time series Transformers, the paper propose a taxonomy from perspectives of network modifications and application domains. From the perspective of network modifications, we can summarize the changes made on both module level and architecture level, to accommodate the architecture for TS modeling. From the application domains, we know that there are multiple ways to use time series, and this part focuses on Application domain. Forecasting, anomaly detection, classification.
  17. As you may imagine is extremely important to encode the positions of input time series. In the vanilla transformer we first encode the positional information as vector, and then inject them to the model as an additional input. Vanilla : Using the vanilla will defintelly not work. Learnable : iNSTEAD OF USING THE FIXED COS FOR THE, THEY DECIDED TO adapt it to multiple tasks.
  18. Now let’s dive on Time series , can it be used as it is for time series? There is a trend to use what works well in one topic for others,mainly because it was a huge advance, we can probably do the same for Time series. Well similar as we’ve seen for NLP, we can face some problems when we try to use it.
  19. The vanilla trans- former (Vaswani et al. 2017; Devlin et al. 2018) uses point wise self-attention mechanism and the time stamps serve as local positional context. However, in the LSTF problem, the ability to capture long-range independence requires global information like hierarchical time stamps (week, month and year) and agnostic time stamps (holidays, events). These are hardly leveraged in canonical self-attention and conse- quent query-key mismatches between the encoder and de- coder bring underlying degradation on the forecasting per- formance. We propose a uniform input representation to mit- igate the issue, the Fig.(6) gives an intuitive overview. IT CAN BE HARDCODED
  20. The main goal of sparse attention is to reduce the computation,  Sparse attention reduces computation time and the memory requirements of the attention mechanism by computing a limited selection of similarity scores from a sequence rather than all possible pairs, resulting in a sparse matrix rather than a full matrix The main idea of ProbSparse is that the canonical self-attention scores form a long-tail distribution, where the "active" queries lie in the "head" scores and "lazy" queries lie in the "tail" area. By "active" query we mean a query 𝑞𝑖 such that the dot-product ⟨𝑞𝑖,𝑘𝑖⟩ contributes to the major attention, whereas a "lazy" query forms a dot-product which generates  trivial attention. Here, 𝑞𝑖 and 𝑘𝑖 are the 𝑖-th rows in 𝑄 and 𝐾 attention matrices respectively. SE ENCARGA DE SELECCIONAR QUE QUERIES,
  21. Because of the ProbSparse self-attention, the encoder’s feature map has some redundancy that can be removed. Therefore, the distilling operation is used to reduce the input size between encoder layers into its half slice, thus in theory removing this redundancy. In practice, Informer's "distilling" operation just adds 1D convolution layers with max pooling between each of the encoder layers. Let 𝑋𝑛 be the output of the 𝑛-th encoder layer, the distilling operation is then defined as
  22. Of course we don’t have a time to explain yet It’s not a huge improvement, but it’s beter.
  23. There is a recent approach that takes Informer style. Informer generates d-dimensional embeddings of seq, with a result expressed in a matrix. This approach aims to modify the token embedding input sequence by flattening each multivariate vector into N scalars with a copy of his timestamp, leading to a new sequence. A diferencia del Informer, el Spacetimeformer utiliza otra representacion de las entradas. Cada token tiene informacion de la posicion It convertst a seq2seq problem of Lengh L into a new format of length LN. But we can imagine a problem here with the amount of Variables.
  24. To run our experiments, we selected a problem to predict the latency intra-microservices. For those that are not aware of what is it, is basically an implementation of a ecommerce site, and the architecture is based on Microservices. We will predict latencies, we do have the information among all the services, and we will predict the latency to the front end service. We could pick any but we’ve selected to User latency.
  25. As we said, there are multiple ways an moments to use. We can have a stage where we have very few points for training. How much is more pints? It depends on each series. But transformers can be useful if we have Long series, and we will like to predict Long amount of data.. Pour The easier way of course is to predict the next point base in the past. We will target to predict long sequences,
  26. BE CAREFULL WITH THE MSE. Close to original value
  27. It’ seems to be better, but the reality is that what matters are the peaks. MSE is a mean of all the measurements, and even if it’s detecting well, it’s not able
  28. But why are transofmers now being used in all environemts? Before moving to the architecture details, let A surevy https://arxiv.org/pdf/2202.07125.pdf
  29. SESIONALITY IS IMPORTANT 3 THINGS IMPORTANT -
  30. El objetivo es poder observer, of couse that the most valuable thing will be if we can have an algorithm to predict a long sequence, it means that, this is a challenge PLANTEAR EL PROBLEMA DE KUBERNETES, CONVIENE??