SlideShare a Scribd company logo
RTSS Jun Young Park
Introduction to PyTorch
Objective
 Understanding AutoGrad
 Review
 Logistic Classifier
 Loss Function
 Backpropagation
 Chain Rule
 Example : Find gradient from a matrix
 AutoGrad
 Solve the example with AutoGrad
 Data Parallism in PyTorch
 Why should we use GPUs?
 Inside CUDA
 How to parallelize our models
 Experiment
Simple but powerful implementation of backpropagation
Understanding AutoGrad
Logistic Classifier (Fully-Connected)
𝑊𝑋 + b = y
2.0
1.0
0.1
p = 0.7
p = 0.2
p = 0.1
S(y)
ProbabilityLogits
X : Input
W, b : To be trained
y : Prediction
S(y) : Softmax function (Can be other activation functions)
A
B
C
𝑆 𝑦 =
𝑒 𝑦 𝑖
𝑖 𝑒 𝑦 𝑖
represents the probabilities of elements in vector 𝑦.
A
Instance
Distance
A
0.7
0.2
0.1
Probability
1
0
0
One-Hot Encoded
A
B
C
MAX
Loss
Find W, b that minimize the loss(error).
Predict Label
Loss Function
 The vector can be very large when there are a lot of classes.
 How can we find the distance between vector S(Predict) and L(Label) ?
𝐷 𝑆, 𝐿 = −
𝑖
𝐿𝑖 log(𝑆𝑖)
0.7
0.2
0.1
1.0
0.0
0.0
S(y) L
※ D(S,L) ≠ D(L,S)
Don’t worry to take log(0)
𝑆 𝑦 =
𝑒 𝑦𝑖
𝑖 𝑒 𝑦 𝑖
In-depth of Classifier
Let there’re equations …
1. Affine Sum
𝜎(𝑥) = 𝑊𝑥 + 𝐵
2. Activation Function
𝑦(𝜎) = 𝑅𝑒𝐿𝑈 𝜎
3. Loss Function
𝐸 𝑦 =
1
2
𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦
2
4. Gradient Descent
𝑤 ← 𝑤 − 𝛼
𝜕𝐸
𝜕𝑤
𝑏 ← 𝑏 − 𝛼
𝜕𝐸
𝜕𝑏
• Gradient Descent requires
𝜕𝐸
𝜕𝑤
and
𝜕𝐸
𝜕𝑏
.
• How can we find them? -> Use chain rule !
𝑦𝑡𝑎𝑟𝑔𝑒𝑡 : Training data
𝑦 : Prediction result
Chain Rule
• Let y(x) is defined below, 𝑥 influences 𝑔 𝑥 and 𝑔 𝑥 influences 𝑓 𝑔 𝑥
𝑦 𝑥 = 𝑓 𝑔 𝑥 = 𝑓 ∘ 𝑔(𝑥)
• Find derivation of y(x)
𝑦′
𝑥 = 𝑓′
𝑔 𝑥 𝑔′
𝑥
• in Liebniz notation…
𝑑𝑦
𝑑𝑥
=
𝑑𝑦
𝑑𝑓
𝑑𝑓
𝑑𝑔
𝑑𝑔
𝑑𝑥
= 1 ∗ 𝑓′ 𝑔 𝑥 ∗ 𝑔′(𝑥)
Chain Rule
𝜕𝐸
𝜕𝑤
=
𝜕𝐸
𝜕𝑦
𝜕𝑦
𝜕𝜎
𝜕𝜎
𝜕𝑤
=
𝑥 𝑦 − 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 (𝜎 > 0)
0 (𝜎 ≤ 0)
𝜕𝐸
𝜕𝑦
= 𝑦 − 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 ,
𝜕𝑦
𝜕𝜎
=
1 (𝜎 > 0)
0 (𝜎 ≤ 0)
,
𝜕𝜎
𝜕𝑤
= 𝑥
Let there’re equations …
1. Affine Sum
𝜎(𝑥) = 𝑊𝑥 + 𝐵
2. Activation Function
𝑦(𝜎) = 𝑅𝑒𝐿𝑈 𝜎
3. Loss Function
𝐸 𝑦 =
1
2
𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦
2
4. Gradient Descent
𝑤 ← 𝑤 − 𝛼
𝜕𝐸
𝜕𝑤
𝑏 ← 𝑏 − 𝛼
𝜕𝐸
𝜕𝑏
Example : Finding gradient of 𝑋
 Let input tensor 𝑋 is initialized by following square matrix of 3rd order.
𝑋 =
1 2 3
4 5 6
7 8 9
 And 𝑌, 𝑍 is defined following …
𝑌 = 𝑋 + 3
𝑍 = 6(𝑌)2
= 6( 𝑋 + 3)2
 And output 𝛿 is the average of tensor 𝑍
𝛿 = 𝑚𝑒𝑎𝑛 𝑍 =
1
9
𝑖 𝑗
𝑍𝑖𝑗
Example : Finding gradient of 𝑋
 We can find scalar 𝑍𝑖𝑗 from its definition (Linearity)
𝑍𝑖𝑗 = 6(𝑌𝑖𝑗)2
𝑌𝑖𝑗 = 𝑋𝑖𝑗 + 3
 To find gradient, We use ‘Chain Rule’ so that we can find partial gradients.
𝜕𝛿
𝜕𝑍𝑖𝑗
=
1
9
,
𝜕𝑍𝑖𝑗
𝜕𝑌𝑖𝑗
= 12𝑌𝑖𝑗,
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
= 1
𝜕𝛿
𝜕𝑋𝑖𝑗
=
𝜕𝛿
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
=
1
9
∗ 12𝑌𝑖𝑗 ∗ 1 =
4
3
𝑋𝑖𝑗 + 3
Example : Finding gradient of 𝑋
 Thus, We can get a gradient of (1,1) element of 𝑋
𝜕𝛿
𝜕𝑋𝑖𝑗
=
4
3
𝑋𝑖𝑗 + 3 |(𝑖, 𝑗)=(1,1) =
4
3
1 + 3 =
16
3
 Like this, We can get whole gradient matrix of 𝑋 …
𝜕𝛿
𝜕 𝑋
=
𝜕𝛿
𝜕𝑋11
𝜕𝛿
𝜕𝑋12
𝜕𝛿
𝜕𝑋13
𝜕𝛿
𝜕𝑋21
𝜕𝛿
𝜕𝑋22
𝜕𝛿
𝜕𝑋23
𝜕𝛿
𝜕𝑋31
𝜕𝛿
𝜕𝑋32
𝜕𝛿
𝜕𝑋33
=
16
3
20
3
24
3
28
3
32
3
36
3
40
3
44
3
48
3
AutoGrad : Finding gradient of 𝑋
𝑋 =
1 2 3
4 5 6
7 8 9
𝑌 = 𝑋 + 3
𝑍 = 6(𝑌)2
= 6( 𝑋 + 3)2
𝛿 = 𝑚𝑒𝑎𝑛 𝑍 =
1
9
𝑖 𝑗
𝑍𝑖𝑗
𝜕𝛿
𝜕𝑋
=
𝜕𝛿
𝜕𝑋11
𝜕𝛿
𝜕𝑋12
𝜕𝛿
𝜕𝑋13
𝜕𝛿
𝜕𝑋21
𝜕𝛿
𝜕𝑋22
𝜕𝛿
𝜕𝑋23
𝜕𝛿
𝜕𝑋31
𝜕𝛿
𝜕𝑋32
𝜕𝛿
𝜕𝑋33
=
16
3
20
3
24
3
28
3
32
3
36
3
40
3
44
3
48
3
Each operation has its gradient function.
Back Propagation
 Get derivatives using ‘Back Propagation’
+
𝑥
𝑦
𝑧
𝑧 = 𝑥 + 𝑦
𝜕𝑧
𝜕𝑥
=
𝜕𝑧
𝜕𝑦
= 1
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑥
=
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑦
=
𝜕𝐿
𝜕𝑧
x
𝑥
𝑦
𝑧
𝑧 = 𝑥𝑦
𝜕𝑧
𝜕𝑥
= 𝑦,
𝜕𝑧
𝜕𝑦
= 𝑥
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑥
=
𝜕𝐿
𝜕𝑧
∙ 𝑦
From output signal 𝐿 …
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑦
=
𝜕𝐿
𝜕𝑧
∙ 𝑥
Back Propagation
 How about exponentation function?
^
𝑛
𝑥 𝑧
𝑧 = 𝑥 𝑛
𝜕𝑧
𝜕𝑥
= 𝑛𝑥 𝑛−1
,
𝜕𝑧
𝜕𝑛
= 𝑥 𝑛
ln 𝑥
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑥
=
𝜕𝐿
𝜕𝑧
(𝑛𝑥 𝑛−1
)
From output signal 𝐿 …
𝑧 = 𝑥 𝑛
ln 𝑧 = 𝑛 ln 𝑥
1
𝑧
𝑑𝑧 = ln 𝑥 𝑑𝑛
𝑑𝑧
𝑑𝑛
= 𝑧 ln 𝑥 = 𝑥 𝑛 ln 𝑥
𝜕𝐿
𝜕𝑧
𝜕𝑧
𝜕𝑛
=
𝜕𝐿
𝜕𝑧
(𝑥 𝑛
ln 𝑥)
Appendix : Operation Graph of 𝛿 (Matrix)
+𝑋11 ^
𝑌11
x
2 2 6
x
1
9
+
𝑍11
𝑋12
…
…
…
𝑋33
…
…
…
… 𝑍12
𝑍33
𝛿
…
𝑍𝑖𝑗 = 6(𝑌𝑖𝑗)2
𝛿 = 𝑚𝑒𝑎𝑛 𝑍
Appendix : Operation Graph of 𝛿 (Scalar)
- Backpropagation
+𝑋𝑖𝑗 ^
𝑌𝑖𝑗
x
2 6
x
1
9
+
𝑍𝑖𝑗
𝛿
+𝑋𝑖𝑗 ^ x x+
𝑍 𝑠𝑢𝑚
2
𝛽𝑖𝑗𝛼𝑖𝑗
𝜕𝛿
𝜕𝑋𝑖𝑗
=
𝜕𝛿
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
=
1
9
∗ 1 ∗ 6 ∗ 2𝑌𝑖𝑗 ∗ 2 =
4
3
(𝑋𝑖𝑗 + 3)
𝜕𝛿
𝜕𝛽𝑖𝑗
=
1
9
𝜕𝛿
𝜕𝑍𝑖𝑗
=
1
9
∗ 1
𝜕𝛿
𝜕𝛼𝑖𝑗
=
1
9
∗ 1 ∗ 6
=
𝜕𝛿
𝜕𝛽𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝑌𝑖𝑗
=
𝜕𝛿
𝜕𝛽𝑖𝑗
𝜕𝛽𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝑍𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝛼𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑌𝑖𝑗
𝜕𝑋𝑖𝑗
𝜕𝛿
𝜕𝑌𝑖𝑗
=
1
9
∗ 1 ∗ 6 ∗ 2𝑌𝑖𝑗
𝜕𝛿
𝜕𝑋𝑖𝑗
=
4
3
(𝑋𝑖𝑗 + 3)
𝜕𝛿
𝜕𝛿
= 1
𝛿
Comparison
AutoGradRaw
Data Parallism
in PyTorch
Why GPU? (CUDA)
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
…
3584 cores
Good for few huge tasks Good for enormous small tasks
3.6 GHz
1.6 GHz
(2.0 GHz @ O.C)
Dataflow Diagram
CPU GPU
Memory MemorycudaMemcpy()
cudaMalloc()
__global__ sum()
hello.cu
NVCC
Co-processor
CPU GPU
d_a
d_b
d_out
h_a
h_b
h_out
1.Memcpy
sum
2.Kernal call (cuBLAS)
3.Memcpy
CUDA on Multi GPU System
Quad SLI
14,336 CUDA cores
48GB of VRAM
How can we use multi GPUs in PyTorch?
Problem
- Low utilization
Only allocated
single GPU.
Zero Utilization
Redundant Memory
Problem
- Duration & Memory Allocation
 Large batch size causes lack of memory.
 Out of memory error from PyTorch -> Python kernel dies.
 Can’t set large batch size.
 Can afford batch_size = 5, num_workers = 2
 Can’t divide up the work with the other GPUs
 Elapsed Time : 25m 44s (10 epochs)
 Reached 99% of accuracy in 9 epochs (for training set)
 It takes too much time.
Data Parallelism in PyTorch
 Implemented using torch.nn.DataParallel()
 Can be used for wrapping a module or model.
 Also support primitives (torch.nn.parallel.*)
 Replicate : Replicate the model on multiple devices(GPUs)
 Scatter : Distribute the input in the first-dimension.
 Gather : Gather and concatenate the input in the first-dimension.
 Apply-Parallel : Apply a set of already-distributed inputs to a set of already-distributed
models.
 PyTorch Tutorials – Multi-GPU examples
 https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html
Easy to Use : nn.DataParallel(model)
- Practical Example
1. Define the model.
2. Wrap the model with nn.DataParallel().
3. Access layers through ‘module’
After Parallelism
- GPU Utilization
 Hyperparameters
 Batch Size : 128
 Number of Workers : 16
 High Utilization.
 Can use large memory space.
 Allocated all GPUs
After Parallelism
- Training Performance
 Hyperparameters
 Batch Size : 128
 Large batch size need more memory space
 Number of Workers : 16
 Recommended to set (4 * NUM_GPUs) – From the forum
 Elapsed Time : 7m 50s (10 epochs)
 Reached 99% of accuracy in 4 epochs (for training set).
 It just taken 3m 10s.
Q & A

More Related Content

What's hot

Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
Jörgen Sandig
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural Networks
Aniket Maurya
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
leopauly
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Simplilearn
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
Edureka!
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Ashray Bhandare
 
Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
BalneSridevi
 
Deep learning
Deep learning Deep learning
Deep learning
Rajgupta258
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
Sri Ambati
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
odsc
 
Deep learning
Deep learningDeep learning
Deep learning
Ratnakar Pandey
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Yan Xu
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
Databricks
 
Intro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionIntro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer Vision
Christoph Körner
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
Edge AI and Vision Alliance
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
Tunde Ajose-Ismail
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
Databricks
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
Yan Xu
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
Ferdous ahmed
 

What's hot (20)

Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural Networks
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
 
Deep learning
Deep learning Deep learning
Deep learning
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Deep learning
Deep learningDeep learning
Deep learning
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Intro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionIntro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer Vision
 
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from..."PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 

Similar to Introduction to PyTorch

04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
Tamer Ahmed Farrag, PhD
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
alpinedatalabs
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
DB Tsai
 
DISCRETE LOGARITHM PROBLEM
DISCRETE LOGARITHM PROBLEMDISCRETE LOGARITHM PROBLEM
DISCRETE LOGARITHM PROBLEM
MANISH KUMAR
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
Stratio
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
Introduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdfIntroduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdf
TulasiramKandula1
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
cairo university
 
Shors'algorithm simplified.pptx
Shors'algorithm simplified.pptxShors'algorithm simplified.pptx
Shors'algorithm simplified.pptx
SundarappanKathiresa
 
Introduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesIntroduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave Machines
Arithmer Inc.
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
DB Tsai
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
ParveenMalik18
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
Hojin Yang
 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
ssuser7f0b19
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
Ahmed BESBES
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notes
UmeshJagga1
 
presentation.pptx
presentation.pptxpresentation.pptx
presentation.pptx
raghav415187
 

Similar to Introduction to PyTorch (20)

04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
DISCRETE LOGARITHM PROBLEM
DISCRETE LOGARITHM PROBLEMDISCRETE LOGARITHM PROBLEM
DISCRETE LOGARITHM PROBLEM
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Introduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdfIntroduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdf
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Shors'algorithm simplified.pptx
Shors'algorithm simplified.pptxShors'algorithm simplified.pptx
Shors'algorithm simplified.pptx
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
Introduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave MachinesIntroduction of Quantum Annealing and D-Wave Machines
Introduction of Quantum Annealing and D-Wave Machines
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notes
 
presentation.pptx
presentation.pptxpresentation.pptx
presentation.pptx
 

More from Jun Young Park

Using Multi GPU in PyTorch
Using Multi GPU in PyTorchUsing Multi GPU in PyTorch
Using Multi GPU in PyTorch
Jun Young Park
 
Trial for Practical NN Using
Trial for Practical NN UsingTrial for Practical NN Using
Trial for Practical NN Using
Jun Young Park
 
Convolutional Neural Network
Convolutional Neural NetworkConvolutional Neural Network
Convolutional Neural Network
Jun Young Park
 
PyTorch and Transfer Learning
PyTorch and Transfer LearningPyTorch and Transfer Learning
PyTorch and Transfer Learning
Jun Young Park
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
Jun Young Park
 
Deep Neural Network
Deep Neural NetworkDeep Neural Network
Deep Neural Network
Jun Young Park
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
Jun Young Park
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
Jun Young Park
 

More from Jun Young Park (8)

Using Multi GPU in PyTorch
Using Multi GPU in PyTorchUsing Multi GPU in PyTorch
Using Multi GPU in PyTorch
 
Trial for Practical NN Using
Trial for Practical NN UsingTrial for Practical NN Using
Trial for Practical NN Using
 
Convolutional Neural Network
Convolutional Neural NetworkConvolutional Neural Network
Convolutional Neural Network
 
PyTorch and Transfer Learning
PyTorch and Transfer LearningPyTorch and Transfer Learning
PyTorch and Transfer Learning
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Deep Neural Network
Deep Neural NetworkDeep Neural Network
Deep Neural Network
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
 

Recently uploaded

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 

Recently uploaded (20)

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 

Introduction to PyTorch

  • 1. RTSS Jun Young Park Introduction to PyTorch
  • 2. Objective  Understanding AutoGrad  Review  Logistic Classifier  Loss Function  Backpropagation  Chain Rule  Example : Find gradient from a matrix  AutoGrad  Solve the example with AutoGrad  Data Parallism in PyTorch  Why should we use GPUs?  Inside CUDA  How to parallelize our models  Experiment
  • 3. Simple but powerful implementation of backpropagation Understanding AutoGrad
  • 4. Logistic Classifier (Fully-Connected) 𝑊𝑋 + b = y 2.0 1.0 0.1 p = 0.7 p = 0.2 p = 0.1 S(y) ProbabilityLogits X : Input W, b : To be trained y : Prediction S(y) : Softmax function (Can be other activation functions) A B C 𝑆 𝑦 = 𝑒 𝑦 𝑖 𝑖 𝑒 𝑦 𝑖 represents the probabilities of elements in vector 𝑦. A Instance
  • 6. Loss Function  The vector can be very large when there are a lot of classes.  How can we find the distance between vector S(Predict) and L(Label) ? 𝐷 𝑆, 𝐿 = − 𝑖 𝐿𝑖 log(𝑆𝑖) 0.7 0.2 0.1 1.0 0.0 0.0 S(y) L ※ D(S,L) ≠ D(L,S) Don’t worry to take log(0) 𝑆 𝑦 = 𝑒 𝑦𝑖 𝑖 𝑒 𝑦 𝑖
  • 7. In-depth of Classifier Let there’re equations … 1. Affine Sum 𝜎(𝑥) = 𝑊𝑥 + 𝐵 2. Activation Function 𝑦(𝜎) = 𝑅𝑒𝐿𝑈 𝜎 3. Loss Function 𝐸 𝑦 = 1 2 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦 2 4. Gradient Descent 𝑤 ← 𝑤 − 𝛼 𝜕𝐸 𝜕𝑤 𝑏 ← 𝑏 − 𝛼 𝜕𝐸 𝜕𝑏 • Gradient Descent requires 𝜕𝐸 𝜕𝑤 and 𝜕𝐸 𝜕𝑏 . • How can we find them? -> Use chain rule ! 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 : Training data 𝑦 : Prediction result
  • 8. Chain Rule • Let y(x) is defined below, 𝑥 influences 𝑔 𝑥 and 𝑔 𝑥 influences 𝑓 𝑔 𝑥 𝑦 𝑥 = 𝑓 𝑔 𝑥 = 𝑓 ∘ 𝑔(𝑥) • Find derivation of y(x) 𝑦′ 𝑥 = 𝑓′ 𝑔 𝑥 𝑔′ 𝑥 • in Liebniz notation… 𝑑𝑦 𝑑𝑥 = 𝑑𝑦 𝑑𝑓 𝑑𝑓 𝑑𝑔 𝑑𝑔 𝑑𝑥 = 1 ∗ 𝑓′ 𝑔 𝑥 ∗ 𝑔′(𝑥)
  • 9. Chain Rule 𝜕𝐸 𝜕𝑤 = 𝜕𝐸 𝜕𝑦 𝜕𝑦 𝜕𝜎 𝜕𝜎 𝜕𝑤 = 𝑥 𝑦 − 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 (𝜎 > 0) 0 (𝜎 ≤ 0) 𝜕𝐸 𝜕𝑦 = 𝑦 − 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 , 𝜕𝑦 𝜕𝜎 = 1 (𝜎 > 0) 0 (𝜎 ≤ 0) , 𝜕𝜎 𝜕𝑤 = 𝑥 Let there’re equations … 1. Affine Sum 𝜎(𝑥) = 𝑊𝑥 + 𝐵 2. Activation Function 𝑦(𝜎) = 𝑅𝑒𝐿𝑈 𝜎 3. Loss Function 𝐸 𝑦 = 1 2 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑦 2 4. Gradient Descent 𝑤 ← 𝑤 − 𝛼 𝜕𝐸 𝜕𝑤 𝑏 ← 𝑏 − 𝛼 𝜕𝐸 𝜕𝑏
  • 10. Example : Finding gradient of 𝑋  Let input tensor 𝑋 is initialized by following square matrix of 3rd order. 𝑋 = 1 2 3 4 5 6 7 8 9  And 𝑌, 𝑍 is defined following … 𝑌 = 𝑋 + 3 𝑍 = 6(𝑌)2 = 6( 𝑋 + 3)2  And output 𝛿 is the average of tensor 𝑍 𝛿 = 𝑚𝑒𝑎𝑛 𝑍 = 1 9 𝑖 𝑗 𝑍𝑖𝑗
  • 11. Example : Finding gradient of 𝑋  We can find scalar 𝑍𝑖𝑗 from its definition (Linearity) 𝑍𝑖𝑗 = 6(𝑌𝑖𝑗)2 𝑌𝑖𝑗 = 𝑋𝑖𝑗 + 3  To find gradient, We use ‘Chain Rule’ so that we can find partial gradients. 𝜕𝛿 𝜕𝑍𝑖𝑗 = 1 9 , 𝜕𝑍𝑖𝑗 𝜕𝑌𝑖𝑗 = 12𝑌𝑖𝑗, 𝜕𝑌𝑖𝑗 𝜕𝑋𝑖𝑗 = 1 𝜕𝛿 𝜕𝑋𝑖𝑗 = 𝜕𝛿 𝜕𝑍𝑖𝑗 𝜕𝑍𝑖𝑗 𝜕𝑌𝑖𝑗 𝜕𝑌𝑖𝑗 𝜕𝑋𝑖𝑗 = 1 9 ∗ 12𝑌𝑖𝑗 ∗ 1 = 4 3 𝑋𝑖𝑗 + 3
  • 12. Example : Finding gradient of 𝑋  Thus, We can get a gradient of (1,1) element of 𝑋 𝜕𝛿 𝜕𝑋𝑖𝑗 = 4 3 𝑋𝑖𝑗 + 3 |(𝑖, 𝑗)=(1,1) = 4 3 1 + 3 = 16 3  Like this, We can get whole gradient matrix of 𝑋 … 𝜕𝛿 𝜕 𝑋 = 𝜕𝛿 𝜕𝑋11 𝜕𝛿 𝜕𝑋12 𝜕𝛿 𝜕𝑋13 𝜕𝛿 𝜕𝑋21 𝜕𝛿 𝜕𝑋22 𝜕𝛿 𝜕𝑋23 𝜕𝛿 𝜕𝑋31 𝜕𝛿 𝜕𝑋32 𝜕𝛿 𝜕𝑋33 = 16 3 20 3 24 3 28 3 32 3 36 3 40 3 44 3 48 3
  • 13. AutoGrad : Finding gradient of 𝑋 𝑋 = 1 2 3 4 5 6 7 8 9 𝑌 = 𝑋 + 3 𝑍 = 6(𝑌)2 = 6( 𝑋 + 3)2 𝛿 = 𝑚𝑒𝑎𝑛 𝑍 = 1 9 𝑖 𝑗 𝑍𝑖𝑗 𝜕𝛿 𝜕𝑋 = 𝜕𝛿 𝜕𝑋11 𝜕𝛿 𝜕𝑋12 𝜕𝛿 𝜕𝑋13 𝜕𝛿 𝜕𝑋21 𝜕𝛿 𝜕𝑋22 𝜕𝛿 𝜕𝑋23 𝜕𝛿 𝜕𝑋31 𝜕𝛿 𝜕𝑋32 𝜕𝛿 𝜕𝑋33 = 16 3 20 3 24 3 28 3 32 3 36 3 40 3 44 3 48 3 Each operation has its gradient function.
  • 14. Back Propagation  Get derivatives using ‘Back Propagation’ + 𝑥 𝑦 𝑧 𝑧 = 𝑥 + 𝑦 𝜕𝑧 𝜕𝑥 = 𝜕𝑧 𝜕𝑦 = 1 𝜕𝐿 𝜕𝑧 𝜕𝐿 𝜕𝑧 𝜕𝑧 𝜕𝑥 = 𝜕𝐿 𝜕𝑧 𝜕𝐿 𝜕𝑧 𝜕𝑧 𝜕𝑦 = 𝜕𝐿 𝜕𝑧 x 𝑥 𝑦 𝑧 𝑧 = 𝑥𝑦 𝜕𝑧 𝜕𝑥 = 𝑦, 𝜕𝑧 𝜕𝑦 = 𝑥 𝜕𝐿 𝜕𝑧 𝜕𝐿 𝜕𝑧 𝜕𝑧 𝜕𝑥 = 𝜕𝐿 𝜕𝑧 ∙ 𝑦 From output signal 𝐿 … 𝜕𝐿 𝜕𝑧 𝜕𝑧 𝜕𝑦 = 𝜕𝐿 𝜕𝑧 ∙ 𝑥
  • 15. Back Propagation  How about exponentation function? ^ 𝑛 𝑥 𝑧 𝑧 = 𝑥 𝑛 𝜕𝑧 𝜕𝑥 = 𝑛𝑥 𝑛−1 , 𝜕𝑧 𝜕𝑛 = 𝑥 𝑛 ln 𝑥 𝜕𝐿 𝜕𝑧 𝜕𝐿 𝜕𝑧 𝜕𝑧 𝜕𝑥 = 𝜕𝐿 𝜕𝑧 (𝑛𝑥 𝑛−1 ) From output signal 𝐿 … 𝑧 = 𝑥 𝑛 ln 𝑧 = 𝑛 ln 𝑥 1 𝑧 𝑑𝑧 = ln 𝑥 𝑑𝑛 𝑑𝑧 𝑑𝑛 = 𝑧 ln 𝑥 = 𝑥 𝑛 ln 𝑥 𝜕𝐿 𝜕𝑧 𝜕𝑧 𝜕𝑛 = 𝜕𝐿 𝜕𝑧 (𝑥 𝑛 ln 𝑥)
  • 16. Appendix : Operation Graph of 𝛿 (Matrix) +𝑋11 ^ 𝑌11 x 2 2 6 x 1 9 + 𝑍11 𝑋12 … … … 𝑋33 … … … … 𝑍12 𝑍33 𝛿 … 𝑍𝑖𝑗 = 6(𝑌𝑖𝑗)2 𝛿 = 𝑚𝑒𝑎𝑛 𝑍
  • 17. Appendix : Operation Graph of 𝛿 (Scalar) - Backpropagation +𝑋𝑖𝑗 ^ 𝑌𝑖𝑗 x 2 6 x 1 9 + 𝑍𝑖𝑗 𝛿 +𝑋𝑖𝑗 ^ x x+ 𝑍 𝑠𝑢𝑚 2 𝛽𝑖𝑗𝛼𝑖𝑗 𝜕𝛿 𝜕𝑋𝑖𝑗 = 𝜕𝛿 𝜕𝑍𝑖𝑗 𝜕𝑍𝑖𝑗 𝜕𝑌𝑖𝑗 𝜕𝑌𝑖𝑗 𝜕𝑋𝑖𝑗 = 𝜕𝛿 𝜕𝛽𝑖𝑗 𝜕𝛽𝑖𝑗 𝜕𝑍𝑖𝑗 𝜕𝑍𝑖𝑗 𝜕𝛼𝑖𝑗 𝜕𝛼𝑖𝑗 𝜕𝑌𝑖𝑗 𝜕𝑌𝑖𝑗 𝜕𝑋𝑖𝑗 = 1 9 ∗ 1 ∗ 6 ∗ 2𝑌𝑖𝑗 ∗ 2 = 4 3 (𝑋𝑖𝑗 + 3) 𝜕𝛿 𝜕𝛽𝑖𝑗 = 1 9 𝜕𝛿 𝜕𝑍𝑖𝑗 = 1 9 ∗ 1 𝜕𝛿 𝜕𝛼𝑖𝑗 = 1 9 ∗ 1 ∗ 6 = 𝜕𝛿 𝜕𝛽𝑖𝑗 = 𝜕𝛿 𝜕𝛽𝑖𝑗 𝜕𝛽𝑖𝑗 𝜕𝑍𝑖𝑗 = 𝜕𝛿 𝜕𝛽𝑖𝑗 𝜕𝛽𝑖𝑗 𝜕𝑍𝑖𝑗 𝜕𝑍𝑖𝑗 𝜕𝛼𝑖𝑗 = 𝜕𝛿 𝜕𝛽𝑖𝑗 𝜕𝛽𝑖𝑗 𝜕𝑍𝑖𝑗 𝜕𝑍𝑖𝑗 𝜕𝛼𝑖𝑗 𝜕𝛼𝑖𝑗 𝜕𝑌𝑖𝑗 = 𝜕𝛿 𝜕𝛽𝑖𝑗 𝜕𝛽𝑖𝑗 𝜕𝑍𝑖𝑗 𝜕𝑍𝑖𝑗 𝜕𝛼𝑖𝑗 𝜕𝛼𝑖𝑗 𝜕𝑌𝑖𝑗 𝜕𝑌𝑖𝑗 𝜕𝑋𝑖𝑗 𝜕𝛿 𝜕𝑌𝑖𝑗 = 1 9 ∗ 1 ∗ 6 ∗ 2𝑌𝑖𝑗 𝜕𝛿 𝜕𝑋𝑖𝑗 = 4 3 (𝑋𝑖𝑗 + 3) 𝜕𝛿 𝜕𝛿 = 1 𝛿
  • 20. Why GPU? (CUDA) T T Core T T Core T T Core T T Core T T Core T T Core … 3584 cores Good for few huge tasks Good for enormous small tasks 3.6 GHz 1.6 GHz (2.0 GHz @ O.C)
  • 21. Dataflow Diagram CPU GPU Memory MemorycudaMemcpy() cudaMalloc() __global__ sum() hello.cu NVCC Co-processor CPU GPU d_a d_b d_out h_a h_b h_out 1.Memcpy sum 2.Kernal call (cuBLAS) 3.Memcpy
  • 22. CUDA on Multi GPU System Quad SLI 14,336 CUDA cores 48GB of VRAM How can we use multi GPUs in PyTorch?
  • 23. Problem - Low utilization Only allocated single GPU. Zero Utilization Redundant Memory
  • 24. Problem - Duration & Memory Allocation  Large batch size causes lack of memory.  Out of memory error from PyTorch -> Python kernel dies.  Can’t set large batch size.  Can afford batch_size = 5, num_workers = 2  Can’t divide up the work with the other GPUs  Elapsed Time : 25m 44s (10 epochs)  Reached 99% of accuracy in 9 epochs (for training set)  It takes too much time.
  • 25. Data Parallelism in PyTorch  Implemented using torch.nn.DataParallel()  Can be used for wrapping a module or model.  Also support primitives (torch.nn.parallel.*)  Replicate : Replicate the model on multiple devices(GPUs)  Scatter : Distribute the input in the first-dimension.  Gather : Gather and concatenate the input in the first-dimension.  Apply-Parallel : Apply a set of already-distributed inputs to a set of already-distributed models.  PyTorch Tutorials – Multi-GPU examples  https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html
  • 26. Easy to Use : nn.DataParallel(model) - Practical Example 1. Define the model. 2. Wrap the model with nn.DataParallel(). 3. Access layers through ‘module’
  • 27. After Parallelism - GPU Utilization  Hyperparameters  Batch Size : 128  Number of Workers : 16  High Utilization.  Can use large memory space.  Allocated all GPUs
  • 28. After Parallelism - Training Performance  Hyperparameters  Batch Size : 128  Large batch size need more memory space  Number of Workers : 16  Recommended to set (4 * NUM_GPUs) – From the forum  Elapsed Time : 7m 50s (10 epochs)  Reached 99% of accuracy in 4 epochs (for training set).  It just taken 3m 10s.
  • 29. Q & A

Editor's Notes

  1. PyTorch 에서 제공하는 자동 미분 기능인 AutoGrad 를 이해하기 위해… Deep Learning 의 기초 이론을 다지고 Backpropagation 을 좀더 깊게 살펴본다. 그리고 그 Backpropagation 과 AutoGrad 의 구현을 보며 차이점을 이해한다. GPU 를 사용하는 이유와 CUDA 연산의 과정을 보고 PyTorch 에서 제공하는 데이터 병렬화 Method 의 사용법을 본다. 그리고 다중 GPU와 단일 GPU의 성능을 비교한다.
  2. Backpropagation 을 쉽게 구현한 모듈.
  3. 로지스틱 분류기의 기본적인 형태는 1차 선형 함수 꼴. (WX+b = y) 이 때 X 는 입력, W, b 는 가중치와 편향 (학습을 한다는 것은 적절한 가중치와 편향을 찾는 것.) Y 는 예측 결과 –> 이 결과 (Logits) 를 확률로 변환 (Softmax Function) 왜 ? : Logit이 매우 커질수도 있으니 이를 0~1 사이의 간단한 값으로 변환. 확률이 제일 높은 것으로 분류 클래스가 두개 ? : Logistic Classification 클래스가 여러 개 ? : Softmax/Multinomial Classification
  4. 클래스를 수로 나타내려면 ? 벡터에서 해당하는 클래스가 참의 값을 가지게 하면 됨. (제일 높은 확률을 갖는 클래스) Ex) 클래스 A ? -> [ 1 0 0 0 0 ….. ] : 클래스 A에 해당하는 인덱스의 값만 참, 나머지는 거짓
  5. 정답과 예측간의 거리 : Cross-Entropy Softmax will not be 0, 순서주의 즉 값이 작으면(가까우면) 옳은 판단. S(y) 의 합은 1이고 각 인스턴스는 0보다 큰 값을 가지므로 log(0) 에 대한 문제가 발생하지 않는다.
  6. 연쇄 법칙에 따라 Loss Function E 의 w 에 대한 미분은 다음과 같음. 이는 곧, w가 변할때 E가 변하는 정도는 합성된 함수에 의한 변화량의 곱과 같음. Y 가 E에 영향을 주고 시그마가 y에 영향을 주고 w가 시그마에 영향을 주는 것으로 나누어 표현. 각각에 대한 미분을 구하면 다음과 같음. 이 때, ReLU 는 Non-linear Function 이므로 구간을 나누어 미분.
  7. 위와같이 연산 정의…
  8. 행렬을 그대로 연산하기는 번거로우므로, 단일 요소에 대한 스칼라 표현을 사용. 그리고 부분 미분을 구하면… 이렇게 나오고 이것을 합성함수로 표시하면
  9. X에 1행 1열 원소인 1을 대입하면 다음과 같이 나옴. 마찬가지로 다른 원소들을 다시 원본 표현인 행렬로 나타내면 다음과 같고 결과는 저렇게 나옴.
  10. Gradient Function 은 결국 가장 기본적인 계산 노드의 Backpropagation 을 의미.
  11. 합성함수에 대하여 제대로 알았으므로 역전파로 가보자. x 와 y 가 z 에 값에 얼마나 영향을 줬는가? 즉, x 와 y 가 변할 때 z 가 어떻게 변하는가? 역전파 : 신호에 노드의 국소적 미분을 곱한 후 다음 노드로 전달 (거꾸로) 더하기 노드의 역전파는 이전 신호를 그대로 전파. 곱하기 노드의 역전파는 이전 신호에 반대편 신호를 곱한 신호를 전파.
  12. 제곱함수 노드와 그에 대한 순전파, 역전파는 다음과같이 나타남. 마찬가지로 z 에 대하여 x 와 n 이 주는 영향을 찾는다는 점에서 같음. 그렇게 구하면 다음과 같이 나옴.
  13. 행렬에 대한 계산 그래프를 나타내면 다음과 같음. 여러 요소에 대하여 각각 계산 후 그 원소 수와 합을 이용하여 평균을 구함.
  14. 행렬에 대한 표현은 이해하기 어려우므로, 각 원소에 대하여 Scalar 로 표시하도록 하자. 앞서 다룬 역전파 원리에 이해 아래와 같이 구해짐.