SlideShare a Scribd company logo
EE-5351: Course project presentation
Accelerated Logistic Regression on GPU(s)
Rahul Bhojwani, Swaraj Khadanga, Anand Saharan
12/16/2018
Outline
• Problem description
• Key concepts
• Problem Understanding
• Datasets used
• Our Solutions
• Results
• References
• Model training and selection is the most costly
and repetitive step.
Our Focus
The Logistic Regression Model
● y = f(X) ; where y = (0, 1)
● The "logit" model solves these problems:
ln[p/(1-p)] = WTX + b
● p is the probability that the event Y occurs, p(Y=1)
● p/(1-p) is the "odds ratio"
● ln[p/(1-p)] is the log odds ratio, or "logit"
The Logistic Regression Model
● The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
● The estimated probability is:
p = 1/[1 + exp(- WTX + b)]
Logistic Regression Training
● Training set {(x1,y1),.......,(xn,yn)} with yn belongs to {0,1}
● Likelihood, assuming independence
● Log-likelihood, to be maximized
● And then weights are updated using:
Logistic Regression Training
● Let
So, this is the main
computation that
needs to be optimized.
● The negative log-likelihood, to be minimized
● The gradient of the objective function
Problem understanding
X y
w
N_Data
N_features
N_Data
N_features
1 1
N_Data ~ 106 - 108
N_features ~ 101 - 103
Problem understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Problem understanding
Let’s call it Grad_compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
Training Routine (Pseudo Code)
1. initialize params
2. for epoch in 1,2,3...N
a. get batch from the file
b. compute intermediate vector [sigmoid(w.T*x) - y]
c. compute gradient
d. update gradient
3. repeat 2 until we have next batch
4. end
Datasets used
• HIGGS Dataset
– N_features = 28
– N_data = 500000
• DIGIT Dataset
– N_features = 784
– N_data = 10000
• We couldn’t load the entire HIGGS dataset on the
machine, so N_data was small.
• We repeated multiple epochs to increase N_data
dimension in both the cases
Sequential Version
Sigmoid
Kernel
Grad
compute
Sigmoid Kernel -1
• Each thread processes one data point (Xi,Yi)
• X is stored as row-major
• Uncoalesced access
Uncoalesced
access
Figure for understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Sigmoid Kernel -2
• Each thread processes one data point (Xi,Yi)
• X is stored in column major format
• Coalesced data access
Coalesced
access
Figure for understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Sigmoid Kernel - 3(Shared memory)
• Weights are being reused by threads.
• So used shared memory.
Sigmoid Kernel -4(constant memory)
• Weights values are constant in the kernel.
• So tried to store them in constant memory.
• Problem:
– The weights needs to be updated in the next kernel.
– So, need to copy the weights to host and then copy
them back to constant memory before training next
batch.
– This drawback led to no improvement in the computation
speed.
Sigmoid Kernel -5(Parallelized reduction)
• Problems in previous kernels:
– In all the above kernels, there is loop running on the feature
dimension to get the sum.
– Higher feature dimension would make it slow.
• Solution:
– Consecutive threads does one multiplication xij * wj.
– Stores result into shared memory.
– Every block does private reduction is done on shared
memory to compute one data point (Xi,Yi)
– Did the same with weights in constant memory.
PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple
computation. (Data is not transposed for memory coalescing)
Figure for understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Sigmoid Kernel -5(Parallelized reduction)
PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple computation.
(Data is not transposed for memory coalescing)
Reduction
step
Data is needed in row major
format for memory coalescing
Next sub problem
Let’s call it Grad compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
• 1 Block in grid, 2D block.
• Each thread computes individual Xij * IMj.
• Tiled computation on the entire data.
• At each tile adds the data to the shared memory
• In the end a set of threads loops to reduce the shared
memory value.
Grad Computation Kernel - Basic
Figure for explanation
Let’s call it Grad compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
Grad Computation Kernel - Basic
Grad Computation Kernel - 2(1D Grid, 2D
block)
• Problems with previous kernel:
– Not exploiting all the threads.
• The blocks are used only in the N_data dimension.
• Instead of 1 set of threads processing all tiles, each
tile is processed by one block.
• Private reduction is applied to get each tiles value.
• Later each block atomically adds to the the global
memory.
Figure for explanation
Let’s call it Grad compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
Grad Computation Kernel-2(1D Grid, 2D block)
Grad Computation Kernel-2(2D Grid, 2D block)
• Problems with previous kernel:
– The max number of threads in block is limited, so can’t
increase data_dim threads, leading to more atomic adds.
– Higher num_features dimension can’t be handled.
• 2D grid is used to run blocks in N_data and
N_feature dimension.
• Private reduction is applied to get each tiles value.
• Later each block atomically adds to the the global
memory
Grad Computation Kernel-2(2D Grid, 2D block)
Transpose Kernel (For memory
coalescing)
• Solving sub-problem 1 using parallelized reduction
needs data in row-major for memory coalescing.
• Solving sub-problem 2 using 1D Grid and 2D Grid
needs data in column-major for memory coalescing.
• So, a kernel is required to transpose the data matrix
X.
Weight Update Kernel
• Kernel to update new weights according to
• Since the weights are already in device memory so
this kernel was very inexpensive.
• The kernel exploits memory coalescing in read and
write.
Hardware Accelerated Exponentiation
• Sigmoid is a very recurring operation in this entire
computation.
• Accelerated it by using hardware accelerated
functions.
• __expf() instead of exp()
• One key thing to note is that continuous training for
multiple epochs makes this process I/O expensive.
• Currently data loading in CPU and computation in
GPU are sequential.
• We exploit this by streaming the load of next batch
in CPU and computation in GPU parallely using
double buffering.
Interleaving CPU and GPU computations
using streams
Results
GPU accelerated the logistic regression computation by 57x
References
• https://www.datanami.com/2018/09/05/how-to-build-a-better-machine-learning-
pipeline/
• Class notes
• https://www.kaggle.com/c/digit-recognizer/data
• https://laurel.datsi.fi.upm.es/_media/proyectos/gopac/cuda-gdb.pdf
• https://archive.ics.uci.edu/ml/datasets/HIGGS
• https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
Accelerated Logistic Regression on GPU(s)

More Related Content

What's hot

Weakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloudWeakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloud
Arithmer Inc.
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Preferred Networks
 
Webinar on Graph Neural Networks
Webinar on Graph Neural NetworksWebinar on Graph Neural Networks
Webinar on Graph Neural Networks
LucaCrociani1
 
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachConvolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Universitat de Barcelona
 
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Universitat Politècnica de Catalunya
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Introduction to Graph neural networks @  Vienna Deep Learning meetupIntroduction to Graph neural networks @  Vienna Deep Learning meetup
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Liad Magen
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - Introduction
Jungwon Kim
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Ryo Takahashi
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
Sangmin Woo
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Gnn overview
Gnn overviewGnn overview
Gnn overview
Louis (Yufeng) Wang
 
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...
Universitat de Barcelona
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
Fellowship at Vodafone FutureLab
 

What's hot (20)

Weakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloudWeakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloud
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
Webinar on Graph Neural Networks
Webinar on Graph Neural NetworksWebinar on Graph Neural Networks
Webinar on Graph Neural Networks
 
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachConvolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approach
 
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
 
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
 
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Introduction to Graph neural networks @  Vienna Deep Learning meetupIntroduction to Graph neural networks @  Vienna Deep Learning meetup
Introduction to Graph neural networks @ Vienna Deep Learning meetup
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - Introduction
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 
Gnn overview
Gnn overviewGnn overview
Gnn overview
 
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
 
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
 
Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 

Similar to Accelerated Logistic Regression on GPU(s)

Introduction to Applied Machine Learning
Introduction to Applied Machine LearningIntroduction to Applied Machine Learning
Introduction to Applied Machine Learning
SheilaJimenezMorejon
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Gurbinder Gill
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
GiannisTsagatakis
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
jtsagata
 
Eye deep
Eye deepEye deep
Eye deep
sveitser
 
Chainer v4 and v5
Chainer v4 and v5Chainer v4 and v5
Chainer v4 and v5
Preferred Networks
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
Amer Ather
 
Practical ML
Practical MLPractical ML
Practical ML
Antonio Pitasi
 
Deep learning
Deep learningDeep learning
Deep learning
Jin Sakuma
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
dk03006
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Universitat Politècnica de Catalunya
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
inside-BigData.com
 
Parallel convolutional neural network
Parallel  convolutional neural networkParallel  convolutional neural network
Parallel convolutional neural network
Abdullah Khan Zehady
 
Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr -
PyData
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
Taegyun Jeon
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classification
smatsus
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
prithan
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
Yi-Hsiu Hsu
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial
Ligeng Zhu
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
ankit_ppt
 

Similar to Accelerated Logistic Regression on GPU(s) (20)

Introduction to Applied Machine Learning
Introduction to Applied Machine LearningIntroduction to Applied Machine Learning
Introduction to Applied Machine Learning
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Eye deep
Eye deepEye deep
Eye deep
 
Chainer v4 and v5
Chainer v4 and v5Chainer v4 and v5
Chainer v4 and v5
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Practical ML
Practical MLPractical ML
Practical ML
 
Deep learning
Deep learningDeep learning
Deep learning
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Parallel convolutional neural network
Parallel  convolutional neural networkParallel  convolutional neural network
Parallel convolutional neural network
 
Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr -
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classification
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 

Recently uploaded

哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 

Accelerated Logistic Regression on GPU(s)

  • 1. EE-5351: Course project presentation Accelerated Logistic Regression on GPU(s) Rahul Bhojwani, Swaraj Khadanga, Anand Saharan 12/16/2018
  • 2. Outline • Problem description • Key concepts • Problem Understanding • Datasets used • Our Solutions • Results • References
  • 3. • Model training and selection is the most costly and repetitive step.
  • 5. The Logistic Regression Model ● y = f(X) ; where y = (0, 1) ● The "logit" model solves these problems: ln[p/(1-p)] = WTX + b ● p is the probability that the event Y occurs, p(Y=1) ● p/(1-p) is the "odds ratio" ● ln[p/(1-p)] is the log odds ratio, or "logit"
  • 6. The Logistic Regression Model ● The logistic distribution constrains the estimated probabilities to lie between 0 and 1. ● The estimated probability is: p = 1/[1 + exp(- WTX + b)]
  • 7. Logistic Regression Training ● Training set {(x1,y1),.......,(xn,yn)} with yn belongs to {0,1} ● Likelihood, assuming independence ● Log-likelihood, to be maximized
  • 8. ● And then weights are updated using: Logistic Regression Training ● Let So, this is the main computation that needs to be optimized. ● The negative log-likelihood, to be minimized ● The gradient of the objective function
  • 9. Problem understanding X y w N_Data N_features N_Data N_features 1 1 N_Data ~ 106 - 108 N_features ~ 101 - 103
  • 10. Problem understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 11. Problem understanding Let’s call it Grad_compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 12. Training Routine (Pseudo Code) 1. initialize params 2. for epoch in 1,2,3...N a. get batch from the file b. compute intermediate vector [sigmoid(w.T*x) - y] c. compute gradient d. update gradient 3. repeat 2 until we have next batch 4. end
  • 13. Datasets used • HIGGS Dataset – N_features = 28 – N_data = 500000 • DIGIT Dataset – N_features = 784 – N_data = 10000 • We couldn’t load the entire HIGGS dataset on the machine, so N_data was small. • We repeated multiple epochs to increase N_data dimension in both the cases
  • 15. Sigmoid Kernel -1 • Each thread processes one data point (Xi,Yi) • X is stored as row-major • Uncoalesced access Uncoalesced access
  • 16. Figure for understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 17. Sigmoid Kernel -2 • Each thread processes one data point (Xi,Yi) • X is stored in column major format • Coalesced data access Coalesced access
  • 18. Figure for understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 19. Sigmoid Kernel - 3(Shared memory) • Weights are being reused by threads. • So used shared memory.
  • 20. Sigmoid Kernel -4(constant memory) • Weights values are constant in the kernel. • So tried to store them in constant memory. • Problem: – The weights needs to be updated in the next kernel. – So, need to copy the weights to host and then copy them back to constant memory before training next batch. – This drawback led to no improvement in the computation speed.
  • 21. Sigmoid Kernel -5(Parallelized reduction) • Problems in previous kernels: – In all the above kernels, there is loop running on the feature dimension to get the sum. – Higher feature dimension would make it slow. • Solution: – Consecutive threads does one multiplication xij * wj. – Stores result into shared memory. – Every block does private reduction is done on shared memory to compute one data point (Xi,Yi) – Did the same with weights in constant memory. PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple computation. (Data is not transposed for memory coalescing)
  • 22. Figure for understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 23. Sigmoid Kernel -5(Parallelized reduction) PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple computation. (Data is not transposed for memory coalescing) Reduction step Data is needed in row major format for memory coalescing
  • 24. Next sub problem Let’s call it Grad compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 25. • 1 Block in grid, 2D block. • Each thread computes individual Xij * IMj. • Tiled computation on the entire data. • At each tile adds the data to the shared memory • In the end a set of threads loops to reduce the shared memory value. Grad Computation Kernel - Basic
  • 26. Figure for explanation Let’s call it Grad compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 28. Grad Computation Kernel - 2(1D Grid, 2D block) • Problems with previous kernel: – Not exploiting all the threads. • The blocks are used only in the N_data dimension. • Instead of 1 set of threads processing all tiles, each tile is processed by one block. • Private reduction is applied to get each tiles value. • Later each block atomically adds to the the global memory.
  • 29. Figure for explanation Let’s call it Grad compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 30. Grad Computation Kernel-2(1D Grid, 2D block)
  • 31. Grad Computation Kernel-2(2D Grid, 2D block) • Problems with previous kernel: – The max number of threads in block is limited, so can’t increase data_dim threads, leading to more atomic adds. – Higher num_features dimension can’t be handled. • 2D grid is used to run blocks in N_data and N_feature dimension. • Private reduction is applied to get each tiles value. • Later each block atomically adds to the the global memory
  • 32. Grad Computation Kernel-2(2D Grid, 2D block)
  • 33. Transpose Kernel (For memory coalescing) • Solving sub-problem 1 using parallelized reduction needs data in row-major for memory coalescing. • Solving sub-problem 2 using 1D Grid and 2D Grid needs data in column-major for memory coalescing. • So, a kernel is required to transpose the data matrix X.
  • 34. Weight Update Kernel • Kernel to update new weights according to • Since the weights are already in device memory so this kernel was very inexpensive. • The kernel exploits memory coalescing in read and write.
  • 35. Hardware Accelerated Exponentiation • Sigmoid is a very recurring operation in this entire computation. • Accelerated it by using hardware accelerated functions. • __expf() instead of exp()
  • 36. • One key thing to note is that continuous training for multiple epochs makes this process I/O expensive. • Currently data loading in CPU and computation in GPU are sequential. • We exploit this by streaming the load of next batch in CPU and computation in GPU parallely using double buffering. Interleaving CPU and GPU computations using streams
  • 37. Results GPU accelerated the logistic regression computation by 57x
  • 38. References • https://www.datanami.com/2018/09/05/how-to-build-a-better-machine-learning- pipeline/ • Class notes • https://www.kaggle.com/c/digit-recognizer/data • https://laurel.datsi.fi.upm.es/_media/proyectos/gopac/cuda-gdb.pdf • https://archive.ics.uci.edu/ml/datasets/HIGGS • https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/