SlideShare a Scribd company logo
Wei Tan, IBM T. J. Watson Research Center
wtan@us.ibm.com
CuMF: Large-Scale Matrix Factorization on
Just One Machine with GPUs
Agenda
 Why accelerate recommendation (matrix factorization) using GPUs?
 What are the challenges?
 How cuMF tackles the challenges?
 So what is the result?
2
Why we need a fast and scalable recommender system?
3
 Recommendation is pervasive
 Drive 80% of Netflix’s watch hours1
 Digital ads market in US: 37.3 billion2
 Need: fast, scalable, economic
ALS (alternating least square) for collaborative filtering
4
 Input: users ratings on some items
 Output: all missing ratings
 How: factorize the rating matrix R into
and minimize the lost on observed ratings:
 ALS: iteratively solve
R
X
xu
ΘT θv
Matrix factorization/ALS is versatile
5
item
X
xu
ΘT θv
Recommendation
user
X
ΘT
word
wordWord
embedding
X
ΘT
word
documentTopic
model
a very important algorithm
to accelerate
Challenges of fast and scalable ALS
6
• ALS needs to solve:
spmm: cublas
• Challenge 1: access and
aggregate many θvs:
irregular (R is sparse) and
memory intensive
LU or Cholesky
decomposition: cublas
• Challenge 2:
Single GPU can NOT handle big m, n and nnz
Challenge 1: memory access
7
 Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec
 Higher flops  higher op intensity (more flops per byte)  caching!
Operational intensity (Flops/Byte)
Gflops/s
4.3T
1
288G ×
15
×
under-utilized
fully-utilized
Address challenge 1: memory-optimized ALS
8
• To obtain
1. Reuse θv s for many users
2. Load a portion into smem
3. Tile and aggregate in register
Address challenge 1: memory-optimized ALS
9
f
f
T
T
T
T
Address challenge 2: scale-up ALS on multiple GPUs
 Model parallel: solve a
portion of the model
10
model
parallel
Address challenge 2: scale-up ALS on multiple GPUs
 Data parallel: solve
with a portion of the
training data
11
Data parallel
model
parallel
Address challenge 2: parallel reduction
 Data parallel needs cross-GPU reduction
12
One-phase parallel reduction. Two-phase parallel reduction.
Inter-socketIntra-socket
Recap: cuMF tackled two challenges
13
• ALS needs to solve:
spmm: cublas
• Challenge 1: access and
aggregate many θvs:
irregular (R is sparse) and
memory intensive
LU or Cholesky
decomposition: cublas
• Challenge 2:
Single GPU can NOT handle big m, n and nnz
Connect cuMF to Spark MLlib
14
 Spark applications relying on mllib/ALS need no change
 Modified mllib/ALS detects GPU and offload computation
 Leverage the best of Spark (scale-out) and GPU (scale-up)
ALS apps
mllib/ALS
cuMFJNI
Connect cuMF to Spark MLlib
1 Power 8 node + 2 K40
CUDA
kernel
GPU1
RDD
CUDA
kernel
…RDD RDD RDD
CUDA
kernel
GPU2
RDD
CUDA
kernel
…RDD RDD RDD
shuffle
1 Power 8 node + 2 K40
CUDA
kernel
GPU1
RDD
CUDA
kernel
…RDD RDD RDD
shuffle
…
 RDD on CPU: to distribute rating data and shuffle parameters
 Solver on GPU: to form and solve
 Able to run on multiple nodes, and multiple GPUs per node
15
Implementation
16
 In C (circa. 10k LOC)
 CUDA 7.0/7.5, GCC OpenMP v3.0
 Baselines:
– Libmf: SGD on 1 node [RecSys14]
– NOMAD: SGD on >1 nodes [VLDB 14]
– SparkALS: ALS on Spark
– FactorBird: SGD + parameter server for MF
– Facebook: enhanced Giraph
CuMF performance
17
X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set
• 1 GPU vs. 30 cores, CuMF slightly
faster than libmf and NOMAD
• CuMF scales well on 1, 2, 4 GPUs
Effectiveness of memory optimization
18
X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set
• Aggressively using registers  2x
• Using texture  25%-35% faster
CuMF performance and cost
19
• CuMF @4 GPUs ≈ NOMAD @64 HPC nodes
≈ 10x NOMAD @32 AWS nodes
• CuMF @4 GPUs ≈ 10x SparkALS @50 nodes
≈ 1% of its cost
CuMF accelerated Spark on Power 8
20
• CuMF @2 K40s achieves 6+x speedup in training (193 sec  1.3k sec)
*GUI designed by Amir Sanjar
Conclusion
 Why accelerate recommendation (matrix factorization) using GPU?
– Need to be fast, scalable and economic
 What are the challenges?
– Memory access, scale to multiple GPUs
 How cuMF tackles the challenges?
– Optimize memory access, parallelism and communication
 So what is the result?
– Up to 10x as fast, 100x as cost-efficient
– Use cuMF standalone or with Spark
– GPU can tackle ML problems beyond deep learning!
21
Wei Tan, IBM T. J. Watson Research Center
wtan@us.ibm.com
http://ibm.biz/wei_tan
Thank you, questions?
Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs.
Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016,
http://arxiv.org/abs/1603.03820
Source code available soon.

More Related Content

What's hot

Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
jemin lee
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Carlo C. del Mundo
 
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
AMD Developer Central
 
Practical Guidelines for Solving Difficult Mixed Integer Programs
Practical Guidelines for Solving Difficult  Mixed Integer ProgramsPractical Guidelines for Solving Difficult  Mixed Integer Programs
Practical Guidelines for Solving Difficult Mixed Integer Programs
IBM Decision Optimization
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
jemin lee
 
UK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
UK ATC 2015: Automated Post Processing of Multimodel Optimisation DataUK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
UK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
Altair
 
P.I.Z.Z.A.: Rationale Behind FPGA
P.I.Z.Z.A.: Rationale Behind FPGAP.I.Z.Z.A.: Rationale Behind FPGA
P.I.Z.Z.A.: Rationale Behind FPGA
NECST Lab @ Politecnico di Milano
 
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization StudioSolving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
optimizatiodirectdirect
 
Innovations in CPLEX performance and solver capabilities
Innovations in CPLEX performance and solver capabilitiesInnovations in CPLEX performance and solver capabilities
Innovations in CPLEX performance and solver capabilities
IBM Decision Optimization
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
s.rohit
 

What's hot (11)

Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
 
Practical Guidelines for Solving Difficult Mixed Integer Programs
Practical Guidelines for Solving Difficult  Mixed Integer ProgramsPractical Guidelines for Solving Difficult  Mixed Integer Programs
Practical Guidelines for Solving Difficult Mixed Integer Programs
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
 
UK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
UK ATC 2015: Automated Post Processing of Multimodel Optimisation DataUK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
UK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
 
Ch1
Ch1Ch1
Ch1
 
P.I.Z.Z.A.: Rationale Behind FPGA
P.I.Z.Z.A.: Rationale Behind FPGAP.I.Z.Z.A.: Rationale Behind FPGA
P.I.Z.Z.A.: Rationale Behind FPGA
 
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization StudioSolving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
 
Innovations in CPLEX performance and solver capabilities
Innovations in CPLEX performance and solver capabilitiesInnovations in CPLEX performance and solver capabilities
Innovations in CPLEX performance and solver capabilities
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
 

Similar to S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Databricks
 
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...John Gunnels
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
Faisal Siddiqi
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE
 
PoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HAPoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HA
Ulf Wendel
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
Ganesan Narayanasamy
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Koichi Shirahata
 
Jvm problem diagnostics
Jvm problem diagnosticsJvm problem diagnostics
Jvm problem diagnostics
Danijel Mitar
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
Ganesan Narayanasamy
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
inside-BigData.com
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Databricks
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
Shivalik college of engineering
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Databricks
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
Facultad de Informática UCM
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plans
inside-BigData.com
 

Similar to S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs (20)

OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
 
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
PoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HAPoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HA
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Jvm problem diagnostics
Jvm problem diagnosticsJvm problem diagnostics
Jvm problem diagnostics
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plans
 

Recently uploaded

1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
Laura Szabó
 
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
CIOWomenMagazine
 
Bài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docxBài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docx
nhiyenphan2005
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
cuobya
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027
harveenkaur52
 
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
zoowe
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
cuobya
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
SEO Article Boost
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Florence Consulting
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
vmemo1
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
Trish Parr
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
Danica Gill
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 

Recently uploaded (20)

1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
 
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
Internet of Things in Manufacturing: Revolutionizing Efficiency & Quality | C...
 
Bài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docxBài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docx
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027
 
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 

S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

  • 1. Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
  • 2. Agenda  Why accelerate recommendation (matrix factorization) using GPUs?  What are the challenges?  How cuMF tackles the challenges?  So what is the result? 2
  • 3. Why we need a fast and scalable recommender system? 3  Recommendation is pervasive  Drive 80% of Netflix’s watch hours1  Digital ads market in US: 37.3 billion2  Need: fast, scalable, economic
  • 4. ALS (alternating least square) for collaborative filtering 4  Input: users ratings on some items  Output: all missing ratings  How: factorize the rating matrix R into and minimize the lost on observed ratings:  ALS: iteratively solve R X xu ΘT θv
  • 5. Matrix factorization/ALS is versatile 5 item X xu ΘT θv Recommendation user X ΘT word wordWord embedding X ΘT word documentTopic model a very important algorithm to accelerate
  • 6. Challenges of fast and scalable ALS 6 • ALS needs to solve: spmm: cublas • Challenge 1: access and aggregate many θvs: irregular (R is sparse) and memory intensive LU or Cholesky decomposition: cublas • Challenge 2: Single GPU can NOT handle big m, n and nnz
  • 7. Challenge 1: memory access 7  Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec  Higher flops  higher op intensity (more flops per byte)  caching! Operational intensity (Flops/Byte) Gflops/s 4.3T 1 288G × 15 × under-utilized fully-utilized
  • 8. Address challenge 1: memory-optimized ALS 8 • To obtain 1. Reuse θv s for many users 2. Load a portion into smem 3. Tile and aggregate in register
  • 9. Address challenge 1: memory-optimized ALS 9 f f T T T T
  • 10. Address challenge 2: scale-up ALS on multiple GPUs  Model parallel: solve a portion of the model 10 model parallel
  • 11. Address challenge 2: scale-up ALS on multiple GPUs  Data parallel: solve with a portion of the training data 11 Data parallel model parallel
  • 12. Address challenge 2: parallel reduction  Data parallel needs cross-GPU reduction 12 One-phase parallel reduction. Two-phase parallel reduction. Inter-socketIntra-socket
  • 13. Recap: cuMF tackled two challenges 13 • ALS needs to solve: spmm: cublas • Challenge 1: access and aggregate many θvs: irregular (R is sparse) and memory intensive LU or Cholesky decomposition: cublas • Challenge 2: Single GPU can NOT handle big m, n and nnz
  • 14. Connect cuMF to Spark MLlib 14  Spark applications relying on mllib/ALS need no change  Modified mllib/ALS detects GPU and offload computation  Leverage the best of Spark (scale-out) and GPU (scale-up) ALS apps mllib/ALS cuMFJNI
  • 15. Connect cuMF to Spark MLlib 1 Power 8 node + 2 K40 CUDA kernel GPU1 RDD CUDA kernel …RDD RDD RDD CUDA kernel GPU2 RDD CUDA kernel …RDD RDD RDD shuffle 1 Power 8 node + 2 K40 CUDA kernel GPU1 RDD CUDA kernel …RDD RDD RDD shuffle …  RDD on CPU: to distribute rating data and shuffle parameters  Solver on GPU: to form and solve  Able to run on multiple nodes, and multiple GPUs per node 15
  • 16. Implementation 16  In C (circa. 10k LOC)  CUDA 7.0/7.5, GCC OpenMP v3.0  Baselines: – Libmf: SGD on 1 node [RecSys14] – NOMAD: SGD on >1 nodes [VLDB 14] – SparkALS: ALS on Spark – FactorBird: SGD + parameter server for MF – Facebook: enhanced Giraph
  • 17. CuMF performance 17 X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set • 1 GPU vs. 30 cores, CuMF slightly faster than libmf and NOMAD • CuMF scales well on 1, 2, 4 GPUs
  • 18. Effectiveness of memory optimization 18 X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set • Aggressively using registers  2x • Using texture  25%-35% faster
  • 19. CuMF performance and cost 19 • CuMF @4 GPUs ≈ NOMAD @64 HPC nodes ≈ 10x NOMAD @32 AWS nodes • CuMF @4 GPUs ≈ 10x SparkALS @50 nodes ≈ 1% of its cost
  • 20. CuMF accelerated Spark on Power 8 20 • CuMF @2 K40s achieves 6+x speedup in training (193 sec  1.3k sec) *GUI designed by Amir Sanjar
  • 21. Conclusion  Why accelerate recommendation (matrix factorization) using GPU? – Need to be fast, scalable and economic  What are the challenges? – Memory access, scale to multiple GPUs  How cuMF tackles the challenges? – Optimize memory access, parallelism and communication  So what is the result? – Up to 10x as fast, 100x as cost-efficient – Use cuMF standalone or with Spark – GPU can tackle ML problems beyond deep learning! 21
  • 22. Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com http://ibm.biz/wei_tan Thank you, questions? Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016, http://arxiv.org/abs/1603.03820 Source code available soon.